### Applied Machine Learning Midterm Project
### Nick Elias

* CSIS 44670-81
* Date: 4/18/2025

### Overview

- Businesses and organizations often need to understand the relationships between different factors to make better decisions.  
    - For example, a company may want to predict the fuel efficiency of a car based on its weight and engine size or estimate home prices based on square footage and location.  
- Regression analysis helps identify and quantify these relationships between numerical features, providing insights that can be used for forecasting and decision-making.  

This project demonstrates your ability to apply regression modeling techniques to a real-world dataset. You will:  
- Load and explore a dataset.  
- Choose and justify features for predicting a target variable.  
- Train a regression model and evaluate performance.  
- Compare multiple regression approaches.  
- Document your work in a structured Jupyter Notebook.  
- Conduct a peer review of a classmate's project.  

# Introduction

In [17]:
# Standard libraries
import os
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning - Preprocessing
from sklearn.preprocessing import (
    LabelEncoder,
    OneHotEncoder
)

# Machine Learning - Model Selection
from sklearn.model_selection import (
    train_test_split,
    cross_val_score
)

# Machine Learning - Models
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import (
    DecisionTreeClassifier,
    plot_tree
)

# Machine Learning - Metrics
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    confusion_matrix,
    ConfusionMatrixDisplay,
    classification_report
)

Section 1. Import and Inspect the Data
1.1 Load the dataset and display the first 10 rows.
1.2 Check for missing values and display summary statistics.
Reflection 1: What do you notice about the dataset? Are there any data issues?

In [18]:
# Load the dataset
data = pd.read_csv('data/train.csv')

# Display all columns in the DataFrame
pd.set_option('display.max_columns', None)

# Display the first 10 rows
print("First 10 rows of the dataset:")
print(data.head(10))

First 10 rows of the dataset:
   Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0   1          60       RL         65.0     8450   Pave   NaN      Reg   
1   2          20       RL         80.0     9600   Pave   NaN      Reg   
2   3          60       RL         68.0    11250   Pave   NaN      IR1   
3   4          70       RL         60.0     9550   Pave   NaN      IR1   
4   5          60       RL         84.0    14260   Pave   NaN      IR1   
5   6          50       RL         85.0    14115   Pave   NaN      IR1   
6   7          20       RL         75.0    10084   Pave   NaN      Reg   
7   8          60       RL          NaN    10382   Pave   NaN      IR1   
8   9          50       RM         51.0     6120   Pave   NaN      Reg   
9  10         190       RL         50.0     7420   Pave   NaN      Reg   

  LandContour Utilities LotConfig LandSlope Neighborhood Condition1  \
0         Lvl    AllPub    Inside       Gtl      CollgCr       Norm   
1         Lvl

In [19]:
print(data.dtypes)

Id                 int64
MSSubClass         int64
MSZoning          object
LotFrontage      float64
LotArea            int64
                  ...   
MoSold             int64
YrSold             int64
SaleType          object
SaleCondition     object
SalePrice          int64
Length: 81, dtype: object


In [20]:
# Display columns with missing data
missing_data = data.isnull().sum()
missing_columns = missing_data[missing_data > 0]
print("Columns with missing data:")
print(missing_columns)

Columns with missing data:
LotFrontage      259
Alley           1369
MasVnrType       872
MasVnrArea         8
BsmtQual          37
BsmtCond          37
BsmtExposure      38
BsmtFinType1      37
BsmtFinType2      38
Electrical         1
FireplaceQu      690
GarageType        81
GarageYrBlt       81
GarageFinish      81
GarageQual        81
GarageCond        81
PoolQC          1453
Fence           1179
MiscFeature     1406
dtype: int64


In [21]:
# Display summary statistics
print("\nSummary statistics of the dataset:")
print(data.describe())


Summary statistics of the dataset:
                Id   MSSubClass  LotFrontage        LotArea  OverallQual  \
count  1460.000000  1460.000000  1201.000000    1460.000000  1460.000000   
mean    730.500000    56.897260    70.049958   10516.828082     6.099315   
std     421.610009    42.300571    24.284752    9981.264932     1.382997   
min       1.000000    20.000000    21.000000    1300.000000     1.000000   
25%     365.750000    20.000000    59.000000    7553.500000     5.000000   
50%     730.500000    50.000000    69.000000    9478.500000     6.000000   
75%    1095.250000    70.000000    80.000000   11601.500000     7.000000   
max    1460.000000   190.000000   313.000000  215245.000000    10.000000   

       OverallCond    YearBuilt  YearRemodAdd   MasVnrArea   BsmtFinSF1  \
count  1460.000000  1460.000000   1460.000000  1452.000000  1460.000000   
mean      5.575342  1971.267808   1984.865753   103.685262   443.639726   
std       1.112799    30.202904     20.645407   181.06

### **Dataset Observations:**

The dataset contains real estate data related to housing prices.

*General Structure:*

- The dataset contains a mix of numerical and categorical columns.
- The target variable seems to be SalePrice, which represents the price of a house.

*Columns with Missing Data:*

- Some columns, such as Alley, FireplaceQu, PoolQC, Fence, and MiscFeature, have missing values (NaN).
- These columns may require imputation or removal depending on their importance.

*Features:*

- Numerical Features: Examples include LotFrontage, LotArea, YearBuilt, GrLivArea, and SalePrice.
- Categorical Features: Examples include MSZoning, Street, LotShape, and Neighborhood.

*Potential Relationships:*

- Features like GrLivArea, OverallQual, and YearBuilt might have a strong correlation with SalePrice.
- Categorical features like Neighborhood and MSZoning could also influence housing prices.

*Outliers and Anomalies:*

- Some columns, such as LotFrontage, have missing values in certain rows.
- Columns like GarageYrBlt have numerical values but may need special handling for missing data.

*Target Variable:*

- SalePrice is the target variable for regression analysis.

## Section 2. Data Exploration and Preparation
### 2.1 Explore data patterns and distributions
- Create histograms, boxplots, and count plots for categorical variables (as applicable).
- Identify patterns, outliers, and anomalies in feature distributions.
- Check for class imbalance in the target variable (as applicable).


### 2.2 Handle missing values and clean data
- Impute or drop missing values (as applicable).
- Remove or transform outliers (as applicable).
- Convert categorical data to numerical format using encoding (as applicable).


### 2.3 Feature selection and engineering
- Create new features (as applicable).
- Transform or combine existing features to improve model performance (as applicable).
- Scale or normalize data (as applicable).


### Reflection 2: What patterns or anomalies do you see? Do any features stand out? What preprocessing steps were necessary to clean and improve the data? Did you create or modify any features to improve performance?

## Section 3. Feature Selection and Justification
### 3.1 Choose features and target
- Select two or more input features (numerical for regression, numerical and/or categorical for classification)
- Select a target variable (as applicable)
- Regression: Continuous target variable (e.g., price, temperature).
- Classification: Categorical target variable (e.g., gender, species).
- Clustering: No target variable.
- Justify your selection with reasoning.


### 3.2 Define X and y
- Assign input features to X
- Assign target variable to y (as applicable)


### Reflection 3: Why did you choose these features? How might they impact predictions or accuracy?

## Section 4. Train a Model (Linear Regression)
### 4.1 Split the data into training and test sets using train_test_split (or StratifiedShuffleSplit if class imbalance is an issue).


### 4.2 Train model using Scikit-Learn model.fit() method


### 4.3 Evalulate performance, for example:
- Regression: R^2, MAE, RMSE (RMSE has been recently updated)
- Classification: Accuracy, Precision, Recall, F1-score, Confusion Matrix
- Clustering: Inertia, Silhouette Score

### Reflection 4: How well did the model perform? Any surprises in the results?

## Section 5. Improve the Model or Try Alternates (Implement Pipelines)
### 5.1 Implement Pipeline 1: Imputer → StandardScaler → Linear Regression.


### 5.2 Implement Pipeline 2: Imputer → Polynomial Features (degree=3) → StandardScaler → Linear Regression.


### 5.3 Compare performance of all models across the same performance metrics


### Reflection 5: Which models performed better? How does scaling impact results?

## Section 6. Final Thoughts & Insights
### 6.1 Summarize findings.


### 6.2 Discuss challenges faced.


### 6.3 If you had more time, what would you try next?


### Reflection 6: What did you learn from this project?