<a href="https://colab.research.google.com/github/Athdam05/California-housing-ML-model/blob/main/Housing_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Perform a regression analysis on the "housing.csv" dataset. This involves loading the data, handling missing values, encoding categorical features, scaling numerical features, and splitting the data into training and testing sets. Train and evaluate Linear Regression, Ridge Regression, and Decision Tree Regressor models, comparing their performance using RMSE and MAE. Finally, analyze the models for underfitting/overfitting and discuss a real-world machine learning issue observed in the context of this dataset or models, summarizing the findings.

## Load Data and Initial Inspection

### Subtask:
Load the `housing.csv` dataset into a pandas DataFrame. Perform an initial inspection to understand its structure, identify data types, and check for any immediate signs of missing values or categorical features.


**Reasoning**:
To address the subtask, I need to load the specified CSV file into a pandas DataFrame, perform initial inspections including displaying the head, checking data types and non-null values with `.info()`, generating descriptive statistics with `.describe()`, and identifying missing values using `.isnull().sum()`.



In [1]:
import pandas as pd

# Load the housing.csv file into a pandas DataFrame
df = pd.read_csv('/content/housing.csv')

# Display the first 5 rows of the DataFrame
print('First 5 rows of the DataFrame:')
print(df.head())
print('\n')

# Display the concise summary of the DataFrame
print('DataFrame Info:')
df.info()
print('\n')

# Generate descriptive statistics of the DataFrame
print('Descriptive Statistics:')
print(df.describe())
print('\n')

# Check for the total count of missing values in each column
print('Missing values per column:')
print(df.isnull().sum())

First 5 rows of the DataFrame:
   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -122.23     37.88                41.0        880.0           129.0   
1    -122.22     37.86                21.0       7099.0          1106.0   
2    -122.24     37.85                52.0       1467.0           190.0   
3    -122.25     37.85                52.0       1274.0           235.0   
4    -122.25     37.85                52.0       1627.0           280.0   

   population  households  median_income  median_house_value ocean_proximity  
0       322.0       126.0         8.3252            452600.0        NEAR BAY  
1      2401.0      1138.0         8.3014            358500.0        NEAR BAY  
2       496.0       177.0         7.2574            352100.0        NEAR BAY  
3       558.0       219.0         5.6431            341300.0        NEAR BAY  
4       565.0       259.0         3.8462            342200.0        NEAR BAY  


DataFrame Info:
<class 'pandas.core.frame.

## Preprocess Data

### Subtask:
Handle missing values in the dataset using appropriate imputation strategies (e.g., mean, median, or mode imputation). Encode categorical variables using one-hot encoding. Apply feature scaling (e.g., StandardScaler) to numerical features to ensure all features contribute equally to the model training. Split the preprocessed dataset into training and testing sets.


**Reasoning**:
The subtask requires handling missing values, separating features and target, encoding categorical variables, scaling numerical features, and splitting the data. I will address all these steps sequentially in a single code block.



In [2]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# 1. Fill missing values in 'total_bedrooms' with its median
df['total_bedrooms'].fillna(df['total_bedrooms'].median(), inplace=True)
print("Missing values after imputation:")
print(df.isnull().sum())
print('\n')

# 2. Separate target variable and features
X = df.drop('median_house_value', axis=1)
y = df['median_house_value']

# 3. Identify categorical and numerical columns
categorical_cols = ['ocean_proximity']
numerical_cols = X.select_dtypes(include=['float64', 'int64']).columns.tolist()
# Remove 'ocean_proximity' from numerical_cols if it was mistakenly included, although select_dtypes handles this
numerical_cols = [col for col in numerical_cols if col not in categorical_cols]

# 4. Apply one-hot encoding to the 'ocean_proximity' column
X_categorical = pd.get_dummies(X[categorical_cols], drop_first=True)

# Concatenate one-hot encoded features with numerical features
X_numerical = X[numerical_cols]
X_processed = pd.concat([X_numerical, X_categorical], axis=1)

# 5. Initialize and apply StandardScaler to numerical features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_processed[numerical_cols])
X_processed[numerical_cols] = X_scaled

print("First 5 rows of preprocessed features (X_processed):")
print(X_processed.head())
print('\n')

# 6. Split the preprocessed dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=0.2, random_state=42)

print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")

Missing values after imputation:
longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64


First 5 rows of preprocessed features (X_processed):
   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0  -1.327835  1.052548            0.982143    -0.804819       -0.972476   
1  -1.322844  1.043185           -0.607019     2.045890        1.357143   
2  -1.332827  1.038503            1.856182    -0.535746       -0.827024   
3  -1.337818  1.038503            1.856182    -0.624215       -0.719723   
4  -1.337818  1.038503            1.856182    -0.462404       -0.612423   

   population  households  median_income  ocean_proximity_INLAND  \
0   -0.974429   -0.977033       2.344766                   False   
1    0.861439    1.669961       2.332238                   False   
2   

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['total_bedrooms'].fillna(df['total_bedrooms'].median(), inplace=True)


**Reasoning**:
The previous code produced a `FutureWarning` due to the use of `inplace=True` with `fillna()`. To resolve this, I will modify the line to assign the result of `fillna()` back to the column, which is the recommended practice in newer pandas versions.



In [3]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# 1. Fill missing values in 'total_bedrooms' with its median
df['total_bedrooms'] = df['total_bedrooms'].fillna(df['total_bedrooms'].median())
print("Missing values after imputation:")
print(df.isnull().sum())
print('\n')

# 2. Separate target variable and features
X = df.drop('median_house_value', axis=1)
y = df['median_house_value']

# 3. Identify categorical and numerical columns
categorical_cols = ['ocean_proximity']
numerical_cols = X.select_dtypes(include=['float64', 'int64']).columns.tolist()
# Remove 'ocean_proximity' from numerical_cols if it was mistakenly included, although select_dtypes handles this
numerical_cols = [col for col in numerical_cols if col not in categorical_cols]

# 4. Apply one-hot encoding to the 'ocean_proximity' column
X_categorical = pd.get_dummies(X[categorical_cols], drop_first=True)

# Concatenate one-hot encoded features with numerical features
X_numerical = X[numerical_cols]
X_processed = pd.concat([X_numerical, X_categorical], axis=1)

# 5. Initialize and apply StandardScaler to numerical features
scaler = StandardScaler()
X_processed[numerical_cols] = scaler.fit_transform(X_processed[numerical_cols])

print("First 5 rows of preprocessed features (X_processed):")
print(X_processed.head())
print('\n')

# 6. Split the preprocessed dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=0.2, random_state=42)

print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")

Missing values after imputation:
longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64


First 5 rows of preprocessed features (X_processed):
   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0  -1.327835  1.052548            0.982143    -0.804819       -0.972476   
1  -1.322844  1.043185           -0.607019     2.045890        1.357143   
2  -1.332827  1.038503            1.856182    -0.535746       -0.827024   
3  -1.337818  1.038503            1.856182    -0.624215       -0.719723   
4  -1.337818  1.038503            1.856182    -0.462404       -0.612423   

   population  households  median_income  ocean_proximity_INLAND  \
0   -0.974429   -0.977033       2.344766                   False   
1    0.861439    1.669961       2.332238                   False   
2   

## Train and Evaluate Linear Regression Model

### Subtask:
Train a Linear Regression model on the preprocessed training data. Evaluate its performance by calculating Root Mean Squared Error (RMSE) on both the training and testing sets, and Mean Absolute Error (MAE) on the testing set. Store these metrics for comparison.


**Reasoning**:
To address the subtask, I will first import the required modules for Linear Regression and evaluation metrics. Then, I will instantiate and train the Linear Regression model, make predictions on both training and testing sets, calculate RMSE for both sets and MAE for the testing set, print these metrics, and finally store them in a dictionary for future comparison.



In [4]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np

# 1. Instantiate a Linear Regression model
linear_reg_model = LinearRegression()

# 2. Train the model
linear_reg_model.fit(X_train, y_train)
print("Linear Regression model trained successfully.\n")

# 3. Make predictions on the training data
y_train_pred_lr = linear_reg_model.predict(X_train)

# 4. Make predictions on the testing data
y_test_pred_lr = linear_reg_model.predict(X_test)

# 5. Calculate RMSE for the training set
rmse_train_lr = np.sqrt(mean_squared_error(y_train, y_train_pred_lr))

# 6. Calculate RMSE for the testing set
rmse_test_lr = np.sqrt(mean_squared_error(y_test, y_test_pred_lr))

# 7. Calculate MAE for the testing set
mae_test_lr = mean_absolute_error(y_test, y_test_pred_lr)

# 8. Print the calculated metrics
print(f"Linear Regression - Training RMSE: {rmse_train_lr:.2f}")
print(f"Linear Regression - Testing RMSE: {rmse_test_lr:.2f}")
print(f"Linear Regression - Testing MAE: {mae_test_lr:.2f}\n")

# 9. Store these metrics in a dictionary for later comparison
model_performance = {}
model_performance['Linear Regression'] = {
    'Train RMSE': rmse_train_lr,
    'Test RMSE': rmse_test_lr,
    'Test MAE': mae_test_lr
}

print("Metrics stored in 'model_performance' dictionary:")
print(model_performance)

Linear Regression model trained successfully.

Linear Regression - Training RMSE: 68433.94
Linear Regression - Testing RMSE: 70060.52
Linear Regression - Testing MAE: 50670.74

Metrics stored in 'model_performance' dictionary:
{'Linear Regression': {'Train RMSE': np.float64(68433.93736666226), 'Test RMSE': np.float64(70060.52184473521), 'Test MAE': 50670.73824097188}}


## Train and Evaluate Ridge Regression Model

### Subtask:
Train a Ridge Regression model on the preprocessed training data, experimenting with a suitable alpha value. Evaluate its performance using RMSE on both the training and testing sets, and MAE on the testing set. Store these metrics for comparison.


**Reasoning**:
To address the subtask, I will import the `Ridge` model, instantiate it with an alpha value, train it on the preprocessed data, make predictions, calculate and print the RMSE for both training and testing sets, and MAE for the testing set, and then store these metrics in the `model_performance` dictionary.



In [5]:
from sklearn.linear_model import Ridge

# 1. Instantiate a Ridge Regression model with alpha=1.0
ridge_reg_model = Ridge(alpha=1.0)

# 2. Train the model
ridge_reg_model.fit(X_train, y_train)
print("Ridge Regression model trained successfully.\n")

# 3. Make predictions on the training data
y_train_pred_ridge = ridge_reg_model.predict(X_train)

# 4. Make predictions on the testing data
y_test_pred_ridge = ridge_reg_model.predict(X_test)

# 5. Calculate RMSE for the training set
rmse_train_ridge = np.sqrt(mean_squared_error(y_train, y_train_pred_ridge))

# 6. Calculate RMSE for the testing set
rmse_test_ridge = np.sqrt(mean_squared_error(y_test, y_test_pred_ridge))

# 7. Calculate MAE for the testing set
mae_test_ridge = mean_absolute_error(y_test, y_test_pred_ridge)

# 8. Print the calculated metrics
print(f"Ridge Regression - Training RMSE: {rmse_train_ridge:.2f}")
print(f"Ridge Regression - Testing RMSE: {rmse_test_ridge:.2f}")
print(f"Ridge Regression - Testing MAE: {mae_test_ridge:.2f}\n")

# 9. Store these metrics in the 'model_performance' dictionary
model_performance['Ridge Regression'] = {
    'Train RMSE': rmse_train_ridge,
    'Test RMSE': rmse_test_ridge,
    'Test MAE': mae_test_ridge
}

print("Metrics stored in 'model_performance' dictionary:")
print(model_performance)

Ridge Regression model trained successfully.

Ridge Regression - Training RMSE: 68435.25
Ridge Regression - Testing RMSE: 70068.81
Ridge Regression - Testing MAE: 50678.36

Metrics stored in 'model_performance' dictionary:
{'Linear Regression': {'Train RMSE': np.float64(68433.93736666226), 'Test RMSE': np.float64(70060.52184473521), 'Test MAE': 50670.73824097188}, 'Ridge Regression': {'Train RMSE': np.float64(68435.2525775098), 'Test RMSE': np.float64(70068.80893507139), 'Test MAE': 50678.35859391172}}


## Train and Evaluate Decision Tree Regressor Model

### Subtask:
Train a Decision Tree Regressor model on the preprocessed training data. Evaluate its performance using RMSE on both the training and testing sets, and MAE on the testing set. Store these metrics for comparison.


**Reasoning**:
To address the subtask, I will import the `DecisionTreeRegressor` model, instantiate it, train it on the preprocessed data, make predictions on both training and testing sets, calculate and print the RMSE for both sets and MAE for the testing set, and then store these metrics in the `model_performance` dictionary.



In [6]:
from sklearn.tree import DecisionTreeRegressor

# 1. Instantiate a Decision Tree Regressor model
decision_tree_reg_model = DecisionTreeRegressor(random_state=42)

# 2. Train the model
decision_tree_reg_model.fit(X_train, y_train)
print("Decision Tree Regressor model trained successfully.\n")

# 3. Make predictions on the training data
y_train_pred_dt = decision_tree_reg_model.predict(X_train)

# 4. Make predictions on the testing data
y_test_pred_dt = decision_tree_reg_model.predict(X_test)

# 5. Calculate RMSE for the training set
rmse_train_dt = np.sqrt(mean_squared_error(y_train, y_train_pred_dt))

# 6. Calculate RMSE for the testing set
rmse_test_dt = np.sqrt(mean_squared_error(y_test, y_test_pred_dt))

# 7. Calculate MAE for the testing set
mae_test_dt = mean_absolute_error(y_test, y_test_pred_dt)

# 8. Print the calculated metrics
print(f"Decision Tree Regressor - Training RMSE: {rmse_train_dt:.2f}")
print(f"Decision Tree Regressor - Testing RMSE: {rmse_test_dt:.2f}")
print(f"Decision Tree Regressor - Testing MAE: {mae_test_dt:.2f}\n")

# 9. Store these metrics in the 'model_performance' dictionary
model_performance['Decision Tree Regressor'] = {
    'Train RMSE': rmse_train_dt,
    'Test RMSE': rmse_test_dt,
    'Test MAE': mae_test_dt
}

print("Metrics stored in 'model_performance' dictionary:")
print(model_performance)

Decision Tree Regressor model trained successfully.

Decision Tree Regressor - Training RMSE: 0.00
Decision Tree Regressor - Testing RMSE: 69790.18
Decision Tree Regressor - Testing MAE: 44214.67

Metrics stored in 'model_performance' dictionary:
{'Linear Regression': {'Train RMSE': np.float64(68433.93736666226), 'Test RMSE': np.float64(70060.52184473521), 'Test MAE': 50670.73824097188}, 'Ridge Regression': {'Train RMSE': np.float64(68435.2525775098), 'Test RMSE': np.float64(70068.80893507139), 'Test MAE': 50678.35859391172}, 'Decision Tree Regressor': {'Train RMSE': np.float64(0.0), 'Test RMSE': np.float64(69790.18312903232), 'Test MAE': 44214.67223837209}}


## Generate Model Comparison Table

### Subtask:
Create a comparison table summarizing the RMSE (Train), RMSE (Test), and MAE (Test) for all trained models (Linear Regression, Ridge Regression, and Decision Tree Regressor).


**Reasoning**:
To fulfill the subtask, I need to convert the existing `model_performance` dictionary into a pandas DataFrame to create a structured comparison table and then display it.



In [7]:
import pandas as pd

# Convert the model_performance dictionary into a pandas DataFrame
comparison_df = pd.DataFrame.from_dict(model_performance, orient='index')

# Display the comparison table
print("\nModel Performance Comparison Table:")
print(comparison_df)


Model Performance Comparison Table:
                           Train RMSE     Test RMSE      Test MAE
Linear Regression        68433.937367  70060.521845  50670.738241
Ridge Regression         68435.252578  70068.808935  50678.358594
Decision Tree Regressor      0.000000  69790.183129  44214.672238


## Analyze Model Performance and Discuss ML Issues

### Subtask:
Provide a brief note explaining where underfitting (high bias) and overfitting (high variance) might have occurred across the different models based on their training and testing errors. Also, comment on at least one real-world machine learning issue observed (e.g., noisy features, outliers, non-linearity, or dataset bias) in the context of this dataset or models.


### Model Performance Analysis and Real-World ML Issues

**Underfitting and Overfitting Analysis:**

*   **Linear Regression:**
    *   `Train RMSE`: 68433.94
    *   `Test RMSE`: 70060.52
    *   Both training and testing RMSE values are relatively high and quite close to each other. This suggests that the Linear Regression model might be **underfitting** the data, meaning it's too simple to capture the underlying patterns. The model has high bias, as it performs similarly (and not excellently) on both seen and unseen data.

*   **Ridge Regression:**
    *   `Train RMSE`: 68435.25
    *   `Test RMSE`: 70068.81
    *   Similar to Linear Regression, Ridge Regression also shows high and close RMSE values for both training and testing sets. This indicates that Ridge Regression, despite regularization, also appears to be **underfitting** the dataset. The regularization did not significantly improve the model's ability to capture complexity, suggesting the core issue might be the model's simplicity for this data.

*   **Decision Tree Regressor:**
    *   `Train RMSE`: 0.00
    *   `Test RMSE`: 69790.18
    *   The Decision Tree Regressor exhibits a perfect `Train RMSE` of 0.00, which is a strong indicator of **overfitting**. The model has learned the training data perfectly, including noise, but its performance drops significantly on the unseen test data. This model has high variance, failing to generalize well.

**Real-World Machine Learning Issue: Non-linearity and Outliers**

One significant real-world machine learning issue observed in the context of this dataset and models is the presence of **non-linear relationships** within the data that linear models struggle to capture effectively, and potentially the impact of **outliers**.

1.  **Non-linearity**: The similar and relatively high RMSEs for both Linear and Ridge Regression suggest that the relationship between the features (like `median_income`, `total_rooms`, `housing_median_age`, etc.) and the target variable (`median_house_value`) is likely not purely linear. Simple linear models cannot adequately represent the true complexity of the housing market data. This could be due to interaction effects between features or thresholds that affect housing prices in a non-linear fashion (e.g., beyond a certain income level, house prices might increase at a different rate).

2.  **Outliers/Data Distribution**: While not explicitly analyzed in detail, the descriptive statistics from the initial inspection (e.g., large differences between mean and median for `total_rooms`, `total_bedrooms`, `population`, `households`, and the max values being significantly higher than 75th percentile) hint at the presence of outliers or skewed distributions in several numerical features. These outliers can heavily influence linear models by pulling the regression line towards them, leading to higher overall error rates. For decision trees, extreme outliers might be captured as specific nodes if the tree grows deep enough, contributing to the overfitting observed.

To address these issues, more complex models (like Random Forests or Gradient Boosting) or techniques such as feature engineering to create non-linear features, robust scaling methods, or outlier detection and handling might be necessary. The task already hints at a complex relationship between features and target, and the simple models confirm this complexity is not fully captured.

## Final Task

### Subtask:
Summarize the regression analysis, including the performance of different models and key insights regarding model complexity and generalization.


## Summary:

### Data Analysis Key Findings

*   **Dataset Overview**: The `housing.csv` dataset contains 20,640 entries and 10 columns. `total_bedrooms` had 207 missing values, which were imputed using the median. `ocean_proximity` was identified as a categorical feature.
*   **Data Preprocessing**:
    *   Missing values in `total_bedrooms` were filled with the median.
    *   The `ocean_proximity` categorical feature was one-hot encoded.
    *   Numerical features were scaled using `StandardScaler`.
    *   The data was split into training (16,512 samples) and testing (4,128 samples) sets.
*   **Model Performance Comparison**:
    *   **Linear Regression**: Showed a Training RMSE of \$68,433.94\$, Testing RMSE of \$70,060.52\$, and Testing MAE of \$50,670.74\$. Both RMSEs were high and close, indicating potential underfitting.
    *   **Ridge Regression**: Exhibited similar performance with a Training RMSE of \$68,435.25\$, Testing RMSE of \$70,068.81\$, and Testing MAE of \$50,678.36\$. This also suggested underfitting, as regularization did not significantly alter performance compared to standard Linear Regression.
    *   **Decision Tree Regressor**: Achieved a Training RMSE of \$0.00\$ but a Testing RMSE of \$69,790.18\$ and Testing MAE of \$44,214.67\$. The perfect training score with a much higher test score is a clear sign of severe overfitting.
*   **Model Complexity and Generalization**:
    *   Linear and Ridge Regression models appear to be **underfitting** the data. Their high and similar training and testing error rates suggest they are too simple to capture the complex relationships within the housing data.
    *   The Decision Tree Regressor severely **overfitted** the training data, as evidenced by its \$0.00\$ training RMSE, but failed to generalize well to unseen data, resulting in a high testing RMSE.
*   **Real-World Machine Learning Issues**: The analysis revealed two key issues:
    *   **Non-linearity**: The poor performance of linear models suggests that the relationship between features and housing prices is likely non-linear, which simple linear models cannot adequately capture.
    *   **Outliers/Data Distribution**: Initial inspection indicated skewed distributions and potential outliers in several numerical features, which can negatively impact model training, especially for linear models and unconstrained decision trees.

### Insights or Next Steps

*   The dataset exhibits non-linear relationships and potentially outliers, making simple linear models prone to underfitting and unconstrained decision trees prone to overfitting. More sophisticated models or feature engineering are necessary to improve predictive performance.
*   Future steps should include exploring more robust regression models (e.g., Random Forest, Gradient Boosting), hyperparameter tuning for the Decision Tree to mitigate overfitting (e.g., setting `max_depth`), and potentially advanced outlier detection and handling techniques.
