# California Housing Prices 
## Pre-Processing, Training and Modeling
### by Anthony Medina

## Table of Contents
#### Pre-Processing
1. Introduction and Notebook Objectives
2. Imports and Loading the data
3. Dealing with Categorical Variables
4. Dropping the Longitude and Latitude columns
5. Normalizing and Scaling Data
#### Training and Modeling
6. Splitting the data into testing and training sets
7. Modeling Imports
8. Model 1 Linear Regression
9. Model 2 Polynomial Regression
10. Model 3 Ridge Regression
11. Model 4 Lasso Regression
#### Model Choice
12. Model Choice

# Pre-Processing

### 1. Introduction and Notebook Objectives
We'll be preparing a cleaned up version of the [(Kaggle-California Housing Prices)](https://www.kaggle.com/datasets/camnugent/california-housing-prices) data set. 

The objective is to prepare the data for machine learning.
By the end of this notebook, gategorical variables will have been converted to numeric. The data will be scaled and normalized. We will have also tried a few models and picked a working one.

### 2. Imports and load the data

In [1]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.model_selection import train_test_split

In [2]:
# Import the data from the cleaned data folder
house_data = pd.read_csv('../cleaned_data/ready_for_EDA.csv', index_col = 0)

In [3]:
house_data.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880,129,322,126,83252.0,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099,1106,2401,1138,83014.0,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467,190,496,177,72574.0,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274,235,558,219,56431.0,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627,280,565,259,38462.0,342200.0,NEAR BAY


In [4]:
house_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20433 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20433 non-null  float64
 1   latitude            20433 non-null  float64
 2   housing_median_age  20433 non-null  float64
 3   total_rooms         20433 non-null  int64  
 4   total_bedrooms      20433 non-null  int64  
 5   population          20433 non-null  int64  
 6   households          20433 non-null  int64  
 7   median_income       20433 non-null  float64
 8   median_house_value  20433 non-null  float64
 9   ocean_proximity     20433 non-null  object 
dtypes: float64(5), int64(4), object(1)
memory usage: 1.7+ MB


In [5]:
house_data.dtypes

longitude             float64
latitude              float64
housing_median_age    float64
total_rooms             int64
total_bedrooms          int64
population              int64
households              int64
median_income         float64
median_house_value    float64
ocean_proximity        object
dtype: object

### 3. Dealing with Categorical Variables

In [6]:
# Ocean Proximity
df = house_data
df = pd.get_dummies(df, columns=['ocean_proximity'], drop_first=True)

In [7]:
df.dtypes

longitude                     float64
latitude                      float64
housing_median_age            float64
total_rooms                     int64
total_bedrooms                  int64
population                      int64
households                      int64
median_income                 float64
median_house_value            float64
ocean_proximity_INLAND          uint8
ocean_proximity_ISLAND          uint8
ocean_proximity_NEAR BAY        uint8
ocean_proximity_NEAR OCEAN      uint8
dtype: object

### 4. Dropping the Longitude and Latitude columns

In [8]:
df = df.drop('longitude', axis =1)
df = df.drop('latitude', axis =1)
df.head()

Unnamed: 0,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity_INLAND,ocean_proximity_ISLAND,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN
0,41.0,880,129,322,126,83252.0,452600.0,0,0,1,0
1,21.0,7099,1106,2401,1138,83014.0,358500.0,0,0,1,0
2,52.0,1467,190,496,177,72574.0,352100.0,0,0,1,0
3,52.0,1274,235,558,219,56431.0,341300.0,0,0,1,0
4,52.0,1627,280,565,259,38462.0,342200.0,0,0,1,0


### 5. Normalizing and Scaling Data

In [9]:
from sklearn.preprocessing import MinMaxScaler
standard_col = ['total_rooms', 'total_bedrooms', 'population', 'households', 'housing_median_age','median_income']

In [10]:
# MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(df[standard_col])
df[standard_col] = scaler.transform(df[standard_col])

# Training and Modeling

### 6. Splitting the data into testing and training sets

In [11]:
X = df.drop('median_house_value', axis =1).values
y = df['median_house_value'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 21)

### 7. Modeling Imports

In [12]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.pipeline import Pipeline 

### 8. Model 1 Linear Regression

In [13]:
# create a linear regression model
lr = LinearRegression()

# define hyperparameters to tune
params = {'fit_intercept': [True, False],
          'n_jobs': [1,2,3,4,5],}


# use grid search to find optimal hyperparameters
grid = GridSearchCV(lr, params, cv=5)
grid.fit(X_train, y_train)

# print the best hyperparameters
print(grid.best_params_)

# predict on the test set using the optimized model
y_pred = grid.predict(X_test)

# evaluate the performance of the model
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print('R-squared: ', r2)
print('MSE: ', mse)
print('RMSE: ', rmse)

{'fit_intercept': True, 'n_jobs': 1}
R-squared:  0.6352576876332579
MSE:  4907868535.531821
RMSE:  70056.18127996859


### 9. Model 2 Polynomial Regression

In [14]:
poly = PolynomialFeatures()

# create a StandardScaler to normalize the features
scaler = StandardScaler()

# create a LinearRegression model
lr = LinearRegression()

# define hyperparameters to tune
params = {'poly__degree': [2, 3, 4],
          'lr__fit_intercept': [True, False],
          'lr__n_jobs': [1,2,3,4]}

# create a pipeline to combine the transformer and the model
from sklearn.pipeline import Pipeline
pipe = Pipeline(steps=[('poly', poly), ('scaler', scaler), ('lr', lr)])

# use grid search to find optimal hyperparameters
grid = GridSearchCV(pipe, params, cv=5)
grid.fit(X_train, y_train)

# print the best hyperparameters
print(grid.best_params_)

# predict on the test set using the optimized model
y_pred = grid.predict(X_test)

# evaluate the performance of the model
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print('R-squared: ', r2)
print('MSE: ', mse)
print('RMSE: ', rmse)

{'lr__fit_intercept': True, 'lr__n_jobs': 1, 'poly__degree': 2}
R-squared:  -1.6166739750942327e+20
MSE:  2.1753503954869408e+30
RMSE:  1474906910786894.8


### 10. Model 3 Ridge Regression

In [15]:
# Define the hyperparameters to tune
parameters = {'alpha': [0.01, 0.1, 0.5, 1.0, 10.0], 'solver': ['auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg']}

# Create a Ridge regression model
ridge = Ridge()

# Perform grid search to find the best hyperparameters
grid_search = GridSearchCV(ridge, parameters, cv=5)
grid_search.fit(X_train, y_train)

# Get the best hyperparameters
best_alpha = grid_search.best_params_['alpha']
best_solver = grid_search.best_params_['solver']

# Train a new Ridge model with the best hyperparameters
ridge_best = Ridge(alpha=best_alpha, solver=best_solver)
ridge_best.fit(X_train, y_train)

# Make predictions on the test set
y_pred = ridge_best.predict(X_test)

# Evaluate the model using MSE and R-squared
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mse)

print("Best Alpha:", best_alpha)
print("Best Solver:", best_solver)
print("R-squared:", r2)
print("Mean Squared Error (MSE):", mse)
print('RMSE: ', rmse)

Best Alpha: 0.1
Best Solver: sparse_cg
R-squared: 0.6348951792611737
Mean Squared Error (MSE): 4912746344.803999
RMSE:  70090.98618798282


### 11. Model Lasso Regression Model

In [16]:
# Define the hyperparameters to tune
parameters = {'alpha': [0.1, 1.0, 10.0], 'max_iter': [500, 1500, 2000]}

# Create a Lasso regression model
lasso = Lasso()

# Perform grid search to find the best hyperparameters
grid_search = GridSearchCV(lasso, parameters, cv=5)
grid_search.fit(X_train, y_train)

# Get the best hyperparameters
best_alpha = grid_search.best_params_['alpha']
best_max_iter = grid_search.best_params_['max_iter']

# Train a new Lasso model with the best hyperparameters
lasso_best = Lasso(alpha=best_alpha, max_iter=best_max_iter)
lasso_best.fit(X_train, y_train)

# Make predictions on the test set
y_pred = lasso_best.predict(X_test)

# Evaluate the model using MSE and R-squared
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print("Best Alpha:", best_alpha)
print("Best Max Iterations:", best_max_iter)
print("R-squared:", r2)
print("Mean Squared Error (MSE):", mse)
print('RMSE: ', rmse)

Best Alpha: 0.1
Best Max Iterations: 500
R-squared: 0.6352516232554682
Mean Squared Error (MSE): 4907950136.069877
RMSE:  70056.76367111085


### 12. Model Choice

To pick the best model, I went with the lowest RMSE.
It was close between Linear Regression and Lasso Regression.

---
Linear Regression: 70056.18

Lasso Regression: 70056.76

---

I decided to go with the Linear Regression model since it is a simpler model to build.