
# Training a SDGRegressor model to predict prices of houses.

## Table of Contents
<ul>
    <li><a href="#Data Preparation">Data Preparation</a>
       <ul>
           <li><a href="#Read Data">Read Data</a></li>
           <li><a href="#Clean Data">Clean Data</a></li>
           <li><a href="#Feature Engineering">Feature Engineering</a></li>
           <li><a href="#Feature Scaling">Feature Scaling</a></li>
       </ul>
    <li><a href="#Model Training">Model Training</a>
    <li><a href="#Model Tuning">Model Tuning</a></li>
    <li><a href="#Model Testing">Model Testing</a></li>
       <ul>
           <li><a href="#Load Model">Load Model</a></li>
           <li><a href="#Predict Values">Predict Values</a></li>
           <li><a href="#Evaluate Model">Evaluate Model</a></li>
       </ul>
    <li><a href="#Compare Models">Compare Models</a></li>

</ul>

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.simplefilter(action="ignore")

<a id='Data Preparation'></a>
## Data Preparation

<a id='Read Data'></a>
### Read Data
In this section:
>- we will read and split the data into train dataset and test dataset

In [2]:
housing = pd.read_csv('housing.csv')

In [3]:
# check fot null values
housing.isnull().sum()

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

>- There are 207 missing values in the total_bedrooms columns

In [4]:
# display stats on the housing data 
housing.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0
mean,-119.569704,35.631861,28.639486,2635.763081,537.870553,1425.476744,499.53968,3.870671,206855.816909
std,2.003532,2.135952,12.585558,2181.615252,421.38507,1132.462122,382.329753,1.899822,115395.615874
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1447.75,296.0,787.0,280.0,2.5634,119600.0
50%,-118.49,34.26,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0
75%,-118.01,37.71,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0


#### split data into training  and testing datasets

In [5]:
from sklearn.model_selection import train_test_split
# split data into 80% train and 20% test
train, test = train_test_split(housing, test_size=0.2, random_state=42)

In [6]:
# confirm the split percentages
train_percent = (train.shape[0]/housing.shape[0]) * 100
test_percent = 100 - train_percent
print(train.shape, test.shape)
print(f'test: {test_percent}%, train: {train_percent}%')

(16512, 10) (4128, 10)
test: 20.0%, train: 80.0%


<a id='Clean Data'></a>
### Clean Data
In this section:
>- we will handle the missing values in the median_house_columns.
>- we will also handle categorical columns since machine learning works best with numberical values_we have to transform the text values to numerical values.
>- we will chose the best methods to achieve the above tasks

#### Handling missing values in the numerical columns
Note: This is a demonstration of how the missing data will be replaced 

In [7]:
from sklearn.impute import SimpleImputer

# create a SimpleInputer instance to replace the missing values in the numerical columns with the columns'  median  
imputer = SimpleImputer(strategy='median')

# drop categorical columns from training dataset
train_numerical_columns = train.drop('ocean_proximity', axis=1)
print(train_numerical_columns.dtypes)

longitude             float64
latitude              float64
housing_median_age    float64
total_rooms           float64
total_bedrooms        float64
population            float64
households            float64
median_income         float64
median_house_value    float64
dtype: object


In [8]:
train_transformed = imputer.fit_transform(train_numerical_columns)
train_transformed

array([[-1.1703e+02,  3.2710e+01,  3.3000e+01, ...,  6.2300e+02,
         3.2596e+00,  1.0300e+05],
       [-1.1816e+02,  3.3770e+01,  4.9000e+01, ...,  7.5600e+02,
         3.8125e+00,  3.8210e+05],
       [-1.2048e+02,  3.4660e+01,  4.0000e+00, ...,  3.3600e+02,
         4.1563e+00,  1.7260e+05],
       ...,
       [-1.1838e+02,  3.4030e+01,  3.6000e+01, ...,  5.2700e+02,
         2.9344e+00,  2.2210e+05],
       [-1.2196e+02,  3.7580e+01,  1.5000e+01, ...,  5.5900e+02,
         5.7192e+00,  2.8350e+05],
       [-1.2242e+02,  3.7770e+01,  5.2000e+01, ...,  1.2420e+03,
         2.5755e+00,  3.2500e+05]])

>- This is array of how the numerical columns of the data would like 

#### Handling Categorical columns


In [9]:
# first let select the categorical columns from the train dataset
train_categircal_columns = train[['ocean_proximity']]
train_categircal_columns['ocean_proximity'].unique()

array(['NEAR OCEAN', 'INLAND', '<1H OCEAN', 'NEAR BAY', 'ISLAND'],
      dtype=object)

>- There are only 5 unique categories in the ocean_proximity columnn hence the OneHotEncoder approach will be used to handle the numerical columns

In [10]:
from sklearn.preprocessing import OneHotEncoder

# create a OneHotEncoder instance to handle the categorcal columns 
cat_encoder = OneHotEncoder()

# fit and transform the categorical data
train_cat_encoder = cat_encoder.fit_transform(train_categircal_columns)
train_cat_encoder

<16512x5 sparse matrix of type '<class 'numpy.float64'>'
	with 16512 stored elements in Compressed Sparse Row format>

>- This creates a sparse array

In [11]:
# lets display the sparse array in a numpy array format
train_cat_encoder.toarray()

array([[0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1.],
       ...,
       [1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0.]])

<a id='Feature Engineering'></a>
### Feature Engineering
In this section:
>- we will choose and create new features which will help us improve the model will train

In [12]:
train['room_per_household'] = train['total_rooms']/ train['households']
train['bedrooms_per_rooms'] = train['total_bedrooms']/train['total_rooms']
train['population_per_household'] = train['population']/train['households']

In [13]:
train.corr()['median_house_value'].sort_values(ascending=False)

median_house_value          1.000000
median_income               0.690647
room_per_household          0.158485
total_rooms                 0.133989
housing_median_age          0.103706
households                  0.063714
total_bedrooms              0.047980
population_per_household   -0.022030
population                 -0.026032
longitude                  -0.046349
latitude                   -0.142983
bedrooms_per_rooms         -0.257419
Name: median_house_value, dtype: float64

>- One of the created feature **room_per_household**  has a better correlation compared to some old features}

<a id='Feature Scaling'></a>
### Feature Scaling
In this section 
>- We will check the range_the max and min values in the columns and see how best we can standardize them
>- we will also create a pipeline with the selected approaches in the Clean Data section to handle missing values and text values  

In [14]:
y_train = train['median_house_value'].copy()
x_train = train.drop('median_house_value', axis=1)
train_num_columns = train.drop(['ocean_proximity', 'median_house_value'], axis=1)
train_num_columns

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,room_per_household,bedrooms_per_rooms,population_per_household
14196,-117.03,32.71,33.0,3126.0,627.0,2300.0,623.0,3.2596,5.017657,0.200576,3.691814
8267,-118.16,33.77,49.0,3382.0,787.0,1314.0,756.0,3.8125,4.473545,0.232703,1.738095
17445,-120.48,34.66,4.0,1897.0,331.0,915.0,336.0,4.1563,5.645833,0.174486,2.723214
14265,-117.11,32.69,36.0,1421.0,367.0,1418.0,355.0,1.9425,4.002817,0.258269,3.994366
2271,-119.80,36.78,43.0,2382.0,431.0,874.0,380.0,3.5542,6.268421,0.180940,2.300000
...,...,...,...,...,...,...,...,...,...,...,...
11284,-117.96,33.78,35.0,1330.0,201.0,658.0,217.0,6.3700,6.129032,0.151128,3.032258
11964,-117.43,34.02,33.0,3084.0,570.0,1753.0,449.0,3.0500,6.868597,0.184825,3.904232
5390,-118.38,34.03,36.0,2101.0,569.0,1756.0,527.0,2.9344,3.986717,0.270823,3.332068
860,-121.96,37.58,15.0,3575.0,597.0,1777.0,559.0,5.7192,6.395349,0.166993,3.178891


>- since the ranges for the variuos columns are not uniform they will have to be standardized

In [15]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Create a pipeline to handle the numerical columns of the training dataset
num_pipeline = Pipeline([('imputer', SimpleImputer(strategy='median')), ('std_scaler', StandardScaler())])

# Create a pipeline to handle the numerical columns of the training dataset
categorical_encoder = OneHotEncoder()

#### Create a full pipeline

In [16]:
from sklearn.compose import ColumnTransformer
# Separate the columns into categorical and numerical columns
numerical_attributes = list(train_num_columns)
categorical_attributes = ['ocean_proximity']


full_pipeline = ColumnTransformer([('num', num_pipeline, numerical_attributes), ('cat', categorical_encoder, categorical_attributes)])

x_train = full_pipeline.fit_transform(x_train)

print(x_train.shape)




(16512, 16)


<a id='Model Training'></a>
## Model Training

In [38]:
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import mean_squared_error

# create an instace for SGDRegressor
sdg_reg_1 = SGDRegressor()
# train the model with train data 
sdg_reg_1.fit(x_train, y_train)


SGDRegressor()

In [39]:
# make predoictios with model created 
predictions = sdg_reg_1.predict(x_train)
predictions

array([180082.4625534 , 287140.89800384, 245588.79305412, ...,
       194946.47922533, 281342.22304792, 273424.49163324])

In [40]:
# Calcualate the root meann squared error to to evealute how the nodel is performing making predictions using the train features
mse = mean_squared_error(y_true=y_train,y_pred=predictions)
rmse = np.sqrt(mse)
rmse

67826.15311590998

<a id='Model Tuning'></a>
## Model Tuning

In [20]:
from sklearn.model_selection import GridSearchCV
param_grid =[{'loss': ['huber', 'epsilon_insensitive', 'squared_epsilon_insensitive'],  'learning_rate': ['optimal', 'invscaling', 'adaptive'],  'tol':[0.001, 0.008]}, \
             {'shuffle': [True], 'loss': ['huber', 'epsilon_insensitive'],  'learning_rate': ['optimal'], \
              'max_iter':[500, 1000], 'penalty':['l2', 'l1', 'elasticnet']}]


In [43]:
param_grid_2 = {
    'alpha': 10.0 ** -np.arange(1, 7),
    'loss': ['squared_loss', 'huber', 'epsilon_insensitive'],
    'penalty': ['l2', 'l1', 'elasticnet'],
    'learning_rate': ['constant', 'optimal', 'invscaling'],
}

In [21]:
sgd_reg = SGDRegressor()
grid_search = GridSearchCV(sgd_reg,param_grid, scoring='neg_mean_squared_error', cv=10, return_train_score=True)


In [44]:
grid_search_2 = GridSearchCV(sgd_reg, param_grid_2, scoring='neg_mean_squared_error', cv=10, return_train_score=True)
grid_search_2.fit(x_train, y_train)

GridSearchCV(cv=10, estimator=SGDRegressor(),
             param_grid={'alpha': array([1.e-01, 1.e-02, 1.e-03, 1.e-04, 1.e-05, 1.e-06]),
                         'learning_rate': ['constant', 'optimal', 'invscaling'],
                         'loss': ['squared_loss', 'huber',
                                  'epsilon_insensitive'],
                         'penalty': ['l2', 'l1', 'elasticnet']},
             return_train_score=True, scoring='neg_mean_squared_error')

In [45]:
grid_search_2.best_estimator_

SGDRegressor(alpha=1e-06, learning_rate='optimal', loss='epsilon_insensitive',
             penalty='l1')

In [22]:
grid_search.fit(x_train, y_train)

GridSearchCV(cv=10, estimator=SGDRegressor(),
             param_grid=[{'learning_rate': ['optimal', 'invscaling',
                                            'adaptive'],
                          'loss': ['huber', 'epsilon_insensitive',
                                   'squared_epsilon_insensitive'],
                          'tol': [0.001, 0.008]},
                         {'learning_rate': ['optimal'],
                          'loss': ['huber', 'epsilon_insensitive'],
                          'max_iter': [500, 1000],
                          'penalty': ['l2', 'l1', 'elasticnet'],
                          'shuffle': [True]}],
             return_train_score=True, scoring='neg_mean_squared_error')

In [23]:
grid_search.best_estimator_

SGDRegressor(learning_rate='adaptive', loss='epsilon_insensitive', tol=0.008)

In [41]:
# Save the model
import pickle 
filename = 'sgd_reg_housing_model_tuned.pkl'
filename_2 = 'sgd_reg_housing_model_original.pkl'


# filename = 'forest_housing_model.sav'

pickle.dump(grid_search.best_estimator_, open(filename, 'wb'))
pickle.dump(sdg_reg_1, open(filename_2, 'wb'))

model = pickle.load(open(filename, 'rb'))

<a id='Model Testing'></a>
## Model Testing

#### Data Preparation for Test Data

In [25]:
# ckeck the number of rows and columns in train and test datasets
test.shape

(4128, 10)

In [26]:
#check for null values in the test dataset
test.isnull().sum()

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

##### Feature Engineering

In [27]:
# Create the same features used in the traing of the model
test['room_per_household'] = test['total_rooms']/ test['households']
test['bedrooms_per_rooms'] = test['total_bedrooms']/test['total_rooms']
test['population_per_household'] = test['population']/test['households']


In [28]:
print(test.shape)

(4128, 13)


In [29]:
y_test = test['median_house_value'].copy() # create the target label for teh test data
x_test = test.drop('median_house_value', axis=1) # create the features for the test data
test_num = test.drop(['median_house_value', 'ocean_proximity'], axis=1)

In [30]:
x_test.shape

(4128, 12)

##### Clean and Scale data

In [31]:
# Use the same pipeline to clean and scale the features of test data
numerical_attribs = list(test_num)
categorical_attribs = ['ocean_proximity']


full_pipeline = ColumnTransformer([('num', num_pipeline, numerical_attribs), ('cat', categorical_encoder, categorical_attribs)])

x_test = full_pipeline.fit_transform(x_test)

In [32]:
print(x_test.shape)


(4128, 16)


<a id='Load Model'></a>
### Load Model

In [33]:
import pickle

In [34]:
model = pickle.load(open('sgd_reg_housing_model_tuned.pkl', 'rb'))


<a id='Predict Values'></a>
### Predict Values

In [35]:
predictions = model.predict(x_test)
predictions

array([122981.26400534, 126769.49373578, 129028.05473276, ...,
       144818.50420978, 125961.41475154, 131799.65254161])

<a id='Evaluate Model'></a>
### Evaluate Model

In [36]:
def evaluate_model(y_true, y_pred):
    mse = mean_squared_error(y_true=y_true, y_pred=y_pred) 
    rmse = np.sqrt(mse)
    return rmse
evaluate_model(y_test, predictions)

133357.4352561383

>- The tuned version of the model has a higher root mean squared error hence we will i will use the original model

<a id='Comapre Models'></a>
## Compare Models

In [42]:
rnf_model = pickle.load(open('forest_housing_model.pkl', 'rb')) # load the RandomForestModel
rnf_pred = rnf_model.predict(x_test)
sdg_model_1 = pickle.load(open('sgd_reg_housing_model_original.pkl', 'rb')) # load the first SGDRegressor model created
sgd_pred = sdg_model_1.predict(x_test)
print('RandomForestModel: ', evaluate_model(y_test, rnf_pred), 'SGDRegressor Model: ', evaluate_model(y_test, sgd_pred))

RandomForestModel:  70030.35065599649 SGDRegressor Model:  69282.40164000388


>- Comparing the two models now the SGDRegressor model achieved a lower root mean squared vlues and hence the preferred one.
>- Note the SgdRegressor model was trained several times.

In [47]:
pred = grid_search.best_estimator_.predict(x_test)

In [48]:
evaluate_model(y_test, pred)

133357.4352561383

In [49]:
y = y_test > 40000

In [50]:
y

20046    True
3024     True
15663    True
20484    True
9814     True
         ... 
15362    True
16623    True
18086    True
2144     True
3665     True
Name: median_house_value, Length: 4128, dtype: bool