## Imports 

In [1]:
import numpy as np
import pandas as pd 
import os
import matplotlib as mpl
import matplotlib.ticker as ticker
import sklearn 
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler, OneHotEncoder, Normalizer
from sklearn.preprocessing import OrdinalEncoder# for oon
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import r2_score, mean_squared_error 
import category_encoders as ce #Encoding 

## Load Data into dataframe

We pull the CSV file from the directory and load it into a pandas dataframe. 

In [2]:
HousingData = pd.read_csv("housing.csv") 
HousingData.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


## General Statistics 

By using the .describe() method, we are able to compute the 5 number summary along with the count and standard deviation of the dataframe. The calculations are done feature wise. 

In [3]:
HousingData.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0
mean,-119.569704,35.631861,28.639486,2635.763081,537.870553,1425.476744,499.53968,3.870671,206855.816909
std,2.003532,2.135952,12.585558,2181.615252,421.38507,1132.462122,382.329753,1.899822,115395.615874
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1447.75,296.0,787.0,280.0,2.5634,119600.0
50%,-118.49,34.26,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0
75%,-118.01,37.71,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0


Feature data must be encoded prior to data mining and machine learning methodology is implemented. Here we analyze the datatypes of each feature using .info(). We see that the only datatype that needs to be processed is the ocean_proximity feature, which is an object type. 

In [4]:
HousingData.info() # need to change object to encoded feature vector. 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
longitude             20640 non-null float64
latitude              20640 non-null float64
housing_median_age    20640 non-null float64
total_rooms           20640 non-null float64
total_bedrooms        20433 non-null float64
population            20640 non-null float64
households            20640 non-null float64
median_income         20640 non-null float64
median_house_value    20640 non-null float64
ocean_proximity       20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


## Preprocessing Technique 1: Handling of NaN values before standardization and encoding of categorical feature vector. 

Using isnull() along with any(axis=1) searches the dataframe for any NaN values along features. Leaving NaN values can skew statistics, effecting the datamining algorithms used downstream of preprocessing. There are many preprocessing methods one can use to fill NaN values. One of the most commonly used methods is to fill NaN values with the mean average of the feature vectors.  

In [5]:
NAN_location=  HousingData[HousingData.isnull().any(axis=1)].head() #find NaN values of any column

In [6]:
NAN_location

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
290,-122.16,37.77,47.0,1256.0,,570.0,218.0,4.375,161900.0,NEAR BAY
341,-122.17,37.75,38.0,992.0,,732.0,259.0,1.6196,85100.0,NEAR BAY
538,-122.28,37.78,29.0,5154.0,,3741.0,1273.0,2.5762,173400.0,NEAR BAY
563,-122.24,37.75,45.0,891.0,,384.0,146.0,4.9489,247100.0,NEAR BAY
696,-122.1,37.69,41.0,746.0,,387.0,161.0,3.9063,178400.0,NEAR BAY


Here we fill the NaN values in 'total_bedrooms' feature column  with the mean using .fillna, defining the value to fill NaN, and replacing it without copy using inplace= True

In [7]:
HousingData['total_bedrooms'].fillna(value=HousingData['total_bedrooms'].mean(), inplace=True)
HousingData

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,78100.0,INLAND
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,77100.0,INLAND
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7000,92300.0,INLAND
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0,INLAND


Verification of no NaN values in 'total_bedrooms' using .isnull().values.any()

In [8]:
HousingData['total_bedrooms'].isnull().values.any() #verification of no NAN values 

False

In [9]:
y_data = HousingData['median_house_value'] #Set label aside 
x_data = HousingData.drop(['median_house_value'],axis = 1).drop(['ocean_proximity'],axis=1)

## Preprocessing Technique #2:  Encoding for ocean_proximity catagorical feature vector. Encoding done on string varaible prior to standardization for ridgeregression.

 One alternative to one hot encoding is using the label encoder function for ocean_proximity

In [10]:
#Ocean_proximity_encoder = preprocessing.LabelEncoder() # label encoder used for ocean proximity feature.
#HousingData['ocean_proximity'] = Ocean_proximity_encoder.fit_transform(HousingData['ocean_proximity']) #select target feature and apply fit.transform function to that column

In [11]:
onehot = pd.get_dummies(HousingData.ocean_proximity, prefix='op_')
onehot.head()

Unnamed: 0,op__<1H OCEAN,op__INLAND,op__ISLAND,op__NEAR BAY,op__NEAR OCEAN
0,0,0,0,1,0
1,0,0,0,1,0
2,0,0,0,1,0
3,0,0,0,1,0
4,0,0,0,1,0


In [12]:
enc = OneHotEncoder(handle_unknown='ignore')
enc_df = pd.get_dummies(HousingData.ocean_proximity,prefix='op_')
housing_df = HousingData.join(enc_df)
housing_df = housing_df.drop(['ocean_proximity'],axis =1)
housing_df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,op__<1H OCEAN,op__INLAND,op__ISLAND,op__NEAR BAY,op__NEAR OCEAN
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,0,0,0,1,0
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,0,0,0,1,0
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,0,0,0,1,0
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,0,0,0,1,0
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,78100.0,0,1,0,0,0
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,77100.0,0,1,0,0,0
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7000,92300.0,0,1,0,0,0
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0,0,1,0,0,0


In [13]:
HousingData["ocean_proximity"].value_counts()

<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: ocean_proximity, dtype: int64

## Preprocessing Technique #2: Z-score normalization for ridge regression. Because rideregression applies a penalty, normalization is needed in order for that penalty to perform equally on all features. 


In [14]:
#caler = StandardScaler() # define scaler to be used (standardized needed for ridge regression)
#scaled_HousingData = scaler.fit_transform(HousingData)
#X_scaled = preprocessing.scale(HousingData) 
#transformer = Normalizer().fit(HousingData)

scaler = StandardScaler()
scaler.fit(x_data)
scaled_df = pd.DataFrame(scaler.transform(x_data),columns = x_data.columns)
scaled_df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income
0,-1.327835,1.052548,0.982143,-0.804819,-0.975228,-0.974429,-0.977033,2.344766
1,-1.322844,1.043185,-0.607019,2.04589,1.355088,0.861439,1.669961,2.332238
2,-1.332827,1.038503,1.856182,-0.535746,-0.829732,-0.820777,-0.843637,1.782699
3,-1.337818,1.038503,1.856182,-0.624215,-0.722399,-0.766028,-0.733781,0.932968
4,-1.337818,1.038503,1.856182,-0.462404,-0.615066,-0.759847,-0.629157,-0.012881


In [15]:
print(scaled_df.mean(axis=0)) #verify mean 0
print(scaled_df.std(axis=0)) # verify std of 1

longitude            -6.527810e-15
latitude              1.256263e-15
housing_median_age    8.557001e-16
total_rooms           1.475181e-16
total_bedrooms        2.724793e-16
population           -6.465442e-17
households            2.139358e-16
median_income         3.734255e-16
dtype: float64
longitude             1.000024
latitude              1.000024
housing_median_age    1.000024
total_rooms           1.000024
total_bedrooms        1.000024
population            1.000024
households            1.000024
median_income         1.000024
dtype: float64


As we can see, mean = 0, and s is 1 across all features. 

 The dataframe is now preprocessed and the next steps of dimensionality reduction via PCA or Ridge regression can take place. 