# Iterating The Machine Learning Model With Normalized Data

In this Notebook I am going to repeat the normalization part that originally didn't work out.
I will then apply the normalized data to our models and check if there is a significant impact.
I will skip visualizing the data in this notebook. Please refer to the data wrangling notebook [here](https://github.com/Caparisun/Linear_Regression_Project/blob/master/Notebooks_and_data/2.Datawrangling.ipynb).

In [10]:
#importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn import preprocessing
import statsmodels.formula.api as sm

In [11]:
# define column header names in a list
names=["id", "date", "bedrooms", "bathrooms","sqft_living", "sqft_lot", "floors", "waterfront", "view", "condition", "grade", "sqft_above", "sqft_basement", "yr_built", "yr_renovated", "zipcode", "lat", "long", "sqft_living15", "sqft_lot15", "price"]

In [12]:
# Import data into pandas dataframe
df = pd.read_csv('regression_data.csv',names=names )

In [13]:
# check if the import worked 
df.head()

Unnamed: 0,id,date,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,...,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,price
0,7129300520,10/13/14,3,1.0,1180,5650,1.0,0,0,3,...,1180,0,1955,0,98178,47.5112,-122.257,1340,5650,221900
1,6414100192,12/9/14,3,2.25,2570,7242,2.0,0,0,3,...,2170,400,1951,1991,98125,47.721,-122.319,1690,7639,538000
2,5631500400,2/25/15,2,1.0,770,10000,1.0,0,0,3,...,770,0,1933,0,98028,47.7379,-122.233,2720,8062,180000
3,2487200875,12/9/14,4,3.0,1960,5000,1.0,0,0,5,...,1050,910,1965,0,98136,47.5208,-122.393,1360,5000,604000
4,1954400510,2/18/15,3,2.0,1680,8080,1.0,0,0,3,...,1680,0,1987,0,98074,47.6168,-122.045,1800,7503,510000


In [14]:
# we will directly drop the columns "ID", "lat", "long", "yr_built" and "date" 
# since they provide no value to our analysis according to the correlation matrix

df_drop=df.drop(['id', 'lat','long','yr_built','date'], axis = 1) 


In [15]:
# check again if the columns were removed sucessfuly
df_drop.head()

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_renovated,zipcode,sqft_living15,sqft_lot15,price
0,3,1.0,1180,5650,1.0,0,0,3,7,1180,0,0,98178,1340,5650,221900
1,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1991,98125,1690,7639,538000
2,2,1.0,770,10000,1.0,0,0,3,6,770,0,0,98028,2720,8062,180000
3,4,3.0,1960,5000,1.0,0,0,5,7,1050,910,0,98136,1360,5000,604000
4,3,2.0,1680,8080,1.0,0,0,3,8,1680,0,0,98074,1800,7503,510000


In [16]:
# define a quick function to convert the year renovated into a boolean value
# this is because we know from our real estate experience, that a renovation has an impact on the price, 
# but the year of renovation usually doesn't matter, only the fact that it got renovated

def boolean(x):
    if x == 0:
        n = 0
    elif x > 0:
        n = 1
    return n

In [17]:
# apply booloean function to yr_renovated colum
df_drop['yr_renovated']=df['yr_renovated'].apply(boolean)
# check if that worked by looking at the head again
df_drop.head()

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_renovated,zipcode,sqft_living15,sqft_lot15,price
0,3,1.0,1180,5650,1.0,0,0,3,7,1180,0,0,98178,1340,5650,221900
1,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1,98125,1690,7639,538000
2,2,1.0,770,10000,1.0,0,0,3,6,770,0,0,98028,2720,8062,180000
3,4,3.0,1960,5000,1.0,0,0,5,7,1050,910,0,98136,1360,5000,604000
4,3,2.0,1680,8080,1.0,0,0,3,8,1680,0,0,98074,1800,7503,510000


In [18]:
#identify outlier in bedrooms column
i = df_drop[((df_drop.bedrooms > 10))]
i

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_renovated,zipcode,sqft_living15,sqft_lot15,price
8748,11,3.0,3000,4960,2.0,0,0,3,7,2400,600,1,98106,1420,4960,520000
15856,33,1.75,1620,6000,1.0,0,0,5,7,1040,580,0,98103,1330,4700,640000


In [19]:
# drop outlier of bedrooms column
df_drop = df_drop.drop(df.index[15856])

## Normalizing the data

In [20]:
# normalizing the columns so the values get ditributed between 0 and 1
# we use preprocessing from the sklearn model to achieve this

x = df_drop.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler() # choose the model with which we are normalizin
x_scaled = min_max_scaler.fit_transform(x) # create normalized values
df_norm = pd.DataFrame(x_scaled) # create new normalized dataframe

In [21]:
# define headers for the normalized dataframe
df_norm.columns =["bedrooms", "bathrooms","sqft_living", "sqft_lot", "floors", "waterfront", "view", "condition", "grade", "sqft_above", "sqft_basement", "yr_renovated", "zipcode", "sqft_living15",  "sqft_lot15", "price"]

In [25]:
#check if the normalization worked by looking at all distributions 
df_norm.head(2)

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_renovated,zipcode,sqft_living15,sqft_lot15,price
0,0.2,0.066667,0.061503,0.003108,0.0,0.0,0.0,0.5,0.4,0.089602,0.0,0.0,0.893939,0.161934,0.005742,0.01888
1,0.2,0.233333,0.167046,0.004072,0.4,0.0,0.0,0.5,0.4,0.199115,0.082988,1.0,0.626263,0.222165,0.008027,0.060352


### I will now export this data as a CSV file and use it in a copy of the applying_model notebook. 
Please refer to [this](https://github.com/Caparisun/Linear_Regression_Project/blob/master/Notebooks_and_data/3.Applying_Model.ipynb) notebook to see the original

In [26]:
df_norm.to_csv('norm_model.csv')


***