<B><I>The Code below is my solution to the second lab in the [Supervised Machine Learning Course](http://www.coursera.org/learn/supervised-machine-learning-regression/home/welcome) by IBM on Coursera. Hope you all learn a thing or two from it like i did 😊😊 </I>

# Machine Learning Foundation

## Course 2, Part b: Regression Setup, Train-test Split LAB 

## Introduction

We will be working with a data set based on [housing prices in Ames, Iowa](https://www.kaggle.com/c/house-prices-advanced-regression-techniques). It was compiled for educational use to be a modernized and expanded alternative to the well-known Boston Housing dataset. This version of the data set has had some missing values filled for convenience.

There are an extensive number of features, so they've been described in the table below.

### Predictor

* SalePrice: The property's sale price in dollars. 

### Features

* MoSold: Month Sold
* YrSold: Year Sold   
* SaleType: Type of sale
* SaleCondition: Condition of sale
* MSSubClass: The building class
* MSZoning: The general zoning classification
* ...

## Question 1

* Import the data using Pandas and examine the shape. There are 79 feature columns plus the predictor, the sale price (`SalePrice`). 
* There are three different types: integers (`int64`), floats (`float64`), and strings (`object`, categoricals). Examine how many there are of each data type. 

In [None]:
import pandas as pd
import numpy as np

# Import the data using the file path
data = pd.read_csv("../input/regression/Ames_Housing_Sales.csv")

data.shape

In [None]:
data.info()

## Question 2

A significant challenge, particularly when dealing with data that have many columns, is ensuring each column gets encoded correctly. 

This is particularly true with data columns that are ordered categoricals (ordinals) vs unordered categoricals. Unordered categoricals should be one-hot encoded, however this can significantly increase the number of features and creates features that are highly correlated with each other.

Determine how many total features would be present, relative to what currently exists, if all string (object) features are one-hot encoded. Recall that the total number of one-hot encoded columns is `n-1`, where `n` is the number of categories.

In [None]:
#Create a copy of the dataframe so that changes won't applied to the original one
X=data.copy()
y=X.pop('SalePrice') #Extract the target variable from the dataframe

#Select the object/categorical columns 
cat_cols =[col for col in X.columns if X[col].dtype=='object']

#Select the numerical columns 
num_cols =[col for col in X.columns if X[col].dtype in ['int64','float64']]

In [None]:
#Run the following code for easier interaction with the dataframe
pd.set_option('display.max_columns',None)

## Question 3

Let's create a new data set where all of the above categorical features will be one-hot encoded. We can fit this data and see how it affects the results.

* Used the dataframe `.copy()` method to create a completely separate copy of the dataframe for one-hot encoding
* On this new dataframe, one-hot encode each of the appropriate columns and add it back to the dataframe. Be sure to drop the original column.
* For the data that are not one-hot encoded, drop the columns that are string categoricals.

For the first step, numerically encoding the string categoricals, either Scikit-learn;s `LabelEncoder` or `DictVectorizer` can be used. However, the former is probably easier since it doesn't require specifying a numerical value for each category, and we are going to one-hot encode all of the numerical values anyway. (Can you think of a time when `DictVectorizer` might be preferred?)

In [None]:
#Encode the categorical variables
from sklearn.preprocessing import OneHotEncoder
ohe=OneHotEncoder(sparse=False)
enc_cat=pd.DataFrame(ohe.fit_transform(X[cat_cols]))

#One Hot Encoder removes columns names. we put them back uing the code below (note that each column is a unique feature 
#with either the value of 1 or 0)
col_names=pd.Series(list(ohe.get_feature_names()))
#OHE puts x0,x1,.....xn_ besides each column name, remove it
enc_cat.columns=col_names.str.split("_",expand=True).loc[:,1]
enc_cat.columns

#Create a new dataframe that has the numerical columns and the encoded categorical ones
new_df=pd.concat([X[num_cols],enc_cat],axis=1)

#Check
new_df.shape

In [None]:
# Determine how many extra columns would be created
enc_cat.shape[1]-len(cat_cols)

*We may also use pd.get_dummies as shown below BUT for better prediction performance it's better to stick to the sklearn method.For a more detailed answer, check this answer on [stackoverflow](https://stackoverflow.com/questions/36631163/what-are-the-pros-and-cons-between-get-dummies-pandas-and-onehotencoder-sciki)*

In [None]:
gd_enc=pd.get_dummies(X[cat_cols])
gd_enc.head()

## Question 4

* Create train and test splits of both data sets. To ensure the data gets split the same way, use the same `random_state` in each of the two splits.
* For each data set, fit a basic linear regression model on the training data. 
* Calculate the mean squared error on both the train and test sets for the respective models. Which model produces smaller error on the test data and why?

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error 
from sklearn.model_selection import train_test_split

#First predict the saleprice using the original dataset
x_train,x_test,y_train,y_test=train_test_split(X[num_cols],y,random_state=0)
lr=LinearRegression()
lr.fit(x_train,y_train)
y_pred=lr.predict(x_test)
msr_org=mean_squared_error(y_pred,y_test)

In [None]:
#Now predict the saleprice using the original dataset
x_train_ohe,x_test_ohe,y_train_ohe,y_test_ohe=train_test_split(new_df,y,random_state=0)
lr2=LinearRegression()
lr2.fit(x_train_ohe,y_train_ohe)
y_pred_ohe=lr2.predict(x_test_ohe)
msr_ohe=mean_squared_error(y_pred_ohe,y_test_ohe)

In [None]:
print('-For the original dataframe the MSR is:',round(msr_org,2),'\n','\n'+'-For the encoded dataframe the MSR is:',msr_ohe)
print('\n'+"-The mean_squared_error increased by a factor of:",round(msr_ohe/msr_org,2))

Note that the error values on the one-hot encoded data are very different for the train and test data. In particular, the errors on the test data are much higher. Based on the lecture, this is because the one-hot encoded model is overfitting the data. We will learn how to deal with issues like this in the next lesson.

## Question 5

For each of the data sets (one-hot encoded and not encoded):

* Scale the all the non-hot encoded values using one of the following: `StandardScaler`, `MinMaxScaler`, `MaxAbsScaler`.
* Compare the error calculated on the test sets

Be sure to calculate the skew (to decide if a transformation should be done) and fit the scaler on *ONLY* the training data, but then apply it to both the train and test data identically.

In [None]:
#We may answer the above question by using the simple code below

#Step one import the chosen methods and build your function
from sklearn.preprocessing import StandardScaler,MinMaxScaler,MaxAbsScaler
def msr_for_scaled(x,y,scaler):
    global y_test , y_pred_ss #Declare both the actual and the predicted values as global so they may be used later for 
                              #building plots
    x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=0)
    x_tr_sca=x_train.copy()
    x_te_sca=x_test.copy()
    ss=scaler()
    sca_tr_val=ss.fit_transform(x_tr_sca)
    sca_te_val=ss.transform(x_te_sca)
    lr=LinearRegression()
    lr.fit(sca_tr_val,y_train)
    y_pred_ss=lr.predict(sca_te_val)
    msr_ss=mean_squared_error(y_pred_ss,y_test)
    return print('The mean square error for prediction using the',scaler(),'method is:',round(msr_ss,2))

#Step two calculate the mean squared error values for each of the scaling methods 
for scaler in [StandardScaler,MinMaxScaler,MaxAbsScaler]:
    msr_for_scaled(X[num_cols],y,scaler)
    print("\n")

## Question 6

Plot predictions vs actual for one of the models.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


sns.set_context('talk')
sns.set_style('ticks')
sns.set_palette('dark')

ax = plt.axes()
# we are going to use y_test, y_test_pred
ax.scatter(y_test, y_pred_ss, alpha=0.7)

ax.set(xlabel='Ground truth', 
       ylabel='Predictions',
       title='Ames, Iowa House Price Predictions vs Truth, using Linear Regression');

<div class="alert alert-block alert-info">
<B><I>That's it. Thanks for your time. Keep on learning and spreading the knowledge</I>
​
<I>Until next time 🖐🖐 </I>

---
### Machine Learning Foundation (C) 2020 IBM Corporation