#### In this lesson, we are going to build our machine learning model in a series of steps.

##  Understanding The Problem
The first step to building a machine learning model is to identify and understand the problem statement. Before starting on any project, make sure you understand its objectives and requirements. You need to know what problem you are trying to solve before attempting to solve it.

We will be working on [Kaggle's Housing Prices Competition](https://www.kaggle.com/c/home-data-for-ml-course/overview) using the Ames Housing dataset. The sole purpose of our model will be to predict sale prices of houses in Ames, Iowa.

##  Getting The Data


We will be loading our data using Pandas. This returns a Pandas dataframe containing all our data.

In [1]:
import pandas as pd
df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

##  Exploring The Data
This is necessary in order to gain insights about the data. First, we are going to take a quick look using the head() method.

In [2]:
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


Each row represents a home. There are 81 columns. Now, let's get a quick description of the data using the info() method.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

From this we can see that most of our data are categorical attributes. There is a total of 1460 entries in the dataset, which is a fairly small number in the Machine Learning world. Some of the attributes have missing values. For example, the *PoolQC* attribute has only seven non-null values, meaning that 1453 values are missing from this feature. This will be taken care of in the Dealing With Missing Values Section. There is also the describe() method which shows a summary of numerical attributes. 

In [4]:
df.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


## Cleaning The Data
A large part of Data Science has to do with data exploration and cleaning. As our data contains a lot of categorical values and missing columns, we are going to clean up the data before performing any serious EDA.
#### Dealing With Missing Values
Most Machine Learning algorithms cannot work with missing features, which makes our dataset inviable for building models. To fix this, there are two basic options;
- Get rid of the attribute with missing values
- Set missing values with some value (median,mean,zero,-99999,e.t.c)

Let's check the total number of null values of each feature.

In [5]:
df.isnull().sum()

Id                  0
MSSubClass          0
MSZoning            0
LotFrontage       259
LotArea             0
Street              0
Alley            1369
LotShape            0
LandContour         0
Utilities           0
LotConfig           0
LandSlope           0
Neighborhood        0
Condition1          0
Condition2          0
BldgType            0
HouseStyle          0
OverallQual         0
OverallCond         0
YearBuilt           0
YearRemodAdd        0
RoofStyle           0
RoofMatl            0
Exterior1st         0
Exterior2nd         0
MasVnrType          8
MasVnrArea          8
ExterQual           0
ExterCond           0
Foundation          0
                 ... 
BedroomAbvGr        0
KitchenAbvGr        0
KitchenQual         0
TotRmsAbvGrd        0
Functional          0
Fireplaces          0
FireplaceQu       690
GarageType         81
GarageYrBlt        81
GarageFinish       81
GarageCars          0
GarageArea          0
GarageQual         81
GarageCond         81
PavedDrive

We can see that some of our features are almost entirely null values. Since we have no idea what these values are, we should drop features with too many null values.

In [6]:
def drop_feats(dataframe):
    for i in dataframe.columns:
        if (dataframe[i].isnull().sum() >= 500):
            dataframe = dataframe.drop(i, axis=1)
    return (dataframe)

In order to drop features with lots of missing values, we have created a function above. A dataframe is passed into this function and it drops features with more than 500 missing values. Let's apply it on our dataset.

In [7]:
df_dropped = drop_feats(df)

Now let's check out our new dataset info.

In [8]:
df_dropped.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 76 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non

Our dataset columns has been reduced to 76 from 81. By running the function, we've managed to drop 'Alley','FireplaceQu','PoolQC','Fence' and 'MiscFeature' columns in one go.

#### Handling Categorical Variables
Now, we have to convert our categorical features to numbers in order to be able to work with them. Scikit-Learn has a series of inbuilt encoder classes just for this purpose. We will be using the *LabelEncoder* class for this dataset.

In [9]:
from sklearn.preprocessing import LabelEncoder

LabelEncoder works by mapping each category to a series of numbers starting from zero. For example;

In [10]:
df_dropped['Street'].value_counts()

Pave    1454
Grvl       6
Name: Street, dtype: int64

As we can see, the 'Street' feature has two categories. By applying LabelEncoder() to this feature, we get 'Pave' mapped to 0 and 'Grvl' mapped to 1.

Moving on, we notice that some of our categorical features still contain null values. LabelEncoder cannot work with null values so we need to set those values.

In [11]:
def fill_cats(df):
    for i in df.columns:
        if df[i].dtype == 'object':
            df[i].fillna('unknown',inplace=True)
    return df

The function above works by filling categorical null values with the 'unknown' placeholder.

In [12]:
df_filled_cats = fill_cats(df_dropped)

In [13]:
df_filled_cats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 76 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non

As seen above, we no longer have null categorical variables. Now, let's encode our categorical features.

In [14]:
def encode_cats(df):
    for i in df.columns:
        if (df[i].dtype == 'object'):
            df[i] = LabelEncoder().fit_transform(df[i])
    return df

In [15]:
df_encode_cats = encode_cats(df_filled_cats)

In [16]:
df_encode_cats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 76 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null int32
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null int32
LotShape         1460 non-null int32
LandContour      1460 non-null int32
Utilities        1460 non-null int32
LotConfig        1460 non-null int32
LandSlope        1460 non-null int32
Neighborhood     1460 non-null int32
Condition1       1460 non-null int32
Condition2       1460 non-null int32
BldgType         1460 non-null int32
HouseStyle       1460 non-null int32
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null int32
RoofMatl         1460 non-null int32
Exterior1st      1460 non-null int32
Exterior2nd      1460 non-null int32
Mas

As we can see, we no longer have object data types in our data. Our data is almost ready for modelling.

#### Handling Numerical Variables
Our data still contains about three numerical features with null values. We would be getting rid of this with the Imputer class.

In [17]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='median')

The SimpleImputer works by filling in missing values with the computed strategy value. It has three main strategies namely 'mean', 'median'and 'most_frequent'. 

In [18]:
data = imputer.fit_transform(df_encode_cats)

Transforming the data with SimpleImputer returns a numpy array, which we will then proceed to convert to a pandas dataframe

In [19]:
data = pd.DataFrame(data, columns=df_encode_cats.columns)

In [20]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 76 columns):
Id               1460 non-null float64
MSSubClass       1460 non-null float64
MSZoning         1460 non-null float64
LotFrontage      1460 non-null float64
LotArea          1460 non-null float64
Street           1460 non-null float64
LotShape         1460 non-null float64
LandContour      1460 non-null float64
Utilities        1460 non-null float64
LotConfig        1460 non-null float64
LandSlope        1460 non-null float64
Neighborhood     1460 non-null float64
Condition1       1460 non-null float64
Condition2       1460 non-null float64
BldgType         1460 non-null float64
HouseStyle       1460 non-null float64
OverallQual      1460 non-null float64
OverallCond      1460 non-null float64
YearBuilt        1460 non-null float64
YearRemodAdd     1460 non-null float64
RoofStyle        1460 non-null float64
RoofMatl         1460 non-null float64
Exterior1st      1460 non-null floa

## Splitting The Data
Before continuing, we need to split our dataset into features and labels.

In [21]:
y = data['SalePrice'].copy()
X = data.drop(['Id','SalePrice'],axis=1)

We also have to split these further into training sets and validation sets.

In [22]:
from  sklearn.model_selection import train_test_split

In [23]:
X_train,X_val,y_train,y_val = train_test_split(X,y,test_size=0.2,random_state=42)

## Feature Scaling 
One of the most important operations you need to apply on your data is feature scaling. Machine Learning algorithms generally don't perform well when the data is of very different scales. We would be using min-max scaling on our data. In min-max scaling (also known as normalization), values are shifted and rescaled until they end up ranging from 0 to 1.

In [24]:
from sklearn.preprocessing import MinMaxScaler

Now, let's fit our scaler to the training data

In [25]:
scaler = MinMaxScaler()
scaler.fit(X_train)

MinMaxScaler(copy=True, feature_range=(0, 1))

Next, we create a function we can use to transform our data. Like the SimpleImputer, it returns a numpy array which we can then convert back to a pandas dataframe.

In [26]:
def scaled(x):
    x_scaled = scaler.transform(x)
    x = pd.DataFrame(x_scaled, columns=x.columns)
    return(x)

In [27]:
X_train = scaled(X_train)
X_val = scaled(X_val)

## Building a Model
Finally, we are ready to select and train a model. We will be using a Random Forest Regression model.

In [28]:
from sklearn.ensemble import RandomForestRegressor

Now, let's fit our model.

In [29]:
model = RandomForestRegressor()
model.fit(X_train,y_train)



RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=10,
                      n_jobs=None, oob_score=False, random_state=None,
                      verbose=0, warm_start=False)

Now, let's make predictions on our validation set.

In [30]:
preds = model.predict(X_val)

Now, let's measure this model's RMSE on the validation set using Scikit-Learn's mean_squared_error function.

In [31]:
from sklearn.metrics import mean_squared_error
import numpy as np

In [32]:
mse=mean_squared_error(y_val,preds)
rmse=np.sqrt(mse)
print(rmse)

30060.78513899419


This is clearly a not so great score. It can be improved by performing cross-validation, hyperparameter tuning and a number of other methods, which will be covered in the coming classes.

## Predicting the Test Data
Before predicting our test data, let's fit all our training data to our model to enable better predictions.

In [33]:
X = scaled(X)

In [34]:
model.fit(X, y)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=10,
                      n_jobs=None, oob_score=False, random_state=None,
                      verbose=0, warm_start=False)

Now, let's prepare our test data for predictions.

In [35]:
tests = test_df.drop(['Id','Alley','FireplaceQu','PoolQC','Fence', 'MiscFeature'],axis=1)
tests = fill_cats(tests)
tests = encode_cats(tests)
tests = imputer.fit_transform(tests)
test_scaled = scaler.transform(tests)

In [36]:
test_preds = model.predict(test_scaled)

Now, all that remains is to export our predictions to a csv file.

In [37]:
output = pd.DataFrame({'Id': test_df.Id,
                       'SalePrice': test_preds})
output.to_csv('houseprices.csv', index=False)

#scores 17042.64005 on the leaderboard

## Exercises
- Submit the houseprices csv file on [Kaggle](https://www.kaggle.com/c/home-data-for-ml-course/submit)
- Improve predictions by dropping some more columns