## Introduction

Hello fellow Kagglers! In this notebook I will show you how you can achieve top 1% rank in most simple and efficient way. First of all I am really thankful to the amazing Kaggle community which helped me learn so many things. I have created this notebook for learning purpose and give back to the community. So I will keep updating it from time to time.

#### My main objectives on this project are:
+ Get to top 1% with minimum lines of code.
+ Learn to use Pipeline.
+ To explain each and every step and the logic behind it.
+ Create our own prediction from scratch without using public kernels.
+ Learn to using three models:- RandomForestRegressor, GradientBoostingRegressor, CatBoostRegressor.
<a id='top'></a> <br>
## NOTEBOOK CONTENT
1. [Imports](#1)
1. [Load Data](#2)
1. [Preprocessing](#3)
1. [Implement Pipeline](#4)
1. [Create Models](#5)
1. [Evaluate Models](#6)
1. [Predict Test set](#7)
1. [Make Submission](#8)

<a id="1"></a> <br>
## 1: Imports

In [None]:
import numpy as np 
import pandas as pd 
from sklearn.model_selection import train_test_split

<a id="2"></a> <br>
## 2: Load Data

In [None]:
X_full = pd.read_csv("/kaggle/input/home-data-for-ml-course/train.csv", index_col= 'Id')
X_test_full = pd.read_csv("/kaggle/input/home-data-for-ml-course/test.csv", index_col = 'Id')

Let's see the shape of our datasets by .shape attribute

In [None]:
X_full.shape, X_test_full.shape

We see X_full has 80 columns (all 79 features + 1 target variable) and X_test_full has 79 columns(all 79 features)

<a id="2"></a> <br>
## 3: Preprocessing
### 3.a Remove rows with missing target, separate target variable from feature variables
.dropna() methods with axis=0 drops rows which has null value, here we have set subset=['SalePrice'] which means we drop all the rows whose 'SalePrice' is null. This subset tells it to look at only 'SalePrice' columns.
Basically here we are dropping all the rows in our dataset whoose target value is null as it is of no use in training our model to make it even better.

In [None]:
X_full.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X_full.SalePrice
X_full.drop(['SalePrice'], axis=1, inplace=True)

<a id="2"></a> <br>
### 3.b Split our data into train and validation

In [None]:
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X_full, y, 
                                                                train_size=0.8, test_size=0.2,
                                                                random_state=0)

In [None]:
X_train_full.shape, X_valid_full.shape, y_train.shape, y_valid.shape

<a id="2"></a> <br>
## 3.c Select categorical columns with relatively low cardinality.
In this problem we convert all the categorical columns into one hot encoding.
Note:- If a categorical variable has 100 columns then its one hot encoding will create 100 new columns.So our data will become of high dimensional.
This makes training hard. This phenomena is called 
#### CURSE OF DIMENSIONALITY
So we first filter our categorical columns which has less than 10 unique values.

(Note:- There are many other ways of tackling this issue, one method is we look at their frequencies, and combine values with less than .05% frequency into one category)

In [None]:
categorical_cols = [cname for cname in X_train_full.columns if
                    X_train_full[cname].nunique() < 10 and 
                    X_train_full[cname].dtype == "object"]

<a id="2"></a> <br>
## 3.d Select numerical columns
A numerical value can be either int type or float type

In [None]:
# numerical_cols = [cname for cname in X_train_full.columns if 
#                 X_train_full[cname].dtype in ['int64', 'float64']]
# update:- 
numerical_cols = X_train_full.select_dtypes(exclude=['object']).columns.tolist()

<a id="2"></a> <br>
## 3.e categorical_cols + numerical_cols
we create copies so that we don't tamper with the original dataset


In [None]:
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_test = X_test_full[my_cols].copy()

<a id="2"></a> <br>
## 4: Implement Pipeline
Pipeline is very useful it can save time.

Some benefits of pipeline:-

1) Cleaner code

2) Fewer Bugs

3) Easier to Productionize

4) More options for Model Validation
#### So lets implement pipeline
### 4.a Imports

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

<a id="2"></a> <br>
## 4.a Preprocessing for numerical data
Lets use SimpleImputer to fill all missing values in our numerical columns

In [None]:
numerical_transformer = SimpleImputer(strategy='constant')

<a id="2"></a> <br>
## 4.b Preprocessing for categorical data
For categorical data we create pipeline for preprocessing
1) We first fill all missing values with the most_frequent value in that column also called as the mode.

2) We convert to one hot encoding.

In [None]:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

<a id="2"></a> <br>
## 4.c Bundle preprocessing for numerical and categorical data
We use ColumnTransformer from sklearn, this lets us apply preprocessing on selected columns.
Here we apply numerical_transformer on all the numerical columns and categorical_transformer on categorical columns.

(Note:- numerical_cols is a list containing names of all numerical columns)

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

<a id="2"></a> <br>
## 5: Create Models
Now we create 3 different models and compare their results. 

(We can do hyperparameter tuning to even further fine tune our models using GridSeachCV, RandomizedSeachCV etc but for now we use the deafault models)
### 5.a Model1:- RandomForestRegressor
Here we create another pipeline which has two steps:- preprocessor and model1.

When we do .fit() it fits and transform for the preprocessor and fits for model1.

When we do .predict() it transform for the preprocessor and predicts for model1.

(Note:- we can create our own custom objects for creating pipeline but that will be an advance topic)

In [None]:
# Define model1
model1 = RandomForestRegressor(n_estimators=800,random_state=20)

# Bundle preprocessing and modeling code in a pipeline
my_pipeline1 = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model1)
                             ])

# Preprocessing of training data, fit model 
my_pipeline1.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds1 = my_pipeline1.predict(X_valid)

<a id="2"></a> <br>
### 5.b Model2:- GradientBoostingRegressor

In [None]:
# Define model2
from sklearn.ensemble import GradientBoostingRegressor
model2 = GradientBoostingRegressor(n_estimators=600, random_state=32)

# Bundle preprocessing and modeling code in a pipeline
my_pipeline2 = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model2)
                             ])

# Preprocessing of training data, fit model 
my_pipeline2.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds2 = my_pipeline2.predict(X_valid)

<a id="2"></a> <br>
### 5.c Model3:- CatBoostRegressor

In [None]:
# Define model3
import catboost as cb
model3 = cb.CatBoostRegressor(loss_function='RMSE',random_state=20,verbose=False)

# Bundle preprocessing and modeling code in a pipeline
my_pipeline3 = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model3)
                             ])

# Preprocessing of training data, fit model 
my_pipeline3.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds3 = my_pipeline3.predict(X_valid)

<a id="2"></a> <br>
## 6: Evaluate models
## 6.a Let's Look at MAE of each model

In [None]:
score = mean_absolute_error(y_valid, preds1)
print('MAE:', score)
score = mean_absolute_error(y_valid, preds2)
print('MAE:', score)
score = mean_absolute_error(y_valid, preds3)
print('MAE:', score)

We see their MAE are as follows:-

RandomForestRegressor:-17226.008848458903

GradientBoostingRegressor:-15563.235198675247

CatBoostRegressor:-16001.783475611544
<a id="2"></a> <br>
### 6.b Average of their predictions
Generally average of very different models with same score gives great boost on leaderboard.
   
Logic:-Because if models have same score means they have same no of correct predictions and if the models are very much different means they are correct at different data points. Thus averaging their predictions increases no of correct predictions.

In [None]:
preds= (preds1+ preds2+ preds3)/3

# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)

Clearly we can see that the averaging of predictions has improved our score.
<a id="2"></a> <br>
## 7: Predict Test set
The real power of pipeline can be seen here. Now we don't need to preprocess test set separately. We will just do .predict() and it will automatically preprocess it.

In [None]:
# Preprocessing of test data, fit model
preds_test1 = my_pipeline1.predict(X_test)
preds_test2 = my_pipeline2.predict(X_test)
preds_test3 = my_pipeline3.predict(X_test)
preds_test = (preds_test1+preds_test2+ preds_test3 )/3

<a id="2"></a> <br>
## 8: Make submission

In [None]:
# Save test predictions to file
output = pd.DataFrame({'Id': X_test.index,
                       'SalePrice': preds_test})
output.to_csv('submission73.csv', index=False)

# End Note
## **Things you can try to improve performance.**
#### 1. Use hyperparameter tuning to find the best parameters
#### 2. Use StandardScaler to scale all the features (Note: don't scale target variables)
#### 3. Try using other model like XGBoostRegressor, Neural Network   
##### (Remember: average of predictions of very different models like (tree base/NN) with similar score will give great boost in performance)
#### 4. When you get the best combination of models make a for loop and generate 50 different predictions using different random seeds and take its average. You will see there is some improvement in performance.
## <font color='orange'><b>If you have any doubts feel free to ask below, I would be happy to help.</b></font>