This Project is my Entry to a Kaggle Competition House Prices: Advanced Regression Techniques
The Target was to rank in the Top 5% in the LeaderBoard and Hence, Target Score was to chosen to be < 0.1
I Ranked in the Top 5% in the LeaderBoard, 280th Position out of the 5,110 Entries
- train.csv - the training set
- data_description.txt - full description of each column, originally prepared by Dean De Cock but lightly edited to match the column names used here
- You Can Find About the Data Fields Here
2. Data Cleaning :(Data Cleaning Notebook)
Some Outliers were detected during Visualizing the Data as shown below (Two Points with GrLivArea > 4000 & Sale Price < 200000)
After Removing the Outliers The Plot shows improvement in Linearity of the Phenomena
While Inspecting the Target Value it was Found that it was Positively Skewed and Hence the Skew Was removed my using Box Cox transformation With Lamda = 0.15
As, You can see the Target Value Reached a near Perfect Normal Distribution
There was a lot of Missing Data in the DataSet as some feature were designed to have some properties reflected by Null Values and Hence, A lot of tranformation was done
Some of the Transformation that was to Fill Missing Values
Feature Engineering :(Feature Engineering Notebook)
Total SquareFoot Was Added, By adding SquareFoot Areas Of All Floor of the Houses In the Dataset
Sklearn Label Encoder was used for Label Encoding
There was Skew in almost 50% of the Features and some Features had Large amount of it and Hence, Skew as Removed by Using Box Cox Tranformation with Lamda = 0.15 and The Threshold was chosen to be Skew > 0.75
Model Building :(Model Building Notebook)
According thr rules of the Competition the Evaluation Metric was Root Mean Square Logarithmic Error(RMSLE)
Note that in the formulation X is the predicted value and Y is the actual value
The Target was to get into top 5% in the kaggle Leaderboard thus, target was to get a RMSLE < 0.1
Base Model Was Chosen and then were tuned by GridSearchCV and then were Scored to Gauge Performance
- Lasso Score: 0.1116 (Std: 0.0074)
- ElasticNet Score: 0.1116 (Std: 0.0074)
- Kernel Ridge Score: 0.1152 (Std: 0.0075)
- Gradient Boosting Score: 0.1166 (Std: 0.0083)
- Xgboost Score: 0.1161 (Std: 0.0072)
- LGBM Score: 0.1164 (Std: 0.0062)
We begin with this simple approach of averaging base models. We build a new class to extend scikit-learn with our model and also to laverage encapsulation and code reuse (inheritance)
- Averaged Based Model Class Score: 0.1087 (Std: 0.0076)
It seems even the simplest stacking approach really improve the score . This encourages us to go further and explore a less simple stacking approach.
In this approach, We add a meta-model on averaged base models and use the out-of-folds predictions of these base models to train our meta-model.
The procedure, for the training part, may be described as follows:
Split the total training set into two disjoint sets (here train and holdout )
Train several base models on the first part (train)
Test these base models on the second part (holdout)
Use the predictions from holdout fold (called out-of-folds predictions) as the inputs, and the correct responses (target variable) as the outputs to train a higher level learner called meta-model.
The first three steps are done iteratively . If we take for example a 5-fold stacking , we first split the training data into 5 folds. Then we will do 5 iterations. In each iteration, we train every base model on 4 folds and predict on the remaining fold (holdout).
So, we will be sure, after 5 iterations , that the entire data is used to get out-of-folds predictions that we will then use as new feature to train our meta-model
For the prediction part , We average the predictions of all base models on the test data and used them as meta-features on which, the final prediction is done with the meta-model.
After this all the Results were put in a Ensemble to get the Final Results
- Final Ensemble Score: 0.076
The Final Scored Reached is RMSLE = 0.076 i.e < 0.1 and Hence, Target was Reached