# Advanced Machine Learning

## Regression

The dataset used here is about some houses prices. You can find more information about it on this [website](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data).

In [50]:
# all library needed
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import cross_decomposition
from sklearn.linear_model import *
from sklearn.svm import *
from sklearn.ensemble import *
from sklearn.model_selection import *
from sklearn.metrics import *

### Preparation of the data for the machine learning (ML) models

In [7]:
train_df = pd.read_csv('train1.csv')
test_df = pd.read_csv('test1.csv')

Here we'll just replace all the missing values by their corresponding features means as they're all numeric.

In [8]:
Y = train_df['SalePrice']

In [9]:
X_train = pd.get_dummies(train_df.drop('SalePrice', axis=1).drop('Id', axis=1))
X_train = X_train.fillna(X_train.mean())

In [10]:
X_test = pd.get_dummies(test_df.drop('Id', axis=1))
X_test = X_test.fillna(X_test.mean())

In [11]:
# taking only the train data that aren't in the test set
for i in list(X_train.columns):
    if not i in list(X_test.columns):
        X_train = X_train.drop(i, axis=1)

### Building some ML models

#### Linear Regression

In [20]:
reg = LinearRegression().fit(X_train, Y)
acc_reg = round(reg.score(X_train, Y) * 100, 2)

#### Support Vector Machine for Regression (SVR)

In [26]:
svr = SVR(gamma='auto')
svr.fit(X_train, Y)
acc_svr = round(svr.score(X_train, Y) * 100, 2)

#### Random Forest

In [22]:
random_forest = RandomForestRegressor(n_estimators=100)
random_forest.fit(X_train, Y)
acc_random_forest = round(random_forest.score(X_train, Y) * 100, 2)

#### Gradient Boosting Regressor

In [23]:
gradient_boost = GradientBoostingRegressor()
gradient_boost.fit(X_train, Y)
acc_gradient_boost = round(gradient_boost.score(X_train, Y) * 100, 2)

Now let's see which one is the best.

In [24]:
results = pd.DataFrame({
    'Model': ['SVR', 'Linear Regression', 'Random Forest', 'Gradient Boosting Regressor'],
    'Score': [acc_svr, acc_reg, acc_random_forest, acc_gradient_boost]})
result_df = results.sort_values(by='Score', ascending=False)
result_df = result_df.set_index('Score')
result_df

Unnamed: 0_level_0,Model
Score,Unnamed: 1_level_1
98.22,Random Forest
96.64,Gradient Boosting Regressor
91.72,Linear Regression
-5.09,SVR


As we can see, the Random Forest regressor goes on the first place. However, let's check how it performs when using cross validation.
The cross validation idea is to split our training set into k-folds. Our model would be trained and evaluated k times using different fold for evaluation everytime, while it would be trained on the remaining k-1 folds.
Here we use k = 10 folds. The score used here is the MSE (mean squared error).

In [64]:
#cval = LeaveOneOut().get_n_splits(X_train)
rf = RandomForestRegressor(n_estimators=100)
scores = cross_val_score(rf, X_train, Y, cv=10, scoring = make_scorer(mean_squared_error))
print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard Deviation:", scores.std())

Scores: [6.27378324e+08 6.51009833e+08 4.87253519e+08 1.50458445e+09
 1.13304490e+09 6.94430084e+08 5.88324386e+08 5.86830313e+08
 1.75345429e+09 7.43044327e+08]
Mean: 876935442.0479609
Standard Deviation: 413725285.498749


### Models hyperparameters tuning

There are many hyparameters, so we cannot tune the model using all of them. We select some of them, and each of them has a list of values. In the end, the best combination is shown.
We'll only tune the hyperparameters of the random forest model as it's the best.

In [66]:
param_grid = {   "min_samples_leaf" : [1, 2, 4],
                  "min_samples_split" : [2, 5, 10],
                  "n_estimators": [100, 200],
             "max_features": ['auto','sqrt','log2']}

rf = RandomForestRegressor(n_estimators=100, random_state=1, n_jobs=-1, oob_score=True)
clf = GridSearchCV(estimator=rf, param_grid=param_grid, n_jobs=-1,cv=10,iid=False)
clf.fit(X_train, Y)
clf.best_params_

{'max_features': 'auto',
 'min_samples_leaf': 1,
 'min_samples_split': 5,
 'n_estimators': 200}

Now let's the new parameters.

In [71]:
random_forest = RandomForestRegressor( min_samples_leaf = 1, 
                                       min_samples_split = 5,   
                                       n_estimators=200, 
                                       max_features='auto', 
                                       oob_score=True, 
                                       random_state=1, 
                                       n_jobs=-1)

random_forest.fit(X_train, Y)
Y_prediction = random_forest.predict(X_test)

round(random_forest.oob_score_, 3)*100

86.2

Now that we have a proper model, we can start evaluating it’s performace in a more accurate way. Before we used, the oob score to do so and now we're going to use other metrics.

In [73]:
# predictions to submit to kaggle
sub_data = X_test
sub_data['SalePrice'] = Y_prediction
sub_data['Id'] = range(1461,2920)
sub_data = sub_data.loc[:,['Id','SalePrice']]
sub_data.to_csv('submission_data1.csv', index=False)

After submission, I got **0.14803** as a score.