# Notes for this notebook:

## Models:
I've used 3 different algorithms:
- XGBoost
- Random Forest
- Support Vector Machine

## Each Model before and after Tuning:
### XGBoost
The baseline model had these scores:
- R^2: 4.6751
- cross-val: 0.9015
- accuracy training: 0.9964
- accuracy test: 0.9193

After the tuning the model WITH THE FIRST STRATEGY, I managed to get these scores:
- R^2: 4.0183
- cross-val: 0.9191
- accuracy training: 0.9874
- accuracy test: 0.9403

After the tuning the model WITH THE SECOND STRATEGY, I managed to get these scores:
- R^2: 4.0473
- cross-val: 0.9215
- accuracy training: 0.9876
- accuracy test: 0.9395

A had a good increase in test accuracy, R^2 for both tuning strategies and cross-val score, but a decrease in training accuracy (a good thing since this reduced over-fitting). Looks like in this case it basically did not matter which strategy I used. The over-fitting

### Random Forest
The baseline model had these scores:
- R^2: 5.2859
- cross-val: 0.9070
- accuracy training: 0.9863
- accuracy test: 0.8117

After the tuning the model, I managed to get these scores:
- R^2: 5.5303
- cross-val: 0.8933
- accuracy training: 0.9573
- accuracy test: 0.8849

I got a big increase in accuracy on the test data, but a decrease in R^2.

### Support Vector Machine
The baseline model had these scores:
- R^2: 8.9383
- cross-val: 0.6907
- accuracy training: 0.7339
- accuracy test: 0.7050

After the tuning the model, I managed to get these scores:
- R^2: 5.7128
- cross-val: 0.8668
- accuracy training: 0.9359
- accuracy test: 0.8795

I got a REALLY high increase in all scores, but only a 2% increase in over-fitting. I could have continued to tune the C parameter even further, but I think this would have lead to an even more over-fitted model.

## Result Discussion (Some stuff I want to point out)
- I got the best baseline and "after tuning" score with XGBoost.
- Most models got somewhat over-fitted
- I would expect RF and XGBoost to get more similar results, but on this dataset the boosting strategy outperformed bagging
- For the RF Model I manually tested different max_depth values to try to reduce over-fitting. Lower values reduced both training and testing scores, while the over-fitting percent stayed the same. I therefor left the max_depth as a high value since the over-fitting was percent wise the same, but both scores increased.
- I think the over-fitting comes from either to little data or bad distribution of data.
- Comparison of the two XGB tuning strategies:
    - The 2nd strategy got a higher cross validation score, but the 1st strategy got slightly better score for every other scoring
    - This most likely comes from me not doing a good enough job.

In [40]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from xgboost import XGBRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.svm import SVR
import numpy as np

In [41]:
df = pd.read_excel(r'C:\Users\marku\Desktop\ML\MLGit\datasets\Concrete_Data.xls')

In [42]:
df.rename(columns={'Cement (component 1)(kg in a m^3 mixture)':'Cement'}, inplace=True)
df.rename(columns={'Blast Furnace Slag (component 2)(kg in a m^3 mixture)':'Blast Furnace Slag'}, inplace=True)
df.rename(columns={'Fly Ash (component 3)(kg in a m^3 mixture)':'Fly Ash'}, inplace=True)
df.rename(columns={'Water  (component 4)(kg in a m^3 mixture)':'Water'}, inplace=True)
df.rename(columns={'Superplasticizer (component 5)(kg in a m^3 mixture)':'Superplasticizer'}, inplace=True)
df.rename(columns={'Coarse Aggregate  (component 6)(kg in a m^3 mixture)':'Coarse Aggregate'}, inplace=True)
df.rename(columns={'Fine Aggregate (component 7)(kg in a m^3 mixture)':'Fine Aggregate'}, inplace=True)
df.rename(columns={'Age (day)':'Age'}, inplace=True)
df.rename(columns={'Concrete compressive strength(MPa, megapascals) ':'Strength'}, inplace=True)

In [43]:
df.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.986111
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.887366
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.269535
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05278
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.296075


# XGBoost

In [44]:
df_XGB = df.copy()

In [45]:
features = df_XGB.drop('Strength', axis=1)
targets = df_XGB['Strength']
train_X, test_X, train_y, test_y = train_test_split(features, targets, random_state=42, test_size=0.25)

XGBModel = XGBRegressor()
XGB_scores = cross_val_score(XGBModel, train_X, train_y)
(XGB_scores.mean(), XGB_scores.std())

  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):


(0.9015121636351117, 0.011785011693389415)

In [46]:
XGBModel.fit(train_X, train_y)
pred = XGBModel.predict(test_X)
mean_squared_error(test_y, pred, squared=False)

  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):


4.675146846418575

In [47]:
(XGBModel.score(train_X, train_y), XGBModel.score(test_X, test_y))

(0.9964513444166627, 0.9193093588600099)

# XGB TUNING: STRATEGY 1
A sequential strategy. This will probably perform worse than the normal strategy (which I use later), since a change in one parameter can affect which value is the best for another parameter. I did this mostly to see how much worse this time-saving strategy is compared to the normal one.

In [48]:
XGBParam1 = {
    'max_depth': [1,2,4,6,8,10],
    'min_child_weight': [2,4,6,8]
}

XGB_Grid1 = GridSearchCV(XGBRegressor(verbosity=0,
                                      gamma=0,
                                      subsample=0.6,
                                      learning_rate=0.1,
                                      n_estimators=200), XGBParam1)
XGB_Grid1.fit(train_X, train_y)
print(XGB_Grid1.best_params_, XGB_Grid1.best_score_)

  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.co

{'max_depth': 6, 'min_child_weight': 8} 0.9213333774781347


In [49]:
XGBParam2 = {
    'gamma':[0.0,0.05,0.1,0.2,0.3,0.4,0.5]
}

XGB_Grid2 = GridSearchCV(XGBRegressor(verbosity=0,
                                      max_depth=6,
                                      min_child_weight=8,
                                      subsample=0.6,
                                      learning_rate=0.1,
                                      n_estimators=200), XGBParam2)
XGB_Grid2.fit(train_X, train_y)
print(XGB_Grid2.best_params_, XGB_Grid2.best_score_)

  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.co

{'gamma': 0.3} 0.9222771296404264


In [50]:
XGBParam3 = {
    'subsample':[0.5,0.6,0.7,0.8,0.9]
}

XGB_Grid3 = GridSearchCV(XGBRegressor(verbosity=0,
                                      max_depth=6,
                                      min_child_weight=8,
                                      gamma=0.3,
                                      learning_rate=0.04,
                                      n_estimators=200), XGBParam3)
XGB_Grid3.fit(train_X, train_y)
print(XGB_Grid3.best_params_, XGB_Grid3.best_score_)

  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.co

{'subsample': 0.6} 0.9157727095772961


In [51]:
XGBParam4 = {
    'learning_rate':[0.01, 0.02, 0.025, 0.03,0.04,0.06]
}

XGB_Grid4 = GridSearchCV(XGBRegressor(verbosity=0,
                                      max_depth=6,
                                      min_child_weight=8,
                                      gamma=0.3,
                                      subsample=0.6,
                                      n_estimators=200), XGBParam4)
XGB_Grid4.fit(train_X, train_y)
print(XGB_Grid4.best_params_, XGB_Grid4.best_score_)

  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.co

{'learning_rate': 0.06} 0.9169117832981719


  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):


In [52]:
XGBParam5 = {
    'n_estimators':[60, 80, 100, 150, 200, 300, 350]
}

XGB_Grid5 = GridSearchCV(XGBRegressor(verbosity=0,
                                      max_depth=6,
                                      min_child_weight=8,
                                      gamma=0.3,
                                      subsample=0.6,
                                      learning_rate=0.06), XGBParam5)
XGB_Grid5.fit(train_X, train_y)
print(XGB_Grid5.best_params_, XGB_Grid5.best_score_)

  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.co

{'n_estimators': 350} 0.9191195128130119


In [53]:
XGBModel = XGBRegressor(verbosity=0,
                        max_depth=6,
                        min_child_weight=8,
                        gamma=0.3,
                        subsample=0.6,
                        learning_rate=0.06,
                        n_estimators=350)
XGB_scores = cross_val_score(XGBModel, train_X, train_y)
(XGB_scores.mean(), XGB_scores.std())

  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):


(0.9191195128130119, 0.011486247362884448)

In [54]:
XGBModel.fit(train_X, train_y)
pred = XGBModel.predict(test_X)
mean_squared_error(test_y, pred, squared=False)

  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):


4.018388775226135

In [55]:
(XGBModel.score(train_X, train_y), XGBModel.score(test_X, test_y))

(0.9874586527472798, 0.9403876068468414)

# XGB TUNING: STRATEGY 2

In [56]:
XGBParam1 = {
    'gamma':[0.0,0.1,0.2],
    'learning_rate':[0.02, 0.025, 0.03],
    'max_depth': [2,4,6],
    'min_child_weight': [2,4,6],
    'n_estimators':[350, 375],
    'subsample':[0.6,0.7,0.8]
}

XGB_Grid1 = GridSearchCV(XGBRegressor(verbosity=0), XGBParam1)
XGB_Grid1.fit(train_X, train_y)
print(XGB_Grid1.best_params_, XGB_Grid1.best_score_)

  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.co

{'gamma': 0.1, 'learning_rate': 0.03, 'max_depth': 6, 'min_child_weight': 4, 'n_estimators': 375, 'subsample': 0.6} 0.9195087750396537


In [57]:
XGBParam2 = {
    'gamma':[0.0,0.1,0.2],
    'learning_rate':[0.025,0.03,0.04],
    'max_depth': [5,6,7],
    'min_child_weight': [2,3,4,5],
    'n_estimators':[375,400],
    'subsample':[0.5,0.6,0.7]
}

XGB_Grid2 = GridSearchCV(XGBRegressor(verbosity=0), XGBParam2)
XGB_Grid2.fit(train_X, train_y)
print(XGB_Grid2.best_params_, XGB_Grid2.best_score_)

  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.co

{'gamma': 0.1, 'learning_rate': 0.03, 'max_depth': 7, 'min_child_weight': 5, 'n_estimators': 400, 'subsample': 0.6} 0.9205356356435013


In [58]:
XGBParam3 = {
    'gamma':[0.05,0.1,0.15],
    'learning_rate':[0.025, 0.03, 0.035],
    'max_depth': [6,7,8],
    'min_child_weight': [4,5,6],
    'n_estimators':[425,450],
    'subsample':[0.5,0.6,0.7]
}

XGB_Grid3 = GridSearchCV(XGBRegressor(verbosity=0), XGBParam3)
XGB_Grid3.fit(train_X, train_y)
print(XGB_Grid3.best_params_, XGB_Grid3.best_score_)

  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.co

{'gamma': 0.1, 'learning_rate': 0.03, 'max_depth': 8, 'min_child_weight': 6, 'n_estimators': 450, 'subsample': 0.5} 0.9215652752651919


In [59]:
XGBModel = XGBRegressor(
    verbosity=0,
    gamma=0.1,
    n_estimators=450,
    min_child_weight=6,
    subsample=0.5,
    learning_rate=0.03,
    max_depth=8
)
XGB_scores = cross_val_score(XGBModel, train_X, train_y)
(XGB_scores.mean(), XGB_scores.std())

  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):


(0.9215652752651919, 0.012761042480223028)

In [60]:
XGBModel.fit(train_X, train_y)
pred = XGBModel.predict(test_X)
mean_squared_error(test_y, pred, squared=False)

  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):


4.047333872235488

In [61]:
(XGBModel.score(train_X, train_y), XGBModel.score(test_X, test_y))

(0.9876977076797194, 0.9395257186238439)

A good score both for training and testing, but over-fitted

# RandomForest Regressor

In [64]:
df_RF = df.copy()

In [65]:
features = df_RF.drop('Strength', axis=1)
targets = df_RF['Strength']
train_X, test_X, train_y, test_y = train_test_split(features, targets, random_state=42, test_size=0.25)

In [66]:
RFModel = RandomForestRegressor(random_state=42)

RF_scores = cross_val_score(RFModel, train_X, train_y)
(RF_scores.mean(), RF_scores.std())

(0.8952102744525048, 0.011172437631651155)

In [67]:
RFModel.fit(train_X, train_y)
pred = RFModel.predict(test_X)
mean_squared_error(test_y, pred, squared=False)

5.524616365752195

In [68]:
(RFModel.score(train_X, train_y), RFModel.score(test_X, test_y))

(0.9853549909121149, 0.8873225774233262)

Baseline model is very over-fitted. Will try to correct this

# RF TUNING

In [69]:
RFParam1 = {
    "max_depth":[2,3,4,5,6],
    "n_estimators":[100, 150, 200, 300],
    "max_samples": [0.1,0.2,0.3],
    'min_samples_leaf':[4,5,6,7]

}

RF_Grid1 = GridSearchCV(RandomForestRegressor(), RFParam1)
RF_Grid1.fit(train_X, train_y)
print(RF_Grid1.best_params_, RF_Grid1.best_score_)

{'max_depth': 6, 'max_samples': 0.3, 'min_samples_leaf': 4, 'n_estimators': 300} 0.8295725416612214


In [70]:
RFParam2 = {
    "max_depth":[5,6,7,8],
    "n_estimators":[275,300,350],
    "max_samples": [0.25,0.3,0.35,0.4],
    'min_samples_leaf':[3,4,5]
}

RF_Grid2 = GridSearchCV(RandomForestRegressor(), RFParam2)
RF_Grid2.fit(train_X, train_y)
print(RF_Grid2.best_params_, RF_Grid2.best_score_)

{'max_depth': 8, 'max_samples': 0.4, 'min_samples_leaf': 3, 'n_estimators': 275} 0.8592249336930777


In [99]:
RFParam3 = {
    "max_depth":[7,8,9,10,15],
    "n_estimators":[250,275,300],
    "max_samples": [0.35,0.4,0.5,0.6],
    'min_samples_leaf':[2,3,4,5],

}

RF_Grid3 = GridSearchCV(RandomForestRegressor(), RFParam3)
RF_Grid3.fit(train_X, train_y)
print(RF_Grid3.best_params_, RF_Grid3.best_score_)

{'max_depth': 15, 'max_samples': 0.6, 'min_samples_leaf': 2, 'n_estimators': 274} 0.8823164098581608


I won't continue the gridsearch due to max_depth only going higher and higher. This will lead to extreme over-fitting

In [100]:
RFModel = RandomForestRegressor(max_depth=15,
                                max_samples=0.6,
                                min_samples_leaf=2,
                                n_estimators=275,
                                random_state=42)
RF_scores = cross_val_score(RFModel, train_X, train_y)
(RF_scores.mean(), RF_scores.std())

(0.8808971732530037, 0.01625403675825014)

In [101]:
RFModel.fit(train_X, train_y)
pred = RFModel.predict(test_X)
mean_squared_error(test_y, pred, squared=False)

5.720365693322723

In [102]:
(RFModel.score(train_X, train_y), RFModel.score(test_X, test_y))

(0.9559532002838189, 0.8791962989878279)

The model gets very over-fitted, but it's much better than the baseline model.

# SVR

In [136]:
df_SVR = df.copy()
df_SVR = pd.get_dummies(df_SVR)

features = df_SVR.drop('Strength', axis=1)
targets = df_SVR['Strength']

scaler = RobustScaler()
features = scaler.fit_transform(features)

train_X, test_X, train_y, test_y = train_test_split(features, targets, random_state=42, test_size=0.25)

In [137]:
SVRModel = SVR()

SVR_scores = cross_val_score(SVRModel, train_X, train_y)
(SVR_scores.mean(), SVR_scores.std())

(0.6906945648411797, 0.03792460043737687)

In [138]:
SVRModel.fit(train_X, train_y)
pred = SVRModel.predict(test_X)
mean_squared_error(test_y, pred, squared=False)

8.938343611426133

In [139]:
(SVRModel.score(train_X, train_y), SVRModel.score(test_X, test_y))

(0.7338929272537011, 0.7050511243718853)

# SVR TUNING

In [121]:
SVRParam1 = {"C":np.arange(0.25,15,0.25),
            'gamma':np.arange(0,0.5,0.01)}

SVR_Grid1 = GridSearchCV(SVR(), SVRParam1)
SVR_Grid1.fit(train_X, train_y)
print(SVR_Grid1.best_params_, SVR_Grid1.best_score_)

{'C': 14.75, 'gamma': 0.25} 0.8537640880327395


In [122]:
SVRParam2 = {"C":np.arange(10,20,0.25),
             'gamma':np.arange(0,0.5,0.01)}

SVR_Grid2 = GridSearchCV(SVR(), SVRParam2)
SVR_Grid2.fit(train_X, train_y)
print(SVR_Grid2.best_params_, SVR_Grid2.best_score_)

{'C': 19.75, 'gamma': 0.25} 0.8595797867324297


In [129]:
SVRParam3 = {"C":np.arange(17.5,30,0.25),
             'gamma':np.arange(0,0.5,0.01)}

SVR_Grid3 = GridSearchCV(SVR(), SVRParam3)
SVR_Grid3.fit(train_X, train_y)
print(SVR_Grid3.best_params_, SVR_Grid3.best_score_)

{'C': 29.75, 'gamma': 0.16} 0.8651117968799211


Seems like the parameter "C" will just keep on going higher and higher, therefor I stop the grid search here

In [133]:
SVRModel = SVR(C=29.75, gamma=0.26)

SVM_scores = cross_val_score(SVRModel, train_X, train_y)
(SVM_scores.mean(), SVM_scores.std())

(0.866877196500645, 0.027392335000372466)

In [134]:
SVRModel.fit(train_X, train_y)
pred = SVRModel.predict(test_X)
mean_squared_error(test_y, pred, squared=False)

5.712804370034576

In [135]:
(SVRModel.score(train_X, train_y), SVRModel.score(test_X, test_y))

(0.9359456979439603, 0.879515450659403)

A great increase in accuracy, but a decent amount of over-fitting