The basic principle of decision trees is to split each parent node in as distinct node as possible using some kind of optimization criteria such as: gini impurity or entropy. In doing so, we can impose certain conditions, for example: limit the depth of the decision tree in order to reduce overfitting. 

In the above process we do not make any assumption on the other hand there are statistical and machine learning models that are sensitive to skewed distributions. Because of that we do not apply any transformation in the explanatory variables to correct any problem such as: skewed distributions.

In this kernel we are going to use random forest the main idea of this machine learning model is: "While an individual tree is overfit to the training data and is likely to have large error, Random Forests uses the insight that a suitably large number of uncorrelated errors average out to zero to solve this problem. Random Forest chooses multiple random samples of observations from the training data, with replacement, constructing a tree from each one. Since each tree learns from different data, they are fairly uncorrelated from one another". [1]

In [1]:
# Loading the packages
import numpy as np
import pandas as pd 
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import StratifiedKFold
import matplotlib.pyplot as plt

In [2]:
# Loading the training dataset
df_train = pd.read_csv("../input/train.csv")

In [3]:
y = df_train["target"]
# We exclude the target and id columns from the training dataset
df_train.pop("target");
df_train.pop("id")
colnames = df_train.columns
X = df_train 
del df_train
X = X.values # Converting pandas dataframe to numpy array 
y = y.values # Converting pandas series to numpy array 

Random Forests have many parameters, you can see them in the following web page: 

[Random Forest Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

In order to find the best set of parameters we are going to use a GridSearchCV which is an exhaustive search over the parameters of the model. 

In [7]:
from sklearn.metrics import make_scorer 
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# We are going to perform a gridsearch 

# TODO: Initialize the classifier
clf = RandomForestClassifier(class_weight = 'balanced', random_state=0)

# Create the parameters list you wish to tune, using a dictionary if needed.
# parameters = {'parameter_1': [value1, value2], 'parameter_2': [value1, value2]}

parameters = {'n_estimators':[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 
              'max_depth':[ 1, 2, 3, 4, 5, 6 ],
              'criterion':['gini', 'entropy'], 
              'max_features': ['auto', 'sqrt', 'log2'], 
              
             }

# Make an roc_auc scoring object using make_scorer()
scorer = make_scorer(roc_auc_score)

grid_obj = GridSearchCV(clf, parameters, scoring=scorer, verbose=1, cv=10)

grid_fit = grid_obj.fit(X, y)


Fitting 10 folds for each of 360 candidates, totalling 3600 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 3600 out of 3600 | elapsed:   48.7s finished


In [6]:
# Get the estimator
best_clf = grid_fit.best_estimator_
print(best_clf)

RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='entropy', max_depth=2, max_features='log2',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=6, n_jobs=None, oob_score=False, random_state=0,
            verbose=0, warm_start=False)


We are going to copy the parameters calculated above in the next cell of code. 

In [15]:
model = RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='entropy', max_depth=2, max_features='log2',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=6, n_jobs=None, oob_score=False, random_state=0,
            verbose=0, warm_start=False)

We fit the model with the whole dataset. 

In [16]:
model.fit(X, y)

RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='entropy', max_depth=2, max_features='log2',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=6, n_jobs=None, oob_score=False, random_state=0,
            verbose=0, warm_start=False)

Random Forest can also be used to identify variables that are  better predictors. We will use permutation importance in order to measure the importance of the explanatory variables, I suggest you to read the following web page in order to understand this:  

https://www.kaggle.com/dansbecker/permutation-importance



In [17]:
import eli5
from eli5.sklearn import PermutationImportance

perm = PermutationImportance(model, random_state=0).fit(X, y)

eli5.show_weights(perm, feature_names = colnames.tolist())

Weight,Feature
0.0544  ± 0.0224,33
0.0296  ± 0.0302,279
0.0264  ± 0.0230,272
0.0224  ± 0.0218,83
0.0224  ± 0.0148,237
0.0200  ± 0.0209,241
0.0168  ± 0.0128,91
0.0152  ± 0.0155,199
0.0144  ± 0.0096,216
0.0128  ± 0.0138,19


We can interpret the numbers in the table above as follows: 

"The first number in each row shows how much model performance decreased with a random shuffling (in this case, using "accuracy" as the performance metric)"[2].

"Like most things in data science, there is some randomness to the exact performance change from a shuffling a column. We measure the amount of randomness in our permutation importance calculation by repeating the process with multiple shuffles. The number after the ± measures how performance varied from one-reshuffling to the next"[2].

According to the table above the best predictors are: 33, 279, 272, 83, 237, 241, 91, 199, 216, 19, 65, 141, 70, 243, 137, 26, 90. Finally, we will send the submission.


In [18]:
df_test = pd.read_csv("../input/test.csv")
df_test.pop("id");
X = df_test 
del df_test
y_pred = model.predict_proba(X)
y_pred = y_pred[:,1]


In [19]:
# submit prediction
smpsb_df = pd.read_csv("../input/sample_submission.csv")
smpsb_df["target"] = y_pred
smpsb_df.to_csv("random_forests.csv", index=None)

## References

[1] https://towardsdatascience.com/random-forests-and-the-bias-variance-tradeoff-3b77fee339b4

[2] https://www.kaggle.com/dansbecker/permutation-importance