<a href="https://colab.research.google.com/github/Sarthak016/MachineLearning/blob/main/Classification_BLUEPRINT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Topic using Classification

The following topics are covered in this colab :

- Downloading a real-world dataset
- Preparing a dataset for training
- Training and interpreting decision trees
- Training and interpreting random forests
- Overfitting, hyperparameter tuning & regularization
- Making predictions on single inputs


# Problem Statement

This tutorial takes a practical and coding-focused approach. We'll define the terms _machine learning_ and _linear regression_ in the context of a problem, and later generalize their definitions. We'll work through a typical machine learning problem step-by-step:


> **QUESTION**: ACME Insurance Inc. offers affordable health insurance to thousands of customer all over the United States. As the lead data scientist at ACME, **you're tasked with creating an automated system to estimate the annual medical expenditure for new customers**, using information such as their age, sex, BMI, children, smoking habits and region of residence. 
>
> Estimates from your system will be used to determine the annual insurance premium (amount paid every month) offered to the customer. Due to regulatory requirements, you must be able to explain why your system outputs a certain prediction.

#Step 1 - Download and Explore the Data

The dataset is available as a ZIP file at the following url:

> Load the data from the file `train.csv` into a Pandas data frame.

In [None]:
# Import pandas and numpy to read csv file
import pandas as pd
import numpy as np
pd.set_option("max_columns",None)

In [None]:
# Read the csv file
data=pd.read_csv("/content/drive/MyDrive/data.csv")

In [None]:
data

Unnamed: 0,id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,0.0
mean,30371830.0,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,...,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946,
std,125020600.0,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,...,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061,
min,8670.0,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,...,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504,
25%,869218.0,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,...,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146,
50%,906024.0,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,...,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004,
75%,8813129.0,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,...,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208,
max,911320500.0,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,...,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075,


The dataset contains 1338 rows and 7 columns. Each row of the dataset contains information about one customer. 

Our objective is to find a way to estimate the value in the "charges" column using the values in the other columns. If we can do so for the historical data, then we should able to estimate charges for new customers too, simply by asking for information like their age, sex, BMI, no. of children, smoking habits and region.

Let's check the data type for each column.

In [None]:
data.info()

>Here are some statistics for the numerical columns:

In [None]:
data.describe()

>  How many rows and columns does the dataset contain? 

In [None]:
n_rows = data.shape[0]

In [None]:
n_cols = data.shape[1]

In [None]:
print('The dataset contains {} rows and {} columns.'.format(n_rows, n_cols))

## Exploratory Analysis and Visualization

Let's explore the data by visualizing the distribution of values in some columns of the dataset, and the relationships between "charges" and other columns.


* libraries that we are going to use in this collab 

In [None]:
# Libraries that we are going to use in this collab 
import seaborn as sns
import plotly.express as px
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#The following settings will improve the default style and font sizes for our charts
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (10, 6)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

> How many `missing values` does the dataset contain in percentage? 

In [None]:
# 1 -step make the list of features which has missing values
feature_with_na=[feature for feature in data.columns if data[feature].isnull().sum()>1]
# 2- step print the feature name and the percentage of missing values
for feature in feature_with_na:
  print(feature, np.round(data[feature].isnull().mean(), 4)*100,  " % missing values")

In [None]:
#lets drop columns which have nan value above 40%
perc=40.0
min_count=int(((100-perc)/100)*data.shape[0] + 1)
data=data.dropna(axis=1,thresh=min_count)

In [None]:
pip install -U dataprep

In [None]:
# Using dataprep profiling to get a idea about dataset 
from dataprep.eda import create_report
report=create_report(data)
report

###Numerical Variables

In [None]:
# list of numerical variables
numerical_features = [feature for feature in data.columns if data[feature].dtypes != 'O']

print('Number of numerical variables: ', len(numerical_features))
# visualise the numerical variables
data[numerical_features].head()

In [None]:
## Lets analyse the continuous values by creating histograms to understand the distribution
df = data[numerical_features]
fig = plt.figure(figsize = (25, 35))
i=1
for n in df.columns:
    plt.subplot(7, 5, i)
    figure = sns.histplot(x = data[n],hue = data['diagnosis'], palette = ['#676FA3', '#FF5959'], bins = 40)
    figure.set(xlabel = None, ylabel = None)
    plt.title(str(n), loc = 'center')
    plt.xticks(rotation = 20, fontsize = 10)
    i += 1

###Categorical Variables

In [None]:
# list of categorical variables
categorical_features = [feature for feature in data.columns if data[feature].dtypes == 'O']

print('Number of categorical variables: ', len(categorical_features))
# visualise the categorical variables
data[categorical_features].head()

In [None]:
# Unique number of categorical features
for feature in categorical_features:
    print('The feature is {} and number of categories are {}'.format(feature,len(data[feature].unique())))

In [None]:
# Find out the relationship between categorical variable and dependent feature

df = data[categorical_features]
plt.figure(figsize = (25, 25))
i = 1
for c in df.columns:
    plt.subplot(5, 2, i)
    figure = sns.countplot(data = data, x = data[c], hue = 'TARGET', palette = ['#676FA3', '#FF5959'])
    figure.set(xlabel = None, ylabel = None)
    plt.title(str(c), loc='center')
    plt.xticks( fontsize = 10)
    i += 1

Discrete Variables Count: 13


###Outliers

In [None]:
for feature in numerical_features:
    dataset=data.copy()
    if 0 in dataset[feature].unique():
        pass
    else:
        dataset[feature]=np.log(dataset[feature])
        dataset.boxplot(column=feature)
        plt.ylabel(feature)
        plt.title(feature)
        plt.show()   

### Correlation

In [None]:
plt.figure(figsize=(15,10))
sns.heatmap(abs(data.corr()),annot=True,cmap='coolwarm',linewidth=1,linecolor='black')

# Step 2 - Prepare the Dataset for Training


Before we can train the model, we need to prepare the dataset. Here are the steps we'll follow:

1. Identify the input and target column(s) for training the model.
2. Identify numeric and categorical input columns.
3. [Impute](https://scikit-learn.org/stable/modules/impute.html) (fill) missing values in numeric columns
4. [Scale](https://scikit-learn.org/stable/modules/preprocessing.html#scaling-features-to-a-range) values in numeric columns to a $(0,1)$ range.
5. [Encode](https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features) categorical data into one-hot vectors.
6. Split the dataset into training and validation sets.


## Identify Inputs and Targets

While the dataset contains `81` columns, not all of them are useful for modeling. Note the following:

- The first column `Id` is a unique ID for each house and isn't useful for training the model.
- The last column `SalePrice` contains the value we need to predict i.e. it's the target column.
- Data from all the other columns (except the first and the last column) can be used as inputs to the model.
 

> Create a list `input_cols` of column names containing data that can be used as input to train the model, and identify the target column as the variable `target_col`.

In [None]:
# Identify the input columns (a list of column names)
input_cols = list(data.columns)[1:-1]

# Identify the name of the target column (a single string, not a list)
target_col =list(data.columns)[-1]

In [None]:
# It always a good practice whatever code u execute, print and check it 
print(input_cols)

In [None]:
# It always a good practice whatever code u execute, print and check it 
print(target_col)

Make sure that the `Id` and `SalePrice` columns are not included in `input_cols`.

Now that we've identified the input and target columns, we can separate input & target data.

In [None]:
# Separate input & target data
inputs_df = data[input_cols]
targets = data[target_col]

##Identify Numeric and Categorical Data
The next step in data preparation is to identify numeric and categorical columns. We can do this by looking at the data type of each column.

> **QUESTION 5**: Crate two lists `numeric_cols` and `categorical_cols` containing names of numeric and categorical input columns within the dataframe respectively. Numeric columns have data types `int64` and `float64`, whereas categorical columns have the data type `object`.
>
> *Hint*: See this [StackOverflow question](https://stackoverflow.com/questions/25039626/how-do-i-find-numeric-columns-in-pandas). 

In [None]:
#numerical=medical.select_dtypes(include=np.number).columns.tolist()
numeric_cols = inputs_df.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_cols = inputs_df.select_dtypes(include=[object]).columns.tolist()

##Impute Numerical Data
Some of the numeric columns in our dataset contain missing values (nan)

In [None]:
# using isna() to calculate the null values in Numeric columns
missing_counts = inputs_df[numeric_cols].isna().sum().sort_values(ascending=False)
missing_counts[missing_counts > 0]

Machine learning models can't work with missing data. The process of filling missing values is called [imputation](https://scikit-learn.org/stable/modules/impute.html).

<img src="https://i.imgur.com/W7cfyOp.png" width="480">

There are several techniques for imputation, but we'll use the most basic one: replacing missing values with the average value in the column using the `SimpleImputer` class from `sklearn.impute`.


> **QUESTION 6**: Impute (fill) missing values in the numeric columns of `inputs_df` using a `SimpleImputer`. 

In [None]:
# Import SimpleImputer from sklearn library
from sklearn.impute import SimpleImputer

# 1. Create the imputer
imputer = SimpleImputer(strategy = 'mean')

# 2. Fit the imputer to the numeric colums
imputer.fit(inputs_df[numeric_cols])

# 3. Transform and replace the numeric columns
inputs_df[numeric_cols] = imputer.transform(inputs_df[numeric_cols])

In [None]:
# using isna()  to check the null values in Numeric columns
missing_counts = inputs_df[numeric_cols].isna().sum().sort_values(ascending=False)
missing_counts[missing_counts > 0]

##Scale Numerical Values
The numeric columns in our dataset have varying ranges.

In [None]:
# using describe function to see statistics information and .loc to filter min and max from describe function
inputs_df[numeric_cols].describe().loc[['min', 'max']]

A good practice is to [scale numeric features](https://scikit-learn.org/stable/modules/preprocessing.html#scaling-features-to-a-range) to a small range of values e.g. $(0,1)$. Scaling numeric features ensures that no particular feature has a disproportionate impact on the model's loss. Optimization algorithms also work better in practice with smaller numbers.


> **QUESTION 7**: Scale numeric values to the $(0, 1)$ range using `MinMaxScaler` from `sklearn.preprocessing`.

In [None]:
# Import MinMaxScaler from sklearn library
from sklearn.preprocessing import MinMaxScaler

# Create the scaler
scaler = MinMaxScaler()

# Fit the scaler to the numeric columns
scaler.fit(inputs_df[numeric_cols])

# Transform and replace the numeric columns
inputs_df[numeric_cols] = scaler.transform(inputs_df[numeric_cols])

After scaling, the ranges of all numeric columns should be (0, 1).

In [None]:
# Let's check that scaling worked or not
inputs_df[numeric_cols].describe().loc[['min', 'max']]

##Encode Categorical Columns
Our dataset contains several categorical columns, each with a different number of categories.

In [None]:
# Printing unique Categorical columns 
inputs_df[categorical_cols].nunique().sort_values(ascending=False)



Since machine learning models can only be trained with numeric data, we need to convert categorical data to numbers. A common technique is to use one-hot encoding for categorical columns.

<img src="https://i.imgur.com/n8GuiOO.png" width="640">

One hot encoding involves adding a new binary (0/1) column for each unique category of a categorical column.

> **QUESTION 8**: Encode categorical columns in the dataset as one-hot vectors using `OneHotEncoder` from `sklearn.preprocessing`. Add a new binary (0/1) column for each category

In [None]:
# Import OneHotEncoder from sklearn library
from sklearn.preprocessing import OneHotEncoder

# 1. Create the encoder
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')

# 2. Fit the encoder to the categorical colums
encoder.fit(inputs_df[categorical_cols])

# 3. Generate column names for each category
encoded_cols = list(encoder.get_feature_names(categorical_cols))
len(encoded_cols)

In [None]:
# 4. Transform and add new one-hot category columns
inputs_df[encoded_cols] = encoder.transform(inputs_df[categorical_cols])

The new one-hot category columns should now be added to `inputs_df`.

##Training and Validation Set
Finally, let's split the dataset into a training and validation set. We'll use a randomly select 25% subset of the data for validation. Also, we'll use just the numeric and encoded columns, since the inputs to our model must be numbers.

In [None]:
# Import train_test_split from sklearn library to make split of data into train sets and validation sets
from sklearn.model_selection import train_test_split
train_inputs, val_inputs, train_targets, val_targets = train_test_split(inputs_df[numeric_cols + encoded_cols], 
                                                                        targets, 
                                                                        test_size=0.25, 
                                                                        random_state=42)


In [None]:
# It always a good practice to print and check the executed codes.
train_inputs

In [None]:
# It always a good practice to print and check the executed codes.
train_targets

In [None]:
# It always a good practice to print and check the executed codes.
val_inputs

In [None]:
# It always a good practice to print and check the executed codes.
val_targets

# Models

In [None]:
pip install catboost

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from xgboost import plot_importance
import lightgbm 


In [None]:
models = [
           ['LogisticRegression: ',              LogisticRegression()],
           ['KNeighborsClassifier: ',            KNeighborsClassifier(n_neighbors=5, p=2, metric='minkowski')],
           ['SVC: ',                             SVC(kernel='linear', C=1.0, random_state=0)],
           ['DecisionTreeClassifier: ',          DecisionTreeClassifier(random_state=42)],
           ['RandomForestClassifier:' ,          RandomForestClassifier(random_state=42)],
           ['ExtraTreesClassifier ',             ExtraTreesClassifier(random_state=42)],
           ['GradientBoostingClassifier ',       GradientBoostingClassifier(random_state=42)],
           ['XGBClassifier :',                   XGBClassifier(objective= 'binary:logistic',random_state=42)],
           ['Light-GBM: ',                     lightgbm.LGBMRegressor(random_state=42)]
          ]
           
           

In [None]:
# Run all the proposed models and update the information in a list model_data
import time
from math import sqrt
from sklearn import metrics

model_data = []
for name,model in models :

    model_data_dic = {}
    model_data_dic["Name"] = name
    start = time.time()
    end = time.time()
    model.fit(train_inputs,train_targets) 
    model_data_dic["Train_Time"] = end - start
    # Training set
    model_data_dic["Train_Accuracy"] = metrics.accuracy_score(train_targets, model.predict(train_inputs))
    # Validation set
    model_data_dic["Test_Accuracy"] = metrics.accuracy_score(val_targets, model.predict(val_inputs))

    model_data.append(model_data_dic)

In [None]:
# Convert list to dataframe
df = pd.DataFrame(model_data)

In [None]:
df.plot(x="Name", y=['Test_R2_Score' , 'Train_R2_Score' , 'Test_RMSE_Score'], kind="bar" , title = 'R2 Score Results' , figsize= (10,8)) ;

* Obervations
1. Best results over test set are given by Extra Tree Regressor with R2 score of 0.57
2. Least RMSE score is also by Extra Tree Regressor 0.65
3. Lasso regularization over Linear regression was worst performing model

In [None]:
from sklearn import metrics
# generate evaluation metrics for training set
def evalaute_train(model,train_inputs,train_targets):
  
    print ("Train - Accuracy :", metrics.accuracy_score(train_targets, model.predict(train_inputs)))
    print ("Train - AUC :", metrics.roc_auc_score(train_targets, model.predict_proba(train_inputs)[:,1]))
    print ("Train - Confusion matrix :",metrics.confusion_matrix(train_targets, model.predict(train_inputs)))
    print ("-----------------------------------------------------------------------------------------")
    print ("Train - classification report :", metrics.classification_report(train_targets, model.predict(train_inputs)))

# generate evaluation metrics for training set
def evalaute_test(model,val_inputs,val_targets):

    print ("Test - Accuracy :", metrics.accuracy_score(val_targets, model.predict(val_inputs)))
    print ("Test - AUC :", metrics.roc_auc_score(val_targets, model.predict_proba(val_inputs)[:,1]))
    print ("Test - Confusion matrix :",metrics.confusion_matrix(val_targets, model.predict(val_inputs)))
    print ("-----------------------------------------------------------------------------------------")
    print ("Test - classification report :", metrics.classification_report(val_targets, model.predict(val_inputs)))


# Model 1 - Train a Logistic Regression Model


Logistic regression is a commonly used technique for solving binary classification problems. In a logistic regression model: 

- we take linear combination (or weighted sum of the input features) 
- we apply the sigmoid function to the result to obtain a number between 0 and 1
- this number represents the probability of the input being classified as "Yes"
- instead of RMSE, the cross entropy loss function is used to evaluate the results


Here's a visual summary of how a logistic regression model is structured ([source](http://datahacker.rs/005-pytorch-logistic-regression-in-pytorch/)):


<img src="https://i.imgur.com/YMaMo5D.png" width="480">

The sigmoid function applied to the linear combination of inputs has the following formula:

<img src="https://i.imgur.com/sAVwvZP.png" width="400">


The output of the sigmoid function is called a logistic, hence the name _logistic regression_. For a mathematical discussion of logistic regression, sigmoid activation and cross entropy, check out [this YouTube playlist](https://www.youtube.com/watch?v=-la3q9d7AKQ&list=PLNeKWBMsAzboR8vvhnlanxCNr2V7ITuxy&index=1). Logistic regression can also be applied to multi-class classification problems, with a few modifications.





> **QUESTION 9**: Create and train a linear regression model using the `Ridge` class from `sklearn.linear_model`.

In [None]:
from sklearn.linear_model import LogisticRegression

# instantiate a logistic regression model, and fit with train_inputs and train_targets
model = LogisticRegression(solver='liblinear')
model.fit(train_inputs, train_targets)

`model.fit` uses the following strategy for training the model (source):

1. We initialize a model with random parameters (weights & biases).
2. We pass some inputs into the model to obtain predictions.
3. We compare the model's predictions with the actual targets using the loss function.
4. We use an optimization technique (like least squares, gradient descent etc.) to reduce the loss by adjusting the weights & biases of the model
5. We repeat steps 1 to 4 till the predictions from the model are good enough.

<img src="https://www.deepnetts.com/blog/wp-content/uploads/2019/02/SupervisedLearning.png" width="480">

In [None]:
evalaute_train(model,train_inputs,train_targets)

In [None]:
evalaute_test(model,val_inputs,val_targets)

##Regularization
With an increase in the number of variables, the probability of over-fitting also increases.
`LASSO (L1)` and `Ridge (L2)` can be applied for logistic regression as well to avoid overfitting.

In [None]:
from sklearn.linear_model import LogisticRegression

# l1 regularization gives better results

model = LogisticRegression(penalty='l1', C=10, random_state=0)

model.fit(train_inputs,train_targets)

In [None]:
evalaute_train(model,train_inputs,train_targets)

In [None]:
evalaute_test(model,val_inputs,val_targets)

## Feature Importance

Let's look at the weights assigned to different columns, to figure out which columns in the dataset are the most important.

> **QUESTION 11**: Identify the weights (or coefficients) assigned to for different features by the model.
> 
> *Hint:* Read [the docs](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html).

In [None]:
weights = model.coef_.flatten()

Let's create a dataframe to view the weight assigned to each column.

In [None]:
weights_df = pd.DataFrame({
    'columns': train_inputs.columns,
    'weight': weights
}).sort_values('weight', ascending=False)

In [None]:
plt.title('Feature Importance')
sns.barplot(data=weights_df.head(10), x='weight', y='columns');

# Model 2 - Train a Support Vector Machine (SVM)

1.  SVM is comparatively less prone to outliers than logistic regression as it only
cares about the points that are closest to the decision boundary or support vectors.


2. Key Parameters
* C: This is the penalty parameter and helps in fitting the boundaries smoothly and
appropriately, default=1
* Kernel: A kernel is a similarity function for pattern analysis. It must be one of rbf/
linear/poly/sigmoid/precomputed, default=’rbf’ (Radial Basis Function). Choosing an
appropriate kernel will result in a better model fit

In [None]:
from sklearn.svm import SVC

model = SVC(kernel='linear', C=1.0, random_state=0)
model.fit(train_inputs,train_targets)

In [None]:
evalaute_train(model,train_inputs,train_targets)

In [None]:
evalaute_test(model,val_inputs,val_targets)

##Ploting SVM decision boundaries

In [None]:
# Let's use sklearn make_classification function to create some test data.
from sklearn.datasets import make_classification
X, y = make_classification(100, 2, 2, 0, weights=[.5, .5], random_state=0)
# build a simple logistic regression model
clf = SVC(kernel='linear', random_state=0)
clf.fit(X, y)
# get the separating hyperplane
w = clf.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(-5, 5)
yy = a * xx - (clf.intercept_[0]) / w[1]
# plot the parallels to the separating hyperplane that pass through the
# support vectors
b = clf.support_vectors_[0]
yy_down = a * xx + (b[1] - a * b[0])
b = clf.support_vectors_[-1]
yy_up = a * xx + (b[1] - a * b[0])

# Plot the decision boundary
plot_decision_regions(X, y, classifier=clf)
# plot the line, the points, and the nearest vectors to the plane
plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1], s=80,
facecolors='none')
plt.plot(xx, yy_down, 'k--')
plt.plot(xx, yy_up, 'k--')
plt.xlabel('X1')
plt.ylabel('X2')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()

# Model 3 - Train a k-Nearest Neighbors (kNN)

In [None]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=5, p=2, metric='minkowski')
model.fit(train_inputs,train_targets)


In [None]:
evalaute_train(model,train_inputs,train_targets)

In [None]:
evalaute_test(model,val_inputs,val_targets)

# Model 4 -Training and Visualizing Decision Trees

A decision tree in general parlance represents a hierarchical series of binary decisions:

<img src="https://i.imgur.com/qSH4lqz.png" width="480">

A decision tree in machine learning works in exactly the same way, and except that we let the computer figure out the optimal structure & hierarchy of decisions, instead of coming up with criteria manually.

## Training

We can use `DecisionTreeClassifier` from `sklearn.tree` to train a decision tree.

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Create the model
model = DecisionTreeClassifier(criterion = 'entropy',random_state=42)

# Fit the model
model.fit(X_train, train_targets)

An optimal decision tree has now been created using the training data.

##Evaluation

Let's evaluate the decision tree using the accuracy score.

In [None]:
evalaute_train(model,train_inputs,train_targets)

In [None]:
evalaute_test(model,val_inputs,val_targets)

The decision tree also returns probabilities for each prediction.

The training set accuracy is close to 100%! But we can't rely solely on the training set accuracy, we must evaluate the model on the validation set too. 

We can make predictions and compute accuracy in one step using `model.score`

Although the training accuracy is 100%, the accuracy on the validation set is just about 79%, which is only marginally better then always predicting "No". 

## Visualization

We can visualize the decision tree _learned_ from the training data.

In [None]:
from sklearn.tree import plot_tree, export_text
plt.figure(figsize=(80,20))
plot_tree(model, feature_names=X_train.columns, max_depth=2, filled=True);

Can you see how the model classifies a given input as a series of decisions? The tree is truncated here, but following any path from the root node down to a leaf will result in "Yes" or "No". Do you see how a decision tree differs from a logistic regression model?


**How a Decision Tree is Created**

Note the `gini` value in each box. This is the loss function used by the decision tree to decide which column should be used for splitting the data, and at what point the column should be split. A lower Gini index indicates a better split. A perfect split (only one class on each side) has a Gini index of 0. 

For a mathematical discussion of the Gini Index, watch this video: https://www.youtube.com/watch?v=-W0DnxQK1Eo . It has the following formula:

<img src="https://i.imgur.com/CSC0gAo.png" width="240">

Conceptually speaking, while training the models evaluates all possible splits across all possible columns and picks the best one. Then, it recursively performs an optimal split for the two portions. In practice, however, it's very inefficient to check all possible splits, so the model uses a heuristic (predefined strategy) combined with some randomization.

The iterative approach of the machine learning workflow in the case of a decision tree involves growing the tree layer-by-layer:

<img src="https://www.deepnetts.com/blog/wp-content/uploads/2019/02/SupervisedLearning.png" width="480">


Let's check the depth of the tree that was created.

In [None]:
model.tree_.max_depth

We can also display the tree as text, which can be easier to follow for deeper trees.

In [None]:
tree_text = export_text(model, max_depth=10, feature_names=list(X_train.columns))
print(tree_text[:5000])

## Feature Importance

Based on the gini index computations, a decision tree assigns an "importance" value to each feature. These values can be used to interpret the results given by a decision tree.

In [None]:
model.feature_importances_

Let's turn this into a dataframe and visualize the most important features.

In [None]:
importance_df = pd.DataFrame({
    'feature': X_train.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

In [None]:
importance_df.head(10)

In [None]:
plt.title('Feature Importance')
sns.barplot(data=importance_df.head(10), x='importance', y='feature');

## Hyperparameter Tuning and Overfitting

As we saw in the previous section, our decision tree classifier memorized all training examples, leading to a 100% training accuracy, while the validation accuracy was only marginally better than a dumb baseline model. This phenomenon is called overfitting, and in this section, we'll look at some strategies for reducing overfitting. The process of reducing overfitting is known as _regularlization_.


The `DecisionTreeClassifier` accepts several arguments, some of which can be modified to reduce overfitting.

These arguments are called hyperparameters because they must be configured manually (as opposed to the parameters within the model which are _learned_ from the data. We'll explore a couple of hyperparameters:

- `max_depth`
- `max_leaf_nodes`

### `max_depth`

By reducing the maximum depth of the decision tree, we can prevent the tree from memorizing all training examples, which may lead to better generalization

In [None]:
model = DecisionTreeClassifier(max_depth=3, random_state=42)

model.fit(X_train, train_targets)

We can compute the accuracy of the model on the training and validation sets using `model.score`

In [None]:
model.score(X_train, train_targets)

In [None]:
model.score(X_val, val_targets)

Great, while the training accuracy of the model has gone down, the validation accuracy of the model has increased significantly.

In [None]:
model.classes_

In [None]:
plt.figure(figsize=(80,20))
plot_tree(model, feature_names=X_train.columns, filled=True, rounded=True, class_names=model.classes_);

In [None]:
print(export_text(model, feature_names=list(X_train.columns)))

Let's experiment with different depths using a helper function.

In [None]:
def test_params(**params):
    model = DecisionTreeClassifier(random_state=42,**params).fit(train_inputs, train_targets)
    Train_score = accuracy_score(model.predict(train_inputs), train_targets)
    Val_score = accuracy_score(model.predict(val_inputs), val_targets)
    return Train_score, Val_score
accuracy_score(train_targets, train_preds)

def test_param_and_plot(param_name, param_values):
    train_errors, val_errors = [], [] 
    for value in param_values:
        params = {param_name: value}
        train_rmse, val_rmse = test_params(**params)
        train_errors.append(train_rmse)
        val_errors.append(val_rmse)
    plt.figure(figsize=(10,6))
    plt.title('Overfitting curve: ' + param_name)
    plt.plot(param_values, train_errors, 'b-o')
    plt.plot(param_values, val_errors, 'r-o')
    plt.xlabel(param_name)
    plt.ylabel('Score')
    plt.legend(['Training', 'Validation'])

In [None]:
def max_depth_error(md):
    model = DecisionTreeClassifier(max_depth=md, random_state=42)
    model.fit(X_train, train_targets)
    train_acc = 1 - model.score(X_train, train_targets)
    val_acc = 1 - model.score(X_val, val_targets)
    return {'Max Depth': md, 'Training Error': train_acc, 'Validation Error': val_acc}

In [None]:
errors_df = pd.DataFrame([max_depth_error(md) for md in range(1, 21)])

In [None]:
plt.figure()
plt.plot(errors_df['Max Depth'], errors_df['Training Error'])
plt.plot(errors_df['Max Depth'], errors_df['Validation Error'])
plt.title('Training vs. Validation Error')
plt.xticks(range(0,21, 2))
plt.xlabel('Max. Depth')
plt.ylabel('Prediction Error (1 - Accuracy)')
plt.legend(['Training', 'Validation'])

This is a common pattern you'll see with all machine learning algorithms:

<img src="https://i.imgur.com/EJCrSZw.png" width="480">





You'll often need to tune hyperparameters carefully to find the optimal fit. In the above case, it appears that a maximum depth of 7 results in the lowest validation error.

In [None]:
model = DecisionTreeClassifier(max_depth=7, random_state=42).fit(X_train, train_targets)
model.score(X_val, val_targets)

### `max_leaf_nodes`

Another way to control the size of complexity of a decision tree is to limit the number of leaf nodes. This allows branches of the tree to have varying depths. 

In [None]:
model = DecisionTreeClassifier(max_leaf_nodes=128, random_state=42)
model.fit(X_train, train_targets)

In [None]:
model.score(X_train, train_targets)

In [None]:
model.score(X_val, val_targets)

In [None]:
model.tree_.max_depth


Notice that the model was able to achieve a greater depth of 12 for certain paths while keeping other paths shorter.

In [None]:
model_text = export_text(model, feature_names=list(X_train.columns))
print(model_text[:3000])

# Model 5 -Training a Random Forest

While tuning the hyperparameters of a single decision tree may lead to some improvements, a much more effective strategy is to combine the results of several decision trees trained with slightly different parameters. This is called a random forest model. 

The key idea here is that each decision tree in the forest will make different kinds of errors, and upon averaging, many of their errors will cancel out. This idea is also commonly known as the "wisdom of the crowd":

<img src="https://i.imgur.com/4Dg0XK4.png" width="480">

In [None]:
from sklearn.ensemble import RandomForestClassifier
num_trees = 100

kfold = cross_validation.StratifiedKFold(y=y_train, n_folds=5, random_state=2017)
num_trees = 100

model = RandomForestClassifier(n_estimators=num_trees)
model.fit(train_inputs, train_targets)
results = cross_validation.cross_val_score(model, train_inputs, train_targets, cv=kfold)
print("\nRandom Forest (Bagging) - Train : ", results.mean())

print("Random Forest (Bagging) - Test : ", metrics.accuracy_score(model.predict(val_inputs), val_targets))

`n_jobs` allows the random forest to use mutiple parallel workers to train decision trees, and `random_state=42` ensures that the we get the same results for each execution.

In [None]:
evalaute_train(model,train_inputs,train_targets)

In [None]:
evalaute_test(model,val_inputs,val_targets)

Once again, the training accuracy is almost 100%, but this time the validation accuracy is much better. In fact, it is better than the best single decision tree we had trained so far. Do you see the power of random forests?

This general technique of combining the results of many models is called "ensembling", it works because most errors of individual models cancel out on averaging. Here's what it looks like visually:

<img src="https://i.imgur.com/qJo8D8b.png" width="640">


We can also look at the probabilities for the predictions. The probability of a class is simply the fraction of trees which that predicted the given class.

In [None]:
train_probs = model.predict_proba(X_train)
train_probs

We can can access individual decision trees using `model.estimators_`

In [None]:
model.estimators_[0]

In [None]:
plt.figure(figsize=(80,20))
plot_tree(model.estimators_[0], max_depth=2, feature_names=X_train.columns, filled=True, rounded=True, class_names=model.classes_);

In [None]:
plt.figure(figsize=(80,20))
plot_tree(model.estimators_[20], max_depth=2, feature_names=X_train.columns, filled=True, rounded=True, class_names=model.classes_);

In [None]:
len(model.estimators_)

## Feature Importance

Just like decision tree, random forests also assign an "importance" to each feature, by combining the importance values from individual trees.

In [None]:
importance_df = pd.DataFrame({
    'feature': X_train.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

In [None]:
plt.title('Feature Importance')
sns.barplot(data=importance_df.head(10), x='importance', y='feature');


Notice that the distribution is a lot less skewed than that for a single decision tree.

## Hyperparameter Tuning with Random Forests

Just like decision trees, random forests also have several hyperparameters. In fact many of these hyperparameters are applied to the underlying decision trees. 

Let's study some the hyperparameters for random forests. You can learn more about them here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [None]:
?RandomForestClassifier

Let's create a base model with which we can compare models with tuned hyperparameters.

In [None]:
base_model = RandomForestClassifier(random_state=42, n_jobs=-1).fit(X_train, train_targets)

In [None]:
base_train_acc = base_model.score(X_train, train_targets)
base_val_acc = base_model.score(X_val, val_targets)

In [None]:
base_accs = base_train_acc, base_val_acc
base_accs

We can use this as a benchmark for hyperparmeter tuning.

### `n_estimators`

This argument controls the number of decision trees in the random forest. The default value is 100. For larger datasets, it helps to have a greater number of estimators. As a general rule, try to have as few estimators as needed. 


**10 estimators**

In [None]:
model = RandomForestClassifier(random_state=42, n_jobs=-1, n_estimators=10)
model.fit(X_train, train_targets)

In [None]:
model.score(X_train, train_targets), model.score(X_val, val_targets)

In [None]:
base_accs

### `max_depth` and `max_leaf_nodes`

These arguments are passed directly to each decision tree, and control the maximum depth and max. no leaf nodes of each tree respectively. By default, no maximum depth is specified, which is why each tree has a training accuracy of 100%. You can specify a `max_depth` to reduce overfitting.

<img src="https://i.imgur.com/EJCrSZw.png" width="480">


Let's define a helper function test_params to make it easy to test hyperparameters

In [None]:
def test_params(**params):
    model = RandomForestClassifier(random_state=42, n_jobs=-1, **params).fit(X_train, train_targets)
    return model.score(X_train, train_targets), model.score(X_val, val_targets)

Let's test a few values of `max_depth` and `max_leaf_nodes`.

In [None]:
test_params(max_depth=10,max_leaf_nodes=300)

In [None]:
test_params(max_depth=25,max_leaf_nodes=500)

In [None]:
base_accs

### `max_features`

Instead of picking all features (columns) for every split, we can specify that only a fraction of features be chosen randomly to figure out a split.

<img src="https://i.imgur.com/FXGWMDY.png" width="720">

Notice that the default value `auto` causes only $\sqrt{n}$ out of total features ( $n$ ) to be chosen randomly at each split. This is the reason each decision tree in the forest is different. While it may seem counterintuitive, choosing all features for every split of every tree will lead to identical trees, so the random forest will not generalize well. 

In [None]:
test_params(max_features='log2')

In [None]:
test_params(max_features=6)

In [None]:
base_accs

### `min_samples_split` and `min_samples_leaf`

By default, the decision tree classifier tries to split every node that has 2 or more. You can increase the values of these arguments to change this behavior and reduce overfitting, especially for very large datasets.

In [None]:
test_params(min_samples_split=3, min_samples_leaf=2)

In [None]:
test_params(min_samples_split=100, min_samples_leaf=60)

In [None]:
base_accs

### `min_impurity_decrease`

This argument is used to control the threshold for splitting nodes. A node will be split if this split induces a decrease of the impurity (Gini index) greater than or equal to this value. It's default value is 0, and you can increase it to reduce overfitting.



In [None]:
test_params(min_impurity_decrease=1e-7)

In [None]:
test_params(min_impurity_decrease=1e-2)

In [None]:
base_accs

### `bootstrap`, `max_samples` 

By default, a random forest doesn't use the entire dataset for training each decision tree. Instead it applies a technique called bootstrapping. For each tree, rows from the dataset are picked one by one randomly, with replacement i.e. some rows may not show up at all, while some rows may show up multiple times.


<img src="https://i.imgur.com/W8UGaEA.png" width="640">

Bootstrapping helps the random forest generalize better, because each decision tree only sees a fraction of th training set, and some rows randomly get higher weightage than others.

In [None]:
test_params(bootstrap=False)

In [None]:
base_accs

When bootstrapping is enabled, you can also control the number or fraction of rows to be considered for each bootstrap using `max_samples`. This can further generalize the model.

<img src="https://i.imgur.com/rsdrL1W.png" width="640">

In [None]:
test_params(max_samples=0.9)

In [None]:
base_accs

### `class_weight`

In [None]:
model.classes_

In [None]:
test_params(class_weight='balanced')

In [None]:
test_params(class_weight={'No': 1, 'Yes': 2})

In [None]:
base_accs

We've increased the accuracy from 84.5% with a single decision tree to 85.7% with a well-tuned random forest. Depending on the dataset and the kind of problem, you may or may not a see a significant improvement with hyperparameter tuning. 

This could be due to any of the following reasons:

- We may not have found the right mix of hyperparameters to regularize (reduce overfitting) the model properly, and we should keep trying to improve the model.

- We may have reached the limits of the modeling technique we're currently using (Random Forests), and we should try another modeling technique e.g. gradient boosting.

- We may have reached the limits of what we can predict using the given amount of data, and we may need more data to improve the model.

- We may have reached the limits of how well we can predict whether it will rain tomorrow using the given weather measurements, and we may need more features (columns) to further improve the model. In many cases, we can also generate new features using existing features (this is called feature engineering).

- Whether it will rain tomorrow may be an inherently random or chaotic phenomenon which simply cannot be predicted beyond a certain accuracy any amount of data for any number of weather measurements with any modeling technique.  

Remember that ultimately all models are wrong, but some are useful. If you can rely on the model we've created today to make a travel decision for tomorrow, then the model is useful, even though it may sometimes be wrong.

Finally, let's also compute the accuracy of our model on the test set.

In [None]:
model.score(X_test, test_targets)

# Model 6 -Training a Extremely Randomized Trees (ExtraTree)
This algorithm is an effort to introduce more randomness to the bagging process. Tree
splits are chosen completely at random from the range of values in the sample at each
split, which allows us to reduce the variance of the model further – however, at the cost of
a slight increase in bias

In [None]:
from sklearn.ensemble import ExtraTreesClassifier
kfold = cross_validation.StratifiedKFold(y=y_train, n_folds=5, random_state=2017)
num_trees = 100

model = ExtraTreesClassifier(n_estimators=num_trees)
model.fit(train_inputs, train_targets)

results = cross_validation.cross_val_score(model, train_inputs, train_targets, cv=kfold)
print("\nRandom Forest (Bagging) - Train : ", results.mean())

print("Random Forest (Bagging) - Test : ", metrics.accuracy_score(model.predict(val_inputs), val_targets))

# Model 7 -Training a Gradient Boosting

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
# Using Gradient Boosting of 100 iterations
kfold = cross_validation.StratifiedKFold(y=y_train, n_folds=5, random_state=2017)
num_trees = 100

model = GradientBoostingClassifier(n_estimators=num_trees, learning_rate=0.1, random_state=2017)
model.fit(X_train, y_train)

results = cross_validation.cross_val_score(model, train_inputs, train_targets, cv=kfold)
print "\nGradient Boosting - CV Train : %.2f" % results.mean()

print "Gradient Boosting - Train : %.2f" % metrics.accuracy_score(model.predict(train_inputs),results = cross_validation.cross_val_score(model, train_inputs, train_targets, cv=kfold)
)
print "Gradient Boosting - Test : %.2f" % metrics.accuracy_score(model.predict(X_test), y_test)

SyntaxError: ignored

# Model 8 -Training a Xgboost (eXtreme Gradient Boosting)

In [None]:
import xgboost as xgb
from xgboost.sklearn import XGBClassifier

model = XGBClassifier(n_estimators = num_rounds,objective= 'binary:logistic',seed=2017)
# use early_stopping_rounds to stop the cv when there is no score imporovement
model.fit(train_inputs, train_targets, early_stopping_rounds=20, eval_set=[(train_inputs, train_targets)], verbose=False)

results = cross_validation.cross_val_score(model, train_inputs, train_targets, cv=kfold)
print "\nxgBoost - CV Train : %.2f" % results.mean()

print "xgBoost - Train : %.2f" % metrics.accuracy_score(model.predict(train_inputs), train_targets)
print "xgBoost - Test : %.2f" % metrics.accuracy_score(model.predict(val_inputs), val_targets)


# Model 9 -Training a LightGBM

In [None]:
# build the lightgbm model
import lightgbm as lgb
LGBM = lgb.LGBMClassifier()
LGBM.fit(train_inputs, train_targets)

In [None]:
evalaute_train(LGBM,train_inputs,train_targets)

In [None]:
evalaute_test(LGBM,val_inputs,val_targets)

#Hyperparameter tuning using optuna

In [None]:
pip install optuna

In [None]:
import optuna  
from sklearn.metrics import log_loss
from optuna.integration import LightGBMPruningCallback

def objective(trial, X, y):
  
    param_grid = {
        # "device_type": trial.suggest_categorical("device_type", ['gpu']),
        "n_estimators": trial.suggest_categorical("n_estimators", [10000]),
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3),
        "num_leaves": trial.suggest_int("num_leaves", 20, 3000, step=20),
        "max_depth": trial.suggest_int("max_depth", 3, 12),
        "min_data_in_leaf": trial.suggest_int("min_data_in_leaf", 200, 10000, step=100),
        "lambda_l1": trial.suggest_int("lambda_l1", 0, 100, step=5),
        "lambda_l2": trial.suggest_int("lambda_l2", 0, 100, step=5),
        "min_gain_to_split": trial.suggest_float("min_gain_to_split", 0, 15),
        "bagging_fraction": trial.suggest_float("bagging_fraction", 0.2, 0.95, step=0.1),
        "bagging_freq": trial.suggest_categorical("bagging_freq", [1]),
        "feature_fraction": trial.suggest_float("feature_fraction", 0.2, 0.95, step=0.1),
    }

    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

   
        model = lgbm.LGBMClassifier(objective="binary", **param_grid)

        model.fit(train_inputs,train_targets,
            eval_set=[(val_inputs, val_targets)],
            eval_metric="binary_logloss",
            early_stopping_rounds=100,
            callbacks=[LightGBMPruningCallback(trial, "binary_logloss")])
        
        preds = model.predict_proba(val_inputs)
        Accuracy=  log_loss(val_targets, preds)

    return Accuracy

In [None]:
study = optuna.create_study(direction="minimize", study_name="LGBM Classifier")
func = lambda trial: objective(trial, X, y)
study.optimize(func, n_trials=20)

In [None]:
study.bast_params 

#Model-10 Catboost

In [None]:
def objective(trial):
   
   param = {
        
        "objective": trial.suggest_categorical("objective", ["Logloss", "CrossEntropy"]),
        "colsample_bylevel": trial.suggest_float("colsample_bylevel", 0.01, 0.1),
        "depth": trial.suggest_int("depth", 1, 12),
        "boosting_type": trial.suggest_categorical("boosting_type", ["Ordered", "Plain"]),
        "bootstrap_type": trial.suggest_categorical("bootstrap_type", ["Bayesian", "Bernoulli", "MVS"]),
        "used_ram_limit": "3gb",
    }

    if param["bootstrap_type"] == "Bayesian":
        param["bagging_temperature"] = trial.suggest_float("bagging_temperature", 0, 10)
    elif param["bootstrap_type"] == "Bernoulli":
        param["subsample"] = trial.suggest_float("subsample", 0.1, 1)

    Cat = CatBoostClassifier(**param)

    Cat.fit(X_train, y_train, eval_set=[(X_test, y_test)], cat_features=categorical_features_indices,verbose=0, early_stopping_rounds=100)

    preds = Cat.predict(X_test)
    accuracy = accuracy_score(y_test, preds)
    return accuracy

In [None]:
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=50, timeout=600)

# Making Predictions on New Inputs

Let's define a helper function to make predictions on new inputs.

In [None]:
def predict_input(model, single_input):
    input_df = pd.DataFrame([single_input])
    input_df[numeric_cols] = imputer.transform(input_df[numeric_cols])
    input_df[numeric_cols] = scaler.transform(input_df[numeric_cols])
    input_df[encoded_cols] = encoder.transform(input_df[categorical_cols])
    X_input = input_df[numeric_cols + encoded_cols]
    pred = model.predict(X_input)[0]
    prob = model.predict_proba(X_input)[0][list(model.classes_).index(pred)]
    return pred, prob

In [None]:
new_input = {'Date': '2021-06-19',
             'Location': 'Launceston',
             'MinTemp': 23.2,
             'MaxTemp': 33.2,
             'Rainfall': 10.2,
             'Evaporation': 4.2,
             'Sunshine': np.nan,
             'WindGustDir': 'NNW',
             'WindGustSpeed': 52.0,
             'WindDir9am': 'NW',
             'WindDir3pm': 'NNE',
             'WindSpeed9am': 13.0,
             'WindSpeed3pm': 20.0,
             'Humidity9am': 89.0,
             'Humidity3pm': 58.0,
             'Pressure9am': 1004.8,
             'Pressure3pm': 1001.5,
             'Cloud9am': 8.0,
             'Cloud3pm': 5.0,
             'Temp9am': 25.7,
             'Temp3pm': 33.0,
             'RainToday': 'Yes'}

## Saving and Loading Trained Models

We can save the parameters (weights and biases) of our trained model to disk, so that we needn't retrain the model from scratch each time we wish to use it. Along with the model, it's also important to save imputers, scalers, encoders and even column names. Anything that will be required while generating predictions using the model should be saved.

We can use the `joblib` module to save and load Python objects on the disk. 

In [None]:
import joblib
aussie_rain = {
    'model': model,
    'imputer': imputer,
    'scaler': scaler,
    'encoder': encoder,
    'input_cols': input_cols,
    'target_col': target_col,
    'numeric_cols': numeric_cols,
    'categorical_cols': categorical_cols,
    'encoded_cols': encoded_cols
}

In [None]:
joblib.dump(aussie_rain, 'aussie_rain.joblib')

The object can be loaded back using `joblib.load`

In [None]:
aussie_rain2 = joblib.load('aussie_rain.joblib')

In [None]:
test_preds2 = aussie_rain2['model'].predict(X_test)
accuracy_score(test_targets, test_preds2)

## Summary and References

The following topics were covered in this tutorial:

- Downloading a real-world dataset
- Preparing a dataset for training
- Training and interpreting decision trees
- Training and interpreting random forests
- Overfitting, hyperparameter tuning & regularization
- Making predictions on single inputs



We also introduced the following terms:

* Decision tree
* Random forest
* Overfitting
* Hyperparameter
* Hyperparameter tuning
* Regularization
* Ensembling
* Generalization
* Bootstrapping


Check out the following resources to learn more: 

- https://scikit-learn.org/stable/modules/tree.html
- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
- https://www.kaggle.com/willkoehrsen/start-here-a-gentle-introduction
- https://www.kaggle.com/willkoehrsen/introduction-to-manual-feature-engineering
- https://www.kaggle.com/willkoehrsen/intro-to-model-tuning-grid-and-random-search