<h1 style="background-color:#45ffa3;text-align:center;color:#aa45ff">Tabular Playground Series - Jan 2021</h1>

<h1 style="background-color:#45ffa3;text-align:left;color:#aa45ff">Contents</h1>

- Basic Data Analysis and Visualization
- Linear Algorithms
- Tree Based Algorithms
    - Decision Tree
    - Random Forest
    - Gradient Boosting (GBM)
    - XGboost

<h2 style="background-color:#45ffa3;text-align:center;color:#aa45ff">Importing required libraries</h2>

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, KFold
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor

from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor
from sklearn.metrics import mean_squared_error,mean_absolute_error

import xgboost as xgb
from xgboost import XGBRegressor,XGBRFRegressor

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
sns.set_style("darkgrid")

<h2 style="background-color:#45ffa3;text-align:center;color:#aa45ff">Reading the data</h2>

In [None]:
df_train = pd.read_csv('/kaggle/input/tabular-playground-series-jan-2021/train.csv',index_col='id')
df_test = pd.read_csv('/kaggle/input/tabular-playground-series-jan-2021/test.csv',index_col='id')
df_sub = pd.read_csv('/kaggle/input/tabular-playground-series-jan-2021/sample_submission.csv',index_col='id')

In [None]:
df_train.head()

In [None]:
df_test.head()

In [None]:
# df_sub.head()

In [None]:
df_train.info()

In [None]:
df_test.info()

In [None]:
df_train.describe()

In [None]:
df_test.describe()

In [None]:
# get the number of missing data points per column
df_train.isnull().sum()

<h2 style="background-color:#45ffa3;text-align:center;color:#aa45ff">Bivariate Analysis</h2>

**Bivariate analysis** is the simultaneous analysis of two variables (attributes). It explores the concept of relationship
between two variables, whether there exists an association and the strength of this association, or whether there are
differences between two variables and the significance of these differences.

### Scatter plot of features vs. target

Scatter plot of each feature in train vs. target values.

In [None]:
def plot_feature_target_scatter(df, features):
    i = 0
    plt.figure()
    fig, ax = plt.subplots(5, 3,figsize=(14, 24))

    for feature in features:
        i += 1
        plt.subplot(5, 3, i)
        plt.scatter(df[feature], df['target'], marker='+', color='purple')
        plt.xlabel(feature, fontsize=9)
    plt.show()

In [None]:
features = ['cont1', 'cont2','cont3','cont4', 'cont5', 'cont6', 'cont7',
            'cont8', 'cont9','cont10','cont11', 'cont12', 'cont13', 'cont14']

plot_feature_target_scatter(df_train[::15], features)

### Features distribution

In [None]:
def plot_feature_distribution(df1, df2, features):
    i = 0
    plt.figure()
    fig, ax = plt.subplots(5, 3,figsize=(14, 24))

    for feature in features:
        i += 1
        plt.subplot(5, 3,i)
        sns.distplot(df1[feature],color="orange", kde=True,bins=120, label='train')
        sns.distplot(df2[feature],color="purple", kde=True,bins=120, label='test')
        plt.xlabel(feature, fontsize=9); plt.legend()
    plt.show()

In [None]:
plot_feature_distribution(df_train[::15],df_test[::10], features)

### Features correlation

In [None]:
plt.figure(figsize=(16, 16))
heatmap = sns.heatmap(np.round(df_train[features].corr(), 3), vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Features correlation', fontdict={'fontsize':10}, pad=10)
plt.title("Spearman correlation - test data")
plt.show()

In [None]:
features_target = features + ['target']
plt.figure(figsize=(16, 16))
heatmap = sns.heatmap(np.round(df_train[features_target].corr(), 3), vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Features correlation', fontdict={'fontsize':10}, pad=10)
plt.title("Spearman correlation - train data")
plt.show()

In [None]:
target = df_train.pop('target')
X_train, X_test, y_train, y_test = train_test_split(df_train, target, train_size=0.60)

<h2 style="background-color:#45ffa3;text-align:center;color:#aa45ff">Linear Regression</h2>

Linear regression is a linear approach to modelling the relationship between a dependent variable and one or more independent variables.

In Multiple linear regression more than one predictor variables are used to predict the response variable.

In [None]:
def plot_results(name, y, yhat, num_to_plot=10000, lims=(0,12), figsize=(15,8)):
    plt.figure(figsize=figsize)
    score = mean_squared_error(y, yhat, squared=False)
    sns.scatterplot(y[:num_to_plot], yhat[:num_to_plot])
    plt.plot(lims, lims)
    plt.ylim(lims)
    plt.xlim(lims)
    plt.title(f'{name}: {score:0.5f}', fontsize=18)
    plt.show()

In [None]:
linear_model = LinearRegression(fit_intercept=False)
linear_model.fit(X_train, y_train)
y_linear = linear_model.predict(X_test)
score_linear = mean_squared_error(y_test, y_linear, squared=False)
print(f'{score_linear:0.5f}')

In [None]:
plot_results('Linear',y_test,y_linear)

<h2 style="background-color:#45ffa3;text-align:center;color:#aa45ff">Lasso Regression Model</h2>

In [None]:
lasso_model = Lasso(fit_intercept=False)
lasso_model.fit(X_train, y_train)
y_lasso = lasso_model.predict(X_test)
score_lasso = mean_squared_error(y_test, y_lasso, squared=False)
print(f'{score_lasso:0.5f}')

In [None]:
plot_results('Lasso',y_test,y_lasso)

<h2 style="background-color:#45ffa3;text-align:center;color:#aa45ff">Introduction to Tree Based Algorithms</h2>

Tree based algorithms are considered to be one of the best and mostly used supervised learning methods. Tree based algorithms empower predictive models with high accuracy, stability and ease of interpretation. Unlike linear models, they map non-linear relationships quite well. They are adaptable at solving any kind of problem at hand (classification or regression).
Methods like decision trees, random forest, gradient boosting are being popularly used in all kinds of data science problems. 

<h2 style="background-color:#45ffa3;text-align:center;color:#aa45ff">What is a Decision Tree ?</h2>

Decision tree is a type of supervised learning algorithm that is mostly used in classification problems. It works for both categorical and continuous input and output variables. In this technique, we split the population or sample into two or more homogeneous sets (or sub-populations) based on most significant splitter / differentiator in input variables.

<h2 style="background-color:#45ffa3;text-align:center;color:#aa45ff">Types of Decision Trees</h2>

Types of decision tree is based on the type of target variable we have. It can be of two types:

1. **Categorical Variable Decision Tree:** Decision Tree which has categorical target variable then it called as categorical variable decision tree.
2. **Continuous Variable Decision Tree:** Decision Tree has continuous target variable then it is called as Continuous Variable Decision Tree.


<h3>Basic terminology used with Decision trees:</h3>

- **Root Node:** It represents entire population or sample and this further gets divided into two or more homogeneous sets.
- **Splitting:** It is a process of dividing a node into two or more sub-nodes.
- **Decision Node:** When a sub-node splits into further sub-nodes, then it is called decision node.
- **Leaf/ Terminal Node:** Nodes do not split is called Leaf or Terminal node.
- **Pruning:** When we remove sub-nodes of a decision node, this process is called pruning. You can say opposite process of splitting.
- **Branch / Sub-Tree:** A sub section of entire tree is called branch or sub-tree.
- **Parent and Child Node:** A node, which is divided into sub-nodes is called parent node of sub-nodes where as sub-nodes are the child of parent node.

<h3>Advantages and Disadvantages</h3>

**Advantages**

- Easy to Understand
- Useful in Data exploration
- Less data cleaning required
- Data type is not a constraint
- Non Parametric Method

**Disadvantages**

- Over fitting: Over fitting is one of the most practical difficulty for decision tree models. This problem gets solved by setting constraints on model parameters and pruning

<h2 style="background-color:#45ffa3;text-align:center;color:#aa45ff">Decision Tree Regressor Model</h2>

In [None]:
dtree_model = DecisionTreeRegressor(random_state=0)
dtree_model.fit(X_train, y_train)
y_dtree = dtree_model.predict(X_test)
score_dtree = mean_squared_error(y_test, y_dtree, squared=False)
print(f'{score_dtree:0.5f}')

In [None]:
plot_results('Decision Tree',y_test,y_dtree)

<h2 style="background-color:#45ffa3;text-align:center;color:#aa45ff">Random Forest Regressor Model</h2>

In [None]:
rf_model = RandomForestRegressor(n_estimators=50, n_jobs=-1)
rf_model.fit(X_train, y_train)
y_rf = rf_model.predict(X_test)
score_dtree = mean_squared_error(y_test, y_rf, squared=False)
print(f'{score_dtree:0.5f}')

In [None]:
plot_results('Random Forest',y_test,y_rf)

<h2 style="background-color:#45ffa3;text-align:center;color:#aa45ff">Gradient Boosting Regressor Model</h2>

In [None]:
gb_model = GradientBoostingRegressor(n_estimators=100,max_depth=5)
gb_model.fit(X_train, y_train)
y_gb = gb_model.predict(X_test)
score_gb = mean_squared_error(y_test, y_gb, squared=False)
print(f'{score_gb:0.5f}')

In [None]:
plot_results('Gradient Boosting',y_test,y_gb)

<h2 style="background-color:#45ffa3;text-align:center;color:#aa45ff">XGBoost (eXtreme Gradient Boosting)</h2>

XGBoost (eXtreme Gradient Boosting) is an advanced implementation of gradient boosting algorithm. It’s feature to implement parallel computing makes it at least 10 times faster than existing gradient boosting implementations. It supports various objective functions, including regression, classification and ranking.

In [None]:
#XGBoost hyper-parameter tuning
def hyperParameterTuning(X_train, y_train):
    param_tuning = {
        'learning_rate': [0.01, 0.1],
        'max_depth': [3, 5, 7, 10],
        'min_child_weight': [1, 3, 5],
        'subsample': [0.5, 0.7],
        'colsample_bytree': [0.5, 0.7],
        'n_estimators' : [100, 200, 500],
        'objective': ['reg:squarederror']
    }

    xgb_model = XGBRegressor(n_jobs = -1)
    gsearch = GridSearchCV(estimator = xgb_model,
                           param_grid = param_tuning,                        
                           cv = 5,
                           verbose = 1)

    gsearch.fit(X_train,y_train)
    return gsearch.best_params_


In [None]:
# hyperParameterTuning(X_train, y_train)

### Best Fit

In [None]:
xgb_model = XGBRegressor(
        objective = 'reg:squarederror',
        colsample_bytree = 0.5,
        learning_rate = 0.05,
        max_depth = 6,
        min_child_weight = 1,
        n_estimators = 1000,
        subsample = 0.7)

xgb_model.fit(X_train, y_train, early_stopping_rounds=5, eval_set=[(X_test, y_test)], verbose=False)

y_pred_xgb = xgb_model.predict(X_test)

mae_xgb = mean_absolute_error(y_test, y_pred_xgb)

print("MAE: ", mae_xgb)

<h2 style="background-color:#45ffa3;text-align:center;color:#aa45ff">Submission</h2>

In [None]:
# dtree_model = DecisionTreeRegressor()
# dtree_model.fit(df_train, target)
# df_sub['target'] = dtree_model.predict(df_test)
# df_sub.to_csv('dtree_submission.csv')

# model = RandomForestRegressor(n_estimators=50, n_jobs=-1)
# model.fit(df_train, target)
# df_sub['target'] = model.predict(df_test)
# df_sub.to_csv('submission.csv')

# model=GradientBoostingRegressor(n_estimators=100,max_depth=5)
# model.fit(df_train, target)
# df_sub['target'] = model.predict(df_test)
# df_sub.to_csv('submission.csv')


df_sub['target'] = xgb_model.predict(df_test)
df_sub.to_csv("submission.csv")

<h2 style="background-color:#45ffa3;text-align:center;color:#aa45ff">Reference Notebook</h2>

- [Tabular Playground Series January EDA](https://www.kaggle.com/gpreda/tabular-playground-series-january-eda)