# **LTFS Data Science FinHack 2**

## **Problem statement**

LTFS receives a lot of requests for its various finance offerings that include housing loan, two-wheeler loan, real estate financing and micro loans. The number of applications received is something that varies a lot with season. Going through these applications is a manual process and is tedious. Accurately forecasting the number of cases received can help with resource and manpower management resulting into quick response on applications and more efficient processing.

We have been appointed with the task of forecasting daily cases for **next 3 months for 2 different business segments** at the **country level** keeping in consideration the following major Indian festivals (inclusive but not exhaustive list): Diwali, Dussehra, Ganesh Chaturthi, Navratri, Holi etc. (We are free to use any publicly available open source external datasets). Some other examples could be:

 + Weather
 + Macroeconomic variables

we also note that the external dataset must belong to a reliable source.

## **Data Dictionary**

The train data has been provided in the following way:

 + For business segment 1, historical data has been made available at branch ID level
 + For business segment 2, historical data has been made available at State level.
 

## **Train File**

|Variable|	Definition|
|:------:|:----------:|
|application_date|Date of application|
|application_date|	Date of application|
|segment|	Business Segment (1/2)|
|branch_id|	Anonymised id for branch at which application was received|
|state|	State in which application was received (Karnataka, MP etc.)|
|zone|	Zone of state in which application was received (Central, East etc.)|
|case_count|	(Target) Number of cases/applications received|

## **Test File**

Forecasting needs to be done at country level for the dates provided in test set for each segment.

|Variable|	Definition|
|:------:|:----------:|
|id|	Unique id for each sample in test set|
|application_date|	Date of application|
| segment|	Business Segment (1/2)|

## **Evaluation**

**Evaluation Metric**

The evaluation metric for scoring the forecasts is MAPE (Mean Absolute Percentage Error) M with the formula:

$$M = \frac{100}{n}\sum_{t = 1}^{n}|\frac{A_t - F_t}{A_t}|$$
 
Where $A_t$ is the actual value and $F_t$ is the forecast value.


The Final score is calculated using $MAPE$ for both the segments using the formula:

$Final Score = 0.5*MAPE_{Segment1} + 0.5*MAPE_{Segment2}$


## **Getting started**

### **Importing libraries**

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as ss
%matplotlib inline

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)
pd.set_option('display.float_format', lambda x: '%.2f' % x)

### **Reading data**

In [0]:
# Setting the path
import os
path = "/content/drive/My Drive/Colab Notebooks (1)/LTFS Data Science FinHack 2"
os.chdir(path)

In [0]:
# Importing the dataset
train = pd.read_csv("./Input/train_fwYjLYX.csv")
test = pd.read_csv("./Input/test_1eLl9Yf.csv")
holidays = pd.read_csv("./Input/holiday_list_2017_2018_2019.csv")
Sample_submission = pd.read_csv("./Input/sample_submission_IIzFVsf.csv")

## **Data Preprocessing**

In [5]:
train.head()

Unnamed: 0,application_date,segment,branch_id,state,zone,case_count
0,2017-04-01,1,1.0,WEST BENGAL,EAST,40.0
1,2017-04-03,1,1.0,WEST BENGAL,EAST,5.0
2,2017-04-04,1,1.0,WEST BENGAL,EAST,4.0
3,2017-04-05,1,1.0,WEST BENGAL,EAST,113.0
4,2017-04-07,1,1.0,WEST BENGAL,EAST,76.0


In [6]:
# Data preprocessing function
train_v2 = pd.DataFrame(train.groupby(['application_date', 'segment'])['case_count'].sum()).reset_index()
train_v2.head()

Unnamed: 0,application_date,segment,case_count
0,2017-04-01,1,299.0
1,2017-04-01,2,897.0
2,2017-04-02,2,605.0
3,2017-04-03,1,42.0
4,2017-04-03,2,2016.0


In [7]:
holidays['application_date'] = pd.to_datetime(holidays['DATE'])
holidays = holidays[['application_date', 'HOLIDAY']]
holidays.head()

Unnamed: 0,application_date,HOLIDAY
0,2017-01-01,New Year's Day
1,2017-01-14,Makar Sankranti / Pongal
2,2017-01-26,Republic Day
3,2017-02-24,Maha Shivaratri
4,2017-03-13,Holi


In [8]:
Diwali_HOLIDAY = pd.DataFrame()
diwali = ["2017-10-17", "2017-10-18", "2017-10-19", "2017-10-20", "2017-10-21",
          "2018-11-05", "2018-11-06", "2018-11-07", "2018-11-08", "2018-11-09",
          "2019-10-25", "2019-10-26", "2019-10-27", "2019-10-28", "2019-10-29"]

Diwali_HOLIDAY['application_date'] = diwali
Diwali_HOLIDAY['application_date'] = pd.to_datetime(Diwali_HOLIDAY['application_date'])
Diwali_HOLIDAY['Diwali_HOLIDAY'] = 1
# temp
print(Diwali_HOLIDAY)

Dussehra_HOLIDAY = pd.DataFrame()
Dussehra = ["2017-09-22","2017-09-23","2017-09-24", "2017-09-25", "2017-09-26",
            "2017-09-27", "2017-09-28", "2017-09-29", "2017-09-30",
            "2018-10-11", "2018-10-12", "2018-10-13", "2018-10-14", "2018-10-15",
            "2018-10-16", "2018-10-17", "2018-10-18", "2018-10-19",
            "2019-09-30", "2019-10-01", "2019-10-02", "2019-10-03", "2019-10-04",
            "2019-10-05", "2019-10-06", "2019-10-07", "2019-10-08"]

Dussehra_HOLIDAY['application_date'] = Dussehra
Dussehra_HOLIDAY['application_date'] = pd.to_datetime(Dussehra_HOLIDAY['application_date'])
Dussehra_HOLIDAY['Dussehra_HOLIDAY'] = 1
# temp
print(Dussehra_HOLIDAY)

   application_date  Diwali_HOLIDAY
0        2017-10-17               1
1        2017-10-18               1
2        2017-10-19               1
3        2017-10-20               1
4        2017-10-21               1
5        2018-11-05               1
6        2018-11-06               1
7        2018-11-07               1
8        2018-11-08               1
9        2018-11-09               1
10       2019-10-25               1
11       2019-10-26               1
12       2019-10-27               1
13       2019-10-28               1
14       2019-10-29               1
   application_date  Dussehra_HOLIDAY
0        2017-09-22                 1
1        2017-09-23                 1
2        2017-09-24                 1
3        2017-09-25                 1
4        2017-09-26                 1
5        2017-09-27                 1
6        2017-09-28                 1
7        2017-09-29                 1
8        2017-09-30                 1
9        2018-10-11                 1
10    

## **Feature engineering**

In [0]:
def feature_eng(train_v2):
    train_v2['application_date'] = pd.to_datetime(train_v2['application_date'])

    train_v2 = pd.merge(train_v2, holidays, on = 'application_date', how = 'left')
    train_v2['HOLIDAY'] = train_v2['HOLIDAY'].fillna('Non-Holiday')
    train_v2['Holiday_flag'] = np.where(train_v2['HOLIDAY'] == 'Non-Holiday', 0, 1)

    train_v2 = pd.merge(train_v2, Diwali_HOLIDAY, on = 'application_date', how = 'left')
    train_v2['Diwali_HOLIDAY'] = train_v2['Diwali_HOLIDAY'].fillna(0)

    train_v2 = pd.merge(train_v2, Dussehra_HOLIDAY, on = 'application_date', how = 'left')
    train_v2['Dussehra_HOLIDAY'] = train_v2['Dussehra_HOLIDAY'].fillna(0)

    train_v2['year'] = train_v2['application_date'].dt.year
    train_v2['Month'] = train_v2['application_date'].dt.month
    train_v2['Date'] = train_v2['application_date'].dt.day
    train_v2['weekday'] = train_v2['application_date'].dt.weekday_name

    Seasons = {6: 'Monsoon', 7: 'Monsoon', 8: 'Monsoon', 9: 'Monsoon',
               10: 'Winter', 11: 'Winter', 12: 'Winter', 1: 'Winter',
               2: 'Summer', 3: 'Summer', 4: 'Summer', 5: 'Summer'}
  
    train_v2['Seasons'] = train_v2['Month'].map(Seasons)

    train_v2['segment'] = np.where(train_v2['segment'] == 1, 1, 0)

    Month_Index = {4: 0.5, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 1, 11: 1, 12: 1, 1: 1, 2: 1, 3: 1.5}

    train_v2['Month_Index'] = train_v2['Month'].map(Month_Index)

    Date_index = {1 : 0.5,2 : 0.5,3 : 0.5,4 : 0.5,5 : 0.5,6 : 0.5,7 : 0.5,8 : 0.5,9 : 0.5,10 : 0.5,
                  11 : 1,12 : 1,13 : 1.5,14 : 1.5,15 : 1.5,16 : 1.5,17 : 1.5,18 : 1.5,19 : 1.5,20 : 1.5,21 : 1.5,
                  22 : 1.5,23 : 1,24 : 1,25 : 1,26 : 1,27 : 1,28 : 1,29 : 0.5,30 : 0.5,31 : 0.5}

    train_v2['Date_index'] = train_v2['Date'].map(Date_index)

    dummy_col = ['weekday', 'Seasons']
    temp = train_v2[dummy_col]
    temp = pd.get_dummies(temp)

    train_v2 = train_v2.drop(dummy_col, axis = 1)
    train_v2 = pd.concat([train_v2, temp], axis = 1)

    train_v2 = train_v2.drop(['application_date','HOLIDAY'], axis = 1)
  
    return train_v2

## **Machine Learning**

### **Creating X and y**

In [10]:
X = train_v2.drop(['case_count'], axis = 1)
y = np.log(train_v2['case_count'])

X = feature_eng(X)

X_train = X
y_train = y

print("Shape of features :", X.shape)
print("Shape of labels :", y.shape)

X.head()

Shape of features : (1650, 19)
Shape of labels : (1650,)


Unnamed: 0,segment,Holiday_flag,Diwali_HOLIDAY,Dussehra_HOLIDAY,year,Month,Date,Month_Index,Date_index,weekday_Friday,weekday_Monday,weekday_Saturday,weekday_Sunday,weekday_Thursday,weekday_Tuesday,weekday_Wednesday,Seasons_Monsoon,Seasons_Summer,Seasons_Winter
0,1,0,0.0,0.0,2017,4,1,0.5,0.5,0,0,1,0,0,0,0,0,1,0
1,0,0,0.0,0.0,2017,4,1,0.5,0.5,0,0,1,0,0,0,0,0,1,0
2,0,0,0.0,0.0,2017,4,2,0.5,0.5,0,0,0,1,0,0,0,0,1,0
3,1,0,0.0,0.0,2017,4,3,0.5,0.5,0,1,0,0,0,0,0,0,1,0
4,0,0,0.0,0.0,2017,4,3,0.5,0.5,0,1,0,0,0,0,0,0,1,0


### **Splitting data into train, validation and test**

In [0]:
# # Dividing data into train and validation set
# from sklearn.model_selection import train_test_split

# validation_percent = 0.30
# test_percent = 0.50
# seed = 786

# X_train, X_validation, y_train, y_validation = train_test_split(X, y, test_size = validation_percent, random_state = seed)
# X_validation, X_test, y_validation, y_test = train_test_split(X_validation, y_validation, test_size = test_percent, random_state = seed)

# # Shape of data
# print("Number of rows and columns in train dataset:",X_train.shape)
# print("Number of rows and columns in validation dataset:",X_validation.shape)
# print("Number of rows and columns in test dataset:",X_test.shape)

# print("Number of rows and columns in target variable for training:",y_train.shape)
# print("Number of rows and columns in target variable for validation:",y_validation.shape)
# print("Number of rows and columns in target variable for test:",y_test.shape)

### **Model evualuation**

In [0]:
import sklearn.metrics as sklm
from sklearn.svm import LinearSVR, SVR
from sklearn.linear_model import LinearRegression, SGDRegressor, Lasso, Ridge,  PassiveAggressiveRegressor, Perceptron, ElasticNet, LassoLars, BayesianRidge, HuberRegressor
from sklearn.neighbors import KNeighborsRegressor, NearestCentroid
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor, RandomForestRegressor, GradientBoostingRegressor 
from xgboost import XGBRegressor
from sklearn.feature_selection import SelectFromModel
from sklearn.pipeline import Pipeline
from time import time
from sklearn.metrics import mean_squared_error, mean_absolute_error, median_absolute_error, r2_score

In [0]:
def mape(forecast, actual):
    mape = np.mean(np.abs(forecast - actual)/np.abs(actual))
    return mape

In [0]:
def accuracy_summary(Regressor, x_train, y_train):
    t0 = time()
    model = Regressor.fit(x_train, y_train)
    y_pred = model.predict(x_train)
    train_test_time = time() - t0
    accuracy = r2_score(y_train, y_pred)
    #accuracy = mape(y_pred, y_train)
    return accuracy, train_test_time

In [0]:
seed = 123
names = ["Linear Regression", "Lasso","Ridge", "ElasticNet", "LassoLars", "BayesianRidge",
         "HuberRegressor","SGDRegressor", "Linear SVR", 
         "Support Vector Machine with RBF kernel","Passive-Aggresive","KNeighborsRegressor",
         "DecisionTreeRegressor","RandomForestRegressor","AdaBoostRegressor", 
         "GradientBoostingRegressor", "XGBRegressor-linear"]

Regressors = [
    LinearRegression(),
    Lasso(random_state=seed),
    Ridge(random_state=seed),
    ElasticNet(random_state=seed),
    LassoLars(),
    BayesianRidge(),
    HuberRegressor(),
    SGDRegressor(random_state=seed),
    LinearSVR(random_state=seed),
    SVR(),
    PassiveAggressiveRegressor(random_state=seed),
    KNeighborsRegressor(),
    DecisionTreeRegressor(random_state=seed),
    RandomForestRegressor(random_state=seed, n_estimators=500),
    AdaBoostRegressor(random_state=seed, n_estimators=500),
    GradientBoostingRegressor(loss = 'huber', random_state=seed, n_estimators=500),
    XGBRegressor(n_estimators=500, random_state=seed),
    XGBRegressor(n_estimators=500, random_state=seed, objective='count:poisson'),
    XGBRegressor(n_estimators=500, random_state=seed, objective='reg:gamma'),
    XGBRegressor(n_estimators=500, random_state=seed, objective='reg:tweedie')
    ]

zipped_reg = zip(names,Regressors)

def Regressor_comparator(Regressor=zipped_reg):
    result = []
    for n,c in Regressor:
        checker_pipeline = Pipeline([
            ('Regressor', c)
        ])
        print("Validation result for {}".format(n))
        print (c)
        reg_accuracy,tt_time = accuracy_summary(checker_pipeline, X_train, y_train)
        result.append((n,reg_accuracy,tt_time))
    return result

In [16]:
Regression_result = Regressor_comparator()
Regression_result

Validation result for Linear Regression
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
Validation result for Lasso
Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
      normalize=False, positive=False, precompute=False, random_state=123,
      selection='cyclic', tol=0.0001, warm_start=False)
Validation result for Ridge
Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
      normalize=False, random_state=123, solver='auto', tol=0.001)
Validation result for ElasticNet
ElasticNet(alpha=1.0, copy_X=True, fit_intercept=True, l1_ratio=0.5,
           max_iter=1000, normalize=False, positive=False, precompute=False,
           random_state=123, selection='cyclic', tol=0.0001, warm_start=False)
Validation result for LassoLars
LassoLars(alpha=1.0, copy_X=True, eps=2.220446049250313e-16, fit_intercept=True,
          fit_path=True, max_iter=500, normalize=True, positive=False,
          precompute='auto', verbose=False)
Validation

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


Validation result for Support Vector Machine with RBF kernel
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='scale',
    kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
Validation result for Passive-Aggresive
PassiveAggressiveRegressor(C=1.0, average=False, early_stopping=False,
                           epsilon=0.1, fit_intercept=True,
                           loss='epsilon_insensitive', max_iter=1000,
                           n_iter_no_change=5, random_state=123, shuffle=True,
                           tol=0.001, validation_fraction=0.1, verbose=0,
                           warm_start=False)
Validation result for KNeighborsRegressor
KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                    weights='uniform')
Validation result for DecisionTreeRegressor
DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None,
           

  if getattr(data, 'base', None) is not None and \


[('Linear Regression', 0.6227758992591237, 0.007465362548828125),
 ('Lasso', 0.04081807986529795, 0.004074573516845703),
 ('Ridge', 0.622768967049937, 0.003392934799194336),
 ('ElasticNet', 0.04579374891566246, 0.003769397735595703),
 ('LassoLars', 0.0, 0.003196239471435547),
 ('BayesianRidge', 0.6227363446971127, 0.0044384002685546875),
 ('HuberRegressor', 0.5515341700417602, 0.06464385986328125),
 ('SGDRegressor', -1.6031283825632608e+30, 0.024270296096801758),
 ('Linear SVR', -0.271237260866541, 0.13211464881896973),
 ('Support Vector Machine with RBF kernel',
  -0.009709019226528337,
  0.34661316871643066),
 ('Passive-Aggresive', -1.2043052488874233, 0.00657343864440918),
 ('KNeighborsRegressor', 0.6018406505084566, 0.02986001968383789),
 ('DecisionTreeRegressor', 0.9999999997137249, 0.012966632843017578),
 ('RandomForestRegressor', 0.9721440309943423, 2.6890580654144287),
 ('AdaBoostRegressor', 0.6267497620315605, 0.08448958396911621),
 ('GradientBoostingRegressor', 0.814930144990

In [17]:
Regression_result_df = pd.DataFrame(Regression_result)
Regression_result_df.columns = ['Regressor', 'R2-Score', 'Train and test time']
Regression_result_df = Regression_result_df.sort_values(by='R2-Score', ascending=False)
Regression_result_df['R2-Score'] = (Regression_result_df['R2-Score']*100).round(1).astype(str) + '%'
Regression_result_df

Unnamed: 0,Regressor,R2-Score,Train and test time
12,DecisionTreeRegressor,100.0%,0.01
13,RandomForestRegressor,97.2%,2.69
16,XGBRegressor-linear,87.6%,0.73
15,GradientBoostingRegressor,81.5%,1.43
14,AdaBoostRegressor,62.7%,0.08
0,Linear Regression,62.3%,0.01
2,Ridge,62.3%,0.0
5,BayesianRidge,62.3%,0.0
11,KNeighborsRegressor,60.2%,0.03
6,HuberRegressor,55.2%,0.06


### **Tuning Randomforest model**

In [0]:
import math
def print_metrics(y_true, y_predicted):
    ## First compute R^2 and the adjusted R^2
    r2 = sklm.r2_score(y_true, y_predicted)
    MAPE = mape(y_predicted, y_true)
    # r2_adj = r2 - (y_true.shape[0] - 1)/(y_true.shape[0] - n_parameters - 1) * (1 - r2)
    
    ## Print the usual metrics and the R^2 values
    print('Mean Square Error      = ' + str(sklm.mean_squared_error(y_true, y_predicted)))
    print('Root Mean Square Error = ' + str(math.sqrt(sklm.mean_squared_error(y_true, y_predicted))))
    print('Mean Absolute Error    = ' + str(sklm.mean_absolute_error(y_true, y_predicted)))
    print('Median Absolute Error  = ' + str(sklm.median_absolute_error(y_true, y_predicted)))
    print('R^2                    = ' + str(r2))
    print('MAPE                    = ' + str(MAPE))

In [0]:
import numpy.random as nr
import sklearn.model_selection as ms

nr.seed(123)
inside = ms.KFold(n_splits=10, shuffle = True)

nr.seed(321)
outside = ms.KFold(n_splits=10, shuffle = True)


In [20]:
model_baseline = RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=500, n_jobs=None, oob_score=False,
                      random_state=123, verbose=0, warm_start=False)

model_baseline = model_baseline.fit(X_train, y_train)

print("Cross-validation accuracy")
cv_estimate = ms.cross_val_score(model_baseline, X_train, y_train, scoring = 'r2',
                                 cv = outside) # Use the outside folds
print('Mean performance metric = %4.3f' % np.mean(cv_estimate))

print('SDT of the metric       = %4.3f' % np.std(cv_estimate))
print('Outcomes by cv fold')
for i, x in enumerate(cv_estimate):
    print('Fold %2d    %4.3f' % (i+1, x))

Cross-validation accuracy
Mean performance metric = 0.777
SDT of the metric       = 0.061
Outcomes by cv fold
Fold  1    0.837
Fold  2    0.734
Fold  3    0.710
Fold  4    0.805
Fold  5    0.874
Fold  6    0.686
Fold  7    0.742
Fold  8    0.731
Fold  9    0.845
Fold 10    0.804


In [21]:
nr.seed(3456)
## Define the dictionary for the grid search and the model object to search on
param_grid = {"n_estimators": [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000]}

## Perform the grid search over the parameters
clf = ms.GridSearchCV(estimator = model_baseline, param_grid = param_grid, 
                      cv = inside, # Use the inside folds
                      return_train_score = True)

## Fit the cross validated grid search over the data 
clf.fit(X_train, y_train)

## And print the best parameter value
print(clf.best_estimator_.n_estimators)

n_estimators_tuned = clf.best_estimator_.n_estimators

500


In [22]:
nr.seed(786)
## Define the dictionary for the grid search and the model object to search on
param_grid = {"max_features": [1,2,3,4,5,6,7, 8, 9, 10, 11, 12, 13, 14, 15, 16,17,18,19,None]}

model = RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=n_estimators_tuned,
                      n_jobs=None, oob_score=False, random_state=123, verbose=0,
                      warm_start=False)

## Perform the grid search over the parameters
clf = ms.GridSearchCV(estimator = model, param_grid = param_grid, 
                      cv = inside, # Use the inside folds
                      return_train_score = True)

## Fit the cross validated grid search over the data 
clf.fit(X_train, y_train)

## And print the best parameter value
print(clf.best_estimator_.max_features)

max_features_tuned = clf.best_estimator_.max_features

10


In [23]:
nr.seed(786)
## Define the dictionary for the grid search and the model object to search on
param_grid = {"min_samples_split": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}

model = RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features= max_features_tuned, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=n_estimators_tuned,
                      n_jobs=None, oob_score=False, random_state=123, verbose=0,
                      warm_start=False)

## Perform the grid search over the parameters
clf = ms.GridSearchCV(estimator = model, param_grid = param_grid, 
                      cv = inside, # Use the inside folds
                      return_train_score = True)

## Fit the cross validated grid search over the data 
clf.fit(X_train, y_train)

## And print the best parameter value
print(clf.best_estimator_.min_samples_split)

min_samples_split_tuned = clf.best_estimator_.min_samples_split

ValueError: min_samples_split must be an integer greater than 1 or a float in (0.0, 1.0]; got the integer 1

ValueError: min_samples_split must be an integer greater than 1 or a float in (0.0, 1.0]; got the integer 1

ValueError: min_samples_split must be an integer greater than 1 or a float in (0.0, 1.0]; got the integer 1

ValueError: min_samples_split must be an integer greater than 1 or a float in (0.0, 1.0]; got the integer 1

ValueError: min_samples_split must be an integer greater than 1 or a float in (0.0, 1.0]; got the integer 1

ValueError: min_samples_split must be an integer greater than 1 or a float in (0.0, 1.0]; got the integer 1

ValueError: min_samples_split must be an integer greater than 1 or a float in (0.0, 1.0]; got the integer 1

ValueError: min_samples_split must be an integer greater than 1 or a float in (0.0, 1.0]; got the integer 1

ValueError: min_samples_split must be an integer greater than 1 or a float in (0.0, 1.0]; got the integer 1

ValueError: min_sam

2


In [24]:
nr.seed(786)
## Define the dictionary for the grid search and the model object to search on
param_grid = {"min_samples_leaf": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}

model = RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features= max_features_tuned, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=min_samples_split_tuned,
                      min_weight_fraction_leaf=0.0, n_estimators=n_estimators_tuned,
                      n_jobs=None, oob_score=False, random_state=123, verbose=0,
                      warm_start=False)

## Perform the grid search over the parameters
clf = ms.GridSearchCV(estimator = model, param_grid = param_grid, 
                      cv = inside, # Use the inside folds
                      return_train_score = True)

## Fit the cross validated grid search over the data 
clf.fit(X_train, y_train)

## And print the best parameter value
print(clf.best_estimator_.min_samples_leaf)

min_samples_leaf_tuned = clf.best_estimator_.min_samples_leaf

1


In [25]:
nr.seed(786)
## Define the dictionary for the grid search and the model object to search on
param_grid = {"max_depth": [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]}

model = RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features= max_features_tuned, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=min_samples_leaf_tuned, min_samples_split=min_samples_split_tuned,
                      min_weight_fraction_leaf=0.0, n_estimators=n_estimators_tuned,
                      n_jobs=None, oob_score=False, random_state=123, verbose=0,
                      warm_start=False)

## Perform the grid search over the parameters
clf = ms.GridSearchCV(estimator = model, param_grid = param_grid, 
                      cv = inside, # Use the inside folds
                      return_train_score = True)

## Fit the cross validated grid search over the data 
clf.fit(X_train, y_train)

## And print the best parameter value
print(clf.best_estimator_.max_depth)
max_depth_tuned = clf.best_estimator_.max_depth

30


In [26]:
nr.seed(786)
## Define the dictionary for the grid search and the model object to search on
param_grid = {"bootstrap": [True, False]}

model = RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=max_depth_tuned,
                      max_features= max_features_tuned, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=min_samples_leaf_tuned, min_samples_split=min_samples_split_tuned,
                      min_weight_fraction_leaf=0.0, n_estimators=n_estimators_tuned,
                      n_jobs=None, oob_score=False, random_state=123, verbose=0,
                      warm_start=False)

## Perform the grid search over the parameters
clf = ms.GridSearchCV(estimator = model, param_grid = param_grid, 
                      cv = inside, # Use the inside folds
                      return_train_score = True)

## Fit the cross validated grid search over the data 
clf.fit(X_train, y_train)

## And print the best parameter value
print(clf.best_estimator_.bootstrap)
bootstrap_tuned = clf.best_estimator_.bootstrap

True


In [27]:
nr.seed(999)
## Define the dictionary for the grid search and the model object to search on
param_grid = {"criterion": ["mse", "mae"]}

model = RandomForestRegressor(bootstrap=bootstrap_tuned, criterion='mse', max_depth=max_depth_tuned,
                      max_features= max_features_tuned, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=min_samples_leaf_tuned, min_samples_split=min_samples_split_tuned,
                      min_weight_fraction_leaf=0.0, n_estimators=n_estimators_tuned,
                      n_jobs=None, oob_score=False, random_state=123, verbose=0,
                      warm_start=False)

## Perform the grid search over the parameters
clf = ms.GridSearchCV(estimator = model, param_grid = param_grid, 
                      cv = inside, # Use the inside folds
                      return_train_score = True)

## Fit the cross validated grid search over the data 
clf.fit(X_train, y_train)

## And print the best parameter value
print(clf.best_estimator_.criterion)
criterion_tuned = clf.best_estimator_.criterion

mse


### **Final Model**

In [28]:
model = RandomForestRegressor(bootstrap=bootstrap_tuned, criterion=criterion_tuned, max_depth=max_depth_tuned,
                      max_features= max_features_tuned, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=min_samples_leaf_tuned, min_samples_split=min_samples_split_tuned,
                      min_weight_fraction_leaf=0.0, n_estimators=n_estimators_tuned,
                      n_jobs=None, oob_score=False, random_state=123, verbose=0,
                      warm_start=False)

model = model.fit(X, y)

print("Cross-validation accuracy")
cv_estimate = ms.cross_val_score(model_baseline, X, y, scoring = 'r2',
                                 cv = outside) # Use the outside folds
print('Mean performance metric = %4.3f' % np.mean(cv_estimate))

print('SDT of the metric       = %4.3f' % np.std(cv_estimate))
print('Outcomes by cv fold')
for i, x in enumerate(cv_estimate):
    print('Fold %2d    %4.3f' % (i+1, x))

Cross-validation accuracy
Mean performance metric = 0.783
SDT of the metric       = 0.053
Outcomes by cv fold
Fold  1    0.822
Fold  2    0.792
Fold  3    0.716
Fold  4    0.858
Fold  5    0.783
Fold  6    0.786
Fold  7    0.778
Fold  8    0.695
Fold  9    0.742
Fold 10    0.863


## **Predicting test data**

In [29]:
test_v2 = test.drop(['id'], axis = 1)
test_v2 = feature_eng(test_v2)

print("Shape of features :", test_v2.shape)

Shape of features : (182, 18)


In [30]:
feature_list = X.columns.tolist()
dummy_add = list(set(feature_list) - set(test_v2.columns))

for newcol in dummy_add:
    test_v2[newcol] = 0

test_v2 = test_v2[feature_list]
test_v2.head()

Unnamed: 0,segment,Holiday_flag,Diwali_HOLIDAY,Dussehra_HOLIDAY,year,Month,Date,Month_Index,Date_index,weekday_Friday,weekday_Monday,weekday_Saturday,weekday_Sunday,weekday_Thursday,weekday_Tuesday,weekday_Wednesday,Seasons_Monsoon,Seasons_Summer,Seasons_Winter
0,1,0,0.0,0.0,2019,7,6,1.0,0.5,0,0,1,0,0,0,0,1,0,0
1,1,0,0.0,0.0,2019,7,7,1.0,0.5,0,0,0,1,0,0,0,1,0,0
2,1,0,0.0,0.0,2019,7,8,1.0,0.5,0,1,0,0,0,0,0,1,0,0
3,1,0,0.0,0.0,2019,7,9,1.0,0.5,0,0,0,0,0,1,0,1,0,0
4,1,0,0.0,0.0,2019,7,10,1.0,0.5,0,0,0,0,0,0,1,1,0,0


In [32]:
test_v2 = test_v2.drop_duplicates()
test['case_count'] = np.exp(model.predict(test_v2))
test['case_count'] = test['case_count'].round(0)
test.head()

Unnamed: 0,id,application_date,segment,case_count
0,1,2019-07-06,1,1910.0
1,2,2019-07-07,1,1288.0
2,3,2019-07-08,1,2883.0
3,4,2019-07-09,1,2809.0
4,5,2019-07-10,1,2977.0


In [0]:
Submission = test[['id', 'application_date', 'segment', 'case_count']]

Submission.to_csv("./Output/Submission_RF_v6.csv", index = False)