# **LTFS Data Science FinHack 2**

## **Problem statement**

LTFS receives a lot of requests for its various finance offerings that include housing loan, two-wheeler loan, real estate financing and micro loans. The number of applications received is something that varies a lot with season. Going through these applications is a manual process and is tedious. Accurately forecasting the number of cases received can help with resource and manpower management resulting into quick response on applications and more efficient processing.

We have been appointed with the task of forecasting daily cases for **next 3 months for 2 different business segments** at the **country level** keeping in consideration the following major Indian festivals (inclusive but not exhaustive list): Diwali, Dussehra, Ganesh Chaturthi, Navratri, Holi etc. (We are free to use any publicly available open source external datasets). Some other examples could be:

 + Weather
 + Macroeconomic variables

we also note that the external dataset must belong to a reliable source.

## **Data Dictionary**

The train data has been provided in the following way:

 + For business segment 1, historical data has been made available at branch ID level
 + For business segment 2, historical data has been made available at State level.
 

## **Train File**

|Variable|	Definition|
|:------:|:----------:|
|application_date|Date of application|
|application_date|	Date of application|
|segment|	Business Segment (1/2)|
|branch_id|	Anonymised id for branch at which application was received|
|state|	State in which application was received (Karnataka, MP etc.)|
|zone|	Zone of state in which application was received (Central, East etc.)|
|case_count|	(Target) Number of cases/applications received|

## **Test File**

Forecasting needs to be done at country level for the dates provided in test set for each segment.

|Variable|	Definition|
|:------:|:----------:|
|id|	Unique id for each sample in test set|
|application_date|	Date of application|
| segment|	Business Segment (1/2)|

## **Evaluation**

**Evaluation Metric**

The evaluation metric for scoring the forecasts is MAPE (Mean Absolute Percentage Error) M with the formula:

$$M = \frac{100}{n}\sum_{t = 1}^{n}|\frac{A_t - F_t}{A_t}|$$
 
Where $A_t$ is the actual value and $F_t$ is the forecast value.


The Final score is calculated using $MAPE$ for both the segments using the formula:

$Final Score = 0.5*MAPE_{Segment1} + 0.5*MAPE_{Segment2}$


## **Getting started**

### **Importing libraries**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as ss
%matplotlib inline

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)
pd.set_option('display.float_format', lambda x: '%.2f' % x)

### **Reading data**

In [2]:
# Setting the path
import os
path = "E:/Data Science/LTFS-Data-Science-FinHack-2"
os.chdir(path)

In [3]:
# Importing the dataset
train = pd.read_csv("./Input/train_fwYjLYX.csv")
test = pd.read_csv("./Input/test_1eLl9Yf.csv")
Sample_submission = pd.read_csv("./Input/sample_submission_IIzFVsf.csv")

## **Data Preprocessing**

In [4]:
train.head()

Unnamed: 0,application_date,segment,branch_id,state,zone,case_count
0,2017-04-01,1,1.0,WEST BENGAL,EAST,40.0
1,2017-04-03,1,1.0,WEST BENGAL,EAST,5.0
2,2017-04-04,1,1.0,WEST BENGAL,EAST,4.0
3,2017-04-05,1,1.0,WEST BENGAL,EAST,113.0
4,2017-04-07,1,1.0,WEST BENGAL,EAST,76.0


In [5]:
# Data preprocessing function
train_v2 = pd.DataFrame(train.groupby(['application_date', 'segment'])['case_count'].sum()).reset_index()
train_v2.head()

Unnamed: 0,application_date,segment,case_count
0,2017-04-01,1,299.0
1,2017-04-01,2,897.0
2,2017-04-02,2,605.0
3,2017-04-03,1,42.0
4,2017-04-03,2,2016.0


## **Feature engineering**

In [6]:
def feature_eng(train_v2):
    train_v2['application_date'] = pd.to_datetime(train_v2['application_date'])
    train_v2['year'] = train_v2['application_date'].dt.year
    train_v2['Month'] = train_v2['application_date'].dt.month
    train_v2['Date'] = train_v2['application_date'].dt.day
    train_v2['weekday'] = train_v2['application_date'].dt.weekday_name

    Seasons = {6: 'Monsoon', 7: 'Monsoon', 8: 'Monsoon', 9: 'Monsoon',
               10: 'Winter', 11: 'Winter', 12: 'Winter', 1: 'Winter',
               2: 'Summer', 3: 'Summer', 4: 'Summer', 5: 'Summer'}
  
    train_v2['Seasons'] = train_v2['Month'].map(Seasons)

    train_v2['segment'] = np.where(train_v2['segment'] == 1, 1, 0)

    dummy_col = ['weekday', 'Seasons']
    temp = train_v2[dummy_col]
    temp = pd.get_dummies(temp)

    train_v2 = train_v2.drop(dummy_col, axis = 1)
    train_v2 = pd.concat([train_v2, temp], axis = 1)

    train_v2 = train_v2.drop(['application_date'], axis = 1)
  
    return train_v2

## **Machine Learning**

### **Creating X and y**

In [7]:
X = train_v2.drop(['case_count'], axis = 1)
y = np.log(train_v2['case_count'])

X = feature_eng(X)

print("Shape of features :", X.shape)
print("Shape of labels :", y.shape)

X.head()

Shape of features : (1650, 14)
Shape of labels : (1650,)


Unnamed: 0,segment,year,Month,Date,weekday_Friday,weekday_Monday,weekday_Saturday,weekday_Sunday,weekday_Thursday,weekday_Tuesday,weekday_Wednesday,Seasons_Monsoon,Seasons_Summer,Seasons_Winter
0,1,2017,4,1,0,0,1,0,0,0,0,0,1,0
1,0,2017,4,1,0,0,1,0,0,0,0,0,1,0
2,0,2017,4,2,0,0,0,1,0,0,0,0,1,0
3,1,2017,4,3,0,1,0,0,0,0,0,0,1,0
4,0,2017,4,3,0,1,0,0,0,0,0,0,1,0


### **Splitting data into train, validation and test**

In [8]:
# Dividing data into train and validation set
from sklearn.model_selection import train_test_split

validation_percent = 0.30
test_percent = 0.50
seed = 786

X_train, X_validation, y_train, y_validation = train_test_split(X, y, test_size = validation_percent, random_state = seed)
X_validation, X_test, y_validation, y_test = train_test_split(X_validation, y_validation, test_size = test_percent, random_state = seed)

# Shape of data
print("Number of rows and columns in train dataset:",X_train.shape)
print("Number of rows and columns in validation dataset:",X_validation.shape)
print("Number of rows and columns in test dataset:",X_test.shape)

print("Number of rows and columns in target variable for training:",y_train.shape)
print("Number of rows and columns in target variable for validation:",y_validation.shape)
print("Number of rows and columns in target variable for test:",y_test.shape)

Number of rows and columns in train dataset: (1155, 14)
Number of rows and columns in validation dataset: (247, 14)
Number of rows and columns in test dataset: (248, 14)
Number of rows and columns in target variable for training: (1155,)
Number of rows and columns in target variable for validation: (247,)
Number of rows and columns in target variable for test: (248,)


### **Model evualuation**

In [9]:
import sklearn.metrics as sklm
from sklearn.svm import LinearSVR, SVR
from sklearn.linear_model import LinearRegression, SGDRegressor, Lasso, Ridge, PassiveAggressiveRegressor, Perceptron
from sklearn.neighbors import KNeighborsRegressor, NearestCentroid
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor, RandomForestRegressor, GradientBoostingRegressor 
from xgboost import XGBRegressor
from sklearn.feature_selection import SelectFromModel
from sklearn.pipeline import Pipeline
from time import time
from sklearn.metrics import mean_squared_error, mean_absolute_error, median_absolute_error, r2_score

In [10]:
def mape(forecast, actual):
    mape = np.mean(np.abs(forecast - actual)/np.abs(actual))
    return mape

In [11]:
def accuracy_summary(Regressor, x_train, y_train, x_validation, y_validation):
    t0 = time()
    model = Regressor.fit(x_train, y_train)
    y_pred = model.predict(x_validation)
    train_test_time = time() - t0
    #accuracy = r2_score(y_validation, y_pred)
    accuracy = mape(y_pred, y_validation)
    return accuracy, train_test_time

In [12]:
seed = 123
names = ["Linear Regression", "SGDRegressor", "Linear SVR", "Lasso","Ridge", "Passive-Aggresive",
        "DecisionTreeRegressor","RandomForestRegressor","AdaBoostRegressor", "GradientBoostingRegressor", "XGBRegressor"]

Regressors = [
    LinearRegression(),
    SGDRegressor(random_state=seed),
    LinearSVR(random_state=seed),
    #SVR(),
    Lasso(random_state=seed),
    Ridge(random_state=seed),
    PassiveAggressiveRegressor(random_state=seed),
    DecisionTreeRegressor(random_state=seed),
    RandomForestRegressor(random_state=seed, n_estimators=500),
    AdaBoostRegressor(random_state=seed, n_estimators=500),
    GradientBoostingRegressor(random_state=seed, n_estimators=500),
    XGBRegressor(n_estimators=500, random_state=seed)
    ]

zipped_reg = zip(names,Regressors)

def Regressor_comparator(Regressor=zipped_reg):
    result = []
    for n,c in Regressor:
        checker_pipeline = Pipeline([
            ('Regressor', c)
        ])
        print("Validation result for {}".format(n))
        print (c)
        reg_accuracy,tt_time = accuracy_summary(checker_pipeline, X_train, y_train, X_validation, y_validation)
        result.append((n,reg_accuracy,tt_time))
    return result

In [13]:
Regression_result = Regressor_comparator()
Regression_result

Validation result for Linear Regression
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
Validation result for SGDRegressor
SGDRegressor(alpha=0.0001, average=False, early_stopping=False, epsilon=0.1,
             eta0=0.01, fit_intercept=True, l1_ratio=0.15,
             learning_rate='invscaling', loss='squared_loss', max_iter=1000,
             n_iter_no_change=5, penalty='l2', power_t=0.25, random_state=123,
             shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0,
             warm_start=False)
Validation result for Linear SVR
LinearSVR(C=1.0, dual=True, epsilon=0.0, fit_intercept=True,
          intercept_scaling=1.0, loss='epsilon_insensitive', max_iter=1000,
          random_state=123, tol=0.0001, verbose=0)
Validation result for Lasso
Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
      normalize=False, positive=False, precompute=False, random_state=123,
      selection='cyclic', tol=0.0001, warm_start=False)




Validation result for Ridge
Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
      normalize=False, random_state=123, solver='auto', tol=0.001)
Validation result for Passive-Aggresive
PassiveAggressiveRegressor(C=1.0, average=False, early_stopping=False,
                           epsilon=0.1, fit_intercept=True,
                           loss='epsilon_insensitive', max_iter=1000,
                           n_iter_no_change=5, random_state=123, shuffle=True,
                           tol=0.001, validation_fraction=0.1, verbose=0,
                           warm_start=False)
Validation result for DecisionTreeRegressor
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=123, splitter='best')
V

[('Linear Regression', 0.10513573449395285, 0.2325127124786377),
 ('SGDRegressor', 60169262908319.36, 0.018990278244018555),
 ('Linear SVR', 0.3326559027917069, 0.1659390926361084),
 ('Lasso', 0.15777142146560802, 0.008999824523925781),
 ('Ridge', 0.10512046272408691, 0.00599980354309082),
 ('Passive-Aggresive', 0.17484383055544325, 0.006994009017944336),
 ('DecisionTreeRegressor', 0.08117820373237314, 0.019991636276245117),
 ('RandomForestRegressor', 0.05420337132096428, 2.840772867202759),
 ('AdaBoostRegressor', 0.09231015135427076, 0.061965227127075195),
 ('GradientBoostingRegressor', 0.07042529886880469, 0.6486122608184814),
 ('XGBRegressor', 0.07042746358188312, 0.4531240463256836)]

In [14]:
Regression_result_df = pd.DataFrame(Regression_result)
Regression_result_df.columns = ['Regressor', 'R2-Score', 'Train and test time']
Regression_result_df.sort_values(by='R2-Score', ascending=False)
Regression_result_df['R2-Score'] = (Regression_result_df['R2-Score']*100).round(1).astype(str) + '%'
Regression_result_df

Unnamed: 0,Regressor,R2-Score,Train and test time
0,Linear Regression,10.5%,0.23
1,SGDRegressor,6016926290831936.0%,0.02
2,Linear SVR,33.3%,0.17
3,Lasso,15.8%,0.01
4,Ridge,10.5%,0.01
5,Passive-Aggresive,17.5%,0.01
6,DecisionTreeRegressor,8.1%,0.02
7,RandomForestRegressor,5.4%,2.84
8,AdaBoostRegressor,9.2%,0.06
9,GradientBoostingRegressor,7.0%,0.65


### **Tuning Randomforest model**

In [15]:
model = RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=500,
                      n_jobs=None, oob_score=False, random_state=123, verbose=0,
                      warm_start=False)

model = model.fit(X_train, y_train)
y_predict_validation = model.predict(X_validation)

In [16]:
import math
def print_metrics(y_true, y_predicted):
    ## First compute R^2 and the adjusted R^2
    r2 = sklm.r2_score(y_true, y_predicted)
    MAPE = mape(y_predicted, y_true)
    # r2_adj = r2 - (y_true.shape[0] - 1)/(y_true.shape[0] - n_parameters - 1) * (1 - r2)
    
    ## Print the usual metrics and the R^2 values
    print('Mean Square Error      = ' + str(sklm.mean_squared_error(y_true, y_predicted)))
    print('Root Mean Square Error = ' + str(math.sqrt(sklm.mean_squared_error(y_true, y_predicted))))
    print('Mean Absolute Error    = ' + str(sklm.mean_absolute_error(y_true, y_predicted)))
    print('Median Absolute Error  = ' + str(sklm.median_absolute_error(y_true, y_predicted)))
    print('R^2                    = ' + str(r2))
    print('MAPE                    = ' + str(MAPE))
    
    # print('Adjusted R^2           = ' + str(r2_adj))

print_metrics(y_validation, y_predict_validation)

Mean Square Error      = 0.3008800222513618
Root Mean Square Error = 0.5485253159621366
Mean Absolute Error    = 0.28422202650153067
Median Absolute Error  = 0.12658874394035635
R^2                    = 0.8364501577346957
MAPE                    = 0.05420337132096428


In [17]:
y_predict_test = model.predict(X_test)
print_metrics(y_test, y_predict_test)

Mean Square Error      = 0.569617587022461
Root Mean Square Error = 0.7547301418536702
Mean Absolute Error    = 0.338705583720686
Median Absolute Error  = 0.1411385306477726
R^2                    = 0.7266298043732462
MAPE                    = 0.07737910962115546


## **Predicting test data**

In [18]:
test_v2 = test.drop(['id'], axis = 1)
test_v2 = feature_eng(test_v2)

print("Shape of features :", test_v2.shape)

Shape of features : (180, 13)


In [19]:
feature_list = X.columns.tolist()
dummy_add = list(set(feature_list) - set(test_v2.columns))

for newcol in dummy_add:
    test_v2[newcol] = 0

test_v2 = test_v2[feature_list]
test_v2.head()

Unnamed: 0,segment,year,Month,Date,weekday_Friday,weekday_Monday,weekday_Saturday,weekday_Sunday,weekday_Thursday,weekday_Tuesday,weekday_Wednesday,Seasons_Monsoon,Seasons_Summer,Seasons_Winter
0,1,2019,7,6,0,0,1,0,0,0,0,1,0,0
1,1,2019,7,7,0,0,0,1,0,0,0,1,0,0
2,1,2019,7,8,0,1,0,0,0,0,0,1,0,0
3,1,2019,7,9,0,0,0,0,0,1,0,1,0,0
4,1,2019,7,10,0,0,0,0,0,0,1,1,0,0


In [20]:
test['case_count'] = np.exp(model.predict(test_v2))
test['case_count'] = test['case_count'].round(0)
test.head()

Unnamed: 0,id,application_date,segment,case_count
0,1,2019-07-06,1,2547.0
1,2,2019-07-07,1,1433.0
2,3,2019-07-08,1,2896.0
3,4,2019-07-09,1,3171.0
4,5,2019-07-10,1,3453.0


In [21]:
Submission = test[['id', 'application_date', 'segment', 'case_count']]

Submission.to_csv("./Output/Submission_v2.csv", index = False)