# JanataHack: Mobility Analytics

Welcome to Sigma Cab Private Limited - a cab aggregator service. Their customers can download their app on smartphones and book a cab from any where in the cities they operate in. They, in turn search for cabs from various service providers and provide the best option to their client across available options. They have been in operation for little less than a year now. During this period, they have captured surge_pricing_type from the service providers.

You have been hired by Sigma Cabs as a Data Scientist and have been asked to build a predictive model, which could help them in predicting the surge_pricing_type pro-actively. This would in turn help them in matching the right cabs with the right customers quickly and efficiently.

## Approach

* No feature Engineering done.
* Created separate group for null values in categorical varaibles
* Updated the null values for continous varaibles with mean/medain
* Used CAT boost with 5 fold CV


In [10]:
# import libraries

import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
#%matplotlib inline 
import os
from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

#read data

df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')
sub = pd.read_csv('sample_submission.csv')

df_train.Type_of_Cab.fillna("F",inplace=True)
df_test.Type_of_Cab.fillna("F",inplace=True)
df_train.Life_Style_Index.fillna(df_train.Life_Style_Index.mean(),inplace=True)
df_test.Life_Style_Index.fillna(df_test.Life_Style_Index.mean(),inplace=True)

df_train.Customer_Since_Months.fillna(99,inplace=True)
df_test.Customer_Since_Months.fillna(99,inplace=True)
df_train.Customer_Since_Months = df_train.Customer_Since_Months.astype(int)
df_test.Customer_Since_Months = df_test.Customer_Since_Months.astype(int)


df_train.Var1.fillna(df_train.Var1.median(),inplace=True)
df_test.Var1.fillna(df_test.Var1.median(),inplace=True)
df_train.Confidence_Life_Style_Index.fillna('D',inplace=True)
df_test.Confidence_Life_Style_Index.fillna('D',inplace=True)



In [11]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 131662 entries, 0 to 131661
Data columns (total 14 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   Trip_ID                      131662 non-null  object 
 1   Trip_Distance                131662 non-null  float64
 2   Type_of_Cab                  131662 non-null  object 
 3   Customer_Since_Months        131662 non-null  int32  
 4   Life_Style_Index             131662 non-null  float64
 5   Confidence_Life_Style_Index  131662 non-null  object 
 6   Destination_Type             131662 non-null  object 
 7   Customer_Rating              131662 non-null  float64
 8   Cancellation_Last_1Month     131662 non-null  int64  
 9   Var1                         131662 non-null  float64
 10  Var2                         131662 non-null  int64  
 11  Var3                         131662 non-null  int64  
 12  Gender                       131662 non-null  object 
 13 

In [12]:
from sklearn.model_selection import StratifiedKFold,KFold
# Set up folds
K = 5
kf = KFold(n_splits = K, random_state = 7, shuffle = True)
skf = StratifiedKFold(n_splits = K, random_state = 7, shuffle = True)

In [13]:
MAX_ROUNDS = 1000
OPTIMIZE_ROUNDS = False
#LEARNING_RATE = 0.1

In [14]:
from sklearn.metrics import accuracy_score
X = df_train.drop(columns=['Trip_ID','Surge_Pricing_Type'],axis=1)
y = df_train['Surge_Pricing_Type']
X_test = df_test.drop(columns='Trip_ID',axis=1)
y_valid_pred = 0*y
y_test_pred = 0
accuracy = 0
result={}
#specifying categorical variables indexes
cat_columns = ['Type_of_Cab','Confidence_Life_Style_Index','Destination_Type','Gender','Customer_Since_Months']
#fitting catboost classifier model
j=1
model = CatBoostClassifier(n_estimators=MAX_ROUNDS,verbose=False)
for i, (train_index, test_index) in enumerate(kf.split(df_train)):

#for train_index, test_index in skf.split(X, y):  
    # Create data for this fold
    y_train, y_valid = y.iloc[train_index], y.iloc[test_index]
    X_train, X_valid = X.iloc[train_index,:], X.iloc[test_index,:]
    print( "\nFold ", j)
    #print( "\nFold ", i)
    
    # Run model for this fold
    if OPTIMIZE_ROUNDS:
        fit_model = model.fit( X_train, y_train, 
                               eval_set=[X_valid, y_valid],cat_features=cat_columns,
                               use_best_model=True
                             )
        print( "  N trees = ", model.tree_count_ )
    else:
        fit_model = model.fit( X_train, y_train,cat_features=cat_columns )
        
    # Generate validation predictions for this fold
    pred = fit_model.predict(X_valid)
    y_valid_pred.iloc[test_index] = pred.reshape(-1)
    print(accuracy_score(y_valid,pred))
    accuracy+=accuracy_score(y_valid,pred)
    # Accumulate test set predictions
    y_test_pred += fit_model.predict(X_test)
    result[j]=fit_model.predict(X_test)
    j+=1
results = y_test_pred / K  # Average test set predictions
print(accuracy/5)


Fold  1
0.7068697072114837

Fold  2
0.707857061481791

Fold  3
0.7057952301382349

Fold  4
0.7008962479112867

Fold  5
0.7046559319459214
0.7052148357377435


In [15]:
d = pd.DataFrame()
for i in range(1, 6):
    d = pd.concat([d,pd.DataFrame(result[i])],axis=1)
d.columns=['1','2','3','4','5']
#d.to_csv("d.csv",index=False)

In [16]:
re = d.mode(axis=1)[0]

In [17]:
sub.Surge_Pricing_Type = re
sub.to_csv('cb_k.csv',index = False)


In [40]:
sub

Unnamed: 0,Trip_ID,Surge_Pricing_Type
0,T0005689459,1.0
1,T0005689462,2.0
2,T0005689463,2.0
3,T0005689466,2.0
4,T0005689468,2.0
...,...,...
87390,T0005908503,2.0
87391,T0005908504,2.0
87392,T0005908505,1.0
87393,T0005908511,2.0
