# Scalable modelling approach
<b>Topics Covered</b>
* Split Data: Train and Test 
* Stratified sampling

Author: Pulkit Gupta

Date: 22-May-2018

### Importing Packages

Package import section allows us to include packages to be used within the notebook


In [2]:
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

### Reading Data

In [3]:
data=pd.read_csv("../data/churn.csv")

### Sample data

In [4]:
data.head(5)

Unnamed: 0,State,Account Length,Area Code,Phone,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,...,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False.
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False.
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False.
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False.
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False.


### Define Target Feature and Id features

In [5]:
target_variable = ["Churn?"]
id_variable = ["Phone"]

### Data Preperation
<b>Define Following: </b>
* independent features (X)
* target feautres (y)
* id features (ids)


In [6]:
X=data.drop(target_variable + id_variable,axis=1)
y=data[target_variable]
ids= data[id_variable]

In [7]:
# Basic Check of null values
data.isnull().values.any()

False

### Convert categorical variable into dummy/indicator variables

In [8]:
X_with_dummy_features = pd.get_dummies(X,drop_first=True)
y_with_dummy_features = pd.get_dummies(y,drop_first=True)

### Validation Sets

A better way to test a model is to use a hold-out set which doesn't enter the training. This operation can be performed using scikit-learn's train/test split utility:

* Example of split without stratified sampling

In [23]:
xtrain, xtest, ytrain, ytest , idtrain, idtest = \
creating_test_train_without_stratified_samples(y_with_dummy_features,\
                                    ids,\
                                    X_with_dummy_features,\
                                    1000,\
                                     0.3 )

No. of True Cases in training data set for 356
No. of True Cases in testing data set for 127
Ratio of True Cases in training data set:  0.15
Ratio of True Cases in testing data set:  0.13


* Example of split with Stratified Sampling <p><p>

<cite>
<b>Stratified sampling</b> is a probability sampling technique; to divide the entire population into different subgroups or strata, then randomly selects the final subjects proportionally from the different strata.</cite>

In [22]:
xtrain, xtest, ytrain, ytest , idtrain, idtest = \
                                                creating_test_train_with_stratified_samples(y_with_dummy_features,\
                                                                                    ids,\
                                                                                    X_with_dummy_features,\
                                                                                    1000,\
                                                                                     0.3 )

No. of True Cases in training data set for 338
No. of True Cases in testing data set for 145
Ratio of True Cases in training data set:  0.14
Ratio of True Cases in testing data set:  0.14


## Supporting Functions

In [14]:
def creating_test_train_without_stratified_samples(df_target,df_id, df_features, seed,
                                        ratio_test):
    '''
        Objective: To split of all three pandas dataset (df_target, df_id, df_features)
                   between training and testing datasets
                   
        Arguments: target dataset,id_dataset, features dataset, seed value, split ratio
                 
                   
        Returns  : 2 split datasets of training and testing each 
                   for target, id, features


    '''
    #strt_clm = df_target
    xtrain, xtest, ytrain, ytest , id_train, id_test=train_test_split \
                    (df_features, df_target,df_id,test_size=ratio_test, \
                    random_state=seed)
    print ("No. of True Cases in training data set for" , ytrain.values.ravel().sum())
    print ("No. of True Cases in testing data set for",ytest.values.ravel().sum())
    
    print ("Ratio of True Cases in training data set: " , round(ytrain.values.ravel().sum()/len(ytrain.values.ravel()),2))
    print ("Ratio of True Cases in testing data set: ", round(ytest.values.ravel().sum()/len(ytest.values.ravel()),2))
    return xtrain, xtest, ytrain, ytest , id_train, id_test

In [None]:
def creating_test_train_with_stratified_samples(df_target,df_id, df_features, seed,
                                        ratio_test):
    '''
        Objective: To split of all three pandas dataset (df_target, df_id, df_features)
                   between training and testing datasets
                   
        Arguments: target dataset,id_dataset, features dataset, seed value, split ratio
                 
                   
        Returns  : 2 split datasets of training and testing each 
                   for target, id, features


    '''
    strt_clm = df_target
    xtrain, xtest, ytrain, ytest , id_train, id_test=train_test_split \
                    (df_features, df_target,df_id,test_size=ratio_test, \
                    stratify=strt_clm,random_state=seed)
    print ("No. of True Cases in training data set for" , ytrain.values.ravel().sum())
    print ("No. of True Cases in testing data set for",ytest.values.ravel().sum())
    
    print ("Ratio of True Cases in training data set: " , round(ytrain.values.ravel().sum()/len(ytrain.values.ravel()),2))
    print ("Ratio of True Cases in testing data set: ", round(ytest.values.ravel().sum()/len(ytest.values.ravel()),2))
    return xtrain, xtest, ytrain, ytest , id_train, id_test