### Tasks/Activities List
Your code should contain the following activities and analyses:
- Collect the data from the zip file linked here.
- Split the dataset into dependent and independent variables.
- Perform train, test split.
- Apply StandardScaler() to the train and test dependent variable.
- Fit the best parameters.
- Model Prediction
- Model Validation Statistics
- Create a FastAPI app
- Create routes for getting predictions
- Create a route for retraining the model


In [1]:
import pandas as pd 
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,accuracy_score,f1_score,recall_score, roc_auc_score
from sklearn.metrics import confusion_matrix
from sklearn.linear_model  import  LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTEENN
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV
from sklearn.metrics import roc_curve
from sklearn.model_selection import cross_val_predict
import os
# pipelines
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline, make_pipeline
import warnings
warnings.filterwarnings("ignore")
print("All libraries sucessfully loaded ")

All libraries sucessfully loaded 


In [2]:
df=pd.read_csv("tel_churn.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,address,income,employ,reside,longmon,tollmon,equipmon,cardmon,wiremon,...,custcat_Plus service,custcat_Total service,tenure_group_1 - 12,tenure_group_13 - 24,tenure_group_25 - 36,tenure_group_37 - 48,tenure_group_49 - 60,tenure_group_61 - 72,AgeGroup_Junior,AgeGroup_Senior
0,0,9,64,5,2,3.7,0.0,0.0,7.5,0.0,...,0,0,0,1,0,0,0,0,1,0
1,1,7,136,5,6,4.4,20.75,0.0,15.25,35.7,...,0,1,1,0,0,0,0,0,1,0
2,2,24,116,29,2,18.15,18.0,0.0,30.25,0.0,...,1,0,0,0,0,0,0,1,1,0
3,3,12,33,0,1,9.45,0.0,0.0,0.0,0.0,...,0,0,0,0,1,0,0,0,1,0
4,4,9,30,2,4,6.3,0.0,0.0,0.0,0.0,...,1,0,0,1,0,0,0,0,1,0


In [3]:
df=df.drop('Unnamed: 0',axis=1)

In [4]:
# Splitting the independent and dependent variables

x = df.drop('churn', axis=1)# feature
y = df['churn'] # target 

In [33]:
x.dtypes

address                   int64
income                    int64
employ                    int64
reside                    int64
longmon                 float64
                         ...   
tenure_group_37 - 48      int64
tenure_group_49 - 60      int64
tenure_group_61 - 72      int64
AgeGroup_Junior           int64
AgeGroup_Senior           int64
Length: 68, dtype: object

In [5]:
## We are using standard scaler as Variables that are measured at different scales do not contribute equally to 
#the model fitting & model learned function and might end up creating a bias.
scaler = StandardScaler()
X_scaled = scaler.fit_transform(x)
X_train, X_test, Y_train, y_test = train_test_split(X_scaled, y,stratify=y, test_size=0.2,random_state=1)


In [6]:
## Logistic Regression 
log_reg = LogisticRegression()

log_reg.fit(X_train,Y_train)

In [7]:
## Creating function that will evaluate model. 
## It will help us to reduced the code size and increase reusability of code

def evaluate_model(model,x_train,y_train,x_test,y_test,fit=False):
    '''
    Model Evaluation for Classifier
    :param  model : model object 
    :param x_train: Train features
    :param y_train: Train Target 
    :param x_test: Test features
    :param y_test: Test Target 
    :param fit bool : True if model is already fited else false

    :return: Train and Test Classification report and AUC- ROC Graph
    '''
    if fit == False:
        model.fit(x_train,y_train)
    train_pred=model.predict(x_train)
    print("Training report")
    print(classification_report(y_train, train_pred))    
    
    print("Testing report")
    test_pred=model.predict(x_test)    
    print(classification_report(y_test, test_pred))

# evaluating the model 
evaluate_model(log_reg,X_train,Y_train,X_test,y_test,fit=True)

model_score_log_reg= log_reg.score(X_test, y_test)
print('Test Score: ',model_score_log_reg)

Training report
              precision    recall  f1-score   support

           0       0.82      0.91      0.86       581
           1       0.66      0.47      0.55       219

    accuracy                           0.79       800
   macro avg       0.74      0.69      0.70       800
weighted avg       0.78      0.79      0.78       800

Testing report
              precision    recall  f1-score   support

           0       0.83      0.90      0.87       145
           1       0.67      0.53      0.59        55

    accuracy                           0.80       200
   macro avg       0.75      0.72      0.73       200
weighted avg       0.79      0.80      0.79       200

Test Score:  0.8



- we need to check recall, precision & f1 score for the minority class, and it's quite evident that the precision, recall & f1 score is too low for Class 1, i.e. churned customers.


In [8]:
##Base model on SMOTE :
## Using SMOTE To balance Training dataset 
# transform the dataset

oversample = SMOTE()
x_train, y_train = oversample.fit_resample(X_train, Y_train)

log_reg_smote = LogisticRegression()

log_reg_smote.fit(x_train,y_train)

print("Training data size")
print(x_train.shape)
print(y_train.value_counts())
evaluate_model(log_reg_smote,x_train,y_train,X_test,y_test,fit=True)

model_score_log_reg_smote= log_reg_smote.score(X_test, y_test)
print('Test Score: ',model_score_log_reg_smote)

Training data size
(1162, 68)
0    581
1    581
Name: churn, dtype: int64
Training report
              precision    recall  f1-score   support

           0       0.78      0.75      0.77       581
           1       0.76      0.79      0.78       581

    accuracy                           0.77      1162
   macro avg       0.77      0.77      0.77      1162
weighted avg       0.77      0.77      0.77      1162

Testing report
              precision    recall  f1-score   support

           0       0.90      0.72      0.80       145
           1       0.51      0.78      0.62        55

    accuracy                           0.73       200
   macro avg       0.70      0.75      0.71       200
weighted avg       0.79      0.73      0.75       200

Test Score:  0.735


In [9]:
# %
#applying smoteenn to balance our data set 
sm = SMOTEENN()
X_resampled, y_resampled = sm.fit_resample(x,y)

xr_train,xr_test,yr_train,yr_test=train_test_split(X_resampled, y_resampled,test_size=0.2)

In [10]:
#logistic regression with smoteenn
model_LR_smoteen=LogisticRegression()

model_LR_smoteen.fit(xr_train,yr_train)
model1_score_r = model_LR_smoteen.score(xr_test, yr_test)

evaluate_model(model_LR_smoteen,xr_train,yr_train,xr_test,yr_test,fit=True)
print('Test Score :',model1_score_r)


Training report
              precision    recall  f1-score   support

           0       0.89      0.88      0.88       273
           1       0.90      0.91      0.91       342

    accuracy                           0.90       615
   macro avg       0.90      0.89      0.89       615
weighted avg       0.90      0.90      0.90       615

Testing report
              precision    recall  f1-score   support

           0       0.90      0.86      0.88        70
           1       0.89      0.92      0.90        84

    accuracy                           0.89       154
   macro avg       0.89      0.89      0.89       154
weighted avg       0.89      0.89      0.89       154

Test Score : 0.8896103896103896


In [11]:
#Support Vector Machine 
svm=SVC()
svm.fit(X_train,Y_train)

In [12]:
# %
evaluate_model(svm,X_train,Y_train,X_test,y_test,fit=True)

model_score_svm= svm.score(X_test, y_test)
print('Test Score: ',model_score_svm)

Training report
              precision    recall  f1-score   support

           0       0.83      0.97      0.90       581
           1       0.88      0.49      0.63       219

    accuracy                           0.84       800
   macro avg       0.86      0.73      0.76       800
weighted avg       0.85      0.84      0.82       800

Testing report
              precision    recall  f1-score   support

           0       0.81      0.93      0.87       145
           1       0.71      0.44      0.54        55

    accuracy                           0.80       200
   macro avg       0.76      0.68      0.70       200
weighted avg       0.78      0.80      0.78       200

Test Score:  0.795


In [13]:
# %
#applying SMOTE
svm_smote=SVC()

#fitting the scv smote model
svm_smote.fit(x_train,y_train)

#evaluating the model
evaluate_model(svm_smote,X_train,Y_train,X_test,y_test,fit=True)

model_score_svm_smote= svm_smote.score(X_test, y_test)
print('Test Score: ',model_score_svm_smote)

Training report
              precision    recall  f1-score   support

           0       0.94      0.90      0.92       581
           1       0.76      0.86      0.80       219

    accuracy                           0.89       800
   macro avg       0.85      0.88      0.86       800
weighted avg       0.89      0.89      0.89       800

Testing report
              precision    recall  f1-score   support

           0       0.87      0.79      0.83       145
           1       0.55      0.69      0.61        55

    accuracy                           0.76       200
   macro avg       0.71      0.74      0.72       200
weighted avg       0.78      0.76      0.77       200

Test Score:  0.76


In [14]:
#Support Vector Machine with smoteenn

model_scv_smoteen=SVC()

model_scv_smoteen.fit(xr_train,yr_train)
model2_score_r = model_scv_smoteen.score(xr_test, yr_test)

evaluate_model(model_scv_smoteen,xr_train,yr_train,xr_test,yr_test,fit=True)
print('Test Score :',model2_score_r)


Training report
              precision    recall  f1-score   support

           0       0.93      0.83      0.88       273
           1       0.87      0.95      0.91       342

    accuracy                           0.90       615
   macro avg       0.90      0.89      0.89       615
weighted avg       0.90      0.90      0.90       615

Testing report
              precision    recall  f1-score   support

           0       0.92      0.79      0.85        70
           1       0.84      0.94      0.89        84

    accuracy                           0.87       154
   macro avg       0.88      0.86      0.87       154
weighted avg       0.88      0.87      0.87       154

Test Score : 0.8701298701298701


- SMOTE ENN Performed better for linear models 


In [15]:
# %%
# Random Forest
rf_clf = RandomForestClassifier(n_jobs=-1)
rf_clf.fit(x_train,y_train)
evaluate_model(rf_clf,x_train,y_train,X_test,y_test,fit=True)

model_score_rand_clf = rf_clf.score(X_test, y_test)
print('Test Score: ',model_score_rand_clf)

Training report
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       581
           1       1.00      1.00      1.00       581

    accuracy                           1.00      1162
   macro avg       1.00      1.00      1.00      1162
weighted avg       1.00      1.00      1.00      1162

Testing report
              precision    recall  f1-score   support

           0       0.84      0.86      0.85       145
           1       0.60      0.58      0.59        55

    accuracy                           0.78       200
   macro avg       0.72      0.72      0.72       200
weighted avg       0.78      0.78      0.78       200

Test Score:  0.78


In [16]:
## Hyperparameter tunining 
# we are tuning three hyperparameters right now, we are passing the different values for both parameters
grid_param = {
    "n_estimators" : [90,100,115,130],
    'criterion': ['gini', 'entropy'],
    'max_depth' : range(2,20,1),
    'min_samples_leaf' : range(1,10,1),
    'min_samples_split': range(2,10,1),
    'max_features' : ['auto','log2']
}
random_search = RandomizedSearchCV(estimator=rf_clf,param_distributions=grid_param,cv=5,n_jobs =-1,verbose = 3)
random_search.fit(x_train,y_train)
# %%
# %%
print(random_search.best_params_)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
{'n_estimators': 90, 'min_samples_split': 9, 'min_samples_leaf': 1, 'max_features': 'log2', 'max_depth': 18, 'criterion': 'entropy'}


In [17]:
rand_clf_tune = RandomForestClassifier(criterion= 'entropy',
 max_depth = 14,
 max_features = 'log2',
 min_samples_leaf = 1,
 min_samples_split= 4,
 n_estimators = 115,random_state=6)

rand_clf_tune.fit(x_train,y_train)
evaluate_model(rand_clf_tune,x_train,y_train,X_test,y_test,fit=True)

model_score_rand_clf_tune = rand_clf_tune.score(X_test, y_test)
print('Test Score: ',model_score_rand_clf_tune)


Training report
              precision    recall  f1-score   support

           0       0.99      1.00      1.00       581
           1       1.00      0.99      1.00       581

    accuracy                           1.00      1162
   macro avg       1.00      1.00      1.00      1162
weighted avg       1.00      1.00      1.00      1162

Testing report
              precision    recall  f1-score   support

           0       0.83      0.87      0.85       145
           1       0.61      0.55      0.58        55

    accuracy                           0.78       200
   macro avg       0.72      0.71      0.71       200
weighted avg       0.77      0.78      0.78       200

Test Score:  0.78


In [18]:
#%
# Random Forest with smoteenn

model_rf_smoteen=RandomForestClassifier(criterion= 'entropy',
 max_depth = 14,
 max_features = 'log2',
 min_samples_leaf = 1,
 min_samples_split= 4,
 n_estimators = 115,random_state=6)

In [19]:
model_rf_smoteen.fit(xr_train,yr_train)
yr_predict4 = model_rf_smoteen.predict(xr_test)
model3_score_r = model_rf_smoteen.score(xr_test, yr_test)

evaluate_model(model_scv_smoteen,xr_train,yr_train,xr_test,yr_test,fit=True)
print('Test Score :',model3_score_r)


Training report
              precision    recall  f1-score   support

           0       0.93      0.83      0.88       273
           1       0.87      0.95      0.91       342

    accuracy                           0.90       615
   macro avg       0.90      0.89      0.89       615
weighted avg       0.90      0.90      0.90       615

Testing report
              precision    recall  f1-score   support

           0       0.92      0.79      0.85        70
           1       0.84      0.94      0.89        84

    accuracy                           0.87       154
   macro avg       0.88      0.86      0.87       154
weighted avg       0.88      0.87      0.87       154

Test Score : 0.922077922077922


In [20]:
# %
# Gossian Naive Bayes 
nb_clf=GaussianNB()
nb_clf.fit(x_train,y_train)
evaluate_model(nb_clf,x_train,y_train,X_test,y_test,fit=True)

model_score_nb_clf = nb_clf.score(X_test, y_test)
print(model_score_nb_clf)


Training report
              precision    recall  f1-score   support

           0       0.80      0.52      0.63       581
           1       0.64      0.87      0.74       581

    accuracy                           0.70      1162
   macro avg       0.72      0.70      0.69      1162
weighted avg       0.72      0.70      0.69      1162

Testing report
              precision    recall  f1-score   support

           0       0.94      0.54      0.69       145
           1       0.43      0.91      0.58        55

    accuracy                           0.65       200
   macro avg       0.69      0.73      0.64       200
weighted avg       0.80      0.65      0.66       200

0.645


In [21]:
# Gossian Naive Bayes with smoteenn

model_nb_smoteen=GaussianNB()

model_nb_smoteen.fit(xr_train,yr_train)
yr_predict3 = model_nb_smoteen.predict(xr_test)
model4_score_r = model_nb_smoteen.score(xr_test, yr_test)

evaluate_model(model_scv_smoteen,xr_train,yr_train,xr_test,yr_test,fit=True)
print('Test Score :',model4_score_r)


Training report
              precision    recall  f1-score   support

           0       0.93      0.83      0.88       273
           1       0.87      0.95      0.91       342

    accuracy                           0.90       615
   macro avg       0.90      0.89      0.89       615
weighted avg       0.90      0.90      0.90       615

Testing report
              precision    recall  f1-score   support

           0       0.92      0.79      0.85        70
           1       0.84      0.94      0.89        84

    accuracy                           0.87       154
   macro avg       0.88      0.86      0.87       154
weighted avg       0.88      0.87      0.87       154

Test Score : 0.8766233766233766


### Pickling the model


In [23]:
import pickle
pickle_out=open("model_rf_smoteen.pkl","wb")
pickle.dump(model_rf_smoteen,pickle_out)
pickle_out.close()