## [Predict the churn risk rate](https://www.kaggle.com/parv619/hackerearth-how-not-to-lose-a-customer-in-10-days)
Max. score: 100
Churn rate is a marketing metric that describes the number of customers who leave a business over a specific time period. . Every user is assigned a prediction value that estimates their state of churn at any given time. This value is based on:

User demographic information
Browsing behavior
Historical purchase data among other information
It factors in our unique and proprietary predictions of how long a user will remain a customer. This score is updated every day for all users who have a minimum of one conversion. The values assigned are between 1 and 5.

Task:
Your task is to predict the churn score for a website based on the features provided in the dataset.

In [49]:
import pandas as pd
import numpy as np
import copy
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

# Import Dataset

In [50]:
dataset_train=pd.read_csv('./train.csv')
dataset_train_copy=dataset_train.copy()
# dataset_test=pd.read_csv('./test.csv')

In [51]:
dataset_train_copy.head()

Unnamed: 0,customer_id,Name,age,gender,security_no,region_category,membership_category,joining_date,joined_through_referral,referral_id,...,avg_time_spent,avg_transaction_value,avg_frequency_login_days,points_in_wallet,used_special_discount,offer_application_preference,past_complaint,complaint_status,feedback,churn_risk_score
0,fffe4300490044003600300030003800,Pattie Morrisey,18,F,XW0DQ7H,Village,Platinum Membership,2017-08-17,No,xxxxxxxx,...,300.63,53005.25,17.0,781.75,Yes,Yes,No,Not Applicable,Products always in Stock,2
1,fffe43004900440032003100300035003700,Traci Peery,32,F,5K0N3X1,City,Premium Membership,2017-08-28,?,CID21329,...,306.34,12838.38,10.0,,Yes,No,Yes,Solved,Quality Customer Care,1
2,fffe4300490044003100390032003600,Merideth Mcmeen,44,F,1F2TCL3,Town,No Membership,2016-11-11,Yes,CID12313,...,516.16,21027.0,22.0,500.69,No,Yes,Yes,Solved in Follow-up,Poor Website,5
3,fffe43004900440036003000330031003600,Eufemia Cardwell,37,M,VJGJ33N,City,No Membership,2016-10-29,Yes,CID3793,...,53.27,25239.56,6.0,567.66,No,Yes,Yes,Unsolved,Poor Website,5
4,fffe43004900440031003900350030003600,Meghan Kosak,31,F,SVZXCWB,City,No Membership,2017-09-12,No,xxxxxxxx,...,113.13,24483.66,16.0,663.06,No,Yes,Yes,Solved,Poor Website,5


In [52]:
# dataset_test.describe()

In [53]:
dataset_train_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36992 entries, 0 to 36991
Data columns (total 25 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   customer_id                   36992 non-null  object 
 1   Name                          36992 non-null  object 
 2   age                           36992 non-null  int64  
 3   gender                        36992 non-null  object 
 4   security_no                   36992 non-null  object 
 5   region_category               31564 non-null  object 
 6   membership_category           36992 non-null  object 
 7   joining_date                  36992 non-null  object 
 8   joined_through_referral       36992 non-null  object 
 9   referral_id                   36992 non-null  object 
 10  preferred_offer_types         36704 non-null  object 
 11  medium_of_operation           36992 non-null  object 
 12  internet_option               36992 non-null  object 
 13  l

In [54]:
# Check for NaN data
display(dataset_train_copy.isnull().sum())

customer_id                        0
Name                               0
age                                0
gender                             0
security_no                        0
region_category                 5428
membership_category                0
joining_date                       0
joined_through_referral            0
referral_id                        0
preferred_offer_types            288
medium_of_operation                0
internet_option                    0
last_visit_time                    0
days_since_last_login              0
avg_time_spent                     0
avg_transaction_value              0
avg_frequency_login_days           0
points_in_wallet                3443
used_special_discount              0
offer_application_preference       0
past_complaint                     0
complaint_status                   0
feedback                           0
churn_risk_score                   0
dtype: int64

In [55]:
# class distribution
print(dataset_train_copy.groupby('churn_risk_score').size())

churn_risk_score
-1     1163
 1     2652
 2     2741
 3    10424
 4    10185
 5     9827
dtype: int64


In [56]:
dataset_train_copy=dataset_train_copy.fillna(method='ffill')

In [57]:
display(dataset_train_copy.isnull().sum())

customer_id                     0
Name                            0
age                             0
gender                          0
security_no                     0
region_category                 0
membership_category             0
joining_date                    0
joined_through_referral         0
referral_id                     0
preferred_offer_types           0
medium_of_operation             0
internet_option                 0
last_visit_time                 0
days_since_last_login           0
avg_time_spent                  0
avg_transaction_value           0
avg_frequency_login_days        0
points_in_wallet                0
used_special_discount           0
offer_application_preference    0
past_complaint                  0
complaint_status                0
feedback                        0
churn_risk_score                0
dtype: int64

In [58]:
print(dataset_train_copy['membership_category'].value_counts())

Basic Membership       7724
No Membership          7692
Gold Membership        6795
Silver Membership      5988
Premium Membership     4455
Platinum Membership    4338
Name: membership_category, dtype: int64


In [59]:
dataset_train_encoded = dataset_train_copy.copy()
categ = ['region_category','membership_category','complaint_status','feedback']
encoder = LabelEncoder()
# dataset_train_encoded = encoder.fit_transform(dataset_train_copy)
dataset_train_encoded[categ] = dataset_train_encoded[categ].apply(encoder.fit_transform)
# dataset_train_encoded = dataset_train_encoded.apply(encoder.fit_transform)
# dataset_train_encoded['membership_category'] = encoder.fit_transform(dataset_train_copy['membership_category'])
dataset_train_encoded.head()

Unnamed: 0,customer_id,Name,age,gender,security_no,region_category,membership_category,joining_date,joined_through_referral,referral_id,...,avg_time_spent,avg_transaction_value,avg_frequency_login_days,points_in_wallet,used_special_discount,offer_application_preference,past_complaint,complaint_status,feedback,churn_risk_score
0,fffe4300490044003600300030003800,Pattie Morrisey,18,F,XW0DQ7H,2,3,2017-08-17,No,xxxxxxxx,...,300.63,53005.25,17.0,781.75,Yes,Yes,No,1,4,2
1,fffe43004900440032003100300035003700,Traci Peery,32,F,5K0N3X1,0,4,2017-08-28,?,CID21329,...,306.34,12838.38,10.0,781.75,Yes,No,Yes,2,5,1
2,fffe4300490044003100390032003600,Merideth Mcmeen,44,F,1F2TCL3,1,2,2016-11-11,Yes,CID12313,...,516.16,21027.0,22.0,500.69,No,Yes,Yes,3,3,5
3,fffe43004900440036003000330031003600,Eufemia Cardwell,37,M,VJGJ33N,0,2,2016-10-29,Yes,CID3793,...,53.27,25239.56,6.0,567.66,No,Yes,Yes,4,3,5
4,fffe43004900440031003900350030003600,Meghan Kosak,31,F,SVZXCWB,0,2,2017-09-12,No,xxxxxxxx,...,113.13,24483.66,16.0,663.06,No,Yes,Yes,2,3,5


In [60]:
print(encoder.classes_)

['No reason specified' 'Poor Customer Service' 'Poor Product Quality'
 'Poor Website' 'Products always in Stock' 'Quality Customer Care'
 'Reasonable Price' 'Too many ads' 'User Friendly Website']


In [61]:
#split dataset in features and target variable
feature_cols = ['region_category','membership_category',
'avg_time_spent','days_since_last_login',
'avg_transaction_value',
'points_in_wallet',
'complaint_status',
'feedback'
]
X = dataset_train_encoded[feature_cols] # Features
y = dataset_train_encoded.churn_risk_score # Target variable

In [62]:
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 80% training and 20% test

In [63]:
# Create Random Forest classifer object
# clf = KNeighborsClassifier(n_neighbors=37)
clf=RandomForestClassifier(n_estimators = 500,random_state = 1)
# clf=DecisionTreeClassifier(min_samples_split=100,random_state = 1)
# clf=GaussianNB()

# Train Random Forest Classifer
clf = clf.fit(X_train,y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

In [64]:
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred)*100)
# Model Classification Report
print("Classification Report:",metrics.classification_report(y_test, y_pred))
# Model Confusion Matrix
print(metrics.confusion_matrix(y_test, y_pred))

Accuracy: 74.52358426814433
Classification Report:               precision    recall  f1-score   support

          -1       0.00      0.00      0.00       236
           1       0.71      0.79      0.74       506
           2       0.78      0.74      0.76       548
           3       0.87      0.92      0.90      2087
           4       0.65      0.58      0.62      2020
           5       0.70      0.81      0.75      2002

    accuracy                           0.75      7399
   macro avg       0.62      0.64      0.63      7399
weighted avg       0.72      0.75      0.73      7399

[[   0   25   10   76   69   56]
 [   0  398  104    0    4    0]
 [   0  141  404    0    3    0]
 [   0    0    0 1915  171    1]
 [   0    0    0  200 1179  641]
 [   0    0    0    0  384 1618]]
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
