# What is Churn Rate?
The churn rate is the percentage of subscribers to a service who discontinue their subscriptions to the service within a given time period. For a company to expand its clientele, its growth rate, as measured by the number of new customers, must exceed its churn rate.
### INTRODUCTION
Subscription Products often are the main source of revenue for companies across all industries. These products can come in the form of a one size fits all' overcompassing subscription, or in multi-level memberships. Regardless of how they structure their memberships, or what industry they are in, companies almost always try to minimize customer churn (a.k.a. subscription cancellations). To retain their customers, these companies first need to identify behavioral patterns that act as catalyst in disengagement with the product.

<b> Market:</b> The target audience is the entirety of a company's subscription base. They are the ones companies want to keep. <br>
<b>Product:</b> The subscription products that customers are already enrolled in can provide value that users may not have imagined, or that they may have forgotten.<br>
<b> Goal:</b> The objective of this model is to predict which users are likely to churn, so that the company can focus on re-engaging these users with the product. These efforts can be email reminders about the benefits of the product, especially focusing on features that are new or that the user has shown to value.
    
<b> BUSINESS CHALLENGE </b>
In this Case Study we will be working for a fin-tech company that provides a subscription product to its users, which allows them to manage their bank accounts (savings accounts, credit cards, etc), provides them with personalized coupons, informs them of the latest low-APR loans available in the market, and educates them on the best available methods to save money (like videos on saving money on taxes, free courses on financial health, etc).
We are in charge of identifying users who are likely to cancel their subscription so that we can start building new features that they may be interested in. These features can increase the engagement and interest of our users towards the product.<br>

<b> DATA </b>
By subscribing to the membership, our customers have provided us with data on their finances, as well as how they handle those finances through the product. We also have some demographic information we acquired from them during the sign-up process. • Financial data can often be unreliable and delayed. As a result, companies can sometimes build their marketing models using only demographic data, and data related to finances handled through the product itself. Therefore, we will be restricting ourselves to only using that type of data. Furthermore, product-related data is more indicative of what new features we should be creating as a company.<br>
<b> DATASET </b>
Mock-up dataset based on trends found in real world case studies; 27 000 instances and 30 features (40 after creating dummy variables of categorical ones)<br>

<b> Goal </b>
Predict which users are likely to churn, so that the company can focus on re-engaging these users with the product.

<strong> Description of each Columns </strong> <br>
<b> churn ------>  Target column. Whether the user subscribed/not </b>               
 1 user id ------> User identifier column. In this case the user has code
 2   age   ------> age of user                 
 3   housing  ------> Categorical var with 3 label: NA:Data not available O: Owner of house    R:  rented the house       
 4   credit_score ------> Credit score of user            
 5   deposits ------> No. of times money deposited         
 6   withdrawal  ------> how many times withdrawls             
 7   purchases_partners ------> purchases done with partner store
 8   purchases ------> how many purchases user has done outside partner    
 9   cc_taken ------> credit card taken         
 10  cc_recommended ------>  credit card recommended     
 11  cc_disliked ------>  credit card disliked         
 12  cc_liked  ------>  credit card liked            
 13  cc_application_begin ------> credit card application started  
 14  app_downloaded  ------>  App was downloaded in mobile        
 15  web_user ------>  The browser was used              
 16  app_web_user ------> Both App and Web browser used             
 17  ios_user ------> Apple phone user             
 18  android_user ------> Android phone user            
 19  registered_phones ------> Registrered number. If only one then col is populated with 0 as one phone number is required to ne registered      
 20  payment_type ------> Frequency user gets paid Bi-weekly/Semi/Weekly/Monthly/NA 
 21  waiting_4_loan ------> Loan is awaited       
 22  cancelled_loan  ------> Loan is cancelled          
 23  received_loan  ------>  Loan received         
 24  rejected_loan  ------> Loan rejected         
 25  zodiac_sign   ------> Zodiac sign of user           
 26  left_for_two_month_plus ----> whether user left for 2 months or plus and they returned, If person left for 2 months and then returned then 1 else 0
 27  left_for_one_month ------> whether user left for 1 month and came back next month  
 28  rewards_earned ------> Points that are earned       
 29  reward_rate  ------>  rate of being rewarded        
 30  is_referred ------> Came through referal. Used some referal code.         

# Importing the libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Importing the dataset that was created while Feature Engineering

In [3]:
df= pd.read_csv('new_ChurnData.csv')

### Checking if Data populated Correctly

In [4]:
df.head()

Unnamed: 0,user,churn,age,deposits,purchases_partners,purchases,cc_recommended,cc_application_begin,reward_rate,housing_O,...,zodiac_sign_Cancer,zodiac_sign_Capricorn,zodiac_sign_Gemini,zodiac_sign_Leo,zodiac_sign_Libra,zodiac_sign_Pisces,zodiac_sign_Sagittarius,zodiac_sign_Scorpio,zodiac_sign_Taurus,zodiac_sign_Virgo
0,23547,0,28.0,0,1,0,96,5,1.47,0,...,0,0,0,1,0,0,0,0,0,0
1,58313,0,35.0,47,86,47,285,9,2.17,0,...,0,1,0,0,0,0,0,0,0,0
2,8095,0,26.0,26,38,25,74,26,1.1,0,...,0,1,0,0,0,0,0,0,0,0
3,3120,1,32.0,5,111,5,227,17,1.83,0,...,0,0,0,0,0,0,0,0,1,0
4,41406,0,21.0,0,4,0,0,0,0.07,0,...,1,0,0,0,0,0,0,0,0,0


In [5]:
df.tail()

Unnamed: 0,user,churn,age,deposits,purchases_partners,purchases,cc_recommended,cc_application_begin,reward_rate,housing_O,...,zodiac_sign_Cancer,zodiac_sign_Capricorn,zodiac_sign_Gemini,zodiac_sign_Leo,zodiac_sign_Libra,zodiac_sign_Pisces,zodiac_sign_Sagittarius,zodiac_sign_Scorpio,zodiac_sign_Taurus,zodiac_sign_Virgo
17201,41813,0,29.0,1,5,1,5,0,0.03,0,...,0,0,0,0,0,0,0,1,0,0
17202,49903,1,28.0,0,26,0,31,0,0.6,0,...,0,0,0,0,0,0,0,0,0,1
17203,24291,1,24.0,0,0,0,81,2,1.07,0,...,0,0,0,1,0,0,0,0,0,0
17204,47663,1,46.0,2,16,2,58,2,0.9,0,...,0,0,0,0,0,0,0,0,0,0
17205,52752,1,34.0,0,4,0,11,0,0.13,0,...,1,0,0,0,0,0,0,0,0,0


# Data Pre-Processing

In [6]:
x = df.loc[:, df.columns != "churn"]
y = df['churn']

In [7]:
x.head()

Unnamed: 0,user,age,deposits,purchases_partners,purchases,cc_recommended,cc_application_begin,reward_rate,housing_O,housing_R,...,zodiac_sign_Cancer,zodiac_sign_Capricorn,zodiac_sign_Gemini,zodiac_sign_Leo,zodiac_sign_Libra,zodiac_sign_Pisces,zodiac_sign_Sagittarius,zodiac_sign_Scorpio,zodiac_sign_Taurus,zodiac_sign_Virgo
0,23547,28.0,0,1,0,96,5,1.47,0,1,...,0,0,0,1,0,0,0,0,0,0
1,58313,35.0,47,86,47,285,9,2.17,0,1,...,0,1,0,0,0,0,0,0,0,0
2,8095,26.0,26,38,25,74,26,1.1,0,1,...,0,1,0,0,0,0,0,0,0,0
3,3120,32.0,5,111,5,227,17,1.83,0,1,...,0,0,0,0,0,0,0,0,1,0
4,41406,21.0,0,4,0,0,0,0.07,0,0,...,1,0,0,0,0,0,0,0,0,0


In [8]:
y.head()

0    0
1    0
2    0
3    1
4    0
Name: churn, dtype: int64

# Split the Dataset

In [9]:
from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=0)

In [10]:
# Removing the User Id column as it is not a real feature.
# but at the end we will use this data associating the prediction from user it came from

train_identifier = x_train['user']
x_train = x_train.drop(columns=['user'])

test_identifier = x_test['user']
x_test = x_test.drop(columns=['user'])

# Feature Scaling

In [11]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

In [12]:
# scale the x_train and create a new variable for scaled x_train
#Standard scaler returns a numpy array so we need to convert to DataFrame
#In array form 'index' and 'column names' are lost we need it so convert to Df

x_train_scaled= pd.DataFrame(sc.fit_transform(x_train))
x_test_scaled = pd.DataFrame(sc.transform(x_test))

In [13]:
# put x_train_scaled have the columns of original x_train set
x_train_scaled.columns = x_train.columns.values

# put x_test_scaled have the columns of original x_test set
x_test_scaled.columns = x_test.columns.values

# take the indexes also
x_train_scaled.index = x_train.index.values
x_test_scaled.index = x_test.index.values

In [14]:
# Compare the original training set with new scaled training set
x_train= x_train_scaled
x_test= x_test_scaled

# Applying machine learning algorithms


### Creating a function that will give the Following output when any model is run

In [15]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

def print_score(clf, x_train, y_train, x_test, y_test, train=True):
    if train:
        pred = clf.predict(x_train)
        clf_report = pd.DataFrame(classification_report(y_train, pred, output_dict=True))
        print("Train Result:\n================================================")
        print(f"Accuracy Score: {accuracy_score(y_train, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(y_train, pred)}\n")
        
    elif train==False:
        pred = clf.predict(x_test)
        clf_report = pd.DataFrame(classification_report(y_test, pred, output_dict=True))
        print("Test Result:\n================================================")        
        print(f"Accuracy Score: {accuracy_score(y_test, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(y_test, pred)}\n")

## LOGISTIC REGRESSION

In [16]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=0)
classifier.fit(x_train, y_train)

print_score(classifier, x_train, y_train, x_test, y_test, train=True)
print_score(classifier, x_train, y_train, x_test, y_test, train=False)

Train Result:
Accuracy Score: 65.74%
_______________________________________________
CLASSIFICATION REPORT:
                     0            1  accuracy     macro avg  weighted avg
precision     0.694813     0.594087   0.65744      0.644450      0.653276
recall        0.743694     0.534531   0.65744      0.639113      0.657440
f1-score      0.718423     0.562738   0.65744      0.640581      0.654222
support    8088.000000  5676.000000   0.65744  13764.000000  13764.000000
_______________________________________________
Confusion Matrix: 
 [[6015 2073]
 [2642 3034]]

Test Result:
Accuracy Score: 65.98%
_______________________________________________
CLASSIFICATION REPORT:
                     0            1  accuracy    macro avg  weighted avg
precision     0.697946     0.596923  0.659791     0.647434      0.656181
recall        0.740466     0.545327  0.659791     0.642896      0.659791
f1-score      0.718577     0.569960  0.659791     0.644268      0.657135
support    2019.000000  142

## DECISION TREE

In [17]:
# DECISSION TREE

# Fitting Model to the Training Set
# Importing the Classifier

from sklearn.tree import DecisionTreeClassifier

tree_clf = DecisionTreeClassifier(random_state=42)
tree_clf.fit(x_train, y_train)

print_score(tree_clf, x_train, y_train, x_test, y_test, train=True)
print_score(tree_clf, x_train, y_train, x_test, y_test, train=False)

Train Result:
Accuracy Score: 100.00%
_______________________________________________
CLASSIFICATION REPORT:
                0       1  accuracy  macro avg  weighted avg
precision     1.0     1.0       1.0        1.0           1.0
recall        1.0     1.0       1.0        1.0           1.0
f1-score      1.0     1.0       1.0        1.0           1.0
support    8088.0  5676.0       1.0    13764.0       13764.0
_______________________________________________
Confusion Matrix: 
 [[8088    0]
 [   0 5676]]

Test Result:
Accuracy Score: 64.09%
_______________________________________________
CLASSIFICATION REPORT:
                     0            1  accuracy    macro avg  weighted avg
precision     0.690882     0.567218  0.640906     0.629050      0.639757
recall        0.701833     0.554462  0.640906     0.628147      0.640906
f1-score      0.696314     0.560768  0.640906     0.628541      0.640276
support    2019.000000  1423.000000  0.640906  3442.000000   3442.000000
__________________

## RANDOM FOREST

In [18]:
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(n_estimators=5)
rf_clf.fit(x_train, y_train)

print_score(rf_clf, x_train, y_train, x_test, y_test, train=True)
print_score(rf_clf, x_train, y_train, x_test, y_test, train=False)

Train Result:
Accuracy Score: 96.54%
_______________________________________________
CLASSIFICATION REPORT:
                     0            1  accuracy     macro avg  weighted avg
precision     0.963807     0.967794  0.965417      0.965801      0.965451
recall        0.977868     0.947674  0.965417      0.962771      0.965417
f1-score      0.970787     0.957629  0.965417      0.964208      0.965361
support    8088.000000  5676.000000  0.965417  13764.000000  13764.000000
_______________________________________________
Confusion Matrix: 
 [[7909  179]
 [ 297 5379]]

Test Result:
Accuracy Score: 66.85%
_______________________________________________
CLASSIFICATION REPORT:
                     0            1  accuracy    macro avg  weighted avg
precision     0.702678     0.610502  0.668507     0.656590      0.664570
recall        0.753839     0.547435  0.668507     0.650637      0.668507
f1-score      0.727360     0.577251  0.668507     0.652305      0.665301
support    2019.000000  142

## SVM

In [21]:
from sklearn.svm import SVC


print("=======================Linear Kernel SVM==========================")
model = SVC(kernel='linear')
model.fit(x_train, y_train)

print_score(model, x_train, y_train, x_test, y_test, train=True)
print_score(model, x_train, y_train, x_test, y_test, train=False)

print("=======================Polynomial Kernel SVM==========================")
from sklearn.svm import SVC

model = SVC(kernel='poly', degree=2, gamma='auto')
model.fit(x_train, y_train)

print_score(model, x_train, y_train, x_test, y_test, train=True)
print_score(model, x_train, y_train, x_test, y_test, train=False)

print("=======================Radial Kernel SVM==========================")
from sklearn.svm import SVC

model = SVC(kernel='rbf', gamma=1)
model.fit(x_train, y_train)

print_score(model, x_train, y_train, x_test, y_test, train=True)
print_score(model, x_train, y_train, x_test, y_test, train=False)

Train Result:
Accuracy Score: 65.23%
_______________________________________________
CLASSIFICATION REPORT:
                     0            1  accuracy     macro avg  weighted avg
precision     0.716724     0.572405  0.652281      0.644564      0.657209
recall        0.675074     0.619803  0.652281      0.647438      0.652281
f1-score      0.695276     0.595162  0.652281      0.645219      0.653991
support    8088.000000  5676.000000  0.652281  13764.000000  13764.000000
_______________________________________________
Confusion Matrix: 
 [[5460 2628]
 [2158 3518]]

Test Result:
Accuracy Score: 64.93%
_______________________________________________
CLASSIFICATION REPORT:
                     0            1  accuracy    macro avg  weighted avg
precision     0.714588     0.569677  0.649332     0.642133      0.654679
recall        0.669638     0.620520  0.649332     0.645079      0.649332
f1-score      0.691383     0.594013  0.649332     0.642698      0.651128
support    2019.000000  142