
# churn prediction with classification

We will do a churn prediction project, predicting if they will swith away from our company

This way we can give them promotional emails and stay

We will use Binary Classification

The ML strategy applied to approach this problem is binary classification, which for one instance ($i^{th}$ customer), can be expressed as:

$$\large g\left(x_{i}\right) = y_{i}$$

In the formula, $y_i$ is the model's prediction and belongs to {0,1}, with 0 being the negative value or no churning, and 1 the positive value or churning. The output corresponds to the likelihood of churning.

In brief, the main idea behind this project is to build a model with historical data from customers and assign a score of the likelihood of churning.

y is a score from 0 to 1, where 0 means no churn and 1 means churn.

basically we use some info to do this

## 3.1 data prep

In [None]:
# lets import the some modules and import the data
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import mutual_info_score
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression

import matplotlib.pyplot as plt

In [None]:
#now we get data and view it

df = pd.read_csv('data/churn.csv')

df.head().T # transposes to db to make rows become columns

Now lets do our standard data preparation

In [None]:
# makes all columns lower case snake case, same with all values of type object

df.columns = df.columns.str.lower().str.replace(' ', '_')

categorial_columns = list(df.dtypes[df.dtypes == 'object'].index)

for c in categorial_columns:
    df[c] = df[c].str.lower().str.replace(' ', '_')
    
df.head().T

In [None]:
# now we need to convert the types, if you run df.dTypes it shows some wacky types
# the columns are total charges, and churn. 

df.totalcharges = pd.to_numeric(df.totalcharges, errors = 'coerce')
# coerce ignores the erros
df.totalcharges = df.totalcharges.fillna(0)
# get rid of the nulls


# churn requres it be binary instead of yes or no
df.churn = (df.churn == 'yes').astype(int)

## 3.2 Set up validation framework

- We did this before using numpy and pandas to do the 60 20 20 division
- this time we will use scikit learn

In [None]:
from sklearn.model_selection import train_test_split


# we first split with full_train and test, then we split the "full_train" into train and validation
df_full_train, df_test = train_test_split(df, test_size= 0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size= 0.25, random_state=1) # 0.25 because the size scales


In [None]:
# check if the size is right
len(df_train), len(df_val), len(df_test)

In [None]:
#reset index 
df_train.reset_index(drop=True)
df_val.reset_index(drop=True)
df_test.reset_index(drop=True)

In [None]:
# get the y values

y_train = df_train.churn.values
y_val = df_val.churn.values
y_test = df_train.churn.values

#delete churn value form all df

del df_train['churn']
del df_val['churn']
del df_test['churn']

## 3.3 EDA

- Look for missing values
- Look at target variable
- Look at numerical and categorical variables

In [None]:
# lets see the churn rate right now
# normalize turns into percentage
df_full_train.churn.value_counts(normalize=True)

In [None]:
# another way to get the churn rate is computing the meawn

global_churn_rate = df_full_train.churn.mean()
round(global_churn_rate, 2)

In [None]:
# now lets get sone numerical variables

#if you run dtypees there are 3  numerical values that we care about:
# totalCharges, tenure, and montlycharges

numerical = ['tenure', 'monthlycharges', 'totalcharges']

df_full_train.columns
categorial = ['gender', 'seniorcitizen', 'partner', 'dependents', 'phoneservice', 'multiplelines', 'internetservice', 'onlinesecurity', 'onlinebackup', 'deviceprotection', 'techsupport','streamingtv', 'streamingmovies', 'contract', 'paperlessbilling','paymentmethod']

df_full_train[categorial].nunique()
# now we can get the number of unique values and turn them into features

## 3.4 Feature importnace
this is a part of EDA which is where we identify if features matter or not
4 sub modules: 
- churn rate + risk ratio (for categorical) "does it matter?"
- mutual information (for categorical) "how much does it matter?"
- correlation (for numerical) "how much does it matter?"

In [None]:
# churn rate
# we need to see which one of thses actually affect our result

churn_female = df_full_train[df_full_train.gender == 'female'].churn.mean()
churn_male = df_full_train[df_full_train.gender == 'male'].churn.mean()

print(f"churn_female: {churn_female}, churn_male: {churn_male}")


In [None]:
churn_partner = df_full_train[df_full_train["partner"] == "yes"].churn.mean()
churn_no_partner = df_full_train[df_full_train["partner"] == "no"].churn.mean()
print(f"churn_partner: {churn_partner}, churn_no_partner: {churn_no_partner}")

In [None]:
# now we get the difference to see if that actually affects the churn rate ornot

global_churn_rate - churn_no_partner

There are 2 things that matter
1. difference, < 0 means more likely to churn, and > 0 means less likely to churn
2. risk ratio, if it is > 1 it means more likely, <1 means less likely


this gives us intuition on how much something affect it

In [None]:
# now we get the risk ratio
churn_no_partner / global_churn_rate, churn_partner/global_churn_rate



In [None]:
# now we can do a select statmenet that calculates churn based on gender
# then we add the 2 columns, diff and risk ratio

df_group = df_full_train.groupby('gender').churn.agg(['mean', 'count'])
df_group['diff'] = df_group['mean'] - global_churn_rate
df_group['risk'] = df_group['mean'] / global_churn_rate
df_group

In [None]:
# lets make it in a loop so we can see if which ones have more corrilation compared to others

for c in categorial:
    df_group = df_full_train.groupby(c).churn.agg(['mean', 'count'])
    df_group['diff'] = df_group['mean'] - global_churn_rate
    df_group['risk'] = df_group['mean'] / global_churn_rate
    # display(df_group) 

### lets do mutual information now, to measure the importance of different variables

tells us how much we can learn about one variable if we know the value of another

In [None]:
# no need to implement
from sklearn.metrics import mutual_info_score

In [None]:
mutual_info_score(df_full_train.contract, df_full_train.churn)

In [None]:
mutual_info_score(df_full_train.gender, df_full_train.churn)

from this we can see that since the number for contract is bigger, it means that it affects us a lot mroe than gender

In [None]:
# lets apply this to all
def mutual_info_churn_score(series):
    return mutual_info_score(series, df_full_train.churn)

mi = df_full_train[categorial].apply(mutual_info_churn_score)
mi.sort_values(ascending = False)

#we can see contract is the most important and geneder is the leaset

Now lets do correlation, this is the importance of numerical variables

there is a formula for this, which gets you the correlation coefficient

- when correlation is positive, then it drectly correlates
- if it is negative then it has an inverse relation
- 0.0-0.2 is low, 0.2-0.5 is moderate, 0.6-1.0 is strong 

In [None]:
df_full_train[numerical].corrwith(df_full_train.churn)

# this gets us the correlation coefficient for numerical features. Kinda cool right?

we can see there is an inverse relationship between tenure and churn, and the corrilation is quite noticable despite the number only being 0.35

## 3.5 One-hot encoding

we are going to use sklearn to encode categorical features. this is what we used in our previous lesson

In [None]:
from sklearn.feature_extraction import DictVectorizer

#this is smart and if we have non categorical variables, it does not transform

In [None]:
#seperate this into rows
dicts = df_train[['gender', 'contract']].iloc[:100].to_dict(orient = 'records')
dicts

In [None]:
dv = DictVectorizer(sparse=False)
dv.fit(dicts)

dv.transform(dicts) #now we get the dict


In [None]:
dv.get_feature_names_out() # this actually gets us the column names

now lets actually do it to everything instead of just 2 columns

In [71]:
train_dicts = df_train[categorial + numerical].to_dict(orient = 'records')
dv = DictVectorizer(sparse=False)

X_train = dv.fit_transform(train_dicts)


In [72]:
# lets do the same for validation and test data set

val_dicts = df_val[categorial + numerical].to_dict(orient = 'records')
X_val = dv.transform(val_dicts) 

# 3.6 Logistic regression


In general, supervised models can be represented with this formula:  

$$
g(x_i) = y_i
$$

Depending on what is the type of target variable, the supervised task can be regression or classification (binary or multiclass). Binary classification tasks can have negative (0) or positive (1) target values. The output of these models is the probability of \( x_i \) belonging to the positive class.  

Logistic regression is similar to linear regression because both models take into account the bias term and weighted sum of features. The difference between these models is that the output of linear regression is a real number, while logistic regression outputs a value between zero and one, applying the sigmoid function to the linear regression formula.  

$$
g(x_i) = Sigmoid(w_0 + w_1x_1 + w_2x_2 + ... + w_nx_n)
$$

$$
Sigmoid(z) = \frac{1}{1 + e^{-z}}
$$

In this way, the sigmoid function allows transforming a score into a probability.  

**in simple terms**:
you just put the sigmoid function ouside of regression, so the result falls in between 0 and 1


In [73]:
# lets train our model with sklearn

from sklearn.linear_model import LogisticRegression

In [75]:
model = LogisticRegression()
model.fit(X_train, y_train)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


below are some ways where we can get the coefficient or the intercept

In [78]:
model.intercept_[0]

np.float64(-0.10910575411731115)

In [77]:
model.coef_[0].round(3)

array([ 0.476, -0.175, -0.408, -0.03 , -0.078,  0.063, -0.089, -0.081,
       -0.034, -0.073, -0.336,  0.317, -0.089,  0.004, -0.258,  0.142,
        0.009,  0.063, -0.089, -0.081,  0.266, -0.089, -0.284, -0.231,
        0.123, -0.166,  0.059, -0.087, -0.032,  0.07 , -0.059,  0.142,
       -0.25 ,  0.216, -0.121, -0.089,  0.102, -0.071, -0.089,  0.052,
        0.213, -0.089, -0.232, -0.07 ,  0.   ])

In [82]:
# now we use the model
# if it is 0 and 1 then they are hard predictions, to avoid this we use predict_proba
# that gets us the soft predictions

# it has 2 columns, one is probability of being 0, and the other is probability of 1
# so right side is probability of churning

model.predict_proba(X_train)

#w e only care about the chance of churning so we get the right column
# lets do it to validation

y_pred = model.predict_proba(X_val)[:, 1]
churn_decision = (y_pred >= 0.5)
churn_decision.astype(int)

# lets get the users that will churn
df_val[churn_decision].customerid

2504    8433-wxgna
4597    3440-jpscl
2343    2637-fkfsy
5591    7228-omtpn
4482    6711-fldfb
           ...    
2611    5976-jcjrh
4211    2034-cgrhz
3999    5276-kqwhg
6240    6521-yytyi
5282    3049-solay
Name: customerid, Length: 312, dtype: object

In [85]:
# lets see how accurate our model is
(y_val == churn_decision).mean()
#we can see we got 80% of them right

np.float64(0.8026969481902059)

# 3.7 Model Interpolation