<div class="alert alert-block alert-success">
    <h1 align="center">Machine Learning in Python</h1>
    <h3 align="center">Churn Modelling</h3>
    <h4 align="center"><a href="http://www.iran-machinelearning.ir">Soheil Tehranipour</a></h5>
</div>

Customer churn prediction is to measure why customers are leaving a business. In this tutorial we will be looking at customer churn in telecom business. We will build some models to predict the churn and use precision,recall, f1-score to measure performance of our model.

### We will go through project like this:
<h3 style='color:blue'>Handle imbalanced data in churn prediction.</h3>


1. Import Library
2. Load Data
3. EDA
4. Visulization
5. Preprocessing (Encoding , Scaling , Imputation)
6. Training the model
7. Evaluation

# 1. Import Libraries

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline
import plotly.express as px

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

from sklearn.preprocessing import MinMaxScaler

import warnings
warnings.filterwarnings('ignore')

## 2. Load data and make dataframe

In [None]:
df = pd.read_csv("Telecom_customer_churn.csv")
df.sample(5)

In [None]:
df.columns

In [None]:
df.dtypes

In [None]:
df['Churn'].value_counts()

# 3. Do Some EDA

In [None]:
df.isnull().sum().sum()

In [None]:
pd.to_numeric(df.TotalCharges,errors='coerce').isnull().sum()

**Quick glance at above makes me realize that TotalCharges should be float but it is an object. Let's check what's going on with  this column**
**Lets convert it to numbers**

**Remove rows with space in TotalCharges**

In [None]:
df1 = df[df["TotalCharges"]!=" "]

In [None]:
df.shape

In [None]:
df1.shape

In [None]:
pd.to_numeric(df1.TotalCharges,errors='coerce').isnull().sum()

In [None]:
df1.TotalCharges = pd.to_numeric(df1.TotalCharges)

In [None]:
type(df1)

In [None]:
df1.shape

In [None]:
df1.dtypes

# 4. Let's do some  Visualization

In [None]:
tenure_churn_no = df1[df1.Churn=='No'].tenure
tenure_churn_yes = df1[df1.Churn=='Yes'].tenure

plt.xlabel("tenure")
plt.ylabel("Number Of Customers")
plt.title("Customer Churn Prediction Visualiztion")

plt.hist([tenure_churn_yes, tenure_churn_no], rwidth=0.95, color=['green','red'],label=['Churn=Yes','Churn=No'])
plt.legend()


In [None]:
mc_churn_no = df1[df1.Churn=='No'].MonthlyCharges      
mc_churn_yes = df1[df1.Churn=='Yes'].MonthlyCharges      

plt.xlabel("Monthly Charges")
plt.ylabel("Number Of Customers")
plt.title("Customer Churn Prediction Visualiztion")

plt.hist([mc_churn_yes, mc_churn_no], rwidth=0.95, color=['green','red'],label=['Churn=Yes','Churn=No'])
plt.legend()

# 5. Data Preprocessing

**Many of the columns are yes, no etc. Let's print unique values in object columns to see data values**

In [None]:
def print_unique_col_values(df):
       for column in df:
            if df[column].dtypes=='object':
                print(f'{column}: {df[column].unique()}') 

In [None]:
print_unique_col_values(df1)

**Some of the columns have no internet service or no phone service, that can be replaced with a simple No**

In [None]:
df1.replace("No internet service","No",inplace=True)
df1.replace("No phone service","No",inplace=True)

In [None]:
df1

**Convert Yes and No to 1 or 0**

In [None]:
yes_no_columns = ['Partner','Dependents','PhoneService','MultipleLines','OnlineSecurity','OnlineBackup',
                  'DeviceProtection','TechSupport','StreamingTV','StreamingMovies','PaperlessBilling','Churn']
for col in yes_no_columns:
    df1[col].replace({'Yes': 1,'No': 0},inplace=True)

In [None]:
for col in df1:
    print(f'{col}: {df1[col].unique()}') 

**One hot encoding for categorical columns**

In [None]:
df1['gender'].replace({'Male': 1,'Female': 0},inplace=True)

In [None]:
for col in df1:
    print(f'{col}: {df1[col].unique()}') 

In [None]:
df1 = pd.get_dummies(df1,['Contract','PaymentMethod','InternetService'])

In [None]:
df1

**Scaling**

In [None]:
df1.columns

In [None]:
cols_to_scale = ['tenure','MonthlyCharges','TotalCharges']

scaler = MinMaxScaler()
df1[cols_to_scale] = scaler.fit_transform(df1[cols_to_scale])


In [None]:
for col in df1:
    print(f'{col}: {df1[col].unique()}')

## Train test split

In [None]:
X = df1.drop('Churn',axis='columns')
y = df1.Churn.astype(np.float32)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=15, stratify=y)

In [None]:
y_train.value_counts()

In [None]:
y.value_counts()

In [None]:
y_test.value_counts()

In [None]:
X_train.shape

In [None]:
X_test.shape

In [None]:
X_train[:10]

In [None]:
len(X_train.columns)

# 6&7 .Train the model
**Use logistic regression classifier**

In [None]:
def log_reg(X_train, y_train, X_test, y_test, weights):
    if weights==-1:
        model = LogisticRegression()
    else:
        model = LogisticRegression(class_weight={0:weights[0], 1:weights[1]})

    model.fit(X_train, y_train)
    acc = model.score(X_test, y_test)
    print("Accuracy", acc, "\n")

    y_pred = model.predict(X_test)
    print("preds", y_pred[:5], "\n")

    cl_rep = classification_report(y_test,y_pred)
    print(cl_rep)

* weight means you are giving more importance to a particular class.

In [None]:
weights = -1 # pass -1 to use Logistics Regression without weights
log_reg(X_train, y_train, X_test, y_test, weights)

In [None]:
weights = [1, 1.5] # pass -1 to use Logistics Regression without weights
log_reg(X_train, y_train, X_test, y_test, weights)

* when we don't define class weights we get 0.66 precision and 0.54 recall.
* when we define class weights we get 0.50 precision and 0.79 recall.

# "Imbalanced dataset"

### Method1: Undersampling

In [None]:
# Class count
count_class_0, count_class_1 = df1.Churn.value_counts()

# Divide by class
df_class_0 = df1[df1['Churn'] == 0]
df_class_1 = df1[df1['Churn'] == 1]

In [None]:
count_class_0

In [None]:
count_class_1

In [None]:
# Undersample 0-class and concat the DataFrames of both class
df_class_0_under = df_class_0.sample(count_class_1)
df_test_under = pd.concat([df_class_0_under, df_class_1], axis=0)

print('Random under-sampling:')
print(df_test_under.Churn.value_counts())

In [None]:
X = df_test_under.drop('Churn',axis='columns')
y = df_test_under['Churn']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=15, stratify=y)

In [None]:
# Number of classes in training Data
y_train.value_counts()

#### Applying Logistic Regression

In [None]:
weights = -1 # pass -1 to use Logistics Regression without weights
log_reg(X_train, y_train, X_test, y_test, weights)

<h4 style='color:blue'>With undersampling: f1 score for minority class 1 improved to be 0.75 from 0.59</h4>

### Method2: Oversampling

In [None]:
# Oversample 1-class and concat the DataFrames of both classes
df_class_1_over = df_class_1.sample(count_class_0, replace=True)
df_test_over = pd.concat([df_class_0, df_class_1_over], axis=0)

print('Random over-sampling:')
print(df_test_over.Churn.value_counts())

In [None]:
X = df_test_over.drop('Churn',axis='columns')
y = df_test_over['Churn']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=15, stratify=y)

In [None]:
# Number of classes in training Data
y_train.value_counts()

#### Logistic Regression

In [None]:
weights = -1 # pass -1 to use Logistics Regression without weights
log_reg(X_train, y_train, X_test, y_test, weights)

<h4 style='color:blue'>With oversampling: f1 score for minority class 1 improved to be 0.76 from 0.59</h4>

### Method3: SMOTE

To install imbalanced-learn library use "pip install imbalanced-learn" command

In [None]:
X = df1.drop('Churn',axis='columns')
y = df1['Churn']

In [None]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(sampling_strategy='minority')
X_sm, y_sm = smote.fit_resample(X, y)

y_sm.value_counts()

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_sm, y_sm, test_size=0.2, random_state=15, stratify=y_sm)

In [None]:
# Number of classes in training Data
y_train.value_counts()

## Final results: Logistic Regression

In [None]:
weights = -1 # pass -1 to use Logistics Regression without weights
log_reg(X_train, y_train, X_test, y_test, weights)

<h4 style='color:blue'>With SMOTE oversampling: f1 score for minority class 1 improved to be 0.81 from 0.59</h4>

<img src="https://webna.ir/wp-content/uploads/2018/08/%D9%85%DA%A9%D8%AA%D8%A8-%D8%AE%D9%88%D9%86%D9%87.png" width=50% />