In [1]:
Student name: Masinde Victor Kiprono

Student pace: Hybrid

Instructor name: Maryann Mwikali

# Predicting Customer Churn For SyriaTel Company

## Business Understanding
This project deals with a company, SyriaTel, that wants to know more about their customer churn. Customers leave a company due to different reasons and my project aims to uncover the reasons and predict customer churn. The company can then use the information gained from this project to work on retaining their customers.

## Import The Necessary Libraries to Notebook
For this project, I am going to use data that has already been collected and stored in kaggle. The data is stored in csv format. I will import the necessary libararies that will enable me read my data from the csv file. I will also import other libraries that will be of help to me in editing my data and visualizing it. I will also import some libraries from **scikit learn** that will help me in **modelling**

# Import libraries necessary for your project
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
%matplotlib inline 
import seaborn as sns
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler,StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score
from sklearn.metrics import roc_curve, auc
sns.set_style('darkgrid')
from imblearn.over_sampling import SMOTE
from sklearn.tree import DecisionTreeClassifier,plot_tree


I will use the Libraries above to read the contents of the csv file in folder named df to my notebookmin preparation for analysis.I will create a variable `df` where I will save my data. After saving the data inside my variable, I will go ahead and check the structure of our data by calling `df.head()` which shows us the preview of our data.I

#Read the data into the notebook using pandas
df=pd.read_csv('Customer churn.csv')
df.head()

## Data understanding
I will use different built in panda methods to check for the structure of my data. I will inspect the number of rows and columns. I will also check for the summary of my data. 

This dataset was sourced from kaggle and it has 3333 rows and 21 columns. The dataset has dsata recorded in different data type including float, intergers and objects. The columns are properly named showing what happens in the communication sector.

#Check for the data summary
df.info()

#Check for the statistical summary of the data
df.describe()

#Check for the number of columns and rows
df.shape

#print the names of columns in our data
df.columns

## Data cleaning
We check our data to see if it ready for modelling. The data needs to be free from duplicates, missing values and wrongly recorded data. Unnecessary columns are also dropped at this point.

#Checking for duplicates
df.duplicated().sum()

#Checking for missing values
df.isnull().sum()

state and area code both play the same role which is showing the geographical characteristics of the customer. We can drop state and remain with area code. This way we ensure that there wont be repetitive features in our model.

#Drop the state column
df=df.drop(columns=['state'],axis=1)

checking the column containing phone numbers, we notice that it is recorded with a '-'. Fter removing it, we want to make this columns to be our index since phone numbers are unique to each customer and can be our identifiers here.

# remove the '-' in phone number column
# Convert from objects to interger
df['phone number'] = df['phone number'].str.replace('-', '').astype(int) 
df.set_index('phone number', inplace=True)

#Confirm if the column has been set as index
df.head()

## Explolatory data analysis
 We will use `univariate`, `bivariate`, and `multivariate` analysis to perform a thorough investigation of the data in this section.

Finding potential `correlations` between the features and variable distribution is the goal of this kind of data exploration, which will be crucial for feature engineering and modelling. Features that have a high correlation with the target oare often good for building basline models.

#Check for the correlation of variables 
df.corr().churn

#correlation matrix 
corr_matrix = df.corr()

# Generate the correlation heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, fmt=".2f")
plt.title('Correlation Matrix between Variables')
plt.show();

# Checking the number of customers who churned and who remained
plt.figure(figsize=(10,8))
df['churn'].value_counts().plot(kind='bar', edgecolor='black')
plt.xlabel('customer churn')
plt.ylabel('No of Customers')
plt.title('Histogram of Customer churn');

df['churn'].value_counts()

The graph above shows that there are more customers who remained in the company(2850) as compared to hose who terminated their contracts. Using this column in our model for logistic regression could introduce a class imbalance but that can be handled using methods like `SMOTE`

## Modelling


For classification model, my target variable will be in form of classes(False and True). For this to be used in a model, it must first be transformed to numerics. It will be a `Binary classification` since it has only two categorical variables. 0 will represent false while 1 will represent true



df['churn'] = df['churn'].astype(int)
df['churn'].value_counts()

Our model cannot work with categorical variables in training and testing

we need to get dummies for categorical columns and drop the first dummie, this will be used as the reference class.

#convert area_code, international plan, and voice_mail_plan to integers 1s and 0s
df = pd.get_dummies(df, columns=['area code', 'international plan', 'voice mail plan'],drop_first=True)
df.head()

A model has features and the target. We need to separate them in preparation for mdoel feeding. `X` will be assigned to features while `y` will be assigned to the target variable 

y= df['churn']
X=df.drop(columns=('churn'))

Now that we have our data ready, we split it into two. This is to get data for `training` the model and another set for `testing` the model to see if it is effecient in generalising the model .For the training set we take 80% of our data since we want the model to use it to learn the underlying patterns. The test set can be 20% of our original data. Pass a random state in the formula for reproducibility when the code is run again.

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)
print(X_train.shape)
print(X_test.shape)

## 1. LOGISTIC REGRESSION
**1.1 Baseline Model**

This is the logistic regression model without tuning any of the parameters

#Perfom scaling for the features
scaler=StandardScaler()
X_train_scaled=scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
#create an object for logistic regression
logreg= LogisticRegression()

#fit the model with your features and target variables for training data
logreg.fit(X_train_scaled, y_train)

Once we have fitted the model, we can go ahead and use it together with `X_test` to make a `prediction(y_pred)`. We will then use this predicted value and compare it with our `real y(y_test)` and calculate the accuracy of our model. We are using the data from our test set to see how our model perfoms with unseen data. This will give us atrue picture of if our model has learnt the underlying patterns well or not.

#Make a prediction (y_pred) using the model
y_pred= logreg.predict(X_test_scaled)

###### Evaluating the model
Having the predicted y, we can now go ahead and evaluate our model. We want to see the  variation between our predicted value and our real value. We can calculate for **accuracy, confusion matrix** which shows us how well the model predicts the values in thneir correct classes`(True positive, False positive, True Negative and False Negative)`. We can also calculate the **classification report**

#Calculate for the accuracy of the model.
accuracy= accuracy_score(y_test,y_pred)
accuracy

The ratio of correctly predicted instances to total instances is 85.76% . Accuracy always shows us how well our model is in predicting both classes and the higher the value the better. It can be misleading at times tho and should not be used alone. 

#Work out the confusion matrix for the baseline model.
conf_matrix= confusion_matrix(y_test,y_pred)
print(conf_matrix)

#We can visusalize the confusion matrix
plt.figure(figsize=(8,6))
sns.heatmap(conf_matrix,annot=True, fmt="d", cmap='Blues', cbar=False,xticklabels=["Not Churned", "Churned"],
            yticklabels=["Not Churned", "Churned"])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix of Test Data');

Our confusion matrix has picked 667 samples and out of this it has:

    561 correctly predicted churned instances.
    99 incorrectly predicted churned intstances.
    2 correctly predicted not churned instances.
    5 incorrectly predicted not churned instances.

#Calculate classification reports
class_report= classification_report(y_test,y_pred)
print(class_report)

class 1 - churned class

class 0-  class not churned

The precision for class 0 is 85% while that of class 1 is 29%.
This means that out of instances predicted as class 1, 29% were from class 1.
Out of the instances predicted as class 0, 85% were actually from class 

The recall for class 0 is 99% while that of class 1 is 2%.
This means that Out of all actual instances of class 0, 99% were correctly identified by the model.
Out of all actual instances of class 1, 2% were correctly identified by the model.

The f1 score of class 0 is 92% while that of class 1 is 4%
The harmonic mean of precision and recall for class 0 is 92%
The harmonic mean of precision and recall for class 1 is 4%

`ROC CURVE AND AUC`

Two crucial instruments for assessing the effectiveness of binary classification models are the Receiver Operating Characteristic (ROC) curve and the Area Under the Curve (AUC).
ROC plots the True Positive Rate(TPR) against the False Positive Rate(FPR). The more this curve is towards the top left corner, the better the perfomance of the model.

The AUC provides a single value to help in gauging the model perfomance.A value of 1 shows that the model distinguishes well between negative and positive samples making a perfect classifier. A value of 0.5 shows that the model is equal to random guessing and will not be good for classification. That of above 0.5 means that the model is better than random guessing.


#Plotting the ROC curve and AUC

#predict probabilies of class churned
y_prob= logreg.predict_proba(X_test_scaled)[:,1]
#Calculate the ROC curve
fpr,tpr,thresholds = roc_curve(y_test, y_prob)
#We can also calculate the AUC
roc_auc = auc(fpr,tpr)

#Plot the graph
plt.figure(figsize=(8,6))
plt.style.use('seaborn-darkgrid')
plt.plot(fpr,tpr,color='r',label=f"ROC Curve(AUC={roc_auc:.2f})")
plt.plot([0,1],[0,1],linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characterustic Curve')
plt.legend(loc='lower right')
plt.show()

In our case, the AUC is 0.72, which is greater than 0.5. This shows that the logistic regression model has reasonable discriminatory power in distinguishing between churned and not churned samples. An AUC of 0.72 suggests that the model has a good ability to rank the predictions, and it performs significantly better than random guessing.

High accuracy, precision, and recall for class 0 demonstrate how well the model predicts the negative class (not churned).
However, as evidenced by the low precision, recall, and F1-score values for class 1, it performs badly for the positive class (churned).
Put otherwise, a significant portion of consumers who are churned are not included in the model, resulting in false negatives. It is not accurately identifying the clients who have left.

This model is better than guessing but can have huge implications to the business as it fails to predict churned customers on a significant level

##### Tuning the logistic regression model
 *Using SMOTE technique
 
 SMOTE is a technique that adresses target class imbalance by generating sysnthetic samples for the minority class. It does this by replicating the samples in the minority class. After doing this, Both classes wwill have equal values therefore adressing class imbalance. We can now create another model using these values. y_test remain untouched so that it can  still represent the real world data when we are testing the perfomance of the model.

#Balance the target classes using SMOTE 
smote =SMOTE(k_neighbors=5, random_state=42,sampling_strategy='minority')

#Fit SMOTE to the model.
X_train_bal,y_train_bal = smote.fit_resample(X_train_scaled, y_train)
y_train_bal.value_counts()

**Tuned Model**

We want to create a model with tuned parameters then we will compare it with our baseline to see if there are improvements

#Create an object for this model
logreg1= LogisticRegression()

#Fit the model
logreg1.fit(X_train_bal,y_train_bal)

#Make predictions to help in calculating accuracy
y_pred1 = logreg1.predict(X_test_scaled)


we proceed and evaluate the perfomance of the model. It gives us a way of comparing this with the first model.

#Calculate Accuracy of the tuned model
accuracy1= accuracy_score(y_test,y_pred1)
accuracy1

#Calculate the confusion matrix of the tuned model
conf_matrix1= confusion_matrix(y_test,y_pred1)
conf_matrix1

#We can visusalize the confusion matrix
plt.figure(figsize=(8,6))
sns.heatmap(conf_matrix1,annot=True, fmt="d", cmap='Set1_r', cbar=False,xticklabels=["Not Churned", "Churned"],
            yticklabels=["Not Churned", "Churned"])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix of Test Data');

#Calculate the classification report of the tuned model
class_report1= classification_report(y_test,y_pred1)
print(class_report1)

#Plotting the ROC curve and AUC

#predict probabilies of class churned
y_prob1= logreg1.predict_proba(X_test_scaled)[:,1]
#Calculate the ROC curve
fpr1,tpr1,thresholds1 = roc_curve(y_test, y_prob1)
#We can also calculate the AUC
roc_auc1 = auc(fpr1,tpr1)

#Plot the graph
plt.figure(figsize=(8,6))
plt.style.use('seaborn-darkgrid')
plt.plot(fpr1,tpr1,color='r',label=f"ROC Curve(AUC={roc_auc1:.2f})")
plt.plot([0,1],[0,1],linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characterustic Curve')
plt.legend(loc='lower right')
plt.show()

## 2. DECISION TREE CLASSIFIER

just like logistic regression, we will train, test and evaluate the Decision Tree Classifier. We will start by calling the DecisionTreeClassifier and feed the model with both X_train and y_train data that we obtained earlier after adressing class imbalance.

#create an object for decision tree
dtc= DecisionTreeClassifier(random_state=42)

#fit the model
dtc_model=dtc.fit(X_train_bal,y_train_bal)

#Make a prediction using test data
y_pred2 = dtc_model.predict(X_test_scaled)


After fitting the model, we can use the predicted y together with y_test to calculate accuracy, precision, recall and f1_score 

#Calculate accuracy 
accuracy2=accuracy_score(y_test, y_pred2)
accuracy2

#Work out the confusion matrix for the  model.
conf_matrix2= confusion_matrix(y_test,y_pred2)
print(conf_matrix2)

#We can visusalize the confusion matrix
plt.figure(figsize=(8,6))
sns.heatmap(conf_matrix2,annot=True, fmt="d", cbar=False,xticklabels=["Not Churned", "Churned"],
            yticklabels=["Not Churned", "Churned"])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix of Test Data');

#Calculate class report metrics
class_report2= classification_report(y_test,y_pred2)
print(class_report2)

#plot the ROC curve and AUC

#predict probabilies of class churned
y_prob2= dtc_model.predict_proba(X_test_scaled)[:,1]
#Calculate the ROC curve
fpr2,tpr2,thresholds2 = roc_curve(y_test, y_prob2)
#We can also calculate the AUC
roc_auc2 = auc(fpr2,tpr2)

#Plot the graph
plt.figure(figsize=(8,6))
plt.style.use('seaborn-darkgrid')
plt.plot(fpr2,tpr2,color='b',label=f"ROC Curve(AUC={roc_auc2:.2f})")
plt.plot([0,1],[0,1],linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characterustic Curve')
plt.legend(loc='lower right')
plt.show()

plt.figure(figsize=(20, 10))  # Specify the figure size
plot_tree(dtc_model, filled=True, feature_names=X.columns, class_names=['class_0', 'class_1'])
plt.show()

SyntaxError: invalid syntax (<ipython-input-1-2ac25358d660>, line 1)