In [1]:
#import necessary libaries 
import pandas as pd
import numpy as np
from scipy.stats import pearsonr
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LogisticRegression 
from sklearn.metrics import confusion_matrix, classification_report 
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFE

#Load dataset
customers = pd.read_csv('customers.csv')

### What features are most likely associated with churn?

In [20]:
#We will answer this by finding the correlation coefficients in regards to churn. But first.

# Convert categorical variables to numerical variables
customers['churn'] = customers['churn'].map({'No': 0, 'Yes': 1})
customers['phoneService'] = customers['phoneservice'].map({'No': 0, 'Yes': 1})
customers['multipleLines'] = customers['multiplelines'].map({'No phone service': 0, 'No': 1, 'Yes': 2})
customers['internetService'] = customers['internetservice'].map({'No': 0, 'DSL': 1, 'Fiber optic': 2})
customers['onlineSecurity'] = customers['onlinesecurity'].map({'No internet service': 0, 'No': 1, 'Yes': 2})
customers['onlineBackup'] = customers['onlinebackup'].map({'No internet service': 0, 'No': 1, 'Yes': 2})
customers['deviceProtection'] = customers['deviceprotection'].map({'No internet service': 0, 'No': 1, 'Yes': 2})
customers['techSupport'] = customers['techsupport'].map({'No internet service': 0, 'No': 1, 'Yes': 2})
customers['streamingTV'] = customers['streamingtv'].map({'No internet service': 0, 'No': 1, 'Yes': 2})
customers['streamingMovies'] = customers['streamingmovies'].map({'No internet service': 0, 'No': 1, 'Yes': 2})
customers['paperlessBilling'] = customers['paperlessbilling'].map({'No': 0, 'Yes': 1})
customers['paymentMethod'] = customers['paymentmethod'].map({'Electronic check': 0, 'Mailed check': 1, 'Bank transfer (automatic)': 2, 'Credit card (automatic)': 3})

# Calculate the correlation coefficients
#This is sorted so the highest correlation coeff are on top
correlations = customers.corr()['churn'].sort_values(ascending=False)

# Print the results
print(correlations)

churn               1.000000
internetService     0.316350
monthlycharges      0.192858
paperlessBilling    0.191454
streamingTV         0.164509
streamingMovies     0.162672
seniorcitizen       0.150541
deviceProtection    0.084402
onlineBackup        0.073934
multipleLines       0.036148
techSupport         0.026744
onlineSecurity      0.023014
phoneService        0.011691
totalcharges       -0.199484
paymentMethod      -0.262918
tenure             -0.354049
Name: churn, dtype: float64


So, based on this, internetservice, monthlycharges and paperless billing are mostly associated with churn for positive values.
However, 0.3 is moderate and not very strong, therefore, there is absolutely other factors affecting churn which we will explore later.


In our SQL script, we found that customers with a higher monthly charge, they churned more. So, to solidify this association,
we will perform hypothesis testing to determine if the correlation between monthly charges and churn is due to chance or if there is a statistical significance between the two. We will use Pearson's correlation test to determine this. This test will tell us if there is a linear relationship between the two and if this relationship is indeed significant. The result will be the coefficient and the p-value. The p-value will tell us everything we need to know about our test.
Null Hypothesis: Our null hypothesis will be a statement that states that there is no significance between the two variables or simply that the relationship is merely due to chance. Therefore, our null hypothesis is: There is no significant correlation between the two variables(monthly charges and churn).

Alternative Hypothesis: The alternative hypothesis basically does the opposite and states that there is a significance between the two variables and their relationship is not due to chance. Therefore, our alternative hypothesis is: There is a significant correlation between the two variables.

Now as I said earlier, the p-value tells us everything we need to know. Therefore, if the p-value is less than the significance level(0.05), we can reject the null hypothesis. If the p-value is greater than the significance level, we failed to reject the null hypothesis and we would have to accept it.

Let us see what our p value is.

In [15]:
# Load monthly charges and churn into columns
mont_charges = customers['monthlycharges']
churn = customers['churn']

# Perform Pearson correlation test
corr, p_value = pearsonr(mont_charges, churn)

# Print results
print("Correlation coefficient:", corr)
print("P-value:", p_value)


Correlation coefficient: 0.19285821847007886
P-value: 6.760843118056653e-60


This is a very small p-value and therefore, once the p value is less than the significance level of 0.05, then we can reject the null hypothesis and this means that there is a significant relationship between monthlycharges and churn. It is not due to chance.

Therefore, customers with higher monthly charges, are at-risk of churning. I would recommend revising pricing of services to ensure it is reasonable for all customers because customers may be leaving because of high monthly charges. So, lowering prices may retain some customers and reduce the churn rate. 
Also, conduct a customer survey to find out the specifics to see if customers are churning due to high monthly charges because of the price not matching up to the services being offered.

### Let's now predict customer churn using Logistic Regression.

In [134]:
#We are dropping customerID and gender because they're not relevant to our model. And we're dropping totalcharges because
#it is highly correlated to tenure, so keeping it would be redundant and cause multilinearity issues.
to_drop = ['customerID','gender','totalcharges']
customers = customers.drop(to_drop,axis=1)
categorical_cols = ['deviceprotection','onlinebackup','multiplelines','phoneservice','onlinesecurity','techsupport','internetservice', 'contract','tenure','paymentmethod','paperlessbilling','streamingtv','streamingmovies','partner','dependents']
encoded_features = pd.get_dummies(customers, columns=categorical_cols)
features = encoded_features.drop('churn', axis=1)
target = encoded_features['churn']

In [145]:


# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.3, random_state=42)

#Increase the number of iterations
logreg = LogisticRegression(max_iter=1000)

# Fit the model to the training data
logreg.fit(X_train, y_train)

# I will now evaluate the model 
y_pred = logreg.predict(X_test) 

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

          No       0.84      0.89      0.86      1549
         Yes       0.63      0.52      0.57       561

    accuracy                           0.79      2110
   macro avg       0.73      0.71      0.72      2110
weighted avg       0.78      0.79      0.78      2110



Nice! So after printing out the classification report, we are interested in the precision, accuracy, f1-score and recall.

The accuracy for our model is 0.79. This means that our model correctly predicts whether a customer will churn or not and is correct about 79% of the time.

The precision for yes is 0.63. This means that when the model predicts if customers will churn, it is correct about it 63% of the time.
The precision for no is 0.84. This means that when the model predicts if a customer won't churn, it is correct about it 84% of the time.

The recall is for YES is 0.52. This means that the model correctly identifies 52% of the customers who actually churned.
The recall for NO is 0.89. This means that the model correctly identifies 89% of the customers who did not churned.

The F-1 score for NO is 0.87 and the for YES, it is 0.57. The F-1 score takes into account precision and recall. It is the harmonic mean of precision and recall. A f-1 score of 0.87 means the model is doing pretty well at predicting who will not churn. However, the f-1 score for predicting if they will churn is 0.57, this is low compared to the 0.87. So, this version of the model does better at predicting who will not churn versus who will.
The weight avg is average of precision, recall and f-1 score across both categories. The model's weight avg is 0.78. which means the model is doing pretty well, but it still could be improved.


In [140]:
to_drop = ['customerID','gender','totalcharges']
customers = customers.drop(to_drop,axis=1)
categorical_cols = ['deviceprotection','onlinebackup','multiplelines','phoneservice','onlinesecurity','techsupport','internetservice', 'contract','tenure','paymentmethod','paperlessbilling','streamingtv','streamingmovies','partner','dependents']
encoded_features = pd.get_dummies(customers, columns=categorical_cols)
features = encoded_features.drop('churn', axis=1)
target = encoded_features['churn']

In [143]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.3, random_state=42)

#Increase the number of iterations
logreg = LogisticRegression(max_iter=1000)
# Apply recursive feature elimination to select the most important features, so we can improve the model's performance
rfe = RFE(logreg, num_features=10)
rfe.fit(X_train, y_train)
rfe_features = X_train.columns[rfe.support_]

# Train the logistic regression model on the new features 
logreg.fit(X_train[rfe_features], y_train)

# Make predictions on the testing data and evaluate the model
y_pred = logreg.predict(X_test[rfe_features])
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

          No       0.87      0.78      0.82      1549
         Yes       0.53      0.67      0.59       561

    accuracy                           0.75      2110
   macro avg       0.70      0.73      0.71      2110
weighted avg       0.78      0.75      0.76      2110



Based on the new classification report, the model did improve based on what I wanted. My goal was to create a model that would do a better job at predicting those who will churn. The recall and f1-score improved. However, there is still some downsides. The precision decreased from 63% to 53%. The model now correctly identifies 67% of customers who actually churned. This is higher than the 52% in the previous model. This means that this model is better at identifying potential churners, which is what I wanted. However, this model is doing a worse job at being correct about who churned. It is now correct 53%  of the time when predicting customers who churn. Before, it was correct 63% of the time. It does do a better job at identifying potential churners though and that is the goal.   
Now, because we used the RFE for feature selection, the accuracy decreased. RFE selects the most important features but it does not always select the most optimal features.The model was trained on a smaller set of features. The accuracy decreased, it is less than the accuracy score we saw before, but it is still decent.

While the second model with RFE does better at identifying customers who actually churned, it has a lower precision. Therefore, the second model would end up predicting customers to churn who don't actually end up churning. Therefore, if a company was to use this model, it would end up spending to retain customers who end up not churning.

On the other hand, the first model has a higher precision. Therefore, it would have as few false positives(customers who are predicted to churn but don't) as possible, but it would end up missing out on some true positives(actual churners). 

So, if a company uses the first model, you wouldn't overspend by trying to retain customers who won't actually churn but you would still end up losing money because there will be some customers that are going to churn that the model did not predict.

While, in the second model, you would overspend by trying to retain customers who won't actually churn, but you would still be identifying actual churners. Because this model(second model) identifies 67% of customers who actually churned versus the first model, which identifies 52% of customers who actually churned. So both models have their pros and cons. 

To conclude, it boils down to the specific goals of the company. In this project, the goal of the company is the minimize the customers who churn even if it means that some customers who will not actually churn would be included. Therefore, the second model would work perfectly for our needs.

On the other hand, there may be another company that would like to minimize the number of erroneous churn predictions so they don't have to utilize unnecessary resources. In this case, the first model would be better because the precision is higher so it will more accurately predict customers who will actually churn.
The optimal situation would be do have a balance between precision and recall. To have a model that accurately predicts customers who will churn and could identify a decent percentage of customers who actually churned.
We could still tune the first model by choosing another classification model other than Logistic Regression. We could apply undersampling or oversampling. One method to tune the model was to use a feature selection algorithm, so I used Recursive Feature Elimination to choose the best features to train for the model in hopes it would improve the model. This is my first predictive model. I think this is a good start.