#### Question: How well can we predict customer churn based on demographic features (gender, SeniorCitizen, Partner, and Dependents)?

##### Expectations
 This can be a useful question for the telecommunications company, as it can help them to understand which factors are most important in driving customer churn and how they can take targeted actions to reduce churn rates. By analyzing the data and building a predictive model, the company can gain insights into which demographic features are most strongly associated with churn, and use this information to develop retention strategies that are tailored to different customer segments.

##### Information about the data:
The data is stored in an Excel file named `Telco_customer_churn_demographics.xlsx`. The file contains 7043 rows. Each row represents a customer, each column contains customer’s attributes described on the column Metadata. The demographic features that we have are: 
1. Gender
2. Age
3. Marriage status
4. Dependents

#### EDA

In [None]:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
import seaborn as sns
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

In [None]:
# Load the data from /Dataset/Telco_customer_churn_demographics.xlsx
dataset1 = pd.read_excel('../Dataset/Telco_customer_churn_demographics.xlsx')

In [None]:
# we need to get a column from another excel file and join it with the dataset

# Load the data from /Dataset/Telco_customer_churn.xlsx
dataset2 = pd.read_excel('../Dataset/Telco_customer_churn.xlsx')

In [None]:
# rename the column to match the column name in the dataset
dataset2.rename(columns={'CustomerID':'Customer ID'}, inplace=True)

In [None]:
# Join the two datasets on the column 'Customer ID'
dataset = pd.merge(dataset1, dataset2, on='Customer ID')

In [None]:
# Check for any missing values
dataset.isnull().sum()

In [None]:
# make sure that the merge happened successfully by comparing Gender_x and Gender_y columns to be the same
difference = dataset['Gender_x'] != dataset['Gender_y']
difference.sum()

In [None]:
# check the data types of the columns
dataset.dtypes

In [None]:
# drop useless columns

my_columns = ['Gender_x', 'Age', 'Married',
              'Number of Dependents', 'Churn Value']

dataset = dataset[my_columns]

In [None]:
# turn the categorical variables into dummy variables
dataset = pd.get_dummies(dataset)

# check the data types of the columns
dataset.dtypes

In [None]:
# drop the newly created dummy variables that are not required
if 'Gender_x_Female' in dataset.columns:
    dataset = dataset.drop(
        ['Gender_x_Female', 'Married_No'], axis=1)


# rename the columns to remove the _Yes suffix
dataset.rename(columns={'Gender_x_Male': 'Gender',
               'Married_Yes': 'Married'}, inplace=True)

# check the data types of the columns
dataset.dtypes    

##### Gender = 1 then male if 0 then female 

In [None]:
# check the head of the dataset
dataset.head()

In [None]:
# split the dataset into training and test sets
train, test = train_test_split(dataset, test_size=0.2, random_state=42)

In [None]:
# visualize the distribution of the demographic features vs churn value
fig, ax = plt.subplots(2, 2, figsize=(10, 10))
ax[0, 0].hist(train[train['Churn Value'] == 0]['Age'], bins=20, color='blue', alpha=0.5)
ax[0, 0].hist(train[train['Churn Value'] == 1]['Age'], bins=20, color='orange', alpha=0.5)
ax[0, 0].set_title('Age')

ax[0, 1].hist(train[train['Churn Value'] == 0]['Number of Dependents'], bins=10, color='blue', alpha=0.5)
ax[0, 1].hist(train[train['Churn Value'] == 1]['Number of Dependents'], bins=10, color='orange', alpha=0.5)
ax[0, 1].set_title('Number of Dependents')

ax[1, 0].hist(train[train['Churn Value'] == 0]['Gender'],
              bins=2, color='blue', alpha=0.5, rwidth=0.8)
ax[1, 0].hist(train[train['Churn Value'] == 1]['Gender'],
              bins=2, color='orange', alpha=0.5, rwidth=0.8)
ax[1, 0].set_title('Gender')

ax[1, 1].hist(train[train['Churn Value'] == 0]['Married'],
              bins=2, color='blue', alpha=0.5,rwidth=0.8)
ax[1, 1].hist(train[train['Churn Value'] == 1]['Married'],
              bins=2, color='orange', alpha=0.5, rwidth=0.8)
ax[1, 1].set_title('Married')

### Initial insights

1. Age seems to be a good predictor of the churn (older people tend to churn more)
2. Gender does not seem to have any effect on the churn rate
3. Hard to tell wether number of dependents has an effect on churn rate further analysis is needed
4. It looks like single people tend to churn more than married people

In [None]:
# perform a correlation analysis on the dataset to see which features are highly correlated
plt.figure(figsize=(20, 10))
sns.heatmap(train.corr(method='pearson'), annot=True, cmap='gist_heat')
plt.show()


### Correlation matrix analysis
features seem to be uncorrelated with each other
we see positive correlation between churn value and Age
we see negative correlation between churn value and Number of dependents, Married


In [None]:
# let's check non-linear correlations
#sns.pairplot(train, hue='Churn Value')

# Correlation with Spearman's Rank Correlation:
plt.figure(figsize=(20, 10))
sns.heatmap(train.corr(method='spearman'), annot=True, cmap='gist_heat')
plt.show()



In [None]:
# Correlation with Kendall's Rank Correlation:
plt.figure(figsize=(20, 10))
sns.heatmap(train.corr(method='kendall'), annot=True, cmap='gist_heat')
plt.show()


#### Model building:

In [None]:
# Let's build a logistic regression model

# drop the target variable from the training set
X_train = train.drop('Churn Value', axis=1)

# select the target variable from the training set
y_train = train['Churn Value']

# drop the target variable from the test set
X_test = test.drop('Churn Value', axis=1)

# select the target variable from the test set
y_test = test['Churn Value']

# balance the training set
sm = SMOTE(random_state=42)
X_train, y_train = sm.fit_resample(X_train, y_train)

In [None]:
# Define hyperparameter values to be tuned
param_grid = {'penalty': ['l1', 'l2'],
              'C': [0.01, 0.1, 1],
              'solver': ['liblinear', 'saga'],
              'class_weight': [None, 'balanced']}

# Create a logistic regression model
logreg = LogisticRegression()

# Create a GridSearchCV object to perform hyperparameter tuning
grid_search = GridSearchCV(logreg, param_grid, cv=5, scoring='accuracy',n_jobs=10)
grid_search.fit(X_train, y_train)

# Get the best hyperparameter
best_params = grid_search.best_params_
print("Best Hyperparameters: ", best_params)

# Fit the logistic regression model with best hyperparameter to the training data
best_logreg = LogisticRegression(**best_params)
best_logreg.fit(X_train, y_train)

# Predict the churn value for the test set
y_pred = best_logreg.predict(X_test)

# Evaluate the model performance
report = classification_report(y_test, y_pred)
print(report)

### Logistic regression results analysis

very poor accuracy of 56% on the test set
we see that the model is not able to predict the churn rate of the customers based on the demographic data

In [None]:
# Let's build a random forest model

# Define hyperparameter values to be tuned
param_grid = {
    "n_estimators": [100, 500],
    "max_depth": [5, 10],
    "min_samples_split": [2, 5],
    "min_samples_leaf": [1, 2],
}

# Create a random forest model
rf = RandomForestClassifier()

# Create a GridSearchCV object to perform hyperparameter tuning
grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='accuracy',n_jobs=10)
grid_search.fit(X_train, y_train)

# Get the best hyperparameter
best_params = grid_search.best_params_
print("Best Hyperparameters: ", best_params)

# Fit the random forest model with best hyperparameter to the training data
best_rf = RandomForestClassifier(**best_params)
best_rf.fit(X_train, y_train)

# Predict the churn value for the test set
y_pred = best_rf.predict(X_test)

# Evaluate the model performance
report = classification_report(y_test, y_pred)
print(report)

In [None]:
# Let's build a SVM model

# Define hyperparameter values to be tuned
param_grid = {
    "C": [0.1, 1, 10],
    "kernel": ["linear", "rbf"],
    "gamma": ["scale", "auto"]
}

# Create a SVM model
svm = SVC()

# Create a GridSearchCV object to perform hyperparameter tuning
grid_search = GridSearchCV(svm, param_grid, cv=5, scoring='accuracy',n_jobs=10)
grid_search.fit(X_train, y_train)

# Get the best hyperparameter
best_params = grid_search.best_params_
print("Best Hyperparameters: ", best_params)

# Fit the SVM model with best hyperparameter to the training data
best_svm = SVC(**best_params)
best_svm.fit(X_train, y_train)

# Predict the churn value for the test set
y_pred = best_svm.predict(X_test)

# Evaluate the model performance
report = classification_report(y_test, y_pred)
print(report)

#### Results Interpretation:
Based on the results of the three models, it appears that predicting customer churn based solely on demographic features (gender, SeniorCitizen, Partner, and Dependents) is challenging. The best model achieved an accuracy of around 57%, which is only slightly better than random guessing.

The Logistic Regression model had the highest recall score for predicting customer churn (0.78), which means it correctly identified 78% of customers who were likely to churn. However, its precision score for predicting customer churn was low (0.37), which means it also identified a large number of false positives.

The Random Forest and SVM models had similar results with an accuracy of around 56%, and relatively balanced precision and recall scores for predicting customer churn.

Based on these results, it seems that demographic features alone may not be sufficient to accurately predict customer churn. To improve the accuracy of the predictive model, it may be necessary to include additional features.

In [None]:
# save the dataset as an Excel file to use for tableau visualization
dataset.to_excel('q1.xlsx', index=False)
