#**Retention Strategies**

###**Assignment**

Start by defining and training a classification model to predict the churn of customers of a telecommunication company. Based on the data analysis, and possibly, the feature importance analysis in the model, suggest strategies that can be used to retain the customers. Simulate your strategies by altering the data and confirm their effectiveness using the model.

###**Data Description**

The customer churn data is given in the file Telco-Customer-Churn.json. Each row represents a customer, each column contains customer’s attributes described on the column Metadata.

The data set includes information about:

- Customers who left within the last month – the column is called Churn;
- Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies;
- Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges;
- Demographic info about customers – gender, age range, and if they have partners and dependents.

###**Practicalities**

Define, train and evaluate a predictive model that takes as the input the data provided. You may want to split the data into training, testing and validation sets, according to your discretion. Do not use external data for this project. You may use any algorithm of your choice or compare multiple models.

Make sure that the solution reflects your entire thought process - it is more important how the code is structured rather than the final metrics.


#### To download the dataset <a href="https://drive.google.com/drive/folders/1mu1QKuC4t2PugPl-24VtysphrwLfyXzt?usp=sharing"> Click here </a>

In [1]:
import pandas as pd

# Load the dataset
file_path = "Telco-Customer-Churn.json"
df = pd.read_json(file_path)

# Inspect the first few rows of the dataset
print(df.head())

# Check for any missing values
print(df.isnull().sum())

# Get a summary of the dataset
print(df.info())


   customerID Churn                                           customer  \
0  0002-ORFBO    No  {'gender': 'Female', 'SeniorCitizen': 0, 'Part...   
1  0003-MKNFE    No  {'gender': 'Male', 'SeniorCitizen': 0, 'Partne...   
2  0004-TLHLJ   Yes  {'gender': 'Male', 'SeniorCitizen': 0, 'Partne...   
3  0011-IGKFF   Yes  {'gender': 'Male', 'SeniorCitizen': 1, 'Partne...   
4  0013-EXCHZ   Yes  {'gender': 'Female', 'SeniorCitizen': 1, 'Part...   

                                             phone  \
0   {'PhoneService': 'Yes', 'MultipleLines': 'No'}   
1  {'PhoneService': 'Yes', 'MultipleLines': 'Yes'}   
2   {'PhoneService': 'Yes', 'MultipleLines': 'No'}   
3   {'PhoneService': 'Yes', 'MultipleLines': 'No'}   
4   {'PhoneService': 'Yes', 'MultipleLines': 'No'}   

                                            internet  \
0  {'InternetService': 'DSL', 'OnlineSecurity': '...   
1  {'InternetService': 'DSL', 'OnlineSecurity': '...   
2  {'InternetService': 'Fiber optic', 'OnlineSecu...   
3  {'I

In [2]:
# Expand the 'customer', 'phone', 'internet', and 'account' columns
customer_df = pd.json_normalize(df['customer'])
phone_df = pd.json_normalize(df['phone'])
internet_df = pd.json_normalize(df['internet'])
account_df = pd.json_normalize(df['account'])

# Combine the expanded dataframes with the original dataframe
df_flattened = pd.concat([df[['customerID', 'Churn']], customer_df, phone_df, internet_df, account_df], axis=1)

# Inspect the flattened data
print(df_flattened.head())

# Check for any missing values in the new dataframe
print(df_flattened.isnull().sum())


   customerID Churn  gender  SeniorCitizen Partner Dependents  tenure  \
0  0002-ORFBO    No  Female              0     Yes        Yes       9   
1  0003-MKNFE    No    Male              0      No         No       9   
2  0004-TLHLJ   Yes    Male              0      No         No       4   
3  0011-IGKFF   Yes    Male              1     Yes         No      13   
4  0013-EXCHZ   Yes  Female              1     Yes         No       3   

  PhoneService MultipleLines InternetService  ... OnlineBackup  \
0          Yes            No             DSL  ...          Yes   
1          Yes           Yes             DSL  ...           No   
2          Yes            No     Fiber optic  ...           No   
3          Yes            No     Fiber optic  ...          Yes   
4          Yes            No     Fiber optic  ...           No   

  DeviceProtection TechSupport StreamingTV StreamingMovies        Contract  \
0               No         Yes         Yes              No        One year   
1       

In [3]:
from sklearn.preprocessing import LabelEncoder

# Convert 'Churn' to binary values (1 for "Yes", 0 for "No")
df_flattened['Churn'] = df_flattened['Churn'].apply(lambda x: 1 if x == 'Yes' else 0)

# Convert categorical columns using Label Encoding
categorical_cols = ['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines',
                    'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
                    'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract',
                    'PaperlessBilling', 'PaymentMethod']

le = LabelEncoder()
for col in categorical_cols:
    df_flattened[col] = le.fit_transform(df_flattened[col])

# Convert 'Charges.Monthly' and 'Charges.Total' to numeric (float) types
df_flattened['Charges.Monthly'] = pd.to_numeric(df_flattened['Charges.Monthly'], errors='coerce')
df_flattened['Charges.Total'] = pd.to_numeric(df_flattened['Charges.Total'], errors='coerce')

# Drop any rows with missing values in 'Charges.Monthly' and 'Charges.Total' if there are any
df_flattened.dropna(subset=['Charges.Monthly', 'Charges.Total'], inplace=True)

# Inspect the preprocessed data
print(df_flattened.head())
print(df_flattened.info())



   customerID  Churn  gender  SeniorCitizen  Partner  Dependents  tenure  \
0  0002-ORFBO      0       0              0        1           1       9   
1  0003-MKNFE      0       1              0        0           0       9   
2  0004-TLHLJ      1       1              0        0           0       4   
3  0011-IGKFF      1       1              1        1           0      13   
4  0013-EXCHZ      1       0              1        1           0       3   

   PhoneService  MultipleLines  InternetService  ...  OnlineBackup  \
0             1              0                0  ...             2   
1             1              2                0  ...             0   
2             1              0                1  ...             0   
3             1              0                1  ...             2   
4             1              0                1  ...             0   

   DeviceProtection  TechSupport  StreamingTV  StreamingMovies  Contract  \
0                 0            2            2 

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Drop 'customerID' as it's not relevant for the model
X = df_flattened.drop(['customerID', 'Churn'], axis=1)
y = df_flattened['Churn']

# Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Logistic Regression model
logreg = LogisticRegression(max_iter=1000)

# Train the model
logreg.fit(X_train, y_train)

# Make predictions on the test set
y_pred = logreg.predict(X_test)

# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

print("\nAccuracy Score:", accuracy_score(y_test, y_pred))


Confusion Matrix:
[[993 106]
 [159 194]]

Classification Report:
              precision    recall  f1-score   support

           0       0.86      0.90      0.88      1099
           1       0.65      0.55      0.59       353

    accuracy                           0.82      1452
   macro avg       0.75      0.73      0.74      1452
weighted avg       0.81      0.82      0.81      1452


Accuracy Score: 0.8174931129476584


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# SMOTE for Oversampling

In [5]:
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Apply SMOTE to balance the classes
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# Initialize the Logistic Regression model
logreg_smote = LogisticRegression(max_iter=1000)

# Train the model on the balanced data
logreg_smote.fit(X_train_smote, y_train_smote)

# Make predictions on the test set
y_pred_smote = logreg_smote.predict(X_test)

# Evaluate the model
print("Confusion Matrix (with SMOTE):")
print(confusion_matrix(y_test, y_pred_smote))

print("\nClassification Report (with SMOTE):")
print(classification_report(y_test, y_pred_smote))

print("\nAccuracy Score (with SMOTE):", accuracy_score(y_test, y_pred_smote))


Confusion Matrix (with SMOTE):
[[839 260]
 [ 79 274]]

Classification Report (with SMOTE):
              precision    recall  f1-score   support

           0       0.91      0.76      0.83      1099
           1       0.51      0.78      0.62       353

    accuracy                           0.77      1452
   macro avg       0.71      0.77      0.72      1452
weighted avg       0.82      0.77      0.78      1452


Accuracy Score (with SMOTE): 0.7665289256198347


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# Random Forest

In [6]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Initialize the Random Forest model
rf = RandomForestClassifier(random_state=42, n_estimators=100)

# Train the model
rf.fit(X_train, y_train)

# Make predictions on the test set
y_pred_rf = rf.predict(X_test)

# Evaluate the model
print("Confusion Matrix (Random Forest):")
print(confusion_matrix(y_test, y_pred_rf))

print("\nClassification Report (Random Forest):")
print(classification_report(y_test, y_pred_rf))

print("\nAccuracy Score (Random Forest):", accuracy_score(y_test, y_pred_rf))


Confusion Matrix (Random Forest):
[[981 118]
 [190 163]]

Classification Report (Random Forest):
              precision    recall  f1-score   support

           0       0.84      0.89      0.86      1099
           1       0.58      0.46      0.51       353

    accuracy                           0.79      1452
   macro avg       0.71      0.68      0.69      1452
weighted avg       0.78      0.79      0.78      1452


Accuracy Score (Random Forest): 0.7878787878787878
