### AIM
Perform Explanatory Data Analysis (EDA) on Customer Churn data within the Telecommunication industry. Although there will be no need to build a model based on the data provided, you are asked to look for issues in the data and find correlation among the various variables in order to improve/lower customer churn predictions.


### What is Churn Rate ?
Churn rate is a critical metric of customer satisfaction. Low churn rates mean happy customers; high churn rates mean customers are leaving you. A small rate of monthly/quarterly churn compounds over time. 1% monthly churn quickly translates to almost 12% yearly churn.

### Instructions
Investigating the data should be done two-fold:
1) Manually by utilizing the classic (legacy) EDA libraries: NumPy, Pandas, graph libraries (MatPlotlib, Seaborn, Plotly), and Python’s Statsmodel modules.
2) Generate ‘html’ reports by integrating Pandas Profiling and SweetViz Python libraries.

The analysis of the data should focus on predicting customer churn rate.

## Python Code for EDA of Telecom Customer Churn Data

### Mounting the Google Drive inorder to import the dataset.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Importing the classic EDA libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import statsmodels.api as sm

### Saving the dataset in a directory to make the inclusion a bit easy in the analysis

In [None]:
file_path = '/content/drive/MyDrive/telco-customer-churn.csv'
telecom_df = pd.read_csv(file_path)

###Loading the dataset in Colab/python file

In [None]:
telecom_df

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.30,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.70,151.65,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,6840-RESVB,Male,0,Yes,Yes,24,Yes,Yes,DSL,Yes,...,Yes,Yes,Yes,Yes,One year,Yes,Mailed check,84.80,1990.5,No
7039,2234-XADUH,Female,0,Yes,Yes,72,Yes,Yes,Fiber optic,No,...,Yes,No,Yes,Yes,One year,Yes,Credit card (automatic),103.20,7362.9,No
7040,4801-JZAZL,Female,0,Yes,Yes,11,No,No phone service,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.60,346.45,No
7041,8361-LTMKD,Male,1,Yes,No,4,Yes,Yes,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Mailed check,74.40,306.6,Yes


### A look at the top 5 rows of the telecom churn dataset

In [None]:
telecom_df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


### Shape of the dataset (Rows,Columns)

In [None]:
telecom_df.shape

(7043, 21)

### Data Types present in the DataSet

In [None]:
telecom_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


### Finding duplicate values in dataset

In [None]:
telecom_df.duplicated().sum()

0

### Handling Missing/Null Values

In [None]:
telecom_df.isnull().sum()

Unnamed: 0,0
customerID,0
gender,0
SeniorCitizen,0
Partner,0
Dependents,0
tenure,0
PhoneService,0
MultipleLines,0
InternetService,0
OnlineSecurity,0


### Finding interquartile range for tenure of subscription




In [None]:
Q1 = telecom_df['tenure'].quantile(0.25)
Q3 = telecom_df['tenure'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
print("Q1 is ",Q1)
print("Q3 is ",Q3)
print("IQR is ",IQR)
print("Lower Bound is ",lower_bound)
print("Upper Bound is ",upper_bound)

Q1 is  9.0
Q3 is  55.0
IQR is  46.0
Lower Bound is  -60.0
Upper Bound is  124.0


### Finding interquartile range for monthly charges

In [None]:
Q1 = telecom_df['MonthlyCharges'].quantile(0.25)
Q3 = telecom_df['MonthlyCharges'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
print("Q1 is ",Q1)
print("Q3 is ",Q3)
print("IQR is ",IQR)
print("Lower Bound is ",lower_bound)
print("Upper Bound is ",upper_bound)

Q1 is  35.5
Q3 is  89.85
IQR is  54.349999999999994
Lower Bound is  -46.02499999999999
Upper Bound is  171.375


### Finding interquartile range for Total charges

### As the TotalCharges contains string at a particular point we need to convert it to numeric value.

In [None]:
try:
  telecom_df['TotalCharges'] = pd.to_numeric(telecom_df['TotalCharges'], errors='coerce')
except:
  pass
Q1 = telecom_df['TotalCharges'].quantile(0.25)
Q3 = telecom_df['TotalCharges'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
print("Q1 is ",Q1)
print("Q3 is ",Q3)
print("IQR is ",IQR)
print("Lower Bound is ",lower_bound)
print("Upper Bound is ",upper_bound)

Q1 is  401.45
Q3 is  3794.7375
IQR is  3393.2875000000004
Lower Bound is  -4688.481250000001
Upper Bound is  8884.66875


### Transforming the whole data into Numeric Values

In [None]:
from sklearn.preprocessing import LabelEncoder

# Create a LabelEncoder object
le = LabelEncoder()

# Iterate through columns and apply LabelEncoder to object type columns
for column in telecom_df.columns:
  if telecom_df[column].dtype == object:
    telecom_df[column] = le.fit_transform(telecom_df[column])

# Display the updated DataFrame
print(telecom_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   int64  
 1   gender            7043 non-null   int64  
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   int64  
 4   Dependents        7043 non-null   int64  
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   int64  
 7   MultipleLines     7043 non-null   int64  
 8   InternetService   7043 non-null   int64  
 9   OnlineSecurity    7043 non-null   int64  
 10  OnlineBackup      7043 non-null   int64  
 11  DeviceProtection  7043 non-null   int64  
 12  TechSupport       7043 non-null   int64  
 13  StreamingTV       7043 non-null   int64  
 14  StreamingMovies   7043 non-null   int64  
 15  Contract          7043 non-null   int64  
 16  PaperlessBilling  7043 non-null   int64  


In [None]:
telecom_df

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,5375,0,0,1,0,1,0,1,0,0,...,0,0,0,0,0,1,2,29.85,29.85,0
1,3962,1,0,0,0,34,1,0,0,2,...,2,0,0,0,1,0,3,56.95,1889.50,0
2,2564,1,0,0,0,2,1,0,0,2,...,0,0,0,0,0,1,3,53.85,108.15,1
3,5535,1,0,0,0,45,0,1,0,2,...,2,2,0,0,1,0,0,42.30,1840.75,0
4,6511,0,0,0,0,2,1,0,1,0,...,0,0,0,0,0,1,2,70.70,151.65,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,4853,1,0,1,1,24,1,2,0,2,...,2,2,2,2,1,1,3,84.80,1990.50,0
7039,1525,0,0,1,1,72,1,2,1,0,...,2,0,2,2,1,1,1,103.20,7362.90,0
7040,3367,0,0,1,1,11,0,1,0,2,...,0,0,0,0,0,1,2,29.60,346.45,0
7041,5934,1,1,1,0,4,1,2,1,0,...,0,0,0,0,0,1,3,74.40,306.60,1


### Present dependencies and correlations among the various features in the data. List the most important variables (Feature Importance) that will affect the target label.

In [None]:
from sklearn.ensemble import RandomForestClassifier

X = telecom_df.drop('Churn', axis=1)
y = telecom_df['Churn']

model = RandomForestClassifier()
model.fit(X, y)
feature_importances = model.feature_importances_

# Print or visualize feature importances
print(feature_importances)

[0.13017526 0.02296104 0.0177689  0.01974807 0.01702667 0.14134218
 0.00459351 0.01971554 0.02669558 0.04905477 0.02355644 0.02258809
 0.03987391 0.01458903 0.01509756 0.06694652 0.02219364 0.04330437
 0.14554709 0.15722183]


### Split the dataset into training and test datasets (80/20 ratio). Using SweetViz’s ‘compare’ command contrast the training vs test datasets on the target (‘churn’)

In [None]:
pip install sweetviz

Collecting sweetviz
  Downloading sweetviz-2.3.1-py3-none-any.whl.metadata (24 kB)
Downloading sweetviz-2.3.1-py3-none-any.whl (15.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.1/15.1 MB[0m [31m81.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sweetviz
Successfully installed sweetviz-2.3.1


In [None]:
from sklearn.model_selection import train_test_split
from sweetviz import compare

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Compare training and testing sets using SweetViz
compare(X_train, X_test).show_html()

                                             |          | [  0%]   00:00 -> (? left)

Report SWEETVIZ_REPORT.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.


### Data Preprocessing

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

# Data Imputation (Handle missing values)
# Replace missing values with the mean of the column
imputer = SimpleImputer(strategy='mean')
telecom_df['TotalCharges'] = imputer.fit_transform(telecom_df[['TotalCharges']])

# Feature Selection & Scaling
# Scale numerical features using StandardScaler
scaler = StandardScaler()
numerical_features = ['tenure', 'MonthlyCharges', 'TotalCharges']
telecom_df[numerical_features] = scaler.fit_transform(telecom_df[numerical_features])


# Encode Categorical Features (Already done using LabelEncoder)

pd.get_dummies(telecom_df, columns=['gender', 'Partner', 'Dependents'])

Unnamed: 0,customerID,SeniorCitizen,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,...,PaymentMethod,MonthlyCharges,TotalCharges,Churn,gender_0,gender_1,Partner_0,Partner_1,Dependents_0,Dependents_1
0,5375,0,-1.277445,0,1,0,0,2,0,0,...,2,-1.160323,-0.994971,0,True,False,False,True,True,False
1,3962,0,0.066327,1,0,0,2,0,2,0,...,3,-0.259629,-0.173876,0,False,True,True,False,True,False
2,2564,0,-1.236724,1,0,0,2,2,0,0,...,3,-0.362660,-0.960399,1,False,True,True,False,True,False
3,5535,0,0.514251,0,1,0,2,0,2,2,...,0,-0.746535,-0.195400,0,False,True,True,False,True,False
4,6511,0,-1.236724,1,0,1,0,0,0,0,...,2,0.197365,-0.941193,1,True,False,True,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,4853,0,-0.340876,1,2,0,2,0,2,2,...,3,0.665992,-0.129281,0,False,True,False,True,False,True
7039,1525,0,1.613701,1,2,1,0,2,2,0,...,1,1.277533,2.242808,0,True,False,False,True,False,True
7040,3367,0,-0.870241,0,1,0,2,0,0,0,...,2,-1.168632,-0.855182,0,True,False,False,True,False,True
7041,5934,1,-1.155283,1,2,1,0,0,0,0,...,3,0.320338,-0.872777,1,False,True,False,True,True,False


### Addressing Data Imbalance using SMOTE



In [None]:
from imblearn.over_sampling import SMOTE

# Separate features (X) and target variable (y)
X = telecom_df.drop('Churn', axis=1)
y = telecom_df['Churn']

# Apply SMOTE to address class imbalance
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

### Splitting the data into 80/20 ratio.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

### Applying various models and performing evaluation.

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

scaler = StandardScaler()  # Create a StandardScaler object
X_train = scaler.fit_transform(X_train)  # Fit and transform the training data
X_test = scaler.transform(X_test)  # Transform the test data using the fitted scaler

# Initialize models
naive_bayes = GaussianNB()
logistic_regression = LogisticRegression(max_iter=1000, solver='saga')
random_forest = RandomForestClassifier()
xgboost = XGBClassifier()

# Train and evaluate models
models = [naive_bayes, logistic_regression, random_forest, xgboost]
model_names = ['Naive Bayes', 'Logistic Regression', 'Random Forests', 'XGBoost']

for model, name in zip(models, model_names):
  model.fit(X_train, y_train)
  y_pred = model.predict(X_test)

  accuracy = accuracy_score(y_test, y_pred)
  precision = precision_score(y_test, y_pred)
  recall = recall_score(y_test, y_pred)
  f1 = f1_score(y_test, y_pred)

  print(f"--- {name} ---")
  print(f"Accuracy: {accuracy:.4f}")
  print(f"Precision: {precision:.4f}")
  print(f"Recall: {recall:.4f}")
  print(f"F1-Score: {f1:.4f}")
  print()

--- Naive Bayes ---
Accuracy: 0.8101
Precision: 0.7929
Recall: 0.8465
F1-Score: 0.8188

--- Logistic Regression ---
Accuracy: 0.8324
Precision: 0.8042
Recall: 0.8847
F1-Score: 0.8425

--- Random Forests ---
Accuracy: 0.8580
Precision: 0.8599
Recall: 0.8599
F1-Score: 0.8599

--- XGBoost ---
Accuracy: 0.8527
Precision: 0.8683
Recall: 0.8360
F1-Score: 0.8519



### Hyperparameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid for hyperparameter tuning
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Create a Random Forest classifier
rf_classifier = RandomForestClassifier()

# Create a GridSearchCV object
grid_search = GridSearchCV(estimator=rf_classifier, param_grid=param_grid, cv=5, scoring='accuracy')

# Fit the grid search to the training data
grid_search.fit(X_train, y_train)

# Print the best parameters found
print("Best Parameters:", grid_search.best_params_)

# Evaluate the model with the best parameters
best_rf_classifier = grid_search.best_estimator_
y_pred = best_rf_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy with Best Parameters:", accuracy)

Best Parameters: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 200}
Accuracy with Best Parameters: 0.8599033816425121
