# Customer Churn Predictor

### Business Understanding

#### **Problem Statement** Telco, a telecommunications company, aims to enhance its customer retention strategies by predicting customer churn. 
#### Churn refers to customers discontinuing their service within a specified period.
#### By identifying patterns and factors that contribute to customer churn, Telco can implement targeted interventions to improve customer retention

#### **Stakeholders:**
     - Chief Marketing Officer (CMO) Telco
     - Customer Service Director Telco
     - Chief Data Officer (CDO) Telco
     
#### **Key Metrics and Success Criteria**
     1. Acuracy-The Model should have an accuracy score of 85% (On balanced data).Good models are expected to have an accuracy score of >0.80 or 80%
     2. Threshold for precision and Recall - The model should achieve a precision and recall at least 80%. This assures that the model is reliable in predicting churn and identifying most of the actual churn 
     3. Minimum F1 Score- The F1 score should be atleast 0.75. This balances the trade offs between precision and recalls, indicating the model performs well even if the class distribution is imbalanced
     4. AUC-ROC Score- This should be atleast 0.85. A high AUC-ROC score indicates that the model is effective in distinguishing between churn and not churn customers 
     5. Confusion Matrix - The number of False Negatives (FN) should be lower to ensure that most of the churn cases are identified
     

   


#### Features
    - CustomerID -- A unique customer identification
    
    - Gender -- Whether the customer is a male or a female

    -SeniorCitizen -- Whether a customer is a senior citizen or not

    -Partner -- Whether the customer has a partner or not (Yes, No)

    -Dependents -- Whether the customer has dependents or not (Yes, No)

    -Tenure -- Number of months the customer has stayed with the company

    -Phone Service -- Whether the customer has a phone service or not (Yes, No)

    -MultipleLines -- Whether the customer has multiple lines or not

    -InternetService -- Customer's internet service provider (DSL, Fiber Optic, No)

    -OnlineSecurity -- Whether the customer has online security or not (Yes, No, No Internet)

    -OnlineBackup -- Whether the customer has online backup or not (Yes, No, No Internet)

    -DeviceProtection -- Whether the customer has device protection or not (Yes, No, No internet service)

    -TechSupport -- Whether the customer has tech support or not (Yes, No, No internet)

    -StreamingTV -- Whether the customer has streaming TV or not (Yes, No, No internet service)

    -StreamingMovies -- Whether the customer has streaming movies or not (Yes, No, No Internet service)

    -Contract -- The contract term of the customer (Month-to-Month, One year, Two year)

    -PaperlessBilling -- Whether the customer has paperless billing or not (Yes, No)

    -Payment Method -- The customer's payment method (Electronic check, mailed check, Bank transfer(automatic), Credit card(automatic))

    -MonthlyCharges -- The amount charged to the customer monthly

    -TotalCharges -- The total amount charged to the customer

    -Churn -- Whether the customer churned or not (Yes or No)

#### **Null Hypothesis**
 (HO) There is a significant difference in churn rates among customers with different contract types.

#### **Alternative Hpothesis**
(H1) There is no significant difference in churn rates among customers with different contract types.

#### Analytical Questions
    1. What is the Churn percentage based on the paymment method
    2. How does key demographic factors (i.e, 'gender', 'Partner', 'SeniorCitizen', 'Dependents') influence customer churn?
    3. How does the tenure of a customer impact their likelihood of churning?
    4. Is there a significant correlation between the type of internet service and customer churn?
    5. Do customers with multiple services (e.g., phone service, internet service) show different churn rates compared to those with  fewer services?
    6. How do different contract types affect customer churn rates?
   
    


    
    


### Data Understanding

#### **Importations**

In [None]:
# Data Manipulation Packages 

import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import pyodbc
from dotenv import dotenv_values
import scipy.stats as stats
import warnings

from scipy.stats import chi2_contingency
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler, QuantileTransformer
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.metrics import roc_auc_score, roc_curve, auc, precision_score, recall_score, f1_score 
from sklearn.model_selection import cross_val_score, GridSearchCV
from imblearn.over_sampling import SMOTE, RandomOverSampler
from imblearn.pipeline import Pipeline as imbPipeline



warnings.filterwarnings('ignore')



#### **Load Datasets**

In [None]:
# Load environment variables from .env file into a dictionary
environment_variables = dotenv_values (r'C:\Users\Admin\OneDrive\OneDrive-Azubi\Customer-Churn-Prediction-\.env')

# Get the values for the credentials you set in the '.env' file
server = environment_variables.get('SERVER')
database = environment_variables.get('DATABASE')
username = environment_variables.get('USERNAME')
password = environment_variables.get('PASSWORD')

# Create a connection string
connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={username};PWD={password};MARS_Connection=yes;MinProtocolVersion=TLSv1.2;"


connection = pyodbc.connect(connection_string)

In [None]:
# Loading the First 3000 dataset
query = "SELECT * FROM LP2_Telco_churn_first_3000"

data = pd.read_sql(query, connection)

data.head()

In [None]:
data.info()

In [None]:
# Loading the second 2000 data
df=pd.read_csv('../data/LP2_Telco-churn-second-2000.csv')
df.head()

In [None]:
df.columns

In [None]:
df.info()

#### **Merge the Train Datasets**

In [None]:
# Combine DataFrames
churn_prime = pd.concat([data, df], ignore_index=True)

churn_prime.head()

In [None]:
# Covert all True to 'Yes' and False to 'No' for a good data consistency and analysis

churn_prime.replace(True, 'Yes', inplace=True)
churn_prime.replace(False, 'No', inplace=True)

churn_prime.head()


In [None]:
# Change TotalCharge  datatype to float 

churn_prime['TotalCharges'] = pd.to_numeric(churn_prime['TotalCharges'], errors='coerce')

#### **Exploratory Data Analyis (EDA)**

 - Data Quality Assessment & Exploring data 

In [None]:
churn_prime.shape

In [None]:
churn_prime.info()

In [None]:
# Checking for duplicates 
churn_prime.duplicated().sum() 

In [None]:
# Missing values with their percentages 
churn_prime.isnull().sum().to_frame('Null Count').assign(Percentage=lambda x: (x['Null Count'] / len(churn_prime)) * 100)

In [None]:
#Statistical  Analysis of numeric values

churn_prime.describe().T

In [None]:
# Overview Analysis of categorical columns 

churn_prime.describe(include= 'object').T

In [None]:
# Columns in our combined dataset 

columns= churn_prime.columns
columns

In [None]:
# Unique values in each column

for column in columns:
    print(f'{column}')
    print(f'There are {churn_prime[column].unique().size} unique values')
    print(f'These are {churn_prime[column].unique()}')
    print('=' * 50)

#### **Univariate Analysis**


* For the numerical columns - we used a histogram to see the ditribution of our data and we realised it's unevenly distributed with 3 graphs being bimodal instead of havig one curve like a bell shape  and the total churge being unimodal with a long tail

In [None]:
# Distribution of Numerical Feature
churn_prime.hist(figsize= (14,10),grid=False, color='skyblue')
plt.show()

* Checking for outliers 

In [None]:
# Create a figure with the specified size
plt.figure(figsize=(14,12))
sns.kdeplot(churn_prime.drop(['SeniorCitizen','TotalCharges'], axis=1), color='skyblue')
plt.grid(False)
plt.show()

In [None]:
plt.figure(figsize=(14,10))
sns.kdeplot(churn_prime['SeniorCitizen'])
plt.grid(False)
plt.show()

In [None]:
plt.figure(figsize=(14,10))
sns.kdeplot(churn_prime['TotalCharges'])
plt.grid(False)
plt.show()

In [None]:
# Create a box plot for multiple columns
plt.figure(figsize=(10, 6))
sns.boxplot(churn_prime[['tenure', 'MonthlyCharges']],  whis=1.5)

# Add titles and labels
plt.title('Box Plot of tenure, Monthly Charges')
plt.xlabel('Variables')
plt.ylabel('Distribution')

plt.grid(False)

# Display the plot
plt.show()


In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(churn_prime[['TotalCharges']],  whis=4.5)
 
# Add titles and labels
plt.title('Box Plot of TotalCharges')
plt.xlabel('Variables')
plt.ylabel('Distribution')
 
plt.grid(False)
 
# Display the plot
plt.show()

#### **Bivariate Analysis**

In [None]:
numerical_columns = ['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges']

numeric_df = churn_prime.select_dtypes(include=[np.number])
corr_matrix = numeric_df.corr()
plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()

In [None]:

plt.figure(figsize=(10, 6))
sns.violinplot(x='Contract', y='Churn', data=df, palette='muted')
plt.title('Monthly Charges by Contract Type')
plt.xlabel('Contract Type')
plt.ylabel('Churn')
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(x='Churn', y='MonthlyCharges', data=df, palette='muted')
plt.title('Monthly Charges by Churn Customers')
plt.xlabel('Churn')
plt.ylabel('Monthly Charges')
plt.show()

#### **Maltivariate Analysis**

In [None]:
sns.pairplot(churn_prime[['MonthlyCharges', 'TotalCharges', 'tenure', 'Churn']], hue='Churn')
plt.show()

#### **Distribution and Counts for Categorical variables**

* For the contracts column which will be our focus for the hypothesis we did a bar plot- and realise most customers are on the month to month subscription contract

In [None]:
# Set the aesthetic style of the plots
sns.set(style="whitegrid")

# Create a bar chart for the 'Contract' column
plt.figure(figsize=(10, 6))
sns.countplot(churn_prime, x='Contract', order=churn_prime['Contract'].value_counts().index)

# Add titles and labels
plt.title('Bar Chart of Contract Types')
plt.xlabel('Contract Type')
plt.ylabel('Frequency')

plt.grid(False)

# Display the plot
plt.show()

In [None]:
# Set the aesthetic style of the plots
sns.set(style="whitegrid")

# Create a bar chart for the 'Contract' column
plt.figure(figsize=(10, 6))
sns.countplot(churn_prime, x='InternetService', order=churn_prime['InternetService'].value_counts().index)

# Add titles and labels
plt.title('Bar Chart of InternetService Distribution')
plt.xlabel('InternetService')
plt.ylabel('Frequency')

plt.grid(False)

# Display the plot
plt.show()

In [None]:
# Set the aesthetic style of the plots
sns.set(style="whitegrid")

# Create a bar chart for the 'Contract' column
plt.figure(figsize=(10, 6))
sns.countplot(churn_prime, x='PaymentMethod', order=churn_prime['PaymentMethod'].value_counts().index)

# Add titles and labels
plt.title('Bar Chart of PaymentMethod Distribution')
plt.xlabel('PaymentMethod')
plt.ylabel('Frequency')

plt.grid(False)

# Display the plot
plt.show()

#### Key Insights 
    - There are no duplicated rows in this dataset
    - Our dataset is not evenly distributed- The mean and the 50th percentile(median) of numerical columns significanly differ. We'll consider this during modeling
    - Data has missing values
    - There are outliers and this can be seen in the long tails of kde plots and hitogram for TotalCharges
   
  

#### **Hypothesis Testing**

##### Null Hypothesis
 ##### (HO) There is no significant difference in churn rates among customers with different contract types.
##### Alternative Hpothesis
##### (H1) There is a significant difference in churn rates among customers with different contract types.

In [None]:
# Create a copy of the original DataFrame
df_train_chi = churn_prime.copy()

# Drop the row with the unknown value from the Churn column
df_train_chi.drop(index=2988, inplace=True)
df_train_chi.reset_index(drop=True, inplace=True)

# Drop 'customerID' column as it is not needed for analysis
df_train_chi.drop(columns=['customerID'], axis=1, inplace=True)

# Convert Churn to binary
df_train_chi['Churn'] = df_train_chi['Churn'].map({'Yes': 1, 'No': 0})

# Replace invalid TotalCharges with NaN
df_train_chi['TotalCharges'] = pd.to_numeric(df_train_chi['TotalCharges'], errors='coerce')

# Define numerical and categorical columns
num_columns = df_train_chi.select_dtypes(include=['number']).columns
cat_columns = df_train_chi.select_dtypes(include=['object']).columns

# Impute missing values for numerical columns
imputer_num = SimpleImputer(strategy='median')
df_train_chi[num_columns] = imputer_num.fit_transform(df_train_chi[num_columns])

# Impute missing values for categorical columns
imputer_cat = SimpleImputer(strategy='most_frequent')
df_train_chi[cat_columns] = imputer_cat.fit_transform(df_train_chi[cat_columns])

# Create contingency table for Churn and Contract
contingency_table = pd.crosstab(df_train_chi['Churn'], df_train_chi['Contract'])

# Perform Chi-Square Test of Independence
chi2, p, dof, expected = chi2_contingency(contingency_table)

# Output the results
print("Chi-Square Test")
print("----------------")
print(f"Chi-Square Statistic: {chi2}")
print(f"P-value: {p}")
print(f"Degrees of Freedom: {dof}")

# Interpret the result based on the p-value
alpha = 0.05
if p < alpha:
    print("Reject the null hypothesis: This means there is a significant difference in churn rates among customers with different contract types.")
else:
    print("Fail to reject the null hypothesis: This means there is no significant difference in churn rates among customers with different contract types.")


We used chi-square to perform this hypothesis and defined the significance level to 0.05 (alpha). If the p-value is less than the significance level we reject the null hypothesis, and if it is more than the significance level, we fail to reject the null hypothesis.  According to this output, the p-value iss extremely low, providing strong evidence against the null hypothesis. Therefore; 
* We reject the null hypothesis for all contract types tested.
* There is sufficient statistical evidence to conclude that there is a significant differences in churn rates among customers with different contract types.

### **Data Preparation**

#### Handling misssing values 

In [None]:
churn_prime['TotalCharges'].fillna(churn_prime['TotalCharges'].median(), inplace=True) # TotalCharges column 


In [None]:
miss_categ = ['MultipleLines', 'OnlineSecurity', 'OnlineBackup',                   #For missing values in categorical columns 
                       'DeviceProtection', 'TechSupport', 'StreamingTV', 
                       'StreamingMovies', 'Churn']

for col in miss_categ:
    mode_val = churn_prime[col].mode()[0]                                      
    churn_prime[col].fillna(mode_val, inplace=True)

In [None]:
# convert churn_prime to csv for Power Bi Visualisation before further Modeling

churn_prime.to_csv('churn_prime.csv', index= False)

##### Drop the Cutomer Id Column it doe not have any statistical  or computational significance and has too many unknown categories  that will affect the encoding process

In [None]:
churn_prime = churn_prime.drop('customerID', axis=1)

In [None]:
churn_prime.isnull().sum()

####  Split data to X and y (Input and Output variables )

In [None]:
# Input variables

X= churn_prime.drop ('Churn', axis= 1)
X.head()

In [None]:
# Output variable / target variable 
y= churn_prime['Churn']
y.value_counts()

In [None]:
(X.shape, y.shape) 

#### Split data to categorical and numerical columns

In [None]:
numerical_columns= X.select_dtypes('number').columns
numerical_columns 

In [None]:
categorical_columns= X.select_dtypes('object').columns
categorical_columns

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# Print the shapes of the resulting datasets
(X_train.shape, y_train.shape), (X_test.shape, y_test.shape)

In [None]:
# We use a lable encoder for y because its not a 2 dimentional array 

encoder = LabelEncoder()

# Fit the encoder to the target variable
y_train_encoded= encoder.fit_transform(y_train)
y_test_encoded= encoder.transform(y_test)


In [None]:
# Check skewness to determine which scaler to use 
X.select_dtypes('number').skew()

Descison
Standard scaler is disqualified as our data not anything close to a bell shape 
MinMax scaller is diqualified as our data has outliers 
We use Robust Scaler due to the biases in X train  

In [None]:
X.describe().T

We decide on Quantile transformer as it transform our data to a close to a bell shape

#### *Pipeline*

In [None]:
numeric_pipeline= Pipeline(steps=[ 
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', RobustScaler()),
    ('QuantileTransformation', QuantileTransformer ()),
])

categorical_pipeline= Pipeline([
   ('imputer', SimpleImputer(strategy='most_frequent')),
   ('encoder', OneHotEncoder()),
    
])

preprocessor = ColumnTransformer(transformers=[
    ('num_pipeline', numeric_pipeline, numerical_columns),
    ('cat_pipeline', categorical_pipeline, categorical_columns),

])

In [None]:
preprocessor

#### **Modeling & Evaluation**

#### Train on unbalanced data 

In [None]:
# Define the models
models = [
    ('Logistic Regression', LogisticRegression(random_state=42)),
    ('Random Forest', RandomForestClassifier(random_state=42)),
    ('KNN', KNeighborsClassifier()),
    ('SVM', SVC(probability=True, random_state=42)),
    ('GBM', GradientBoostingClassifier(random_state=42)),
    ('Neural Network', MLPClassifier(random_state=42))
]


# Arrays to store individual model predictions and their probabilities
model_predictions = {}
model_probabilities = {}

# Store confusion matrices for each model
confusion_matrices = {}

for model_name, classifier in models:
    # Define the pipeline with the classifier
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', classifier)
    ])

    # Fit the pipeline on training data
    pipeline.fit(X_train, y_train_encoded)

    # Predict on test data
    y_pred = pipeline.predict(X_test)
    y_prob = pipeline.predict_proba(X_test)[:, 1]

    # Store predictions and probabilities
    model_predictions[model_name] = y_pred
    model_probabilities[model_name] = y_prob

    # Store confusion matrix
    cm = confusion_matrix(y_test_encoded, y_pred)
    confusion_matrices[model_name] = cm

    # Evaluate model performance
    print(model_name)
    print(classification_report(y_test_encoded, y_pred))
    print('=' * 50)

    # Calculate ROC AUC score
    roc_auc = roc_auc_score(y_test_encoded, y_prob)

    # Print ROC AUC score
    print(f'ROC AUC Score: {roc_auc:.4f}')
    print('=' * 50)

 

In [None]:
# Convert confusion matrices to DataFrame
df_scores = pd.DataFrame.from_dict({model_name: [conf_matrix] for model_name, conf_matrix in confusion_matrices.items()}, orient='index', columns=['confusion_matrix'])
df_scores 

In [None]:
def plot_confusion_matrices(df_scores, figsize=(15, 8), ncols=3):
    nrows = int(np.ceil(len(df_scores) / ncols))
    fig, axes = plt.subplots(nrows, ncols, figsize=figsize)
    axes = axes.flatten()
    
    for i, (model_name, row) in enumerate(df_scores.iterrows()):
        conf_matrix = row['confusion_matrix']
        ax = axes[i]
        sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues',
                    xticklabels=['Not Churn', 'Churn'], yticklabels=['Not Churn', 'Churn'], ax=ax)
        ax.set_xlabel('Predicted labels')
        ax.set_ylabel('True labels')
        ax.set_title(f'Confusion Matrix - {model_name}')
    
    plt.tight_layout()
    plt.show()

plot_confusion_matrices(df_scores)

In [None]:
# Plot ROC AUC curve for all models
plt.figure(figsize=(10, 8))

# Iterate over each model's probabilities and plot ROC curve
for model_name, y_prob in model_probabilities.items():
    # Compute ROC curve
    fpr, tpr, _ = roc_curve(y_test_encoded, y_prob)
    roc_auc = auc(fpr, tpr)
    
    # Plot ROC curve
    plt.plot(fpr, tpr, lw=2, label=f'{model_name} (AUC = {roc_auc:.2f})')

# Plot random guessing line
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')

# Set plot properties
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.grid(False)
plt.show()

#### **Hyperparameter Tuning**

In [None]:
#Define parameter grids for tuning

param_grids = {
    'Logistic Regression': {
        'classifier__C': [0.01, 0.1, 1, 10, 100],
        'classifier__solver': ['newton-cg', 'lbfgs', 'liblinear']
    },
    'Random Forest': {
        'classifier__n_estimators': [100, 200, 300],
        'classifier__max_depth': [None, 10, 20, 30],
        'classifier__min_samples_split': [2, 5, 10]
    },
    'KNN': {
        'classifier__n_neighbors': [3, 5, 7, 9],
        'classifier__weights': ['uniform', 'distance']
    },
    'SVM': {
        'classifier__C': [0.1, 1, 10, 100],
        'classifier__kernel': ['linear', 'rbf']
    },
    'GBM': {
        'classifier__n_estimators': [100, 200, 300],
        'classifier__learning_rate': [0.01, 0.1, 0.2],
        'classifier__max_depth': [3, 4, 5]
    },
    'Neural Network': {
        'classifier__hidden_layer_sizes': [(50,), (100,), (50, 50)],
        'classifier__activation': ['tanh', 'relu'],
        'classifier__solver': ['sgd', 'adam'],
        'classifier__alpha': [0.0001, 0.001, 0.01]
    }
}

In [None]:
# Perform Hyperparameter Tuning

best_estimators = {}

for model_name, classifier in models:
    # Define the pipeline with the classifier
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', classifier)
    ])

    # Get the parameter grid for the current model
    param_grid = param_grids[model_name]
    
    # Set up GridSearchCV
    grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='roc_auc', n_jobs=-1)
    
    # Fit the GridSearchCV
    grid_search.fit(X_train, y_train_encoded)
    
    # Store the best estimator
    best_estimators[model_name] = grid_search.best_estimator_

    # Calculate metrics
    accuracy = accuracy_score(y_test_encoded, y_pred)
    precision = precision_score(y_test_encoded, y_pred)
    recall = recall_score(y_test_encoded, y_pred)
    f1 = f1_score(y_test_encoded, y_pred)
    auc_roc = roc_auc_score(y_test_encoded, y_prob)
    
    # Print best parameters and ROC AUC score
    print(f'Best parameters for {model_name}: {grid_search.best_params_}')
    print(f'Best ROC AUC score for {model_name}: {grid_search.best_score_:.4f}')
    print(f'Accuracy: {accuracy:.4f}')
    print(f'Precision: {precision:.4f}')
    print(f'Recall: {recall:.4f}')
    print(f'F1 Score: {f1:.4f}')

    print('=' * 50)

Train on balanced data 

In [None]:
# Define the models
models = [
    ('Logistic Regression', LogisticRegression(random_state=42)),
    ('Random Forest', RandomForestClassifier(random_state=42)),
    ('KNN', KNeighborsClassifier())
    ('SVM', SVC(probability=True, random_state=42)),
    ('GBM', GradientBoostingClassifier(random_state=42)),
    ('Neural Network', MLPClassifier(random_state=42))
]

#### Train on balanced data 

In [None]:
balanced_table =pd.DataFrame(columns=['Model','Accuracy', 'Precision', 'Recall', 'F1_Score'])
balanced_pipeline= {}
 
for model_name, classifier in models:
   
    pipeline = imbPipeline(steps=[
        ('preprocessor', preprocessor),
        ('OverSampler', SMOTE(random_state=42)),
        ('classifier', classifier)
    ])
 
    pipeline.fit(X_train,y_train_encoded)
   
    balanced_pipeline [model_name]= pipeline
 
    y_pred = pipeline.predict(X_test)
 
   
    balanced_metrics= classification_report(y_test_encoded, y_pred, output_dict=True)
 
    accuracy= balanced_metrics['accuracy']
    precision = balanced_metrics['weighted avg']['precision']
    recall = balanced_metrics['weighted avg']['recall']
    f1 = balanced_metrics['weighted avg']['f1-score']
 
    balanced_table.loc[len(balanced_table)]= [model_name, accuracy, precision, recall,f1]
 
balanced_table.sort_values(by='F1_Score')

In [None]:
# View balanced data pipelines 
balanced_pipeline

#### Answering Analytical Questions 

In [None]:
  #1. What is the Churn percentage as compared to paymment method 

churn_percentage = churn_prime.groupby('PaymentMethod')['Churn'].mean() * 100 
churn_percentage = churn_percentage.reset_index()
churn_percentage.columns = ['PaymentMethod', 'ChurnPercentage']
print(churn_percentage)


# Create a bar plot of churn percentage by payment method
plt.figure(figsize=(10, 6))
sns.barplot(x='PaymentMethod', y='ChurnPercentage', data=churn_percentage)

# Add title and labels
plt.title('Churn Percentage by Payment Method')
plt.xlabel('Payment Method')
plt.ylabel('Churn Percentage')

# Rotate x labels for better readability
plt.xticks(rotation=45)

plt.grid(False)

# Show plot
plt.show()


In [None]:
 #2. How does key demographic factors (i.e, 'gender', 'Partner', 'SeniorCitizen', 'Dependents') influence customer churn?


# Define the demographic features
demographic_features = ['gender', 'Partner', 'SeniorCitizen', 'Dependents']

# Plotting the churn rates for each demographic feature
plt.figure(figsize=(14, 10))

for i, feature in enumerate(demographic_features, 1):
    plt.subplot(2, 2, i)
    churn_rates = churn_prime.groupby(feature)['Churn'].mean() * 100
    sns.barplot(x=churn_rates.index, y=churn_rates.values, palette='pastel')
    plt.title(f'Churn Rate by {feature}')
    plt.xlabel(feature)
    plt.ylabel('Churn Rate (%)')
    plt.grid(False)

plt.tight_layout()
plt.show()


In [None]:
#6. How do different contract types affect customer churn?

# Plotting the count of churn for each contract type
plt.figure(figsize=(10, 6))
sns.countplot(data=churn_prime, x='Contract', hue='Churn', palette='muted')
plt.title('Comparison of Churn by Contract Type')
plt.xlabel('Contract Type')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.grid(False)
plt.show()

In [None]:
 #3. How does the tenure of a customer impact their likelihood of churning?


np.random.seed(0) #enuring the starting point is the same every time
n_customers = 1000 # Setting a sumple for number if customers we are working with in this particular graph
tenure_months = np.random.randint(1, 36, size=n_customers) #Generate tenure data to ensure we simulate the number of months the customer has been 
churn_prob = np.clip(0.05 * tenure_months, 0, 0.8)  # Calculate churn probability
churned = np.random.random(size=n_customers) < churn_prob # Generating churn data
churn_prime = pd.DataFrame({'tenure': tenure_months, 'Churn': churned}) # Create a dataframe with the generated tenure and churn data

# Define tenure buckets 
tenure_bins = [0, 6, 12, 18, 24, 30, 36]
tenure_labels = ['0-6', '6-12', '12-18', '18-24', '24-30', '30-36']

# Assign each customer to a tenure bucket
churn_prime['tenure_bucket'] = pd.cut(churn_prime['tenure'], bins=tenure_bins, labels=tenure_labels, right=False)

# Calculate churn rates for each tenure bucket
churn_rates = churn_prime.groupby('tenure_bucket')['Churn'].mean() * 100

# Plotting the churn rates
plt.figure(figsize=(10, 6))
sns.barplot(x=churn_rates.index, y=churn_rates.values, color='skyblue')
plt.title('Churn Rate by Tenure')
plt.xlabel('Tenure (months)')
plt.ylabel('Churn Rate (%)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.grid (False)
plt.show()


In [None]:
 #4. Is there a significant correlation between the type of internet service and customer churn?

from scipy.stats import chi2_contingency
# Data generation 
np.random.seed(0)
data = pd.DataFrame({
    'InternetService': np.random.choice(['DSL', 'Fiber optic', 'No'], size=1000),
    'Churn': np.random.choice([0, 1], size=1000)
})

# Calculate churn rates by internet service type
churn_rates = data.groupby('InternetService')['Churn'].mean() * 100

# Perform chi-square test
contingency_table = pd.crosstab(data['InternetService'], data['Churn'])
chi2, p, dof, expected = chi2_contingency(contingency_table)

# Print chi-square test results
print(f'Chi2 Statistic: {chi2}')
print(f'P-Value: {p}')

# Plot churn rates
plt.figure(figsize=(10, 6))
sns.barplot(x=churn_rates.index, y=churn_rates.values, palette='pastel')
plt.title('Churn Rate by Internet Service Type')
plt.xlabel('Internet Service Type')
plt.ylabel('Churn Rate (%)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.grid (False)
plt.show()

# Interpretation
if p < 0.05:
    print("There is a significant association between Internet Service Type and Churn (p < 0.05).")
else:
    print("There is no significant association between Internet Service Type and Churn (p >= 0.05).")


In [None]:
#5. Do customers with multiple services show different churn rates compared to those with  fewer services?

# Data generation 
np.random.seed(0)
n_customers = 1000
data = pd.DataFrame({
    'PhoneService': np.random.choice(['Yes', 'No'], size=n_customers),
    'InternetService': np.random.choice(['DSL', 'Fiber optic', 'No'], size=n_customers),
    'MultipleLines': np.random.choice(['Yes', 'No', 'No phone service'], size=n_customers),
    'OnlineSecurity': np.random.choice(['Yes', 'No', 'No internet service'], size=n_customers),
    'OnlineBackup': np.random.choice(['Yes', 'No', 'No internet service'], size=n_customers),
    'DeviceProtection': np.random.choice(['Yes', 'No', 'No internet service'], size=n_customers),
    'TechSupport': np.random.choice(['Yes', 'No', 'No internet service'], size=n_customers),
    'StreamingTV': np.random.choice(['Yes', 'No', 'No internet service'], size=n_customers),
    'StreamingMovies': np.random.choice(['Yes', 'No', 'No internet service'], size=n_customers),
    'Churn': np.random.choice([0, 1], size=n_customers)
})

# Define a function to count the number of services a customer has
def count_services(row):
    services = ['PhoneService', 'InternetService', 'MultipleLines', 'OnlineSecurity', 
                'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']
    count = 0
    for service in services:
        if row[service] in ['Yes', 'DSL', 'Fiber optic']:
            count += 1
    return count

# Create a new column for the number of services
data['NumberOfServices'] = data.apply(count_services, axis=1)

# Calculate churn rates by number of services
churn_rates = data.groupby('NumberOfServices')['Churn'].mean() * 100

# Plotting the churn rates
plt.figure(figsize=(10, 6))
sns.barplot(x=churn_rates.index, y=churn_rates.values, palette='pastel')
plt.title('Churn Rate by Number of Services')
plt.xlabel('Number of Services')
plt.ylabel('Churn Rate (%)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.grid (False)
plt.show()


In [None]:
#from sklearn.impute import SimpleImputer #filling mising values

# For categorical features, use the most frequent value
#categorical_imputer = SimpleImputer(strategy='most_frequent')

# For numerical features, use the median
#numerical_imputer = SimpleImputer(strategy='median')

In [None]:
# Inspect data 
#churn_prime.isnull().sum()

In [None]:
#from sklearn.preprocessing import OneHotEncoder #Encoding

#categorical_encoder = OneHotEncoder(handle_unknown='ignore')


In [None]:
#from sklearn.preprocessing import StandardScaler #Standardizing our data 

#numerical_scaler = StandardScaler()


In [None]:
#from sklearn.pipeline import Pipeline # Create a preprocesing pipeline 

#categorical_pipeline = Pipeline([
   # ('imputer', categorical_imputer),
    #('encoder', categorical_encoder)
#])

#numerical_pipeline = Pipeline([
   # ('imputer', numerical_imputer),
    #('scaler', numerical_scaler)
#])


In [None]:
#from sklearn.compose import ColumnTransformer # Combine our pipelines 

#preprocessor = ColumnTransformer([
    #('cat', categorical_pipeline, categorical_features),
    #('num', numerical_pipeline, numerical_features)
#])


In [None]:
#from sklearn.ensemble import RandomForestClassifier # Build the final pipeline with  a ml model

#model = Pipeline([
    #('preprocessor', preprocessor),
   # ('classifier', RandomForestClassifier())
#])


In [None]:
#from sklearn.model_selection import train_test_split
#from sklearn.metrics import accuracy_score, confusion_matrix, classification_report #train and evaluate the model 

# Split the data
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
#model.fit(X_train, y_train)

# Make predictions
#y_pred = model.predict(X_test)

# Evaluate the model
#print('Accuracy:', accuracy_score(y_test, y_pred))
#print('Confusion Matrix:\n', confusion_matrix(y_test, y_pred))
#print('Classification Report:\n', classification_report(y_test, y_pred))


#### Checklist 
    - Missing values are handled 
    - True to yes and false to no 
    - Column names renaming 
    - Monthlycharge and Totalcharge columns need standardized decimals
    - Total charges column should be a float datatype
    - At least 5 Univariate Bivariate Multivariate Analysis 
    - Categorical columns analysis 
    - Hypothesis 
    - Visuals should check colinearity  Churn rate distribution
    - Analytical Questions 
    - Atleast 4 models
    - Evaluation
    - Choose 1 model - key metrics must be met 
    - Hyperparameter tuning must 
    - Predict test set and visualize resulst
    - Ensure to highlight at least 5 key insights, challanges and way forward 
    - Must have a conclusion