<h1 style='text-align: center; front-size: 50px;'>Enhancing Interconnect Loyalty Through Predictive Analytics</h1>

# Introduction:

**Interconnect**, a telecom operator, is working to improve customer loyalty by forecasting which clients are likely to stop using its services. Anticipating churn allows the company to take proactive steps, such as offering targeted promotions or customized plans, to retain those at risk.

This project involves building a predictive model using customer demographics, contract types, and service usage patterns. The model will help identify key indicators of churn and support data-driven marketing strategies.

The overall workflow covering data exploration, preparation, modeling, and evaluation is outlined in the work plan at the end of this notebook.

Ultimately, the goal is to deliver a reliable and interpretable tool that enables the marketing team to act early and effectively to reduce customer loss.

# Step 1: Exploratory Data Analysis (EDA)

# Initialization:

In [None]:
# Loading all the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.model_selection import GridSearchCV

# Loading Data:

## `Contract` Dataframe:

In [None]:
# Loading the data file:
contract_df = pd.read_csv('/datasets/final_provider/contract.csv')
contract_df.head()

In [None]:
# Renaming columns names:
contract_df = contract_df.rename(columns={'customerID': 'customer_id',

                                          'BeginDate': 'begin_date',
                                          'EndDate': 'end_date',
                                          'Type': 'type',
                                          'PaperlessBilling': 'paperless_billing',
                                          'PaymentMethod': 'payment_method',
                                          'MonthlyCharges': 'monthly_charges',
                                          'TotalCharges': 'total_charges'}
                                )
contract_df.head()

In [None]:
# Data overview:
contract_df.info()

In [None]:
# Converting 'begin_date' into datetime:
contract_df['begin_date'] = pd.to_datetime(contract_df['begin_date'])
contract_df.head()

In [None]:
# Converting 'end_date' into datetime:
contract_df['end_date'] = pd.to_datetime(contract_df['end_date'], errors='coerce')
contract_df.head()

In [None]:
# Converting 'total_charges' into Numerical:
contract_df['total_charges'] = pd.to_numeric(contract_df['total_charges'], errors='coerce')
contract_df.head()

In [None]:
# Checking for missing values:
contract_df.isna().sum()

In [None]:
# Filling missing values on 'total_charges' with 'monthly_charges':
contract_df['total_charges'].fillna(contract_df['monthly_charges'], inplace=True)
contract_df.head()

In [None]:
# Checking for duplicates:
contract_df.duplicated().sum()

In [None]:
# Checking for duplicates in 'customer_id':
contract_df['customer_id'].duplicated().sum()

In [None]:
# Creating the target variable 'churn':
contract_df['churn'] = contract_df['end_date'].notna().astype(int)
contract_df.head()

## `Personal` Dataframe:

In [None]:
# Loading the data file:
personal_df = pd.read_csv('/datasets/final_provider/personal.csv')
personal_df.head()

In [None]:
# Renaming columns names:
personal_df = personal_df.rename(columns={'customerID': 'customer_id',

                                          'gender': 'gender',
                                          'SeniorCitizen': 'senior_citizen',
                                          'Partner': 'partner',
                                          'Dependents': 'dependents'
                                         }
                                )
personal_df.head()

In [None]:
# Data overview:
personal_df.info()

In [None]:
# Checking for missing values:
personal_df.isna().sum()

In [None]:
# Checking for duplicates:
personal_df.duplicated().sum()

In [None]:
# Checking for duplicates in 'customer_id':
personal_df['customer_id'].duplicated().sum()

## `Internet` Dataframe:

In [None]:
# Loading the data file:
internet_df = pd.read_csv('/datasets/final_provider/internet.csv')
internet_df.head()

In [None]:
# Renaming columns names:
internet_df = internet_df.rename(columns={'customerID': 'customer_id',

                                          'InternetService': 'internet_service',
                                          'OnlineSecurity': 'online_security',
                                          'OnlineBackup': 'online_backup',
                                          'DeviceProtection': 'device_protection',
                                          'TechSupport': 'tech_support',
                                          'StreamingTV': 'streaming_tv',
                                          'StreamingMovies': 'streaming_movies'
                                         }
                                )
internet_df.head()

In [None]:
# Data overview:
internet_df.info()

In [None]:
# Checking for missing values:
internet_df.isna().sum()

In [None]:
# Checking for duplicates:
internet_df.duplicated().sum()

In [None]:
# Checking for duplicates in 'customer_id':
internet_df['customer_id'].duplicated().sum()

## `Phone` Dataframe:

In [None]:
# Loading the data file:
phone_df = pd.read_csv('/datasets/final_provider/phone.csv')
phone_df.head()

In [None]:
# Renaming columns names:
phone_df = phone_df.rename(columns={'customerID': 'customer_id',
                                    'MultipleLines': 'multiple_lines'
                                   }
                          )
phone_df.head()

In [None]:
# Data overview:
phone_df.info()

In [None]:
# Checking for missing values:
phone_df.isna().sum()

In [None]:
# Checking for duplicates:
phone_df.duplicated().sum()

In [None]:
# Checking for duplicates in 'customer_id':
phone_df['customer_id'].duplicated().sum()

# Merging all datasets into one dataFrame:

In [None]:
# Merging contract data into personal info
merged_df = personal_df.merge(contract_df, on='customer_id', how='left')
merged_df.info()

In [None]:
# Merging internet data into previous merge:
merged_df = merged_df.merge(internet_df, on='customer_id', how='left')
merged_df.info()

In [None]:
# Merging phone data into final dataset
merged_df = merged_df.merge(phone_df, on='customer_id', how='left')
merged_df.head()

In [None]:
# Data overview:
merged_df.info()

In [None]:
# Checking for missing values:
merged_df.isna().sum()

In [None]:
# Filling internet related service columns with 'No':
internet_related_cols = [
    'internet_service',
    'online_security',
    'online_backup',
    'device_protection',
    'tech_support',
    'streaming_tv',
    'streaming_movies',
    'multiple_lines'
]

merged_df[internet_related_cols] = merged_df[internet_related_cols].fillna('No')
merged_df.head()

In [None]:
# Checking for duplicates:
merged_df.duplicated().sum()

In [None]:
# Statistical summary of the dataset:
merged_df.describe()

In [None]:
# Checking the balance of the target variable:
merged_df['churn'].value_counts(normalize=True)

In [None]:
# Visualizing the distribution of Retained Customers Vs Churned:
plt.figure(figsize=(6, 5))
sns.countplot(x='churn', data=merged_df, palette={0: 'green', 1: 'red'})

plt.title('Distribution of Customers Retained vs Churned', fontweight='bold')
plt.xlabel('Churn Status', fontweight='bold')
plt.ylabel('Number of Customers', fontweight='bold')
plt.xticks([0, 1], ['Retained', 'Churned'])

plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

We visualized the distribution of retained vs. churned customers using a bar chart:

- ~73.5% of customers were retained (green bar)

- ~26.5% of customers churned (red bar)

The chart reveals a clear class imbalance in the target variable.

In [None]:
# Visualizing the number of churned Vs retained customers for each contract type:
plt.figure(figsize=(6, 4))
sns.countplot(data=merged_df, x='type', hue='churn', palette={0: 'green', 1: 'red'})

plt.title('Distribution of Customer Churn by Contract Type', fontweight='bold')
plt.xlabel('Contract Type', fontweight='bold')
plt.ylabel('Number of Customers', fontweight='bold')
legend = plt.legend(title='Customer Status', labels=['Retained', 'Churned'])
legend.get_title().set_fontweight('bold')

plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

This visualization explores how customer churn behavior varies across different contract types (month-to-month, one year, and two year). By comparing the number of retained vs churned customers, we can assess how contract length influences retention:

- **Month-to-month** contracts show the highest churn rate, with a large number of customers leaving compared to those retained.

- **One-year** and **two-year** contracts show significantly lower churn, with retention clearly outweighing churn.

This suggests that longer-term contracts are associated with better customer retention.

In [None]:
# Visualizing the number of churned Vs retained customers for each internet service type:
plt.figure(figsize=(8, 5))
sns.countplot(x='internet_service', hue='churn', data=merged_df, palette={0: 'green', 1: 'red'})

plt.title('Distribution of Retained Vs Churned Customers by Internet Service Type', fontweight='bold')
plt.xlabel('Internet Service Type', fontweight='bold')
plt.ylabel('Number of Customers', fontweight='bold')
legend = plt.legend(title='Customer Status', labels=['Retained', 'Churned'])
legend.get_title().set_fontweight('bold')

plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

This bar chart compares churn behavior across different types of internet services: DSL, Fiber optic, and No internet service.

- **DSL:** More customers stayed than left, suggesting moderate satisfaction or better retention strategies.

- **Fiber Optic:** Shows higher churn than DSL despite being a faster service, indicating possible issues like pricing or service reliability.

- **No Internet:** These customers had very low churn, possibly because they are less reliant on the service overall or represent more traditional usage profiles.

This visualization helps identify churn risk segments tied to service type, which could guide retention initiatives (e.g., improving fiber optic experience or pricing strategy).

In [None]:
# Visualizing the distribution of monthly charges by customer status:
plt.figure(figsize=(8, 5))
sns.histplot(data=merged_df, x='monthly_charges', hue='churn', multiple='stack', palette={0: 'green', 1: 'red'}, bins=30)

plt.title('Distribution of Monthly Charges for Retained vs Churned Customers', fontweight='bold')
plt.xlabel('Monthly Charges ($)', fontweight='bold')
plt.ylabel('Number of Customers', fontweight='bold')
legend = plt.legend(title='Customer Status', labels=['Churned', 'Retained'])
legend.get_title().set_fontweight('bold')

plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

This chart shows how customer churn varies across different monthly charge amounts:

- Customers paying around **`$20` per month** are much more likely to stay. The bar for retained customers is significantly higher in this range compared to churned customers.


- As monthly charges increase **(especially between `$60` and `$100`)**, the number of churned customers rises. The gap between retained and churned customers narrows, showing that higher charges may be linked to more cancellations.



- At the highest price range **(above `$100`)**, churn continues but both retained and churned customer counts drop, likely due to fewer customers in that tier.

This suggests that lower charges are associated with better retention, while churn becomes more common as charges go up.

# Step 2: Feature Engineering:

In [None]:
# Filling missing end_date for active customers:
latest_date = merged_df['end_date'].max()
merged_df['end_date'] = merged_df['end_date'].fillna(latest_date)

In [None]:
# Calculating customer tenure in months (from begin_date to end_date):
merged_df['tenure'] = (merged_df['end_date'] - merged_df['begin_date']) / np.timedelta64(1, 'M')
merged_df['tenure'] = merged_df['tenure'].round().astype('Int64')

merged_df[['begin_date', 'end_date', 'tenure']].head()

In [None]:
# Plotting the distribution of tenure for retained vs churned customers:
plt.figure(figsize=(8, 5))
sns.histplot(data=merged_df, x='tenure', hue='churn', multiple='stack', bins=30, palette={0: 'green', 1: 'red'})

plt.title('Distribution of Tenure for Retained vs Churned Customers', fontweight='bold')
plt.xlabel('Tenure (Months)', fontweight='bold')
plt.ylabel('Number of Customers', fontweight='bold')

legend = plt.legend(title='Customer Status', labels=['Churned', 'retained'])
legend.get_title().set_fontweight('bold')

plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

This chart explores the relationship between customer tenure (in months) and churn behavior:

- We observe that churn is highest during the first few months, particularly between `0 and 10 months`, suggesting that many customers leave early, possibly due to unmet expectations or dissatisfaction.
- After the `12-months mark`, churn begins to decline steadily, indicating that customers who stay longer are more likely to remain loyal.
- A peak is also noticeable at `72 months`, which could represent a maximum tenure cap or a long-standing customer cohort.

In general, retained customers are spread across all tenure ranges, but churned customers are more concentrated in the early stages. This emphasizes the importance of focusing on customer experience and engagement within the first year to improve long-term retention.

In [None]:
# Visualizing correlation between Customer Features and Churn:
plt.figure(figsize=(8, 5))

numeric_df = merged_df.select_dtypes(include='number')
corr_matrix = numeric_df.corr()

sns.heatmap(corr_matrix, annot=True, cmap='RdBu', fmt=".2f", square=True, center=0, linewidths=0.5,linecolor='gray')
plt.title('Correlation Between Customer Features and Churn', fontweight='bold')

plt.tight_layout()
plt.show()

This heatmap visualizes the pairwise correlation between key numeric features in the dataset. A few key insights:

- `tenure` has a **strong negative** correlation with churn `(-0.34)`, meaning customers with longer tenure are less likely to churn.

- `total_charges` and tenure are **strongly positively** correlated `(0.83)`, as expected since charges accumulate over time.

- `monthly_charges` has a **weak positive** correlation with churn `(0.19)`, suggesting customers with higher monthly bills may be slightly more likely to churn.

- `senior_citizen` has a **very weak positive** correlation with churn `(0.15)`, indicating age may have a limited influence.

- `total_charges` is **moderately** correlated with monthly_charges `(0.65)`, reflecting the connection between recurring payments and total cost over time.

The most predictive numeric indicator of churn is `tenure`: the longer a customer stays, the less likely they are to churn. Monthly charges also play a role, higher monthly costs slightly increase churn risk. These insights can guide retention strategies, such as offering discounts to new or high paying customers to encourage loyalty.

In [None]:
# Listing existing categorical columns to encode:
categorical_features = [
    'gender', 'partner', 'dependents', 'internet_service', 'online_security',
    'online_backup', 'device_protection', 'tech_support', 'streaming_tv',
    'streaming_movies', 'multiple_lines', 'type', 'paperless_billing',
    'payment_method'
]

# One-hot encoding the categorical columns and droping the first category to avoid multicollinearity:
encoded_df = pd.get_dummies(merged_df, columns=categorical_features, drop_first=True)

encoded_df.info()

# Step 3: Model Building and Evaluation:

## Trainnig:

In [None]:
# Splitting the dataset into features and target:
target = 'churn'
X = encoded_df.drop([target, 'customer_id', 'begin_date', 'end_date'], axis=1)
y = encoded_df[target]

# Splitting the data into train and test sets (80% train / 20% test):
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Displaying the size of each subset:
print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)


# `Random Forest Model`:

In [None]:
# Defining hyperparameter grid:
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5]
}

# Initializing the Random Forest model:
rf_model = RandomForestClassifier(random_state=42)

# RandomizedSearchCV setup:
grid_search = GridSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_grid=param_grid,
    scoring='roc_auc',
    cv=5,
    n_jobs=-1
)

# Fitting the Random Forest model on training data:
grid_search.fit(X_train, y_train)

# Using the best model from grid_search:
best_rf = grid_search.best_estimator_

# Predicting on the test set:
y_pred = best_rf.predict(X_test)
y_proba = best_rf.predict_proba(X_test)[:, 1] 

# Evaluating the Random Forest model:
accuracy = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_proba)

# Displaying the metrics:
print('--- Tuned Random Forest Training Summary ---')
print(f"Accuracy            : {accuracy:.2f}")
print(f"ROC-AUC Score       : {roc_auc:.2f}")

The tuned Random Forest model achieved an **accuracy of 0.84** and a **ROC-AUC score of 0.89** on the test set. These metrics indicate that the model is performing well in terms of both overall classification accuracy and its ability to distinguish between churned and retained customers.

- The **high ROC-AUC** suggests the model is particularly effective at ranking customers by their likelihood to churn, which is critical in business applications where targeting the most at risk users matters more than just overall correctness.

- While **the accuracy** confirms that the model classifies most customers correctly, **the ROC-AUC provides a more nuanced view** by evaluating performance across all classification thresholds. This makes the model reliable even in slightly imbalanced datasets.

Overall, the tuned Random Forest can serve as a strong baseline model for churn prediction, with solid generalization to unseen data and good potential for informing retention-focused marketing strategies.

# `Logistic Regression Model`:

In [None]:
# Defining hyperparameter grid:
param_grid = {
    'C': [0.01, 0.1, 1, 10],               
    'penalty': ['l2'],                     
    'solver': ['lbfgs']                    
}

# Initializing the Logistic Regression model:
lr_model = LogisticRegression(random_state=42, max_iter=1000)

# GridSearchCV setup:
grid_search = GridSearchCV(
    estimator=lr_model,
    param_grid=param_grid,
    scoring='roc_auc',
    cv=5,
    n_jobs=-1
)

# Fitting the Logistic Regression model on training data:
grid_search.fit(X_train, y_train)

# Using the best model from GridSearchCV:
best_lr = grid_search.best_estimator_

# Predicting on the test set:
y_pred_lr = best_lr.predict(X_test)
y_proba_lr = best_lr.predict_proba(X_test)[:, 1]

# Evaluating the Logistic Regression model:
accuracy_lr = accuracy_score(y_test, y_pred_lr)
roc_auc_lr = roc_auc_score(y_test, y_proba_lr)

# Displaying the metrics:
print('--- Tuned Logistic Regression Training Summary ---')
print(f"Accuracy            : {accuracy:.2f}")
print(f"ROC-AUC Score       : {roc_auc:.2f}")

The tuned `Logistic Regression model` achieved an **accuracy of 0.82** and a **ROC-AUC score of 0.86** on the test dataset.

- The **accuracy of 82%** indicates that the model correctly classified the majority of customer cases overall.
- The **high ROC-AUC score (0.86)** suggests strong discriminative power, meaning the model is effective at distinguishing between customers who are likely to churn and those who are not.

This performance reflects a solid balance between precision and recall, making Logistic Regression a reliable and interpretable baseline model for churn prediction.

## Testing:

# `Random Forest Model`:

In [None]:
# Random Forest Test Set Evaluation:
y_pred_rf = best_rf.predict(X_test)
y_proba_rf = best_rf.predict_proba(X_test)[:, 1]

# Calculating evaluation metrics
accuracy_rf = accuracy_score(y_test, y_pred_rf)
roc_auc_rf = roc_auc_score(y_test, y_proba_rf)

# Displaying results
print("\n--- Random Forest Test Set Evaluation ---")
print(f"Accuracy            : {accuracy_rf:.2f}")
print(f"ROC-AUC Score       : {roc_auc_rf:.2f}")

`The Random Forest model` performed well, correctly classifying **84%** of test cases. Its high **ROC-AUC (0.89)** score shows strong ability to rank customers by churn risk, making it a reliable and powerful model.

# `Logistic Regression Model`:

In [None]:
# Logistic Regression Test Set Evaluation:
y_pred_lr = best_lr.predict(X_test)
y_proba_lr = best_lr.predict_proba(X_test)[:, 1]

# Calculating evaluation metrics
accuracy_lr = accuracy_score(y_test, y_pred_lr)
roc_auc_lr = roc_auc_score(y_test, y_proba_lr)

# Displaying results
print("\n--- Logistic Regression Test Set Evaluation ---")
print(f"Accuracy            : {accuracy_lr:.2f}")
print(f"ROC-AUC Score       : {roc_auc_lr:.2f}")

`Logistic Regression` also performed well, with **82%** accuracy and a solid **ROC-AUC (0.86)** score. While slightly behind Random Forest, it's simpler and easier to interpret, which can be valuable in practice.

# General Conclusion:

In this project, we developed a machine learning pipeline to predict customer churn for **Interconnect**, a telecom provider. Using demographic, contract, and service usage data, we explored key churn patterns, trained classification models, and evaluated their performance based on **accuracy** and **ROC-AUC** metrics.

- The analysis revealed that approximately `26.5%` of customers churned while `73.5%` were retained. Customers with month-to-month contracts showed the highest churn risk, while those on one- or two-year contracts demonstrated significantly better retention. Churn was also highest among customers with short tenure, especially within the first 10 months, suggesting the importance of early engagement. Fiber optic users and customers with higher monthly charges were also more likely to churn, while long-tenure customers tended to remain loyal.

- We trained and tuned two models: `Random Forest` and `Logistic Regression`, to predict churn. The `tuned Random Forest` achieved an accuracy of **0.84** and a `ROC-AUC` of **0.89**, slightly outperforming `Logistic Regression`, which achieved an accuracy of **0.82** and a `ROC-AUC** of **0.86**. This indicates that `Random Forest` is better at ranking customers by churn risk, while Logistic Regression remains a simpler and more interpretable model for business stakeholders.


This analysis provided a strong predictive foundation for understanding churn behavior and prioritizing customer retention strategies. Moving forward, **Interconnect** can use these models to prioritize early intervention. To reduce churn, we recommend offering incentives for long-term contracts, improving onboarding for new users, and targeting high-risk customers with personalized retention offers.