# Introduction to Machine Learning – Project 2025

## Dataset: Customer Churn

This notebook corresponds to the project instructions.

We chose **Dataset A: Churn Data** (from `ChurnData.csv`).

We will follow the required steps:


1. Dataset selection and problem definition

2. Scenario / about the dataset

3. Data loading and summary

4. Data wrangling / preprocessing

5. Exploratory Data Analysis (EDA)

6. Model development (multiple ML algorithms)

7. Model evaluation (metrics & plots)

8. Model refinement and conclusions



## Step 2 – Scenario / About the Dataset

You are working as a **data analyst** for a telecom company.

The company wants to understand and **predict customer churn** (whether a customer is likely to leave).


**Goal:** build a machine learning solution that, based on customer attributes (tenure, income, services, etc.),

predicts whether the customer will churn.


**Type of problem:** this is a **supervised classification** problem because:

- The target variable `churn` takes discrete values (0 = no churn, 1 = churn).

- We want to assign each customer to one of these classes.



## Step 3 – Data Loading and Description

In [None]:

import pandas as pd

# Load churn dataset
df = pd.read_csv("/mnt/data/Project/ChurnData.csv")

# Basic info
df.head()


In [None]:

# Shape of the data (rows, columns)
df.shape


In [None]:

# Column names and data types
df.dtypes


In [None]:

# Basic statistics for numerical features
df.describe()


### Short description of the attributes
Below is an example description; you can adapt the wording if needed:

- `tenure`: Number of months the customer has been with the company.

- `age`: Age of the customer.

- `address`: Related to how long the customer has lived at the current address.

- `income`: Normalized income indicator.

- `ed`: Education level.

- `employ`: Years of employment.

- `equip`: Type or presence of company equipment.

- `callcard`, `wireless`, `voice`, `pager`, `internet`: Usage or subscription indicators for different services.

- `longmon`, `tollmon`, `equipmon`, `cardmon`, `wiremon`: Monthly billing amounts for different services.

- `longten`, `tollten`, `cardten`: Tenure-related metrics for different services.

- `callwait`, `confer`, `ebill`: Service features (call waiting, conference, electronic billing).

- `loglong`, `logtoll`, `lninc`: Log-transformed numeric features.

- `custcat`: Customer category (segment).

- `churn`: Target variable (1 = churn, 0 = no churn).



## Step 4 – Data Wrangling / Pre‑processing

In [None]:

# Check for missing values
df.isna().sum()


In [None]:

# Example: handle missing values (if any)
# - For numerical columns: fill with median
# - For categorical columns: fill with mode

import numpy as np

num_cols = df.select_dtypes(include=["int64", "float64"]).columns.tolist()
cat_cols = df.select_dtypes(include=["object", "bool"]).columns.tolist()

for col in num_cols:
    df[col] = df[col].fillna(df[col].median())

for col in cat_cols:
    df[col] = df[col].fillna(df[col].mode()[0])
    
df.isna().sum().head()


In [None]:

# Separate features X and target y
target_col = "churn"
X = df.drop(columns=[target_col])
y = df[target_col]

X.head()


In [None]:

# Train-test split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

X_train.shape, X_test.shape


In [None]:

# Scaling numerical features (many models work better with scaled data)
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


**Summary of preprocessing:**
- Checked and imputed missing values.
- Split data into training and test sets.
- Scaled numerical features for the models.


## Step 5 – Exploratory Data Analysis (EDA)

In [None]:

import matplotlib.pyplot as plt

# Distribution of target variable (churn)
y.value_counts().plot(kind='bar')
plt.title('Class distribution: churn')
plt.xlabel('Churn (0 = no, 1 = yes)')
plt.ylabel('Count')
plt.show()


In [None]:

# Example: histogram of a few important numeric features
cols_to_plot = ["tenure", "age", "income", "longmon"]

for col in cols_to_plot:
    df[col].hist(bins=20)
    plt.title(f"Distribution of {col}")
    plt.xlabel(col)
    plt.ylabel("Frequency")
    plt.show()


In [None]:

# Correlation matrix (numeric features only)
import numpy as np

corr = df.corr(numeric_only=True)
corr


In [None]:

# Simple visualization of correlations with churn
corr_with_churn = corr['churn'].sort_values(ascending=False)
corr_with_churn


**Key findings (to be adapted after running the notebook):**
- Some features such as tenure, income, and long distance usage may correlate with churn.
- Class distribution may be slightly imbalanced (depending on counts).
- Certain service-related features (e.g. `ebill`, `internet`) might be associated with higher or lower churn.


## Step 6 – Model Development
 We will train **multiple machine learning algorithms**:
- Logistic Regression (baseline, commonly taught in class).
- Random Forest Classifier.
- Gradient Boosting Classifier (often more advanced, may qualify as “not taught in class”).


In [None]:

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Random Forest": RandomForestClassifier(n_estimators=200, random_state=42),
    "Gradient Boosting": GradientBoostingClassifier(random_state=42)
}

trained_models = {}

for name, clf in models.items():
    clf.fit(X_train_scaled, y_train)
    trained_models[name] = clf
    print(f"Trained: {name}")


## Step 7 – Model Evaluation

In [None]:

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, classification_report

results = []

for name, clf in trained_models.items():
    y_pred = clf.predict(X_test_scaled)
    y_proba = clf.predict_proba(X_test_scaled)[:, 1] if hasattr(clf, "predict_proba") else None
    
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred)
    rec = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    auc = roc_auc_score(y_test, y_proba) if y_proba is not None else None
    
    results.append({
        "model": name,
        "accuracy": acc,
        "precision": prec,
        "recall": rec,
        "f1": f1,
        "roc_auc": auc
    })
    
    print("==", name, "==")
    print("Accuracy:", acc)
    print("Precision:", prec)
    print("Recall:", rec)
    print("F1-score:", f1)
    if auc is not None:
        print("ROC AUC:", auc)
    print("Classification report:\n", classification_report(y_test, y_pred))
    print("\n")


In [None]:

import pandas as pd

results_df = pd.DataFrame(results)
results_df


In [None]:

# Optional: simple bar plot of F1-score by model
results_df.set_index("model")["f1"].plot(kind="bar")
plt.title("F1-score by model")
plt.ylabel("F1-score")
plt.show()


## Step 8 – Model Refinement & Conclusions

### Comparison of models

After running the notebook, compare the metrics:

- Which model has the highest **F1-score**?

- Which model has the best **ROC AUC** (when available)?

- Is there a trade‑off between precision and recall?


### Possible refinements

- Try **hyperparameter tuning** (GridSearchCV or RandomizedSearchCV) for the best models.

- Perform **feature selection** or regularization to reduce overfitting.

- Try additional or more advanced algorithms (e.g. XGBoost, LightGBM) if available.

- Handle possible class imbalance (e.g. using `class_weight='balanced'` or resampling techniques).


### Limitations of the models

- The dataset size and quality may limit generalization.

- Features may not capture all reasons for customer churn.

- Models assume that the future will be similar to the historical data.


### Final recommendation to the client

- Choose the model with the best balance between recall (detecting churners) and precision (avoiding false alarms).

- Use the predictions to **prioritize retention campaigns** for customers at high risk of churn.

