### Analysis Step

This cell performs a logical step in the machine learning workflow.

In [None]:
from google.colab import files
uploaded = files.upload()

### Data Loading

Here we load the raw dataset into a Pandas DataFrame for analysis.

In [None]:
import pandas as pd
df = pd.read_csv("churn.csv")
df.sample(10)

### Initial Data Inspection

We inspect data types, ranges, and missing values to understand dataset quality.

In [None]:
df.info()
df.describe()

### Analysis Step

This cell performs a logical step in the machine learning workflow.

In [None]:
df.isnull().sum()

### Analysis Step

This cell performs a logical step in the machine learning workflow.

In [None]:
df.columns

### Data Cleaning

This step converts malformed numeric columns and handles missing values using domain logic.

In [None]:
df['TotalCharges'].dtype

### Data Cleaning

This step converts malformed numeric columns and handles missing values using domain logic.

In [None]:
df['TotalCharges'] = df['TotalCharges'].replace(' ', pd.NA)
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'])

### Data Cleaning

This step converts malformed numeric columns and handles missing values using domain logic.

In [None]:
df['TotalCharges'].isnull().sum()

### Analysis Step

This cell performs a logical step in the machine learning workflow.

In [None]:
df.dropna(inplace=True)

### Analysis Step

This cell performs a logical step in the machine learning workflow.

In [None]:
df.isnull().sum()

### Analysis Step

This cell performs a logical step in the machine learning workflow.

In [None]:
df.duplicated().sum()

### Analysis Step

This cell performs a logical step in the machine learning workflow.

In [None]:
df.value_counts(normalize=True)

### Analysis Step

This cell performs a logical step in the machine learning workflow.

In [None]:
import matplotlib.pyplot as plt
df.boxplot(column = "MonthlyCharges",by = "Churn")
df.boxplot(column = "tenure",by = "Churn")
plt.show()

### Analysis Step

This cell performs a logical step in the machine learning workflow.

In [None]:
import seaborn as sns
cols = []
for i in df.columns:
  if df[i].dtype != 'object':
     cols.append(i)
     sns.boxplot(df[i])
     plt.xlabel(i)
     plt.ylabel('range')
     plt.show()
print(cols)

### Analysis Step

This cell performs a logical step in the machine learning workflow.

In [None]:
X = df.drop('Churn', axis=1)
y = df['Churn']

### Trainâ€“Test Split

We split the dataset using stratification to preserve class proportions.

In [None]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2,random_state=42,stratify = y)

### Preprocessing Pipeline

A reusable preprocessing pipeline is built to avoid data leakage and ensure consistency.

In [None]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

### Data Cleaning

This step converts malformed numeric columns and handles missing values using domain logic.

In [None]:
categorical_features = ['gender', 'Partner', 'Dependents',
          'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod']

numeric_features = ['tenure', 'MonthlyCharges', 'TotalCharges']

### Preprocessing Pipeline

A reusable preprocessing pipeline is built to avoid data leakage and ensure consistency.

In [None]:
from sklearn.impute import SimpleImputer

num_pipeline = Pipeline(
    [('imputer', SimpleImputer(strategy='mean')),
     ('scaler', StandardScaler())]
)

cat_pipeline = Pipeline(
    [('encoder', OneHotEncoder())]
)

### Preprocessing Pipeline

A reusable preprocessing pipeline is built to avoid data leakage and ensure consistency.

In [None]:
preprocessor = ColumnTransformer(
    [
        ("num", num_pipeline, numeric_features),
        ("cat", cat_pipeline,categorical_features)
    ]
)
X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)

### Baseline Model: Logistic Regression

A simple, interpretable baseline model is trained for comparison.

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)

### Model Evaluation

Final evaluation is performed using appropriate classification metrics.

In [None]:
from sklearn.metrics import roc_auc_score,classification_report
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
print(classification_report(y_test,y_pred))
print("ROC-AUC:", roc_auc_score(y_test, y_prob))

### Model Evaluation

Final evaluation is performed using appropriate classification metrics.

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(
    model,
    X_train,
    y_train,
    cv=5,
    scoring="roc_auc"
)

scores.mean(), scores.std()


Model Comparison


### Model Evaluation

Final evaluation is performed using appropriate classification metrics.

In [None]:
from sklearn.model_selection import cross_val_score

def evaluate_model(model, X, y):
    scores = cross_val_score(
        model,
        X,
        y,
        cv=5,
        scoring="roc_auc"
    )
    return scores.mean(), scores.std()

### Preprocessing Pipeline

A reusable preprocessing pipeline is built to avoid data leakage and ensure consistency.

In [None]:
from sklearn.svm import SVC

svm_clf = SVC(
    kernel="linear",
    probability=True,
    random_state=42
)
svm_pipeline = Pipeline([
    ("model", svm_clf)
])

mean_auc, std_auc = evaluate_model(
    svm_pipeline,
    X_train,
    y_train
)

mean_auc, std_auc

### Preprocessing Pipeline

A reusable preprocessing pipeline is built to avoid data leakage and ensure consistency.

In [None]:
svm_rbf = SVC(
    kernel="rbf",
    C=1.0,
    gamma="scale",
    probability=True,
    random_state=42
)

svm_rbf_pipeline = Pipeline([
    ("model", svm_rbf)
])

evaluate_model(svm_rbf_pipeline, X_train, y_train)


### Preprocessing Pipeline

A reusable preprocessing pipeline is built to avoid data leakage and ensure consistency.

In [None]:
from sklearn.tree import DecisionTreeClassifier

tree_clf = DecisionTreeClassifier(random_state=42)

tree_pipeline = Pipeline([
    ("model", tree_clf)
])

evaluate_model(tree_pipeline, X_train, y_train)


### Preprocessing Pipeline

A reusable preprocessing pipeline is built to avoid data leakage and ensure consistency.

In [None]:
tree_clf_reg = DecisionTreeClassifier(
    max_depth=5,
    min_samples_leaf=50,
    random_state=42
)

tree_reg_pipeline = Pipeline([
    ("model", tree_clf_reg)
])

evaluate_model(tree_reg_pipeline, X_train, y_train)


### Preprocessing Pipeline

A reusable preprocessing pipeline is built to avoid data leakage and ensure consistency.

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(
    n_estimators=200,
    random_state=42,
    n_jobs=-1
)

rf_pipeline = Pipeline([
    ("model", rf_clf)
])

evaluate_model(rf_pipeline, X_train, y_train)


### Preprocessing Pipeline

A reusable preprocessing pipeline is built to avoid data leakage and ensure consistency.

In [None]:
models = {
    "Logistic Regression": model,
    "Linear SVM": svm_clf,
    "RBF SVM": svm_rbf,
    "Decision Tree": tree_clf_reg,
    "Random Forest": rf_clf
}

results = []

for name, model in models.items():
    pipeline = Pipeline([
        ("model", model)
    ])
    mean_auc, std_auc = evaluate_model(pipeline, X_train, y_train)
    results.append((name, mean_auc, std_auc))

pd.DataFrame(results, columns=["Model", "ROC-AUC Mean", "ROC-AUC Std"])


### Hyperparameter Tuning

Model hyperparameters are tuned using cross-validation.

In [None]:
from sklearn.model_selection import RandomizedSearchCV
import numpy as np

param_distributions = {
    "model__n_estimators": [100, 200, 300, 500],
    "model__max_depth": [None, 5, 10, 20],
    "model__min_samples_split": [2, 5, 10],
    "model__min_samples_leaf": [1, 5, 10],
    "model__max_features": ["sqrt", "log2"]
}

### Preprocessing Pipeline

A reusable preprocessing pipeline is built to avoid data leakage and ensure consistency.

In [None]:
rf_pipeline = Pipeline([
    ("model", RandomForestClassifier(
        random_state=42,
        n_jobs=-1
    ))
])


### Preprocessing Pipeline

A reusable preprocessing pipeline is built to avoid data leakage and ensure consistency.

In [None]:
random_search = RandomizedSearchCV(
    rf_pipeline,
    param_distributions=param_distributions,
    n_iter=20,
    cv=5,
    scoring="roc_auc",
    random_state=42,
    n_jobs=-1
)

random_search.fit(X_train, y_train)


### Analysis Step

This cell performs a logical step in the machine learning workflow.

In [None]:
random_search.best_params_
random_search.best_score_

### Model Evaluation

Final evaluation is performed using appropriate classification metrics.

In [None]:
from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt

train_sizes, train_scores, val_scores = learning_curve(
    random_search.best_estimator_,
    X_train,
    y_train,
    cv=5,
    scoring="roc_auc",
    n_jobs=-1,
    train_sizes=np.linspace(0.1, 1.0, 5)
)

train_mean = train_scores.mean(axis=1)
val_mean = val_scores.mean(axis=1)

plt.plot(train_sizes, train_mean, label="Training AUC")
plt.plot(train_sizes, val_mean, label="Validation AUC")
plt.legend()
plt.xlabel("Training size")
plt.ylabel("ROC-AUC")
plt.show()


### Analysis Step

This cell performs a logical step in the machine learning workflow.

In [None]:
ohe = preprocessor.named_transformers_["cat"].named_steps["encoder"]

cat_features = ohe.get_feature_names_out(categorical_features)
all_features = list(numeric_features) + list(cat_features)

### Analysis Step

This cell performs a logical step in the machine learning workflow.

In [None]:
importances = random_search.best_estimator_.named_steps["model"].feature_importances_

feat_imp = pd.Series(importances, index=all_features)\
    .sort_values(ascending=False)[:15]

feat_imp.plot(kind="barh")
plt.gca().invert_yaxis()
plt.show()


### Dimensionality Reduction (PCA)

PCA is applied to reduce dimensionality before clustering.

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=0.9, random_state=42)
X_train_pca = pca.fit_transform(X_train)

X_train_pca.shape

### Dimensionality Reduction (PCA)

PCA is applied to reduce dimensionality before clustering.

In [None]:
from sklearn.cluster import KMeans

inertia = []
k_range = range(2, 11)

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_train_pca)
    inertia.append(kmeans.inertia_)

import matplotlib.pyplot as plt

plt.plot(k_range, inertia, marker="o")
plt.xlabel("Number of clusters")
plt.ylabel("Inertia")
plt.show()


### Dimensionality Reduction (PCA)

PCA is applied to reduce dimensionality before clustering.

In [None]:
import pandas as pd

kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X_train_pca)

# Convert X_train (which is a NumPy array) to a DataFrame before adding the 'Cluster' column
X_train_clustered = pd.DataFrame(X_train)
X_train_clustered["Cluster"] = clusters

### Analysis Step

This cell performs a logical step in the machine learning workflow.

In [None]:
X_train_clustered.groupby("Cluster").mean()

### Anomaly Detection

We detect unusual customer behavior using anomaly detection.

In [None]:
from sklearn.ensemble import IsolationForest

iso_forest = IsolationForest(
    contamination=0.05,
    random_state=42
)

anomaly_labels = iso_forest.fit_predict(X_train)

### Analysis Step

This cell performs a logical step in the machine learning workflow.

In [None]:
import pandas as pd
anomalies = X_train[anomaly_labels == -1]
anomalies = pd.DataFrame(anomalies)
anomalies.head()

### Model Evaluation

Final evaluation is performed using appropriate classification metrics.

In [None]:
best_model = random_search.best_estimator_

best_model.fit(X_train, y_train)

from sklearn.metrics import roc_auc_score, classification_report

y_test_prob = best_model.predict_proba(X_test)[:, 1]
y_test_pred = best_model.predict(X_test)

print("Test ROC-AUC:", roc_auc_score(y_test, y_test_prob))
print(classification_report(y_test, y_test_pred))