In [None]:
import pandas as pd
import numpy as np
# For visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
pd.options.display.max_rows = None
pd.options.display.max_columns = None
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.metrics import recall_score

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/HumayDS/Big-data-analysis/main/Churn_Modelling.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,2,3,15619304,Onio,502,,Female,42,8,159660.8,3,1,0,113931.57,1
3,3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [None]:
Customer_id = df['CustomerId']

In [None]:
#Drop redundant columns
df = df.drop(['Unnamed: 0' , 'RowNumber' , 'CustomerId','Surname'] , axis = 1)

In [None]:
##Fill categoric column with mode
df['Gender'] = df['Gender'].fillna(df['Gender'].mode()[0])
df['Geography'] = df['Geography'].fillna(df['Geography'].mode()[0])

In [None]:
#Creating dummy variables(One hot encoding)
#Make sure you run it once
categorical_cols = df.select_dtypes(include='object').columns
df_dummies = pd.get_dummies(df[categorical_cols], drop_first=True, dtype=int)
df = df.drop(columns=categorical_cols)
df = pd.concat([df, df_dummies], axis=1)
df.head()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_Germany,Geography_Spain,Gender_Male
0,619,42,2,0.0,1,1,1,101348.88,1,0,0,0
1,608,41,1,83807.86,1,0,1,112542.58,0,0,1,0
2,502,42,8,159660.8,3,1,0,113931.57,1,0,0,0
3,699,39,1,0.0,2,0,0,93826.63,0,0,0,0
4,850,43,2,125510.82,1,1,1,79084.1,0,0,1,0


#Xgboost
##It‚Äôs an optimized implementation of Gradient Boosting, designed for speed and performance.
###It‚Äôs widely used in Kaggle competitions, industry projects, and research, because it delivers:
###High accuracy
###Fast training speed
###Good handling of missing data and outliers
###Built-in regularization to prevent overfitting

###XGBoost stands for Extreme Gradient Boosting.
###Boosting = building a strong model by combining many weak models (usually decision trees).
###Each new tree is trained to fix the errors made by the previous ones.
###Gradient Boosting specifically uses gradient descent to minimize errors.
#XGBoost builds trees step-by-step, and at each step:
###It looks at where the model makes mistakes.
###It builds a new tree to correct those mistakes.
###It combines all trees for the final prediction.

Let‚Äôs say you‚Äôre predicting house prices.

Start with a simple model (like predicting the average price).

Calculate errors (difference between predicted and actual prices).

Build a small decision tree that predicts those errors.

Add the new tree‚Äôs predictions to improve the old model.

Repeat steps 2‚Äì4 many times (each tree fixes the previous model‚Äôs mistakes).

Combine all trees for the final result.

This is ‚Äúboosting.‚Äù

ü™µ Random Forest = Many independent trees that vote together ‚Üí stable, simple, and fast.

‚ö° XGBoost = Many dependent trees that learn from each other‚Äôs mistakes ‚Üí more accurate, but more complex.

Random Forest = ‚ÄúA classroom of students answering the same question independently, then taking a majority vote.‚Äù

XGBoost = ‚ÄúA classroom where each student learns from the previous one‚Äôs mistakes to improve the final answer.‚Äù

In Random Forest, trees are independent
IN XGboost, trees are dependent

XGBoost adds learning_rate, regularization (L1/L2), and gamma, which don‚Äôt exist in Random Forest.

XGBOOST ‚úÖ Handles missing values automatically. During training, it learns the best direction (left or right) to take when a feature is missing.
Less robust to outliers, because boosting focuses on correcting errors, and outliers create large errors that can distort learning
RANDOM FOREST ‚ùå Does not handle missing values automatically. You must fill or impute them before training (e.g., with mean, median, or mode).
Fairly robust to outliers, because Random Forest averages many trees ‚Üí single outlier doesn‚Äôt influence much.


###üîπ Random Forest avoids overfitting through randomness and averaging.
###üîπ XGBoost can overfit because it learns sequentially, but offers strong regularization tools to control it.

# Hyperparameters of Xgboost

max_depth ‚Üí Maximum depth of each decision tree (how many splits a tree can make).
üîπ Higher = more complex model ‚Üí risk of overfitting.
üîπ Lower = simpler model ‚Üí might underfit.
üîπ Usually between 3‚Äì10.
üîπ Start around 5‚Äì6.

learning_rate ‚Üí Controls how much each new tree contributes to the model. This is one of the most important XGBoost parameters ‚Äî sometimes it determines 70% of model performance.

Small ‚Üí slower learning but more accurate.

Large ‚Üí faster learning but riskier.
üîπ Usually between 0.01‚Äì0.3.
üîπ Start with 0.1; reduce if overfitting occurs.

0.05‚Äì0.1 ‚Üí most stable and accurate models

0.01 ‚Üí for very large datasets

0.2‚Äì0.3 ‚Üí faster results, medium accuracy

0.3 ‚Üí generally not recommended

n_estimators ‚Üí Number of boosting rounds (trees).
üîπ More trees = better performance (up to a limit).
üîπ Too many trees ‚Üí longer training, overfitting risk (if learning_rate is high).
üîπ Usually 100‚Äì1000.
üîπ Use early stopping to find the optimal number.

subsample ‚Üí Percentage of observations (rows) used for each tree.

Each tree sees a different subset ‚Üí more robust model.

1.0 ‚Üí 100% of data

0.8 ‚Üí 80% of data

0.5 ‚Üí 50% of data
üîπ Adds randomness.
üîπ Lower = reduces overfitting.
üîπ Too low = underfitting.
üîπ Usually 0.5‚Äì1.0, 0.8 is a good start.

Example: subsample = 0.5, n_estimators = 1000:

Tree 1 ‚Üí randomly selects 500 rows

Tree 2 ‚Üí another random 500 rows

Tree 3 ‚Üí another random 500 rows
‚û° Each tree sees different points ‚Üí model doesn‚Äôt memorize points ‚Üí overall performance improves

colsample_bytree ‚Üí Percentage of features (columns) used for each tree.
üîπ Reduces correlation between trees.
üîπ Lower = reduces overfitting.
üîπ Too low = underfitting.
üîπ Usually 0.5‚Äì1.0, often 0.8.

reg_alpha (L1 penalty) ‚Üí Higher = stronger regularization ‚Üí reduces overfitting.

Encourages sparsity (some features ignored).

Range: 0‚Äì5

reg_lambda (L2 penalty) ‚Üí Penalizes large weights but does not make them zero.

Range: 1‚Äì10

Regularization = mechanism to prevent overfitting.

If model is too complex ‚Üí overfitting occurs

Regularization ‚Üí adds a ‚Äúpenalty‚Äù ‚Üí simplifies model

In XGBoost, two ways to regularize:

L2 penalty ‚Üí reg_lambda ‚Üí makes model more ‚Äúcautious‚Äù, reduces overfitting

1 ‚Üí normal regularization

5 ‚Üí stronger penalty ‚Üí simpler model

0 ‚Üí no penalty ‚Üí risky, overfitting possible

L1 penalty ‚Üí reg_alpha ‚Üí can zero out unnecessary leaves, simplifies model

Useful for datasets with many features

Analogy:

L1 (alpha) ‚Üí ‚Äúcut some‚Äù

L2 (lambda) ‚Üí ‚Äúsoften all a bit‚Äù

Defaults: reg_alpha = 0, reg_lambda = 1 (sufficient for small datasets)
For large datasets with many features ‚Üí try reg_alpha = 0.1‚Äì1, reg_lambda = 1‚Äì5

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import xgboost as xgb
import pandas as pd

# üéØ Define features (X) and target (y)
X = df.drop(columns=['Exited'])
y = df['Exited']

# 1Ô∏è‚É£ Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print("Original class distribution in Train:", y_train.value_counts())

# 2Ô∏è‚É£ Create and fit XGBoost model
xgb_model = xgb.XGBClassifier(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=4,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_alpha=0.5,
    reg_lambda=2,
    #scale_pos_weight=4,   # Uncommend if you have imbalance problem (algorithm level solution 80/20 =4 )
    random_state=42,
)

xgb_model.fit(X_train, y_train)

# 3Ô∏è‚É£ Predictions
pred_train = xgb_model.predict(X_train)
pred_test = xgb_model.predict(X_test)

# Accuracy
acc_train = accuracy_score(y_train, pred_train)
acc_test = accuracy_score(y_test, pred_test)

print(f"üîπ Train accuracy: {acc_train:.4f}")
print(f"üîπ Test accuracy:  {acc_test:.4f}\n")

# 4Ô∏è‚É£ Custom confusion matrix
def confusion_matrix_custom(y_true, y_pred):
    cm = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.ravel()

    cm_df = pd.DataFrame([[tp, fp],
                          [fn, tn]],
                         index=["Predicted 1", "Predicted 0"],
                         columns=["Actual 1", "Actual 0"])

    print("Confusion Matrix (Predicted on top, Actual on left):\n", cm_df)
    print(f"\nTP: {tp}, TN: {tn}, FP: {fp}, FN: {fn}\n")

    return cm_df

# === For Train set ===
print("=== For Train set ===")
print(classification_report(y_train, pred_train))
confusion_matrix_custom(y_train, pred_train)

# === For Test set ===
print("=== For Test set ===")
print(classification_report(y_test, pred_test))
confusion_matrix_custom(y_test, pred_test)


Original class distribution in Train: Exited
0    5574
1    1426
Name: count, dtype: int64
üîπ Train accuracy: 0.8917
üîπ Test accuracy:  0.8647

=== For Train set ===
              precision    recall  f1-score   support

           0       0.90      0.98      0.93      5574
           1       0.85      0.57      0.68      1426

    accuracy                           0.89      7000
   macro avg       0.88      0.77      0.81      7000
weighted avg       0.89      0.89      0.88      7000

Confusion Matrix (Predicted on top, Actual on left):
              Actual 1  Actual 0
Predicted 1       806       138
Predicted 0       620      5436

TP: 806, TN: 5436, FP: 138, FN: 620

=== For Test set ===
              precision    recall  f1-score   support

           0       0.88      0.96      0.92      2389
           1       0.76      0.49      0.60       611

    accuracy                           0.86      3000
   macro avg       0.82      0.73      0.76      3000
weighted avg       0.8

Unnamed: 0,Actual 1,Actual 0
Predicted 1,299,94
Predicted 0,312,2295


SMOTE / Oversampling:

If recall is still critical, it is possible to artificially increase the minority class.

Combining scale_pos_weight with SMOTE is often very effective.

###SMOTE - Synthetic Minority Oversampling Technique

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE
import xgboost as xgb
import pandas as pd
from collections import Counter

# üéØ Define features (X) and target (y)
X = df.drop(columns=['Exited'])
y = df['Exited']

# 1Ô∏è‚É£ Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y  # stratify=y ensures class distribution is preserved
)

# 2Ô∏è‚É£ Increase minority class using SMOTE
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X_train, y_train)

print("Original class distribution:", Counter(y_train))
print("Class distribution after SMOTE:", Counter(y_res))

# 3Ô∏è‚É£ Calculate scale_pos_weight
neg, pos = Counter(y_res)[0], Counter(y_res)[1]
scale_pos_weight = neg / pos
print(f"\nscale_pos_weight = {scale_pos_weight:.2f}\n")

# 4Ô∏è‚É£ Create and fit XGBoost model
xgb_model = xgb.XGBClassifier(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=4,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_alpha=0.5,
    reg_lambda=2,
    scale_pos_weight=4,   # weight for imbalanced classes
    random_state=42,
)

xgb_model.fit(X_res, y_res)

# 5Ô∏è‚É£ Predictions
pred_train = xgb_model.predict(X_res)
pred_test = xgb_model.predict(X_test)

# Accuracy
acc_train = accuracy_score(y_res, pred_train)
acc_test = accuracy_score(y_test, pred_test)

print(f"üîπ Train accuracy: {acc_train:.4f}")
print(f"üîπ Test accuracy:  {acc_test:.4f}\n")

# 6Ô∏è‚É£ Confusion matrices: Predicted on top, Actual on left
def confusion_matrix_custom(y_true, y_pred):
    cm = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.ravel()

    cm_df = pd.DataFrame([[tp, fp],
                          [fn, tn]],
                         index=["Predicted 1", "Predicted 0"],
                         columns=["Actual 1", "Actual 0"])

    print("Confusion Matrix (Predicted on top, Actual on left):\n", cm_df)
    print(f"\nTP: {tp}, TN: {tn}, FP: {fp}, FN: {fn}\n")

    return cm_df

# For Train set
print("=== For Train set ===")
print(classification_report(y_res, pred_train))
confusion_matrix_custom(y_res, pred_train)

# For Test set
print("=== For Test set ===")
print(classification_report(y_test, pred_test))
confusion_matrix_custom(y_test, pred_test)

Original class distribution: Counter({0: 5574, 1: 1426})
Class distribution after SMOTE: Counter({0: 5574, 1: 5574})

scale_pos_weight = 1.00

üîπ Train accuracy: 0.8175
üîπ Test accuracy:  0.6790

=== For Train set ===
              precision    recall  f1-score   support

           0       0.96      0.66      0.78      5574
           1       0.74      0.97      0.84      5574

    accuracy                           0.82     11148
   macro avg       0.85      0.82      0.81     11148
weighted avg       0.85      0.82      0.81     11148

Confusion Matrix (Predicted on top, Actual on left):
              Actual 1  Actual 0
Predicted 1      5422      1882
Predicted 0       152      3692

TP: 5422, TN: 3692, FP: 1882, FN: 152

=== For Test set ===
              precision    recall  f1-score   support

           0       0.95      0.63      0.76      2389
           1       0.38      0.87      0.53       611

    accuracy                           0.68      3000
   macro avg       0.6

Unnamed: 0,Actual 1,Actual 0
Predicted 1,533,885
Predicted 0,78,1504


Desicion tree, Random Forest, Xgboost ----->>> Tree Based models

SVM, KNN, GMM ----->> Distance Based Models

#What is SVM ?(Support Vecor Machines)

Effective for Small to Medium data

Support Vector Machines (SVMs) are supervised machine-learning models used for classification, regression. SVM tries to find the best boundary (called a hyperplane) that separates data into classes. The best hyperplane is the one with the maximum margin ‚Äî the largest distance between the boundary and the closest data points of any class.

The data points that lie closest to the boundary are called support vectors. They ‚Äúsupport‚Äù or define the exact position of the hyperplane

Imagine two classes of points (red and blue). Many lines could separate them, but SVM chooses the optimal one.

Optimal = the separating hyperplane with the maximum margin


In Support Vector Machine (SVM), support vectors are the data points that lie closest to the decision boundary (hyperplane)

Margin = the distance between the hyperplane and the closest data points on each side.

In [None]:

from IPython.display import Image

# Display an image from the web
url = "https://data-flair.training/blogs/wp-content/uploads/sites/2/2019/07/introduction-to-SVM.png"
display(Image(url=url))


Key Hyperparameters of SVMs

1.Kernel

2.C

3.Gamma

##1. Kernel is a function that helps SVM separate data when it is NOT linearly separable.
kernel = 'linear'  If data is linearly separable ‚Üí SVM draws a straight line
kernel = 'rbf'  If data is not linearly separable ‚Üí Kernel transforms data into a higher dimension so it becomes separable

In [None]:

from IPython.display import Image

# Display an image from the web
url = "https://media.geeksforgeeks.org/wp-content/uploads/20250513094345928254/svm.webp?utm_source=chatgpt.com"
display(Image(url=url))


Use Linear kernel when: Data is already almost linearly separable, You want a simpler, faster model
Use RBF kernel when: Data has complex patterns, You want high model performance ,

In [None]:

from IPython.display import Image

# Display an image from the web
url = "https://cdn.hashnode.com/res/hashnode/image/upload/v1735885699184/222d7252-7ece-4e31-97f5-577bb8577797.png?utm_source=chatgpt.com"
display(Image(url=url))

# 2. C parameter

C = how much penalty SVM gives to misclassified points.

C is small ‚Üí SVM is relaxed (soft, tolerant)

C is large ‚Üí SVM is strict , Tries to classify every point correctly

Large C overfits

Small C underfits but generalizes better

Most ML engineers start with this range:

‚û§ C ‚àà {0.01, 0.1, 1, 10, 100}

Choose larger C:
1 ‚Üí 10 ‚Üí 100

Large C tries to classify all points correctly.

#3. Gamma

Gamma is like the ‚Äúwidth‚Äù or ‚Äúreach radius‚Äù of the RBF kernel. More technically, gamma measures the influence area of a single data point:

Small Œ≥ ‚Üí each point has influence over a very wide area.

Large Œ≥ ‚Üí each point only affects points very close to it.(may overfit)


As gamma increases, the influence radius becomes smaller, which leads to overfitting.

If gamma is large, each point has a very ‚Äúlocal‚Äù influence, the model memorizes even tiny details, and this leads to overfitting.

In [None]:
from IPython.display import Image

# Display an image from the web
url = "https://scikit-learn.org/stable/_images/sphx_glr_plot_rbf_parameters_001.png"
display(Image(url=url))

# Scaling is very important for SVM (for tree based models )

Scaling = converting all features to the same numerical range.

Age --->> 26

Salary ---->> 1800

If you don‚Äôt scale:

Salary has large numbers ‚Üí model pays more attention to salary

Age has small numbers ‚Üí model ignores it

Scaling puts everything on a similar scale so no feature dominates unfairly.

Scaling is very important for SVM
(and not necessary for tree-based models)

For distance-based models such as SVM, KNN, and Logistic Regression, scaling is essential ‚Äî differences in feature magnitudes can significantly affect the results.

For tree-based models such as Decision Tree, Random Forest, and XGBoost, scaling is not needed ‚Äî because these models look only at the relative order of features when splitting nodes, not at their measurement units.

In [None]:
from IPython.display import Image

# Display an image from the web
url = "https://miro.medium.com/v2/resize:fit:640/format:webp/1*Bx8sWhleKvBdSWECm6eeFg.png"
display(Image(url=url))

Data should be scaled when distance plays an important role in the model.
Distance-based models are sensitive to the magnitude of numerical values.
If the features have different scales, the model will perform incorrectly.

These include:

KNN (K-Nearest Neighbors)

K-means clustering

Hierarchical clustering

SVM (especially RBF kernel, polynomial kernel)

PCA (because variance is measured)

Neural Networks / MLP / Deep Learning

These models are based on a decision-tree structure, and the magnitude of the numbers plays no role.

‚úî Do NOT require scaling:

Decision Tree

Random Forest

XGBoost

LightGBM

CatBoost

Tree models simply split features based on thresholds ‚Üí whether a value is 10 or 10,000 does not matter.

###Normalization vs Standardization
Normalization (Min-Max Scaling)

Compresses values into the range 0 to 1.

Each feature is scaled to fall within the 0‚Äì1 range.

It changes the shape of the original distribution by squeezing it.

Standardization (Z-score Scaling)

Transforms values so that the mean becomes 0 and the standard deviation becomes 1.

If there are outliers ‚Üí Standardization (Z-score) works better.

Normalization (Min-Max), on the other hand, gets heavily distorted.

In [None]:
from sklearn.preprocessing import StandardScaler

# Separate features (X) and target (y)
X = df.drop(columns=['Exited'])
y = df['Exited']

# Initialize StandardScaler
scaler = StandardScaler()

# Fit and transform the features
X_scaled = scaler.fit_transform(X)

# Convert the scaled features back to a DataFrame
X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns)

# Combine scaled features with the target variable
df_scaled = pd.concat([X_scaled_df, y.reset_index(drop=True)], axis=1)

print("First 5 rows of the standardized DataFrame (df_scaled):")
print(df_scaled.head())

First 5 rows of the standardized DataFrame (df_scaled):
   CreditScore       Age    Tenure   Balance  NumOfProducts  HasCrCard  \
0    -0.326221  0.293517 -1.041760 -1.225848      -0.911583   0.646092   
1    -0.440036  0.198164 -1.387538  0.117350      -0.911583  -1.547768   
2    -1.536794  0.293517  1.032908  1.333053       2.527057   0.646092   
3     0.501521  0.007457 -1.387538 -1.225848       0.807737  -1.547768   
4     2.063884  0.388871 -1.041760  0.785728      -0.911583   0.646092   

   IsActiveMember  EstimatedSalary  Geography_Germany  Geography_Spain  \
0        0.970243         0.021886          -0.578736        -0.573809   
1        0.970243         0.216534          -0.578736         1.742740   
2       -1.030670         0.240687          -0.578736        -0.573809   
3       -1.030670        -0.108918          -0.578736        -0.573809   
4        0.970243        -0.365276          -0.578736         1.742740   

   Gender_Male  Exited  
0    -1.096209       1  
1   

##Applying SVM

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.svm import SVC
import pandas as pd

# üéØ Define features (X) and target (y)
X = df.drop(columns=['Exited'])
y = df['Exited']

# 1Ô∏è‚É£ Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print("Original class distribution in Train:")
print(y_train.value_counts(), "\n")

# 2Ô∏è‚É£ Scaling (VERY IMPORTANT for SVM)
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)

# 3Ô∏è‚É£ Create and fit SVM model
svm_model = SVC(
    kernel='rbf',
    C=1.0,
    gamma='scale',
    class_weight='balanced',   # uncomment if dataset is imbalanced
    random_state=42
)

svm_model.fit(X_train_scaled, y_train)

# 4Ô∏è‚É£ Predictions
pred_train = svm_model.predict(X_train_scaled)
pred_test = svm_model.predict(X_test_scaled)

# Accuracy
acc_train = accuracy_score(y_train, pred_train)
acc_test = accuracy_score(y_test, pred_test)

print(f"üîπ Train accuracy: {acc_train:.4f}")
print(f"üîπ Test accuracy:  {acc_test:.4f}\n")

# 5Ô∏è‚É£ Custom confusion matrix
def confusion_matrix_custom(y_true, y_pred):
    cm = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.ravel()

    cm_df = pd.DataFrame([[tp, fp],
                          [fn, tn]],
                         index=["Predicted 1", "Predicted 0"],
                         columns=["Actual 1", "Actual 0"])

    print("Confusion Matrix (Predicted on top, Actual on left):\n", cm_df)
    print(f"\nTP: {tp}, TN: {tn}, FP: {fp}, FN: {fn}\n")
    return cm_df

# === For Train set ===
print("=== For Train set ===")
print(classification_report(y_train, pred_train))
confusion_matrix_custom(y_train, pred_train)

# === For Test set ===
print("=== For Test set ===")
print(classification_report(y_test, pred_test))
confusion_matrix_custom(y_test, pred_test)


Original class distribution in Train:
Exited
0    5574
1    1426
Name: count, dtype: int64 

üîπ Train accuracy: 0.8131
üîπ Test accuracy:  0.7917

=== For Train set ===
              precision    recall  f1-score   support

           0       0.94      0.82      0.87      5574
           1       0.53      0.80      0.64      1426

    accuracy                           0.81      7000
   macro avg       0.73      0.81      0.75      7000
weighted avg       0.86      0.81      0.83      7000

Confusion Matrix (Predicted on top, Actual on left):
              Actual 1  Actual 0
Predicted 1      1139      1021
Predicted 0       287      4553

TP: 1139, TN: 4553, FP: 1021, FN: 287

=== For Test set ===
              precision    recall  f1-score   support

           0       0.93      0.80      0.86      2389
           1       0.49      0.76      0.60       611

    accuracy                           0.79      3000
   macro avg       0.71      0.78      0.73      3000
weighted avg      

Unnamed: 0,Actual 1,Actual 0
Predicted 1,464,478
Predicted 0,147,1911


# KNN - (K nearest neighbor )
KNN (K-Nearest Neighbors) is a model that makes predictions by looking at the closest K neighbors to a new data point.

‚ùå Slow on large datasets (distance calculation for each point)

###1 A new data point comes in.

Example: a new customer

####2 KNN finds the K nearest neighbors

Use distance (usually Euclidean distance).

####3 Look at their labels

Example:
Among 5 neighbors:

3 are class ‚Äú1‚Äù

2 are class ‚Äú0‚Äù

####4 Majority vote

Since "1" is more common ‚Üí predict 1.

What is K?

K = how many neighbors the model should look at.

‚úî Small K ‚Üí model is sensitive, may overfit

‚úî Large K ‚Üí model becomes too smooth, may underfit

Most common values:
K = 3, 5, 7, 9

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd

# üéØ Define features (X) and target (y)
X = df.drop(columns=['Exited'])
y = df['Exited']

# 1Ô∏è‚É£ Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print("Original class distribution in Train:")
print(y_train.value_counts(), "\n")

# 2Ô∏è‚É£ Scaling (VERY IMPORTANT for KNN)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)

# 3Ô∏è‚É£ Create and fit KNN model
knn_model = KNeighborsClassifier(
    n_neighbors=5,      # K = 5 (you can tune this)

)

knn_model.fit(X_train_scaled, y_train)

# 4Ô∏è‚É£ Predictions
pred_train = knn_model.predict(X_train_scaled)
pred_test = knn_model.predict(X_test_scaled)

# Accuracy
acc_train = accuracy_score(y_train, pred_train)
acc_test = accuracy_score(y_test, pred_test)

print(f"üîπ Train accuracy (KNN): {acc_train:.4f}")
print(f"üîπ Test accuracy (KNN):  {acc_test:.4f}\n")

# 5Ô∏è‚É£ Custom confusion matrix
def confusion_matrix_custom(y_true, y_pred):
    cm = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.ravel()

    cm_df = pd.DataFrame([[tp, fp],
                          [fn, tn]],
                         index=["Predicted 1", "Predicted 0"],
                         columns=["Actual 1", "Actual 0"])

    print("Confusion Matrix (Predicted on top, Actual on left):\n", cm_df)
    print(f"\nTP: {tp}, TN: {tn}, FP: {fp}, FN: {fn}\n")
    return cm_df

# === For Train set ===
print("=== For Train set (KNN) ===")
print(classification_report(y_train, pred_train))
confusion_matrix_custom(y_train, pred_train)

# === For Test set ===
print("=== For Test set (KNN) ===")
print(classification_report(y_test, pred_test))
confusion_matrix_custom(y_test, pred_test)


Original class distribution in Train:
Exited
0    5574
1    1426
Name: count, dtype: int64 

üîπ Train accuracy (KNN): 0.8700
üîπ Test accuracy (KNN):  0.8213

=== For Train set (KNN) ===
              precision    recall  f1-score   support

           0       0.88      0.97      0.92      5574
           1       0.80      0.48      0.60      1426

    accuracy                           0.87      7000
   macro avg       0.84      0.73      0.76      7000
weighted avg       0.86      0.87      0.86      7000

Confusion Matrix (Predicted on top, Actual on left):
              Actual 1  Actual 0
Predicted 1       687       171
Predicted 0       739      5403

TP: 687, TN: 5403, FP: 171, FN: 739

=== For Test set (KNN) ===
              precision    recall  f1-score   support

           0       0.85      0.94      0.89      2389
           1       0.60      0.37      0.46       611

    accuracy                           0.82      3000
   macro avg       0.73      0.65      0.67      3

Unnamed: 0,Actual 1,Actual 0
Predicted 1,225,150
Predicted 0,386,2239


##KNN does not have algorithm level solution for imbalance problem. That's why we will solve problem with SMOTE method.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from imblearn.over_sampling import SMOTE
import pandas as pd

# üéØ Define features (X) and target (y)
X = df.drop(columns=['Exited'])
y = df['Exited']

# 1Ô∏è‚É£ Train/test split (stratify to keep the imbalance ratio in both)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print("Original class distribution in Train:")
print(y_train.value_counts(), "\n")

# 2Ô∏è‚É£ Handle imbalance on TRAIN set only (SMOTE)
sm = SMOTE(random_state=42)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)

print("After SMOTE (Train):")
print(pd.Series(y_train_res).value_counts(), "\n")

# 3Ô∏è‚É£ Scaling (VERY IMPORTANT for KNN)
scaler = StandardScaler()
scaler.fit(X_train_res)              # fit on resampled train
X_train_res_scaled = scaler.transform(X_train_res)
X_test_scaled      = scaler.transform(X_test)

# 4Ô∏è‚É£ Create and fit KNN model
knn_model = KNeighborsClassifier(
    n_neighbors=5,      # you can tune this
    weights='uniform',  # or 'distance'
    metric='minkowski',
    p=2                 # Euclidean distance
)

knn_model.fit(X_train_res_scaled, y_train_res)

# 5Ô∏è‚É£ Predictions
# üîπ For Train: evaluate on ORIGINAL train set (not SMOTE data)
X_train_scaled = scaler.transform(X_train)
pred_train = knn_model.predict(X_train_scaled)

# üîπ For Test
pred_test = knn_model.predict(X_test_scaled)

# Accuracy
acc_train = accuracy_score(y_train, pred_train)
acc_test = accuracy_score(y_test, pred_test)

print(f"üîπ Train accuracy (KNN + SMOTE): {acc_train:.4f}")
print(f"üîπ Test accuracy  (KNN + SMOTE): {acc_test:.4f}\n")

# 6Ô∏è‚É£ Custom confusion matrix
def confusion_matrix_custom(y_true, y_pred):
    cm = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.ravel()

    cm_df = pd.DataFrame([[tp, fp],
                          [fn, tn]],
                         index=["Predicted 1", "Predicted 0"],
                         columns=["Actual 1", "Actual 0"])

    print("Confusion Matrix (Predicted on top, Actual on left):\n", cm_df)
    print(f"\nTP: {tp}, TN: {tn}, FP: {fp}, FN: {fn}\n")
    return cm_df

# === For Train set ===
print("=== For Train set (KNN + SMOTE) ===")
print(classification_report(y_train, pred_train))
confusion_matrix_custom(y_train, pred_train)

# === For Test set ===
print("=== For Test set (KNN + SMOTE) ===")
print(classification_report(y_test, pred_test))
confusion_matrix_custom(y_test, pred_test)


Original class distribution in Train:
Exited
0    5574
1    1426
Name: count, dtype: int64 

After SMOTE (Train):
Exited
0    5574
1    5574
Name: count, dtype: int64 

üîπ Train accuracy (KNN + SMOTE): 0.8264
üîπ Test accuracy  (KNN + SMOTE): 0.7547

=== For Train set (KNN + SMOTE) ===
              precision    recall  f1-score   support

           0       0.93      0.85      0.89      5574
           1       0.55      0.75      0.64      1426

    accuracy                           0.83      7000
   macro avg       0.74      0.80      0.76      7000
weighted avg       0.85      0.83      0.84      7000

Confusion Matrix (Predicted on top, Actual on left):
              Actual 1  Actual 0
Predicted 1      1074       863
Predicted 0       352      4711

TP: 1074, TN: 4711, FP: 863, FN: 352

=== For Test set (KNN + SMOTE) ===
              precision    recall  f1-score   support

           0       0.90      0.78      0.84      2389
           1       0.43      0.65      0.52       

Unnamed: 0,Actual 1,Actual 0
Predicted 1,395,520
Predicted 0,216,1869


##PCA (Principal Component Analysis)

PCA (Principal Component Analysis) is a method that simplifies high-dimensional data while minimizing information loss. It is like compressing complex, multi-dimensional information into fewer dimensions while preserving the main meaning.

PCA finds the directions in the data with the highest variance and projects the data onto those directions.

PCA is a signal-extraction method that compresses data with many features into fewer dimensions while keeping the essential information.

Suppose the dataset has 100 features (Age, income, balance, credit score, etc.). Some of these 100 features carry duplicate information. In reality, just 2‚Äì3 directions explain the main information in those 100 features. These new directions are called Principal Components.

Why is PCA important?

‚úîÔ∏è 1. Reduces the number of features (dimensionality reduction)
100 variables ‚Üí 5 variables
But about 90% of the information remains.

‚úîÔ∏è 2. Speeds up models
KNN, Logistic Regression, and SVM run faster.

‚úîÔ∏è 3. Solves multicollinearity problems
(When independent variables (X) are highly correlated.)
Highly correlated features are eliminated through PCA.

‚úîÔ∏è 4. Useful for visualization
Ideal for showing non-3D data in a 2D plot.

How does PCA work?

1Ô∏è‚É£ Standardizes the features

2Ô∏è‚É£ Builds the correlation matrix

3Ô∏è‚É£ Finds where the highest variance is

4Ô∏è‚É£ Creates new axes (PC1, PC2, PC3‚Ä¶)

5Ô∏è‚É£ Projects the data onto these axes

#PCR (Principal Component Regression)
PCR = PCA + Linear Regression.

This means the dimensionality is reduced using PCA, and then Linear Regression is built using the new components.

PCR is a method that finds the most informative directions via PCA and performs regression using those components.

Why use PCR?

‚úî Solves multicollinearity
Highly correlated features are condensed into a single component through PCA.

‚úî Reduces overfitting
Unnecessary dimensions are removed ‚Üí the model becomes more stable.

‚úî Very effective for high-dimensional data
Even if there are 500 features, 5‚Äì10 components may be enough.

When to use PCR

When the number of features is very large

When features are highly similar or correlated (multicollinearity)