# Class Imbalance Strategies
- Class imbalance occurs when one class in a classification problem significantly outweighs the other class. It’s common in many machine learning problems. Examples include fraud detection, anomaly detection, and medical diagnosis.

These modify the training data to balance the class distribution.

### 1. Oversampling
- **Random Oversampling**: Duplicate examples from the minority class.
- **SMOTE (Synthetic Minority Oversampling Technique)**: Create synthetic samples by interpolating between existing minority class samples.
- **ADASYN (Adaptive Synthetic Sampling)**: Similar to SMOTE but focuses more on harder-to-learn examples.

### 2. Undersampling
- **Random Undersampling**: Remove samples from the majority class at random.
- **Tomek Links**: Remove overlapping examples to clean the decision boundary.
- **Cluster Centroids**: Replace the majority class with cluster centroids.

### 3. Hybrid Methods
- Combine over- and under-sampling to strike a balance  
  (e.g., **SMOTE + Tomek Links** or **SMOTE**).


# Oversampling
- Oversampling is a technique used to handle class imbalance by increasing the number of samples in the minority class. This is done to match the size of the majority class, helping the model learn from both classes more equally.

- Unlike undersampling, oversampling does not remove any data, so there is no loss of information. However, it can increase the risk of overfitting, especially if the same minority samples are simply duplicated.

- For example, if you have 10,000 samples:

- 9,000 belong to the majority class (True)

- 1,000 belong to the minority class (False)

- You can apply oversampling by replicating or generating new samples in the minority class to increase it to 9,000, matching the majority class.

# Undersampling
- Undersampling is a technique used to handle class imbalance by reducing the number of samples in the majority class. This helps balance the dataset and can prevent the model from being biased toward the majority class.

- However, since undersampling discards data from the majority class, it may lead to the loss of important information, which can negatively impact model performance.

- For example, suppose you have 10,000 samples, where:

- 9,000 belong to the majority class (True)

- 1,000 belong to the minority class (False)

- If you keep all 1,000 samples from the minority class and randomly select 1,000 from the majority class (ignoring the remaining 8,000), this is called random undersampling.



- In undersampling, you reduce the number of majority class samples to match the minority class size. Then you train the model using this balanced dataset (both classes are present, but the majority class has fewer samples than before).

- In oversampling, you increase the number of minority class samples—usually by duplicating them or generating synthetic examples—to match the majority class. This also results in a balanced dataset, but without removing data from the majority class.

# SMOTE (Synthetic Minority Over-sampling Technique)
### 📌 What is SMOTE?
- SMOTE is an advanced oversampling technique used to deal with class imbalance in datasets.
Instead of simply duplicating existing minority class samples, SMOTE generates new, synthetic samples by interpolating between existing ones.
### 💡 How Does It Work?
- For each sample in the minority class, SMOTE:
- Finds its k nearest neighbors (usually k=5) among other minority class samples.
- It then randomly selects one of those neighbors.
- A new synthetic sample is created by picking a point along the line between the original sample and the neighbor.
- This makes the new data more diverse and less prone to overfitting than simple duplication.


# SMOTE + Tomek Links
- 🔄 What is it?
- SMOTE + Tomek Links is a two-step process that:
- Uses SMOTE to generate synthetic minority class samples (oversampling)
- Uses Tomek Links to clean the dataset by removing overlapping samples (undersampling)
- After Applying smote you create synthetic data for minority class we need to check the nearest neighbours of minority nodes if the node is generated by synthetic data and near by data point is majority node the tomek links helps to delete the new synthetic node this helps in balancing the dataset


### This helps in:

- Balancing the dataset

- Removing noisy or borderline samples that might confuse the model

In [29]:
# Handle Class Imbalance Using imblearn: Churn Prediction


In [30]:
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

In [31]:
df = pd.read_csv("D:\\utils\\DataSets\\churn.csv")
df.head()

Unnamed: 0,Call Failure,Complains,Subscription Length,Charge Amount,Seconds of Use,Frequency of use,Frequency of SMS,Distinct Called Numbers,Age Group,Tariff Plan,Status,Age,Customer Value,Churn
0,8,0,38,0,4370,71,5,17,3,1,1,30,197.64,0
1,0,0,39,0,318,5,7,4,2,1,2,25,46.035,0
2,10,0,37,0,2453,60,359,24,3,1,1,30,1536.52,0
3,10,0,38,0,4198,66,1,35,1,1,1,15,240.02,0
4,3,0,38,0,2393,58,2,33,1,1,1,15,145.805,0


In [32]:
df.Churn.value_counts()

Churn
0    2655
1     495
Name: count, dtype: int64

In [33]:
df.isna().sum()

Call  Failure              0
Complains                  0
Subscription  Length       0
Charge  Amount             0
Seconds of Use             0
Frequency of use           0
Frequency of SMS           0
Distinct Called Numbers    0
Age Group                  0
Tariff Plan                0
Status                     0
Age                        0
Customer Value             0
Churn                      0
dtype: int64

In [34]:
X = df.drop('Churn', axis=1)  
y = df['Churn']  

In [35]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# Train a Logistic Regression model
model = LogisticRegression(max_iter=2000)  # Increase max_iter if convergence issues occur
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)
report = classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

           0       0.90      0.98      0.94       531
           1       0.82      0.42      0.56        99

    accuracy                           0.90       630
   macro avg       0.86      0.70      0.75       630
weighted avg       0.89      0.90      0.88       630



STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [36]:
# The under sampling and over Sampling will be done only on training dataset

In [37]:
from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=42)
X_train_rus, y_train_rus = rus.fit_resample(X_train, y_train)

y_train_rus.value_counts()

Churn
0    396
1    396
Name: count, dtype: int64

In [38]:
y_train_rus

2430    0
1471    0
1423    0
2662    0
1011    0
       ..
3099    1
2378    1
2268    1
1831    1
1631    1
Name: Churn, Length: 792, dtype: int64

In [39]:
# Train a Logistic Regression model
model = LogisticRegression(max_iter=2000)  # Increase max_iter if convergence issues occur
model.fit(X_train_rus, y_train_rus)

# Predict on the test set
y_pred_rus = model.predict(X_test)
report = classification_report(y_test, y_pred_rus)
print(report)

              precision    recall  f1-score   support

           0       0.96      0.81      0.88       531
           1       0.45      0.84      0.58        99

    accuracy                           0.81       630
   macro avg       0.71      0.82      0.73       630
weighted avg       0.88      0.81      0.83       630



STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


| Metric     | Definition                                                                                                                               |
|------------|------------------------------------------------------------------------------------------------------------------------------------------|
| Precision  | Out of all the items the model predicted as positive, what proportion were actually positive? (True Positives / (True Positives + False Positives)) |
| Recall     | Out of all the actual positive items, what proportion did the model correctly identify? (True Positives / (True Positives + False Negatives))    |
| F1-score   | The harmonic mean of precision and recall, providing a balanced measure of the model's accuracy. (2 * (Precision * Recall) / (Precision + Recall)) |
| Support    | The number of actual occurrences of the class in the dataset.                                                                             |

In summary:

* **Precision** focuses on the accuracy of the positive predictions.
* **Recall** focuses on the model's ability to find all the actual positive instances.
* **F1-score** provides a balanced measure of both.
* **Support** indicates the number of actual instances of each class.

These metrics are crucial for evaluating the performance of classification models. Depending on the specific problem, you might prioritize one metric over the others. For instance, in a spam detection system, high precision might be more important to avoid incorrectly classifying legitimate emails as spam (false positives). In a medical diagnosis system, high recall might be more critical to ensure that all actual cases of a disease are identified (minimizing false negatives).

✅ Precision increasing:
The model is making fewer false positive errors — it's getting better at not misclassifying negative examples as positive.

✅ Recall increasing:
The model is making fewer false negative errors — it's getting better at finding all the actual positive cases.

✅ F1-score increasing:
Since F1 is the harmonic mean of precision and recall, its improvement confirms that both precision and recall are improving together, and not at the expense of one another.

In [40]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
y_train_smote.value_counts()

Churn
0    2124
1    2124
Name: count, dtype: int64

In [41]:
model = LogisticRegression(max_iter=2000)
model.fit(X_train_smote, y_train_smote)

y_pred_smote = model.predict(X_test)
report = classification_report(y_test, y_pred_smote)
print(report)

              precision    recall  f1-score   support

           0       0.97      0.80      0.88       531
           1       0.45      0.85      0.59        99

    accuracy                           0.81       630
   macro avg       0.71      0.83      0.73       630
weighted avg       0.88      0.81      0.83       630



STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [42]:
from imblearn.combine import SMOTETomek

smt = SMOTETomek(random_state=42)
X_tomek, y_tomek = smt.fit_resample(X_train, y_train)
y_tomek.value_counts()

Churn
0    2091
1    2091
Name: count, dtype: int64

In [43]:
model = LogisticRegression(max_iter=2000)
model.fit(X_tomek, y_tomek)

y_pred_tomek = model.predict(X_test)
report = classification_report(y_test, y_pred_tomek)
print(report)

              precision    recall  f1-score   support

           0       0.97      0.80      0.88       531
           1       0.45      0.86      0.59        99

    accuracy                           0.81       630
   macro avg       0.71      0.83      0.73       630
weighted avg       0.89      0.81      0.83       630



STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
