<a href="https://colab.research.google.com/github/Jhansipothabattula/Machine_Learning/blob/main/Day40.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Handling Imbalanced Data

**Problems caused by Imbalanced Data in Classification Tasks**

- What is Imbalanced Data?

  - Refers to datasets where one class significantly outnumbers the other(s)

  - Challenges with Imbalanced Data

    - Bias Toward Majority Class:

      - Machine Learning Models tend to prioritize the majority class due to it's Frequency

    - Misleading Evaluation Metrics

      - Metrics like accuracy can be misleading, as they do not account for class imbalance

    - Limited information for Minority Class

      - Insufficient Samples for the minority class can lead to underfitting

  - Applications with Imbalanced Data

    - Fraud detection, medical diagnosis, and anomaly detection

**Techniques to Handle Imbalanced Data**

- Resampling Techniques

  - Oversampling

    - Increase the number of minority class samples by duplicating or synthesizing new samples

    - Examples: SMOTE(Synthetic Minority Over-Sampling Technique), which generates synthetic examples

  - Undersampling

    - Reduce the number of majority class samples to balance the dataset

    - Risk: Loss of valuable info from majority class

- Algorithmic Solutions

  - Class Weights

    - Assigns higher weights to the minority class during model training

    - Many algorithms(Ex: Logistic Regression, Random Forest)have built-in support for weights

  - Anomaly Detection Models

    - Treat the minority class as anomalies, focusing the model on detecting them

  - Evaluation Metrics for Imbalanced Data

    - F1-Score

      - Harmonic mean of precision and recall, focusing on both false positive and false negatives

    - ROC-AUC

      - Measures the ability to diistinguish between classes across various threshold values

    - Precision-Recall Curve

      - Focuses on perfomance for the positive class

**1.Apply SMOTE to handle class imbalance, train a classifier, and evaluate it's perfomance using metrics like ROC-AUC and F1-Score**

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

from google.colab import files
uploaded = files.upload()

Saving creditcard.csv to creditcard.csv


In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score, f1_score
from imblearn.over_sampling import SMOTE

# Load the dataset
df = pd.read_csv("creditcard.csv")

# Explore Dataset
print("Dataset Info: \n")
print(df.info())
print("Class Distribution: \n")
print(df["Class"].value_counts())

# Split Dataset
X = df.drop("Class", axis=1)
y = df["Class"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Random Forest
rf_model = RandomForestClassifier(random_state=42, class_weight="balanced")
rf_model.fit(X_train, y_train)

# predict and evaluate
y_pred = rf_model.predict(X_test)
print("Classification Report: \n")
print(classification_report(y_test, y_pred))

roc_auc = roc_auc_score(y_test, rf_model.predict_proba(X_test)[:, 1])
print(f"ROC-AUC: {roc_auc:.2f}")

# Apply SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# Display new class distribution
print("New Class Distribution after SMOTE: \n")
print(pd.Series(y_resampled).value_counts())

# Train Random Forest on resampled data
rf_model_resampled = RandomForestClassifier(random_state=42)
rf_model_resampled.fit(X_resampled, y_resampled)

# Predict and Evaluate
y_pred_smote = rf_model_resampled.predict(X_test)
print("Classification Report after SMOTE: \n")
print(classification_report(y_test, y_pred_smote))

roc_auc_smote = roc_auc_score(y_test, rf_model_resampled.predict_proba(X_test)[:, 1])
print(f"ROC-AUC after SMOTE: {roc_auc_smote:.2f}")

Dataset Info: 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64