<p style="text-align:center">
    <a href="https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML241ENSkillsNetwork820-2023-01-01">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Tackle Imbalanced Data Challenge**


Estimated time needed: **60** minutes


In this lab, you will identify the imbalanced data problems in four binary classification scenarios, all with skewed class distributions: 


| Task Name     | Class Ratio (Negative vs. Positive)  |
| ------------- |:-------------:|
| _Credit Card Fraud Detection_      | ~1000 : 1      | 
| _Predicting Customer Churn_ | ~5 : 1      | 
| _Tumor Type Estimation_ | ~2 : 1     | 
| _Predicting Job Change_ | ~10 : 1      | 


Next, you will try to tackle the imbalanced data challenges in the above tasks using class weighting and resampling methods:
- Effective class weighting strategies will assign minority class with more weights, so that it may have a larger impact on the model training process
- Resampling methods will generate synthetic datasets from the original datasets


## Objectives


After completing this lab you will be able to:


* Identify typical patterns of imbalanced data challenges
* Apply `Class Re-weighting` method to adjust the impacts of different classes in model training processes
* Apply `Oversampling` and `Undersampling` to generate synthetic datasets and rebalance classes
* Evaluate your consolidated classifiers using robust metrics such as `F-score` and `AUC`


In [1]:
import pandas as pd
import numpy as np
import imblearn
from imblearn.under_sampling import RandomUnderSampler
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, MinMaxScaler
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split, learning_curve, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, accuracy_score, precision_recall_fscore_support, confusion_matrix
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from collections import Counter

Grid Search CV Methods.

In [2]:
def grid_search_lr(X_train, y_train, random_state=123):
    param_grid = {
        'class_weight': [
            {0: 0.05, 1: 0.95},
            {0: 0.1, 1: 0.9},
            {0: 0.2, 1: 0.8}
        ]
    }
    lr_model = LogisticRegression(random_state=random_state, max_iter=1000)
    grid_search = GridSearchCV(
        estimator=lr_model,
        param_grid=param_grid,
        scoring='f1',
        cv=5,
        verbose=True
    )
    grid_search.fit(X_train, y_train)
    best_params = grid_search.best_params_
    return best_params

def grid_search_rf(X_train, y_train, random_state=123):
    param_grid = {
        'max_depth': [5, 10, 15, 20],
        'n_estimators': [25, 50, 100],
        'min_samples_split': [2, 5],
        'class_weight': [
            {0: 0.1, 1: 0.9},
            {0: 0.2, 1: 0.8},
            {0: 0.3, 1: 0.7}
        ]
    }
    rf_model = RandomForestClassifier(random_state=random_state)
    grid_search = GridSearchCV(
        estimator=rf_model,
        param_grid=param_grid,
        scoring='f1',
        cv=5,
        verbose=True
    )
    grid_search.fit(X_train, y_train)
    best_params = grid_search.best_params_
    return best_params

In [None]:
def split_data(df, y_column, random_state=123):
    X = df.drop(y_column, axis=1)
    y = df[y_column]
    return train_test_split(X, y, test_size=0.2, stratify=y, random_state=random_state)