In [None]:
pip install scikit-learn imbalanced-learn




The provided code demonstrates the use of oversampling techniques, specifically Random Oversampling and Synthetic Minority Over-sampling Technique (SMOTE), to address class imbalance in a toy dataset. Initially, a synthetic dataset is generated with a significant class imbalance, where one class comprises 10% and the other 90% of the samples. The dataset is then split into training and testing sets. Two oversampling techniques, RandomOverSampler and SMOTE, are instantiated. RandomOverSampler duplicates instances from the minority class randomly to balance class distribution, while SMOTE generates synthetic samples for the minority class by interpolating between existing samples. The training sets are then oversampled using both techniques separately. The code prints the class distribution before and after oversampling, highlighting the effectiveness of each technique in balancing the class distribution. This process ensures that the machine learning model trained on the balanced dataset can learn equally from both classes, leading to better performance and generalization on unseen data.

In [None]:
# Importing necessary libraries
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import RandomOverSampler, SMOTE

# Generating a toy dataset with class imbalance
X, y = make_classification(n_classes=2, class_sep=2,
                           weights=[0.1, 0.9], n_informative=3, n_redundant=1,
                           flip_y=0, n_features=20, n_clusters_per_class=1,
                           n_samples=1000, random_state=10)

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Instantiate oversampling techniques
ros = RandomOverSampler(random_state=42)
smote = SMOTE(random_state=42)

# Random Oversampling
X_train_ros, y_train_ros = ros.fit_resample(X_train, y_train)

# SMOTE Oversampling
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# Checking class distribution after oversampling
print("Class distribution before Random Oversampling:", y_train.sum() / len(y_train))
print("Class distribution after Random Oversampling:", y_train_ros.sum() / len(y_train_ros))

print("Class distribution before SMOTE:", y_train.sum() / len(y_train))
print("Class distribution after SMOTE:", y_train_smote.sum() / len(y_train_smote))


Class distribution before Random Oversampling: 0.90375
Class distribution after Random Oversampling: 0.5
Class distribution before SMOTE: 0.90375
Class distribution after SMOTE: 0.5


The provided code aims to address class imbalance in a classification problem using a Support Vector Machine (SVC) classifier within a pipeline that incorporates both oversampling and undersampling techniques. First, the dataset is split into training and testing sets. Then, the RandomOverSampler and RandomUnderSampler are defined to handle the imbalance by oversampling the minority class and undersampling the majority class, respectively. Next, an SVC classifier is instantiated. These components are combined into a pipeline, ensuring that resampling is applied before classification. The pipeline is fitted to the training data. Finally, predictions are made on the testing data, and the performance is evaluated using the classification report, which includes metrics such as precision, recall, and F1-score for each class. This approach helps mitigate the impact of class imbalance and improves the generalization performance of the classifier on unseen data.

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.metrics import classification_report

# Load your dataset
# Assuming X_train, y_train are your training features and labels respectively

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the resampling strategy
over_sampler = RandomOverSampler(sampling_strategy='minority')
under_sampler = RandomUnderSampler(sampling_strategy='majority')

# Define the classifier
clf = SVC()

# Create a pipeline with resampling and classifier
pipeline = Pipeline([
    ('over_sampling', over_sampler),
    ('under_sampling', under_sampler),
    ('classifier', clf)
])

# Fit the pipeline on the training data
pipeline.fit(X_train, y_train)

# Predict on the testing data
y_pred = pipeline.predict(X_test)

# Evaluate the performance
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       1.00      0.96      0.98        23
           1       0.99      1.00      1.00       177

    accuracy                           0.99       200
   macro avg       1.00      0.98      0.99       200
weighted avg       1.00      0.99      0.99       200

