# Data processing

## Dealing with imbalanced dataset (REsampling)

### Importing libraries:

In [9]:
import pandas as pd  # For data manipulation 
from imblearn.over_sampling import SMOTE  # For oversampling
from imblearn.under_sampling import RandomUnderSampler  # For undersampling
from sklearn.model_selection import train_test_split  # For spliting data into traininng and testing


### Loading dataset:

In [10]:
df = pd.read_csv("creditcard.csv")  # Loading in dataset

### Splitting attributes:

In [12]:
x = df.drop("Class", axis=1)  # All attributes except Class
y = df["Class"]

### Training and test sets:

In [14]:
# Splits data into 80% training, 20% testing with seed set to 69 for reproduciability 
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 69)

# Checks the data before resampling
print(f"Before resampling: {y_train.value_counts()}")

Before resampling: Class
0    227435
1       410
Name: count, dtype: int64


### Apply oversample:

In [15]:
# Apply SMOTE for oversampling the minority class
smote = SMOTE(sampling_strategy = "minority", random_state = 69)
xResample, yResample = smote.fit_resample(x_train, y_train)

print(f"After resample: {yResample.value_counts()}")

After resample: Class
0    227435
1    227435
Name: count, dtype: int64


In [17]:
# Apple Random Undersampling for undersampling the majority class
undersample = RandomUnderSampler(sampling_strategy = "majority", random_state = 69)
x_resample, y_resample = undersample.fit_resample(x_train, y_train)

# Checking to see the data after resampling
print(f"After undersample: {y_resample.value_counts()}")

After undersample: Class
0    410
1    410
Name: count, dtype: int64


##### Why resample

In an imbalanced datset, machine learning models ofthen become "biased" towards the majority class, as it dominates the data. The model may leaern to always predict the majority class, leading to high "accuracy", but low preformance when detecting the minority class (RISKY image approving a a frudulent tranaction or falsely diagnosing a positive patient, negative)

(Also imagine a dataset where 98% of the data is negative(non-fradulent or negative patients), if the model simply predictis every patient or transaction to me negative then the model would have a 98% accuracy)