<a href="https://colab.research.google.com/github/Sjoerd-de-Witte/Machine-Learning-2023/blob/main/4_3_Balancing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!gdown -O /tmp/ml.py 174lBNvDBJSVWs3OpNL3a68cnhWIcWYuY
%run /tmp/ml.py

Downloading...
From: https://drive.google.com/uc?id=174lBNvDBJSVWs3OpNL3a68cnhWIcWYuY
To: /tmp/ml.py
  0% 0.00/1.31k [00:00<?, ?B/s]100% 1.31k/1.31k [00:00<00:00, 4.76MB/s]


# Balancing

Most datasets are inbalanced, and this is something to be very much aware of when you solve problems. The accuracy paradox tells us that if 90% of the data has the same label, than a random classifier will already reward us with 90% accuracy. One of the things to check is the skew of the dataset, to see if the accuracy paradox applies.

In [2]:
from pipetorch import Kaggle
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, precision_score
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
from math import sqrt
import numpy as np
import pandas as pd
import pickle
import matplotlib.pyplot as plt

# Data

Suppose that we really want to buy a good bottle of wine, but a bottle that just received a '6' does not really qualify for our intents and purposes, therefore we would like a bottle that received at least a 7. It turns out, that only 13.5% of the wines in our dataset meet this requirement.

In [3]:
filepath = Kaggle('uciml/red-wine-quality-cortez-et-al-2009').file()

Downloading dataset uciml/red-wine-quality-cortez-et-al-2009 from kaggle to /root/.pipetorchuser/red-wine-quality-cortez-et-al-2009


In [4]:
df = pd.read_csv(filepath)
df.quality = df.quality > 6
# how many bottles of wine are rated as good?
sum(df.quality)/len(df)

0.1357098186366479

In [5]:
# Split the dataset in a train and validation set.
# use 20% of the dataset for validation.
# in this case, keep X and y together to make it easier to balance in part 2
train, valid = train_test_split(df, test_size=0.2)

In [6]:
# create train_X, valid_X, etc. quality is the target variable,
# report the F1 score
train_X = train.drop(columns='quality').to_numpy()
train_y = train.quality.to_numpy()
valid_X = valid.drop(columns='quality').to_numpy()
valid_y = valid.quality.to_numpy()

scaler = StandardScaler()
train_X_scaled = scaler.fit_transform(train_X)
valid_X_scaled = scaler.transform(valid_X)

model = LogisticRegression()
model.fit(train_X, train_y)

pred_y = model.predict(valid_X)

f1_score(valid_y, pred_y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.46376811594202894

Then when we check the F1 Score, we probably see that the model is underperforming.

# balance

In [7]:
train.groupby(by='quality').quality.count()

quality
False    1109
True      170
Name: quality, dtype: int64

In [9]:
# Try to balance the data using 'resample', then fit the model again and report the F1 Score
train_resampled = pd.concat([train[train.quality==0],
                            resample(train[train.quality==1], n_samples=1100)])

train_X_resampled = train_resampled.drop(columns='quality').to_numpy()
train_y_resampled = train_resampled.quality.to_numpy()

valid_X = valid.drop(columns='quality').to_numpy()
valid_y = valid.quality.to_numpy()

scaler = StandardScaler()
train_X_scaled = scaler.fit_transform(train_X)
valid_X_scaled = scaler.transform(valid_X)

model = LogisticRegression()
model.fit(train_X_resampled, train_y_resampled)

pred_y = model.predict(valid_X)

f1_score(valid_y, pred_y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.5985401459854015

In [None]:
halt_notebook()