# Introduction 

In this notebook, the analysis are made on SMOTE, BorderlineSMOTE, Hybrid SMOTE, and Hybrid BorderlineSMOTE. 

I got results for 3 different imbalanced datasets using the default method for SMOTE and BorderlineSMOTE. In addition, I used a custom sampling strategy by creating a custom dictionary and got results for 6 different datasets.


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#SMOTEs
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import BorderlineSMOTE
import math

# IMPORTS FOR TRAINING
from sklearn.utils import shuffle
from sklearn.neural_network import MLPClassifier
#from xgboost import XGBClassifier
#from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
#from sklearn.neighbors import KNeighborsClassifier
#from sklearn import linear_model
#from sklearn.linear_model import RidgeClassifier, LogisticRegression
#from sklearn import svm

#RESULTS
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, classification_report, confusion_matrix


from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Augmentation Methods

Both [SMOTE](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html) and [BorderlineSMOTE](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.BorderlineSMOTE.html) have package in imblearn library. They are basically designed for imbalanced binary data sets to increase the amount of lower label to bigger one. It is working on numerical representations so we will be using our vectorized text data bring back from vectorized versions which we created with vectorizer notebook.

In ctweet data set, we have 3 labels which it is not allowed for default sampling_strategy in SMOTE or BorderlineSMOTE. To solve the problem, we use dictionary to increase the amount of 2 other lower labels to biggest one like it is a Hybrid version.
_____

In [2]:
def HybridSMOTE(x, y):
    smote = SMOTE(sampling_strategy = new_dict, random_state = 42)
    x, y = smote.fit_resample(x, y)

    return x, y

def HybridBORDER(x, y):
    smote = BorderlineSMOTE(sampling_strategy = new_dict, random_state = 42)
    x, y = smote.fit_resample(x, y)

    return x, y

def smote(x, y):
    smote = SMOTE(sampling_strategy= 1, random_state=42)
    x, y = smote.fit_resample(x, y)

    return x, y

def BlSmote(x, y):
    sm = BorderlineSMOTE(sampling_strategy=1, random_state = 42)
    x, y = sm.fit_resample(x, y)

    return x, y

# Data sets

We have 6 data sets to test our methods. Every data set has function to call easily, you just need to change location.

To understand why .iloc is used exactly, you can check without using it. While the vectorized version is saved on .csv file in vectorizer notebook, there is a column added which we do not interested.

In some functions you will see [:150000] or [:20000] while assignment. It is about not losing the matches for labels and data which we decreased in vectorizing phase.
______

In [3]:
def get_reddit():
  b = pd.read_csv("/content/drive/MyDrive/TUBITAK_TASK1/encoded_texts/reddit/encoded_reddit150000_train.csv").iloc[:, 1:]
  word_vectors_train = pd.DataFrame(b)

  b = pd.read_csv("/content/drive/MyDrive/TUBITAK_TASK1/encoded_texts/reddit/encoded_reddit20000_test.csv").iloc[:, 1:]
  word_vectors_test = pd.DataFrame(b)

  train_data = pd.read_csv("/content/drive/MyDrive/TUBITAK_TASK1/data/reddit/train.csv")
  test_data = pd.read_csv('/content/drive/MyDrive/TUBITAK_TASK1/data/reddit/test.csv')

  train_labels = train_data["Y"]
  test_labels = test_data["Y"]

  x_train = word_vectors_train[:150000]
  y_train = train_labels[:150000]

  x_test = word_vectors_test[:20000]
  y_test = test_labels[:20000]

  del word_vectors_train
  del word_vectors_test
  del train_data
  del test_data

  return x_train, y_train, x_test, y_test

In [4]:
def get_stweet():
  b = pd.read_csv("/content/drive/MyDrive/TUBITAK_TASK1/encoded_texts/stweet/encoded_stweet150000_train.csv").iloc[:, 1:]
  word_vectors_train = pd.DataFrame(b)

  b = pd.read_csv("/content/drive/MyDrive/TUBITAK_TASK1/encoded_texts/stweet/encoded_stweet20000_test.csv").iloc[:, 1:]
  word_vectors_test = pd.DataFrame(b)

  train_data = pd.read_csv("/content/drive/MyDrive/TUBITAK_TASK1/data/stweet/train.csv")
  test_data = pd.read_csv('/content/drive/MyDrive/TUBITAK_TASK1/data/stweet/test.csv')

  train_labels = train_data["Y"]
  test_labels = test_data["Y"]

  x_train = word_vectors_train[:150000]
  y_train = train_labels[:150000]

  x_test = word_vectors_test[:20000]
  y_test = test_labels[:20000]

  del word_vectors_train
  del word_vectors_test
  del train_data
  del test_data

  return x_train, y_train, x_test, y_test

In [5]:
def get_ctweet():
  b = pd.read_csv("/content/drive/MyDrive/TUBITAK_TASK1/encoded_texts/ctweet/encoded_ctweet_train.csv").iloc[:, 1:]
  word_vectors_train = pd.DataFrame(b)

  b = pd.read_csv("/content/drive/MyDrive/TUBITAK_TASK1/encoded_texts/ctweet/encoded_ctweet_test.csv").iloc[:, 1:]
  word_vectors_test = pd.DataFrame(b)

  train_data = pd.read_csv("/content/drive/MyDrive/TUBITAK_TASK1/data/ctweet/train.csv")
  test_data = pd.read_csv("/content/drive/MyDrive/TUBITAK_TASK1/data/ctweet/test.csv")

  train_labels = train_data["Y"]
  test_labels = test_data["Y"]

  x_train = word_vectors_train
  y_train = train_labels

  x_test = word_vectors_test
  y_test = test_labels

  return x_train, y_train, x_test, y_test

In [6]:
def get_sarcasm():
  b = pd.read_csv("/content/drive/MyDrive/TUBITAK_TASK1/encoded_texts/sarcasm/encoded_sarcasm_train.csv").iloc[:, 1:]
  word_vectors_train = pd.DataFrame(b)

  b = pd.read_csv("/content/drive/MyDrive/TUBITAK_TASK1/encoded_texts/sarcasm/encoded_sarcasm_test.csv").iloc[:, 1:]
  word_vectors_test = pd.DataFrame(b)

  train_data = pd.read_csv("/content/drive/MyDrive/TUBITAK_TASK1/data/sarcasm/train.csv")
  test_data = pd.read_csv('/content/drive/MyDrive/TUBITAK_TASK1/data/sarcasm/test.csv')

  train_labels = train_data["Y"]
  test_labels = test_data["Y"]

  x_train = word_vectors_train
  y_train = train_labels

  x_test = word_vectors_test
  y_test = test_labels

  return x_train, y_train, x_test, y_test

In [7]:
def get_toxic():
  b = pd.read_csv("/content/drive/MyDrive/TUBITAK_TASK1/encoded_texts/toxic/encoded_toxic_train.csv").iloc[:, 1:]
  word_vectors_train = pd.DataFrame(b)

  b = pd.read_csv("/content/drive/MyDrive/TUBITAK_TASK1/encoded_texts/toxic/encoded_toxic_test.csv").iloc[:, 1:]
  word_vectors_test = pd.DataFrame(b)

  train_data = pd.read_csv("/content/drive/MyDrive/TUBITAK_TASK1/data/toxic/train.csv")
  test_data = pd.read_csv('/content/drive/MyDrive/TUBITAK_TASK1/data/toxic/test.csv')

  train_labels = train_data["Y"]
  test_labels = test_data["Y"]

  x_train = word_vectors_train
  y_train = train_labels

  x_test = word_vectors_test
  y_test = test_labels

  return x_train, y_train, x_test, y_test

In [8]:
def get_food():
  b = pd.read_csv("/content/drive/MyDrive/TUBITAK_TASK1/encoded_texts/food/encoded_food_train.csv").iloc[:, 1:]
  word_vectors_train = pd.DataFrame(b)

  b = pd.read_csv("/content/drive/MyDrive/TUBITAK_TASK1/encoded_texts/food/encoded_food20000_test.csv").iloc[:, 1:]
  word_vectors_test = pd.DataFrame(b)

  train_data = pd.read_csv("/content/drive/MyDrive/TUBITAK_TASK1/data/food/train.csv")
  test_data = pd.read_csv('/content/drive/MyDrive/TUBITAK_TASK1/data/food/test.csv')

  train_labels = train_data["Y"]
  test_labels = test_data["Y"]

  x_train = word_vectors_train
  y_train = train_labels

  x_test = word_vectors_test[:20000]
  y_test = test_labels[:20000]

  return x_train, y_train, x_test, y_test

# Train the model
I used MLPClassifier to get possible highest scores in shortest time. Both train and metrics are functionalized.
_____

In [9]:
def train_model(x_train, y_train, x_test):
  clf = MLPClassifier(max_iter=10, hidden_layer_sizes = (50,50)).fit(x_train, y_train)
  y_pred = clf.predict(x_test)
  return y_pred

In [10]:
def print_metrics(y_test, y_pred):
  print('Precision: %.4f' % precision_score(y_test, y_pred, average='weighted'))
  print('Recall: %.4f' % recall_score(y_test, y_pred, average='weighted'))
  print('Accuracy: %.4f' % accuracy_score(y_test, y_pred))
  print('F1 Score: %.4f' % f1_score(y_test, y_pred, average='weighted'))
  print(classification_report(y_test, y_pred))

In [15]:
x_train, y_train, x_test, y_test = get_sarcasm()

In [16]:
# CREATING DICTIONARY TO BE ABLE TO AUGMENT NECESSARY CLASSES

df_c = pd.DataFrame(y_train)
value_counts = df_c.value_counts()
dictionary = dict()
for (i,), j in value_counts.items():
  dictionary[i] = j

dictionary

{0: 10479, 1: 9554}

In [17]:
# ASSIGN NEW DICTIONARY TO ADJUST AMOUNT OF AUGMENTATION

augment_zero = int(dictionary[0] * 2)
augment_one = int(dictionary[1] * 2)
#augment_two = int(dictionary[2] * 2)

new_dict = {0 : augment_zero, 1: augment_one}
new_dict

{0: 20958, 1: 19108}

In [18]:
#AUGMENTATION
x_new, y_new = HybridBORDER(x_train, y_train)

  f"After over-sampling, the number of samples ({n_samples})"
  f"After over-sampling, the number of samples ({n_samples})"


In [19]:
# Check the augmented label count

df_c = pd.DataFrame(y_new)
value_counts = df_c.value_counts()
dictionary = dict()
for (i,), j in value_counts.items():
  dictionary[i] = j

dictionary

{0: 20958, 1: 19108}

In [20]:
from sklearn.utils import shuffle
x_new, y_new = shuffle(x_new, y_new)
x_test, y_test = shuffle(x_test, y_test)

In [21]:
#Training augmented data
y_pred = train_model(x_new, y_new, x_test)
print_metrics(y_test, y_pred)

Precision: 0.8581
Recall: 0.8576
Accuracy: 0.8576
F1 Score: 0.8576
              precision    recall  f1-score   support

           0       0.88      0.85      0.86      4506
           1       0.84      0.87      0.85      4080

    accuracy                           0.86      8586
   macro avg       0.86      0.86      0.86      8586
weighted avg       0.86      0.86      0.86      8586



