# LexiGuard: XGBoost experiments

[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/CalleRosa40/lexiguard/blob/main/xgboost.ipynb)

Please run `data_preprocessing.ipynb` to create data file(s).

## Setup

In [12]:
# Python standard lib
import time; full_run_time_start = time.time() # start timing exec right away

# other "usual suspects"
import numpy as np
import pandas as pd

# scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier

# imbalanced-learn
from imblearn.over_sampling import SMOTE

# XGBoost
from xgboost import XGBClassifier

# display all df columns (default is 20)
pd.options.display.max_columns = None

# show all data in columns
pd.options.display.max_colwidth = None

# load utility functions
%run functions.ipynb

## Load data

In [13]:
# Watch out which file is loaded here! Full data or just sample?
df = pd.read_csv('data/lexiguard_data_50000.csv')
df.shape

(49797, 6)

In [14]:
df.head()

Unnamed: 0,raw,clean,clean_pp,clean_pp_lemma,clean_pp_lemma_stop,toxic
0,"Naturally you can feel it in your urine. \nThat's one of the common expressions in the Germanic languages and the hallmark of those who, after the fact, say they knew that all along. ;)","Naturally you can feel it in your urine. That's one of the common expressions in the Germanic languages and the hallmark of those who, after the fact, say they knew that all along. ;)",naturally you can feel it in your urine that 's one of the common expressions in the germanic languages and the hallmark of those who after the fact say they knew that all along,naturally you can feel it in your urine that be one of the common expression in the germanic language and the hallmark of those who after the fact say they know that all along,naturally feel urine common expression germanic language hallmark fact know,0
1,Yum! What's not to love: water+maple syrup - together.,Yum! What's not to love: water+maple syrup - together.,yum what 's not to love water+maple syrup together,yum what be not to love water+maple syrup together,yum love water+maple syrup,0
2,Catou I will wager that mutual fund sales will be just as strong this RRSP season as last year. The fact is that your typical mutual fund owner isn't going to suddenly discover DIY investing in individual stocks and salesperson at the bank isn't going to suddenly abandon the sales pitch for lower cost ETFs or GICs.,Catou I will wager that mutual fund sales will be just as strong this RRSP season as last year. The fact is that your typical mutual fund owner isn't going to suddenly discover DIY investing in individual stocks and salesperson at the bank isn't going to suddenly abandon the sales pitch for lower cost ETFs or GICs.,catou i will wager that mutual fund sales will be just as strong this rrsp season as last year the fact is that your typical mutual fund owner is n't going to suddenly discover diy investing in individual stocks and salesperson at the bank is n't going to suddenly abandon the sales pitch for lower cost etfs or gics,catou i will wager that mutual fund sale will be just as strong this rrsp season as last year the fact be that your typical mutual fund owner be not go to suddenly discover diy invest in individual stock and salesperson at the bank be not go to suddenly abandon the sale pitch for low cost etf or gic,catou wager mutual fund sale strong rrsp season year fact typical mutual fund owner go suddenly discover diy invest individual stock salesperson bank go suddenly abandon sale pitch low cost etf gic,0
3,"""The shortage of priests is not a shortage of vocations but a shortness of sight.""\n\nExactly! Very well put!","""The shortage of priests is not a shortage of vocations but a shortness of sight."" Exactly! Very well put!",the shortage of priests is not a shortage of vocations but a shortness of sight exactly very well put,the shortage of priest be not a shortage of vocation but a shortness of sight exactly very well put,shortage priest shortage vocation shortness sight exactly,0
4,"I dont disagree with that. It takes money to deal drugs. Street level drug sales people, they nit really dealers, are from many parts of the federal population.","I dont disagree with that. It takes money to deal drugs. Street level drug sales people, they nit really dealers, are from many parts of the federal population.",i do nt disagree with that it takes money to deal drugs street level drug sales people they nit really dealers are from many parts of the federal population,i do not disagree with that it take money to deal drug street level drug sale people they nit really dealer be from many part of the federal population,not disagree take money deal drug street level drug sale people nit dealer part federal population,0


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49797 entries, 0 to 49796
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   raw                  49797 non-null  object
 1   clean                49797 non-null  object
 2   clean_pp             49797 non-null  object
 3   clean_pp_lemma       49797 non-null  object
 4   clean_pp_lemma_stop  49797 non-null  object
 5   toxic                49797 non-null  int64 
dtypes: int64(1), object(5)
memory usage: 2.3+ MB


## Handle missing values

There shouldn't be any after preprocessing, but anyway ...

In [16]:
print('Checking for NaN\'s ...')

if df.isna().sum().sum() != 0:
    print('NaN\'s found.')
    rows_before = df.shape[0]
    print('Rows before dropping:', rows_before)
    print('Dropping rows ...')
    df.dropna(inplace=True)
    df.reset_index(drop=True, inplace=True)
    rows_after = df.shape[0]
    print('Rows after dropping:', rows_after)
    print('Rows dropped:', rows_before - rows_after)

else:
    print('No NaN\'s found.')

Checking for NaN's ...
No NaN's found.


 ## Check for imbalance in data

In [17]:
value_counts = df.toxic.value_counts()
nontoxic_count = value_counts[0]
toxic_count = value_counts[1]
nontoxic_perc =\
    round((nontoxic_count / (nontoxic_count + toxic_count)) * 100, 1)
toxic_perc =\
    round((toxic_count / (nontoxic_count + toxic_count)) * 100, 1)

print(f'Nontoxic (0): {nontoxic_count} ({nontoxic_perc} %)')
print(f'Toxic (1): {toxic_count} ({toxic_perc} %)')

Nontoxic (0): 45760 (91.9 %)
Toxic (1): 4037 (8.1 %)


## Function: Create bag of words

In [18]:
def bow(s):
    vect = CountVectorizer()
    return vect.fit_transform(s)

## Function: TF-IDF

In [19]:
def tf_idf(s):
    vect = TfidfVectorizer()
    return vect.fit_transform(s)

## Run XGB on different data cols

TODO: Try out more and more sophisticated representations/embeddings.

In [20]:
# parameters for model
params = {'random_state': 42, 'n_jobs': -1}

# load model with parameters
xgb = XGBClassifier(**params)

y = df.toxic

def run_experiment(X, data_desc):
    print('Running:', data_desc)
    
    # Split data using 80% for train, 20% for test. Make sure test data has
    # same toxic/nontoxic ratio as train data by using stratify arg.
    X_train, X_test, y_train, y_test =\
        train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

    # use SMOTE to balance train data
    sm = SMOTE(random_state=42)
    X_train_sm, y_train_sm = sm.fit_resample(X_train, y_train)

    # train model and time execution
    train_time_start = time.time()
    xgb.fit(X_train_sm, y_train_sm)
    train_time = time.time() - train_time_start
    train_time_str = f'{int(train_time // 60)}m {round(train_time % 60)}s'

    # predict test data
    y_pred = xgb.predict(X_test)

    eval_model(xgb.__class__.__name__,
            params,
            data_desc,
            train_time_str,
            y_test, y_pred)

run_experiment(bow(df.raw), 'BOW on col "raw"')
run_experiment(bow(df.clean), 'BOW on col "clean"')
run_experiment(bow(df.clean_pp), 'BOW on col "clean_pp"')
run_experiment(bow(df.clean_pp_lemma), 'BOW on col "clean_pp_lemma"')
run_experiment(bow(df.clean_pp_lemma_stop), 'BOW on col "clean_pp_lemma_stop"')
run_experiment(tf_idf(df.clean_pp_lemma_stop),
               'TF-IDF on col "clean_pp_lemma_stop"')

Running: BOW on col "raw"
Running: BOW on col "clean"
Running: BOW on col "clean_pp"
Running: BOW on col "clean_pp_lemma"
Running: BOW on col "clean_pp_lemma_stop"
Running: TF-IDF on col "clean_pp_lemma_stop"


## Show test results + total exec time

In [21]:
test_results

Unnamed: 0,model_name,model_params,data_desc,f1,acc,recall,prec,cf_matrix,exec_time,notes
0,XGBClassifier,{},"BOW on col ""raw""",0.48306,0.94026,0.34449,0.80814,"[[9087, 66], [529, 278]]",0m 20s,
1,XGBClassifier,{},"BOW on col ""clean""",0.48427,0.94076,0.34325,0.82196,"[[9093, 60], [530, 277]]",0m 21s,
2,XGBClassifier,{},"BOW on col ""clean_pp""",0.48747,0.94046,0.34944,0.80571,"[[9085, 68], [525, 282]]",0m 21s,
3,XGBClassifier,{},"BOW on col ""clean_pp_lemma""",0.47528,0.93926,0.33953,0.79191,"[[9081, 72], [533, 274]]",0m 18s,
4,XGBClassifier,{},"BOW on col ""clean_pp_lemma_stop""",0.48833,0.93835,0.36307,0.74555,"[[9053, 100], [514, 293]]",0m 12s,
5,XGBClassifier,{},"TF-IDF on col ""clean_pp_lemma_stop""",0.55875,0.92912,0.5539,0.56368,"[[8807, 346], [360, 447]]",0m 60s,


In [22]:
full_run_time = time.time() - full_run_time_start
print(f'Full run time: {int(full_run_time // 60)}m {round(full_run_time % 60)}s')

Full run time: 3m 1s
