# LexiGuard: Baseline model (Bag of Words + logistic regression)

[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/CalleRosa40/lexiguard/blob/main/baseline_model.ipynb)

Please run `data_preprocessing.ipynb` to create data file(s).

## Setup

In [1]:
# Python standard lib
import time; full_run_time_start = time.time() # start timing exec right away

# other "usual suspects"
import numpy as np
import pandas as pd

# scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

# imbalanced-learn
from imblearn.over_sampling import SMOTE

# display all df columns (default is 20)
pd.options.display.max_columns = None

# show all data in columns
pd.options.display.max_colwidth = None

# load utility functions
%run functions.ipynb

## Load data

In [2]:
# Watch out which file is loaded here! Full data or just sample?
df = pd.read_csv('data/lexiguard_data_10000.csv')
df.shape

(9964, 6)

In [3]:
df.head()

Unnamed: 0,raw,clean,clean_pp,clean_pp_lemma,clean_pp_lemma_stop,toxic
0,Naturally you can feel it in your urine. \nTha...,Naturally you can feel it in your urine. That'...,naturally you can feel it in your urine that '...,naturally you can feel it in your urine that b...,naturally feel urine common expression germani...,0
1,Yum! What's not to love: water+maple syrup - ...,Yum! What's not to love: water+maple syrup - t...,yum what 's not to love water+maple syrup toge...,yum what be not to love water+maple syrup toge...,yum love water+maple syrup,0
2,Catou I will wager that mutual fund sales will...,Catou I will wager that mutual fund sales will...,catou i will wager that mutual fund sales will...,catou i will wager that mutual fund sale will ...,catou wager mutual fund sale strong rrsp seaso...,0
3,"""The shortage of priests is not a shortage of ...","""The shortage of priests is not a shortage of ...",the shortage of priests is not a shortage of v...,the shortage of priest be not a shortage of vo...,shortage priest shortage vocation shortness si...,0
4,I dont disagree with that. It takes money to d...,I dont disagree with that. It takes money to d...,i do nt disagree with that it takes money to d...,i do not disagree with that it take money to d...,not disagree take money deal drug street level...,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9964 entries, 0 to 9963
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   raw                  9964 non-null   object
 1   clean                9964 non-null   object
 2   clean_pp             9964 non-null   object
 3   clean_pp_lemma       9964 non-null   object
 4   clean_pp_lemma_stop  9964 non-null   object
 5   toxic                9964 non-null   int64 
dtypes: int64(1), object(5)
memory usage: 467.2+ KB


## Handle missing values

There shouldn't be any after preprocessing, but anyway ...

In [5]:
print('Checking for NaN\'s ...')

if df.isna().sum().sum() != 0:
    print('NaN\'s found.')
    rows_before = df.shape[0]
    print('Rows before dropping:', rows_before)
    print('Dropping rows ...')
    df.dropna(inplace=True)
    df.reset_index(drop=True, inplace=True)
    rows_after = df.shape[0]
    print('Rows after dropping:', rows_after)
    print('Rows dropped:', rows_before - rows_after)

else:
    print('No NaN\'s found.')

Checking for NaN's ...
No NaN's found.


 ## Check for imbalance in data

In [6]:
value_counts = df.toxic.value_counts()
nontoxic_count = value_counts[0]
toxic_count = value_counts[1]
nontoxic_perc =\
    round((nontoxic_count / (nontoxic_count + toxic_count)) * 100, 1)
toxic_perc =\
    round((toxic_count / (nontoxic_count + toxic_count)) * 100, 1)

print(f'Nontoxic (0): {nontoxic_count} ({nontoxic_perc} %)')
print(f'Toxic (1): {toxic_count} ({toxic_perc} %)')

Nontoxic (0): 9123 (91.6 %)
Toxic (1): 841 (8.4 %)


## Function: Create bag of words

In [7]:
def bow(s):
    vect = CountVectorizer()
    return vect.fit_transform(s)

## Run baseline model (logistic regression) on different data cols

In [8]:
# parameters for model
params = {'max_iter': 2_000}

# load model with parameters
lr = LogisticRegression(**params)

y = df.toxic

def run_experiment(X, data_desc):
    # Split data using 80% for train, 20% for test. Make sure test data has
    # same toxic/nontoxic ratio as train data by using stratify arg.
    X_train, X_test, y_train, y_test =\
        train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

    # use SMOTE to balance train data
    sm = SMOTE(random_state=42)
    X_train_sm, y_train_sm = sm.fit_resample(X_train, y_train)

    # train model and time execution
    train_time_start = time.time()
    lr.fit(X_train_sm, y_train_sm)
    train_time = time.time() - train_time_start
    train_time_str = f'{int(train_time // 60)}m {round(train_time % 60)}s'

    # predict test data
    y_pred = lr.predict(X_test)

    eval_model(lr.__class__.__name__,
            params,
            data_desc,
            train_time_str,
            y_test, y_pred)

run_experiment(bow(df.raw), 'BOW on col "raw"')
run_experiment(bow(df.clean), 'BOW on col "clean"')
run_experiment(bow(df.clean_pp), 'BOW on col "clean_pp"')
run_experiment(bow(df.clean_pp_lemma), 'BOW on col "clean_pp_lemma"')
run_experiment(bow(df.clean_pp_lemma_stop), 'BOW on col "clean_pp_lemma_stop"')

## Show test results + total exec time

In [9]:
test_results

Unnamed: 0,model_name,model_params,data_desc,f1,acc,recall,prec,cf_matrix,exec_time,notes
0,LogisticRegression,{'max_iter': 2000},"BOW on col ""raw""",0.22804,0.79277,0.3631,0.16621,"[[1519, 306], [107, 61]]",0m 2s,
1,LogisticRegression,{'max_iter': 2000},"BOW on col ""clean""",0.23178,0.79378,0.36905,0.16894,"[[1520, 305], [106, 62]]",0m 2s,
2,LogisticRegression,{'max_iter': 2000},"BOW on col ""clean_pp""",0.24254,0.79629,0.3869,0.17663,"[[1522, 303], [103, 65]]",0m 2s,
3,LogisticRegression,{'max_iter': 2000},"BOW on col ""clean_pp_lemma""",0.23619,0.7988,0.36905,0.17367,"[[1530, 295], [106, 62]]",0m 2s,
4,LogisticRegression,{'max_iter': 2000},"BOW on col ""clean_pp_lemma_stop""",0.24172,0.7702,0.43452,0.16743,"[[1462, 363], [95, 73]]",0m 1s,


In [10]:
full_run_time = time.time() - full_run_time_start
print(f'Full run time: {int(full_run_time // 60)}m {round(full_run_time % 60)}s')

Full run time: 0m 16s
