# LexiGuard: LSTM (final model)

[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/CalleRosa40/lexiguard/blob/main/lstm.ipynb)

Please run `data_preprocessing.ipynb` to create data file(s).

## Setup

In [1]:
# Python standard lib
import time; full_run_time_start = time.time() # start timing exec right away
import pickle

# other "usual suspects"
import pandas as pd

# scikit-learn
from sklearn.model_selection import train_test_split

# imbalanced-learn
from imblearn.over_sampling import SMOTE

# TensorFlow
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

# display all df columns (default is 20)
pd.options.display.max_columns = None

# show all data in columns
pd.options.display.max_colwidth = None

# load utility functions
%run functions.ipynb

## Load data

In [2]:
# Watch out which file is loaded here! Full data or just sample?
df = pd.read_csv('data/lexiguard_data_5000.csv')
df.shape

(4984, 6)

In [3]:
df.head()

Unnamed: 0,raw,clean,clean_pp,clean_pp_lemma,clean_pp_lemma_stop,toxic
0,"Naturally you can feel it in your urine. \nThat's one of the common expressions in the Germanic languages and the hallmark of those who, after the fact, say they knew that all along. ;)","Naturally you can feel it in your urine. That's one of the common expressions in the Germanic languages and the hallmark of those who, after the fact, say they knew that all along. ;)",naturally you can feel it in your urine that 's one of the common expressions in the germanic languages and the hallmark of those who after the fact say they knew that all along,naturally you can feel it in your urine that be one of the common expression in the germanic language and the hallmark of those who after the fact say they know that all along,naturally feel urine common expression germanic language hallmark fact know,0
1,Yum! What's not to love: water+maple syrup - together.,Yum! What's not to love: water+maple syrup - together.,yum what 's not to love water+maple syrup together,yum what be not to love water+maple syrup together,yum love water+maple syrup,0
2,Catou I will wager that mutual fund sales will be just as strong this RRSP season as last year. The fact is that your typical mutual fund owner isn't going to suddenly discover DIY investing in individual stocks and salesperson at the bank isn't going to suddenly abandon the sales pitch for lower cost ETFs or GICs.,Catou I will wager that mutual fund sales will be just as strong this RRSP season as last year. The fact is that your typical mutual fund owner isn't going to suddenly discover DIY investing in individual stocks and salesperson at the bank isn't going to suddenly abandon the sales pitch for lower cost ETFs or GICs.,catou i will wager that mutual fund sales will be just as strong this rrsp season as last year the fact is that your typical mutual fund owner is n't going to suddenly discover diy investing in individual stocks and salesperson at the bank is n't going to suddenly abandon the sales pitch for lower cost etfs or gics,catou i will wager that mutual fund sale will be just as strong this rrsp season as last year the fact be that your typical mutual fund owner be not go to suddenly discover diy invest in individual stock and salesperson at the bank be not go to suddenly abandon the sale pitch for low cost etf or gic,catou wager mutual fund sale strong rrsp season year fact typical mutual fund owner go suddenly discover diy invest individual stock salesperson bank go suddenly abandon sale pitch low cost etf gic,0
3,"""The shortage of priests is not a shortage of vocations but a shortness of sight.""\n\nExactly! Very well put!","""The shortage of priests is not a shortage of vocations but a shortness of sight."" Exactly! Very well put!",the shortage of priests is not a shortage of vocations but a shortness of sight exactly very well put,the shortage of priest be not a shortage of vocation but a shortness of sight exactly very well put,shortage priest shortage vocation shortness sight exactly,0
4,"I dont disagree with that. It takes money to deal drugs. Street level drug sales people, they nit really dealers, are from many parts of the federal population.","I dont disagree with that. It takes money to deal drugs. Street level drug sales people, they nit really dealers, are from many parts of the federal population.",i do nt disagree with that it takes money to deal drugs street level drug sales people they nit really dealers are from many parts of the federal population,i do not disagree with that it take money to deal drug street level drug sale people they nit really dealer be from many part of the federal population,not disagree take money deal drug street level drug sale people nit dealer part federal population,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4984 entries, 0 to 4983
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   raw                  4984 non-null   object
 1   clean                4984 non-null   object
 2   clean_pp             4984 non-null   object
 3   clean_pp_lemma       4984 non-null   object
 4   clean_pp_lemma_stop  4984 non-null   object
 5   toxic                4984 non-null   int64 
dtypes: int64(1), object(5)
memory usage: 233.8+ KB


## Handle missing values

There shouldn't be any after preprocessing, but anyway ...

In [5]:
print('Checking for NaN\'s ...')

if df.isna().sum().sum() != 0:
    print('NaN\'s found.')
    rows_before = df.shape[0]
    print('Rows before dropping:', rows_before)
    print('Dropping rows ...')
    df.dropna(inplace=True)
    df.reset_index(drop=True, inplace=True)
    rows_after = df.shape[0]
    print('Rows after dropping:', rows_after)
    print('Rows dropped:', rows_before - rows_after)

else:
    print('No NaN\'s found.')

Checking for NaN's ...
No NaN's found.


 ## Check for imbalance in data

In [6]:
value_counts = df.toxic.value_counts()
nontoxic_count = value_counts[0]
toxic_count = value_counts[1]
nontoxic_perc =\
    round((nontoxic_count / (nontoxic_count + toxic_count)) * 100, 1)
toxic_perc =\
    round((toxic_count / (nontoxic_count + toxic_count)) * 100, 1)

print(f'Nontoxic (0): {nontoxic_count} ({nontoxic_perc} %)')
print(f'Toxic (1): {toxic_count} ({toxic_perc} %)')

Nontoxic (0): 4555 (91.4 %)
Toxic (1): 429 (8.6 %)


## Prepare data for LSTM

TODO: Optimize LSTM.

In [8]:
X = df['clean_pp_lemma_stop']
y = df['toxic']

# Split the data into train and test sets
X_train, X_test, y_train, y_test =\
    train_test_split(X, y, test_size=0.2, random_state=42)

# Tokenize and convert text to sequences
max_words = 10000  # Set the maximum number of words to consider
max_len = 100  # Set the maximum length of each sequence
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(X_train)
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)

# Pad sequences to a fixed length
X_train_padded = pad_sequences(X_train_seq, maxlen=max_len)
X_test_padded = pad_sequences(X_test_seq, maxlen=max_len)

# use SMOTE to balance train data
sm = SMOTE(random_state=42)
X_train_padded_sm, y_train_sm = sm.fit_resample(X_train_padded, y_train)

## Build LSTM model

In [9]:
model = Sequential()
model.add(Embedding(input_dim=max_words, output_dim=128, input_length=max_len))
model.add(LSTM(units=64))
model.add(Dense(units=1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

## Train model

In [11]:
train_time_start = time.time() # timer

model.fit(X_train_padded_sm, y_train_sm,
          epochs=5,
          batch_size=32,
          validation_data=(X_test_padded, y_test))

train_time = time.time() - train_time_start # timer
train_time_str = f'{int(train_time // 60)}m {round(train_time % 60)}s'

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


## Save model + tokenizer for later use in dashboard

In [12]:
# Save model architecture as JSON
model_json = model.to_json()
with open('data/lstm_architect.json', 'w') as file:
    file.write(model_json)

# Save model weights
model.save_weights('data/lstm_weights.h5')

# Pickle tokenizer
with open('data/lstm_tokenizer.pkl', 'wb') as file:
    pickle.dump(tokenizer, file)

## Predict test data

In [13]:
# Predict test data
y_pred = (model.predict(X_test_padded) > 0.5).astype(int)



## Evaluate model

In [14]:
eval_model(model.__class__.__name__,
           'epochs=5, batch_size=32, validation_data=(X_test_padded, y_test)',
           '5,000 rows from "clean_pp_lemma_stop',
           train_time_str,
           y_test, y_pred)

## Show test results + total exec time

In [15]:
test_results

Unnamed: 0,model_name,model_params,data_desc,f1,acc,recall,prec,cf_matrix,exec_time,notes
0,Sequential,"epochs=5, batch_size=32, validation_data=(X_test_padded, y_test)","5,000 rows from ""clean_pp_lemma_stop",0.22222,0.51555,0.78409,0.12946,"[[445, 464], [19, 69]]",2m 43s,


In [16]:
full_run_time = time.time() - full_run_time_start
print(f'Full run time: {int(full_run_time // 60)}m {round(full_run_time % 60)}s')

Full run time: 8m 1s
