<a href="https://colab.research.google.com/github/Deurru/AISaturdaysProject/blob/master/ToxicityFlair.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Toxicity Classifier using LSTM and Zalando´s Flair

### 01 - Basic start operations:
Mount the GDrive structure containing the datasets, import and install all libraries we are going to use to work on data,  check if hardware accelerator is activated.

In [0]:
# Mount Google Drive (On every new session)

import os

from google.colab import drive

drive.mount("/content/drive/")

os.chdir("drive/My Drive")

In [0]:
# installs flair library
!pip install flair

In [0]:
# Main imports here
import torch
import flair
import numpy as np
import pandas as pd

In [5]:
# version checks (not mandatory)
print(torch.__version__)
print(flair.__version__)

1.0.1.post2
0.4.1


In [6]:
# checking for cuda (if false = GPU not activated. Go to Runtime > Change Runtime Type > Hardware Accelerator > Select "GPU")
print(torch.cuda.is_available())
print(torch.version.cuda)

False
10.0.130


In [7]:
# directory & dataset unzipping statements. Execute once, then comment.
!pwd # Checks working directory
# !unzip Datasets/jigsaw_unintended_bias_in_toxicity_classification.zip -d Datasets

/content/drive/My Drive


### 02 - Exploring the dataset

In [0]:
# define path to source file and load data into pandas dataframe
train_path = "./Datasets/train.csv"
dataset = pd.read_csv(train_path)

In [18]:
# check shape, column names and basic stats of attributes of dataset
print(dataset.shape)
print(dataset.columns)
print(dataset.describe())

(1804874, 45)
Index(['id', 'target', 'comment_text', 'severe_toxicity', 'obscene',
       'identity_attack', 'insult', 'threat', 'asian', 'atheist', 'bisexual',
       'black', 'buddhist', 'christian', 'female', 'heterosexual', 'hindu',
       'homosexual_gay_or_lesbian', 'intellectual_or_learning_disability',
       'jewish', 'latino', 'male', 'muslim', 'other_disability',
       'other_gender', 'other_race_or_ethnicity', 'other_religion',
       'other_sexual_orientation', 'physical_disability',
       'psychiatric_or_mental_illness', 'transgender', 'white', 'created_date',
       'publication_id', 'parent_id', 'article_id', 'rating', 'funny', 'wow',
       'sad', 'likes', 'disagree', 'sexual_explicit',
       'identity_annotator_count', 'toxicity_annotator_count'],
      dtype='object')
                 id        target  severe_toxicity       obscene  \
count  1.804874e+06  1.804874e+06     1.804874e+06  1.804874e+06   
mean   3.738434e+06  1.030173e-01     4.582099e-03  1.387721e-0

In [14]:
dataset.head()

Unnamed: 0,id,target,comment_text,severe_toxicity,obscene,identity_attack,insult,threat,asian,atheist,...,article_id,rating,funny,wow,sad,likes,disagree,sexual_explicit,identity_annotator_count,toxicity_annotator_count
0,59848,0.0,"This is so cool. It's like, 'would you want yo...",0.0,0.0,0.0,0.0,0.0,,,...,2006,rejected,0,0,0,0,0,0.0,0,4
1,59849,0.0,Thank you!! This would make my life a lot less...,0.0,0.0,0.0,0.0,0.0,,,...,2006,rejected,0,0,0,0,0,0.0,0,4
2,59852,0.0,This is such an urgent design problem; kudos t...,0.0,0.0,0.0,0.0,0.0,,,...,2006,rejected,0,0,0,0,0,0.0,0,4
3,59855,0.0,Is this something I'll be able to install on m...,0.0,0.0,0.0,0.0,0.0,,,...,2006,rejected,0,0,0,0,0,0.0,0,4
4,59856,0.893617,haha you guys are a bunch of losers.,0.021277,0.0,0.021277,0.87234,0.0,0.0,0.0,...,2006,rejected,0,0,0,1,0,0.0,4,47


In [15]:
dataset.tail()

Unnamed: 0,id,target,comment_text,severe_toxicity,obscene,identity_attack,insult,threat,asian,atheist,...,article_id,rating,funny,wow,sad,likes,disagree,sexual_explicit,identity_annotator_count,toxicity_annotator_count
1804869,6333967,0.0,"Maybe the tax on ""things"" would be collected w...",0.0,0.0,0.0,0.0,0.0,,,...,399385,approved,0,0,0,0,0,0.0,0,4
1804870,6333969,0.0,What do you call people who STILL think the di...,0.0,0.0,0.0,0.0,0.0,,,...,399528,approved,0,0,0,0,0,0.0,0,4
1804871,6333982,0.0,"thank you ,,,right or wrong,,, i am following ...",0.0,0.0,0.0,0.0,0.0,,,...,399457,approved,0,0,0,0,0,0.0,0,4
1804872,6334009,0.621212,Anyone who is quoted as having the following e...,0.030303,0.030303,0.045455,0.621212,0.0,,,...,399519,approved,0,0,0,0,0,0.0,0,66
1804873,6334010,0.0,Students defined as EBD are legally just as di...,0.0,0.0,0.0,0.0,0.0,,,...,399318,approved,0,0,0,0,0,0.0,0,4


Unnamed: 0,id,target,severe_toxicity,obscene,identity_attack,insult,threat,asian,atheist,bisexual,...,parent_id,article_id,funny,wow,sad,likes,disagree,sexual_explicit,identity_annotator_count,toxicity_annotator_count
count,1804874.0,1804874.0,1804874.0,1804874.0,1804874.0,1804874.0,1804874.0,405130.0,405130.0,405130.0,...,1026228.0,1804874.0,1804874.0,1804874.0,1804874.0,1804874.0,1804874.0,1804874.0,1804874.0,1804874.0
mean,3738434.0,0.1030173,0.004582099,0.01387721,0.02263571,0.08115273,0.009311271,0.011964,0.003205,0.001884,...,3722687.0,281359.7,0.2779269,0.04420696,0.1091173,2.446167,0.5843688,0.006605974,1.439019,8.784694
std,2445187.0,0.1970757,0.02286128,0.06460419,0.07873156,0.1760657,0.04942218,0.087166,0.050193,0.026077,...,2450261.0,103929.3,1.055313,0.2449359,0.4555363,4.727924,1.866589,0.04529782,17.87041,43.50086
min,59848.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,61006.0,2006.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0
25%,796975.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,796018.8,160120.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
50%,5223774.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,5222993.0,332126.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,4.0
75%,5769854.0,0.1666667,0.0,0.0,0.0,0.09090909,0.0,0.0,0.0,0.0,...,5775758.0,366237.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,6.0
max,6334010.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,6333965.0,399541.0,102.0,21.0,31.0,300.0,187.0,1.0,1866.0,4936.0


### 03 - Cleaning the Dataset
Dropping columnswe do not need for the type of training we are going to choose, eliminating NaNs, normalizing variables, removing outliers (scores too high, texts too long, etc.).

### 04 - Defining Dataloaders
Splitting the dataset into training and validation (test set already available), defining transforms, creating dataloaders.

###05 - Defining Network Architecture
Defining our LSTM, creating and implementing its functions (train, backprop, etc)

###06 - Training the Model
Defining the model´s hyperparameters.
Running our train data through the model, checking loss, saving the better fitting model, fine tuning hyperparameters.


###07 - Validating the Model
Running our validation dataset through the model. Repeating step 06 if necessary.

###08 - Testing the model
After 06 and 07 have been completed satisfactorily, running the test set through the model. If the model generalizes well, then saving the model. Otherwise, start from 6 from scratch.