<a href="https://colab.research.google.com/github/Deurru/AISaturdaysProject/blob/master/ToxicityFlair.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Toxicity Classifier using LSTM and Zalando´s Flair

### 01 - Basic start operations:
Mount the GDrive structure containing the datasets, import and install all libraries we are going to use to work on data,  check if hardware accelerator is activated.

In [1]:
# Mount Google Drive (On every new session)

import os

from google.colab import drive

drive.mount("/content/drive/")

os.chdir("drive/My Drive")

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive/


In [2]:
# installs flair library
!pip install flair

Collecting flair
[?25l  Downloading https://files.pythonhosted.org/packages/44/54/76374f9a448ca765446502e7f2bb53c976e9c055102290fe6f8b0b038b37/flair-0.4.1.tar.gz (78kB)
[K     |████████████████████████████████| 81kB 5.2MB/s 
Collecting segtok>=1.5.7 (from flair)
  Downloading https://files.pythonhosted.org/packages/1d/59/6ed78856ab99d2da04084b59e7da797972baa0efecb71546b16d48e49d9b/segtok-1.5.7.tar.gz
Collecting mpld3>=0.3 (from flair)
[?25l  Downloading https://files.pythonhosted.org/packages/91/95/a52d3a83d0a29ba0d6898f6727e9858fe7a43f6c2ce81a5fe7e05f0f4912/mpld3-0.3.tar.gz (788kB)
[K     |████████████████████████████████| 798kB 13.6MB/s 
Collecting sqlitedict>=1.6.0 (from flair)
  Downloading https://files.pythonhosted.org/packages/0f/1c/c757b93147a219cf1e25cef7e1ad9b595b7f802159493c45ce116521caff/sqlitedict-1.6.0.tar.gz
Collecting deprecated>=1.2.4 (from flair)
  Downloading https://files.pythonhosted.org/packages/9f/7a/003fa432f1e45625626549726c2fbb7a29baa764e9d1fdb2323a5d779f8

In [0]:
# Main imports here
import torch
import flair
import numpy as np
import pandas as pd

In [4]:
# version checks (not mandatory)
print(torch.__version__)
print(flair.__version__)

1.0.1.post2
0.4.1


In [5]:
# checking for cuda (if false = GPU not activated. Go to Runtime > Change Runtime Type > Hardware Accelerator > Select "GPU")
print(torch.cuda.is_available())
print(torch.version.cuda)

False
10.0.130


In [6]:
# directory & dataset unzipping statements. Execute once, then comment.
!pwd # Checks working directory
# !unzip Datasets/jigsaw_unintended_bias_in_toxicity_classification.zip -d Datasets

/content/drive/My Drive


### 02 - Exploring the dataset

In [0]:
# define path to source file and load data into pandas dataframe
train_path = "./Datasets/train.csv"
trainset = pd.read_csv(train_path)

In [10]:
# check shape, column names and basic stats of attributes of dataset
print(trainset.shape)
print(trainset.columns)
print(trainset.describe())

(1804874, 45)
Index(['id', 'target', 'comment_text', 'severe_toxicity', 'obscene',
       'identity_attack', 'insult', 'threat', 'asian', 'atheist', 'bisexual',
       'black', 'buddhist', 'christian', 'female', 'heterosexual', 'hindu',
       'homosexual_gay_or_lesbian', 'intellectual_or_learning_disability',
       'jewish', 'latino', 'male', 'muslim', 'other_disability',
       'other_gender', 'other_race_or_ethnicity', 'other_religion',
       'other_sexual_orientation', 'physical_disability',
       'psychiatric_or_mental_illness', 'transgender', 'white', 'created_date',
       'publication_id', 'parent_id', 'article_id', 'rating', 'funny', 'wow',
       'sad', 'likes', 'disagree', 'sexual_explicit',
       'identity_annotator_count', 'toxicity_annotator_count'],
      dtype='object')
                 id        target  severe_toxicity       obscene  \
count  1.804874e+06  1.804874e+06     1.804874e+06  1.804874e+06   
mean   3.738434e+06  1.030173e-01     4.582099e-03  1.387721e-0

In [11]:
trainset.head()

Unnamed: 0,id,target,comment_text,severe_toxicity,obscene,identity_attack,insult,threat,asian,atheist,...,article_id,rating,funny,wow,sad,likes,disagree,sexual_explicit,identity_annotator_count,toxicity_annotator_count
0,59848,0.0,"This is so cool. It's like, 'would you want yo...",0.0,0.0,0.0,0.0,0.0,,,...,2006,rejected,0,0,0,0,0,0.0,0,4
1,59849,0.0,Thank you!! This would make my life a lot less...,0.0,0.0,0.0,0.0,0.0,,,...,2006,rejected,0,0,0,0,0,0.0,0,4
2,59852,0.0,This is such an urgent design problem; kudos t...,0.0,0.0,0.0,0.0,0.0,,,...,2006,rejected,0,0,0,0,0,0.0,0,4
3,59855,0.0,Is this something I'll be able to install on m...,0.0,0.0,0.0,0.0,0.0,,,...,2006,rejected,0,0,0,0,0,0.0,0,4
4,59856,0.893617,haha you guys are a bunch of losers.,0.021277,0.0,0.021277,0.87234,0.0,0.0,0.0,...,2006,rejected,0,0,0,1,0,0.0,4,47


In [12]:
trainset.tail()

Unnamed: 0,id,target,comment_text,severe_toxicity,obscene,identity_attack,insult,threat,asian,atheist,...,article_id,rating,funny,wow,sad,likes,disagree,sexual_explicit,identity_annotator_count,toxicity_annotator_count
1804869,6333967,0.0,"Maybe the tax on ""things"" would be collected w...",0.0,0.0,0.0,0.0,0.0,,,...,399385,approved,0,0,0,0,0,0.0,0,4
1804870,6333969,0.0,What do you call people who STILL think the di...,0.0,0.0,0.0,0.0,0.0,,,...,399528,approved,0,0,0,0,0,0.0,0,4
1804871,6333982,0.0,"thank you ,,,right or wrong,,, i am following ...",0.0,0.0,0.0,0.0,0.0,,,...,399457,approved,0,0,0,0,0,0.0,0,4
1804872,6334009,0.621212,Anyone who is quoted as having the following e...,0.030303,0.030303,0.045455,0.621212,0.0,,,...,399519,approved,0,0,0,0,0,0.0,0,66
1804873,6334010,0.0,Students defined as EBD are legally just as di...,0.0,0.0,0.0,0.0,0.0,,,...,399318,approved,0,0,0,0,0,0.0,0,4


### 03 - Cleaning the Dataset
Dropping columnswe do not need for the type of training we are going to choose, eliminating NaNs, normalizing variables, removing outliers (scores too high, texts too long, etc.).

In [0]:
def preprocess(data):
    '''
    Credit goes to https://www.kaggle.com/gpreda/jigsaw-fast-compact-solution
    '''
    punct = "/-'?!.,#$%\'()*+-/:;<=>@[\\]^_`{|}~`" + '""“”’' + '∞θ÷α•à−β∅³π‘₹´°£€\×™√²—–&'
    def clean_special_chars(text, punct):
        for p in punct:
            text = text.replace(p, ' ')
        return text

    data = data.astype(str).apply(lambda x: clean_special_chars(x, punct))
    return data

In [0]:
# Drops non relevant columns: Creates a list of labels, then passes that list 
# to .drop() for the columns with those labels to be dropped from the dataframe
droplist = ['created_date','publication_id', 'parent_id', 'article_id', 
            'rating', 'funny', 'wow', 'sad', 'likes', 'disagree', 
            'identity_annotator_count', 'toxicity_annotator_count']
trainset_clean = trainset.drop(columns = droplist)

In [24]:
print(trainset_clean.shape)
trainset_clean.columns

(1804874, 33)


Index(['id', 'target', 'comment_text', 'severe_toxicity', 'obscene',
       'identity_attack', 'insult', 'threat', 'asian', 'atheist', 'bisexual',
       'black', 'buddhist', 'christian', 'female', 'heterosexual', 'hindu',
       'homosexual_gay_or_lesbian', 'intellectual_or_learning_disability',
       'jewish', 'latino', 'male', 'muslim', 'other_disability',
       'other_gender', 'other_race_or_ethnicity', 'other_religion',
       'other_sexual_orientation', 'physical_disability',
       'psychiatric_or_mental_illness', 'transgender', 'white',
       'sexual_explicit'],
      dtype='object')

In [25]:
# Checks number of NaN values per column.
trainset_clean.isnull().sum()

id                                           0
target                                       0
comment_text                                 0
severe_toxicity                              0
obscene                                      0
identity_attack                              0
insult                                       0
threat                                       0
asian                                  1399744
atheist                                1399744
bisexual                               1399744
black                                  1399744
buddhist                               1399744
christian                              1399744
female                                 1399744
heterosexual                           1399744
hindu                                  1399744
homosexual_gay_or_lesbian              1399744
intellectual_or_learning_disability    1399744
jewish                                 1399744
latino                                 1399744
male         

### 04 - Tokenization & Defining Dataloaders
Splitting the dataset into training and validation (test set already available), defining transforms, creating dataloaders.

In [0]:
from flair.data import Sentence
TrySentence = Sentence(dataset_clean.iloc[0,2])
print(TrySentence)

Sentence: "This is so cool. It's like, 'would you want your mother to read this??' Really great idea, well done!" - 19 Tokens


In [0]:
for token in TrySentence:
  print(token)

Token: 1 This
Token: 2 is
Token: 3 so
Token: 4 cool.
Token: 5 It's
Token: 6 like,
Token: 7 'would
Token: 8 you
Token: 9 want
Token: 10 your
Token: 11 mother
Token: 12 to
Token: 13 read
Token: 14 this??'
Token: 15 Really
Token: 16 great
Token: 17 idea,
Token: 18 well
Token: 19 done!


Cleaning the sentences from punctuation.

###05 - Defining Network Architecture
Defining our LSTM, creating and implementing its functions (train, backprop, etc)

###06 - Training the Model
Defining the model´s hyperparameters.
Running our train data through the model, checking loss, saving the better fitting model, fine tuning hyperparameters.


In [0]:
train = pd.read_csv('../input/jigsaw-unintended-bias-in-toxicity-classification/train.csv')
test = pd.read_csv('../input/jigsaw-unintended-bias-in-toxicity-classification/test.csv')

x_train = train['comment_text']
y_train = np.where(train['target'] >= 0.5, 1, 0)
y_aux_train = train[['target', 'severe_toxicity', 'obscene', 'identity_attack', 'insult', 'threat']]
x_test = test['comment_text']

###07 - Validating the Model
Running our validation dataset through the model. Repeating step 06 if necessary.

###08 - Testing the model
After 06 and 07 have been completed satisfactorily, running the test set through the model. If the model generalizes well, then saving the model. Otherwise, start from 6 from scratch.