## Prerequisites



In [1]:
import pandas as pd 
import numpy as np 
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB

### Note! Some of these models support only multiclass classification, please, while selecting your dataset,  
### be sure that for algorithms which does not support multilabel classification you use only examples with only one label. 
### Examples without a label in any of the provided categories are clean messages, without any toxicity.

In [2]:
df = pd.read_csv("../jigsaw-toxic-comment-classification-challenge/train.csv")

In [3]:
df.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [4]:
df.shape

(159571, 8)

### As one of the methods to make the training simpier, use only examples, assigned to any category vs clean examples.  
For example:  
- Select only messages with obscene label == 1  
- Select all of the "clean" messages  
Implement a model which can perform a binary classification  - to understand whether your message is obscene or not.   

##### If you want to perform a multilabel classification, please understand the difference between multilabel and multiclass classification and be sure that you are solving the correct task - choose only algorithms applicable for solving this type of problem.

#### To work with multiclass task:  
You only need to select messages which have only one label assigned: message cannot be assigned to 2 or more categories.  

#### To work with multilabel task: 
You can work with the whole dataset - some of your messages have only 1 label, some more than 1. 

## Text vectorization

Previously we worked only with words vectorization. But we need to have a vector for each text, not only words from it. 

Before starting a text vectorization, please, make sure you are working with clean data - use the dataset created on the previous day. Cleaned from punctuation, stop words, lemmatized or stemmed, etc. 

In [5]:
import spacy

nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser", "ner"])

In [6]:
def preprocess_text(tokenizer, lemmatizer, stop_words, punctuation, text): 
    tokens = tokenizer(text.lower())
    lemmas = [lemmatizer.lemmatize(token) for token in tokens]
    return [token for token in lemmas if token not in stop_words and token not in punctuation]

df['cleaned'] = df['comment_text'].apply(lambda x: [t.lemma_ for t in nlp(x.replace('\n', ' ').replace('\r', ' ')) if not (t.is_stop or t.is_punct or t.text[0]==' ')])

In [7]:
df['cleaned'][50]

['BI',
 'say',
 'want',
 'talk',
 'lead',
 'section',
 'write',
 'promoter',
 'speculate',
 '1994',
 'skyhook',
 'concept',
 'cost',
 'competitive',
 'realistically',
 'think',
 'achievable',
 'space',
 'elevator',
 'skyhook',
 'competitive',
 'rotate',
 'tether',
 'concept',
 'addition',
 'rotate',
 'skyhook',
 'fact',
 'deem',
 'engineeringly',
 'feasible',
 'presently',
 'available',
 'material',
 'addition',
 'rotate',
 'skyhook',
 'fact',
 'deem',
 'engineeringly',
 'feasible',
 'presently',
 'available',
 'material',
 'statement',
 'appear',
 'come',
 'Ref',
 '3',
 'page',
 '10',
 'quote',
 'mass',
 'tether',
 'start',
 'exceed',
 '200',
 'time',
 'mass',
 'payload',
 'indication',
 'particular',
 'scenario',
 'consider',
 'engineeringly',
 'feasible',
 'presently',
 'available',
 'material',
 'application',
 'feasible',
 'near',
 'future',
 'well',
 'material',
 'available',
 'high',
 'tensile',
 'strength',
 'high',
 'operational',
 'temperature',
 'go',
 'shall',
 'presently',

In [8]:
def flat_nested(nested):
    flatten = []
    for item in nested:
        if isinstance(item, list):
            flatten.extend(item)
        else:
            flatten.append(item)
    return flatten

In [9]:
vocab = set(flat_nested(df.cleaned.tolist()))

In [10]:
len(vocab)

246806

As we see, probably you vocabulary is too large.  
Let's try to make it smaller.  
For example, let's get rig of words, which has counts in our dataset less than some threshold.

In [11]:
from collections import Counter, defaultdict 

cnt_vocab = Counter(flat_nested(df.cleaned.tolist()))

In [12]:
cnt_vocab.most_common(40)

[('article', 70999),
 ('page', 55153),
 ('edit', 38916),
 ('Wikipedia', 37946),
 ('talk', 29391),
 ('like', 27736),
 ('think', 25163),
 ('know', 23704),
 ('source', 22905),
 ('add', 19115),
 ('time', 18144),
 ('people', 17679),
 ('use', 17294),
 ('=', 16518),
 ('need', 15488),
 ('want', 14842),
 ('say', 14838),
 ('delete', 14436),
 ('link', 14377),
 ('block', 14204),
 ('find', 14043),
 ('remove', 13793),
 ('work', 13526),
 ('look', 12644),
 ('user', 12295),
 ('write', 12202),
 ('way', 11834),
 ('information', 11795),
 ('go', 11765),
 ('change', 11765),
 ('comment', 11724),
 ('section', 11241),
 ('editor', 11185),
 ('point', 11086),
 ('good', 10959),
 ('deletion', 10922),
 ('try', 10894),
 ('Thanks', 10888),
 ('thing', 10782),
 ('help', 10665)]

You can clean words which are shorter that particular length and occur less than N times. 

In [13]:
threshold_count = 10
threshold_len = 1 
cleaned_vocab = [token for token, count in cnt_vocab.items() if count > threshold_count and len(token) > threshold_len]

In [14]:
len(cleaned_vocab)

24227

Much better!  
Let's try to vectorize the text summing one-hot vectors for each word. 

In [15]:
vocabulary = defaultdict()

for i, token in enumerate(cleaned_vocab): 
    empty_vec = np.zeros(len(cleaned_vocab))
    empty_vec[i] = 1 
    vocabulary[token] = empty_vec

In [16]:
vocabulary['source']

array([0., 0., 0., ..., 0., 0., 0.])

Rigth now we have vectors for words (words are one-hot vectorized)  
Let's try to create vectors for texts: 

In [17]:
sample_text = df.cleaned[10]
print(sample_text)

['Fair', 'use', 'rationale', 'Image', 'Wonju.jpg', 'Thanks', 'upload', 'Image', 'Wonju.jpg', 'notice', 'image', 'page', 'specify', 'image', 'fair', 'use', 'explanation', 'rationale', 'use', 'Wikipedia', 'article', 'constitute', 'fair', 'use', 'addition', 'boilerplate', 'fair', 'use', 'template', 'write', 'image', 'description', 'page', 'specific', 'explanation', 'rationale', 'image', 'article', 'consistent', 'fair', 'use', 'image', 'description', 'page', 'edit', 'include', 'fair', 'use', 'rationale', 'upload', 'fair', 'use', 'medium', 'consider', 'check', 'specify', 'fair', 'use', 'rationale', 'page', 'find', 'list', 'image', 'page', 'edit', 'click', 'contribution', 'link', 'locate', 'Wikipedia', 'page', 'log', 'select', 'Image', 'dropdown', 'box', 'Note', 'fair', 'use', 'image', 'upload', '4', '2006', 'lack', 'explanation', 'delete', 'week', 'upload', 'describe', 'criterium', 'speedy', 'deletion', 'question', 'ask', 'Media', 'copyright', 'question', 'page', 'Thank', 'talk', 'contribs'

### One-hot vectorization and count vectorization

In [18]:
sample_vector = np.zeros(len(cleaned_vocab))

for token in sample_text: 
    try: 
        sample_vector += vocabulary[token]
    except KeyError: 
        print(token)

Wonju.jpg
Wonju.jpg
4
Wonju.jpg
Wonju.jpg


In [19]:
sample_vector

array([0., 2., 0., ..., 0., 0., 0.])

Right now we have count vectorization for our text.   
Use this pipeline to create vectors for all of the texts. Save them into np.array. i-th raw in np.array is a vector which represents i-th text from the dataframe.  

In [20]:
vectors = np.zeros((df.shape[0], len(cleaned_vocab)), dtype=np.int32)
for i in range(df.size()[0]):
    for token in df['cleaned'][i]: 
        try: 
            vectors[i] += vocabulary[token]
        except KeyError: 
            continue

MemoryError: Unable to allocate 14.4 GiB for an array with shape (159571, 24227) and data type int32

### The next step is to train any classification model on top of the received vectors and report the quality. 

Please, select any of the proposed pipelines for performing a text classification task. (Binary, multiclass or multilabel).  

The main task to calculate our models performance is to create a training and test sets. When you selected a texts for your task, please, use https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html to have at least two sets - train and test.  

Train examples you will use to train your model on and test examples to evaluate your model - to understand how your model works on the unseen data. 

### Train-test split 

In [None]:
### Your code here, splitting your dataset into train and test parts. 

### TF-IDF score 

#### Please, review again this article or read it if you have not done it before. 

https://medium.com/@paritosh_30025/natural-language-processing-text-data-vectorization-af2520529cf7

#### Implement calculating a tf-idf score for each of the words from your vocabulary. 

The main goal of this taks is to create a dictionary - keys of the dictionary would be tokens and values would be corresponding tf-idf score of the token.

#### Calculate it MANUALLY and compare the received scores for words with the sklearn implementation:  
from sklearn.feature_extraction.text import TfidfTransformer 

#### Tip: 

##### TF = (Number of time the word occurs in the current text) / (Total number of words in the current text)  

##### IDF = (Total number of documents / Number of documents with word t in it)

##### TF-IDF = TF*IDF 

When you calculated a tf-idf score for each of the words in your vocabulary - revectorize the texts.  
Instead of using number of occurences of the i-th word in the i-th cell of the text vector, use it's tf-idf score.   

Revectorize the documents, save vectors into np.array. 

In [None]:
### Your code here for obtaining a tf-idf vectorized documents. 

### Training the model 

As it was said before, select any of the text classification models for the selected task and train the model. 

When the model is trained, you need to evaluate it somehow. 

Read about True positive, False positive, False negative and True negative counts and how to calculate them:   

https://developers.google.com/machine-learning/crash-course/classification/true-false-positive-negative 

##### Calculate TP, FP, FN and TN on the test set for your model to measure its performance. 


In [None]:
TP = 0  ## Your code here 
FP = 0  ## Your code here 
FN = 0  ## Your code here 
TN = 0  ## Your code here 

#### The next step is to calculate  Precision, Recall, F1 and F2 score 

https://en.wikipedia.org/wiki/Sensitivity_and_specificity

In [None]:
prec = 0  ## Your code here 
rec = 0  ## Your code here 
F1 = 0  ## Your code here 
F2 = 0  ## Your code here 

Calculate these metrics for the vectorization created using count vectorizing and for tf-idf vectorization.  
Compare them. 

### Conclusions and improvements 

For all of the vectorization pipelines we used all of the words, which were available in our dictionary, as experiment try to use the most meaningful words - select them using TF-IDF score. (for example for each text you can select not more than 10 words for vectorization, or less). 

Compare this approach with the first and second ones. Did your model improve? 



### Additionally, visualisations 

For now you have a vector for each word from your vocabulary. 
You have vectors with lenght > 18000, so the dimension of your space is more than 18000 - it's impossible to visualise it in 2d space. 

So try to research and look for algorithms which perform dimensionality reduction. (t-SNE, PCA) 
Try to visualise obtained vectors in a vectorspace, only subset from the vocabulary, don't plot all of the words. (100) 

Probably on this step you will realise how this type of vectorization using these techniques is not the best way to vectorize words. 

Please, analyse the obtained results and explain why visualisation looks like this. 