## Prerequisites



In [1]:
import pandas as pd 
import numpy as np 
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB


from os import chdir
chdir(
    r'C:\Users\laplace-transform\AppData\Local\Programs\Python\Python37\notebooks\2020-knu-ai-master\jigsaw-toxic-comment-classification-challenge'
)

### Note! Some of these models support only multiclass classification, please, while selecting your dataset,  
### be sure that for algorithms which does not support multilabel classification you use only examples with only one label. 
### Examples without a label in any of the provided categories are clean messages, without any toxicity.

In [2]:
df = pd.read_csv("../jigsaw-toxic-comment-classification-challenge/train.csv")

In [3]:
df.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [4]:
df.shape

(159571, 8)

### As one of the methods to make the training simpier, use only examples, assigned to any category vs clean examples.  
For example:  
- Select only messages with obscene label == 1  
- Select all of the "clean" messages  
Implement a model which can perform a binary classification  - to understand whether your message is obscene or not.   

##### If you want to perform a multilabel classification, please understand the difference between multilabel and multiclass classification and be sure that you are solving the correct task - choose only algorithms applicable for solving this type of problem.

#### To work with multiclass task:  
You only need to select messages which have only one label assigned: message cannot be assigned to 2 or more categories.  

#### To work with multilabel task: 
You can work with the whole dataset - some of your messages have only 1 label, some more than 1. 

## Text vectorization

Previously we worked only with words vectorization. But we need to have a vector for each text, not only words from it. 

Before starting a text vectorization, please, make sure you are working with clean data - use the dataset created on the previous day. Cleaned from punctuation, stop words, lemmatized or stemmed, etc. 

In [5]:
from string import punctuation

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer 
  
lemmatizer = WordNetLemmatizer() 
stop_words = set(stopwords.words('english'))

In [6]:
def preprocess_text(tokenizer, lemmatizer, stop_words, punctuation, text): 
    tokens = tokenizer(text.lower())
    lemmas = [lemmatizer.lemmatize(token) for token in tokens]
    return [token for token in lemmas if token not in stop_words and token not in punctuation]

df['cleaned'] = df.comment_text.apply(lambda x: preprocess_text(word_tokenize, lemmatizer, stop_words, punctuation, x))

In [7]:
df.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,cleaned
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0,"[explanation, edits, made, username, hardcore,..."
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0,"[d'aww, match, background, colour, 'm, seeming..."
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0,"[hey, man, 'm, really, trying, edit, war, 's, ..."
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0,"[``, ca, n't, make, real, suggestion, improvem..."
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0,"[sir, hero, chance, remember, page, 's]"


In [8]:
def flat_nested(nested):
    flatten = []
    for item in nested:
        if isinstance(item, list):
            flatten.extend(item)
        else:
            flatten.append(item)
    return flatten

In [9]:
vocab = set(flat_nested(df.cleaned.tolist()))

In [10]:
len(vocab)

249531

As we see, probably you vocabulary is too large.  
Let's try to make it smaller.  
For example, let's get rig of words, which has counts in our dataset less than some threshold.

In [11]:
from collections import Counter, defaultdict 

cnt_vocab = Counter(flat_nested(df.cleaned.tolist()))

In [12]:
cnt_vocab.most_common(10)

[("''", 241319),
 ('``', 156982),
 ('article', 73264),
 ("'s", 66766),
 ("n't", 57144),
 ('wa', 56590),
 ('page', 56239),
 ('wikipedia', 45413),
 ('talk', 35356),
 ('ha', 31896)]

You can clean words which are shorter that particular length and occur less than N times. 

In [13]:
threshold_count = 10
threshold_len = 4 
cleaned_vocab = [token for token, count in cnt_vocab.items() if count > threshold_count and len(token) > threshold_len]

In [14]:
len(cleaned_vocab)

18705

Much better!  
Let's try to vectorize the text summing one-hot vectors for each word. 

In [15]:
vocabulary = defaultdict()

for i, token in enumerate(cleaned_vocab): 
    empty_vec = np.zeros(len(cleaned_vocab))
    empty_vec[i] = 1 
    vocabulary[token] = empty_vec

In [16]:
vocabulary['hardcore']

array([0., 0., 0., ..., 0., 0., 0.])

Rigth now we have vectors for words (words are one-hot vectorized)  
Let's try to create vectors for texts: 

In [17]:
sample_text = df.cleaned[10]
print(sample_text)

['``', 'fair', 'use', 'rationale', 'image', 'wonju.jpg', 'thanks', 'uploading', 'image', 'wonju.jpg', 'notice', 'image', 'page', 'specifies', 'image', 'used', 'fair', 'use', 'explanation', 'rationale', 'use', 'wikipedia', 'article', 'constitutes', 'fair', 'use', 'addition', 'boilerplate', 'fair', 'use', 'template', 'must', 'also', 'write', 'image', 'description', 'page', 'specific', 'explanation', 'rationale', 'using', 'image', 'article', 'consistent', 'fair', 'use', 'please', 'go', 'image', 'description', 'page', 'edit', 'include', 'fair', 'use', 'rationale', 'uploaded', 'fair', 'use', 'medium', 'consider', 'checking', 'specified', 'fair', 'use', 'rationale', 'page', 'find', 'list', "'image", 'page', 'edited', 'clicking', '``', "''", 'contribution', "''", "''", 'link', 'located', 'top', 'wikipedia', 'page', 'logged', 'selecting', '``', "''", 'image', "''", "''", 'dropdown', 'box', 'note', 'fair', 'use', 'image', 'uploaded', '4', 'may', '2006', 'lacking', 'explanation', 'deleted', 'one

### One-hot vectorization and count vectorization

In [18]:
sample_vector = np.zeros(len(cleaned_vocab))

for token in sample_text: 
    try: 
        sample_vector += vocabulary[token]
    except KeyError: 

        continue

In [56]:
sample_vector

Unnamed: 0,insult,cleaned
0,0,"[explanation, edits, made, username, hardcore,..."
1,0,"[d'aww, match, background, colour, 'm, seeming..."
2,0,"[hey, man, 'm, really, trying, edit, war, 's, ..."
3,0,"[``, ca, n't, make, real, suggestion, improvem..."
4,0,"[sir, hero, chance, remember, page, 's]"
...,...,...
159566,0,"[``, second, time, asking, view, completely, c..."
159567,0,"[ashamed, horrible, thing, put, talk, page, 12..."
159568,0,"[spitzer, umm, actual, article, prostitution, ..."
159569,0,"[look, like, wa, actually, put, speedy, first,..."


Right now we have count vectorization for our text.   
Use this pipeline to create vectors for all of the texts. Save them into np.array. i-th raw in np.array is a vector which represents i-th text from the dataframe.  

In [48]:
from scipy.sparse import csr_matrix

def vocabulary_interact_vect(
    sample_text:     np.ndarray,
    vocabulary_len:  int,
    vect_vocabulary: defaultdict,
) -> np.ndarray:
    
    text_vector = np.zeros(len(vect_vocabulary))
    
    for token in sample_text:
        try: 
            text_vector += vect_vocabulary[token]
        except KeyError: 
            continue
    return text_vector

# because we're going to use matrix interprep on train data
def vocabulary_interact_sparse(
    corpus:          pd.Series,
    vocabulary_len:  int,
    vect_vocabulary: defaultdict,
) -> csr_matrix:
    
    corpus_len = len(corpus)
    
    texts_vectorized = csr_matrix((corpus_len, vocabulary_len))
    
    for j in range(corpus_len):
        
        current_text_vector = vocabulary_interact_vect(
            corpus[j], vocabulary_len, vect_vocabulary
        )
        
        current_sparse_matr = csr_matrix(
            (current_text_vector,(np.full(N, j),np.arange(N))), 
            shape = (corpus_len, vocabulary_len)
        )
        
        texts_vectorized += current_sparse_matr
    
    return texts_vectorized

In [51]:
N = len(cleaned_vocab)

text_vectorized = vocabulary_interact_sparse(
    corpus = pd.Series([df.cleaned[10]]), vocabulary_len = N, vect_vocabulary = vocabulary
)

text_vectorized.toarray()

1 <class 'pandas.core.series.Series'>


array([[3., 0., 0., ..., 0., 0., 0.]])

### The next step is to train any classification model on top of the received vectors and report the quality. 

Please, select any of the proposed pipelines for performing a text classification task. (Binary, multiclass or multilabel).  

The main task to calculate our models performance is to create a training and test sets. When you selected a texts for your task, please, use https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html to have at least two sets - train and test.  

Train examples you will use to train your model on and test examples to evaluate your model - to understand how your model works on the unseen data. 

### Train-test split 

In [76]:
### Your code here, splitting your dataset into train and test parts.

# This time i'm going to handle binary classification task
# First of all, I'll divide df into two separate groups:
# - non-toxic data
# - insult labelled data

df_categories = [
    'identity_hate', 'insult', 'obscene', 'severe_toxic', 'threat', 'toxic'
]

crucial_data = df[[df_categories[1], 'cleaned']]

df_non_toxic = crucial_data[~df[df_categories].any(axis = 'columns')]
df_insulting = crucial_data[df.insult != 0]

df_combined = df_non_toxic.append(df_insulting).reset_index(drop = True)

print(
    df_combined.head(),
    df_combined.tail(),
    sep = '\n\n'
)

   insult                                            cleaned
0       0  [explanation, edits, made, username, hardcore,...
1       0  [d'aww, match, background, colour, 'm, seeming...
2       0  [hey, man, 'm, really, trying, edit, war, 's, ...
3       0  [``, ca, n't, make, real, suggestion, improvem...
4       0            [sir, hero, chance, remember, page, 's]

        insult                                            cleaned
151218       1  [``, previous, conversation, fucking, shit, ea...
151219       1                        [mischievious, pubic, hair]
151220       1  [absurd, edits, absurd, edits, great, white, s...
151221       1  [``, hey, listen, n't, ever, delete, edits, ev...
151222       1  ['m, going, keep, posting, stuff, u, deleted, ...


In [79]:
from sklearn.model_selection import train_test_split

custom_test_size = 0.25

# our list of texts
X = df_combined['cleaned']

# their labels
Y = df_combined['insult']

# making train and test sets for future model
X_train, X_test = train_test_split(
    X, test_size = custom_test_size
)

X_train = X_train.reset_index(drop = True)
X_test = X_test.reset_index(drop = True)

In [80]:
# Train set
X_train.head()

0    [.., malechh, actually, referred, tribe, count...
1    [article, could, tidied, someone, familiar, en...
2    [deletion, pretty, ridiculous, dude, need, sti...
3    [added, picture, 2, interpreter, see, hand, ge...
4    [thank, experimenting, wikipedia, test, worked...
Name: cleaned, dtype: object

In [81]:
# Test set
X_test.head()

0                 [already, responded, workshop, page]
1    [``, search, eliminates, upenn.edu, hit, appea...
2    [``, metalcore, modern, fusion, hardcore, punk...
3    [``, part, ``, '', preserve, reasonable, conte...
4    [btw, need, serious, fact, checking, claim, as...
Name: cleaned, dtype: object

### TF-IDF score 

#### Please, review again this article or read it if you have not done it before. 

https://medium.com/@paritosh_30025/natural-language-processing-text-data-vectorization-af2520529cf7

#### Implement calculating a tf-idf score for each of the words from your vocabulary. 

The main goal of this task is to create a dictionary - keys of the dictionary would be tokens and values would be corresponding tf-idf score of the token.

#### Calculate it MANUALLY and compare the received scores for words with the sklearn implementation:  

#### Tip: 

##### TF = (Number of time the word occurs in the current text) / (Total number of words in the current text)  

##### IDF = (Total number of documents / Number of documents with word t in it)

##### TF-IDF = TF*IDF 

When you calculated a tf-idf score for each of the words in your vocabulary - revectorize the texts.  
Instead of using number of occurences of the i-th word in the i-th cell of the text vector, use it's tf-idf score.   

Revectorize the documents, save vectors into np.array. 

In [65]:
### Your code here for obtaining a tf-idf vectorized documents. 

# Like mentioned above, we're going to define 
# the tf-idf vectorization method manually.

def TF_binary(
    term:     str,       # our token
    document: list       # our text - a list of cleaned tokens
) -> bool:
    return term in document


def TF(
    term:     str,       # our token
    document: list       # our text - a list of cleaned tokens
) -> float:
    return document.count(term)/len(document) if document else 0


def IDF(
    term:     str,       # our token
    corpus:   pd.Series, # list of all texts, to which the mentioned one belongs
    use_log:  bool       # apply log func on a result or not
) -> float:
    TF_binary_v_rough = np.vectorize(lambda doc: TF_binary(term, doc))
    
    return (np.log if use_log else (lambda t: t))(
        len(corpus)/len(corpus[TF_binary_v_rough(corpus)])
    )


def TF_IDF(
    term:     str,       # our token
    document: list,      # our text - a list of cleaned tokens
    corpus:   pd.Series, # list of all texts, to which the mentioned one belongs
    use_log:  bool       # use IDF or IDF_enchanced in calculations below
) -> float:
    return TF(term, document) * (
        IDF(term, corpus, use_log)
    )

In [66]:
df_sample = df.cleaned[10]
ex_term = 'one'

print(
    "Term:{tm:>10}\n"
    "TF_bin:{tb:>6}\n"
    "TF_std:{ts:>25}\n".format(
        tm = ex_term,
        tb = TF_binary(ex_term, df_sample),
        ts = TF(ex_term, df_sample) # 16/276
    )
)

df_sample_series = df.cleaned[8:13]

print(df_sample_series, end = '\n\n')

print(
    "Term:{tm:>10}\n"
    "IDF_std:{ib:>22}\n"
    "IDF_log:{il:>22}\n".format(
        tm = ex_term,
        ib = IDF(ex_term, df_sample_series, use_log = False),
        il = IDF(ex_term, df_sample_series, use_log = True)
    )
)

print(
    "Term:{tm:>10}\n"
    "TF-IDF_std: {tib}\n"
    "TF-IDF_log: {til}\n".format(
        tm = ex_term,
        tib = TF_IDF(ex_term, df_sample, df_sample_series, use_log = False),
        til = TF_IDF(ex_term, df_sample, df_sample_series, use_log = True)
    )
)

del df_sample_series, df_sample, ex_term

Term:       one
TF_bin:     1
TF_std:     0.014492753623188406

8     [sorry, word, 'nonsense, wa, offensive, anyway...
9             [alignment, subject, contrary, dulithgow]
10    [``, fair, use, rationale, image, wonju.jpg, t...
11             [bbq, man, let, discus, it-maybe, phone]
12    [hey, ..., it.., talk, ..., exclusive, group, ...
Name: cleaned, dtype: object

Term:       one
IDF_std:    1.6666666666666667
IDF_log:    0.5108256237659907

Term:       one
TF-IDF_std: 0.024154589371980676
TF-IDF_log: 0.007403269909652039



In [67]:
def TF_IDF_dict(
    corpus:     pd.Series,   # list of all texts
    vocabulary: defaultdict, # generated set of uniq words
    use_log:    bool         # use IDF or IDF_enchanced in calculations below
) -> defaultdict:
    
    tf_idf_dict = defaultdict()
    
    for token in vocabulary:
        
        tf_idf_dict[token] = np.array(
            [(doc.count(token)/len(doc) if doc else 0) for doc in corpus]
        ) * (np.log if use_log else (lambda t: t))(
        len(corpus)/len(corpus[np.vectorize(lambda doc: token in doc)(corpus)])
        )

In [None]:
TF_IDF_final = TF_IDF_dict(
    corpus = df.cleaned,
    vocabulary = vocabulary,
    use_log = False
)

TF_IDF_final

In [83]:
# Now we're calculating count matrix for X_train

#X_train_count_matrix = vocabulary_interact_sparse(
#    corpus = X_train, vocabulary_len = N, vect_vocabulary = vocabulary
#)

In [84]:
from sklearn.feature_extraction.text import TfidfTransformer

tf_idf_trans = TfidfTransformer()

#tf_idf_trans.fit(texts_vectorized)

#sparse_matr = tf_idf_trans.transform(texts_vectorized)
#print(sparse_matr)

#trans_array = sparse_matr.toarray()

### Training the model 

As it was said before, select any of the text classification models for the selected task and train the model. 

When the model is trained, you need to evaluate it somehow. 

Read about True positive, False positive, False negative and True negative counts and how to calculate them:   

https://developers.google.com/machine-learning/crash-course/classification/true-false-positive-negative 

##### Calculate TP, FP, FN and TN on the test set for your model to measure its performance. 


In [47]:
#linear_classifier = LogisticRegression(
#    random_state = 0
#)

ValueError: setting an array element with a sequence.

In [56]:
TP = 0  ## Your code here 
FP = 0  ## Your code here 
FN = 0  ## Your code here 
TN = 0  ## Your code here 

#### The next step is to calculate  Precision, Recall, F1 and F2 score 

https://en.wikipedia.org/wiki/Sensitivity_and_specificity

In [24]:
prec = 0  ## Your code here 
rec = 0  ## Your code here 
F1 = 0  ## Your code here 
F2 = 0  ## Your code here 

Calculate these metrics for the vectorization created using count vectorizing and for tf-idf vectorization.  
Compare them. 

### Conclusions and improvements 

For all of the vectorization pipelines we used all of the words, which were available in our dictionary, as experiment try to use the most meaningful words - select them using TF-IDF score. (for example for each text you can select not more than 10 words for vectorization, or less). 

Compare this approach with the first and second ones. Did your model improve? 



### Additionally, visualisations 

For now you have a vector for each word from your vocabulary. 
You have vectors with lenght > 18000, so the dimension of your space is more than 18000 - it's impossible to visualise it in 2d space. 

So try to research and look for algorithms which perform dimensionality reduction. (t-SNE, PCA) 
Try to visualise obtained vectors in a vectorspace, only subset from the vocabulary, don't plot all of the words. (100) 

Probably on this step you will realise how this type of vectorization using these techniques is not the best way to vectorize words. 

Please, analyse the obtained results and explain why visualisation looks like this. 