# Homework 2 - TF-IDF Classifier

Ваша цель обучить классификатор который будет находить "токсичные" комментарии и опубликовать решения на Kaggle [Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge)

В процессе обучения нужно ответить на ***[вопросы](https://docs.google.com/forms/d/e/1FAIpQLSd9mQx8EFpSH6FhCy1M_FmISzy3lhgyyqV3TN0pmtop7slmTA/viewform?usp=sf_link)***

Данные можно скачать тут - https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data



In [1]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

In [2]:
class_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

train = pd.read_csv('../../data/Toxis comments/train.csv').fillna(' ')
test = pd.read_csv('../../data/Toxis comments/test.csv').fillna(' ')

In [3]:
train.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


Стадартными подходами для анализа текста являются [Bag of words](https://en.wikipedia.org/wiki/Bag-of-words_model) и его модификация [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

Они реалзованны в `sklearn` в виде [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) и [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

Более подробней про них можно посмотреть [тут](https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-feature-extraction-and-engineering.ipynb)

In [4]:
train_text = train['comment_text']
test_text = test['comment_text']
all_text = pd.concat([train_text, test_text])

In [43]:
# Попробуйте разные Vectorizer и разные размеры n-gramm, стоп-слова, обрезку редких слов, обрезку слишком частых слов
word_vectorizer = TfidfVectorizer(analyzer='word',
                                  # ngram_range=(1,2),
                                  
                                  # vocabulary=None,
                                  # max_features=5000,
                                 ) # TfidfVectorizer или CountVectorizer

In [29]:
all_word_features = word_vectorizer.fit_transform(all_text)
train_word_features = word_vectorizer.transform(train_text)
test_word_features = word_vectorizer.transform(test_text)

# What words are the most common in the dataset
### First method:

In [7]:
sums = all_word_features.todense().sum(axis=0)
d = list(zip(word_vectorizer.get_feature_names(), sums.tolist()[0]))
words_sorted = sorted(d, key=lambda x: x[1], reverse=True)

In [8]:
words_sorted[:5]

[('the', 918456),
 ('to', 538918),
 ('of', 409932),
 ('and', 408809),
 ('you', 393819)]

### Second method:

In [9]:
from collections import Counter
import re

In [10]:
def get_words(text):
    """return list of the words"""
    pattern = r'[a-z]+'
    words = re.findall(pattern, text.lower())
    
    return words

In [11]:
words_lists = all_text.apply(get_words)

In [12]:
all_words = dict()

for item in words_lists:
    
    for word in list(item):
        if word in all_words:
            all_words[word] += 1
        else:
            all_words[word] = 1

In [13]:
sorted(all_words.items(), key=lambda x: x[1], reverse=True)[:5]

[('the', 919075), ('to', 539244), ('i', 434390), ('a', 412567), ('of', 410841)]

## Third

In [14]:
result = Counter(all_words)
result.most_common(5)

[('the', 919075), ('to', 539244), ('i', 434390), ('a', 412567), ('of', 410841)]

## Logistic regression

Для классификации будем использовать логистическую регрессию [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

Будем тренировать по одному классификатору на каждый класс. 

Что бы провалидировать качество модели воспользуемся функцией [cross_val_score](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)

In [42]:
scores= []
# class_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

c_values = [0.1, 0.5, 1.0, 2.0, 3.0, 4.0, 5.0, 7.5, 10.0]

class_names = class_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']


for class_name in class_names:
    c_list = []
    print('CLASS NAME: {}'.format(class_name))
    for c in c_values:
        
        classifier = LogisticRegression(C=c, random_state=32)
        
        train_target = train[class_name]

        cv_score = np.mean(cross_val_score(classifier, train_word_features, train_target, scoring='roc_auc'))
    
        print('CV score for c = {} is {}'.format(c, cv_score))
        c_list.append((c, cv_score))
    scores.append(max(c_list, key=lambda x: x[1]))
    print('-' * 20)

sc = [item[1] for item in scores]
print('Total score is {}'.format(np.mean(sc)))

CLASS NAME: toxic
CV score for c = 15.0 is 0.9712074727068446
--------------------
CLASS NAME: severe_toxic
CV score for c = 15.0 is 0.9831489787444582
--------------------
CLASS NAME: obscene
CV score for c = 15.0 is 0.9832291290386114
--------------------
CLASS NAME: threat
CV score for c = 15.0 is 0.987027109000404
--------------------
CLASS NAME: insult
CV score for c = 15.0 is 0.9761482352859209
--------------------
CLASS NAME: identity_hate
CV score for c = 15.0 is 0.9724700488426598
--------------------
Total score is 0.9788718289364833


In [31]:
scores

[(10.0, 0.9708742349886449),
 (5.0, 0.9835302944420797),
 (10.0, 0.9830055071311312),
 (10.0, 0.9867756782977338),
 (10.0, 0.9760689019280289),
 (10.0, 0.9723599802257598)]

Попробуйте подобрать лучшие параметры для `word_vectorizer` и `classifier` оптимизируя метрику [ROC AUC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)


---

Опубликуйте лучшие решение на [Kaggle Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/submit)

In [17]:
submission = pd.DataFrame.from_dict({'id': test['id']})

In [32]:
c_values = [10.0, 5.0, 10.0, 10.0, 10.0, 10.0]

for class_name, c in zip(class_names, c_values):
    
    classifier = LogisticRegression(C=c, random_state=32)
    
    classifier.fit(train_word_features, train[class_name])
    # test_features = classifier.predict(test['comment text'])
    
    submission[class_name] = classifier.predict_proba(test_word_features)[:, 1]    

In [33]:
submission.head()

Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,0.999188,0.130817,0.996346,0.028062,0.935121,0.244781
1,0000247867823ef7,0.001927,0.001057,0.00088,0.000164,0.002757,0.001109
2,00013b17ad220c46,0.021117,0.004374,0.009299,0.000861,0.009835,0.002597
3,00017563c3f7919a,0.001044,0.000919,0.000781,0.000341,0.000653,0.000336
4,00017695ad8997eb,0.01296,0.003997,0.005524,0.000832,0.005872,0.001444


In [34]:
submission.to_csv('submission.csv', index=False)