# Homework 2 - TF-IDF Classifier

Ваша цель обучить классификатор который будет находить "токсичные" комментарии и опубликовать решения на Kaggle [Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge)

В процессе обучения нужно ответить на ***[вопросы](https://docs.google.com/forms/d/e/1FAIpQLSd9mQx8EFpSH6FhCy1M_FmISzy3lhgyyqV3TN0pmtop7slmTA/viewform?usp=sf_link)***

Данные можно скачать тут - https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data



In [1]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from scipy.sparse import hstack
from wordcloud import STOPWORDS

In [2]:
from IPython.display import Markdown, display
def printmd(string):
    display(Markdown(string))

In [3]:
class_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
train = pd.read_csv('../data/train.csv').fillna(' ')
test = pd.read_csv('../data/test.csv').fillna(' ')

Стадартными подходами для анализа текста являются [Bag of words](https://en.wikipedia.org/wiki/Bag-of-words_model) и его модификация [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

Они реалзованны в `sklearn` в виде [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) и [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

Более подробней про них можно посмотреть [тут](https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-feature-extraction-and-engineering.ipynb)

In [4]:
train_text = train['comment_text']
test_text = test['comment_text']
all_text = pd.concat([train_text, test_text])

word_vectorizer = CountVectorizer(ngram_range=(1,1), max_df=1.0, vocabulary=None, max_features=1, analyzer='word')
cv = word_vectorizer.fit(all_text)
word = list(word_vectorizer.vocabulary_.keys())[0]
printmd('В об\'єднаному датасеті найчастіше зустрічається слово: **_{}_**'.format(word))

Попробуйте разные Vectorizer и разные размеры n-gramm, стоп-слова, обрезку редких слов, обрезку слишком частых слов

In [5]:
%%time
word_vectorizer_tfidf = TfidfVectorizer( 
    analyzer='word', 
    ngram_range=(1, 1),
    vocabulary=None, 
    stop_words=STOPWORDS,
    max_features=30000, 
    smooth_idf=True, 
    sublinear_tf=True,
    min_df=3, 
    norm='l2',
    strip_accents='unicode')

word_vectorizer_tfidf.fit(all_text)
train_word_features_tfidf = word_vectorizer_tfidf.transform(train_text)
test_word_features_tfidf = word_vectorizer_tfidf.transform(test_text)

char_vectorizer_tfidf = TfidfVectorizer( 
    sublinear_tf=True,
    analyzer='char_wb', 
    ngram_range=(3, 5),
    max_features=30000, 
    smooth_idf=False,  
    norm='l2',
    strip_accents='unicode')

char_vectorizer_tfidf.fit(all_text)
train_char_features_tfidf = char_vectorizer_tfidf.transform(train_text)
test_char_features_tfidf = char_vectorizer_tfidf.transform(test_text)

CPU times: user 29.2 s, sys: 280 ms, total: 29.5 s
Wall time: 29.6 s


In [7]:
%%time
train_features = hstack([train_word_features_tfidf, train_char_features_tfidf])
test_features = hstack([test_word_features_tfidf, test_char_features_tfidf])

CPU times: user 1.92 s, sys: 1.51 s, total: 3.43 s
Wall time: 3.42 s


def max_roc_auc():
    l1 = []
    l2 = []
    start = 0.3
    step = 0.1
    for class_name in class_names:
        mm = 0
        cc = 0
        ll = cycle(start, class_name)
        i = 0
        while (i < 30) and (ll[0] > mm):
            mm = ll[0]
            cc = ll[1]
            i += 1
            ll = cycle(start + step * (i + 1), class_name)
        l1.append(cc) 
        l2.append(mm)
    return l1, l2

Для классификации будем использовать логистическую регрессию [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

In [8]:
def cycle(c, feature):
    train_target = train[feature]
    classifier = LogisticRegression(C=c, solver='sag', n_jobs=-1)
    cv_score = np.mean(cross_val_score(classifier, train_features, train_target, scoring='roc_auc'))
    print('CV score for class {} is {} - {}'.format(feature, cv_score, c))
    return [cv_score, c]

Будем тренировать по одному классификатору на каждый класс. 
Что бы провалидировать качество модели воспользуемся функцией [cross_val_score](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)

%%time
c_wv, scores = max_roc_auc()
print('Total score is {}'.format(np.mean(scores)))
c_wv

In [None]:
scores = []
c_wv = [1.3, 0.5, 1.2, 1.5, 0.9, 1]
for i, j in zip(class_names, c_wv):
    scores.append(cycle(j, i)[0])
print('Total score is {}'.format(np.mean(scores)))

Попробуйте подобрать лучшие параметры для `word_vectorizer` и `classifier` оптимизируя метрику [ROC AUC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)



---

Опубликуйте лучшие решение на [Kaggle Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/submit)

In [None]:
submission = pd.DataFrame.from_dict({'id': test['id']})

In [None]:
%%time
for c, class_name in zip(c_wv, class_names):
    classifier = LogisticRegression(C=c, solver='sag', n_jobs=-1)
    train_target = train[class_name]
    classifier.fit(train_features, train_target)
    submission[class_name] = classifier.predict_proba(test_features)[:, 1]

In [None]:
submission.to_csv('submission.csv', index=False)