# Homework 2 - TF-IDF Classifier

Ваша цель обучить классификатор который будет находить "токсичные" комментарии и опубликовать решения на Kaggle [Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge)

В процессе обучения нужно ответить на ***[вопросы](https://docs.google.com/forms/d/e/1FAIpQLSd9mQx8EFpSH6FhCy1M_FmISzy3lhgyyqV3TN0pmtop7slmTA/viewform?usp=sf_link)***

Данные можно скачать тут - https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data



In [4]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, HashingVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline

In [2]:
%install_ext https://raw.github.com/cpcloud/ipython-autotime/master/autotime.py
%load_ext autotime

UsageError: Line magic function `%install_ext` not found.


In [5]:
class_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

train = pd.read_csv('../data/train.csv').fillna(' ')
test = pd.read_csv('../data/test.csv').fillna(' ')

Стадартными подходами для анализа текста являются [Bag of words](https://en.wikipedia.org/wiki/Bag-of-words_model) и его модификация [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

Они реалзованны в `sklearn` в виде [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) и [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

Более подробней про них можно посмотреть [тут](https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-feature-extraction-and-engineering.ipynb)

In [6]:
train_text = train['comment_text']
test_text = test['comment_text']
all_text = pd.concat([train_text, test_text])

# Homework questions

## The most popular word
Самое популярное 'the' - пересчитать по TfidVectorizer

In [5]:
vectorizer = CountVectorizer()
vectorizer.fit(all_text)
vector = vectorizer.transform(all_text)
features = vectorizer.get_feature_names()

In [6]:
smv = vector.sum(axis=0)
features[smv.argmax()]

'the'

## Choose optimal vectorizer with params

In [35]:
%%time
word_vectorizer = CountVectorizer(ngram_range=(1,2))
word_vectorizer.fit(all_text)
train_word_features = word_vectorizer.transform(train_text)
test_word_features = word_vectorizer.transform(test_text)

CPU times: user 1min 35s, sys: 3.4 s, total: 1min 38s
Wall time: 1min 39s


In [7]:
%%time
word_vectorizer = TfidfVectorizer(
    sublinear_tf=True,
    strip_accents='unicode',
    analyzer='word',
    token_pattern=r'\w{1,}',
    min_df=5,
    stop_words='english',
    binary=True,
    max_features=32500)
word_vectorizer.fit(all_text)
train_word_features = word_vectorizer.transform(train_text)
test_word_features = word_vectorizer.transform(test_text)

CPU times: user 32.2 s, sys: 416 ms, total: 32.6 s
Wall time: 32.9 s


In [23]:
%%time
word_vectorizer = HashingVectorizer()
word_vectorizer.fit(all_text)
train_word_features = word_vectorizer.transform(train_text)
test_word_features = word_vectorizer.transform(test_text)

CPU times: user 13.6 s, sys: 249 ms, total: 13.8 s
Wall time: 13.9 s


## Choose optimal regression params

Для классификации будем использовать логистическую регрессию [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

In [70]:
?LogisticRegression

Будем тренировать по одному классификатору на каждый класс. 

Что бы провалидировать качество модели воспользуемся функцией [cross_val_score](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)

{'identity_hate': {'tol': 0.0001},
 'insult': {'tol': 1e-05},
 'obscene': {'tol': 1e-05},
 'severe_toxic': {'tol': 0.0001},
 'threat': {'tol': 1e-05},
 'toxic': {'tol': 1e-05}}


{'identity_hate': {'intercept_scaling': 19.2},
 'insult': {'intercept_scaling': 18.6},
 'obscene': {'intercept_scaling': 18.9},
 'severe_toxic': {'intercept_scaling': 19.6},
 'threat': {'intercept_scaling': 3.9},
 'toxic': {'intercept_scaling': 19.2}}
 
 {'identity_hate': {'C': 1.5},
 'insult': {'C': 1.5},
 'obscene': {'C': 1.8},
 'severe_toxic': {'C': 0.7},
 'threat': {'C': 2.3},
 'toxic': {'C': 2.1}}


0.9801799060936968

0.9801942996120685

0.9801953784313565

0.9801961844417346

0.9801958807718631

In [9]:
classifier = LogisticRegression()

In [81]:
params = {
    'C': [0.7, 1.5, 1.8, 2.1, 2.3],
    'tol': [0.0001, 0.00001],
    'intercept_scaling': [3.9, 18.6, 18.9, 19.2, 19.6]
#     'C': [0.65, 0.7, 0.75, 1.4, 1.5, 1.6, 1.7, 1.8, 2.1, 2.2, 2.4, 2.5],
#     'C': [0.5, 1.0, 1.5, 2.0],
#     'C': np.arange(.1, 2.5, 0.1)
#     'C': list()
#     'tol': [0.00001, 0.0001],
#     'intercept_scaling': list(np.arange(3.5, 4.5, 0.1)) + list(np.arange(17.2, 20.0, 0.1))
}

grid = GridSearchCV(LogisticRegression(), params, n_jobs=-1, scoring='roc_auc')
best_params = {}
for class_name in class_names:
    train_target = train[class_name]
    grid.fit(train_word_features, train_target)
    best_params[class_name] = grid.best_params_
    print(class_name + " finished")

toxic finished
severe_toxic finished
obscene finished
threat finished
insult finished
identity_hate finished


In [84]:
best_params

{'identity_hate': {'C': 1.5, 'intercept_scaling': 19.2, 'tol': 0.0001},
 'insult': {'C': 1.5, 'intercept_scaling': 19.2, 'tol': 1e-05},
 'obscene': {'C': 1.8, 'intercept_scaling': 18.6, 'tol': 1e-05},
 'severe_toxic': {'C': 0.7, 'intercept_scaling': 19.2, 'tol': 1e-05},
 'threat': {'C': 2.3, 'intercept_scaling': 18.6, 'tol': 0.0001},
 'toxic': {'C': 2.1, 'intercept_scaling': 19.6, 'tol': 0.0001}}

In [83]:
%%time

scores= []

for class_name in class_names:
    train_target = train[class_name]
    classifier = LogisticRegression(**best_params[class_name])
    cv_score = np.mean(cross_val_score(classifier,
                                       train_word_features,
                                       train_target,
                                       scoring='roc_auc'))

    print('CV score for class {} is {}'.format(class_name, cv_score))
    scores.append(cv_score)

print('Total score is {}'.format(np.mean(scores)))

CV score for class toxic is 0.9708049252003486
CV score for class severe_toxic is 0.9863405424188875
CV score for class obscene is 0.9862588274321614
CV score for class threat is 0.9844704274520101
CV score for class insult is 0.9773616766393504
CV score for class identity_hate is 0.9759658723927512
Total score is 0.9802003785892515
CPU times: user 14.6 s, sys: 469 ms, total: 15.1 s
Wall time: 15.1 s


Попробуйте подобрать лучшие параметры для `word_vectorizer` и `classifier` оптимизируя метрику [ROC AUC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)


---

Опубликуйте лучшие решение на [Kaggle Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/submit)

In [85]:
submission = pd.DataFrame.from_dict({'id': test['id']})

In [86]:
%%time
for class_name in class_names:
    train_target = train[class_name]
    classifier = LogisticRegression(**best_params[class_name])
    classifier.fit(train_word_features, train_target)
    submission[class_name] = classifier.predict_proba(test_word_features)[:, 1]    

CPU times: user 7.61 s, sys: 136 ms, total: 7.75 s
Wall time: 7.83 s


In [87]:
submission.to_csv('submission.csv', index=False)