# Homework 2 - TF-IDF Classifier

Ваша цель обучить классификатор который будет находить "токсичные" комментарии и опубликовать решения на Kaggle [Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge)

В процессе обучения нужно ответить на ***[вопросы](https://docs.google.com/forms/d/e/1FAIpQLSd9mQx8EFpSH6FhCy1M_FmISzy3lhgyyqV3TN0pmtop7slmTA/viewform?usp=sf_link)***

Данные можно скачать тут - https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data



In [1]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import FeatureUnion
from scipy.sparse import hstack

In [2]:
class_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

train = pd.read_csv('data/train.csv').fillna(' ')
test = pd.read_csv('data/test.csv').fillna(' ')

Стадартными подходами для анализа текста являются [Bag of words](https://en.wikipedia.org/wiki/Bag-of-words_model) и его модификация [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

Они реалзованны в `sklearn` в виде [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) и [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

Более подробней про них можно посмотреть [тут](https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-feature-extraction-and-engineering.ipynb)

In [3]:
train_text = train['comment_text']
test_text = test['comment_text']
all_text = pd.concat([train_text, test_text])

In [4]:
wv = CountVectorizer()
t = wv.fit_transform(all_text)

In [5]:
# zip(wv.get_feature_names(),
#     np.asarray(t.sum(axis=0)).ravel())
t2 = t.sum(axis=0)
t2 = t2.tolist()[0]

tz = zip(t2, wv.get_feature_names())
sorted(tz, reverse=True)[:5]

[(918456, 'the'),
 (538918, 'to'),
 (409932, 'of'),
 (408809, 'and'),
 (393819, 'you')]

In [6]:
wv = CountVectorizer(stop_words='english')
t = wv.fit_transform(all_text)

In [7]:
t2 = t.sum(axis=0)
t2 = t2.tolist()[0]

tz = zip(t2, wv.get_feature_names())
sorted(tz, reverse=True)[:5]

[(105723, 'article'),
 (83699, 'wikipedia'),
 (79164, 'page'),
 (53805, 'talk'),
 (52864, 'like')]

In [8]:
# Попробуйте разные Vectorizer и разные размеры n-gramm, стоп-слова, обрезку редких слов, обрезку слишком частых слов
word_vectorizer = TfidfVectorizer(sublinear_tf=True,
                                  strip_accents='unicode',
                                  analyzer='word',
                                  token_pattern=r'\w{1,}',
                                  ngram_range=(1,1),
                                  max_features=30000)

#                                   stop_words='english',

In [9]:
word_vectorizer.fit(all_text)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=30000, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents='unicode', sublinear_tf=True,
        token_pattern='\\w{1,}', tokenizer=None, use_idf=True,
        vocabulary=None)

In [10]:
train_word_features = word_vectorizer.transform(train_text)
test_word_features = word_vectorizer.transform(test_text)

In [11]:
# char_vectorizer = TfidfVectorizer(
#     sublinear_tf=True,
#     strip_accents='unicode',
#     analyzer='char_wb',
#     ngram_range=(1,1),
#     max_features=50000)

In [12]:
# char_vectorizer.fit(all_text)
# train_char_features = char_vectorizer.transform(train_text)
# test_char_features = char_vectorizer.transform(test_text)

In [13]:
# train_features = hstack([train_char_features, train_word_features])
# test_features = hstack([test_char_features, test_word_features])

Для классификации будем использовать логистическую регрессию [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

In [53]:
classifier = LogisticRegression(C=0.9, class_weight='balanced') # 2.1

Будем тренировать по одному классификатору на каждый класс. 

Что бы провалидировать качество модели воспользуемся функцией [cross_val_score](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)

In [54]:
scores= []

for class_name in class_names:
    train_target = train[class_name]

    cv_score = np.mean(cross_val_score(classifier, train_word_features, train_target, scoring='roc_auc'))
    
    print('CV score for class {} is {}'.format(class_name, cv_score))
    scores.append(cv_score)

print('Total score is {}'.format(np.mean(scores)))

CV score for class toxic is 0.972569101586317
CV score for class severe_toxic is 0.9855397599389809
CV score for class obscene is 0.985722221409507
CV score for class threat is 0.9880001843291139
CV score for class insult is 0.9793803969402547
CV score for class identity_hate is 0.9758994266733962
Total score is 0.9811851818129282


Попробуйте подобрать лучшие параметры для `word_vectorizer` и `classifier` оптимизируя метрику [ROC AUC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)


---

Опубликуйте лучшие решение на [Kaggle Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/submit)

In [16]:
submission = pd.DataFrame.from_dict({'id': test['id']})

In [17]:
for class_name in class_names:
    train_target = train[class_name]
    classifier.fit(train_word_features, train_target)
    submission[class_name] = classifier.predict_proba(test_word_features)[:, 1]
    print(class_name)

toxic
severe_toxic
obscene
threat
insult
identity_hate


In [18]:
submission.head()

Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,0.999015,0.156693,0.995919,0.047108,0.942477,0.191801
1,0000247867823ef7,0.002442,0.000954,0.001372,0.000232,0.002601,0.001832
2,00013b17ad220c46,0.02082,0.003083,0.010181,0.000829,0.011003,0.003214
3,00017563c3f7919a,0.002783,0.002249,0.002676,0.001245,0.003816,0.000705
4,00017695ad8997eb,0.020028,0.003529,0.006074,0.002031,0.005411,0.00199


In [19]:
submission.to_csv('submission.csv', index=False)

###### kaggle score: 0.9759
	 