# Homework 2 - TF-IDF Classifier

Ваша цель обучить классификатор который будет находить "токсичные" комментарии и опубликовать решения на Kaggle [Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge)

В процессе обучения нужно ответить на ***[вопросы](https://docs.google.com/forms/d/e/1FAIpQLSd9mQx8EFpSH6FhCy1M_FmISzy3lhgyyqV3TN0pmtop7slmTA/viewform?usp=sf_link)***

Данные можно скачать тут - https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data



In [29]:
import numpy as np
import pandas as pd
import nltk

from nltk.stem.snowball import SnowballStemmer
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from scipy.sparse import hstack

In [40]:
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]

In [30]:
class_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

train = pd.read_csv('train.csv').fillna(' ')
test = pd.read_csv('test.csv').fillna(' ')

In [4]:
test.iloc[1250]['comment_text']

"==Categories for discussion nomination of Category:Baycroft School== \n\n :Category:Baycroft School, which you created, has been nominated for discussion. If you would like to participate in the discussion, you are invited to add your comments at the category's entry on the Categories for discussion page. Thank you."

Стадартными подходами для анализа текста являются [Bag of words](https://en.wikipedia.org/wiki/Bag-of-words_model) и его модификация [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

Они реалзованны в `sklearn` в виде [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) и [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

Более подробней про них можно посмотреть [тут](https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-feature-extraction-and-engineering.ipynb)

In [33]:
train_text = train['comment_text']
test_text = test['comment_text']
all_text = pd.concat([train_text, test_text])

In [4]:
def my_preprocessor(s):
    return s.lower().translate(str.maketrans('','', punctuation))

In [34]:
# Попробуйте разные Vectorizer и разные размеры n-gramm, стоп-слова, обрезку редких слов, обрезку слишком частых слов
word_vectorizer = CountVectorizer(stop_words="english", analyzer='word', ngram_range=[1,4], strip_accents='unicode', token_pattern=r'\w{1,}') # TfidfVectorizer или CountVectorizer

In [35]:
char_vectorizer = CountVectorizer(strip_accents='unicode', analyzer='char', stop_words='english', ngram_range=(1, 5), max_features=40000)

In [37]:
%%time
word_vectorizer.fit(all_text)
train_word_features = word_vectorizer.transform(train_text)
test_word_features = word_vectorizer.transform(test_text)

CPU times: user 3min 59s, sys: 20.9 s, total: 4min 20s
Wall time: 4min 33s


In [38]:
%%time
char_vectorizer.fit(all_text)
train_char_features = char_vectorizer.transform(train_text)
test_char_features = char_vectorizer.transform(test_text)

CPU times: user 12min 11s, sys: 47.9 s, total: 12min 58s
Wall time: 13min 28s


In [39]:
tfidf_trans = TfidfTransformer(sublinear_tf=True)

In [40]:
tfidf_trans.fit(train_word_features)

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=True, use_idf=True)

In [42]:
tfidf_trans.fit(train_char_features)

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=True, use_idf=True)

In [41]:
train_word_features = tfidf_trans.transform(train_word_features)

In [43]:
train_char_features = tfidf_transidf_trans.transform(train_char_features)

In [14]:
test_word_features = tfidf_trans.transform(test_word_features)

# Task 1 - the most frequent term

In [51]:
word_vectorizer = CountVectorizer(stop_words="english")

In [52]:
all_mx = word_vectorizer.fit_transform(all_text)

In [55]:
frequency = list(zip(word_vectorizer.get_feature_names(), np.asarray(all_mx.sum(axis=0)).ravel()))

In [56]:
frequency = sorted(frequency, key=lambda x: -x[1])

In [59]:
frequency[0]

('article', 105723)

Для классификации будем использовать логистическую регрессию [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

In [44]:
log_clf = LogisticRegression(C=2, class_weight='balanced') # Попробуйте разные параметры, найтдите оттимальные на кросс-валидации

In [91]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

In [92]:
text_clf = Pipeline([('vect', CountVectorizer(stop_words="english")),
                     ('tfidf', TfidfTransformer()),
                     ('clf', LogisticRegression(C=2))
])

In [10]:
parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
              'clf__C': (5, 10)
}

In [8]:
train_pred = train['toxic']

In [12]:
gs_clf = GridSearchCV(text_clf, parameters, scoring='roc_auc')

Будем тренировать по одному классификатору на каждый класс. 

Что бы провалидировать качество модели воспользуемся функцией [cross_val_score](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)

In [48]:
def get_score(clf):
    scores= []

    for class_name in class_names:
        train_target = train[class_name]

        clf.fit(train_features, train_target)

        cv_score = np.mean(cross_val_score(clf, train_features, train_target, cv=3, scoring='roc_auc'))

        print('CV score for class {} is {:.4f}'.format(class_name, cv_score))
        scores.append(cv_score)

    print('Total score is {:.4f}'.format(np.mean(scores)))

In [47]:
train_features = hstack([train_char_features, train_word_features])

In [49]:
get_score(log_clf)

CV score for class toxic is 0.9785
CV score for class severe_toxic is 0.9893
CV score for class obscene is 0.9909
CV score for class threat is 0.9900
CV score for class insult is 0.9834
CV score for class identity_hate is 0.9837
Total score is 0.9860


Попробуйте подобрать лучшие параметры для `word_vectorizer` и `classifier` оптимизируя метрику [ROC AUC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)


---

Опубликуйте лучшие решение на [Kaggle Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/submit)

In [15]:
submission = pd.DataFrame.from_dict({'id': test['id']})

In [18]:
for class_name in class_names:
    train_target = train[class_name]
    log_clf.fit(train_word_features, train_target)
    
    
    submission[class_name] = log_clf.predict_proba(test_word_features)[:, 1]    

In [19]:
submission.to_csv('submission.csv', index=False)