# Homework 2 - TF-IDF Classifier

Ваша цель обучить классификатор который будет находить "токсичные" комментарии и опубликовать решения на Kaggle [Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge)

В процессе обучения нужно ответить на ***[вопросы](https://docs.google.com/forms/d/e/1FAIpQLSd9mQx8EFpSH6FhCy1M_FmISzy3lhgyyqV3TN0pmtop7slmTA/viewform?usp=sf_link)***

Данные можно скачать тут - https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data



In [1]:
import numpy as np
import pandas as pd
import re

from string import punctuation
from wordcloud import STOPWORDS
from collections import Counter

from scipy.sparse import csr_matrix, hstack
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

In [2]:
class_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

train = pd.read_csv('../../data/Toxis comments/train.csv').fillna(' ')
test = pd.read_csv('../../data/Toxis comments/test.csv').fillna(' ')

In [3]:
train.shape

(159571, 8)

Стадартными подходами для анализа текста являются [Bag of words](https://en.wikipedia.org/wiki/Bag-of-words_model) и его модификация [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

Они реалзованны в `sklearn` в виде [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) и [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

Более подробней про них можно посмотреть [тут](https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-feature-extraction-and-engineering.ipynb)

In [4]:
train_text = train['comment_text']
test_text = test['comment_text']
all_text = pd.concat([train_text, test_text])

In [5]:
# to count the most common words in the comments you should set max_features=1000 to avoid MemoryError
word_vectorizer = TfidfVectorizer(analyzer='word',
                                  max_features=1000,
                                 )

In [6]:
all_word_features = word_vectorizer.fit_transform(all_text)
x_train = word_vectorizer.transform(train_text)
x_test = word_vectorizer.transform(test_text)

# What words are the most common in the dataset
answer depends on what Vectorizer we have chosen. But the most common is the same: 'the'
### First method:

In [7]:
sums = all_word_features.todense().sum(axis=0)
d = list(zip(word_vectorizer.get_feature_names(), sums.tolist()[0]))
words_sorted = sorted(d, key=lambda x: x[1], reverse=True)

In [8]:
words_sorted[:5]

[('the', 35525.038806166674),
 ('to', 23633.798627869706),
 ('you', 22468.364330022436),
 ('of', 19213.909587014077),
 ('and', 18905.674429729585)]

### Second method:

In [9]:
def get_words(text):
    """return list of the words"""
    pattern = r'[a-z]+'
    words = re.findall(pattern, text.lower())
    
    return words

In [10]:
words_lists = all_text.apply(get_words)

In [11]:
all_words = dict()

for item in words_lists:
    
    for word in list(item):
        if word in all_words:
            all_words[word] += 1
        else:
            all_words[word] = 1

In [12]:
sorted(all_words.items(), key=lambda x: x[1], reverse=True)[:5]

[('the', 919075), ('to', 539244), ('i', 434390), ('a', 412567), ('of', 410841)]

## Third

In [13]:
result = Counter(all_words)
result.most_common(5)

[('the', 919075), ('to', 539244), ('i', 434390), ('a', 412567), ('of', 410841)]

# Logistic regression

Для классификации будем использовать логистическую регрессию [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

Будем тренировать по одному классификатору на каждый класс. 

Что бы провалидировать качество модели воспользуемся функцией [cross_val_score](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)

### add some features to train and test datasets
*don't make our predictions better

In [14]:
add_more_features = False

In [15]:
if add_more_features:
    # TRAIN

    train['count_word']=train["comment_text"].apply(lambda x: len(str(x).split()))
    #Unique word count
    train['count_unique_word']=train["comment_text"].apply(lambda x: len(set(str(x).split())))
    #punctuation count
    train["count_punctuations"] =train["comment_text"].apply(lambda x: len([c for c in str(x) if c in punctuation]))
    #upper case words count
    train["count_words_upper"] = train["comment_text"].apply(lambda x: len([w for w in str(x).split() if w.isupper()]))
    #Number of stopwords
    train["count_stopwords"] = train["comment_text"].apply(lambda x: len([w for w in str(x).lower().split() if w in STOPWORDS]))
    #Average length of the words
    train["mean_word_len"] = train["comment_text"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))


    #TEST

    test['count_word']=test["comment_text"].apply(lambda x: len(str(x).split()))
    #Unique word count
    test['count_unique_word']=test["comment_text"].apply(lambda x: len(set(str(x).split())))
    #punctuation count
    test["count_punctuations"] =test["comment_text"].apply(lambda x: len([c for c in str(x) if c in punctuation]))
    #upper case words count
    test["count_words_upper"] = test["comment_text"].apply(lambda x: len([w for w in str(x).split() if w.isupper()]))
    #Number of stopwords
    test["count_stopwords"] = test["comment_text"].apply(lambda x: len([w for w in str(x).lower().split() if w in STOPWORDS]))
    #Average length of the words
    test["mean_word_len"] = test["comment_text"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))

In [16]:
if add_more_features:
    # create csr_matrixs
    features_train = csr_matrix(train[train.columns[-6:]].fillna(0))
    features_test = csr_matrix(test[test.columns[-6:]].fillna(0))

    # concatanate with the train/test_words_features
    x_train = hstack([x_train, features_train])
    x_test = hstack([x_test, features_test])

In [2]:
scores= [] # best c_value for current class_name
c_values = [0.1, 0.5, 1.0, 2.0, 3.0, 4.0, 5.0, 7.5, 10.0]


for class_name in class_names:
    
    c_list = []
    print('CLASS NAME: {}'.format(class_name))
    
    for c in c_values:
        classifier = LogisticRegression(C=c, random_state=32)
        
        y_train = train[class_name]
        cv_score = np.mean(cross_val_score(classifier, x_train, y_train, scoring='roc_auc'))
    
        print('CV score for c = {} is {}'.format(c, cv_score))
        
        c_list.append((c, cv_score))
        
    scores.append(max(c_list, key=lambda x: x[1]))
    
    print('-' * 20)

sc = [item[1] for item in scores]
print('Total score is {}'.format(np.mean(sc)))


NameError: name 'target_features' is not defined

In [18]:
scores

[]

## Photos of the previous results
### 1
![Photos of the previous results](http://localhost:8888/notebooks/pictures/Toxic_comments1.jpg)
![Photos of the previous results](http://localhost:8888/notebooks/pictures/Toxic_comments2.jpg)
### 2
![Photos of the previous results](http://localhost:8888/notebooks/pictures/Toxic_comments3.jpg)

Попробуйте подобрать лучшие параметры для `word_vectorizer` и `classifier` оптимизируя метрику [ROC AUC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)


---

Опубликуйте лучшие решение на [Kaggle Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/submit)

In [19]:
submission = pd.DataFrame.from_dict({'id': test['id']})

In [20]:
c_values = [item[0] for item in scores]

for class_name, c in zip(class_names, c_values):
    
    classifier = LogisticRegression(C=c, random_state=32)
    
    y_train = train[class_name]
    classifier.fit(x_train, y_train)
    
    submission[class_name] = classifier.predict_proba(x_test)[:, 1]    

In [21]:
submission.head()

Unnamed: 0,id
0,00001cee341fdb12
1,0000247867823ef7
2,00013b17ad220c46
3,00017563c3f7919a
4,00017695ad8997eb


In [22]:
submission.to_csv('submission.csv', index=False)