# Зимняя школа МФТИ
# Практикум по машинному обучению: базовый поток
*Автор: [Илья Захаркин](https://vk.com/ilyazakharkin) | @izakharkin*

## Оценка тональности отзывов на фильмы

<img src="https://lh4.googleusercontent.com/proxy/1sG9fGlYWYqxND3cM-lrddATFg1cQHUI3GjtP19mtfn8qiSXcli4PbDxdFUk9z1EfxomhFkng8UNlgePX5tpLKVjC8jO7nF-K1BQJHoXjP9ivlv9AX3hsqvC4BRRmSYK4uP2zvQ5Ap1OH5k">

<img src="https://miro.medium.com/max/3260/1*8XIjunF2z6dmsVlkEuOUaw.png" width=500>

В этом ноутбуке мы проанализируем датасет отзывов с *IMDB*, будем оценивать их позитивность (1) и негативность (0).

In [1]:
import warnings
warnings.filterwarnings('ignore')

## Recap: работа с категориальными признаками

* LabelEncoder:

In [2]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import LabelEncoder

In [3]:
fruits = pd.DataFrame(data={'Фрукт': ['яблоко', 'груша', 'яблоко', 'банан', 'киви']})

In [4]:
fruits

Unnamed: 0,Фрукт
0,яблоко
1,груша
2,яблоко
3,банан
4,киви


In [5]:
encoder = LabelEncoder()
result = encoder.fit_transform(fruits.values.reshape(-1,))

In [6]:
fruits['LabelEncoder'] = result
fruits

Unnamed: 0,Фрукт,LabelEncoder
0,яблоко,3
1,груша,1
2,яблоко,3
3,банан,0
4,киви,2


* OneHotEncoder:

In [7]:
from sklearn.preprocessing import OneHotEncoder

In [8]:
encoder = OneHotEncoder()
result = encoder.fit_transform(fruits['Фрукт'].values.reshape(-1,1))

In [9]:
result

<5x4 sparse matrix of type '<class 'numpy.float64'>'
	with 5 stored elements in Compressed Sparse Row format>

In [10]:
result.todense()

matrix([[0., 0., 0., 1.],
        [0., 1., 0., 0.],
        [0., 0., 0., 1.],
        [1., 0., 0., 0.],
        [0., 0., 1., 0.]])

In [11]:
fruits

Unnamed: 0,Фрукт,LabelEncoder
0,яблоко,3
1,груша,1
2,яблоко,3
3,банан,0
4,киви,2


### Скачиваем данные

In [12]:
import numpy as np
import pandas as pd

import nltk

In [13]:
!pip install nltk



In [14]:
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/izakharkin/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


True

In [15]:
from nltk.corpus import movie_reviews

* Информация о датасете::

In [16]:
print(movie_reviews.readme())

Sentiment Polarity Dataset Version 2.0
Bo Pang and Lillian Lee

http://www.cs.cornell.edu/people/pabo/movie-review-data/

Distributed with NLTK with permission from the authors.


Introduction

This README v2.0 (June, 2004) for the v2.0 polarity dataset comes from
the URL http://www.cs.cornell.edu/people/pabo/movie-review-data .


What's New -- June, 2004

This dataset represents an enhancement of the review corpus v1.0
described in README v1.1: it contains more reviews, and labels were
created with an improved rating-extraction system.


Citation Info 

This data was first used in Bo Pang and Lillian Lee,
``A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization 
Based on Minimum Cuts'',  Proceedings of the ACL, 2004.

@InProceedings{Pang+Lee:04a,
  author =       {Bo Pang and Lillian Lee},
  title =        {A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts},
  booktitle =    "Proceedings of the ACL",
  year =      

In [92]:
movie_reviews.raw().split('\n')[:10]

['plot : two teen couples go to a church party , drink and then drive . ',
 'they get into an accident . ',
 'one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . ',
 "what's the deal ? ",
 'watch the movie and " sorta " find out . . . ',
 'critique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . ',
 "which is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn't snag this one correctly . ",
 'they seem to have taken this pretty neat concept , but executed it terribly . ',
 'so what are the problems with the movie ? ',
 "well , its main problem is that it's simply too jumbled . "]

* Чтобы удобно было разделять отзывы:

In [93]:
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')

In [94]:
negfeats = [movie_reviews.words(fileids=[f]) for f in negids]
posfeats = [movie_reviews.words(fileids=[f]) for f in posids]

In [95]:
negfeats[:5]

[['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...],
 ['the', 'happy', 'bastard', "'", 's', 'quick', 'movie', ...],
 ['it', 'is', 'movies', 'like', 'these', 'that', 'make', ...],
 ['"', 'quest', 'for', 'camelot', '"', 'is', 'warner', ...],
 ['synopsis', ':', 'a', 'mentally', 'unstable', 'man', ...]]

In [96]:
posfeats[-5:]

[['wow', '!', 'what', 'a', 'movie', '.', 'it', "'", 's', ...],
 ['richard', 'gere', 'can', 'be', 'a', 'commanding', ...],
 ['glory', '--', 'starring', 'matthew', 'broderick', ...],
 ['steven', 'spielberg', "'", 's', 'second', 'epic', ...],
 ['truman', '(', '"', 'true', '-', 'man', '"', ')', ...]]

* Саздаём **DataFrame** и вектор меток **y**:

In [21]:
sentiments = pd.DataFrame({'sentiment': negfeats + posfeats})  # all sentiments
labels = np.array([0] * len(negfeats) + [1] * len(posfeats))  # 0 - bad, 1 - good

In [22]:
sentiments.head()

Unnamed: 0,sentiment
0,"(plot, :, two, teen, couples, go, to, a, churc..."
1,"(the, happy, bastard, ', s, quick, movie, revi..."
2,"(it, is, movies, like, these, that, make, a, j..."
3,"("", quest, for, camelot, "", is, warner, bros, ..."
4,"(synopsis, :, a, mentally, unstable, man, unde..."


In [23]:
sentiments.tail()

Unnamed: 0,sentiment
1995,"(wow, !, what, a, movie, ., it, ', s, everythi..."
1996,"(richard, gere, can, be, a, commanding, actor,..."
1997,"(glory, --, starring, matthew, broderick, ,, d..."
1998,"(steven, spielberg, ', s, second, epic, film, ..."
1999,"(truman, (, "", true, -, man, "", ), burbank, is..."


In [24]:
sentiments.shape

(2000, 1)

In [25]:
len(posfeats) / (len(negfeats) + len(posfeats))

0.5

### Преобразование данных в ***Bag-of-Words*** ("мешок слов"):

<img src="https://miro.medium.com/max/2552/1*MeSYCKGDOdwkJKVZKxJuvg.png" width=650>

In [26]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [27]:
cnt_vect = CountVectorizer()

In [28]:
data = [' '.join(sent) for sent in sentiments['sentiment']]

In [29]:
X = cnt_vect.fit_transform(data)

In [30]:
X

<2000x39659 sparse matrix of type '<class 'numpy.int64'>'
	with 666842 stored elements in Compressed Sparse Row format>

In [31]:
y = labels

### Пробуем логистическую регрессию как модель для классификации тональности отзывов

In [32]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

In [33]:
clf = LogisticRegression()

In [34]:
vect_logreg = Pipeline([('vectorizer', cnt_vect), ('logreg', clf)])

* Измерим качество **accuracy score**:

In [35]:
acc_scores = cross_val_score(vect_logreg, data, y, scoring='accuracy')

In [36]:
np.mean(acc_scores)

0.8360216503929078

* Измерим качество с помощью **ROC AUC**:

In [37]:
roc_aucs = cross_val_score(vect_logreg, data, y, scoring='roc_auc')

In [38]:
np.mean(roc_aucs)

0.9107764937833774

* Обучаем модель на *всех данных*:

In [39]:
clf.fit(X, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

### Top-5 наиболее важных слов для классификации:

In [40]:
idx_arr = np.argsort(np.abs(clf.coef_))[0][::-1][:5]
idx_arr

array([ 2954, 37056, 39195, 14159, 38417])

In [41]:
np.array(cnt_vect.get_feature_names())[idx_arr]

array(['bad', 'unfortunately', 'worst', 'fun', 'waste'], dtype='<U58')

---

## Экспериментируем с моделями, stop-words и n-grams

* Давайте проверим, какое качество даст бейзлайн **без stop-words и n-grams**:

In [42]:
count_logreg_pipe = Pipeline([('vectorizer', CountVectorizer()), ('logreg', LogisticRegression())])
tfidf_logreg_pipe = Pipeline([('vectorizer', TfidfVectorizer()), ('logreg', LogisticRegression())])

In [43]:
cv_acc_scores_count = cross_val_score(count_logreg_pipe, data, y, cv=5)
cv_acc_scores_tfidf = cross_val_score(tfidf_logreg_pipe, data, y, cv=5)

In [50]:
print(
    f'CountVectorizer + LogisticRegression CV accuracy: \
    {cv_acc_scores_count.mean():.5f} ({cv_acc_scores_count.std():.5f})'
)

CountVectorizer + LogisticRegression CV accuracy:     0.84100 (0.01678)


In [52]:
print(
    f'TfIdfVectorizer + LogisticRegression CV accuracy: \
    {cv_acc_scores_tfidf.mean():.5f} ({cv_acc_scores_tfidf.std():.5f})'
)

TfIdfVectorizer + LogisticRegression CV accuracy:     0.82100 (0.00406)


* Теперь изменим *min_df* у *CountVectorizer* и увидим, как изменится модель:

In [54]:
param_grid = [10, 20, 30, 40, 50]

for min_df in param_grid:
    count_logreg_pipe = Pipeline([
        ('vectorizer', CountVectorizer(min_df=min_df)), 
        ('logreg', LogisticRegression())
    ])
    cv_acc_scores_count = cross_val_score(count_logreg_pipe, data, y, cv=5)
    print(
        f'CountVectorizer + LogisticRegression CV accuracy with min_df={min_df}: \
        {cv_acc_scores_count.mean():.3f} ({cv_acc_scores_count.std():.3f})\n'
    )

CountVectorizer + LogisticRegression CV accuracy with min_df=10:         0.839 (0.012)

CountVectorizer + LogisticRegression CV accuracy with min_df=20:         0.833 (0.017)

CountVectorizer + LogisticRegression CV accuracy with min_df=30:         0.825 (0.014)

CountVectorizer + LogisticRegression CV accuracy with min_df=40:         0.817 (0.009)

CountVectorizer + LogisticRegression CV accuracy with min_df=50:         0.813 (0.013)



* Попробуем разные модели:

In [55]:
from sklearn.linear_model import SGDClassifier
from sklearn.svm import LinearSVC

In [56]:
classifiers = [LogisticRegression(solver='lbfgs'), LinearSVC(), SGDClassifier()]
pipelines = [Pipeline([('vectorizer', CountVectorizer()), ('classifier', clf)]) for clf in classifiers]

In [57]:
for pipeline in pipelines:
    cv_acc_scores = cross_val_score(pipeline, data, y, cv=5)
    print(f'CV accuracy: {cv_acc_scores.mean():.3f} ({cv_acc_scores.std():.3f})\n')

CV accuracy: 0.842 (0.022)

CV accuracy: 0.833 (0.016)

CV accuracy: 0.834 (0.013)



* Сравним **stop-words** из **sklearn** и из **nltk**:

In [58]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/izakharkin/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [59]:
stop_words_nltk = nltk.corpus.stopwords.words('english')

In [60]:
stop_words = ['english', stop_words_nltk]

pipelines = [
    Pipeline([
        ('vectorizer', CountVectorizer(stop_words=words)), 
        ('logreg', LogisticRegression())
    ]) 
    for words in stop_words
]

names = ['sklearn', 'nltk']
for i, pipeline in enumerate(pipelines):
    cv_acc_scores = cross_val_score(pipeline, data, y, cv=5)
    print(
        f'CountVectorizer + LogisticRegression + stop-words: {names[i]} CV accuracy: \
        {cv_acc_scores.mean():.3f} ({cv_acc_scores.std():.3f})\n'
    )

CountVectorizer + LogisticRegression + stop-words: sklearn CV accuracy:         0.839 (0.010)

CountVectorizer + LogisticRegression + stop-words: nltk CV accuracy:         0.841 (0.010)



* И, наконец, попробуем **1-2-word-grams** и **3-5-char-grams**:

In [69]:
pipeline = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range=(1, 2))),
    ('logreg', LogisticRegression())
])

In [70]:
cv_acc_scores = cross_val_score(pipeline, data, y, cv=5)
print(
    f'CountVectorizer + LogisticRegression + 1-2-word-grams CV accuracy: \
    {cv_acc_scores.mean():.3f} ({cv_acc_scores.std():.3f})'
)

CountVectorizer + LogisticRegression + 1-2-word-grams CV accuracy:     0.853 (0.017)


In [71]:
pipeline = Pipeline([
    ('vectorizer', CountVectorizer(analyzer='char_wb', ngram_range=(3, 5))), 
    ('logreg', LogisticRegression())
])

In [72]:
cv_acc_scores = cross_val_score(pipeline, data, y, cv=5)
print(
    f'CountVectorizer + LogisticRegression + 3-5-char-grams CV accuracy: \
    {cv_acc_scores.mean():.3f} ({cv_acc_scores.std():.3f})'
)

CountVectorizer + LogisticRegression + 3-5-char-grams CV accuracy:     0.820 (0.011)


## Полезные ссылки

Deep Learning School при ФПМИ МФТИ:
* [Официальный сайт](https://www.dlschool.org/) | [Github](https://github.com/DLSchool/deep_learning_2018-19) | [YouTube](https://www.youtube.com/channel/UCFTNoZYjkg-3LZTHrHfV1nQ/) | [VK](https://vk.com/dlschool_mipt) | [Telegram]()

Курсы:
* [Специализация Яндекса по Анализу данных и Машинному обучению](https://ru.coursera.org/specializations/machine-learning-data-analysis)
* [Открытый курс по машинному обучению от OpenDataScience](https://habr.com/ru/company/ods/blog/322626/)
* [deeplearning.ai](https://www.coursera.org/specializations/deep-learning) по Deep Learning (нейронные сети)
* [Stanford cs231n](http://cs231n.stanford.edu/) по Computer Vision
* [Stanford cs224n](http://web.stanford.edu/class/cs224n/) по Natural Language Processing

Демо:
* [Сайт-сборник интерактивных демо по машинному обучению](http://arogozhnikov.github.io/2016/04/28/demonstrations-for-ml-courses.html)