# Lab 04. Text Classification


This lab is devoted to text classification tasks.
- **Part 1 [8 points]** is about very common NLP problem - sentiment analysis.
- **Part 2 [7 points]** include tasks on POS tagging and WordEmbeddings.


#### Evaluation

Each task has its value, **15 points** in total. If you use some open-source code please make sure to include the url.

#### How to submit

- Name your file according to this convention: `lab04_GroupNo_Surname_Name.ipynb`. If you don't have group number, put `nan` instead.
- Attach it to an **email** with **topic** `lab04_GroupNo_Surname_Name.ipynb`
- Send it to `cosmic.research.ml@yandex.ru`


Data can be dowloaded from: https://disk.yandex.ru/d/ixeu6m2KBG80ig

The deadline is **2021-11-17 23:00:00 +03:00**

## Part 1. Bag of Words vs. Bag of Popcorn [8 points]

This task is based on [Bag of Words Meets Bags of Popcorn](https://www.kaggle.com/c/word2vec-nlp-tutorial/data) competition. The goal is to label film reviews as positive or negative. 

Reviews may look like this:

```
I dont know why people think this is such a bad movie. Its got a pretty good plot, some good action, and the change of location for Harry does not hurt either. Sure some of its offensive and gratuitous but this is not the only movie like that. Eastwood is in good form as Dirty Harry, and I liked Pat Hingle in this movie as the small town cop. If you liked DIRTY HARRY, then you should see this one, its a lot better than THE DEAD POOL. 4/5
```

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import nltk

NLTK (Natural Language Toolkit)  — пакет библиотек и программ для символьной и статистической обработки естественного языка, написанных на языке программирования Python. 

In [2]:
reviews = pd.read_csv("reviews.tsv", sep="\t")
reviews.head(3)

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...


In [3]:
X = reviews["review"]
y = reviews["sentiment"]

In [4]:
from sklearn.model_selection import train_test_split

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=5000, random_state=42, stratify=y)

In [6]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((20000,), (5000,), (20000,), (5000,))

### Time to extract features

In this part of the assignment we will apply several methods of feature extraction and comapre them.

**Task 1.1 [0.5 point] - Simple BOW (Bag-of-Words)** 

In this task we will build a simple bow representation - without any preprocessing. 

For this purpose we will use [*CountVectorizer*](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) - a method that transforms text dataset into a [sparse matrix](https://docs.scipy.org/doc/scipy/reference/sparse.html).

**CountVectorizer - сonvert a collection of text documents to a matrix of token counts.**

Import CountVectorizer:

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

Now try each of these approaches:
- fit vectorizer on X_train, apply to X_train, X_test
- fit vectorizer on X_train, apply to X_train; fit on X_test, apply to X_test
- fit vectorizer on X, apply to X_train, X_test

Report output matrix sizes in each case. 
- What is the difference? 
- Which of these approaches is the most fair and correct?

Use the most fair and correct one to get `X_train_0` and `X_test_0` - they will be needed for further tasks.

In [8]:
vectorizer = CountVectorizer()
tmp = vectorizer.fit_transform(['Hi, hi', 
                                'Hello, hi',
                                'Hi',
                                'hello'])
tmp.toarray()

array([[0, 2],
       [1, 1],
       [0, 1],
       [1, 0]], dtype=int64)

In [9]:
#from sklearn.ensemble import RandomForestClassifier
#from sklearn.metrics import mean_squared_error
#from time import time

**1. fit vectorizer on X_train, apply to X_train, X_test**

In [10]:
count_vectorizer = CountVectorizer()

#X_0 = count_vectorizer.fit_transform(X.to_list())
#X_train_0, X_test_0, y_train_0, y_test_0 = train_test_split(X_0, y, test_size=5000, random_state=42, stratify=y)
#print(X_train_0.shape, X_test_0.shape)

X_train_0 = count_vectorizer.fit_transform(X_train.to_list())
X_test_0  = count_vectorizer.transform(X_test.to_list())
print(X_train_0.shape, X_test_0.shape)

#cl1 = RandomForestClassifier()
#start = time()
#cl1.fit(X_train_0, y_train)
#end = time()
#print('lead time: {:.3f} seconds'.format(end - start))
#cl1.fit(X_train_0.toarray(), y_train)
#ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.

#cl1_score_train = cl1.score(X_train_0,y_train)
#cl1_score_test  = cl1.score(X_test_0,y_test)

#pred_train = cl1.predict(X_train_0)
#cl1_mse_train   = mean_squared_error(pred_train, y_train)
#pred_test = cl1.predict(X_test_0)
#cl1_mse_test  = mean_squared_error(pred_test, y_test)

(20000, 68482) (5000, 68482)


**2. fit vectorizer on X_train, apply to X_train; fit on X_test, apply to X_test**

In [11]:
count_vectorizer = CountVectorizer()
X_train_0 = count_vectorizer.fit_transform(X_train.to_list())
X_test_0  = count_vectorizer.fit_transform(X_test.to_list())
print(X_train_0.shape, X_test_0.shape)

#cl2 = RandomForestClassifier()
#start = time()
#cl2.fit(X_train_0, y_train)
#end = time()
#print('lead time: {:.3f} seconds'.format(end - start))

#cl2_score_train = cl2.score(X_train_0,y_train)
#cl2_score_test  = cl2.score(X_test_0,y_test)

#pred_train = cl2.predict(X_train_0)
#cl2_mse_train   = mean_squared_error(pred_train, y_train)
#pred_test = cl2.predict(X_test_0)
#cl2_mse_test  = mean_squared_error(pred_test, y_test)

#ValueError: Number of features of the model must match the input. Model n_features is 68482 and input n_features is 38591 

(20000, 68482) (5000, 38591)


**3. fit vectorizer on X, apply to X_train, X_test**

In [12]:
count_vectorizer = CountVectorizer()
X_0 = count_vectorizer.fit_transform(X.to_list())
X_train_0 = count_vectorizer.transform(X_train.to_list())
X_test_0  = count_vectorizer.transform(X_test.to_list())
print(X_train_0.shape, X_test_0.shape)

#cl3 = RandomForestClassifier()
#start = time()
#cl3.fit(X_train_0, y_train)
#end = time()
#print('lead time: {:.3f} seconds'.format(end - start))

#cl3_score_train = cl3.score(X_train_0,y_train)
#cl3_score_test  = cl3.score(X_test_0,y_test)

#pred_train = cl3.predict(X_train_0)
#cl3_mse_train   = mean_squared_error(pred_train, y_train)
#pred_test = cl3.predict(X_test_0)
#cl3_mse_test  = mean_squared_error(pred_test, y_test)

(20000, 74849) (5000, 74849)


In [13]:
#d = {'score train':  [cl1_score_train,cl2_score_train, cl3_score_train] , 
#     'score test':   [cl1_score_test, cl2_score_test,  cl3_score_test],
#     'mse train':    [cl1_mse_train,  cl2_mse_train,   cl3_mse_train],
#     'mse test':     [cl1_mse_test,   cl2_mse_test,    cl3_mse_test]
#    }
#pd.DataFrame(data=d)

Если честно, не особо понимаю, как происходит обучение с Sparse matrices. 
Она преобразовывается в array как-то динамически? 
Просто когда попыталась преобразовать, то Питон ругается, мол слишком большой массив.

Если динамически преобразование не происходит, боюсь выше фигня написана)

Особенно во вротом случае. Матрицы X_train_0 and X_test_0 как минимум должны быть одинаковой размерности.

**Task 1.2 [0.5 point] - S___se matrices**

What is the data type of `X_train_0` and `X_test_0`? What are those?

What differs them from usual np.arrays? Name several types how those special matrices are stored and what they are good for.

**fit_transform()** and **transform()** returns X: sparse matrix of shape (n_samples, n_features), document-term matrix.

In [14]:
type(X_train_0)

scipy.sparse.csr.csr_matrix

https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html

Compressed Sparse Row matrix

In [15]:
X_train_0

<20000x74849 sparse matrix of type '<class 'numpy.int64'>'
	with 2760558 stored elements in Compressed Sparse Row format>

In [16]:
#X_train_0.toarray()
#ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.

In [17]:
from scipy.sparse import csr_matrix
row = np.array([0, 0, 1, 2, 2, 2])
col = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
csr_matrix((data, (row, col))).toarray()

array([[1, 0, 2],
       [0, 0, 3],
       [4, 5, 6]], dtype=int32)

We know the places with the location of the data, the rest are zeros.

**Sparse matrix classes:** https://docs.scipy.org/doc/scipy/reference/sparse.html

Разряженные матрицы содержат много информации, но занимают значительно меньше памяти.

*Answer:*

**Task 1.3 [1 points] - Training**

Train LogisticRegression and Random forest on this data representations.
- Compare training time 
- Compare Accuracy, precision, recall 
- Plot ROC Curve and calculate ROC AUC (don't forget to predict_proba) 
- Plot Precision-Recall curve and calculate f1-score (for example, with `plt.subplots(nrows=1, ncols=2)`)
- Print the trickiest missclassified objects. Why they were hard to classify? 


In [18]:
#import time as tm
from time import time

In [19]:
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.metrics import f1_score, precision_recall_curve, roc_curve, roc_auc_score

In [20]:
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(n_estimators=500)

start = time()
rf_model.fit(X_train_0, y_train)
end = time()
print('lead time: {:.3f} seconds'.format(end - start))

rf_pred_train = rf_model.predict(X_train_0)
rf_pred_test = rf_model.predict(X_test_0)

accuracy_score_rf_train = accuracy_score(y_train, rf_pred_train)
accuracy_score_rf_test = accuracy_score(y_test, rf_pred_test)

precision_score_rf_train = precision_score(y_train, rf_pred_train)
precision_score_rf_test = precision_score(y_test, rf_pred_test)

recall_score_rf_train = recall_score(y_train, rf_pred_train)
recall_score_rf_test = recall_score(y_test, rf_pred_test)

f1_score_rf_train = f1_score(y_train, rf_pred_train)
f1_score_rf_test  = f1_score(y_test, rf_pred_test)

precision_recall_curve_rf_train = precision_recall_curve(y_train, rf_pred_train)
precision_recall_curve_rf_test  = precision_recall_curve(y_test, rf_pred_test)

roc_curve_rf_train = roc_curve(y_train, rf_pred_train)
roc_curve_rf_test  = roc_curve(y_test, rf_pred_test)

roc_auc_score_rf_train = roc_auc_score(y_train, rf_pred_train)
roc_auc_score_rf_test  = roc_auc_score(y_test, rf_pred_test)

lead time: 879.605 seconds


In [21]:
from sklearn.linear_model import LogisticRegression
lr_model = LogisticRegression(max_iter=1e5)

start = time()
lr_model.fit(X_train_0, y_train)
end = time()
print('lead time: {:.3f} seconds'.format(end - start))

lr_pred_train = lr_model.predict(X_train_0)
lr_pred_test  = lr_model.predict(X_test_0)

precision_score_lr_train = precision_score(y_train, lr_pred_train)
precision_score_lr_test  = precision_score(y_test,  lr_pred_test)

recall_score_lr_train = recall_score(y_train, lr_pred_train)
recall_score_lr_test  = recall_score(y_test,  lr_pred_test)

f1_score_lr_train = f1_score(y_train, lr_pred_train)
f1_score_lr_test  = f1_score(y_test, lr_pred_test)

precision_recall_curve_lr_train = precision_recall_curve(y_train, lr_pred_train)
precision_recall_curve_lr_test  = precision_recall_curve(y_test, lr_pred_test)

roc_curve_lr_train = roc_curve(y_train, lr_pred_train)
roc_curve_lr_test  = roc_curve(y_test, lr_pred_test)

roc_auc_score_lr_train = roc_auc_score(y_train, lr_pred_train)
roc_auc_score_lr_test  = roc_auc_score(y_test, lr_pred_test)

lead time: 17.527 seconds


$%#d = {'model':  ['RandomForestClassifier', 'LogisticRegression'] ,     
%#     'precision_score_train':   [precision_score_rf_train, precision_score_lr_train],
%#     'precision_score_test':    [precision_score_rf_test,  precision_score_lr_test],
%#     'recall_score_train':      [recall_score_rf_train,    recall_score_lr_train],
%#     'recall_score_test':       [recall_score_rf_test,     recall_score_lr_test],
%#     'f1_score_train':          [f1_score_rf_train,    f1_score_lr_train],
%#     'f1_score_test':           [f1_score_rf_test,     f1_score_lr_test],
%     #'precision_recall_curve_train':          [precision_recall_curve_rf_train,    %precision_recall_curve_lr_train],
%     #'precision_recall_curve_test':           [precision_recall_curve_rf_test,     %precision_recall_curve_lr_test],
%     #'roc_curve_train':                [roc_curve_rf_train,    roc_curve_lr_train],
%     #'roc_curve_test':                [roc_curve_rf_test,      roc_curve_lr_test],
%#     'roc_auc_score_train':          [roc_auc_score_rf_train,    roc_auc_score_lr_train],
%#     'roc_auc_score_test':           [roc_auc_score_rf_test,     roc_auc_score_lr_test]}
%#pd.DataFrame(data=d)$

In [23]:
d = {'RandomForestClassifier':   [precision_score_rf_train, 
                                  precision_score_rf_test,
                                     recall_score_rf_train,
                                     recall_score_rf_test,
                                         f1_score_rf_train,
                                         f1_score_rf_test,
                                    roc_auc_score_rf_train, 
                                    roc_auc_score_rf_test],
     'LogisticRegression':       [precision_score_lr_train,  
                                  precision_score_lr_test,
                                     recall_score_lr_train,
                                     recall_score_lr_test,
                                         f1_score_lr_train,
                                         f1_score_lr_test,
                                    roc_auc_score_lr_train,
                                    roc_auc_score_lr_test]}
pd.DataFrame(data=d, index = ['precision_score_train',
                              'precision_score_test',
                              'recall_score_train',
                              'recall_score_test',
                              'f1_score_train',
                              'f1_score_test',
                              'roc_auc_score_train', 
                              'roc_auc_score_test'])

Unnamed: 0,RandomForestClassifier,LogisticRegression
precision_score_train,1.0,0.9992
precision_score_test,0.846273,0.87564
recall_score_train,1.0,0.9988
recall_score_test,0.872,0.89
f1_score_train,1.0,0.999
f1_score_test,0.858944,0.882761
roc_auc_score_train,1.0,0.999
roc_auc_score_test,0.8568,0.8818


Name several types how those special matrices are stored and what they are good for.

Which model gives higher scores? Any ideas why? Please suggest 1-2 reasons.

*Answer:*

$ %RandomForestClassifier показал лучше результат. $

LogisticRegression показал лучше результат.

### More sophisticated feature prerocessing

As we have seen, simple BOW can give us some result - it's time to improve it.

**Task 1.4 [1 point] - Frequencies calculation**

- Calculate top-20 words in train set and test set. *Are they meaningful?*
- Import `stopwords` and print some of them. What are those?
- Recalculate top-20 words in each set, but exclude stop words.
- Does now top-20 include more useful words?

In [24]:
from collections import Counter
from nltk.tokenize import WhitespaceTokenizer, WordPunctTokenizer, TreebankWordTokenizer
from nltk.corpus import stopwords

https://pythonworld.ru/moduli/modul-collections.html

**Calculate top-20 words in train set and test set. Are they meaningful?**

In [25]:
#top-20 words in train set
X_train

15061    A very silly movie, this starts with a soft po...
10112    1st watched 8/3/2003 - 2 out of 10(Dir-Brad Sy...
24550    This is a really heart-warming family movie. I...
2570     Nicole Kidman is a wonderful actress and here ...
16053    It's very hard to say just what was going on w...
                               ...                        
20210    The Three Stooges has always been some of the ...
15989    And a made for TV movie too, this movie was go...
20873    Back in my days as an usher \Private Lessons\"...
10422    I might not be a huge admirer of the original ...
21768    This movie was an impressive one. My first exp...
Name: review, Length: 20000, dtype: object

In [26]:
s = ' '.join(map(str, X_train))

In [27]:
Whitespace = WhitespaceTokenizer().tokenize(s)
Counter_Whitespace = Counter(Whitespace)
#sorted(Counter_Whitespace.items())
top = [(l,k) for k,l in sorted([(j,i) for i,j in Counter_Whitespace.items()], reverse=True)]
top[0:20]

[('the', 230191),
 ('a', 124236),
 ('and', 122525),
 ('of', 114632),
 ('to', 106231),
 ('is', 82852),
 ('in', 68303),
 ('I', 52896),
 ('that', 51864),
 ('this', 45937),
 ('it', 43691),
 ('/><br', 40824),
 ('was', 37452),
 ('as', 33868),
 ('with', 33567),
 ('for', 32891),
 ('The', 27102),
 ('but', 27011),
 ('on', 24827),
 ('movie', 24625)]

In [28]:
WordPunct = WordPunctTokenizer().tokenize(s)
Counter_WordPunct = Counter(WordPunct)
top = [(l,k) for k,l in sorted([(j,i) for i,j in Counter_WordPunct.items()], reverse=True)]
top[0:20]

[('the', 232718),
 (',', 210946),
 ('.', 180270),
 ('and', 126065),
 ('a', 125408),
 ('of', 115592),
 ('to', 107393),
 ("'", 104320),
 ('is', 85332),
 ('br', 81648),
 ('in', 70114),
 ('I', 65898),
 ('it', 62493),
 ('that', 56535),
 ('s', 50464),
 ('this', 48839),
 ('-', 44759),
 ('/><', 40824),
 ('/>', 39157),
 ('was', 38383)]

In [29]:
TreebankWord = TreebankWordTokenizer().tokenize(s)
Counter_TreebankWord = Counter(TreebankWord)
top = [(l,k) for k,l in sorted([(j,i) for i,j in Counter_TreebankWord.items()], reverse=True)]
top[0:20]

[('the', 231895),
 (',', 221238),
 ('and', 125458),
 ('a', 125014),
 ('of', 115057),
 ('to', 106660),
 ('is', 86915),
 ('/', 81829),
 ('>', 81789),
 ('<', 81742),
 ('br', 81648),
 ('in', 69149),
 ('I', 65008),
 ('it', 56833),
 ('that', 55557),
 ("''", 52892),
 ("'s", 49338),
 ('this', 47639),
 ('was', 39719),
 ('as', 34629)]

Слова не значимы

**Task 1.5 [1 point] - Word Freqs by class**

How do you think, will top100 tokens for positive and negative classes be different? Use data to prove your point.

In [30]:
positive_class = X[y==1] 
negative_class = X[y==0] 
positive_class.shape[0] + negative_class.shape[0] == X.shape[0]

True

In [31]:
positive_class

0        With all this stuff going down at the moment w...
1        \The Classic War of the Worlds\" by Timothy Hi...
4        Superbly trashy and wondrously unpretentious 8...
5        I dont know why people think this is such a ba...
9        <br /><br />This movie is full of references. ...
                               ...                        
24987    First off, I'd like to make a correction on an...
24988    While originally reluctant to jump on the band...
24989    I heard about this movie when watching VH1's \...
24990    I've never been huge on IMAX films. They're co...
24999    I saw this movie as a child and it broke my he...
Name: review, Length: 12500, dtype: object

In [32]:
y[1]

1

In [33]:
negative_class

2        The film starts with a manager (Nicholas Bell)...
3        It must be assumed that those who praised this...
6        This movie could have been very good, but come...
7        I watched this video at a friend's house. I'm ...
8        A friend of mine bought this film for £1, and ...
                               ...                        
24994    Unimaginably stupid, redundant and humiliating...
24995    It seems like more consideration has gone into...
24996    I don't believe they made this film. Completel...
24997    Guy is a loser. Can't get girls, needs to buil...
24998    This 30 minute documentary Buñuel made in the ...
Name: review, Length: 12500, dtype: object

In [34]:
y[24994]

0

In [35]:
s_positive = ' '.join(map(str, positive_class))
s_negative = ' '.join(map(str, negative_class))

In [None]:
Whitespace = WhitespaceTokenizer().tokenize(s_positive)
Counter_Whitespace = Counter(Whitespace)
top_1 = [(l,k) for k,l in sorted([(j,i) for i,j in Counter_Whitespace.items()], reverse=True)]
#top_1[0:20]

In [37]:
Whitespace = WhitespaceTokenizer().tokenize(s_negative)
Counter_Whitespace = Counter(Whitespace)
top_2 = [(l,k) for k,l in sorted([(j,i) for i,j in Counter_Whitespace.items()], reverse=True)]
#top_2[0:20]

[('the', 138618),
 ('a', 75668),
 ('and', 68388),
 ('of', 67631),
 ('to', 67359),
 ('is', 47871),
 ('in', 39784),
 ('I', 35045),
 ('that', 32617),
 ('this', 31174),
 ('it', 27443),
 ('/><br', 26318),
 ('was', 25389),
 ('for', 20199),
 ('with', 19689),
 ('as', 18580),
 ('but', 17331),
 ('movie', 17129),
 ('The', 17100),
 ('on', 15380)]

*Answer:* по большей части совпадают

**Task 1.6 [2 points] - Reducing dimensionality**

The goal is to reduce number of features to 15000.

Implement the following methods of dimensinality reduction:
1. Use CountVectorizer, but leave only 15k most frequent tokens
2. Use HashingVectorizer with 15k features
3. Use 15k most important features from perspective of previously trained RandomForest

*Hints:*
- in 1 and 2 you don't have to apply nltk.corpus.stopwords, vectorizers have `stopwords` parameter
- in 1 look for `vocabulary` parameter
- in 3... remember `lab02`? You may use `X_train_0` and `X_test_0` as input matrices

Train LogisticRegression and RandomForest on each dataset and compare ROC AUC scores of the classifiers.

In [38]:
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer

**1. Use CountVectorizer, but leave only 15k most frequent tokens**

In [39]:
X_train_0, X_test_0

(<20000x74849 sparse matrix of type '<class 'numpy.int64'>'
 	with 2760558 stored elements in Compressed Sparse Row format>,
 <5000x74849 sparse matrix of type '<class 'numpy.int64'>'
 	with 685303 stored elements in Compressed Sparse Row format>)

In [40]:
X

0        With all this stuff going down at the moment w...
1        \The Classic War of the Worlds\" by Timothy Hi...
2        The film starts with a manager (Nicholas Bell)...
3        It must be assumed that those who praised this...
4        Superbly trashy and wondrously unpretentious 8...
                               ...                        
24995    It seems like more consideration has gone into...
24996    I don't believe they made this film. Completel...
24997    Guy is a loser. Can't get girls, needs to buil...
24998    This 30 minute documentary Buñuel made in the ...
24999    I saw this movie as a child and it broke my he...
Name: review, Length: 25000, dtype: object

In [None]:
s = ' '.join(map(str, X))

In [42]:
#s = ' '.join(map(str, X_train))

In [50]:
s[0:10]

'A very sil'

In [43]:
#TreebankWord = TreebankWordTokenizer().tokenize(s)
#Counter_TreebankWord = Counter(TreebankWord)
#top = [(l,k) for k,l in sorted([(j,i) for i,j in Counter_TreebankWord.items()], reverse=True)]
#top[0:20]

#MemoryError
#----> 1 TreebankWord = TreebankWordTokenizer().tokenize(s)

In [None]:
Whitespace = WhitespaceTokenizer().tokenize(s)

In [None]:
Whitespace = WhitespaceTokenizer().tokenize(s)
#Whitespace = WhitespaceTokenizer().tokenize(s)
Counter_Whitespace = Counter(Whitespace)

In [None]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

In [None]:
#Whitespace = WhitespaceTokenizer().tokenize(s)
vectorizer = CountVectorizer()
#Counter_Whitespace = Counter(Whitespace)
top = [(l,k) for k,l in sorted([(j,i) for i,j in Counter_Whitespace.items()], reverse=True)]
top[15000]

In [None]:
Counter_Whitespace = Counter(Whitespace, max_features = 15000) 

In [None]:
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...     'Is this the first document?',
... ]
>>> vectorizer = CountVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> vectorizer.get_feature_names_out()
array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
       'this'], ...)
>>> print(X.toarray())
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]
>>> vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2))
>>> X2 = vectorizer2.fit_transform(corpus)
>>> vectorizer2.get_feature_names_out()
array(['and this', 'document is', 'first document', 'is the', 'is this',
       'second document', 'the first', 'the second', 'the third', 'third one',
       'this document', 'this is', 'this the'], ...)
 >>> print(X2.toarray())
 [[0 0 1 1 0 0 1 0 0 0 0 1 0]
 [0 1 0 1 0 1 0 1 0 0 1 0 0]
 [1 0 0 1 0 0 0 0 1 1 0 1 0]
 [0 0 1 0 1 0 1 0 0 0 0 0 1]]

In [None]:
>>> from sklearn.feature_extraction.text import HashingVectorizer
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...     'Is this the first document?',
... ]
>>> vectorizer = HashingVectorizer(n_features=2**4)
>>> X = vectorizer.fit_transform(corpus)
>>> print(X.shape)
(4, 16)

**Task 1.7 [2 points] - Token Normalization**

Choose the best working method from previous task. Try improve it by applying a token normalization technique.

You may use one of normalizers imported below, but feel free to experiment.

Do the following:
- Apply normalizer to X_train, X_test
- Build BOW with CountVectorizer + stopwords. What are the shapes of train and test matrices now?
- Reduce dimensionality with the best method from Task 2.6. You may try all of them
- Train LR/RF to examine whether ROC AUC or Accuracy was improved.

In [46]:
from nltk.stem import WordNetLemmatizer, PorterStemmer

## Part 2. Word Embeddings [7 points]

In [47]:
import gensim.downloader

ModuleNotFoundError: No module named 'gensim'

Here is the list of pretrained word embedding models. We suggest using `glove-wiki-gigaword-100`.

In [None]:
list(gensim.downloader.info()['models'].keys())

In [None]:
word_embeddings = gensim.downloader.load("glove-wiki-gigaword-100")

**Task 2.1 [2 point] - WordEmbeddings Geometry**

As you probably know, vector space of word embeddings has non-trivial geometry: some word relations (like country-capital or single-plural) cab be represented by vectors, like: **(king - man) + woman = queen**

<img src="https://linkme.ufanet.ru/images/5687a2011b49eb2413912f1c7d0fb0bd.png" width=600px>

Check this statement on words from the above picture with `word_embeddings.most_similar` function. Pay attention to `positive` and `negative` params.

Provide **several** examples, make sure to present different relations: some for nouns, some for verbs.

**Task 2.2 [2 point] - POS analysis**

Use POS tagger to calculate most common POS in the dataset. 
Here you may read about nltk-taggers: [link](https://www.inf.ed.ac.uk/teaching/courses/icl/nltk/tagging.pdf)

- If you were to design POS-related weights, how would you do it? 
- What POS would get the higher weight? 

**Task 2.3 [3 points] - WordEmbeddings**

Use dense vector representations to construct vector-representation of each review, then train a model (LR or RF).

Compare results of the new model to results of the models above.
**Important**
- If you just sum embeddings of each token to get an embedding of the whole review, the cost of the task is **[2 points]**
- For **[3 points]** you have to use either TF-IDF weight or weights that you designed from POS tags.