# Lab 04. Text Classification


This lab is devoted to text classification tasks.
- **Part 1 [8 points]** is about very common NLP problem - sentiment analysis.
- **Part 2 [7 points]** include tasks on POS tagging and WordEmbeddings.


#### Evaluation

Each task has its value, **15 points** in total. If you use some open-source code please make sure to include the url.

#### How to submit

- Name your file according to this convention: `lab04_GroupNo_Surname_Name.ipynb`. If you don't have group number, put `nan` instead.
- Attach it to an **email** with **topic** `lab04_GroupNo_Surname_Name.ipynb`
- Send it to `cosmic.research.ml@yandex.ru`


Data can be dowloaded from: https://disk.yandex.ru/d/ixeu6m2KBG80ig

The deadline is **2021-11-17 23:00:00 +03:00**

## Part 1. Bag of Words vs. Bag of Popcorn [8 points]

This task is based on [Bag of Words Meets Bags of Popcorn](https://www.kaggle.com/c/word2vec-nlp-tutorial/data) competition. The goal is to label film reviews as positive or negative. 

Reviews may look like this:

```
I dont know why people think this is such a bad movie. Its got a pretty good plot, some good action, and the change of location for Harry does not hurt either. Sure some of its offensive and gratuitous but this is not the only movie like that. Eastwood is in good form as Dirty Harry, and I liked Pat Hingle in this movie as the small town cop. If you liked DIRTY HARRY, then you should see this one, its a lot better than THE DEAD POOL. 4/5
```

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import nltk

NLTK (Natural Language Toolkit)  — пакет библиотек и программ для символьной и статистической обработки естественного языка, написанных на языке программирования Python. 

In [2]:
reviews = pd.read_csv("reviews.tsv", sep="\t")
reviews.head(3)

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...


In [3]:
X = reviews["review"]
y = reviews["sentiment"]

In [4]:
reviews.shape

(25000, 3)

In [5]:
from sklearn.model_selection import train_test_split

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=5000, random_state=42, stratify=y)

In [7]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((20000,), (5000,), (20000,), (5000,))

### Time to extract features

In this part of the assignment we will apply several methods of feature extraction and comapre them.

**Task 1.1 [0.5 point] - Simple BOW (Bag-of-Words)** 

In this task we will build a simple bow representation - without any preprocessing. 

For this purpose we will use [*CountVectorizer*](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) - a method that transforms text dataset into a [sparse matrix](https://docs.scipy.org/doc/scipy/reference/sparse.html).

**CountVectorizer - сonvert a collection of text documents to a matrix of token counts.**

Import CountVectorizer:

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

Now try each of these approaches:
- fit vectorizer on X_train, apply to X_train, X_test
- fit vectorizer on X_train, apply to X_train; fit on X_test, apply to X_test
- fit vectorizer on X, apply to X_train, X_test

Report output matrix sizes in each case. 
- What is the difference? 
- Which of these approaches is the most fair and correct?

Use the most fair and correct one to get `X_train_0` and `X_test_0` - they will be needed for further tasks.

In [9]:
vectorizer = CountVectorizer()
tmp = vectorizer.fit_transform(['Hi, hi', 
                                'Hello, hi',
                                'Hi',
                                'hello'])
tmp.toarray()

array([[0, 2],
       [1, 1],
       [0, 1],
       [1, 0]], dtype=int64)

**1. fit vectorizer on X_train, apply to X_train, X_test**

In [10]:
count_vectorizer = CountVectorizer()
X_train_0 = count_vectorizer.fit_transform(X_train.to_list())
X_test_0  = count_vectorizer.transform(X_test.to_list())
print(X_train_0.shape, X_test_0.shape)

(20000, 68482) (5000, 68482)


**2. fit vectorizer on X_train, apply to X_train; fit on X_test, apply to X_test**

In [11]:
count_vectorizer = CountVectorizer()
X_train_0 = count_vectorizer.fit_transform(X_train.to_list())
X_test_0  = count_vectorizer.fit_transform(X_test.to_list())
print(X_train_0.shape, X_test_0.shape)

(20000, 68482) (5000, 38591)


**3. fit vectorizer on X, apply to X_train, X_test**

In [12]:
count_vectorizer = CountVectorizer()
X_0 = count_vectorizer.fit_transform(X.to_list())
X_train_0 = count_vectorizer.transform(X_train.to_list())
X_test_0  = count_vectorizer.transform(X_test.to_list())
print(X_train_0.shape, X_test_0.shape)

(20000, 74849) (5000, 74849)


Если честно, не особо понимаю, как происходит обучение с **Sparse matrices**. 
Она преобразовывается в array как-то динамически? 
Просто когда попыталась преобразовать в array, то Питон ругается, мол слишком большой массив.

Если динамически преобразование не происходит, боюсь выше фигня написана)

Особенно во вротом случае. Матрицы X_train_0 and X_test_0 как минимум должны быть одинаковой размерности.

Посколькуво втором случае получились разные размерности, то этот способ точно не подходит. 

В третьем случае, как мне кажется, слишком большая размерность.

**Думаю, наилучший - первый способ.**

**Task 1.2 [0.5 point] - S___se matrices**

What is the data type of `X_train_0` and `X_test_0`? What are those?

What differs them from usual np.arrays? Name several types how those special matrices are stored and what they are good for.

In [13]:
type(X)

pandas.core.series.Series

In [14]:
type(X_train_0)

scipy.sparse.csr.csr_matrix

## **Sparse matrices -  матрица с преимущественно нулевыми элементами.**

https://ru.wikipedia.org/wiki/%D0%A0%D0%B0%D0%B7%D1%80%D0%B5%D0%B6%D0%B5%D0%BD%D0%BD%D0%B0%D1%8F_%D0%BC%D0%B0%D1%82%D1%80%D0%B8%D1%86%D0%B0

https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.dok_matrix.html

В **scipy.sparse** есть семь типов разреженных матриц:
1. bsr_matrix: Block Sparse Row matrix
2. coo_matrix: формат координат (то есть IJV, формат 3D)
3. csc_matrix: сжатый формат столбца (compressed row storage)
4. csr_matrix: формат сжатой строки (compressed sparse row)
5. lil_matrix: формат списка (List of Lists)
6. dok_matrix: словарный формат значений (Dictionary of Keys)
7. dia_matrix: диагональный формат

### 1. bsr_matrix: Block Sparse Row matrix

In [15]:
from scipy.sparse import bsr_matrix
bsr_matrix((3, 4), dtype=np.int8).toarray()

array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0]], dtype=int8)

In [16]:
row = np.array( [0, 0, 1, 2, 2, 2])
col = np.array( [0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
bsr_matrix((data, (row, col)), shape=(3, 3)).toarray()

array([[1, 0, 2],
       [0, 0, 3],
       [4, 5, 6]], dtype=int32)

In [17]:
data = np.array([1, 2, 3, 4, 5, 7]).repeat(4).reshape(6, 2, 2)
#data

In [18]:
indptr  = np.array([0, 2, 3, 6])
indices = np.array([0, 2, 2, 0, 1, 2])
bsr_matrix((data,indices,indptr), shape=(6, 6)).toarray()

array([[1, 1, 0, 0, 2, 2],
       [1, 1, 0, 0, 2, 2],
       [0, 0, 0, 0, 3, 3],
       [0, 0, 0, 0, 3, 3],
       [4, 4, 5, 5, 7, 7],
       [4, 4, 5, 5, 7, 7]])

In [19]:
indptr  = np.array([0, 1, 2, 3])
indices = np.array([0, 1, 2, 0, 0, 0])
bsr_matrix((data,indices,indptr), shape=(6, 6)).toarray()

array([[1, 1, 0, 0, 0, 0],
       [1, 1, 0, 0, 0, 0],
       [0, 0, 2, 2, 0, 0],
       [0, 0, 2, 2, 0, 0],
       [0, 0, 0, 0, 3, 3],
       [0, 0, 0, 0, 3, 3]])

In [20]:
indptr  = np.array([0, 1, 2, 4])
indices = np.array([1, 2, 0, 0, 0, 0])
bsr_matrix((data,indices,indptr), shape=(6, 6)).toarray()

array([[0, 0, 1, 1, 0, 0],
       [0, 0, 1, 1, 0, 0],
       [0, 0, 0, 0, 2, 2],
       [0, 0, 0, 0, 2, 2],
       [7, 7, 0, 0, 0, 0],
       [7, 7, 0, 0, 0, 0]])

### 2. scipy.sparse.coo_matrix - a sparse matrix in COOrdinate format.

COO хранит список кортежей (строка, столбец, значение). В идеале записи сортируются сначала по индексу строки, а затем по индексу столбца, чтобы сократить время произвольного доступа. Википедия  site:hrwiki.ru

In [21]:
from scipy.sparse import coo_matrix
coo_matrix((3, 4), dtype=np.int8).toarray()

array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0]], dtype=int8)

In [22]:
row  = np.array([0, 3, 1, 0])
col  = np.array([0, 3, 1, 2])
data = np.array([4, 5, 7, 9])
coo_matrix((data, (row, col)), shape=(4, 4)).toarray()

array([[4, 0, 9, 0],
       [0, 7, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 5]])

### 3. scipy.sparse.csc_matrix

In [23]:
from scipy.sparse import csc_matrix
csc_matrix((3, 4), dtype=np.int8).toarray()

array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0]], dtype=int8)

In [24]:
row = np.array([0, 2, 2, 0, 1, 2])
col = np.array([0, 0, 1, 2, 2, 2])
data = np.array([1, 2, 3, 4, 5, 6])
csc_matrix((data, (row, col)), shape=(3, 3)).toarray()

array([[1, 0, 4],
       [0, 0, 5],
       [2, 3, 6]], dtype=int32)

In [25]:
indptr = np.array([0, 2, 3, 6])
indices = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
csc_matrix((data, indices, indptr), shape=(3, 3)).toarray()

array([[1, 0, 4],
       [0, 0, 5],
       [2, 3, 6]])

### 4. scipy.sparse.csr_matrix

In [26]:
from scipy.sparse import csr_matrix
csr_matrix((3, 4), dtype=np.int8).toarray()

array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0]], dtype=int8)

In [27]:
row = np.array([0, 0, 1, 2, 2, 2])
col = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
csr_matrix((data, (row, col)), shape=(3, 3)).toarray()

array([[1, 0, 2],
       [0, 0, 3],
       [4, 5, 6]], dtype=int32)

In [28]:
indptr = np.array([0, 2, 3, 6])
indices = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
csr_matrix((data, indices, indptr), shape=(3, 3)).toarray()

array([[1, 0, 2],
       [0, 0, 3],
       [4, 5, 6]])

In [29]:
row = np.array([0, 1, 2, 0])
col = np.array([0, 1, 1, 0])
data = np.array([1, 2, 4, 8])
csr_matrix((data, (row, col)), shape=(3, 3)).toarray()

array([[9, 0, 0],
       [0, 2, 0],
       [0, 4, 0]], dtype=int32)

### 5. scipy.sparse.dia_matrix

### 6. dok_matrix

### 7. lil_matrix

### 8. spmatrix

In [30]:
type(X_train_0)

scipy.sparse.csr.csr_matrix

In [31]:
X_train_0

<20000x74849 sparse matrix of type '<class 'numpy.int64'>'
	with 2760558 stored elements in Compressed Sparse Row format>

In [32]:
#X_train_0.toarray()
#ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.

*Answer:*

**Task 1.3 [1 points] - Training**

Train LogisticRegression and Random forest on this data representations.
- Compare training time 
- Compare Accuracy, precision, recall 
- Plot ROC Curve and calculate ROC AUC (don't forget to predict_proba) 
- Plot Precision-Recall curve and calculate f1-score (for example, with `plt.subplots(nrows=1, ncols=2)`)
- Print the trickiest missclassified objects. Why they were hard to classify? 


In [33]:
#import time as tm
from time import time

In [34]:
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.metrics import f1_score, precision_recall_curve, roc_curve, roc_auc_score

In [35]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

In [36]:
rf_model = RandomForestClassifier(n_estimators=500)

start = time()
rf_model.fit(X_train_0, y_train)
end = time()

print('lead time: {:.3f} seconds'.format(end - start))

rf_pred_train = rf_model.predict(X_train_0)
rf_pred_test = rf_model.predict(X_test_0)

accuracy_score_rf_train = accuracy_score(y_train, rf_pred_train)
accuracy_score_rf_test = accuracy_score(y_test, rf_pred_test)

precision_score_rf_train = precision_score(y_train, rf_pred_train)
precision_score_rf_test = precision_score(y_test, rf_pred_test)

recall_score_rf_train = recall_score(y_train, rf_pred_train)
recall_score_rf_test = recall_score(y_test, rf_pred_test)

f1_score_rf_train = f1_score(y_train, rf_pred_train)
f1_score_rf_test  = f1_score(y_test, rf_pred_test)

precision_recall_curve_rf_train = precision_recall_curve(y_train, rf_pred_train)
precision_recall_curve_rf_test  = precision_recall_curve(y_test, rf_pred_test)

roc_curve_rf_train = roc_curve(y_train, rf_pred_train)
roc_curve_rf_test  = roc_curve(y_test, rf_pred_test)

roc_auc_score_rf_train = roc_auc_score(y_train, rf_pred_train)
roc_auc_score_rf_test  = roc_auc_score(y_test, rf_pred_test)

lead time: 338.635 seconds


In [37]:
lr_model = LogisticRegression(max_iter=1e5)

start = time()
lr_model.fit(X_train_0, y_train)
end = time()

print('lead time: {:.3f} seconds'.format(end - start))

lr_pred_train = lr_model.predict(X_train_0)
lr_pred_test  = lr_model.predict(X_test_0)

precision_score_lr_train = precision_score(y_train, lr_pred_train)
precision_score_lr_test  = precision_score(y_test,  lr_pred_test)

recall_score_lr_train = recall_score(y_train, lr_pred_train)
recall_score_lr_test  = recall_score(y_test,  lr_pred_test)

f1_score_lr_train = f1_score(y_train, lr_pred_train)
f1_score_lr_test  = f1_score(y_test, lr_pred_test)

precision_recall_curve_lr_train = precision_recall_curve(y_train, lr_pred_train)
precision_recall_curve_lr_test  = precision_recall_curve(y_test, lr_pred_test)

roc_curve_lr_train = roc_curve(y_train, lr_pred_train)
roc_curve_lr_test  = roc_curve(y_test, lr_pred_test)

roc_auc_score_lr_train = roc_auc_score(y_train, lr_pred_train)
roc_auc_score_lr_test  = roc_auc_score(y_test, lr_pred_test)

lead time: 15.927 seconds


In [38]:
d = {'RandomForestClassifier':   [precision_score_rf_train, 
                                  precision_score_rf_test,
                                     recall_score_rf_train,
                                     recall_score_rf_test,
                                         f1_score_rf_train,
                                         f1_score_rf_test,
                                    roc_auc_score_rf_train, 
                                    roc_auc_score_rf_test],
     'LogisticRegression':       [precision_score_lr_train,  
                                  precision_score_lr_test,
                                     recall_score_lr_train,
                                     recall_score_lr_test,
                                         f1_score_lr_train,
                                         f1_score_lr_test,
                                    roc_auc_score_lr_train,
                                    roc_auc_score_lr_test]}
pd.DataFrame(data=d, index = ['precision_score_train',
                              'precision_score_test',
                              'recall_score_train',
                              'recall_score_test',
                              'f1_score_train',
                              'f1_score_test',
                              'roc_auc_score_train', 
                              'roc_auc_score_test'])

Unnamed: 0,RandomForestClassifier,LogisticRegression
precision_score_train,1.0,0.9992
precision_score_test,0.843099,0.87564
recall_score_train,1.0,0.9988
recall_score_test,0.8748,0.89
f1_score_train,1.0,0.999
f1_score_test,0.858657,0.882761
roc_auc_score_train,1.0,0.999
roc_auc_score_test,0.856,0.8818


In [39]:
#X_train_importances = [X_train_importances.append(X_train_0[:,i]) for i in ind_15k]

Name several types how those special matrices are stored and what they are good for.

Which model gives higher scores? Any ideas why? Please suggest 1-2 reasons.

*Answer:*

**LogisticRegression показал лучше результат на тестовой выборке.**

Не знаю даже почему лес справился хуже. 

Может быть потому что есть ирония? "отличный фильм" может восприниматься как плохой. Но этого же не лишена и линейная регрессия.

Может быть деревья плохи именно в таких огромных массивах. 

### More sophisticated feature prerocessing

As we have seen, simple BOW can give us some result - it's time to improve it.

**Task 1.4 [1 point] - Frequencies calculation**

- Calculate top-20 words in train set and test set. *Are they meaningful?*
- Import `stopwords` and print some of them. What are those?
- Recalculate top-20 words in each set, but exclude stop words.
- Does now top-20 include more useful words?

In [40]:
from collections import Counter
from nltk.tokenize import WhitespaceTokenizer, WordPunctTokenizer, TreebankWordTokenizer

https://pythonworld.ru/moduli/modul-collections.html

**Calculate top-20 words in train set and test set. Are they meaningful?**

In [41]:
#top-20 words in train set
X_train

15061    A very silly movie, this starts with a soft po...
10112    1st watched 8/3/2003 - 2 out of 10(Dir-Brad Sy...
24550    This is a really heart-warming family movie. I...
2570     Nicole Kidman is a wonderful actress and here ...
16053    It's very hard to say just what was going on w...
                               ...                        
20210    The Three Stooges has always been some of the ...
15989    And a made for TV movie too, this movie was go...
20873    Back in my days as an usher \Private Lessons\"...
10422    I might not be a huge admirer of the original ...
21768    This movie was an impressive one. My first exp...
Name: review, Length: 20000, dtype: object

In [42]:
s = ' '.join(map(str, X_train))

In [43]:
s[0:100]

'A very silly movie, this starts with a soft porn sequence, ventures into farcelike comedy in the art'

In [44]:
Whitespace = WhitespaceTokenizer().tokenize(s)
Counter_Whitespace = Counter(Whitespace)
top_Whitespace = [(l,k) for k,l in sorted([(j,i) for i,j in Counter_Whitespace.items()], reverse=True)]

In [45]:
WordPunct = WordPunctTokenizer().tokenize(s)
Counter_WordPunct = Counter(WordPunct)
top_WordPunct = [(l,k) for k,l in sorted([(j,i) for i,j in Counter_WordPunct.items()], reverse=True)]

In [46]:
TreebankWord = TreebankWordTokenizer().tokenize(s)
Counter_TreebankWord = Counter(TreebankWord)
top_TreebankWord = [(l,k) for k,l in sorted([(j,i) for i,j in Counter_TreebankWord.items()], reverse=True)]

In [47]:
print('top_Whitespace','top_WordPunct','top_TreebankWord', sep = '\t\t\t')
print('-'*100)
for i in range(20): 
    print(top_Whitespace[i], top_WordPunct[i], top_TreebankWord[i], sep = '\t\t\t')

top_Whitespace			top_WordPunct			top_TreebankWord
----------------------------------------------------------------------------------------------------
('the', 230191)			('the', 232718)			('the', 231895)
('a', 124236)			(',', 210946)			(',', 221238)
('and', 122525)			('.', 180270)			('and', 125458)
('of', 114632)			('and', 126065)			('a', 125014)
('to', 106231)			('a', 125408)			('of', 115057)
('is', 82852)			('of', 115592)			('to', 106660)
('in', 68303)			('to', 107393)			('is', 86915)
('I', 52896)			("'", 104320)			('/', 81829)
('that', 51864)			('is', 85332)			('>', 81789)
('this', 45937)			('br', 81648)			('<', 81742)
('it', 43691)			('in', 70114)			('br', 81648)
('/><br', 40824)			('I', 65898)			('in', 69149)
('was', 37452)			('it', 62493)			('I', 65008)
('as', 33868)			('that', 56535)			('it', 56833)
('with', 33567)			('s', 50464)			('that', 55557)
('for', 32891)			('this', 48839)			("''", 52892)
('The', 27102)			('-', 44759)			("'s", 49338)
('but', 27011)			('/><', 40824)			('thi

Слова не значимы

In [48]:
from nltk.corpus import stopwords

https://pythonspot.com/nltk-stop-words/

In [49]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Пользователь\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [50]:
stopWords = stopwords.words('english')
stopWords[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [51]:
def del_stopWorsd(Words,stopWords):
    n = len(Words)
    new_Words = []
    for i in range(n):
        if not(Words[i][0] in stopWords):
            new_Words.append(Words[i])
    return new_Words

In [52]:
top_Whitespace_del_stopWorsd   = del_stopWorsd(top_Whitespace,stopWords)
top_WordPunct_del_stopWorsd    = del_stopWorsd(top_WordPunct, stopWords)
top_TreebankWord_del_stopWorsd = del_stopWorsd(top_TreebankWord,stopWords)

print('top_Whitespace','top_WordPunct','top_TreebankWord', sep = '\t\t\t')
print('-'*100)
for i in range(20): 
    print(top_Whitespace_del_stopWorsd[i], 
          top_WordPunct_del_stopWorsd[i], 
          top_TreebankWord_del_stopWorsd[i], 
          sep = '\t\t\t')

top_Whitespace			top_WordPunct			top_TreebankWord
----------------------------------------------------------------------------------------------------
('I', 52896)			(',', 210946)			(',', 221238)
('/><br', 40824)			('.', 180270)			('/', 81829)
('The', 27102)			("'", 104320)			('>', 81789)
('movie', 24625)			('br', 81648)			('<', 81742)
('film', 22030)			('I', 65898)			('br', 81648)
('one', 16664)			('-', 44759)			('I', 65008)
('like', 14536)			('/><', 40824)			("''", 52892)
('This', 9886)			('/>', 39157)			("'s", 49338)
('would', 9528)			('The', 36086)			('The', 34492)
('good', 9043)			('movie', 34953)			('movie', 29399)
('It', 8759)			('film', 31956)			(')', 29058)
('really', 8691)			('\\"', 31496)			('(', 28452)
('even', 8569)			('.<', 29402)			('film', 27091)
('see', 8190)			('(', 27203)			("n't", 26125)
('-', 7494)			('one', 19670)			('\\', 20220)
('get', 6977)			('like', 15645)			('!', 19962)
('much', 6888)			('It', 14736)			('one', 18004)
('story', 6838)			(')', 13515)			('like',

Все равно очень много "мусора": **'/><br'**, **'br'**, **','**, **'>'**, **'\\'** и пр.

In [53]:
'br' in stopwords.words()

False

In [54]:
stopSign = ['.', ',', '<', '>', '"', '\\', '|', '/', ';', ':', '-', '\\', "'",
            '/><', '(', ')', '/><br','\\"','.<', '-','/>','br','...','!', '&', '?', "''",
           '/>The', '<br', '/>I']

In [55]:
top_Whitespace_del   = del_stopWorsd(top_Whitespace_del_stopWorsd,stopSign)
top_WordPunct_del  = del_stopWorsd(top_WordPunct_del_stopWorsd,stopSign)
top_TreebankWord_del = del_stopWorsd(top_TreebankWord_del_stopWorsd,stopSign)

print('top_Whitespace','top_WordPunct','top_TreebankWord', sep = '\t\t\t')
print('-'*100)
for i in range(20): 
    print(top_Whitespace_del[i], 
          top_WordPunct_del[i], 
          top_TreebankWord_del[i], 
          sep = '\t\t\t')

top_Whitespace			top_WordPunct			top_TreebankWord
----------------------------------------------------------------------------------------------------
('I', 52896)			('I', 65898)			('I', 65008)
('The', 27102)			('The', 36086)			("'s", 49338)
('movie', 24625)			('movie', 34953)			('The', 34492)
('film', 22030)			('film', 31956)			('movie', 29399)
('one', 16664)			('one', 19670)			('film', 27091)
('like', 14536)			('like', 15645)			("n't", 26125)
('This', 9886)			('It', 14736)			('one', 18004)
('would', 9528)			('This', 12012)			('like', 15097)
('good', 9043)			('good', 11492)			('It', 14448)
('It', 8759)			('time', 9922)			('This', 11832)
('really', 8691)			('would', 9834)			('would', 10538)
('even', 8569)			('story', 9267)			('good', 10154)
('see', 8190)			('really', 9108)			('really', 8956)
('get', 6977)			('see', 8963)			('even', 8848)
('much', 6888)			('even', 8896)			('see', 8568)
('story', 6838)			('much', 7635)			('story', 8060)
('time', 6236)			('well', 7301)			('time', 7700)
('

**Whitespace** показывает больше полезных слов, однако, убрав символы, все равно видно, что три свособа дают ± одинаковый результат.

**Task 1.5 [1 point] - Word Freqs by class**

How do you think, will top100 tokens for positive and negative classes be different? Use data to prove your point.

In [56]:
positive_class = X[y==1] 
negative_class = X[y==0] 
positive_class.shape[0] + negative_class.shape[0] == X.shape[0]

True

In [57]:
s_positive = ' '.join(map(str, positive_class))
s_negative = ' '.join(map(str, negative_class))

In [58]:
Whitespace = WhitespaceTokenizer().tokenize(s_positive)
Counter_Whitespace = Counter(Whitespace)
top_positive = [(l,k) for k,l in sorted([(j,i) for i,j in Counter_Whitespace.items()], reverse=True)]

Whitespace = WhitespaceTokenizer().tokenize(s_negative)
Counter_Whitespace = Counter(Whitespace)
top_negative = [(l,k) for k,l in sorted([(j,i) for i,j in Counter_Whitespace.items()], reverse=True)]

In [59]:
stopSign = ['.', ',', '<', '>', '"', '\\', '|', '/', ';', ':', '-', '\\', "'",
            '/><', '(', ')', '/><br','\\"','.<', '-','/>','br','...','!', '&', '?', "''",
           '/>The', '<br', '/>I', ]

In [60]:
top_positive = del_stopWorsd(top_positive,stopWords)
top_positive = del_stopWorsd(top_positive,stopSign) 
#top_positive[:20]

top_negative = del_stopWorsd(top_negative,stopWords)
top_negative = del_stopWorsd(top_negative,stopSign) 
#top_negative[:20]

In [61]:
#print('top_positive','top_negative', sep = '\t\t\t')
#print('-'*100)
#for i in range(100): 
#    print(top_positive[i], top_negative[i], sep = '\t\t\t')

*Answer:*

± одинаковый результат ('I', 'film', 'see' ...)

В top_positive чаще встречаются слова с позитивным окрасом: ('like', 7978), ('good', 5793), ('great', 5100), ('well', 3606), ('best', 3387), ('love', 3325), ('better', 1879)...

Однако в top_negative они тоже встречаются, причем слова ('like', 10155), ('good', 5642), ('better', 2527) даже чаще. 
Это увеличит ошибку при обучении. 
Но ('great', 2091), ('love', 1648), ('best', 1588) на порядок меньше. 
Чаще встречаются слова с негативным окрасом ('bad', 5096), ('never', 3010), ('least', 1775), но их как будто тоже не особо много.

**Task 1.6 [2 points] - Reducing dimensionality**

The goal is to reduce number of features to 15000.

Implement the following methods of dimensinality reduction:
1. Use CountVectorizer, but leave only 15k most frequent tokens
2. Use HashingVectorizer with 15k features
3. Use 15k most important features from perspective of previously trained RandomForest

*Hints:*
- in 1 and 2 you don't have to apply nltk.corpus.stopwords, vectorizers have `stopwords` parameter
- in 1 look for `vocabulary` parameter
- in 3... remember `lab02`? You may use `X_train_0` and `X_test_0` as input matrices

Train LogisticRegression and RandomForest on each dataset and compare ROC AUC scores of the classifiers.

In [62]:
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer

In [63]:
from sklearn.metrics import roc_auc_score

In [64]:
len(top_TreebankWord)

146099

In [65]:
top = top_TreebankWord
top = del_stopWorsd(top,stopWords)
top = del_stopWorsd(top,stopSign)

In [66]:
top_15k = top_TreebankWord[:15000]

In [67]:
list_top_15k = [w[0] for w in top_15k]

In [68]:
#count_vectorizer = CountVectorizer()
#count_vectorizer.fit(list_top_15k)

**1. Use CountVectorizer, but leave only 15k most frequent tokens**

In [69]:
count_vectorizer = CountVectorizer(stop_words = 'english', max_features = 15000)
count_vectorizer.fit(X_train.to_list())

CountVectorizer(max_features=15000, stop_words='english')

In [70]:
X_train_cv = count_vectorizer.transform(X_train.to_list())
X_test_cv  = count_vectorizer.transform(X_test.to_list())

In [71]:
rf_cv = RandomForestClassifier()

start = time()
rf_cv.fit(X_train_cv, y_train)
end = time()
print('lead time: {:.3f} seconds'.format(end - start))

rf_cv_pred_train = rf_cv.predict(X_train_cv)
rf_cv_pred_test = rf_cv.predict(X_test_cv)

rf_cv_score_train = roc_auc_score(y_train, rf_cv_pred_train)
rf_cv_score_test  = roc_auc_score(y_test,  rf_cv_pred_test)

lead time: 37.891 seconds


In [72]:
rf_cv_score_train, rf_cv_score_test

(1.0, 0.8514)

In [73]:
lr_cv = LogisticRegression()

start = time()
lr_cv.fit(X_train_cv, y_train)
end = time()
print('lead time: {:.3f} seconds'.format(end - start))

lr_cv_pred_train = lr_cv.predict(X_train_cv)
lr_cv_pred_test  = lr_cv.predict(X_test_cv)

lr_cv_score_train = roc_auc_score(y_train, lr_cv_pred_train)
lr_cv_score_test  = roc_auc_score(y_test,  lr_cv_pred_test)

lead time: 0.843 seconds


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [74]:
lr_cv_score_train, lr_cv_score_test

(0.9959, 0.8738)

**2. Use HashingVectorizer with 15k features**

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html

In [75]:
vectorizer = HashingVectorizer(stop_words = 'english', n_features=1500)
hashing_vectorizer = vectorizer.fit(X_train.to_list())

In [76]:
X_train_hv = hashing_vectorizer.transform(X_train.to_list())
X_test_hv  = hashing_vectorizer.transform(X_test.to_list())

In [77]:
rf_hv = RandomForestClassifier()

start = time()
rf_hv.fit(X_train_hv, y_train)
end = time()
print('lead time: {:.3f} seconds'.format(end - start))

rf_hv_pred_train = rf_hv.predict(X_train_hv)
rf_hv_pred_test = rf_hv.predict(X_test_hv)

rf_hv_score_train = roc_auc_score(y_train, rf_hv_pred_train)
rf_hv_score_test  = roc_auc_score(y_test,  rf_hv_pred_test)

lead time: 56.483 seconds


In [78]:
rf_hv_score_train, rf_hv_score_test

(1.0, 0.8022000000000001)

In [79]:
lr_hv = LogisticRegression()

start = time()
lr_hv.fit(X_train_hv, y_train)
end = time()
print('lead time: {:.3f} seconds'.format(end - start))

lr_hv_pred_train = lr_hv.predict(X_train_hv)
lr_hv_pred_test  = lr_hv.predict(X_test_hv)

lr_hv_score_train = roc_auc_score(y_train, lr_hv_pred_train)
lr_hv_score_test  = roc_auc_score(y_test,  lr_hv_pred_test)

lead time: 0.523 seconds


In [80]:
lr_cv_score_train, lr_cv_score_test

(0.9959, 0.8738)

**3. Use 15k most important features from perspective of previously trained RandomForest**

In [81]:
X_train_0.shape, X_test_0.shape

((20000, 74849), (5000, 74849))

In [82]:
rf_model = RandomForestClassifier()
rf_model.fit(X_train_0, y_train)
#rf_pred_train = rf_model.predict(X_train_0)
#rf_pred_test = rf_model.predict(X_test_0)

RandomForestClassifier()

In [83]:
rf_feature_importances = rf_model.feature_importances_

In [84]:
type(rf_feature_importances)

numpy.ndarray

In [85]:
d = {'rf_feature_importances': rf_feature_importances}
df = pd.DataFrame(data=d) 
df = df.sort_values(by=['rf_feature_importances'])

In [86]:
n = len(rf_feature_importances)-1
df_15k = df[n-15000:n]

In [87]:
ind_15k = df_15k.index 

In [88]:
X_train_0.shape

(20000, 74849)

In [89]:
X_train_importances = X_train_0[:,ind_15k]

In [90]:
X_train_importances.shape

(20000, 15000)

In [91]:
X_test_importances = X_test_0[:,ind_15k]

In [92]:
rf_rf = RandomForestClassifier()

start = time()
rf_rf.fit(X_train_importances, y_train)
end = time()
print('lead time: {:.3f} seconds'.format(end - start))

rf_rf_pred_train = rf_rf.predict(X_train_importances)
rf_rf_pred_test  = rf_rf.predict(X_test_importances)

rf_rf_score_train = roc_auc_score(y_train, rf_rf_pred_train)
rf_rf_score_test  = roc_auc_score(y_test,  rf_rf_pred_test)

lead time: 46.225 seconds


In [93]:
lr_rf = LogisticRegression(max_iter=1e5)

start = time()
lr_rf.fit(X_train_importances, y_train)
end = time()
print('lead time: {:.3f} seconds'.format(end - start))

lead time: 8.736 seconds


In [94]:
lr_rf_pred_train = lr_rf.predict(X_train_importances)
lr_rf_pred_test  = lr_rf.predict(X_test_importances)

lr_rf_score_train = roc_auc_score(y_train, lr_rf_pred_train)
lr_rf_score_test  = roc_auc_score(y_test,  lr_rf_pred_test)

In [95]:
d = {'CountVectorizer':       [rf_cv_score_train, rf_cv_score_test,lr_cv_score_train, lr_cv_score_test],
     'HashingVectorizer':     [rf_hv_score_train, rf_hv_score_test,lr_hv_score_train, lr_hv_score_test],
     'RandomForestClassifier':[rf_rf_score_train, rf_rf_score_test,lr_rf_score_train, lr_rf_score_test]}

pd.DataFrame(data=d, index =  ['rf_train','rf_test','lr_train','lr_test' ])

Unnamed: 0,CountVectorizer,HashingVectorizer,RandomForestClassifier
rf_train,1.0,1.0,1.0
rf_test,0.8514,0.8022,0.8406
lr_train,0.9959,0.84725,0.99725
lr_test,0.8738,0.8288,0.8738


**Task 1.7 [2 points] - Token Normalization**

Choose the best working method from previous task. Try improve it by applying a token normalization technique.

You may use one of **normalizers imported below**, but feel free to experiment.

Do the following:
- Apply normalizer to X_train, X_test
- Build BOW with CountVectorizer + stopwords. What are the shapes of train and test matrices now?
- Reduce dimensionality with the best method from Task 2.6. You may try all of them
- Train LR/RF to examine whether ROC AUC or Accuracy was improved.

In [96]:
from nltk.stem import WordNetLemmatizer, PorterStemmer

**Лемматизация** и **стемминг** — это процессы преобразования слова в его базовую форму. 

Разница между **стемминг** (**stemming**) и **лемматизацией** заключается в том, что лемматизация учитывает контекст и преобразует слово в его значимую базовую форму, тогда как стемминг просто удаляет последние несколько символов, что часто приводит к неверному значению и орфографическим ошибкам.

Например, лемматизация правильно определила бы базовую форму **«caring»** и **«care»**, в то время как стемминг отрезал бы «ing» и преобразовал ее в **car**.

In [97]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Пользователь\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [98]:
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("caring"))

stemmer = PorterStemmer()
print(stemmer.stem("caring"))

caring
care


In [99]:
def Lemmatizer(X):
    Y = []
    for i in X:
        word_list = TreebankWordTokenizer().tokenize(i)
        word_list = del_stopWorsd(word_list,stopWords)
        word_list = del_stopWorsd(word_list,stopSign)
        lemm_srt  = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
        Y.append(lemm_srt)
    return Y

In [100]:
#X_train[:2]

In [101]:
#X_train_Lemmatizer = Lemmatizer(X_train[:2])
#X_train_Lemmatizer

In [102]:
X_train_Lemmatizer = Lemmatizer(X_train)
X_test_Lemmatizer = Lemmatizer(X_test)

In [103]:
def Stemmer(X):
    Y = []
    for i in X:
        word_list = TreebankWordTokenizer().tokenize(i)
        word_list = del_stopWorsd(word_list,stopWords)
        word_list = del_stopWorsd(word_list,stopSign)
        lemm_srt  = ' '.join([stemmer.stem(w) for w in word_list])
        Y.append(lemm_srt)
    return Y

In [104]:
X_train_Stemmer = Stemmer(X_train)
X_test_Stemmer = Stemmer(X_test)

In [105]:
type(X_train_Lemmatizer)

list

In [106]:
len(X_train_Lemmatizer)

20000

In [107]:
count_vectorizer = CountVectorizer(stop_words = 'english', max_features = 15000)
count_vectorizer.fit(X_train_Lemmatizer)
# MemoryError: 

CountVectorizer(max_features=15000, stop_words='english')

In [108]:
X_train = count_vectorizer.transform(X_train_Lemmatizer)
X_test  = count_vectorizer.transform(X_test_Lemmatizer)

In [109]:
rf = RandomForestClassifier()

start = time()
rf.fit(X_train, y_train)
end = time()
print('lead time: {:.3f} seconds'.format(end - start))

rf_pred_train = rf.predict(X_train)
rf_pred_test = rf.predict(X_test)

rf_roc_auc_train_l = roc_auc_score(y_train, rf_pred_train)
rf_roc_auc_test_l  = roc_auc_score(y_test,  rf_pred_test)

lead time: 33.605 seconds


In [110]:
lr = LogisticRegression()

start = time()
lr.fit(X_train, y_train)
end = time()
print('lead time: {:.3f} seconds'.format(end - start))

lr_pred_train = lr.predict(X_train)
lr_pred_test  = lr.predict(X_test)

lr_roc_auc_train_l = roc_auc_score(y_train, lr_pred_train)
lr_roc_auc_test_l  = roc_auc_score(y_test,  lr_pred_test)

lead time: 0.695 seconds


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [111]:
count_vectorizer = CountVectorizer(stop_words = 'english', max_features = 15000)
count_vectorizer.fit(X_train_Stemmer)

CountVectorizer(max_features=15000, stop_words='english')

In [112]:
X_train = count_vectorizer.transform(X_train_Stemmer)
X_test = count_vectorizer.transform(X_test_Stemmer)

In [113]:
rf = RandomForestClassifier()

start = time()
rf.fit(X_train, y_train)
end = time()
print('lead time: {:.3f} seconds'.format(end - start))

rf_pred_train = rf.predict(X_train)
rf_pred_test = rf.predict(X_test)

rf_roc_auc_train_s = roc_auc_score(y_train, rf_pred_train)
rf_roc_auc_test_s  = roc_auc_score(y_test,  rf_pred_test)

lead time: 33.052 seconds


In [114]:
lr = LogisticRegression()

start = time()
lr.fit(X_train, y_train)
end = time()
print('lead time: {:.3f} seconds'.format(end - start))

lr_pred_train = lr.predict(X_train)
lr_pred_test  = lr.predict(X_test)

lr_roc_auc_train_s = roc_auc_score(y_train, lr_pred_train)
lr_roc_auc_test_s  = roc_auc_score(y_test,  lr_pred_test)

lead time: 0.689 seconds


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [115]:
d = {'LogisticRegression':[rf_roc_auc_train_l,rf_roc_auc_test_l,lr_roc_auc_train_l,lr_roc_auc_test_l],
     'RandomForestClassifier':[rf_roc_auc_train_s,rf_roc_auc_test_s,lr_roc_auc_train_s,lr_roc_auc_test_s]}
pd.DataFrame(data=d, index =  ['Lemmatizer train','Lemmatizer test','Stemmer train','Stemmer test' ])

Unnamed: 0,LogisticRegression,RandomForestClassifier
Lemmatizer train,1.0,1.0
Lemmatizer test,0.8296,0.8246
Stemmer train,0.98235,0.97995
Stemmer test,0.85,0.8496


При Lemmatizer и Stemmer, LogisticRegression показывает лучший результат, чем RandomForestClassifier.

При этом, используя Stemmer, и LogisticRegression, и RandomForestClassifier предстказывают лучше.

## Part 2. Word Embeddings [7 points]

In [116]:
#!pip3 install gensim==3.8.3

In [117]:
import gensim.downloader

Here is the list of pretrained word embedding models. We suggest using `glove-wiki-gigaword-100`.

In [118]:
list(gensim.downloader.info()['models'].keys())

['fasttext-wiki-news-subwords-300',
 'conceptnet-numberbatch-17-06-300',
 'word2vec-ruscorpora-300',
 'word2vec-google-news-300',
 'glove-wiki-gigaword-50',
 'glove-wiki-gigaword-100',
 'glove-wiki-gigaword-200',
 'glove-wiki-gigaword-300',
 'glove-twitter-25',
 'glove-twitter-50',
 'glove-twitter-100',
 'glove-twitter-200',
 '__testing_word2vec-matrix-synopsis']

In [119]:
#word_embeddings = gensim.downloader.load("glove-wiki-gigaword-100")
#---------------------------------------------------------------------------
#MemoryError

**MemoryError**: Unable to allocate 153. MiB for an array with shape (400000, 100) and data type float32

**Task 2.1 [2 point] - WordEmbeddings Geometry**

As you probably know, vector space of word embeddings has non-trivial geometry: some word relations (like country-capital or single-plural) cab be represented by vectors, like: **(king - man) + woman = queen**

<img src="https://linkme.ufanet.ru/images/5687a2011b49eb2413912f1c7d0fb0bd.png" width=600px>

Check this statement on words from the above picture with `word_embeddings.most_similar` function. Pay attention to `positive` and `negative` params.

Provide **several** examples, make sure to present different relations: some for nouns, some for verbs.

In [120]:
#trained_model.most_similar(positive=['woman', 'king'], negative=['man'])
#queen

In [121]:
#nouns
#trained_model.most_similar(positive=['', ''], negative=[''])

In [122]:
#verbs
#trained_model.most_similar(positive=['', ''], negative=[''])

**Task 2.2 [2 point] - POS analysis**

Use POS tagger to calculate most common POS in the dataset. 
Here you may read about nltk-taggers: [link](https://www.inf.ed.ac.uk/teaching/courses/icl/nltk/tagging.pdf)

- If you were to design POS-related weights, how would you do it? 
- What POS would get the higher weight? 

**Part-of-Speech Tagging (Часть речи)** 

Теги частей речи делят слова на категории в зависимости от того, как они могут быть объединены в предложения. Например, артикли могут сочетаться с существительными, но не с глаголами. 

Теги части речи также предоставляют информацию о семантическом содержании слова. Например, существительные обычно выражают «вещи», а предлоги выражают отношения между «вещами».

В большинстве наборов тегов частей речи используются одни и те же основные категории, такие как «существительное», «глагол», «прилагательное» и «предлог». 

Однако наборы тегов отличаются как тем, насколько точно они делят слова на категории; и в том, как определить их категории. Например, «is» может быть помечен как глагол в одном наборе тегов; но как форма «быть» в другом наборе тегов.

Такое разнообразие в наборах тегов разумно, поскольку теги частей речи используются по-разному для разных задач. В этом руководстве мы будем использовать набор тегов, указанный в Таблице 1. Этот набор тегов является упрощением широко используемого набора тегов Brown Corpus. Полный набор тегов Brown Corpus состоит из 87 основных тегов. Дополнительные сведения о наборах тегов см. в разделе «Основы статистической обработки естественного языка» (Manning & Schutze), стр. 139–145.

* AT Article
* NN Noun
* VB Verb
* JJ Adjective
* IN Preposition
* CD Number
* END Sentence-ending punctuation

https://www.nltk.org/book/ch05.html

https://www.guru99.com/pos-tagging-chunking-nltk.html

In [123]:
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Пользователь\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [124]:
from nltk import pos_tag

In [125]:
text ="learn php from guru99 and make study easy".split()
tokens_tag = pos_tag(text)
print("After Token:", tokens_tag)

After Token: [('learn', 'JJ'), ('php', 'NN'), ('from', 'IN'), ('guru99', 'NN'), ('and', 'CC'), ('make', 'VB'), ('study', 'NN'), ('easy', 'JJ')]


In [126]:
#X

In [127]:
#X_Lemmatizer = Lemmatizer(X)
#len(X_Lemmatizer)
#X_Lemmatizer[:10]

In [128]:
m = X.shape[0]
AT, NN, VB, JJ, IN, CD, END = 0, 0, 0, 0, 0, 0, 0
for j in range(m):
    tokens_tag = pos_tag(X[j].split())
    n = len(tokens_tag)
    for i in range(n):
        AT += int(tokens_tag[i][1] == 'AT')
        NN += int(tokens_tag[i][1] == 'NN')
        VB += int(tokens_tag[i][1] == 'VB')
        JJ += int(tokens_tag[i][1] == 'JJ')
        IN += int(tokens_tag[i][1] == 'IN')
        CD += int(tokens_tag[i][1] == 'CD')
        END += int(tokens_tag[i][1] == 'END')
AT, NN, VB, JJ, IN, CD, END

(0, 977965, 218833, 506548, 657833, 60646, 0)

In [129]:
d = {'AT':AT, 'NN':NN, 'VB':VB, 'JJ':JJ, 'IN':IN, 'CD':CD, 'END':END}
d

{'AT': 0,
 'NN': 977965,
 'VB': 218833,
 'JJ': 506548,
 'IN': 657833,
 'CD': 60646,
 'END': 0}

1 место - **JJ Adjective**: хороший/плохой, глубокий, слабый, отличный, ужасный, милый, мерзкий, стращный

2 место - **VB Verb**: понравился, я смеялся/плакал, заставил задуматься, держал в напряжении

далее веса значительно слабее, потому что по ним сложно определить отношение к фильму

3 место - **NN Noun**: война, юмор, легкость, расслабление/напряжение, страх/радость

4 место - **END Sentence-ending punctuation**: !, ?!, ..., !!!11

5 место - **CD Number**: 10 из 10, тясячу раз посмотрел, фильм на один раз

Минимальные веса

4 место - **IN Preposition**, **AT Article** - очень мало о чем говорят

**Task 2.3 [3 points] - WordEmbeddings**

Use dense vector representations to construct vector-representation of each review, then train a model (LR or RF).

Compare results of the new model to results of the models above.
**Important**
- If you just sum embeddings of each token to get an embedding of the whole review, the cost of the task is **[2 points]**
- For **[3 points]** you have to use either TF-IDF weight or weights that you designed from POS tags.

from sklearn.feature_extraction.text import TfidfVectorizer

преобразование набора необработанных документов в матрицу функций TF-IDF.

In [130]:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ['This is the first document.',
        'This document is the second document.',
        'And this is the third one.',
        'Is this the first document?']
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
vectorizer.get_feature_names()

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

Equivalent to CountVectorizer followed by TfidfTransformer.

**Term Frequency (TF)-Inverse Document Frequency (IDF)**

**1. Частота слова (Term Frequency)** - вероятность найти какое-то слово **wi** в документе **dj**:

n - количество раз, которое wi встречается в dj

m - общее число слов в dj

**TF(wi,dj) = m / n**

**2. Обратная частота документа (Inverse Document Frequency)**

В логике IDF, если слово встречается во всех документах, оно не очень полезно. Так определяется, насколько уникально слово во всем корпусе.

**IDF(wi,Dc) = log(N/ni)** 

Dc - все документы в корпусе,

N = Общее число документов,

ni = документы, которые содержат слово (wi).

**3. TF-IDF — умножение значений TF и IDF** 

**TF(wi, dj) * IDF(wi, Dc)**

Больший вес получат слова, которые встречаются в документе чаще, чем во всем остальном корпусе.

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import nltk

In [2]:
reviews = pd.read_csv("reviews.tsv", sep="\t")
reviews.head(3)

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...


In [3]:
from sklearn.model_selection import train_test_split
X = reviews["review"]
y = reviews["sentiment"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=5000, random_state=42, stratify=y)

In [4]:
from nltk.stem import WordNetLemmatizer, PorterStemmer

import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Пользователь\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [5]:
from collections import Counter
from nltk.tokenize import WhitespaceTokenizer, WordPunctTokenizer, TreebankWordTokenizer

from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
stopWords = stopwords.words('english')

stopSign = ['.', ',', '<', '>', '"', '\\', '|', '/', ';', ':', '-', '\\', "'",
            '/><', '(', ')', '/><br','\\"','.<', '-','/>','br','...','!', '&', '?', "''",
           '/>The', '<br', '/>I']

def del_stopWorsd(Words,stopWords):
    n = len(Words)
    new_Words = []
    for i in range(n):
        if not(Words[i][0] in stopWords):
            new_Words.append(Words[i])
    return new_Words

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
def Lemmatizer(X):
    Y = []
    for i in X:
        word_list = TreebankWordTokenizer().tokenize(i)
        word_list = del_stopWorsd(word_list,stopWords)
        word_list = del_stopWorsd(word_list,stopSign)
        lemm_srt  = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
        Y.append(lemm_srt)
    return Y

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Пользователь\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [6]:
X_train = Lemmatizer(X_train)
X_test = Lemmatizer(X_test)

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [8]:
vectorizer = TfidfVectorizer()
X_TV_train = vectorizer.fit_transform(X_train)

In [9]:
X_TV_test = vectorizer.transform(X_test)

In [10]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=500)

In [11]:
from time import time
start = time()
rf.fit(X_TV_train, y_train)
end = time()
print('lead time: {:.3f} seconds'.format(end - start))

lead time: 208.532 seconds


In [12]:
rf.score(X_TV_train, y_train), rf.score(X_TV_test, y_test)

(1.0, 0.8362)

In [13]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()

In [14]:
start = time()
lr.fit(X_TV_train, y_train)
end = time()
print('lead time: {:.3f} seconds'.format(end - start))

lead time: 2.560 seconds


In [15]:
lr.score(X_TV_train, y_train), lr.score(X_TV_test, y_test)

(0.92045, 0.8752)

https://datastart.ru/blog/read/plavnoe-vvedenie-v-natural-language-processing-nlp