<a href="https://colab.research.google.com/github/Krahjotdaan/MachineLearning/blob/main/Vectorizers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Работа с текстовыми данными

## Практика - шаг 0 - Подготовка данных
- загрузите из файла **movie_reviews.csv** отзывы о кинофильмах
- выведите количество положительных и отрицательных отзывов
- получите новый признак - длина отзыва
- посчитайте корреляцию длины отзыва и позитивности отзыва
- обучите логистическую регрессию и посчитайте метрику правильности

In [None]:
import pandas as pd
import numpy as np
import string

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
df = pd.read_csv('imdb_14000.csv').drop(columns=['Unnamed: 0'])
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,"Petter Mattei's ""Love in the Time of Money"" is...",positive
4,"Probably my all-time favorite movie, a story o...",positive


In [None]:
df['review'].iloc[0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

In [None]:
df['review_len'] = df['review'].apply(len)
df

Unnamed: 0,review,sentiment,review_len
0,One of the other reviewers has mentioned that ...,positive,1764
1,A wonderful little production. <br /><br />The...,positive,998
2,I thought this was a wonderful way to spend ti...,positive,926
3,"Petter Mattei's ""Love in the Time of Money"" is...",positive,1317
4,"Probably my all-time favorite movie, a story o...",positive,656
...,...,...,...
13995,I never read the book. Now I don't really want...,negative,1094
13996,"""Shinobi"" is one of those movies that thinks t...",negative,1516
13997,The teasers for Tree of Palme try to pass it o...,negative,1831
13998,I've read comments that you shouldn't watch th...,negative,1188


In [None]:
X = df.loc[:, 'review_len':'review_len']
y = df['sentiment']

lr = LogisticRegression().fit(X, y)
accuracy_score(y, lr.predict(X))

0.49757142857142855

## Практика - шаг 1 - Мешок слов
- сформируйте "мешок слов" с помощью CountVectorizer
- выведите получившийся словарь и его длину
- получите матрицу признаков и посмотрите на ее представление

In [None]:
vectorizer = CountVectorizer()
X_vect = vectorizer.fit_transform(df['review'])
y = df['sentiment']

In [None]:
X_vect.shape

(14000, 60467)

In [None]:
X_vect

<14000x60467 sparse matrix of type '<class 'numpy.int64'>'
	with 1916627 stored elements in Compressed Sparse Row format>

In [None]:
X_vect.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [None]:
m = X_vect.toarray()

In [None]:
len(m[0])

60467

In [None]:
len(m[0][m[0] != 0])

186

In [None]:
186 / 60467 * 100

0.3076058015115683

## Практика - шаг 2 - Сравнение моделей
Обучите модели
- логистическую регрессию
- kNN
- решающее дерево
Посчитайте для каждой модели качество на кросс-валидации

In [None]:
lr = LogisticRegression(solver='saga', max_iter=200)
knn = KNeighborsClassifier()
tree = DecisionTreeClassifier(max_depth=10)

print('LR:', cross_val_score(lr, X_vect, y, cv=3).mean())
print('KNN:', cross_val_score(knn, X_vect, y, cv=3).mean())
print('TREE:', cross_val_score(tree, X_vect, y, cv=3).mean())



LR: 0.8650717282364223
KNN: 0.612357092979673
TREE: 0.7154999154582461


## Практика - шаг 3 - Значимые слова
На логистической регрессии получите значимость признаков.

In [None]:
lr = LogisticRegression(solver='saga', max_iter=200).fit(X_vect, y)



In [None]:
lr.coef_[0]

array([-0.0078809 , -0.01821229, -0.00089261, ..., -0.00363336,
       -0.00011346,  0.00066911])

In [None]:
vectorizer.get_feature_names_out()

array(['00', '000', '00001', ..., 'über', 'überwoman', 'ünfaithful'],
      dtype=object)

In [None]:
w = sorted(zip(vectorizer.get_feature_names_out(), lr.coef_[0]), key = lambda x: x[1])
df_w = pd.DataFrame(w, columns = ['token', 'lr_weight'])
df_w

Unnamed: 0,token,lr_weight
0,worst,-0.782425
1,awful,-0.631925
2,bad,-0.595884
3,waste,-0.592165
4,boring,-0.558941
...,...,...
60462,best,0.390325
60463,perfect,0.408951
60464,wonderful,0.419023
60465,great,0.484766


## Практика - шаг 4 - улучшаем извлечение слов
Повторите шаги 1-2 с параметром **min_df** (минимальная частота слова, при котором оно учитывается).

Повторите шаги 1-2 с параметром **max_df** (максимальная частота слова, при котором оно учитывается).

Добавьте работу со стоп-словами

In [None]:
vectorizer = CountVectorizer(min_df=30)
X_vect = vectorizer.fit_transform(df['review'])
print(X_vect.shape)
y = df['sentiment']

lr = LogisticRegression(solver='saga', max_iter=200).fit(X_vect, y)

print('LR:', cross_val_score(lr, X_vect, y, cv=3).mean())

In [None]:
vectorizer = CountVectorizer(max_df=0.8)
X_vect = vectorizer.fit_transform(df['review'])
print(X_vect.shape)
y = df['sentiment']

lr = LogisticRegression(solver='saga', max_iter=200)

print('LR:', cross_val_score(lr, X_vect, y, cv=3).mean())

In [None]:
vectorizer = CountVectorizer(stop_words='english')
X_vect = vectorizer.fit_transform(df['review'])
print(X_vect.shape)
y = df['sentiment']

lr = LogisticRegression(solver='saga', max_iter=200)

print('LR:', cross_val_score(lr, X_vect, y, cv=3).mean())

In [None]:
vectorizer = CountVectorizer(max_df=0.8)
X_vect = vectorizer.fit_transform(df['review'])
print(X_vect.shape)
y = df['sentiment']

lr = LogisticRegression(solver='saga', max_iter=200)

print('LR:', cross_val_score(lr, X_vect, y, cv=6).mean())

## Практика - шаг 5 - Важность слов
Выполните ячейку ниже и проинтерпретируйте полученный результат.

Отмасштабируйте данные исходного датасета с помощью метода **tf-idf**  и повторите шаги 1-3.

Какой из  способов, опробованных выше, подойдет, чтобы улучшить полученный результат?
 Проверьте свою гипотезу.

In [None]:
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vec = TfidfVectorizer()
X = vec.fit_transform(corpus)
print(vec.get_feature_names())

X.toarray()

AttributeError: 'TfidfVectorizer' object has no attribute 'get_feature_names'

In [None]:
vectorizer = CountVectorizer(max_df=0.8)
X_vect = vectorizer.fit_transform(df['review'])
print(X_vect.shape)
y = df['sentiment']

lr = LogisticRegression(solver='saga', max_iter=200)

print('LR:', cross_val_score(lr, X_vect, y, cv=5).mean())

(14000, 60458)




LR: 0.8780000000000001




In [None]:
vectorizer = TfidfVectorizer(max_df=0.8)
X_vect = vectorizer.fit_transform(df['review'])
print(X_vect.shape)
y = df['sentiment']

lr = LogisticRegression(solver='saga', max_iter=200)

print('LR:', cross_val_score(lr, X_vect, y, cv=5).mean())

(14000, 60458)
LR: 0.880142857142857


## Практика - шаг 6 - Порядок слов
Обучите логистическую регрессию на данных с эн-грамами, выведите значимые признаки

In [None]:
#(1, 1) - только униграммы
#(1, 2) - униграммы и биграммы
vectorizer = TfidfVectorizer(max_df=0.8, ngram_range=(2, 2))
X_vect = vectorizer.fit_transform(df['review'])
print(X_vect.shape)
y = df['sentiment']

lr = LogisticRegression(solver='saga', max_iter=200)

print('LR:', cross_val_score(lr, X_vect, y, cv=5).mean())

(14000, 938691)
LR: 0.8562857142857142


In [None]:
lr = LogisticRegression(solver='saga', max_iter=200).fit(X_vect, y)
w = sorted(zip(vectorizer.get_feature_names_out(), lr.coef_[0]), key = lambda x: x[1])
df_w = pd.DataFrame(w, columns = ['token', 'lr_weight'])
df_w

Unnamed: 0,token,lr_weight
0,the worst,-7.482333
1,the only,-4.154146
2,waste of,-4.065876
3,at all,-3.258438
4,supposed to,-3.219869
...,...,...
938686,as the,2.843732
938687,it is,2.934924
938688,one of,3.189383
938689,is great,3.815831


## Практика - шаг 7 - Оптимизация
С помощью решетчатого поиска подберите оптимальные значения моделей, обучите с ними модели, посчитайте качество и выведите значимые признаки.

In [None]:
from sklearn.pipeline import Pipeline

vec = TfidfVectorizer(min_df = 3, max_df = 0.8, ngram_range = (1,2))
lr = LogisticRegression(solver = 'saga')

pipe = Pipeline(steps = [('vec', vec), ('lr', lr)])
X = df['review']
y = df['sentiment']

cross_val_score(pipe, X, y).mean()

0.8852857142857141

In [None]:
vec = TfidfVectorizer()
lr = LogisticRegression(solver = 'saga')

pipe = Pipeline(steps = [('vec', vec), ('lr', lr)])
X = df['review']
y = df['sentiment']

params = {
    #'vec__min_df': np.arange(0.001, 0.2, 0.1),
    'vec__ngram_range': [(1, 1), (1, 2)],
    #'lr__C': np.arange(0.01, 0.2, 0.1)
}

gs = GridSearchCV(pipe, params, n_jobs = -1, verbose = 2, cv = 3)
gs.fit(X, y)

Fitting 3 folds for each of 2 candidates, totalling 6 fits


In [None]:
gs.best_score_, gs.best_params_

(0.8753571517899967, {'vec__ngram_range': (1, 2)})

## Практика - шаг 8 - Обработка естественного языка
- С помощью библиотеки **nltk** (Natural Language Toolkit) для примера текста выделите токены в виде слов, затем в виде предложений.
- Проведите нормализацию слов для примера текста

Для исходного датасета с отзывами на фильмы:
- Выделите токены-слова
- Проведите нормализацию слов
- Удалите знаки препинания и переведите все слова в нижний регистр
- Векторизуйте нормализованный текст с масштабированием
- Обучите логистическую регрессию и посчитайте метрику качества на кросс-валидации
- Выведите значимые признаки

In [None]:
data = '''Natural language processing (NLP) is a subfield of linguistics, computer science,
and artificial intelligence concerned with the interactions between computers and
human language, in particular how to program computers to process and analyze
large amounts of natural language data. The result is a computer capable of "understanding"
the contents of documents, including the contextual nuances of the language within them.
The technology can then accurately extract information and insights contained in
the documents as well as categorize and organize the documents themselves. '''

In [None]:
import nltk
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
words = word_tokenize(data)
print(words)

['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'subfield', 'of', 'linguistics', ',', 'computer', 'science', ',', 'and', 'artificial', 'intelligence', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', 'language', ',', 'in', 'particular', 'how', 'to', 'program', 'computers', 'to', 'process', 'and', 'analyze', 'large', 'amounts', 'of', 'natural', 'language', 'data', '.', 'The', 'result', 'is', 'a', 'computer', 'capable', 'of', '``', 'understanding', "''", 'the', 'contents', 'of', 'documents', ',', 'including', 'the', 'contextual', 'nuances', 'of', 'the', 'language', 'within', 'them', '.', 'The', 'technology', 'can', 'then', 'accurately', 'extract', 'information', 'and', 'insights', 'contained', 'in', 'the', 'documents', 'as', 'well', 'as', 'categorize', 'and', 'organize', 'the', 'documents', 'themselves', '.']


In [None]:
sents = sent_tokenize(data)
for s in sents:
  print(s)
  print('-'*80)

Natural language processing (NLP) is a subfield of linguistics, computer science,
and artificial intelligence concerned with the interactions between computers and
human language, in particular how to program computers to process and analyze
large amounts of natural language data.
--------------------------------------------------------------------------------
The result is a computer capable of "understanding"
the contents of documents, including the contextual nuances of the language within them.
--------------------------------------------------------------------------------
The technology can then accurately extract information and insights contained in
the documents as well as categorize and organize the documents themselves.
--------------------------------------------------------------------------------


In [None]:
words = word_tokenize(data)
stem = SnowballStemmer('english')

words_stem = [stem.stem(w) for w in words]

df = pd.DataFrame(zip(words, words_stem), columns=['Orig', 'Stem'])
df[:20]

Unnamed: 0,Orig,Stem
0,Natural,natur
1,language,languag
2,processing,process
3,(,(
4,NLP,nlp
5,),)
6,is,is
7,a,a
8,subfield,subfield
9,of,of


In [None]:
words = word_tokenize(data)

stem = SnowballStemmer('english')
words_stem = [stem.stem(w) for w in words]

lem = WordNetLemmatizer()
words_lem = [lem.lemmatize(w) for w in words]

df = pd.DataFrame(zip(words, words_stem, words_lem), columns=['Orig', 'Stem', 'Lem'])
df[:20]

Unnamed: 0,Orig,Stem,Lem
0,Natural,natur,Natural
1,language,languag,language
2,processing,process,processing
3,(,(,(
4,NLP,nlp,NLP
5,),),)
6,is,is,is
7,a,a,a
8,subfield,subfield,subfield
9,of,of,of


## Практика - шаг 9 - Работа с русским языком

In [None]:
data_rus = '''Вчера после работы заскочила в Магнит за продуктами. Передо мной на кассу стояла женщина с сынишкой лет 5. Все время, пока мы ждали своей очереди, ребенок канючил:
— Ма-а-ам, я хочу большой Киндер! Давай купим большой Киндер! Ну ма-а-ам, купи!
Уставшая от этих стонов женщина обернулась к сыну и указала на лежащий на ленте сверток.
— Нет, не куплю. Ты же видишь, мы сегодня купили конфеты!
— Ну да, да... — горестно вздохнул ребятенок, но уже в следующий момент в его глазах засветилась неугасимая надежда. — Но ведь КОГДА-НИБУДЬ ты мне его обязательно купишь, правда?!
И, окрыленный этой мыслью, он вприпрыжку поскакал за уже расплатившейся матерью, радостно улыбаясь всем вокруг.
Эх, как все-таки мало детям надо для счастья.'''

In [None]:
!pip install pymorphy2
import pymorphy2

In [None]:
words = word_tokenize(data_rus)

stem = SnowballStemmer('russian')
words_stem = [stem.stem(w) for w in words]

lem = pymorphy2.MorphAnalyzer()
words_lem = [lem.normal_forms(w)[0] for w in words]

df = pd.DataFrame(zip(words, words_stem, words_lem), columns=['Orig', 'Stem', 'Lem'])
df[:20]

Unnamed: 0,Orig,Stem,Lem
0,Вчера,вчер,вчера
1,после,посл,после
2,работы,работ,работа
3,заскочила,заскоч,заскочить
4,в,в,в
5,Магнит,магн,магнит
6,за,за,за
7,продуктами,продукт,продукт
8,.,.,.
9,Передо,перед,перед
