# Импортирование библиотек

In [2]:
from nltk.tokenize import wordpunct_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

import pandas as pd

# Выгрузка данных
## Информация про датасет
### Columns - [text, label]
1. text - Содержит отзыв о фильме
2. label:
    - 0 - негативный отзыв
    - 1 - позитивный отзыв
- ссылка на датасет: https://www.kaggle.com/datasets/thedevastator/imdb-movie-review-sentiment-dataset?select=train.csv

In [3]:
df1 = pd.read_csv('train.csv')
df2 = pd.read_csv('test.csv')
df = pd.concat([df1, df2])
df.head()

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0


In [4]:
df.label.value_counts()

label
0    25000
1    25000
Name: count, dtype: int64

#### Пример негативного отзыва

In [5]:
df[df['label'] == 0].iloc[15].text

"This film is just plain horrible. John Ritter doing pratt falls, 75% of the actors delivering their lines as if they were reading them from cue cards, poor editing, horrible sound mixing (dialogue is tough to pick up in places over the background noise), and a plot that really goes nowhere. I didn't think I'd ever say this, but Dorothy Stratten is not the worst actress in this film. There are at least 3 others that suck more. Patti Hansen delivers her lines with the passion of Ben Stein. I started to wonder if she wasn't dead inside. Even Bogdanovich's kids are awful (the oldest one is definitely reading her lines from a cue card). This movie is seriously horrible. There's a reason Bogdanovich couldn't get another project until 4 years later. Please don't watch it. If you see it in your television listings, cancel your cable. If a friend suggests it to you, reconsider your friendship. If your spouse wants to watch it, you're better off finding another soulmate. I'd rather gouge my eye

##### Пример положительного отзыва

In [6]:
df[df['label'] == 1].iloc[10].text

"Lars von Trier's Europa is a worthy echo of The Third Man, about an American coming to post-World War II Europe and finds himself entangled in a dangerous mystery.<br /><br />Jean-Marc Barr plays Leopold Kessler, a German-American who refused to join the US Army during the war, arrives in Frankfurt as soon as the war is over to work with his uncle as a sleeping car conductor on the Zentropa Railway. What he doesn't know is the war is still secretly going on with an underground terrorist group called the Werewolves who target American allies. Leopold is strongly against taking any sides, but is drawn in and seduced by Katharina Hartmann (Barbara Sukowa), the femme fatale daughter of the owner of the railway company. Her father was a Nazi sympathizer, but is pardoned by the American Colonel Harris (Eddie Considine) because he can help get the German transportation system up and running again. The colonel soon enlists, or forces, Leopold to be a spy (without giving him a choice or chance

# Обработка данных 
1. Привести всё к нижнему регистру и токенезировать слова.
2. Векторизовать строки с помощью Tf-IDF алгоритма

In [7]:
df['text'] = df['text'].apply(lambda x: ' '.join(wordpunct_tokenize(x.lower())))

In [8]:
vec = TfidfVectorizer()
X = vec.fit_transform(df['text'])
y = df['label']

# Разделение данных на тестовую и тренировочную выборку

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Обучение и тест моделей

In [13]:
log_reg = LogisticRegression(random_state=0).fit(X_train, y_train)


print("Точность на тренировочной выборке: {:.2f}".format(log_reg.score(X_train, y_train))) 
print("Точность на тестовой выборке: {:.2f}".format(log_reg.score(X_test, y_test)))

Точность на тренировочной выборке: 0.93
Точность на тестовой выборке: 0.90


In [15]:
rm_forest = RandomForestClassifier(max_depth=5, random_state=0).fit(X_train, y_train)

print("Точность на тренировочной выборке: {:.2f}".format(rm_forest.score(X_train, y_train)))
print("Точность на тестовой выборке: {:.2f}".format(rm_forest.score(X_test, y_test)))

Точность на тренировочной выборке: 0.83
Точность на тестовой выборке: 0.81


# Пример использования

### Пример предсказания негативного отзыва

In [20]:
text = [' '.join(wordpunct_tokenize("I really hate this movie! The plot is so boring! Don't recommend it to anyone".lower()))]
text = vec.transform(text)

print("Предсказания логистической регрессии: ", log_reg.predict_proba(text.reshape(1, -1))[0])
print("Предсказание случайного леса решений: ", rm_forest.predict_proba(text.reshape(1, -1))[0])

Предсказания логистической регрессии:  [0.94278559 0.05721441]
Предсказание случайного леса решений:  [0.50013103 0.49986897]


### Пример предсказания позитивного отзыва

In [21]:
text = [' '.join(wordpunct_tokenize("It was the best movie I've ever seen! I mean the plot is so awesome and intresting. I'm defenetly going to rewatch it!".lower()))]
text = vec.transform(text)

print("Предсказания логистической регрессии: ", log_reg.predict_proba(text.reshape(1, -1))[0])
print("Предсказание случайного леса решений: ", rm_forest.predict_proba(text.reshape(1, -1))[0])

Предсказания логистической регрессии:  [0.19490628 0.80509372]
Предсказание случайного леса решений:  [0.48379417 0.51620583]


### Пример предсказания нейтрального отзыва

In [22]:
text = [' '.join(wordpunct_tokenize("It wasn't that bad but I mean the plot is not that intesting either. Would I recommend it? I'm not sure. You can watch it, maybe you will like it more.".lower()))]
text = vec.transform(text)

print("Предсказания вероятности логистической регрессии: ", log_reg.predict_log_proba(text.reshape(1, -1)))
print("Предсказания логистической регрессии: ", log_reg.predict(text.reshape(1, -1)))
print("Предсказание случайного леса решений: ",rm_forest.predict_proba(text.reshape(1, -1)))

Предсказания вероятности логистической регрессии:  [[-0.2008149  -1.70409944]]
Предсказания логистической регрессии:  [0]
Предсказание случайного леса решений:  [[0.49198092 0.50801908]]


#### Вывод: логистическая регрессия показывает намного лучше результат, чем случайный лес, поэтому подходит намного лучше для предсказания.
##### Логичестическая регрессия итак показывает довольно неплохой результат, но этот результат можно улучшить ещё больше, если использовать более продвинутые модели глубого обучения для анализа текста, использующие Attention и Self-Attention