## Файл с предобработкой данных  
На вход мы подаем наш датасет, а на выходе получаем отчищенный от случайных, неинформативных символов, лемматизированный, векторизованный по методу TF-IDF на униграммы\биграммы\триграммы и разбитый на тренировочную и тестовую выборки.

In [1]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
import pickle

In [2]:
data = pd.read_csv('./IMDB Dataset.csv')

# Лемматизация текста
def lemmatize_text(text):
    return ' '.join([lemmatizer.lemmatize(word) for word in text.split()])

def clean_text(
        text : str
)->str:
    """
    Очищает текст от HTML тегов, специальных символов и приводит его к нижнему регистру.
    Параметры:
    text (str): Исходный текст.
    Возвращает:
    str: Очищенный текст.
    """
    text = text.lower()
    text = re.sub(r'<.*?>', '', text)
    text = re.sub(r'[^A-Za-z0-9\s]', '', text)
    text = re.sub(r'(.)\1{2,}', r'\1', text)
    text = ' '.join([word for word in text.split() if word not in stop_words])
    text = lemmatize_text(text)
    return text

In [3]:
# Загрузка стоп-слов
nltk.download('stopwords')
nltk.download('wordnet')
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\andre\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\andre\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Текст до отчистки

In [4]:
print(data.iloc[7,0])

This show was an amazing, fresh & innovative idea in the 70's when it first aired. The first 7 or 8 years were brilliant, but things dropped off after that. By 1990, the show was not really funny anymore, and it's continued its decline further to the complete waste of time it is today.<br /><br />It's truly disgraceful how far this show has fallen. The writing is painfully bad, the performances are almost as bad - if not for the mildly entertaining respite of the guest-hosts, this show probably wouldn't still be on the air. I find it so hard to believe that the same creator that hand-selected the original cast also chose the band of hacks that followed. How can one recognize such brilliance and then see fit to replace it with such mediocrity? I felt I must give 2 stars out of respect for the original cast that made this show such a huge success. As it is now, the show is just awful. I can't believe it's still on the air.


In [6]:
print(data.iloc[12,0])

So im not a big fan of Boll's work but then again not many are. I enjoyed his movie Postal (maybe im the only one). Boll apparently bought the rights to use Far Cry long ago even before the game itself was even finsished. <br /><br />People who have enjoyed killing mercs and infiltrating secret research labs located on a tropical island should be warned, that this is not Far Cry... This is something Mr Boll have schemed together along with his legion of schmucks.. Feeling loneley on the set Mr Boll invites three of his countrymen to play with. These players go by the names of Til Schweiger, Udo Kier and Ralf Moeller.<br /><br />Three names that actually have made them selfs pretty big in the movie biz. So the tale goes like this, Jack Carver played by Til Schweiger (yes Carver is German all hail the bratwurst eating dudes!!) However I find that Tils acting in this movie is pretty badass.. People have complained about how he's not really staying true to the whole Carver agenda but we on

### Текст после отчистки

In [7]:
# Отчищаем данные и бинаризуем целевой параметр
data["review"] = data['review'].apply(clean_text)
data['sentiment'] = data['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)
print(data.iloc[7,0])

show amazing fresh innovative idea 70 first aired first 7 8 year brilliant thing dropped 1990 show really funny anymore continued decline complete waste time todayits truly disgraceful far show fallen writing painfully bad performance almost bad mildly entertaining respite guesthosts show probably wouldnt still air find hard believe creator handselected original cast also chose band hack followed one recognize brilliance see fit replace mediocrity felt must give 2 star respect original cast made show huge success show awful cant believe still air


In [8]:
print(data.iloc[12,0])

im big fan boll work many enjoyed movie postal maybe im one boll apparently bought right use far cry long ago even game even finsished people enjoyed killing mercs infiltrating secret research lab located tropical island warned far cry something mr boll schemed together along legion schmuck feeling loneley set mr boll invite three countryman play player go name til schweiger udo kier ralf moellerthree name actually made self pretty big movie biz tale go like jack carver played til schweiger yes carver german hail bratwurst eating dude however find tils acting movie pretty badass people complained he really staying true whole carver agenda saw carver first person perspective dont really know looked like kicking however storyline film beyond demented see evil mad scientist dr krieger played udo kier making geneticallymutatedsoldiers gm called performing topsecret research island reminds spoiler vancouver reason thats right palm tree instead got nice rich lumberjackwoods havent even gone 

In [17]:
X_train, X_test, y_train, y_test = train_test_split(data['review'], data['sentiment'], test_size=0.2, random_state=42)

# Разделяю текст на униграммы\биграммы
tfidf = TfidfVectorizer(ngram_range=(1,3),max_features=8000)
X_train_tfidf = tfidf.fit_transform(X_train).toarray()
X_test_tfidf = tfidf.transform(X_test).toarray()

train_data = {
    'features': X_train_tfidf,
    'labels': y_train
}

test_data = {
    'features': X_test_tfidf,
    'labels': y_test
}

# Сохранение тренировочных данных в pickle
with open('./files/train_data.pkl', 'wb') as f:
    pickle.dump(train_data, f)

# Сохранение тестовых данных в pickle
with open('./files/test_data.pkl', 'wb') as f:
    pickle.dump(test_data, f)

# Сохранение отчищенного текста в pickle
with open('./files/Cleaned_Data.pkl', 'wb') as f:
    pickle.dump(data, f)
