## Файл с предобработкой данных  
На вход мы подаем наш датасет, а на выходе получаем отчищенный от случайных, неинформативных символов, лемматизированный, векторизованный по методу TF-IDF на униграммы\биграммы\триграммы и разбитый на тренировочную и тестовую выборки.

In [5]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
import pickle

In [14]:
data = pd.read_csv('./IMDB Dataset.csv')

# Лемматизация текста
def lemmatize_text(text):
    return ' '.join([lemmatizer.lemmatize(word) for word in text.split()])

def clean_text(
        text : str
)->str:
    """
    Очищает текст от HTML тегов, специальных символов и приводит его к нижнему регистру.
    Параметры:
    text (str): Исходный текст.
    Возвращает:
    str: Очищенный текст.
    """
    text = text.lower()
    text = re.sub(r'<.*?>', '', text)
    text = re.sub(r'[^A-Za-z0-9\s]', '', text)
    text = re.sub(r'(.)\1{2,}', r'\1', text)
    text = ' '.join([word for word in text.split() if word not in stop_words])
    text = lemmatize_text(text)
    return text

In [10]:
# Загрузка стоп-слов
nltk.download('stopwords')
nltk.download('wordnet')
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\andre\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\andre\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Текст до отчистки

In [7]:
print(data.iloc[5,0])

Probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it's not preachy or boring. It just never gets old, despite my having seen it some 15 or more times in the last 25 years. Paul Lukas' performance brings tears to my eyes, and Bette Davis, in one of her very few truly sympathetic roles, is a delight. The kids are, as grandma says, more like "dressed-up midgets" than children, but that only makes them more fun to watch. And the mother's slow awakening to what's happening in the world and under her own roof is believable and startling. If I had a dozen thumbs, they'd all be "up" for this movie.


### Текст после отчистки

In [15]:
# Отчищаем данные и бинаризуем целевой параметр
data["review"] = data['review'].apply(clean_text)
data['sentiment'] = data['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)
print(data.iloc[5,0])

probably alltime favorite movie story selflessness sacrifice dedication noble cause preachy boring never get old despite seen 15 time last 25 year paul lukas performance brings tear eye bette davis one truly sympathetic role delight kid grandma say like dressedup midget child make fun watch mother slow awakening whats happening world roof believable startling dozen thumb theyd movie


In [17]:
X_train, X_test, y_train, y_test = train_test_split(data['review'], data['sentiment'], test_size=0.2, random_state=42)

# Разделяю текст на униграммы\биграммы
tfidf = TfidfVectorizer(ngram_range=(1,3),max_features=8000)
X_train_tfidf = tfidf.fit_transform(X_train).toarray()
X_test_tfidf = tfidf.transform(X_test).toarray()

train_data = {
    'features': X_train_tfidf,
    'labels': y_train
}

test_data = {
    'features': X_test_tfidf,
    'labels': y_test
}

# Сохранение тренировочных данных в pickle
with open('./files/train_data.pkl', 'wb') as f:
    pickle.dump(train_data, f)

# Сохранение тестовых данных в pickle
with open('./files/test_data.pkl', 'wb') as f:
    pickle.dump(test_data, f)

# Сохранение отчищенного текста в pickle
with open('./files/Cleaned_Data.pkl', 'wb') as f:
    pickle.dump(data, f)
