**Датасет**: https://www.kaggle.com/datasets/andrewmvd/steam-reviews

**Об этом наборе данных**: Набор данных содержит более 6,4 миллиона общедоступных рецензий на английском языке из раздела Steam Reviews магазина Steam, управляемого компанией Valve. Каждый отзыв описывается текстом отзыва, идентификатором игры, к которой он относится, настроением отзыва (положительным или отрицательным) и количеством пользователей, посчитавших отзыв полезным.

In [139]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from gensim.models import Word2Vec, FastText
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/dm/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

# Загрузка датасета

In [140]:
# Загрузка данных
df = pd.read_csv("dataset.csv")

df.dropna(subset=['review_text'], inplace=True)
df.head()

Unnamed: 0,app_id,app_name,review_text,review_score,review_votes
0,203160,Tomb Raider,Definitely one of the best games of 2013! A mu...,1,0
1,300,Day of Defeat: Source,Really fun! A bit sad that there aren't so man...,1,1
2,234710,Poker Night 2,Great little poker game with fun characters (i...,1,0
3,113200,The Binding of Isaac,Okay I have just bought the Binding of Issac f...,1,0
4,252030,Valdis Story: Abyssal City,Apart from the at times annoying platforming a...,1,0


# BOW (Bag of Words)

In [141]:
bow_vectorizer = CountVectorizer()
bow_matrix = bow_vectorizer.fit_transform(df['review_text'])

# TF-IDF

In [142]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df['review_text'])

# Word2Vec

In [143]:
tokenized_reviews = [word_tokenize(review.lower()) for review in df['review_text']]
word2vec_model = Word2Vec(tokenized_reviews, vector_size=100, window=3, min_count=1, sg=1)

# FastText

In [144]:
fasttext_model = FastText(tokenized_reviews, vector_size=100, window=3, min_count=1, sg=1)

# Пример текста для поиска

In [145]:
search_text = "best game"

# Результаты поиска

In [146]:
search_vector_bow = bow_vectorizer.transform([search_text])
search_vector_tfidf = tfidf_vectorizer.transform([search_text])
search_vector_word2vec = sum([word2vec_model.wv[word] for word in word_tokenize(search_text.lower()) if word in word2vec_model.wv])
search_vector_fasttext = sum([fasttext_model.wv[word] for word in word_tokenize(search_text.lower()) if word in fasttext_model.wv])

results = []

# Результаты для BOW
bow_cosine_similarities = cosine_similarity(search_vector_bow, bow_matrix).flatten()
bow_top_indices = bow_cosine_similarities.argsort()[-5:][::-1]
results.append(('BOW', df.loc[bow_top_indices, 'review_text'].values))

# Результаты для TF-IDF
tfidf_cosine_similarities = cosine_similarity(search_vector_tfidf, tfidf_matrix).flatten()
tfidf_top_indices = tfidf_cosine_similarities.argsort()[-5:][::-1]
results.append(('TF-IDF', df.loc[tfidf_top_indices, 'review_text'].values))

# Результаты для Word2Vec
word2vec_similarities = [(cosine_similarity(search_vector_word2vec.reshape(1, -1), vector.reshape(1, -1)).item(), i) for i, vector in enumerate(word2vec_model.wv.vectors)]
word2vec_top_indices = [index for _, index in sorted(word2vec_similarities, reverse=True)[:5] if index < len(df)]
results.append(('Word2Vec', df.loc[word2vec_top_indices, 'review_text'].values))

# Результаты для FastText
fasttext_similarities = [(cosine_similarity(search_vector_fasttext.reshape(1, -1), vector.reshape(1, -1)).item(), i) for i, vector in enumerate(fasttext_model.wv.vectors)]
fasttext_top_indices = [index for _, index in sorted(fasttext_similarities, reverse=True)[:5] if index < len(df)]
results.append(('FastText', df.loc[fasttext_top_indices, 'review_text'].values))

# Вывод результатов

In [147]:
for model, texts in results:
    print(f"\nTop reviews using {model}:")
    for i, text in enumerate(texts, 1):
        print(f"{i}. {text}")


Top reviews using BOW:
1. I really like this game, For the people thinking about playing it, I'd suggest you to do one thing as I did. Don't play the logical way you think will benefit you the most and make you to 'win' this game. Play it the way as you were the person who got these chances, play it like YOU were the faust, when I did it, I figured out, I would die. Isn't that nice? :)  Graphics looks good the only thing I find that is bad is that I can't play this game in windowed mode, but it seems like it's running in borderless mode so Alt+tabbing is working without any problems.  On the other side, audio is really bad, it seems like it's lagging but that may be my problem only. Characters are not talking, only subtitles are there, but I don't really mind it.  This game is very short when you play it only once but I believe this game is not meant to be played only once, considering every action you do changes the story. First time I played this game I finished it in 40 minutes, se