## 1 - Data Collection

I start this project by collecting some initial data to work with. These cells collect 2000 reviews from a single game. I then store the reviews in a dataframe and save in a feather format, as the file size for feather files is a lot smaller than for csv files.

In [None]:
import pandas as pd
import requests

In [None]:
def get_reviews(appid, params):
        url_start = 'https://store.steampowered.com/appreviews/'
        response = requests.get(url=url_start+appid, params=params)
        return response.json() # return data extracted from the json response

In [None]:
appid = '1091500' # Cyberpunk 2077, good mix of positive and negative, just for samples
cursor = '*'
params = { # https://partner.steamgames.com/doc/store/getreviews
        'json' : 1,
        'filter' : 'all', # sort by: recent, updated, all (helpfullness)
        'language' : 'english', # https://partner.steamgames.com/doc/store/localization
        'day_range' : 9223372036854775807, # shows reveiws from all time
        'cursor' : cursor.encode(), # for pagination
        'review_type' : 'all', # all, positive, negative
        'purchase_type' : 'all', # all, non_steam_purchase, steam
        'num_per_page' : 100 # max amount per request
    }

In [None]:
results = []
for i in range(n//100):
    result = get_reviews(appid, params)
    results += result['reviews']
    params['cursor'] = result['cursor']

In [None]:
reviews_df = pd.DataFrame(results)[['review', 'voted_up']]
reviews_df

In [None]:
reviews_df['voted_up'].value_counts()

In [None]:
reviews_df.to_feather('data/sample_reviews.feather', index=False)

Here I read in the sample data and perform the train/test split.

In [4]:
from sklearn.model_selection import train_test_split

In [2]:
reviews_df = pd.read_csv('data/sample_reviews.feather')
reviews_df

Unnamed: 0,review,voted_up
0,While it does feel like they needed a bit more...,True
1,Game is asbolutely good. The Night City is som...,True
2,This game has a JoJo reference.,True
3,"Cheers everyone, after 8 years we finally made...",True
4,made my penis to perfection in a call with fri...,True
...,...,...
1995,The game doesn't bring anything new to the tab...,False
1996,pp go smol ( ͡° ͜ʖ ͡°)\n\npp go big (˵ ͡☉ ͜ʖ ͡...,True
1997,"Great characters, nice city, thrilling storyli...",True
1998,So here is my review after all of this time.\n...,True


In [3]:
reviews_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   review    2000 non-null   object
 1   voted_up  2000 non-null   bool  
dtypes: bool(1), object(1)
memory usage: 17.7+ KB


In [5]:
df_train, df_test = train_test_split(reviews_df, random_state=212)
X_train, y_train = df_train['review'], df_train['voted_up']
X_test, y_test = df_test['review'], df_test['voted_up']
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((1500,), (1500,), (500,), (500,))

## 2 - Data Processing

In this section I run through the pre-processing and feature engineering steps I think I might use with the full dataset.

### Tokenization

The Regexp Tokenizer allows me to match only latin characters and digits. Steam reviews have a language option, but even English-marked reviews are often written in other languages. At the same time, I remove markdown tags from the taxt, as well as punctuation.

In [6]:
import nltk
from nltk.tokenize import RegexpTokenizer
import numpy as np
import re
from string import punctuation
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Andrew\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [7]:
punctuation_list = list(punctuation) + ['`', '’', '…']

In [8]:
def tokenize(review):
    review = re.sub(r'\[.*?\]', '', review) # remove markdown tags, only needed for Steam reviews
    review = review.translate(str.maketrans('', '', ''.join(punctuation_list))) # remove all punctuation
    tokenizer = RegexpTokenizer(r'[a-zA-Z0-9]+') # tokenize words with only numbers and latin characters
    return tokenizer.tokenize(review.lower())

In [9]:
X_train_tokenized = list(map(tokenize, X_train))
X_test_tokenized = list(map(tokenize, X_test))
len(X_train_tokenized), len(X_test_tokenized)

(1500, 500)

### Stop-Words Removal

In [10]:
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Andrew\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [11]:
stopwords_list = stopwords.words('english') + punctuation_list
len(stopwords_list)

214

In [12]:
X_train_stopworded = [[word for word in review if word not in stopwords_list] for review in X_train_tokenized]
X_test_stopworded = [[word for word in review if word not in stopwords_list] for review in X_test_tokenized]
len(X_train_stopworded), len(X_test_stopworded)

(1500, 500)

### Lemmatization

In [13]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Andrew\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [14]:
lemmatizer = WordNetLemmatizer() 
X_train_lemmatized = [list(map(lemmatizer.lemmatize, review)) for review in X_train_stopworded]
X_test_lemmatized = [list(map(lemmatizer.lemmatize, review)) for review in X_test_stopworded]
len(X_train_lemmatized), len(X_test_lemmatized)

(1500, 500)

### Finalizing

In [15]:
X_train_preprocessed = [' '.join(review) for review in X_train_lemmatized]
X_test_preprocessed = [' '.join(review) for review in X_test_lemmatized]
X_train_split = [review.split(' ') for review in X_train_preprocessed]
X_test_split = [review.split(' ') for review in X_test_preprocessed]
len(X_train_preprocessed), len(X_test_preprocessed)

(1500, 500)

### Bag of Words

In [16]:
from sklearn.feature_extraction.text import CountVectorizer

In [17]:
cv = CountVectorizer()
X_train_bow = pd.DataFrame(cv.fit_transform(X_train_preprocessed).todense(), columns=cv.get_feature_names())
X_test_bow = pd.DataFrame(cv.transform(X_test_preprocessed).todense(), columns=cv.get_feature_names())
X_train_bow.shape, X_test_bow.shape

((1500, 12529), (500, 12529))

### TF-IDF

This Vectorizer also allows for n-gram creation, which I will use in the full dataset.

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [19]:
tf = TfidfVectorizer()
X_train_tf = pd.DataFrame(tf.fit_transform(X_train_preprocessed).todense(), columns=tf.get_feature_names())
X_test_tf = pd.DataFrame(tf.transform(X_test_preprocessed).todense(), columns=tf.get_feature_names())
X_train_tf.shape, X_test_tf.shape

((1500, 12529), (500, 12529))

### Document Embeddings

I'd like to come back to this and try spacy's document embedding, but this one is easier to use, as it functions just like an sklearn model/transformer.

In [94]:
from gensim.sklearn_api import D2VTransformer
from sklearn.preprocessing import MinMaxScaler

In [112]:
vectorizer = D2VTransformer()
scaler = MinMaxScaler((1, 2)) # scaled to prevent negative values, which do not work with Naive Bayes models
X_train_embed = scaler.fit_transform(pd.DataFrame(vectorizer.fit_transform(X_train_split)))
X_test_embed = scaler.transform(pd.DataFrame(vectorizer.transform(X_test_split)))
X_train_embed.shape, X_test_embed.shape

((1500, 100), (500, 100))

## 3 - EDA

I didn't perform an EDA on the sample data, but this is where in the process it would occur.

## 4 - Base Models

I trial a few sklearn classifier models here on each of my processing methods. The big takaways are that TF-IDF is the best performer, and gensim document embeddings greatly underperformed. In terms of models, Logistic Regression actually performed the best. SVM also performed well, but it took so long to run that I won't be attempting it on the larger dataset.

In [46]:
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

In [50]:
def get_metrics(y_train, y_hat_train, y_test, y_hat_test):
    train_accuracy = accuracy_score(y_train, y_hat_train)
    test_accuracy = accuracy_score(y_test, y_hat_test)
    train_precision = precision_score(y_train, y_hat_train)
    test_precision = precision_score(y_test, y_hat_test)
    train_recall = recall_score(y_train, y_hat_train)
    test_recall = recall_score(y_test, y_hat_test)
    
    print('\t\tAccuracy\tPrecision\tRecall')
    print(f'Training:\t{round(train_accuracy, 2)}\t\t{round(train_precision, 2)}\t\t{round(train_recall, 2)}')
    print(f'Testing:\t{round(test_accuracy, 2)}\t\t{round(test_precision, 2)}\t\t{round(test_recall, 2)}')
    
    return {'train_accuracy':train_accuracy, 'train_precision':train_precision, 'train_recall':train_recall,
            'test_accuracy':test_accuracy, 'test_precision':test_precision, 'test_recall':test_recall}

In [124]:
model_metrics = []

### Basic Model

In [65]:
y_train.value_counts(normalize=True)

True     0.596667
False    0.403333
Name: voted_up, dtype: float64

In [66]:
train_preds = [True]*len(y_train)
test_preds = [True]*len(y_test)

In [125]:
model_name = 'Predict only "Suggested"'
data_name = 'None'
print(f'{model_name}\t{data_name}')
metrics = {'model':model_name, 'data':data_name}
metrics.update(get_metrics(y_train, train_preds, y_test, test_preds))
model_metrics.append(metrics)

Predict only "Suggested"	None
		Accuracy	Precision	Recall
Training:	0.63		0.63		0.93
Testing:	0.63		0.64		0.91


### Baseline Models

In [126]:
models = [('Logistic Regression', LogisticRegression),
          ('Multinomial Naive Bayes', MultinomialNB),
          ('Random Forest', RandomForestClassifier),
          ('Support Vector Machines', SVC)]
datasets = [('Bag of Words', X_train_bow, X_test_bow),
             ('TF-IDF', X_train_tf, X_test_tf),
             ('Document Embeddings', X_train_embed, X_test_embed)]

In [127]:
for model_name, model in models:
    for data_name, X_train, X_test in datasets:
        classifier = model()
        classifier.fit(X_train, y_train)
        train_preds = classifier.predict(X_train)
        test_preds = classifier.predict(X_test)

        print(f'{model_name}\t\t{data_name}')
        metrics = {'model':model_name, 'data':data_name}
        metrics.update(get_metrics(y_train, train_preds, y_test, test_preds))
        print()
        model_metrics.append(metrics)

Logistic Regression		Bag of Words
		Accuracy	Precision	Recall
Training:	0.98		0.97		1.0
Testing:	0.82		0.84		0.87

Logistic Regression		TF-IDF
		Accuracy	Precision	Recall
Training:	0.92		0.89		0.98
Testing:	0.83		0.81		0.95

Logistic Regression		Document Embeddings
		Accuracy	Precision	Recall
Training:	0.63		0.63		0.93
Testing:	0.64		0.65		0.92

Multinomial Naive Bayes		Bag of Words
		Accuracy	Precision	Recall
Training:	0.94		0.93		0.97
Testing:	0.8		0.83		0.86

Multinomial Naive Bayes		TF-IDF
		Accuracy	Precision	Recall
Training:	0.85		0.8		1.0
Testing:	0.73		0.71		0.98

Multinomial Naive Bayes		Document Embeddings
		Accuracy	Precision	Recall
Training:	0.6		0.61		0.92
Testing:	0.61		0.63		0.91

Random Forest		Bag of Words
		Accuracy	Precision	Recall
Training:	1.0		1.0		1.0
Testing:	0.77		0.81		0.82

Random Forest		TF-IDF
		Accuracy	Precision	Recall
Training:	1.0		1.0		1.0
Testing:	0.77		0.78		0.88

Random Forest		Document Embeddings
		Accuracy	Precision	Recall
Training:	1.0		1.0		1.0


In [128]:
model_metrics_df = pd.DataFrame(model_metrics)
model_metrics_df.sort_values(by='test_accuracy', ascending=False)

Unnamed: 0,model,data,train_accuracy,train_precision,train_recall,test_accuracy,test_precision,test_recall
11,Support Vector Machines,TF-IDF,0.995333,0.992239,1.0,0.83,0.819718,0.932692
2,Logistic Regression,TF-IDF,0.920667,0.893509,0.984358,0.828,0.81044,0.945513
1,Logistic Regression,Bag of Words,0.983333,0.972826,1.0,0.816,0.839506,0.871795
4,Multinomial Naive Bayes,Bag of Words,0.938,0.929336,0.969832,0.802,0.827692,0.862179
8,Random Forest,TF-IDF,0.998,0.996659,1.0,0.772,0.779661,0.884615
7,Random Forest,Bag of Words,0.998,0.997768,0.998883,0.766,0.805643,0.823718
10,Support Vector Machines,Bag of Words,0.824667,0.776224,0.992179,0.752,0.72488,0.971154
5,Multinomial Naive Bayes,TF-IDF,0.847333,0.796263,1.0,0.734,0.707657,0.977564
9,Random Forest,Document Embeddings,1.0,1.0,1.0,0.672,0.702186,0.823718
3,Logistic Regression,Document Embeddings,0.632,0.629239,0.932961,0.644,0.651584,0.923077


In [129]:
model_metrics_df.groupby(by='model').mean().sort_values(by='test_accuracy', ascending=False)

Unnamed: 0_level_0,train_accuracy,train_precision,train_recall,test_accuracy,test_precision,test_recall
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Logistic Regression,0.845333,0.831858,0.972439,0.762667,0.767176,0.913462
Support Vector Machines,0.815333,0.798072,0.973557,0.737333,0.729749,0.936966
Random Forest,0.998667,0.998142,0.999628,0.736667,0.762496,0.844017
Multinomial Naive Bayes,0.795778,0.778821,0.962384,0.716667,0.72282,0.915598
"Predict only ""Suggested""",0.626,0.625753,0.928492,0.63,0.644647,0.907051


In [130]:
model_metrics_df.groupby(by='data').mean().sort_values(by='test_accuracy', ascending=False)

Unnamed: 0_level_0,train_accuracy,train_precision,train_recall,test_accuracy,test_precision,test_recall
data,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
TF-IDF,0.940333,0.919668,0.996089,0.791,0.779369,0.935096
Bag of Words,0.936,0.919038,0.990223,0.784,0.79943,0.882212
Document Embeddings,0.715,0.716464,0.944693,0.64,0.657882,0.890224
,0.626,0.625753,0.928492,0.63,0.644647,0.907051


### Gridsearch

Don't need to do this for preliminaries, just add this into final notebooks.

## 5 - Neural Networks

I want to try these for the full dataset, but I'm not going to run these on this sample.

## 6 - Topic Modeling

gensim's LDA topic modeling is the first one to run through. I seperated the reviews by positive and negative reviews, but the results were more or less the same. In order to get this to a meaningful result, more work needs to be done. I need to remove tokens that have enough commonality between the two classes. This may also be good to try with bigrams, when I make those for the full dataset. For now, I won't bother adding this into the final project.

In [19]:
import gensim
from gensim import corpora
from gensim.models import LdaModel
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

  and should_run_async(code)


In [28]:
df_tmp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       1500 non-null   object
 1   1       1500 non-null   bool  
dtypes: bool(1), object(1)
memory usage: 13.3+ KB


  and should_run_async(code)


In [32]:
df_tmp = pd.DataFrame(zip(X_train_preprocessed, y_train))
X_split_pos = [[word for word in review.split(' ')] for review in df_tmp.loc[df_tmp[1]][0].to_numpy()]
X_split_neg = [[word for word in review.split(' ')] for review in df_tmp.loc[~df_tmp[1]][0].to_numpy()]

  and should_run_async(code)


In [34]:
id2word = corpora.Dictionary(X_split_pos)
corpus = [id2word.doc2bow(review) for review in X_split_pos]
lda_model = LdaModel(corpus=corpus, id2word=id2word, num_topics=5, passes=10, iterations=100, random_state=212)
vis = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary=lda_model.id2word)
vis

  and should_run_async(code)


In [35]:
id2word = corpora.Dictionary(X_split_neg)
corpus = [id2word.doc2bow(review) for review in X_split_neg]
lda_model = LdaModel(corpus=corpus, id2word=id2word, num_topics=5, passes=10, iterations=100, random_state=212)
vis = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary=lda_model.id2word)
vis

  and should_run_async(code)


## 7 - Unlabeled Data Analysis

When the full models are made, I will need to run them against unlabeled data I collect from reddit and/or twitter.