# Data Processing

In addition to reading in the data from feather files (smaller file sizes than csv), I also perform the train-test split here. Perhaps this could have been done after the processing, but I wanted to make sure the different processed files were in the same order.

## Preprocessing

Most of the functions I use are the same as in the sample dataset. I've functionalized them in scripts/preprocessing.py, so I can later use them with the target data. These are seperated out instead of put in a pipeline to help with debugging errors. The large data size caused many errors and long runtime, so running these steps individually was the best way to make it work. Not using pipelines now may also allow me to not use scikit-learn in a final product, which could help in getting all the libraries I need loaded onto heroku.

In [1]:
# This allows importing of scripts, which are stored in a folder one level up
import sys
sys.path.append('..')

In [2]:
import numpy as np
import pandas as pd
from scripts import preprocessing
from string import punctuation
from nltk.corpus import stopwords

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Andrew\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Andrew\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Andrew\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
stopwords_list = stopwords.words('english') + list(punctuation) + ['`', '’', '…', '\n']

In [4]:
for i in range(10):
    print(i+1, 'of 10 started')
    df = pd.read_pickle(f'../data/reviews_raw_{str(i)}.pkl.gz')
    
    y = df['voted_up'].to_numpy()
    pd.DataFrame(y, columns=['voted_up']).to_pickle(f'../data/processed/y_{str(i)}.pkl.gz')
    
    X = df['review'].to_numpy()
    X = list(map(preprocessing.remove_markdown, X))
    X = list(map(preprocessing.remove_punctuation, X))
    X = list(map(preprocessing.tokenize, X))
    pd.DataFrame([' '.join(review) for review in X]).to_pickle(f'../data/processed/X_preprocessed_{str(i)}.pkl.gz')
    
    X_stopword = []
    for review in X:
        X_stopword.append([word for word in review if word not in stopwords_list])
    pd.DataFrame([' '.join(review) for review in X_stopword]).to_pickle(f'../data/processed/X_stopword_{str(i)}.pkl.gz')

1 of 10 started
2 of 10 started
3 of 10 started
4 of 10 started
5 of 10 started
6 of 10 started
7 of 10 started
8 of 10 started
9 of 10 started
10 of 10 started


## Train-Test Split

To perform the train-test split, I combine all 10 of each data split, then run sklearn's train-test split function. I use a random state so that the data will split the same on all three data sets, as they are the same size. This is important because my computer cannot handle loading all three together, and so cannot run the train-test split on the entire dataset at once.

In [5]:
from sklearn.model_selection import train_test_split

In [6]:
for data_name in ['y', 'X_stopword', 'X_preprocessed']:
    print('Starting', data_name)
    df = pd.read_pickle(f'../data/processed/{data_name}_0.pkl.gz')
    
    for i in range(1, 10):
        df = df.append(pd.read_pickle(f'../data/processed/{data_name}_{str(i)}.pkl.gz'))
        
    df_train, df_test = train_test_split(df, test_size=0.2, random_state=404)
    df_train.to_pickle(f'../data/processed/{data_name}_train.pkl.gz')
    df_test.to_pickle(f'../data/processed/{data_name}_test.pkl.gz')

Starting y
Starting X_stopword
Starting X_preprocessed


## Feature Engineering

I already know that TF-IDF performs the best, but I'm still interested to see how neural networks perform with the gensim document embeddings. These embeddings are much quicker and smaller than the TF-IDF vectorizers, so it isn't any trouble to run and save the data.

### TF-IDF

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
tf = TfidfVectorizer(max_features=8000, stop_words=stopwords_list)
X_train_tf = pd.DataFrame(tf.fit_transform(X_train_join).todense(), columns=tf.get_feature_names())
X_test_tf = pd.DataFrame(tf.transform(X_test_join).todense(), columns=tf.get_feature_names())

In [None]:
X_train_tf.to_feather('../data/processed/X_train_tf.feather')
X_test_tf.to_feather('../data/processed/X_test_tf.feather')

### TF-IDF with Bigrams

TF-IDF with Bigrams performed the best after running the models, so I pickled the vectorizer to use again later. When I get the ability to run bigger models and vectorizers, I may come back and try other levels of n-grams.

In [1]:
import pandas as pd
from pickle import dump, load
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
tf_bigram = TfidfVectorizer(max_features=1000, ngram_range=(1,2))
X_train = pd.read_pickle('../data/processed/X_preprocessed_train.pkl.gz')[0]
X_train_bigram = pd.DataFrame(tf_bigram.fit_transform(X_train).todense(), columns=tf_bigram.get_feature_names())

X_train_bigram.to_pickle('../data/processed/X_bigram_train.pkl.gz')
dump(tf_bigram, open('../final_model/tfidf_bigram_vectorizer.pk', 'wb'))

MemoryError: Unable to allocate 12.7 GiB for an array with shape (1706317, 1000) and data type float64

In [9]:
tf_bigram = load(open('../final_model/tfidf_bigram_vectorizer.pk', 'rb'))
X_test = pd.read_pickle('../data/processed/X_preprocessed_test.pkl.gz')[0]
X_test_bigram = pd.DataFrame(tf_bigram.transform(X_test).todense(), columns=tf_bigram.get_feature_names())
X_test_bigram.to_pickle('../data/processed/X_bigram_test.pkl.gz')

Unnamed: 0,review
53104,this game is like a simulationtheres nothing m...
130983,timbermanesque really fun fastpaced simplistic...
199208,the genre of 2d soulslikes has gotten a lot of...
181619,i dont understand how games like this even ent...
26270,introduction i am setsuna is a short touching ...
...,...
132893,its like having sex with your space bar
94655,here is the origin of duke nukem one of the mo...
202043,the game has a lot of potential some interesti...
201016,an interesting short visual novel with multipl...


In [None]:
X_train_bigram.to_pickle('../data/processed/X_bigram_train.pkl.gz')
X_test_bigram.to_pickle('../data/processed/X_bigram_test.pkl.gz')

In [None]:
dump(tf_bigram, open('../final_model/tfidf_bigram_vectorizer.pk', 'wb'))

### Document Embeddings

In [None]:
from gensim.sklearn_api import D2VTransformer
from sklearn.preprocessing import MinMaxScaler

In [None]:
d2v = D2VTransformer()
X_train_embed = d2v.fit_transform(X_train_pre)
X_test_embed = d2v.transform(X_test_pre)

scaler = MinMaxScaler((1, 2))
X_train_embed = pd.DataFrame(scaler.fit_transform(X_train_embed))
X_test_embed = pd.DataFrame(scaler.transform(X_test_embed))

X_train_embed.columns = X_train_embed.columns.astype(str)
X_test_embed.columns = X_test_embed.columns.astype(str)

In [None]:
X_train_embed.to_feather('../data/processed/X_train_embed.feather')
X_test_embed.to_feather('../data/processed/X_test_embed.feather')

Some of these final processed files are too alrge to upload to Github, so the entire data/processed folder has been added to .gitignore. You will need to run this script yourself to generate the same files. The raw data is still included in the Github upload.