# Data Processing

In addition to reading in the data from feather files (smaller file sizes than csv), I also perform the train-test split here. Perhaps this could have been done after the processing, but I wanted to make sure the different processed files were in the same order.

## Preprocessing

Most of the functions I use are the same as in the sample dataset. I've functionalized them in scripts/preprocessing.py, so I can later use them with the target data. These are seperated out instead of put in a pipeline to help with debugging errors. The large data size caused many errors and long runtime, so running these steps individually was the best way to make it work. Not using pipelines now may also allow me to not use scikit-learn in a final product, which could help in getting all the libraries I need loaded onto heroku.

In [1]:
# This allows importing of scripts, which are stored in a folder one level up
import sys
sys.path.append('..')

In [2]:
import pandas as pd
from scripts import preprocessing
from string import punctuation
import numpy as np
from nltk.corpus import stopwords

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Andrew\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Andrew\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Andrew\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [6]:
stopwords_list = stopwords.words('english') + list(punctuation) + ['`', '’', '…', '\n']

In [9]:
for i in range(10):
    print(i+1, 'of 10 started')
    df = pd.read_pickle(f'../data/reviews_raw_{str(i)}.pkl.gz')
    y = df['voted_up'].to_numpy()
    
    X = df['review'].to_numpy()
    X = list(map(preprocessing.remove_markdown, X))
    X = list(map(preprocessing.remove_punctuation, X))
    X = list(map(preprocessing.tokenize, X))
    
    X_stopword = []
    for review in X:
        X_stopword.append([word for word in review if word not in stopwords_list])
    
    df_preprocessed = pd.DataFrame({'reviews_preprocessed': X,
                                    'reviews_stopworded': X_stopword,
                                    'voted_up': y})
    df_preprocessed.to_pickle(f'../data/processed/reviews_preprocessed_{str(i)}.pkl.gz')

0 of 10 started
1 of 10 started
2 of 10 started
3 of 10 started
4 of 10 started
5 of 10 started
6 of 10 started
7 of 10 started
8 of 10 started
9 of 10 started


## Train-Test Split

In [3]:
from sklearn.model_selection import train_test_split

In [11]:
reviews_df = pd.DataFrame({'reviews_preprocessed': [],
                           'reviews_stopworded': [],
                           'voted_up': []})
for i in range(0, 10):
    print(i+1, 'of 10 started')
    reviews_df = reviews_df.append(pd.read_pickle(f'../data/processed/reviews_preprocessed_{str(i)}.pkl.gz'))
reviews_df

1 of 10 started


MemoryError: 

In [None]:
df_train, df_test = train_test_split(reviews_df, test_size=0.2, random_state=404)

X_train_preprocessed= df_train['reviews_preprocessed'].to_numpy()
X_train_stopworded = df_train['reviews_stopworded'].to_numpy()
y_train = df_train['voted_up'].to_numpy()

X_test_preprocessed= df_test['reviews_preprocessed'].to_numpy()
X_test_stopworded = df_test['reviews_stopworded'].to_numpy()
y_test = df_test['voted_up'].to_numpy()

len(X_train_preprocessed), len(X_train_stopworded), len(y_train), len(X_test_preprocessed), len(X_test_stopworded), len(y_test)

In [None]:
pd.DataFrame(X_train_preprocessed, columns=['review']).to_pickle('../data/processed/x_train_preprocessed.pkl.gz')
pd.DataFrame(X_train_stopworded, columns=['review']).to_pickle('../data/processed/x_train_stopworded.pkl.gz')
pd.DataFrame(y_train, columns=['voted_up']).to_pickle('../data/processed/y_train.pkl.gz')

pd.DataFrame(X_test_preprocessed, columns=['review']).to_pickle('../data/processed/x_test_preprocessed.pkl.gz')
pd.DataFrame(X_test_stopworded, columns=['review']).to_pickle('../data/processed/x_test_stopworded.pkl.gz')
pd.DataFrame(y_test, columns=['voted_up']).to_pickle('../data/processed/y_test.pkl.gz')

## Feature Engineering

I already know that TF-IDF performs the best, but I'm still interested to see howneural networks perform with the gensim document embeddings. These embeddings are much quicker and smaller than the TF-IDF vectorizers, so it isn't any trouble to run and save the data.

### TF-IDF

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [20]:
tf = TfidfVectorizer(max_features=8000, stop_words=stopwords_list)
X_train_tf = pd.DataFrame(tf.fit_transform(X_train_join).todense(), columns=tf.get_feature_names())
X_test_tf = pd.DataFrame(tf.transform(X_test_join).todense(), columns=tf.get_feature_names())

In [21]:
X_train_tf.to_feather('../data/processed/X_train_tf.feather')
X_test_tf.to_feather('../data/processed/X_test_tf.feather')

### TF-IDF with Bigrams

TF-IDF with Bigrams performed the best after running the models, so I pickled the vectorizer to use again later. When I get the ability to run bigger models and vectorizers, I may come back and try other levels of n-grams.

In [13]:
from pickle import dump

In [22]:
tf_bigram = TfidfVectorizer(max_features=8000, ngram_range=(1,2))
X_train_bigram = pd.DataFrame(tf_bigram.fit_transform(X_train_join).todense(), columns=tf_bigram.get_feature_names())
X_test_bigram = pd.DataFrame(tf_bigram.transform(X_test_join).todense(), columns=tf_bigram.get_feature_names())

In [23]:
X_train_bigram.to_feather('../data/processed/X_train_bigram.feather')
X_test_bigram.to_feather('../data/processed/X_test_bigram.feather')

In [16]:
dump(tf_bigram, open('../final_model/vectorizer.pk', 'wb'))

### Document Embeddings

In [24]:
from gensim.sklearn_api import D2VTransformer
from sklearn.preprocessing import MinMaxScaler

In [25]:
d2v = D2VTransformer()
X_train_embed = d2v.fit_transform(X_train_pre)
X_test_embed = d2v.transform(X_test_pre)

scaler = MinMaxScaler((1, 2))
X_train_embed = pd.DataFrame(scaler.fit_transform(X_train_embed))
X_test_embed = pd.DataFrame(scaler.transform(X_test_embed))

X_train_embed.columns = X_train_embed.columns.astype(str)
X_test_embed.columns = X_test_embed.columns.astype(str)

In [26]:
X_train_embed.to_feather('../data/processed/X_train_embed.feather')
X_test_embed.to_feather('../data/processed/X_test_embed.feather')

Some of these final processed files are too alrge to upload to Github, so the entire data/processed folder has been added to .gitignore. You will need to run this script yourself to generate the same files. The raw data is still included in the Github upload.