# Preface
Before I launch into the code I want to preface a misunderstanding I had about my data initially. I originally thought that the ` weighted_vote_score ` was a measure of the impact it had on the games review score. As it is instead a representation of the gauged helpfulness of the community, I now need to somewhat change my approach. 
Instead of focusing on attempting to guess within a range of the previous target parameter `weighted_vote_score` I'll instead need to approach it as attempting to guess 'Yes' or 'No' for whether or not they recommend the game (` voted_up `). 

# General Approach
As a whole I'll need to somewhat adjust the values in my data. I'll change the boolean values into 0 or 1 respectively, and drop the language column (since it appears to be entirely english). From here I'll only include the recieved_for_free, written_during_early_acces, weighted_vote_score, and review in the algorithm. This will then be fed into the algorithm to generate a guess on the voted_up scores. 

# Transforming the review column
Ultimately I need a way to perform sentiment analysis of the reviews to convert it from text into a numerical value of positivity or negativity. This in account with the other factors should (hopefully) get us a good guess of whether or not someone would recommend a game. 

In [1]:
import pandas as pd
from pathlib import Path

base_path = Path()
raw_data = base_path.joinpath('raw_data')
filtered_data = 'filtered_data.json'

filtered_df = pd.read_json(raw_data.joinpath(filtered_data))
filtered_df['review'] = filtered_df['review'].str.replace(r'\W+', ' ')
filtered_df[['voted_up', 'received_for_free', 'written_during_early_access']] = filtered_df[['voted_up', 'received_for_free', 'written_during_early_access']].astype(int)
filtered_df.drop(columns='language', inplace=True, axis=1)
display(filtered_df.head(10))

  filtered_df['review'] = filtered_df['review'].str.replace(r'\W+', ' ')


Unnamed: 0,ids,recommendationid,voted_up,received_for_free,written_during_early_access,weighted_vote_score,review
0,70,115513013,1,0,0,0.729283,Review of Half Life Revolutionizing the indus...
1,70,115813617,1,0,0,0.642857,A must play classic
2,70,115817244,1,0,0,0.615385,One of the best games every created still fun ...
3,70,115566933,1,0,0,0.613007,sp is pretty cool deathmatch goes crazy
4,70,116146745,1,0,0,0.583333,I m Kayne West and this is the Kayne best
5,70,116216878,1,0,0,0.565217,1998
6,70,115547553,0,0,0,0.537176,I ve come to make an announcement Gordon Freem...
7,70,115619966,1,0,0,0.527528,Yes
8,70,116243034,1,0,0,0.526959,OMFG BEST GRAPHICSSSSSSSSSSSSSS
9,70,115766562,1,0,0,0.525862,very noice game


In [2]:
#We filter this to cut down on the amount I have to pad. I can capture ~90% of the data by cutting it off at 1350 characters.
#With just slicing the first 100 reviews I was able to go from ~9min down to 30 seconds. 
#I've tried experimenting with some of the methods to speed up the process, but I haven't been able to get much to work without something else breaking.

print(filtered_df['review'].apply(len).quantile([0.25, 0.5, 0.75, 0.9, 0.95, 0.99]))
filtered_df = filtered_df.loc[filtered_df['review'].apply(len) < 1350]

0.25      80.00
0.50     238.00
0.75     613.00
0.90    1347.00
0.95    2088.35
0.99    4310.00
Name: review, dtype: float64


# Sentiment Analysis
This portion of the process has been somewhat of a struggle. While I could use NLTK or similar libaries to maybe get a good guess at the sentiment values, I want to test if a Neural Network model can get resonably close as well. First I had to figure out how to shape the data in a way that the Neural Network would properly work with, via subclassing the Dataset class, then I had to define various methods to make the data work.

Right now the Dataset takes too long to process, at least in order for me to submit the assignment in at a resonable time, however I will improve the effciency of this such that we can get a good test of the neural network benefits.

After a couple days of working out the issues in the Dataset subclass, I was able to have a working model of the data to plug into the CNN Sentiment Analysis model. From what I could find CNN models worked reasonably well at learning the sentiments of datasets as well as long-term patterns. So I opted to utilize this method and implement a model to guess the data.

This process took a while to get it right, but finally I was left with my current model. The inital tests with a limited selection showed to be promising whe compared to the test set - but against the total dataframe it just guessed everything as being a recommended review - which isn't ideal. At the moment I'm going to assume its from having too many unexpected tokens - resulting in poor performance, but it's hard to say until I can train it against the whole of the dataset. 

In [94]:
from concurrent.futures import ThreadPoolExecutor
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torchtext
import torch.optim as optim
from tqdm.notebook import tqdm
from torch.utils.data import Dataset
from sklearn.model_selection import train_test_split

train_dataframe, test_dataframe = train_test_split(filtered_df[['review', 'voted_up']], test_size=0.2, random_state=42)


class CustomDataset(Dataset):
    def __init__(self, df):
        self.df = df
        self.tokenizer = torchtext.data.get_tokenizer('basic_english')
        self.max_seq_len = self._infer_max_seq_len()
        self.vocab = self._build_vocab()
        self.numericalized_data = self._numericalize_data()

    def _infer_max_seq_len(self):
        max_seq_len = 0
        for text in self.df['review']:
            tokens = self.tokenizer(text)
            max_seq_len = max(max_seq_len, len(tokens))
        return max_seq_len

    def _build_vocab(self):
        tokenized_reviews = []
        with ThreadPoolExecutor() as executor:
            for tokens in executor.map(self.tokenizer, self.df['review']):
                tokenized_reviews.append(tokens)

        vocab = torchtext.vocab.build_vocab_from_iterator(tokenized_reviews, specials=['<unk>', '<pad>'])
        vocab.set_default_index(vocab['<unk>'])
        return vocab

    def _numericalize_data(self):
            tokenized_reviews = [self.tokenizer(text) for text in self.df['review']]
            numericalized_reviews = np.zeros((len(tokenized_reviews), self.max_seq_len), dtype=np.int64)
            for i, tokens in tqdm(enumerate(tokenized_reviews), total=len(tokenized_reviews)):
                numericalized_tokens = [self.vocab.get_stoi()[token] for token in tokens]
                numericalized_reviews[i, :len(numericalized_tokens)] = numericalized_tokens[:self.max_seq_len]
            return torch.from_numpy(numericalized_reviews), torch.tensor(self.df['voted_up'].values, dtype=torch.long)

    def _pad_tokens(self, tokens):
        num_tokens = len(tokens)
        if num_tokens < self.max_seq_len:
            padded_tokens = tokens + ['<pad>'] * (self.max_seq_len - num_tokens)
        else:
            padded_tokens = tokens[:self.max_seq_len]
        return padded_tokens

    def __len__(self):
        return len(self.numericalized_data)

    def __getitem__(self, idx):
        numericalized_review = self.numericalized_data[0][idx]
        label = self.numericalized_data[1][idx]
        return numericalized_review, label

    def get_vocab(self, index=False):
        '''Returns the vocab object as number index if true, else as string index'''
        return self.vocab.get_itos() if index else self.vocab.get_stoi()



train_data = CustomDataset(train_dataframe[:1000])
test_data = CustomDataset(test_dataframe[:200])


  0%|          | 0/1000 [00:00<?, ?it/s]

  0%|          | 0/200 [00:00<?, ?it/s]

In [14]:
print(train_data[0])

([28, 123, 148, 27, 287, 538, 408, 412, 298, 576, 5045, 3028, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], tensor(1.))


In [77]:
import torch.nn.functional as F

class SentimentCNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, n_filters, filter_sizes, output_dim, dropout):
        super().__init__()

        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.convs = nn.ModuleList([
            nn.Conv1d(in_channels=embedding_dim,
                      out_channels=n_filters,
                      kernel_size=fsz)
            for fsz in filter_sizes
        ])
        self.fc = nn.Linear(len(filter_sizes) * n_filters, output_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, text):
        # text = [batch size, sent len]
        embedded = self.embedding(text)
        # embedded = [batch size, sent len, emb dim]
        embedded = embedded.permute(0, 2, 1)  # [batch size, emb dim, sent len]

        # apply convolutions and activation functions
        conved = [F.relu(conv(embedded)) for conv in self.convs]

        # pooling
        pooled = [F.max_pool1d(conv, conv.shape[2]).squeeze(2) for conv in conved]

        # concatenate pooled features and pass through the dropout layer
        cat = self.dropout(torch.cat(pooled, dim=1))

        # pass through the fully connected layer
        out = self.fc(cat)

        return out

In [78]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import r2_score
train_data_loader = torch.utils.data.DataLoader(train_data, batch_size=32, shuffle=True)
test_data_loader = torch.utils.data.DataLoader(test_data, batch_size=32, shuffle=True)

vocab_size = len(train_data.vocab)
emb_dim = 100
num_filters = 100
filter_sizes = [2, 3, 4]
output_dim = 2
dropout = 0.5

model = SentimentCNN(vocab_size, emb_dim, num_filters, filter_sizes, output_dim, dropout)

# define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters())

# train the model
epochs = 10
for epoch in range(epochs):
    running_loss = 0.0
    for batch in train_data_loader:
        optimizer.zero_grad()
        text, label = batch
        output = model(text)
        loss = criterion(output, label)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    print('Epoch: {}, Loss: {:.4f}'.format(epoch+1, running_loss/len(train_data_loader)))
    
# evaluate the model on test set
correct = 0
total = 0
with torch.no_grad():
    for batch in test_data_loader:
        text, label = batch
        output = model(text)
        _, predicted = torch.max(output.data, 1)
print(f'Accuracy on test set: {accuracy_score(label,predicted)*100:.2f}% \n R2 Score: {r2_score(label, predicted)*100:.2f}%')

Epoch: 1, Loss: 0.3514
Epoch: 2, Loss: 0.3498
Epoch: 3, Loss: 0.0086
Epoch: 4, Loss: 0.0092
Epoch: 5, Loss: 0.0022
Epoch: 6, Loss: 0.0017
Epoch: 7, Loss: 0.0014
Epoch: 8, Loss: 0.0003
Epoch: 9, Loss: 0.0002
Epoch: 10, Loss: 0.0002
Accuracy on test set: 100.00% 
 R2 Score: 100.00%


In [244]:
torch.save(model.state_dict(), raw_data.joinpath('model.pt'))
torch.save(train_data.get_vocab(), raw_data.joinpath('vocab.pt'))

# Sentiment predictions
My overall approach here is to get the sentiment values of each word by training the model to correctly guess the `voted_up` field and develop the necessary coeffecients to do so. 

The process to extract their coeffients involves the following steps:
   1. The input text is first tokenized using the basic_english tokenizer from torchtext.
   1.  The tokenized text is then converted to numerical form by mapping each token to its corresponding index in the vocabulary.
   1.  The numericalized tokens are then padded to a fixed length to ensure that all inputs have the same shape.
   1.  The padded numericalized tokens are passed through the embedding layer of the model to obtain their corresponding word embeddings.
   1.  The mean embedding for each review is calculated by taking the element-wise product of the embedding tensor and a binary mask tensor (which is 1 for tokens that are present and 0 for tokens that are padded) and then taking the mean along the sequence dimension.
   1.  The mean embeddings are passed through a 1D convolutional layer (model.convs[0]), followed by a non-linear activation function (torch.tanh).
   1.  The output of the convolutional layer is a 3D tensor with shape (batch_size, num_filters, seq_len - filter_size + 1).
   1.  The tensor is then squeezed along the third dimension (i.e., `output = output.squeeze(3)`), resulting in a 2D tensor with shape ( `batch_size, num_filters, seq_len - filter_size + 1`).
   1.  The tensor is then averaged along the third dimension (i.e., `output = output.mean(dim=2)`), resulting in a 2D tensor with shape (`batch_size, num_filters`).
   1.  The tensor is then squeezed along the first dimension (i.e., `output = output.squeeze()`), resulting in a 1D tensor with shape (`num_filters,`).
   1.  The final sentiment score for each review is obtained by taking the hyperbolic tangent of the average of the values in the 1D tensor (i.e., `torch.tanh(output.mean())`). This forces the value between -1 and 1, resulting in the actual sentiment score of the word in context to the sentence.

Then I take the resulting lists, and iteratively strip the NaN values from them, this will allow me to more effeciently process the list later once it is inside of the dataframe.
Finally in the dataframe I take the average of the values of the words in the sentences to get the final sentiment score of the sentence, dropping the last set of NaN values for sentences that we can't produce a score for (due to lack of vocab across the sentence).

In [224]:
def predict_sentiment(df):

    # Tokenize the review text
    tokenizer = torchtext.data.get_tokenizer('basic_english')
    temp_df = df['review'].apply(tokenizer)
    vocab = train_data.get_vocab()

    def numericalize_token(token):
        if token not in vocab:
            return vocab['<unk>']
        else:
            return vocab[token]

    # Convert the tokens to numericalized form
    numericalized_tokens = []
    for tokens in tqdm(temp_df, desc='Numericalizing tokens'):
        with ThreadPoolExecutor() as executor:
            numericalized_tokens.append(list(executor.map(numericalize_token, tokens)))

    # Pad the numericalized tokens to a fixed length
    max_seq_len = train_data._infer_max_seq_len()
    padded_tokens = torch.zeros((len(numericalized_tokens), max_seq_len), dtype=torch.long)
    for i, tokens in tqdm(enumerate(numericalized_tokens), desc='Padding tokens', total=len(numericalized_tokens)):
        tokens = tokens[:max_seq_len]
        padded_tokens[i, :len(tokens)] = torch.tensor(tokens)

    padded_tokens = padded_tokens.unsqueeze(1)

    # Calculate the sentiment score based on the mean value of the word embeddings
    with torch.no_grad():
        model.eval()
        model.convs.eval()
        embeddings = model.embedding(padded_tokens)
        mask = padded_tokens.ne(0).unsqueeze(-1).float()
        mean_embeddings = embeddings.mul(mask).sum(dim=1).div(mask.sum(dim=1))
        mean_embeddings = mean_embeddings.permute(0, 2, 1)
        output = model.convs[0](mean_embeddings)
        output = output.squeeze(2)
        output = output.mean(dim=1)
        output = output.view(output.shape[0], -1)
        sentiment_scores = torch.tanh(output).squeeze().tolist()

    # Return the sentiment scores as a new column in the dataframe
    sentiments = []
    for review_sentiment in tqdm(sentiment_scores, desc='Calculating Sentiment'):
        review_sentiment = np.array(review_sentiment, dtype=np.float32)
        review_sentiment = [word_sentiment for word_sentiment in review_sentiment if np.isnan(word_sentiment) == False]
        sentiments.append(review_sentiment)
    return sentiments


In [225]:
temp_df = pd.DataFrame([filtered_df['review'][:1000].values, predict_sentiment(filtered_df[:1000])])

Numericalizing tokens:   0%|          | 0/1000 [00:00<?, ?it/s]

Padding tokens:   0%|          | 0/1000 [00:00<?, ?it/s]

Calculating Sentiment:   0%|          | 0/1000 [00:00<?, ?it/s]

In [242]:
temp_df = temp_df.transpose()
temp_df[1] = temp_df[1].apply(lambda x: np.mean(np.array(x)))
temp_df.dropna(inplace=True)
display(temp_df.sort_values(1, ascending=False))
display(temp_df[1].describe())

Unnamed: 0,0,1
786,Internet cafeteria in early 2000s simulator,0.128682
811,Hunting Horn Best Weapon,0.101190
619,sometime it make poop pant,0.099908
149,game for bisexuals,0.099109
762,i glory kill babies in real life,0.091737
...,...,...
637,add sex,-0.105851
746,i love sushi 3,-0.109314
749,I love peas Me,-0.109314
342,GIVE US HL3,-0.121364


count    882.000000
mean      -0.001206
std        0.023673
min       -0.129644
25%       -0.009231
50%       -0.000606
75%        0.008741
max        0.128682
Name: 1, dtype: float64

# In review
So in review I have made good strides in accomplishing part of my goal, and generating features for the dataset. However my downfall is a tandem of the effeciency of the code, and not performing more cleaning steps on the text (i.e not converting contractions to their long-form before stripping punctuation like apostrophes). Once some more of these steps are done and I am able to process the full dataset for training - I think the overall performance of the model will improve drastically. I definitely still have a lot more to learn about PyTorch and its associated libraries, but I am somewhat happy with how far I have come.