# Analysis of Sentiment Data Timeline

## Introduction
This notebook consists of code for performing sentiment analysis on game reviews from Steam.

A summary of the content exists below and a table of contents as well. In another notebook is code for using Steam API for mining the reviews. Data is not provided due to API Terms of Use (https://steamcommunity.com/dev/apiterms). Details on choices made are stated in the report.

This is intended to be finalized in a python package at a later stage.


### Plan
- Data import and processing e.g. balancing dataset
- Model evaluation (GTA V reviews)
- Majority voting
- Model selection and final training
- Model validation (Wolcen reviews)
- Visualization of the predictions over the time axis

### Algorithms
- Baseline classifier (dummy classifier)
- Multinomial Naive Bayes
- SVM (Linear)
- Logistic Regression
- KNN
- Random Subspaces (SVM)

### Preprocessing
- Unigrams with Term Frequency

## Table of Content
0. Imports
1. Data Ingestion
2. Helper Functions
3. Model Evaluation 
    1. Dummy Classifier
    2. Multinomial Naive Bayes
    3. SGD - Linear SVM
    4. Logistic Regression
    5. KNN
    6. Random Subspaces (SVM)
4. Majority Voting
5. Validation of Final Model
6. Visualization

## 0. Imports

In [1]:
# Basics
import numpy as np
import pandas as pd
from datetime import datetime

# Helpers
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

# Models
from sklearn.dummy import DummyClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.ensemble import BaggingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

# Evaluation
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

# Visualization
from bokeh.plotting import figure
from bokeh.io import output_notebook, show, output_file, save
from bokeh.models import ColumnDataSource, HoverTool, Panel
from bokeh.models.widgets import Tabs
from bokeh.models import DatetimeTickFormatter
%matplotlib inline

## 1. Data Ingestion

In [2]:
# Import data
df = pd.read_parquet('D:\\data\\test_train\\review_merged.parquet')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 274865 entries, 0 to 274864
Data columns (total 18 columns):
 #   Column                       Non-Null Count   Dtype         
---  ------                       --------------   -----         
 0   language                     274865 non-null  string        
 1   review                       274865 non-null  string        
 2   timestamp_created            274865 non-null  datetime64[ns]
 3   timestamp_updated            274865 non-null  datetime64[ns]
 4   voted_up                     274865 non-null  boolean       
 5   votes_up                     274865 non-null  Int64         
 6   votes_funny                  274865 non-null  Int64         
 7   weighted_vote_score          274865 non-null  float32       
 8   comment_count                274865 non-null  Int64         
 9   steam_purchase               274865 non-null  boolean       
 10  received_for_free            274865 non-null  boolean       
 11  written_during_early_acces

In [3]:
# Convert from boolean to int and check dataset balance
df['voted_up'] = df['voted_up'].astype('int64')
df['voted_up'].value_counts()

1    205874
0     68991
Name: voted_up, dtype: int64

In [4]:
# Balance Dataset
# Divide dataframe into positive and negative
df_pos = df[df['voted_up'] == 0]
df_neg = df[df['voted_up'] == 1]

# Under-sample larger dataframe
if len(df_pos.index) == len(df_neg.index):
    # Dataset is balanced
    pass
elif len(df_pos.index) > len(df_neg.index):
    # Positive has higher count, under-sample positive and then merge again
    df_pos = df_pos.sample(len(df_neg.index))
    df = pd.concat([df_pos, df_neg], axis=0)
else:
    # Negative has higher count, under-sample negative and then merge again
    df_neg = df_neg.sample(len(df_pos.index))
    df = pd.concat([df_pos, df_neg], axis=0)

In [5]:
# Verify that data is now balanced
df['voted_up'].value_counts()

1    68991
0    68991
Name: voted_up, dtype: int64

In [6]:
# Test train split
X_train, X_test, y_train, y_test = train_test_split(df['review'], df['voted_up'], test_size=0.3, random_state=42)

In [7]:
# Create count vector from train data, using TF if binary=False and TP if binary=True
vectorizer = CountVectorizer(stop_words='english', binary=False)
X = vectorizer.fit_transform(X_train)
Y = vectorizer.transform(X_test)

## 2. Helper Functions

In [8]:
def evaluate(y_true, y_predicted):
    """Prints evaluation metrics from the predicted classification.

    Using Scikit-learn functions with y_true and y_predicted to calculate metrics. 
    These are then printed into an easily read format.

    Args:
        y_true: The true y labels.
        y_predicted: The predicted y labels

    Returns:
        N/A

    Raises:
        N/A
    """
        
    conf_mat = confusion_matrix(y_true, y_predicted)
    
    print("======== CONFUSION MATRIX ========")
    print("\t0\t1")
    print(f"0\t{conf_mat[0][0]}\t{conf_mat[0][1]}")
    print(f"1\t{conf_mat[1][0]}\t{conf_mat[1][1]}")
    print('\n')
    
    print("======== CLASSIFICATION REPORT ========")
    print(classification_report(y_true, y_predicted))
    print('\n')

## 3. Model Evaluation 

### 3A. Dummy Classifier
https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html

In [9]:
dummy_clf = DummyClassifier(strategy="stratified")

# Train
dummy_clf.fit(X, y_train)

# Predict
y_dummy_train = dummy_clf.predict(X)
y_dummy_test = dummy_clf.predict(Y)

In [10]:
evaluate(y_train, y_dummy_train)

	0	1
0	24015	24172
1	24179	24221


              precision    recall  f1-score   support

           0       0.50      0.50      0.50     48187
           1       0.50      0.50      0.50     48400

    accuracy                           0.50     96587
   macro avg       0.50      0.50      0.50     96587
weighted avg       0.50      0.50      0.50     96587





In [11]:
evaluate(y_test, y_dummy_test)

	0	1
0	10480	10324
1	10276	10315


              precision    recall  f1-score   support

           0       0.50      0.50      0.50     20804
           1       0.50      0.50      0.50     20591

    accuracy                           0.50     41395
   macro avg       0.50      0.50      0.50     41395
weighted avg       0.50      0.50      0.50     41395





### 3B. Multinomial Naive Bayes
https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html

In [12]:
# Train
clf = MultinomialNB()
clf.fit(X, y_train)

MultinomialNB()

In [13]:
# Predict
y_pred_train = clf.predict(X)
y_pred_test = clf.predict(Y)

In [14]:
evaluate(y_train, y_pred_train)

	0	1
0	41980	6207
1	5356	43044


              precision    recall  f1-score   support

           0       0.89      0.87      0.88     48187
           1       0.87      0.89      0.88     48400

    accuracy                           0.88     96587
   macro avg       0.88      0.88      0.88     96587
weighted avg       0.88      0.88      0.88     96587





In [15]:
evaluate(y_test, y_pred_test)

	0	1
0	18042	2762
1	2841	17750


              precision    recall  f1-score   support

           0       0.86      0.87      0.87     20804
           1       0.87      0.86      0.86     20591

    accuracy                           0.86     41395
   macro avg       0.86      0.86      0.86     41395
weighted avg       0.86      0.86      0.86     41395





### 3C. SGD - Linear SVM
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier

In [16]:
# Train
clf_sgd = SGDClassifier(max_iter=1000, tol=1e-4)
clf_sgd.fit(X, y_train)

SGDClassifier(tol=0.0001)

In [17]:
# Predict
y_pred_train_sgd = clf_sgd.predict(X)
y_pred_test_sgd = clf_sgd.predict(Y)

In [18]:
evaluate(y_train, y_pred_train_sgd)

	0	1
0	40443	7744
1	3350	45050


              precision    recall  f1-score   support

           0       0.92      0.84      0.88     48187
           1       0.85      0.93      0.89     48400

    accuracy                           0.89     96587
   macro avg       0.89      0.89      0.88     96587
weighted avg       0.89      0.89      0.88     96587





In [19]:
evaluate(y_test, y_pred_test_sgd)

	0	1
0	16851	3953
1	1794	18797


              precision    recall  f1-score   support

           0       0.90      0.81      0.85     20804
           1       0.83      0.91      0.87     20591

    accuracy                           0.86     41395
   macro avg       0.87      0.86      0.86     41395
weighted avg       0.87      0.86      0.86     41395





### 3D. Logistic Regression
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [20]:
# Train
clf_lr = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=1000)
clf_lr.fit(X, y_train)

LogisticRegression(max_iter=1000)

In [21]:
# Predict
y_pred_train_lr = clf_lr.predict(X)
y_pred_test_lr = clf_lr.predict(Y)

In [22]:
evaluate(y_train, y_pred_train_lr)

	0	1
0	42155	6032
1	3049	45351


              precision    recall  f1-score   support

           0       0.93      0.87      0.90     48187
           1       0.88      0.94      0.91     48400

    accuracy                           0.91     96587
   macro avg       0.91      0.91      0.91     96587
weighted avg       0.91      0.91      0.91     96587





In [23]:
evaluate(y_test, y_pred_test_lr)

	0	1
0	17084	3720
1	2015	18576


              precision    recall  f1-score   support

           0       0.89      0.82      0.86     20804
           1       0.83      0.90      0.87     20591

    accuracy                           0.86     41395
   macro avg       0.86      0.86      0.86     41395
weighted avg       0.86      0.86      0.86     41395





### 3E. KNN
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier

In [24]:
# Train
clf_knn = KNeighborsClassifier()
clf_knn.fit(X, y_train)

KNeighborsClassifier()

In [25]:
# Predict
y_pred_train_knn = clf_knn.predict(X)
y_pred_test_knn = clf_knn.predict(Y)

In [26]:
evaluate(y_train, y_pred_train_knn)

	0	1
0	34689	13498
1	2900	45500


              precision    recall  f1-score   support

           0       0.92      0.72      0.81     48187
           1       0.77      0.94      0.85     48400

    accuracy                           0.83     96587
   macro avg       0.85      0.83      0.83     96587
weighted avg       0.85      0.83      0.83     96587





In [27]:
evaluate(y_test, y_pred_test_knn)

	0	1
0	13461	7343
1	1709	18882


              precision    recall  f1-score   support

           0       0.89      0.65      0.75     20804
           1       0.72      0.92      0.81     20591

    accuracy                           0.78     41395
   macro avg       0.80      0.78      0.78     41395
weighted avg       0.80      0.78      0.78     41395





### 3F. Random Subspaces (SVM)

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html#sklearn.ensemble.BaggingClassifier

Very complex for large feature and large samples O(n_features * n_samles^3)

In [28]:
# Train
clf_rs = BaggingClassifier(base_estimator=SVC(), n_estimators=2, random_state=422, bootstrap_features=True, n_jobs=4)
clf_rs.fit(X, y_train)

BaggingClassifier(base_estimator=SVC(), bootstrap_features=True, n_estimators=2,
                  n_jobs=4, random_state=422)

In [29]:
# Predict
y_pred_train_3 = clf_rs.predict(X)
y_pred_test_3 = clf_rs.predict(Y)

In [30]:
evaluate(y_train, y_pred_train_3)

	0	1
0	38322	9865
1	6028	42372


              precision    recall  f1-score   support

           0       0.86      0.80      0.83     48187
           1       0.81      0.88      0.84     48400

    accuracy                           0.84     96587
   macro avg       0.84      0.84      0.84     96587
weighted avg       0.84      0.84      0.84     96587





In [31]:
evaluate(y_test, y_pred_test_3)

	0	1
0	16446	4358
1	3005	17586


              precision    recall  f1-score   support

           0       0.85      0.79      0.82     20804
           1       0.80      0.85      0.83     20591

    accuracy                           0.82     41395
   macro avg       0.82      0.82      0.82     41395
weighted avg       0.82      0.82      0.82     41395





## 4. Majority Voting

In [32]:
# MultNB, Linear SVM, KNN
voting_test = y_pred_test + y_pred_test_sgd + y_pred_test_knn

In [33]:
voting_test[voting_test <= 1] = 0
voting_test[voting_test >= 2] = 1

In [34]:
evaluate(y_test, voting_test)

	0	1
0	17210	3594
1	1891	18700


              precision    recall  f1-score   support

           0       0.90      0.83      0.86     20804
           1       0.84      0.91      0.87     20591

    accuracy                           0.87     41395
   macro avg       0.87      0.87      0.87     41395
weighted avg       0.87      0.87      0.87     41395





## 5. Validation of Final Model

In [35]:
dv = pd.read_parquet('D:\\data\\validation\\review_merged.parquet')
dv['voted_up'] = dv['voted_up'].astype('int64')

In [36]:
# Use complete training corpus for training validation model
vectorizer_val = CountVectorizer(stop_words='english', binary=False)
X_train_val = vectorizer_val.fit_transform(df['review'])
y_train_val = df['voted_up']

clf_NB_val = MultinomialNB()
clf_NB_val.fit(X_train_val, y_train_val)

clf_sgd_val = SGDClassifier(max_iter=1000, tol=1e-4)
clf_sgd_val.fit(X_train_val, y_train_val)

clf_lr_val = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=1000)
clf_lr_val.fit(X_train_val, y_train_val)

LogisticRegression(max_iter=1000)

In [37]:
V = vectorizer_val.transform(dv['review'])
NB_eval = clf_NB_val.predict(V)
SVM_eval = clf_sgd_val.predict(V)
LR_eval = clf_lr_val.predict(V)

In [38]:
evaluate(dv['voted_up'], NB_eval)

	0	1
0	8839	1828
1	2945	10721


              precision    recall  f1-score   support

           0       0.75      0.83      0.79     10667
           1       0.85      0.78      0.82     13666

    accuracy                           0.80     24333
   macro avg       0.80      0.81      0.80     24333
weighted avg       0.81      0.80      0.80     24333





In [39]:
evaluate(dv['voted_up'], SVM_eval)

	0	1
0	8137	2530
1	1746	11920


              precision    recall  f1-score   support

           0       0.82      0.76      0.79     10667
           1       0.82      0.87      0.85     13666

    accuracy                           0.82     24333
   macro avg       0.82      0.82      0.82     24333
weighted avg       0.82      0.82      0.82     24333





In [40]:
evaluate(dv['voted_up'], LR_eval)

	0	1
0	8342	2325
1	2092	11574


              precision    recall  f1-score   support

           0       0.80      0.78      0.79     10667
           1       0.83      0.85      0.84     13666

    accuracy                           0.82     24333
   macro avg       0.82      0.81      0.82     24333
weighted avg       0.82      0.82      0.82     24333





In [41]:
majority_vote = NB_eval + SVM_eval + LR_eval
majority_vote[majority_vote <= 1] = 0
majority_vote[majority_vote >= 2] = 1
evaluate(dv['voted_up'], majority_vote)

	0	1
0	8420	2247
1	1914	11752


              precision    recall  f1-score   support

           0       0.81      0.79      0.80     10667
           1       0.84      0.86      0.85     13666

    accuracy                           0.83     24333
   macro avg       0.83      0.82      0.83     24333
weighted avg       0.83      0.83      0.83     24333





## 6. Visualization
Inspiration: https://towardsdatascience.com/interactive-histograms-with-bokeh-202b522265f3

In [42]:
df_viz = dv
df_viz['predicted'] = majority_vote

In [43]:
df_viz.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24333 entries, 0 to 24332
Data columns (total 19 columns):
 #   Column                       Non-Null Count  Dtype         
---  ------                       --------------  -----         
 0   language                     24333 non-null  string        
 1   review                       24333 non-null  string        
 2   timestamp_created            24333 non-null  datetime64[ns]
 3   timestamp_updated            24333 non-null  datetime64[ns]
 4   voted_up                     24333 non-null  int64         
 5   votes_up                     24333 non-null  Int64         
 6   votes_funny                  24333 non-null  Int64         
 7   weighted_vote_score          24333 non-null  float32       
 8   comment_count                24333 non-null  Int64         
 9   steam_purchase               24333 non-null  boolean       
 10  received_for_free            24333 non-null  boolean       
 11  written_during_early_access  24333 non-nu

In [44]:
df_viz_pos = df_viz[df_viz["predicted"] == 1]
df_viz_neg = df_viz[df_viz["predicted"] == 0]

In [45]:
def plot_reviews(df_pos, df_neg, x_label, vectorizer, log_scale=False, save_name=None):
    """Creates a figure with positive and negative reviews as two histograms, where negative reviews are inverted along y-axis.

    Utilizing Bokeh and Numpy to calculate histograms for both positive and negative reviews. Then 
    these are plotted and essential tools like tool tips and hover tools are added.

    Args:
        df_pos: Dataframe of positive reviews.
        df_neg: Dataframe of negative reviews.
        x_label: The x label that has the timestamp for each review.
        log_scale: Optional; If the count is very uneven the log scale can be used by setting log_scale=True.
        save: Optional; Set to output name if the figure should be saved to html instead of shown, e.g. save='plot_7'

    Returns:
        N/A

    Raises:
        N/A
    """
    
    # Define Colors
    # Positive, negative, hover, background
    colors=["#99B898", "#FF847C", "#FECEA8", "#2A363B"]

    # Helper function for handling the datetime to string convertion
    def fix_time(time):
        """Converts datetime object to formatted string.

        From Pandas datetime object converted using datetime package into a string.

        Args:
            time: Vector of pandas datetime values.

        Returns:
            Converted string.

        Raises:
            N/A
        """
        return datetime.fromtimestamp(time // 1000).strftime('%Y-%m-%d')

    # Use Numpy to calculate histogram and saving to column data store
    def create_columndatastore(df, x_label, sentiment, log_scale):
        """Creates the ColumnDataStore for Bokeh from histogram.

        The histogram is calculated using Numpy and then the data is prepared for plotting before creating 
        the ColumnDataStore object.

        Args:
            df: Dataframe with datetime data.
            x_label: The label for the datetime data.
            sentiment: Controls if the histogram is inverted or not. 'neg' inverts the y-axis and 'pos' uses it as is.
            log_scale: If y-axis is scaled by log or not.

        Returns:
            ColumnDataStore object.

        Raises:
            N/A
        """
            
        hist, edges = np.histogram(df[x_label].astype(np.int64) // 10**6, bins = 100)
        
        hist_df = pd.DataFrame({x_label: hist,
                                 "left": edges[:-1],
                                 "right": edges[1:]})
        hist_df["interval"] = [f"{fix_time(left)} to {fix_time(right)}" for left, 
                               right in zip(hist_df["left"], hist_df["right"])]
        
        # Calculate 5 most common words for each interval to use with tool tip later
        n = 5
        most_common = []
        for row in hist_df.itertuples():
            # Select all samples within the datetime span
            df_selection = df[(df[x_label].astype(np.int64) // 10**6 >= row.left) & 
                              (df[x_label].astype(np.int64) // 10**6 <= row.right)]
            
            # Transform and get frequency
            tmp_vector = vectorizer.transform(df_selection['review'].tolist())
            freqs = zip(vectorizer.get_feature_names(), tmp_vector.sum(axis=0).tolist()[0])  
            
            # Sort, format and save
            top_n = sorted(freqs, key=lambda x: -x[1])[:n]
            formatted = ''.join([f"{a}: {b}<br>" for a,b in top_n])
            most_common.append(formatted)
        
        # Add to hist_df
        hist_df['top_n'] = most_common
        
        if log_scale:
            with np.errstate(divide='ignore'):
                hist_df[x_label] = np.nan_to_num(np.log10(hist_df[x_label]), nan=0.0, posinf=0.0, neginf=0.0)
            
        # If negative reviews flip the axis
        if sentiment == 'neg':
            hist_df[x_label] = -hist_df[x_label]
        
        return ColumnDataSource(hist_df)

    src_pos = create_columndatastore(df_pos, x_label, 'pos', log_scale)
    src_neg = create_columndatastore(df_neg, x_label, 'neg', log_scale)
    
    # Change tool tips strings if log scale is used.
    if log_scale:
        y_axis = "Log base 10"
    else:
        y_axis = "Count"

    # Define the plot
    plot = figure(plot_height = 600, plot_width = 1000,
                    title = "Histogram of Reviews",
                    x_axis_label = "Date",
                    y_axis_label = y_axis)    

    # Positive plot
    plot.quad(bottom = 0, top = x_label,left = "left", 
        right = "right", source = src_pos, fill_color = colors[0], 
        line_color = "black", fill_alpha = 1,
        hover_fill_alpha = 0.8, hover_fill_color = colors[2])
    
    # Negative plot
    plot.quad(bottom = 0, top = x_label,left = "left", 
        right = "right", source = src_neg, fill_color = colors[1], 
        line_color = "black", fill_alpha = 1,
        hover_fill_alpha = 0.8, hover_fill_color = colors[2])

    plot.xaxis.formatter=DatetimeTickFormatter(
            hours=["%d %B %Y"],
            days=["%d %B %Y"],
            months=["%d %B %Y"],
            years=["%d %B %Y"],
        )

    # Add Hover Tool
    hover = HoverTool(tooltips = [('Interval', '@interval'),
                              (y_axis, str("@" + x_label)),
                              ('Top 5', '@top_n{safe}')])
    plot.add_tools(hover)
    
    # Plot or save
    if save_name:
        # Save files for offline use
        output_file(f'{save_name}.html', mode='inline')
        save(plot)
    else:
        show(plot)

In [46]:
show_in_notebook=False

if show_in_notebook:
    # Show plots from Bokeh in notebook
    output_notebook()

In [47]:
plot_reviews(df_viz_pos, df_viz_neg, 'timestamp_updated', vectorizer_val)


In [48]:
plot_reviews(df_viz_pos, df_viz_neg, 'timestamp_updated', vectorizer_val, log_scale=True)

In [49]:
# Save plots in html
plot_reviews(df_viz_pos, df_viz_neg, 'timestamp_updated', vectorizer_val, save_name='validation_plot')
plot_reviews(df_viz_pos, df_viz_neg, 'timestamp_updated', vectorizer_val, log_scale=True, save_name='validation_plot_log')