To do list:

- import the modules required
- import data
- filter out the columns required
- clean the data (check for null values)
- use TDD to process the text
- split data into training and test data

- use TDD to develop models (fit and predict)
- measure accuracy of the models and decide whether to make improvements on the models
- models that will be used: Naive bayes and support vector machines

In [1]:
import pandas as pd
import numpy as np
import unittest
import string
from imblearn.over_sampling import SMOTE
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

### Import the data

In [2]:
df = pd.read_csv('opencritic_data_with_ratings.csv', index_col= 'Unnamed: 0')

df

Unnamed: 0,name,developer,publisher,genre,release_date,description,critics_average_score,critic_review,date_of_review,critics_score,opencritic_rating_critics_score
0,The Outer Worlds,Obsidian Entertainment,Private Division,RPG,2019-10-25,The Outer Worlds is a new single-player first-...,82,There's a category of games I think of as Satu...,2019-10-22 00:00:00,79.0,strong
1,The Outer Worlds,Obsidian Entertainment,Private Division,RPG,2019-10-25,The Outer Worlds is a new single-player first-...,82,"With The Outer Worlds, Obsidian has found its ...",2019-10-22 00:00:00,85.0,strong
3,The Outer Worlds,Obsidian Entertainment,Private Division,RPG,2019-10-25,The Outer Worlds is a new single-player first-...,82,"A deep, funny, and intricately designed RPG re...",2019-10-29 00:00:00,90.0,mighty
4,The Outer Worlds,Obsidian Entertainment,Private Division,RPG,2019-10-25,The Outer Worlds is a new single-player first-...,82,The Outer Worlds marks Obsidian operating at t...,2019-10-22 00:00:00,90.0,mighty
6,The Outer Worlds,Obsidian Entertainment,Private Division,RPG,2019-10-25,The Outer Worlds is a new single-player first-...,82,Classic RPG sensibilities enhance wonderful ch...,2019-10-22 00:00:00,93.0,mighty
...,...,...,...,...,...,...,...,...,...,...,...
15080,Star Wars Battlefront 2,"DICE,",Electronic Arts,"Action, First-Person Shooter, Vehicle Combat",2017-11-17,Embark on an all-new Battlefront experience fr...,68,Star Wars Battlefront 2 manages to make nearly...,2017-11-24 00:00:00,40.0,fair
15081,Star Wars Battlefront 2,"DICE,",Electronic Arts,"Action, First-Person Shooter, Vehicle Combat",2017-11-17,Embark on an all-new Battlefront experience fr...,68,"An engaging campaign, a satisfying arcade mode...",2017-11-25 00:00:00,80.0,strong
15083,Star Wars Battlefront 2,"DICE,",Electronic Arts,"Action, First-Person Shooter, Vehicle Combat",2017-11-17,Embark on an all-new Battlefront experience fr...,68,Star Wars Battlefront II is a complete experie...,2017-11-20 00:00:00,85.0,strong
15084,Star Wars Battlefront 2,"DICE,",Electronic Arts,"Action, First-Person Shooter, Vehicle Combat",2017-11-17,Embark on an all-new Battlefront experience fr...,68,"I do really enjoy playing this, but when you l...",2017-11-21 00:00:00,70.0,strong


Columns required for training and predicting the models:

- critic_review
- critics_score

In [3]:
df2 = df.loc[:,['critic_review','critics_score','opencritic_rating_critics_score']]

df2

Unnamed: 0,critic_review,critics_score,opencritic_rating_critics_score
0,There's a category of games I think of as Satu...,79.0,strong
1,"With The Outer Worlds, Obsidian has found its ...",85.0,strong
3,"A deep, funny, and intricately designed RPG re...",90.0,mighty
4,The Outer Worlds marks Obsidian operating at t...,90.0,mighty
6,Classic RPG sensibilities enhance wonderful ch...,93.0,mighty
...,...,...,...
15080,Star Wars Battlefront 2 manages to make nearly...,40.0,fair
15081,"An engaging campaign, a satisfying arcade mode...",80.0,strong
15083,Star Wars Battlefront II is a complete experie...,85.0,strong
15084,"I do really enjoy playing this, but when you l...",70.0,strong


In [4]:
#how many rating scores for each category?
ratings_grouped = df.groupby('opencritic_rating_critics_score').critic_review.count()
ratings_grouped

#only 5 weak reviews to classify and 233 fair reviews to classify. attempt to oversample these using SMOTE
#will train model with the current data and check accuracy score

opencritic_rating_critics_score
fair       233
mighty    2559
strong    3890
weak         5
Name: critic_review, dtype: int64

null values have already been removed from this csv file

Use TDD to process the text, ready for training the model

## Process text code

In [5]:
#remove punctuation
def remove_punctuation(reviews):
    punctuation = string.punctuation
    new_list = []
    for review in reviews:
        new_string = ''
        for letter in review:
            if letter not in punctuation:
                new_string += letter
        new_list.append(new_string)
    
    return new_list #explain the difficulty in refactoring this code due to the number of components

#lemmatize
def lemmatize_review(reviews):
    new_list = []
    lem = WordNetLemmatizer()
    for review in reviews:
        #tokenize review
        tokenised = word_tokenize(review)
        lemmatized_list = []
        for word in tokenised:
            lemmatized_word = lem.lemmatize(word)
            lemmatized_list.append(lemmatized_word)
        new_string = " ".join(lemmatized_list)
        new_list.append(new_string)
    return new_list

#use tfidfVectorizer to convert words to lowercase, removes stopwords and tokenizes the data
def tfidfvectorize_reviews(review):
    stop_words = set(stopwords.words('english'))
    vectorizer = TfidfVectorizer(stop_words=stop_words)
    fit_review = vectorizer.fit_transform(review)
    matrix = fit_review.toarray()
    return matrix
    


## Unit tests

In [6]:
class TestTextFormatting(unittest.TestCase):
    
    def test_remove_punc_exists(self):
        self.assertIsNotNone(remove_punctuation)
    
    def test_punctuation_removed(self):
        res = remove_punctuation(['x'+ string.punctuation, 'y'+ string.punctuation])
        for r in res:
            self.assertNotRegex(r,'[\W]')
        
    def test_lemmatizer_exists(self):
        self.assertIsNotNone(lemmatize_review)
        
    def test_words_lemmatized(self):
        tester = ["corpora rockers", "players syllabi"]
        res = lemmatize_review(tester)
        self.assertEqual(res, ["corpus rocker", "player syllabus"])
    
    def test_vectorizer_exists(self):
        self.assertIsNotNone(tfidfvectorize_reviews)
        
    def test_words_vectorized(self):
        tester = ["This is a review",
          "This is a very good review",
          "This is a very bad review",
          "This is an awful review dont watch",
          "This review is outstanding"
          ]
        res = tfidfvectorize_reviews(tester)
        self.assertEqual(type(res), np.ndarray)
    
unittest.main(argv=['ingored', '-v'], exit=False)

test_lemmatizer_exists (__main__.TestTextFormatting) ... ok
test_punctuation_removed (__main__.TestTextFormatting) ... ok
test_remove_punc_exists (__main__.TestTextFormatting) ... ok
test_vectorizer_exists (__main__.TestTextFormatting) ... ok
test_words_lemmatized (__main__.TestTextFormatting) ... ok
test_words_vectorized (__main__.TestTextFormatting) ... ok

----------------------------------------------------------------------
Ran 6 tests in 1.108s

OK


<unittest.main.TestProgram at 0x1e282ac6348>

Now that functions have been tested successfully, implement functions on the reviews...

In [46]:
text_to_process = df2.loc[:,'critic_review']
y = df2.loc[:,['opencritic_rating_critics_score']].reset_index(drop=True)

remove_punc = remove_punctuation(text_to_process)

lemmatize_text = lemmatize_review(remove_punc)

tfidf_vectorize_text = tfidfvectorize_reviews(lemmatize_text)

In [47]:
y

Unnamed: 0,opencritic_rating_critics_score
0,strong
1,strong
2,mighty
3,mighty
4,mighty
...,...
6682,fair
6683,strong
6684,strong
6685,strong


In [48]:
tfidf_vectorize_text

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [49]:
X = pd.DataFrame(data=tfidf_vectorize_text)
X

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,12922,12923,12924,12925,12926,12927,12928,12929,12930,12931
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6682,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6683,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6684,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6685,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Split the data into training and testing sets

In [50]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

In [51]:
X_train

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,12922,12923,12924,12925,12926,12927,12928,12929,12930,12931
4107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1085,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3052,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
548,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2907,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6371,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1462,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1642,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1007,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [52]:
y_train

Unnamed: 0,opencritic_rating_critics_score
4107,strong
1085,strong
3052,strong
548,mighty
2907,strong
...,...
6371,mighty
1462,strong
1642,mighty
1007,strong


In [53]:
X_test

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,12922,12923,12924,12925,12926,12927,12928,12929,12930,12931
6573,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
532,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3210,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1309,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6131,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2668,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1411,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2180,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6030,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [54]:
y_test

Unnamed: 0,opencritic_rating_critics_score
6573,strong
532,mighty
3210,strong
1309,mighty
6131,strong
...,...
2668,strong
1411,mighty
2180,strong
6030,strong


In [55]:
#import the naive bayes model
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score

In [56]:
NB = MultinomialNB()
NB_scores = cross_val_score(NB, X, np.ravel(y), cv=5)

In [57]:
np.mean(NB_scores)

0.5305924402959127

In [58]:
#perform k-fold cross validation on critics scores using support vector regression
from sklearn import svm

regr = svm.LinearSVC()
svm_scores = cross_val_score(regr, X, np.ravel(y), cv=5)

In [59]:
np.mean(svm_scores)

0.4396732975349179

In [60]:
#use SMOTE technique to oversample fair and weak reviews
oversample = SMOTE(k_neighbors=4)
oversampled_X, oversampled_y = oversample.fit_resample(X, y)


In [61]:
oversampled_NB_scores = cross_val_score(NB, oversampled_X, np.ravel(oversampled_y), cv=10)
np.mean(oversampled_NB_scores)

0.7170308483290488

In [62]:
oversampled_svm_scores = cross_val_score(regr, oversampled_X, np.ravel(oversampled_y), cv=10)
np.mean(oversampled_svm_scores)

0.7519280205655527

Potential downside of SMOTE: the majority classes are not considered when the synthetic samples are created for the minority classes. The overlapping of review lexicon between the classes may be more nuanced but this is not considered with SMOTE