To do list:

- import the modules required
- import data
- filter out the columns required
- clean the data (check for null values)
- use TDD to process the text
- split data into training and test data

- use TDD to develop models (fit and predict)
- measure accuracy of the models and decide whether to make improvements on the models
- models that will be used: Naive bayes and support vector machines

In [36]:
import pandas as pd
import numpy as np
import unittest
import string
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer

### Import the data

In [37]:
df = pd.read_csv('opencritic_data_cleaned.csv', index_col= 'Unnamed: 0')

df.head()

Unnamed: 0,name,developer,publisher,genre,release_date,description,critics_average_score,critic_review,date_of_review,critics_score
0,The Outer Worlds,Obsidian Entertainment,Private Division,RPG,2019-10-25,The Outer Worlds is a new single-player first-...,82,There's a category of games I think of as Satu...,2019-10-22 00:00:00,79.0
1,The Outer Worlds,Obsidian Entertainment,Private Division,RPG,2019-10-25,The Outer Worlds is a new single-player first-...,82,"With The Outer Worlds, Obsidian has found its ...",2019-10-22 00:00:00,85.0
2,The Outer Worlds,Obsidian Entertainment,Private Division,RPG,2019-10-25,The Outer Worlds is a new single-player first-...,82,"A conventional, easygoing scifi RPG with sligh...",2019-10-22 00:00:00,
3,The Outer Worlds,Obsidian Entertainment,Private Division,RPG,2019-10-25,The Outer Worlds is a new single-player first-...,82,"A deep, funny, and intricately designed RPG re...",2019-10-29 00:00:00,90.0
4,The Outer Worlds,Obsidian Entertainment,Private Division,RPG,2019-10-25,The Outer Worlds is a new single-player first-...,82,The Outer Worlds marks Obsidian operating at t...,2019-10-22 00:00:00,90.0


Columns required for training and predicting the models:

- critic_review
- critics_score

In [38]:
df2 = df.loc[:,['critic_review','critics_score']]

df2

Unnamed: 0,critic_review,critics_score
0,There's a category of games I think of as Satu...,79.0
1,"With The Outer Worlds, Obsidian has found its ...",85.0
2,"A conventional, easygoing scifi RPG with sligh...",
3,"A deep, funny, and intricately designed RPG re...",90.0
4,The Outer Worlds marks Obsidian operating at t...,90.0
...,...,...
15339,For Honor developed from a promising concept t...,60.0
15340,For Honor is an incredibly competitive multipl...,75.0
15341,As an arena sword fighter For Honor does an ad...,
15342,For Honor is an impressive fighting game with ...,90.0


clean the data (remove null values)

In [39]:
df2 = df2.dropna()

type(df2)

pandas.core.frame.DataFrame

15333-11759 = 3574 rows dropped

Use TDD to process the text, ready for training the model

In [53]:
reviews = df.loc[:,'critic_review'].values


## Process text code

In [45]:
#remove punctuation
def remove_punctuation(reviews):
    punctuation = string.punctuation
    new_list = []
    for review in reviews:
        new_string = ''
        for letter in review:
            if letter not in punctuation:
                new_string += letter
        new_list.append(new_string)
    
    return new_list #explain the difficulty in refactoring this code due to the number of components

#lemmatize
def lemmatize_review(review):
    lem = WordNetLemmatizer()
    lemmatized_review = [lem.lemmatize(r) for r in review]
    return lemmatized_review

#use tfidfVectorizer to convert words to lowercase, removes stopwords and tokenizes the data
def tfidfvectorize_reviews(review):
    stop_words = set(stopwords.words('english'))
    vectorizer = TfidfVectorizer(stop_words=stop_words)
    fit_review = vectorizer.fit_transform(review)
    matrix = fit_review.toarray()
    return matrix
    


## Unit tests

In [51]:
class TestTextFormatting(unittest.TestCase):
    
    def test_remove_punc_exists(self):
        self.assertIsNotNone(remove_punctuation)
    
    def test_punctuation_removed(self):
        res = remove_punctuation(['x'+ string.punctuation, 'y'+ string.punctuation])
        for r in res:
            self.assertNotRegex(r,'[\W]')
        
    def test_lemaatizer_exists(self):
        self.assertIsNotNone(lemmatize_review)
        
    def test_words_lemmatized(self):
        string_to_lem = ["corpora", "rockers", "players", "syllabi"]
        res = lemmatize_review(string_to_lem)
        self.assertEqual(res, ["corpus", "rocker", "player", "syllabus"])
    
    def test_vectorizer_exists(self):
        self.assertIsNotNone(tfidfvectorize_reviews)
        
    def test_words_vectorized(self):
        tester = ["This is a review",
          "This is a very good review",
          "This is a very bad review",
          "This is an awful review dont watch",
          "This review is outstanding"
          ]
        res = tfidfvectorize_reviews(tester)
        self.assertEqual(type(res), np.ndarray)
    
unittest.main(argv=['ingored', '-v'], exit=False)

test_lemaatizer_exists (__main__.TestTextFormatting) ... ok
test_punctuation_removed (__main__.TestTextFormatting) ... ok
test_remove_punc_exists (__main__.TestTextFormatting) ... ok
test_vectorizer_exists (__main__.TestTextFormatting) ... ok
test_words_lemmatized (__main__.TestTextFormatting) ... ok
test_words_vectorized (__main__.TestTextFormatting) ... ok

----------------------------------------------------------------------
Ran 6 tests in 0.005s

OK


<unittest.main.TestProgram at 0x1e88c279588>

Now that functions have been tested successfully, implement functions on the reviews...