# Game "Safety" Classification From Game Reviews
by [CSpanias](https://cspanias.github.io/aboutme/), 1st Week's Project for [Solving Business Problems with NLP](https://omdena.com/course/solving-business-problems-with-nlp/) by Omdena

# CONTENT
1. [Webscraping Data](#webscraping)
1. [Functions for Text PreProcessing, Stemming \& Model Evaluation](#functions)
1. [Data Wrangling](#datawrangling)
1. [NLP PipeLine](#pipeline)
    1. [Basic NLP Count-Based Features](#CountBasedFeatures)
    1. [Sentiment Analysis](#sentimentanalysis)
    1. [Term Frequency-Inverse Document Frequency](#tfidf) 
    1. [Logistic Regression](#logreg)
    1. [Random Forest Classifier](#rfc)
        1. [Hyperparameter Tuning](#gs)
    1. [Sentiment Analysis B](#vader)
1. [Conclusions](#conclusions)

<a name="webscraping"></a>
# 1. Webscraping Data

The data were scraped using __[ParseHub](https://www.parsehub.com/)__, a free and straightforward tool for web scraping.

Notice that in order to ensure that the games will have reviews, my __starting URL__ had the games __sorted by Stars: High to Low__:

<img src="sort.PNG" align='left' alt="sort" style="width: 80%;" />

In addition, reviews were __filtered by both Parent & Kids Populatiry__:

<img src="filters.PNG" align='left' alt="filters" style="width: 80%;" />

 You can see the __step-by-step commands__ used on the following GIF image.
 
 __Note__: Uncomment and render as markdown cell to see it.

In [None]:
#![parsehub_commands.gif](attachment:parsehub_commands.gif)

<a name="functions"></a>
# 2. Functions for Text Pre-Processing, Stemming & Model Evaluation

First, we have to __import the required libraries__ that we aim to work with.

In [None]:
# import required libraries
import pandas as pd # import dataset, create and manipulate dataframes



import string # count-based features


from pprint import pprint # pretty print
from gensim.parsing.preprocessing import remove_stopwords

from sklearn.linear_model import LogisticRegression # model
from sklearn.ensemble import RandomForestClassifier # model
from sklearn.feature_extraction.text import TfidfVectorizer # count-based language models
from sklearn.metrics import classification_report, make_scorer # model evaluation metrics
from sklearn.metrics import accuracy_score, f1_score # model evaluation metrics
from sklearn.model_selection import GridSearchCV # split & evaluate dataset, hyperparameter optimization
from sklearn.model_selection import StratifiedKFold, cross_val_score, cross_val_predict # cross-validation
from collections import Counter # count-based calculations
from textblob import TextBlob # sentiment analysis



pd.options.mode.chained_assignment = None  # hide warnings

First, in order to __avoid repetitive code chunks__, we will create some relatively simple functions to user later during our NLP pipeline:
1. `normalize_document` for basic text preprocessing tasks
2. `simple_text_preprocessor` for stemming and some basic text preprocessing tasks
3. `extended_classification_report` for evaluating our models
4. `generate_confusion_matrix` for visualizing our results

The `normalize_document` function aims to perform some basic text pre-processing tasks in any text document. More specifically:
1. Remove special characters (any characters that are not alphabetic or numeric) using regular expressions.
2. Remove trailing (at the beginning and/or the end) whitespace.
3. Expand contracted words, e.g. `It's` &rarr; `It is`.
4. Tokenize text (split sentences into individual words).
5. Remove stopwords such as the, a, an, etc.
6. Join tokens back into a single string, i.e. like it was first inputted, but "cleaned".

__Note__: A great [article](https://towardsdatascience.com/text-pre-processing-stop-words-removal-using-different-libraries-f20bac19929a) about the differences of removing stopwords using different libraries (__NLTK__, __spaCy__, __gensim__, __scikit-learn__).

In [1]:
import contractions # expand contractions
import re # regular expressions
import numpy as np # vectorize functions and perform calculations
from nltk.tokenize import word_tokenize # tokenize strings
from gensim.parsing.preprocessing import STOPWORDS # removing stopwords

def normalize_document(doc):
    """Normalize the document by performing basic text pre-processing tasks."""
    # remove special characters
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I|re.A)
    # remove trailing whitespace
    nowhite = doc.strip()
    # expand contractions
    expanded = contractions.fix(nowhite)
    # tokenize document
    tokens = word_tokenize(expanded)
    # remove stopwords
    filtered_tokens = [token for token in tokens if token not in STOPWORDS]
    # re-create document from tokens
    doc = ' '.join(filtered_tokens)

    return doc

# vectorize function for faster computations
normalize_corpus = np.vectorize(normalize_document)

We will also make a function that performs __stemming__ to a document, that is, removing the word's __affixes__. We will use this function __prior the application of the tfidf process__ (more on this later). 

In [2]:
from nltk.stem import PorterStemmer # stemming

def simple_text_preprocessor(document):
    """Perform basic text pre-processing tasks."""
    # load up a simple porter stemmer - nothing fancy
    ps = PorterStemmer()

    # lower case
    document = str(document).lower()

    # expand contractions
    document = contractions.fix(document)

    # remove unnecessary characters
    document = re.sub(r'[^a-zA-Z]',r' ', document)
    document = re.sub(r'nbsp', r'', document)
    document = re.sub(' +', ' ', document)

    # simple porter stemming
    document = ' '.join([ps.stem(word) for word in document.split()])

    # stopwords removal
    document = ' '.join([word for word in document.split() if word not in STOPWORDS])

    return document

# vectorize function
stp = np.vectorize(simple_text_preprocessor)

Next we define the `extended_classification_report` which is a function (which I am sure takes a lot of refactoring) that evaluates a model using __cross-validation__, and it is what its name says: an extension of the original __classification report__ from scikit-learn.

In [3]:
from sklearn.model_selection import cross_validate

def extended_classification_report(model, kf, X, y):
       
    # define scoring metrics
    scoring = ['accuracy', 'precision', 'recall', 'f1', 'neg_brier_score', 'neg_log_loss', 'roc_auc']
    
    # cross-validate model
    model_scores = cross_validate(model, X, y, cv=kf, scoring=scoring, return_train_score=True)

    accuracy_train = []
    accuracy_test = []
    precision_splits = []
    recall_splits = []
    f1_splits = []
    brier_splits = []
    logloss_splits = []
    rocauc_splits = []
    for key, value in model_scores.items():
        if key == 'train_accuracy':
            accuracy_train.append(value)
        if key == 'test_accuracy':
            accuracy_test.append(value)
        if key == 'test_precision':
            precision_splits.append(value)
        if key == 'test_recall':
            recall_splits.append(value)
        if key == 'test_f1':
            f1_splits.append(value)
        if key == 'test_neg_brier_score':
            brier_splits.append(value)
        if key == 'test_neg_log_loss':
            logloss_splits.append(value)
        if key == 'test_roc_auc':
            rocauc_splits.append(value)


    # set column names
    split_cols_names = ['split 1', 'split 2', 'split 3', 'split 4', 'split 5',
                        'split 6', 'split 7', 'split 8', 'split 9', 'split 10']

    # convert lists of scores to dataframe
    accuracy_train = pd.DataFrame(accuracy_train, columns=split_cols_names )
    accuracy_test = pd.DataFrame(accuracy_test, columns=split_cols_names)
    precision_splits = pd.DataFrame(precision_splits, columns=split_cols_names)
    recall_splits = pd.DataFrame(recall_splits, columns=split_cols_names)
    f1_splits = pd.DataFrame(f1_splits, columns=split_cols_names)
    brier_splits = pd.DataFrame(brier_splits, columns=split_cols_names)
    logloss_splits = pd.DataFrame(logloss_splits, columns=split_cols_names)
    rocauc_splits = pd.DataFrame(rocauc_splits, columns=split_cols_names)

    # rename rows
    accuracy_train.rename(index = {0: "Accuracy Train"}, inplace=True)
    accuracy_test.rename(index = {0: "Accuracy Test"}, inplace=True)
    precision_splits.rename(index = {0: "Precision"}, inplace = True)
    recall_splits.rename(index = {0: "Recall"}, inplace = True)
    f1_splits.rename(index = {0: "F1"}, inplace = True)
    brier_splits.rename(index = {0: "Brier"}, inplace = True)
    logloss_splits.rename(index = {0: "LogLoss"}, inplace = True)
    rocauc_splits.rename(index = {0: "RocAuc"}, inplace = True)


    # merge all dataframes into a single one
    metrics_model = pd.concat([accuracy_train, accuracy_test, precision_splits, recall_splits, f1_splits,
                         brier_splits, logloss_splits, rocauc_splits])

    # calculate mean scores for each row
    mean_scores = metrics_model.mean(axis=1)

    # append column to the dataframe
    metrics_model['mean'] = round(mean_scores, 4)
    
    # display dataframe as a table
    return display(metrics_model)

Lastly, we define a short-function `generate_confusion_matrix` for visualizing our results.

In [4]:
import seaborn as sns # visualization
import matplotlib.pyplot as plt # visualization
from sklearn.metrics import confusion_matrix

def generate_confusion_matrix(y, y_pred):
    """Generate a confusion matrix based on a seaborn heatmap."""
    cm = confusion_matrix(y, y_pred)
    # visualize confusion matrix with seaborn heatmap
    cm_matrix = pd.DataFrame(data=cm, columns=['Actual Positive:1',
                                               'Actual Negative:0'],
                            index=['Predict Positive:1', 'Predict Negative:0'])
    fig, ax = plt.subplots(figsize=(7,7))  
    return sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='YlGnBu');

<a name="datawrangling"></a>
# 2. Data Wrangling

In [None]:
# import dataset
df = pd.read_csv('https://raw.githubusercontent.com/CSpanias/nlp_resources/main/nlp_omdena/w1/game_data.csv')

In [None]:
# inspect first 5 rows
df.head()

We can see that the __column names are unecessary long__, thus, it seems like a good idea to rename them.

In [None]:
# check column names
print(df.columns)

# rename columns
df.rename(columns={
    'game_title_name': 'title',
    'game_title_game_age': 'age',
    'game_title_kid_review_name': 'review_kid',
    'game_title_parent_review_name': 'review_parent'
}, inplace=True)

# check first 5 rows
df.head()

In [None]:
# check basic stats
df.info()

In [None]:
# check for duplicates
print("Number of duplicated rows: {}.".format(df.duplicated().sum()))

In [None]:
# drop duplicated rows
df.drop_duplicates(keep='first', inplace=True)

# check duplicates
print("Number of duplicated rows: {}.".format(df.duplicated().sum()))

The age column, which will form the base for our classification includes symbols `+`, whitespace, as well as the string `age`.

We will __clean that up using a regular expression__ and extract only what is relevant to us, i.e. the numeric characters.

We will also __replace `NaN` values with an emtpy string__ and then __convert the column to numeric__.

In [None]:
# clean age column
df.age = df.age.str.replace(pat=r'[^\d{,2}]', repl='', regex=True)

#check 1st 5 rows
df.head()

In [None]:
# replace NaN values
df.replace(np.nan,'',regex=True, inplace=True)
df.head()

In [None]:
# convert age column to int
df.age = pd.to_numeric(df.age)

# check dtype of age
df.info()

We are not interested in seperating the kids' from parents' reviews in this project, thus, we will __concatenate the two in a single column__.

In [None]:
# merge reviews into 1 column
df['reviews'] = df.review_kid + df.review_parent

# discard unecessary columns
df.drop(columns=['review_kid', 'review_parent'], inplace=True)

# check 1st five rows
df.head()

In [None]:
# make every review lower-case
df['reviews'] = df['reviews'].apply(str.lower)

# check first 5 rows
df.head()

In [None]:
# check missing values
df.isna().sum()

As we can see there are __303 missing values__ in the age column.

We will fill those with __the forward fill method__, as our missing values refer to the movie before them, hence, they have the same age as the cell before them.

In [None]:
# forwardfill missing values
df.fillna(method='ffill', axis=None, inplace=True)

# check for NaNs
df.isna().sum()

Now, we will create our target column based on age:
* If the game has an `age 17+` sign, we want to classify this as `non-safe` and label it as `0`, otherwise we will classify it as `safe` and label it as `1`.


In [None]:
# create category of interest, 1 = safe, 0 = non-safe
df['safe'] = df.apply(lambda row: 1 if row['age'] < 17  else 0, axis=1)

# check 1st five rows
df.head()

SInce we have our target column, we don't really need `age` anymore (and we never needed `title` to begin with!).

In [None]:
# discard unecessary columns
df.drop(columns=['title', 'age'], inplace=True)

In [None]:
# check values
print("The target distribution is {} safe (1) and {} non-safe (0) movie titles."
      .format(df.safe.value_counts()[0], df.safe.value_counts()[1]))

In [None]:
from wordcloud import WordCloud # visualization

# generate a wordcloud for safe titles
safe_wordcloud = WordCloud(width=512, height=512).generate(' '.join(df['reviews'][df['safe']==1]))
plt.figure(figsize=(6, 4), facecolor='k')
plt.imshow(safe_wordcloud)
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()

In [None]:
# generate a wordcloud for non-safe titles
non_safe_wordcloud = WordCloud(width=512, height=512).generate(' '.join(df['reviews'][df['safe']==0]))
plt.figure(figsize=(6, 4), facecolor='k')
plt.imshow(non_safe_wordcloud)
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()

In [None]:
# normalize 'reviews' column
norm_corpus = normalize_corpus(list(df['reviews']))

In [None]:
# check shape
print("The 'review' column has {} rows.\n".format(df.reviews.shape[0]))

# check first 5 rows
print("The first 5 reviews are:\n{}\n\n".format(df.reviews.head()))

In [None]:
# assign feature & target variables
X = df.drop(['safe'], axis = 1)
y = df['safe']

# check shape of features & target sets
print("Features' set shape: {} | Target's set shape {}."
      .format(X.shape, y.shape))

# 3. NLP Pipeline

<a name="CountBasedFeatures"></a>
## 3.1 Basic NLP Count-Based Features

A number of basic text based features can also be created which sometimes are helpful for **improving text classification models**. 

Some examples are:

- __Word Count:__ total number of words in the documents
- __Character Count:__ total number of characters in the documents
- __Average Word Density:__ average length of the words used in the documents
- __Puncutation Count:__ total number of punctuation marks in the documents
- __Upper Case Count:__ total number of upper count words in the documents
- __Title Word Count:__ total number of proper case (title) words in the documents

Since we chose to __lower-case our reviews__ during the text preprocessing step, we won't need the __upper-case__ & __title-case__ word count features.

**Note**: The aforementioned information comes from [this](https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-and-implement-text-classification-in-python/) article.

In [None]:
# calculate total number of characters
X['char_count'] = X['reviews'].apply(len)
# calculate total number of words
X['word_count'] = X['reviews'].apply(lambda x: len(x.split()))
# # calculate average word density
X['word_density'] = X['char_count'] / (X['word_count']+1)
# calculate total number of punctuaction marks
X['punctuation_count'] = X['reviews'].apply(lambda x: len("".join(_ for _ in x if _ in string.punctuation)))

In [None]:
# check df
X.head()

<a name="sentimentanalysis"></a>
## 3.2 Sentiment Analysis

> _"Sentiment Analysis is the process of determining whether a piece of writing is positive, negative or neutral."_ ([Lexalytics](https://www.lexalytics.com/technology/sentiment-analysis))

We want to **infer safety from game reviews** which are higly **subjective**, **opinionated** and people often **express strong emotions** and **feelings** through it. 

This makes it a classic case where the text documents here are a good candidate for **extracting sentiment as a feature**.

The general expectation is that a **"safe" review** (label 1) should have a **positive sentiment** and a **"non-safe" review** (label 0) should have a **negative sentiment**.

**`TextBlob`** is an excellent open-source library for performing **sentiment analysis** based on a **sentiment lexicon** which leverages to give both **polarity and subjectivity scores**. 

* Polarity is a float that lies between \[-1,1\], -1 indicates negative sentiment and +1 indicates positive sentiments.
* Subjectivity is also a float that lies in the range of \[0,1\]. Subjective sentences generally refer to opinion, emotion, or judgment. 

This is **unsupervised**, **lexicon-based sentiment analysis** where **we don't have any pre-labeled data** saying which review migth have a positive or negative sentiment. 

**Note**:The above information come from [this](https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72) and [this](https://www.analyticsvidhya.com/blog/2021/01/sentiment-analysis-vader-or-textblob/) article.

In [None]:
# calculate review's sentiment 
x_snt_obj = X['reviews'].apply(lambda row: TextBlob(row).sentiment)
# create a column for polarity scores
X['Polarity'] = [obj.polarity for obj in x_snt_obj.values]
# create a column for subjectivity scores
X['Subjectivity'] = [obj.subjectivity for obj in x_snt_obj.values]

In [None]:
# check df
X.head()

In [None]:
# create a new column with cleaned text
X['clean_reviews'] = stp(X['reviews'].values)

# check first 5 rows
X.head()

In [None]:
# remove the 2 columns
X_metadata = X.drop(['reviews', 'clean_reviews'], axis=1).reset_index(drop=True)

# check first 5 rows
X_metadata.head()

<a name="tfidf"></a>
## 3.3 Term Frequency-Inverse Document Frequency

__Term Frequency-Inverse Document Frequency__ (tf-idf) uses a combination of two metrics in
its computation, namely: __term frequency__ (tf) and __inverse document frequency__ (idf). 

This technique was developed for ranking results for queries in search engines and now it is an indispensable model in the world of __information retrieval__ and NLP.

__Note__: More information about [__tfidf__](https://towardsdatascience.com/text-pre-processing-stop-words-removal-using-different-libraries-f20bac19929a).

In [None]:
# instatiate vectorizer
tv = TfidfVectorizer(min_df=0.0, max_df=1.0, ngram_range=(1, 1))

# fit vectorizer to 'Clean Review' and convert it to numpy array
X_tv = tv.fit_transform(X['clean_reviews']).toarray()
# create a pandas DataFrame
X_tv = pd.DataFrame(X_tv, columns=tv.get_feature_names())

# check first 5 rows
X_tv.head()

Now we will __concatenate the two dataframes__, the one that hold __reviews metadata__ and the one with the __tfidf scores__, into one.

In [None]:
# concatenate the 2 dataframes
X_comb = pd.concat([X_metadata, X_tv], axis=1)

# check first 5 rows
X_comb.head()

<a name="logreg"></a>
## 3.4 Logistic Regression

In [None]:
# instantiate log reg
lr = LogisticRegression(C=1, random_state=42, solver='liblinear')

# choose how many train/test sets we want by "n_splits"
kfold = StratifiedKFold(n_splits=10, shuffle=True)

# evaluate model
extended_classification_report(model=lr, kf=kfold, X=X_comb, y=y)

In [None]:
# predict using cv
y_pred = cross_val_predict(lr, X_comb, y, cv=kfold)

# generate cm
generate_confusion_matrix(y=y, y_pred=y_pred);

<a name="rfc"></a>
## 3.5 Random Forest Classifier

In [None]:
# instantiate model
rfc = RandomForestClassifier()

# evaluate model
extended_classification_report(model=rfc, kf=kfold, X=X_comb, y=y);

In [None]:
# predict using cv
y_pred = cross_val_predict(rfc, X_comb, y, cv=kfold)

# generate cm
generate_confusion_matrix(y=y, y_pred=y_pred);

<a name="gs"></a>
## 3.5.1 Hyperparameter Tuning

__Note__: Details about how to Tune an RF model in this [article](https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74).

In [None]:
# Look at parameters used by our current forest
print('Parameters currently in use:\n')
pprint(rfc.get_params())

In [None]:
# Create the parameter grid based on the results of random search 
param_grid = {
    'bootstrap': [True],
    'max_depth': [5, 10, 20, 40, 60, 80, 100],
    'max_features': [2, 3],
    'min_samples_leaf': [1, 2, 5, 10],
    'min_samples_split': [2, 5, 10, 15, 100],
    'n_estimators': [100, 250, 500, 750, 1000, 1200]
}

# choose how many train/test sets we want by "n_splits"
kfold = StratifiedKFold(n_splits=3, shuffle=True)

# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rfc, param_grid = param_grid, 
                          cv = kfold, n_jobs = -1, verbose = 3)

# Fit the grid search to the data
grid_search.fit(X_comb, y)

In [None]:
# check best parameters
print("Parameters suggested by GS:\n\n{}".format(grid_search.best_params_))

# instantiate model with best params
best_grid = grid_search.best_estimator_

In [None]:
# choose how many train/test sets we want by "n_splits"
kfold = StratifiedKFold(n_splits=10, shuffle=True)

# evaluate model
extended_classification_report(model=best_grid, kf=kfold, X=X_comb, y=y)

In [None]:
# predict using cv
y_pred = cross_val_predict(best_grid, X_comb, y, cv=kfold)

# generate cm
generate_confusion_matrix(y=y, y_pred=y_pred);

<a name="vader"></a>
# 4. Sentiment Analysis B

[TextBlob vs Vader](https://www.analyticsvidhya.com/blog/2021/01/sentiment-analysis-vader-or-textblob/)

In [None]:
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

<a name="conclusions"></a>
# 4. Conclusions

1. As we can see __Logistic Regression__ does a pretty good job with a __mean f1_score of 89%__. 


2. __Random Forest__ without any tuning performs almost perfect with a __mean f1_Score of 97%__.


3. __Hyperparameter tuning__ using GS managed to increase the performance even more and raised the __mean f1_Score to over 99%__!