<a id='ReturntoTop'></a>


<header>
  <div style="display:flex; align-items:center;">
    <div style="flex-grow:1;">
      <h1>NLP with Steam Video Game Reviews</h1>
      <h3>Notebook 2 - Data Cleaning and Prep</h3>
      <p>Author: David Lappin | Date: 5/12/2023 - */**/2023 </p>
    </div>
    <img src="bannerphoto/banner.jpg" alt="your-image-description" style="height:225px; margin-left:50px; border: 8px solid black;border-radius: 5%;">
  </div>
</header>

------------------------------------------------------------------------------------------------------------------------------

# Introduction and Purpose

# Table of Contents

[Packages Import](#1)

[Data Import](#2)

[Blank](#3)

[Blank](#4)

[Blank](#5)


# Packages Import
<a id='1'></a>
[Return to Top](#ReturntoTop)

**Matplotlib** - Used as needed for basic visualizations

**Numpy** - Supports large, multi-dimensional arrays and matrices,and contains a large collection of high-level mathematical functions to operate on these arrays.

**Pandas** - Additional data manipulation and analysis

**sklearn** - machine learning library

**seaborn** - graphing and visualization package

**nltk.corpus stopwords** - allows for the removal of english stop words (a, an, am, for....ect) in the user reviews

**nltk.stem PorterStemmer** - removing the commoner morphological and inflexional endings from words in English. 

**nltk WordNetLemmatizer** - reduces a word to its base or dictionary form

**re** - regex expression support

**nltk.stem SnowballStemmer** - reduces a word to its base or dictionary form


In [72]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn.metrics as sk_metrics

#text cleaning 
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk import WordNetLemmatizer
from nltk.stem import SnowballStemmer

# Data Import
[Return to Top](#ReturntoTop)
<a id='2'></a>

Import the reviews csv and explore some of the data within:

In [129]:
#import data

raw_df = pd.read_csv('data/train.csv')
raw_df.head(10)

Unnamed: 0,review_id,title,year,user_review,user_suggestion
0,1,Spooky's Jump Scare Mansion,2016.0,I'm scared and hearing creepy voices. So I'll...,1
1,2,Spooky's Jump Scare Mansion,2016.0,"Best game, more better than Sam Pepper's YouTu...",1
2,3,Spooky's Jump Scare Mansion,2016.0,"A littly iffy on the controls, but once you kn...",1
3,4,Spooky's Jump Scare Mansion,2015.0,"Great game, fun and colorful and all that.A si...",1
4,5,Spooky's Jump Scare Mansion,2015.0,Not many games have the cute tag right next to...,1
5,6,Spooky's Jump Scare Mansion,2015.0,"Early Access ReviewIt's pretty cute at first, ...",1
6,7,Spooky's Jump Scare Mansion,2017.0,Great game. it's a cute little horror game tha...,1
7,8,Spooky's Jump Scare Mansion,2015.0,Spooky's Jump Scare Mansion is a Free Retro ma...,1
8,9,Spooky's Jump Scare Mansion,2015.0,"Somewhere between light hearted, happy parody ...",0
9,10,Spooky's Jump Scare Mansion,2015.0,This game with its cute little out of the wall...,1


The looks like it loaded correctly, we can confirm with some additional info:

In [130]:
#general data information

raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17494 entries, 0 to 17493
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   review_id        17494 non-null  int64  
 1   title            17494 non-null  object 
 2   year             17316 non-null  float64
 3   user_review      17494 non-null  object 
 4   user_suggestion  17494 non-null  int64  
dtypes: float64(1), int64(2), object(2)
memory usage: 683.5+ KB


### Observations:
Looks like all the data is as it was from the last notebook. Firstly, let's convert the review text to string type (instead of object) and then we can begin to clean the text of stop words, symbols, emojis, non-english text ect... 

In [131]:
#changing to string

raw_df['user_review'].astype(str).head()

0    I'm scared and hearing creepy voices.  So I'll...
1    Best game, more better than Sam Pepper's YouTu...
2    A littly iffy on the controls, but once you kn...
3    Great game, fun and colorful and all that.A si...
4    Not many games have the cute tag right next to...
Name: user_review, dtype: object

In [132]:
#commit change to string

raw_df['user_review'] = raw_df['user_review'].astype(str)

In [133]:
#check the data type for sanity check 

raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17494 entries, 0 to 17493
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   review_id        17494 non-null  int64  
 1   title            17494 non-null  object 
 2   year             17316 non-null  float64
 3   user_review      17494 non-null  object 
 4   user_suggestion  17494 non-null  int64  
dtypes: float64(1), int64(2), object(2)
memory usage: 683.5+ KB


In [134]:
print(raw_df['user_review'].dtype)

object


### Observations:

Based on the variable lengths of text per review, the column is defaulting to object datatype. We can go ahead and work on cleaning the review text to prep for tokenization and modeling. Lets review the previous selection of reviews:

# Removing "Early Access Reviews"

After doing some research into Steam and its Early Access Reviews, I found that theses are reviews based on pre-release or beta versions of the games. Essentially, these reviews are by people playing an unfinished game and are intended to provide crowd sourced feedback to developers. 

**Per Steams Website Docs:** (https://partner.steamgames.com/doc/store/earlyaccess)

> *What is Early Access?
Steam Early Access enables you to sell your game on Steam while it is still being developed, and provide context to customers that a product should be considered "unfinished." Early Access is a place for games that are in a playable alpha or beta state, are worth the current value of the playable build, and that you plan to continue to develop for release.

> Releasing a game in Early Access helps set context for prospective customers and provides them with information about your plans and goals before a "final" release.*

These specific reviews might be a great way to predict the positive or negative sentiment of a game BEFORE it is released. Modeling theses specific reviews could be a way of indicating the success or failure of a game pre-launch. They are not however, a pure review of a completed game and therefore, should not be included in this project. We will remove these reviews before continuing to clean up the text.

**Lets look again at the sample reviews from Notebook: '1.0_EDA_and_Exploration'**

In [135]:
#Show some samples of the user reviews

for x in raw_df['user_review'].sample(n=10, random_state=1001):
    print(x)
    print('\n')

Clicker game that doesn't need you to click.You can just leave the game to play at background while you do your stuffs (including work, yeah, I've done it. Don't try this at supervised office computer, though), and come back once in a while to upgrade all your crusaders, then get back to what you were doing.The fairly recent addition of Mission makes collecting crusaders worth it.This game is an example of how F2P should be, many freebies - you can buy Jeweled Chests using Rubies which generously given by daily tasks and missions, and every event there are at least 3 Jeweled Event Chests for you, plus 1 extra for you newsletter subscriber.Definitely 5/7


ehh ส์ํ์ั์ํ์๋์ํ์ั์ํ์่์ํ์ั์ํ์๋์ํ์ั์ํ์็์ํ์ั์ํ์๋์ํ์ั์ํ์่์ํ์ั์ํ์๋์ํ์ั์ํ์็์ํ์ั์ํ์๋์ํ์ั์ํ์่์ํ์ั์ํ์๋์ํ์ั์ํ์็์ํ์ั์ํ์๋์ํ์ั์ํ์่์ํ์ั์ํ์๋์ํ์ั์ํ์็์ํ์ั์ํ์๋์ํ์ั์ํ์่์ํ์ั์ํ์๋์ํ์ั์ํ์็์ํ์ั์ํ์๋์ํ์ั์ํ์่์ํ์ั์ํ์๋์ํ์ั์ํ์็์ํ์ั์ํ์๋์ํ์ั์ํ์่์ํ์ั์ํ์๋์ํ์ั์ํ์็์ํ์ั์ํ์๋์ํ์ั์ํ์่์ํ์ั์ํ์๋์ํ์ั์ํ์็์ํ์ั์ํ์๋์ํ์ั์ํ์่์ํ์ั์ํ์๋์ํ์ั์ํ์็์ํ์ั์ํ์๋์ํ์ั์ํ์่์ํ์ั์ํ์๋์ํ์ั์ํ์็์ํ์ั์ํ์๋์ํ์

### Observations:

As we can see, in this selection alone, there are 4 examples of Early access games. Lets go ahead and replace these with Nan or Null values:

In [136]:
#replaces all values in the user review column that start with "Early Access Review" with Nan
#Note: the '^' indicates "begins with" and the '.*' at the end indicates "followed by any characters"
#regex=True ensures that the replacement is applied using regular expression matching

raw_df['user_review'] = raw_df['user_review'].replace(r'^Early Access Review.*', np.nan, regex=True)

Lets look at the same sample again to see if the "Early Access" reviews are removed and replaced with Nan Values:

In [137]:
for x in raw_df['user_review'].sample(n=10, random_state=1001):
    print(x)
    print('\n')

Clicker game that doesn't need you to click.You can just leave the game to play at background while you do your stuffs (including work, yeah, I've done it. Don't try this at supervised office computer, though), and come back once in a while to upgrade all your crusaders, then get back to what you were doing.The fairly recent addition of Mission makes collecting crusaders worth it.This game is an example of how F2P should be, many freebies - you can buy Jeweled Chests using Rubies which generously given by daily tasks and missions, and every event there are at least 3 Jeweled Event Chests for you, plus 1 extra for you newsletter subscriber.Definitely 5/7


ehh ส์ํ์ั์ํ์๋์ํ์ั์ํ์่์ํ์ั์ํ์๋์ํ์ั์ํ์็์ํ์ั์ํ์๋์ํ์ั์ํ์่์ํ์ั์ํ์๋์ํ์ั์ํ์็์ํ์ั์ํ์๋์ํ์ั์ํ์่์ํ์ั์ํ์๋์ํ์ั์ํ์็์ํ์ั์ํ์๋์ํ์ั์ํ์่์ํ์ั์ํ์๋์ํ์ั์ํ์็์ํ์ั์ํ์๋์ํ์ั์ํ์่์ํ์ั์ํ์๋์ํ์ั์ํ์็์ํ์ั์ํ์๋์ํ์ั์ํ์่์ํ์ั์ํ์๋์ํ์ั์ํ์็์ํ์ั์ํ์๋์ํ์ั์ํ์่์ํ์ั์ํ์๋์ํ์ั์ํ์็์ํ์ั์ํ์๋์ํ์ั์ํ์่์ํ์ั์ํ์๋์ํ์ั์ํ์็์ํ์ั์ํ์๋์ํ์ั์ํ์่์ํ์ั์ํ์๋์ํ์ั์ํ์็์ํ์ั์ํ์๋์ํ์ั์ํ์่์ํ์ั์ํ์๋์ํ์ั์ํ์็์ํ์ั์ํ์๋์ํ์

### Observations:
Great! Looks like they have been replaced. Lets now check to see how many where removed by summing all Nan values in the `user_review` column:

In [138]:
#check for Nan values in user_review column

raw_df['user_review'].isnull().sum()

5733

### Observations

There are approximately 5 thousand Early Access Reviews that have been removed. This is quite a few (about a third), and while this is a fair bit of the data set, the data does not fit the objective and needs to be removed for improved accuracy. We will remove all the nan values in the future.

We can now move on to text clean up to address the punctuation, excess spaces, emojis, ect...



In [139]:
#sanity check to determine if the Nan values will be removed

raw_df.dropna(subset=['user_review'], inplace=False).isnull().sum()

review_id            0
title                0
year               178
user_review          0
user_suggestion      0
dtype: int64

In [140]:
#confirm drop values

raw_df.dropna(subset=['user_review'], inplace=True)

In [141]:
#check for Nan values in user_review column

raw_df['user_review'].isnull().sum()

0

In [142]:
#check the new length of the dataframe

len(raw_df)

11761

# Text Cleaning and Prep

The following is a really nice collection of code sourced from a similar project (link to source below). This project used a similar Steam data set to perform some type of sentiment analysis. The data set used in the project was significantly larger and potentially had more issues than we uncovered so the code here is very comprehensive. As a precaution, and because it wont hurt our data to cover all the bases, we will include it all.

In general, the code below seeks to define text cleaning functions for the following;
- Removal of unwanted symbols, emojis, spaces, numbers and punctuation
- Removal of English Stop words
- Reducing words to their root form via 'Stemming'

I have included additional markdown to explain what is happeinign in each function

**Source - https://www.kaggle.com/code/danielbeltsazar/steam-games-reviews-analysis-sentiment-analysis**

In [143]:
#

def clean(raw):
    """ Remove hyperlinks and markup """
    result = re.sub("<[a][^>]*>(.+?)</[a]>", 'Link.', raw)
    result = re.sub('&gt;', "", result)
    result = re.sub('&#x27;', "'", result)
    result = re.sub('&quot;', '"', result)
    result = re.sub('&#x2F;', ' ', result)
    result = re.sub('<p>', ' ', result)
    result = re.sub('</i>', '', result)
    result = re.sub('&#62;', '', result)
    result = re.sub('<i>', ' ', result)
    result = re.sub("\n", '', result)
    return result


#

def remove_num(texts):
   output = re.sub(r'\d+', '', texts)
   return output


#

def deEmojify(x):
    regrex_pattern = re.compile(pattern = "["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags = re.UNICODE)
    return regrex_pattern.sub(r'', x)


#

def unify_whitespaces(x):
    cleaned_string = re.sub(' +', ' ', x)
    return cleaned_string 


#

def remove_symbols(x):
    cleaned_string = re.sub(r"[^a-zA-Z0-9?!.,]+", ' ', x)
    return cleaned_string

#


def remove_punctuation(text):
    final = "".join(u for u in text if u not in ("?", ".", ";", ":",  "!",'"',','))
    return final



#

stop=set(stopwords.words("english"))
stemmer=PorterStemmer()
lemma=WordNetLemmatizer()

def remove_stopword(text):
   text=[word.lower() for word in text.split() if word.lower() not in stop]
   return " ".join(text)

#


def Stemming(text):
   stem=[]
   stopword = stopwords.words('english')
   snowball_stemmer = SnowballStemmer('english')
   word_tokens = nltk.word_tokenize(text)
   stemmed_word = [snowball_stemmer.stem(word) for word in word_tokens]
   stem=' '.join(stemmed_word)
   return stem



In [144]:
def cleaning(df,review):
    df[review] = df[review].apply(clean)
    df[review] = df[review].apply(deEmojify)
    df[review] = df[review].str.lower()
    df[review] = df[review].apply(remove_num)
    df[review] = df[review].apply(remove_symbols)
    df[review] = df[review].apply(remove_punctuation)
    df[review] = df[review].apply(remove_stopword)
    df[review] = df[review].apply(unify_whitespaces)
    df[review] = df[review].apply(Stemming)

In [145]:

cleaning(raw_df,'user_review')


In [146]:
for x in raw_df['user_review'].sample(n=10, random_state=1001):
    print(x)
    print('\n')

start play creativers two year ago start play close halloween play creativers saw ghost leafi confus never seen went kill walk hill saw like okay halloween went tame one brought back workshop name ghosti halloween event sad ghost leafi ghost creatur wish creativers kept ghost creatur great nightim creatur halloween although get close halloween year hope bring back creativers still great game


learn curv steep first hrs die non stop mayb happen suck get hang realli enjoy game amaz see good qualiti free play actual fp pay wini play light assault time like jetpack say sure everi class feel differ uniqu gun uniqu even gun differ stat feel differ higbi awesom need good enough rig play upgrad gb ram run smooth much wrongi see peopl complain interact dev guess that kind con dont care much dev interactionsfin word addict


lack atmospher kiddi art direct wannab wow graphic hand hold quest sparkl show exact go quest deiti charact background impact charact game play racial bonus super conserv b

In [147]:
raw_df2 = raw_df.copy()




In [148]:
raw_df2['user_review'] = raw_df2['user_review'].apply(lambda x:str(x).split())
top = Counter([item for sublist in raw_df2['temp_list'] for item in sublist])
temp = pd.DataFrame(top.most_common(20))
temp.columns = ['Common_words','count']
temp.style.background_gradient(cmap='Blues')


NameError: name 'Counter' is not defined

------------------------------------------------------------------------------------------------------------------------------

# Summary and Next Steps:
<a id='7'></a>
[Return to Top](#ReturntoTop)

**From the Preliminary EDA we found the following out and completed the following tasks for our dataset:**
- This is a relatively small data set with only ~17,000 reviews
- There are 44 unique video game titles with multiple reviews for each. Each review also logs an overall (binary) recommendation.
- There where no duplicate or missing values and no data imputation was required
- Of the reviews we noted that some have non-english words and symbols.
- There are a few data imbalances in our data
    - There are more overall positive reviews than negative reviews
    - There are varying amounts of reviews by each title
    - There are varying amounts of positive and negative reviews within each title (some are highly positive and others are highly negative)
- Generally speaking, the games are mostly positively reviewed (assuming positive percentage greater than 50%)


**Next Steps: (Data Cleaning and Prep)**
- Firstly we will need to remove non-english reviews
    - re-evaluate the amount of data we have and the distribution between positive and negative reviews after this
- Determine if we will employ methods to reduce data imbalances (ex: Under vs Over Sampling, Class weighting, Data Augmentation

    

### Next steps are located in the Second Notebook - '2.0_DataCleaning_and_Prep'