### Cleaning the dataset

1. Importing the necessary libraries
2. Creating a pandas dataframe and getting an overview of the data
3. Analyze amount of negative and positive review, as well as drop any rows with NAs and invalid values
4. Removing special characters from the Reviews
5. Optional(Autocorrecting words to reduce the amount of lost information due to spelling errors)
6. Removing stopwords, as well as non english words to further improve performance of the ML algorithms
7. Lemmatizing words to further enhance performance
8. Save the cleaned dataset to a csv file

In [11]:
!pip install pandas
!pip install nltk

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [12]:
#from autocorrect import Speller
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import words
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download("words")
nltk.download("punkt")
nltk.download('wordnet')
nltk.download("stopwords")

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [13]:
reviews = pd.read_csv("Steam_Reviews")
# Dropping the MOBA and Genre columns. MOBA: because not enough reviews could be scraped, Genre: These are not valid review rows and thus can be safely removed
reviews = reviews[reviews["Genre"] != "MOBA"]
reviews = reviews[reviews["Genre"] != "Genre"]
# Checking for the genre distribution
counts = reviews["Genre"].value_counts()
counts

Genre
RPG           140305
Coop          130000
FPS           129998
Strategy      113478
Platformer    111787
Fighter       103367
Name: count, dtype: int64

In [14]:
# Checking how many positive and negative reviews we have got and removing missing values, as well as removing any rows where the value for recommended is literally recommended (happens when initiating a new API call for a new game)
reviews = reviews.dropna()
reviews = reviews[reviews["Recommended"] != "Recommended"]
reviews["Recommended"].value_counts()

Recommended
True     623064
False    104108
Name: count, dtype: int64

In [17]:
# Removing special characters
def delete_specials(Review):
    return re.sub(r'[^\w\s]','', Review) #using a regular expression (and the re package) to get rid of any character which are not a - z. A - Z and 0 - 9, as well as white spaces.

In [21]:
reviews["Review"] = reviews["Review"].apply(lambda x: delete_specials(x))
reviews

Unnamed: 0,Review,Recommended,Genre
0,spy sappin my dispensers,True,FPS
1,very nc better than overwatch shite game,True,FPS
2,it is good game,True,FPS
3,spy among us,True,FPS
4,Full of bots no updates except cosmetic crap U...,False,FPS
...,...,...,...
751844,allahı var,True,Strategy
751845,fun game,True,Strategy
751846,Its really good now with the new 160 patch Wev...,True,Strategy
751847,Its good its like COH2 but with new maps and b...,True,Strategy


There was a lot of thinking done about if a fuzzy word matching should be included but two problems arise with this approach:

- Specific words in gamer language are simply not present in the dictionary provided by these libraries, which might lead some "important" words to be corrected into something completly different
- It is quite performance heavy as suggested by some of these libraries, which has also been observed when working with datasets above 100k entries. 

This begs the question if it is reall worth doing the autocorrect part, especially when working with really large datasets. For now, it will be excluded, since it takes too much time to be considered a worthwile investement.

In [22]:
# Correcting misspelled words
#def autocorrector(Review):
#    return spell(Review)

In [23]:
#spell = Speller()
#reviews["Corrected_reviews"] = reviews["Review"].apply(lambda x: autocorrector(x))

Next step is to:

- Remove stopwords
- Lemmatize words
- Filter out any non english words

The code blocks below is used to do exactly that.

In [26]:
english_vocabulary = set(words.words()) # we get a list of english words and use set for better performance
stop_words = stopwords.words("english") # we get a list with stopwords which we want to remove from our review
lemmatize_words = WordNetLemmatizer()   # Function to lemmatize words

In [27]:
def filter_words(Review):
    tokens = nltk.word_tokenize(Review) # each word is going to transformed into a token, as this is necessary for the processing
    filtered_text = [lemmatize_words.lemmatize(word.lower()) for word in tokens if word.lower() in english_vocabulary and word.lower() not in stop_words] # now we iterate over each word in the actual review and filter out any word which is not included in the nltk word set, as well as removing stop words and lemmatizing
    return " ".join(filtered_text) # we turn it back into a single big string

In [28]:
reviews["Review"] = reviews["Review"].apply(lambda x: filter_words(x))
reviews.head()

Unnamed: 0,Review,Recommended,Genre
0,spy,True,FPS
1,better overwatch game,True,FPS
2,good game,True,FPS
3,spy among u,True,FPS
4,full except cosmetic crap unbalanced pure garbage,False,FPS


Lastly, it is time to get rid of any empty review values and save our cleaned dataset to a csv.

In [29]:
reviews = reviews[reviews["Review"] != ""]
reviews

Unnamed: 0,Review,Recommended,Genre
0,spy,True,FPS
1,better overwatch game,True,FPS
2,good game,True,FPS
3,spy among u,True,FPS
4,full except cosmetic crap unbalanced pure garbage,False,FPS
...,...,...,...
751843,like major worked much fun opinion variety,True,Strategy
751845,fun game,True,Strategy
751846,really good new patch weve gotten new free new...,True,Strategy
751847,good like new better graphic,True,Strategy


In [30]:
reviews.to_csv("Cleaned_Steam_Reviews.csv", index = False)

For the genre classification, it is also very interesting to remove words that immideately hint towards the genre, for example: "This is the best RPG or this is what a first person shooter is all about!". This step is used to remove these words and create a second training/test file.

In [36]:
genre_words = ["rpg", "role playing game", "overwatch", "role", "shooter", "shoot", "fps", "platformer", "coop", "moba", "fight", "fighter", "baldur", "dota", "tekken",
              "fighterz", "ori", "persona", "helldiver", "skyrim", "ori", "smite", "payday", "pseudoregalia", "mortal", "kombat", "guilty", "gear", "brawlhalla", "strategy", "warhammer"]

In [37]:
def filter_genre(Review):
    tokens = nltk.word_tokenize(Review) # each word is going to transformed into a token, as this is necessary for the processing
    filtered_text = [word for word in tokens if word.lower() not in genre_words]
    return " ".join(filtered_text)

In [38]:
reviews["Review"] = reviews["Review"].apply(lambda x: filter_genre(x))
reviews.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  reviews["Review"] = reviews["Review"].apply(lambda x: filter_genre(x))


Unnamed: 0,Review,Recommended,Genre
0,spy,True,FPS
1,better game,True,FPS
2,good game,True,FPS
3,spy among u,True,FPS
4,full except cosmetic crap unbalanced pure garbage,False,FPS


In [39]:
reviews = reviews[reviews["Review"] != ""]
reviews

Unnamed: 0,Review,Recommended,Genre
0,spy,True,FPS
1,better game,True,FPS
2,good game,True,FPS
3,spy among u,True,FPS
4,full except cosmetic crap unbalanced pure garbage,False,FPS
...,...,...,...
751843,like major worked much fun opinion variety,True,Strategy
751845,fun game,True,Strategy
751846,really good new patch weve gotten new free new...,True,Strategy
751847,good like new better graphic,True,Strategy


In [40]:
reviews.to_csv("Cleaned_Steam_Reviews_without_Genre.csv", index = False)