## Satisfactory Reviews - An NLP Project

Earlier this year, I discovered a wonderful game named Satisfactory. I immediately fell in love with the concept of landing on an unknown planet, where the only goal is to use the planet's resources to build the ultimate factory. With a huge map jampacked with resouces, aliens and poisonous clouds (and lizard doggos!) - one has to locate resources, build factories, and spend an inordinate amount of time just making the process more efficient. 

*MUST INCREASE EFFICIENCY AT ALL COSTS*

As a data scientist (who willingly spent a ridiculous amount of time increasing the efficiency of everything), I needed to know what others had to say about this game. I built a webscraper that took in user reviews of Satisfactory from the Steam community website, and pulled around 6000 reviews from the past few months since the last major update. 

I wanted to focus on those that wrote a review, recommended (or not), the length of the review, and how long they've played the game. Using this data, we'll be able to draw insights into what are words commonly said across all reviews, and determine if the length of time played in the game affects their recommendation or not. 

My current hypothesis is thus:  

### *Players that log at least 50 hours in the game will recommend this game to others.* 



We'll start by pulling in all the relevant libraries first. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 

In [2]:
# reading in the dataframe
df = pd.read_csv(r"C:\Users\Katya\Documents\GitHub\SF-Project\Steam_Reviews_526870_20210803.csv")

# take a quick peek
df.head(10)

Unnamed: 0,SteamId,ProfileURL,ReviewText,Review,ReviewLength(Chars),PlayHours,DatePosted
0,7.65612E+16,https://steamcommunity.com/profiles/7656119884...,EARLY ACCESS REVIEW\r\n0yuyou9yfui9ftrygyg6ftd...,Recommended,59,36.6 hrs on record,Posted: August 3
1,7.65612E+16,https://steamcommunity.com/profiles/7656119829...,EARLY ACCESS REVIEW\r\nHm.I think it's a good ...,Recommended,40,24.3 hrs on record,Posted: August 3
2,7.65612E+16,https://steamcommunity.com/profiles/7656119816...,EARLY ACCESS REVIEW\r\nI uhh....... already fe...,Recommended,59,21.7 hrs on record,Posted: August 3
3,7.65612E+16,https://steamcommunity.com/profiles/7656119820...,"EARLY ACCESS REVIEW\r\nIts pretty cool, really...",Recommended,60,4.4 hrs on record,Posted: August 3
4,7.65612E+16,https://steamcommunity.com/profiles/7656119907...,EARLY ACCESS REVIEW\r\nvery good,Recommended,26,5.6 hrs on record,Posted: August 3
5,7.65612E+16,https://steamcommunity.com/profiles/7656119831...,EARLY ACCESS REVIEW\r\nThe game where u watch ...,Recommended,106,216.7 hrs on record,Posted: August 3
6,7.65612E+16,https://steamcommunity.com/profiles/7656119903...,EARLY ACCESS REVIEW\r\nIt's addicting!,Recommended,32,403.6 hrs on record,Posted: August 3
7,7.65612E+16,https://steamcommunity.com/profiles/7656119904...,EARLY ACCESS REVIEW\r\nThis game is already in...,Recommended,88,77.9 hrs on record,Posted: August 3
8,burakyilmaz,https://steamcommunity.com/id/burakyilmaz/,"EARLY ACCESS REVIEW\r\nIf you like automation,...",Recommended,213,393.0 hrs on record,Posted: August 3
9,Merleo,https://steamcommunity.com/id/Merleo/,EARLY ACCESS REVIEW\r\nAfter the fun i have be...,Recommended,875,76.9 hrs on record,Posted: August 3


The dataframe contains information from a few points of view: 

- `SteamID`: The steam ID of the player
- `ProfileURL`: The link to the profile of the player
- `ReviewText`: The review text content 
- `Review`: Shows whether if the player recommends this game or not 
- `ReviewLength(Chars)`: The length of the review
- `PlayHours`: The amount of hours played 
- `DatePosted`: When the review was posted 

Now that we have an idea of what columns are, we'll go ahead with some basic cleaning of the columns. Since we're not focusing on the user profiles of the customers in this project, we'll be dropping the `SteamId` and `ProfileURL` columns next. 




In [3]:
# dropping user information
df.drop(['SteamId', 'ProfileURL'], axis=1, inplace=True)

With the user information removed, we'll start with cleaning the `ReviewText` column, by removing the "EARLY ACCESS REVIEW" text at the beginning of each line. While this was being scraped from the website, the review content box also included the "Early Access Review" icon, which was collected together. 

In [4]:
# cleaning the review text column (removing the EARLY ACCESS REVIEW)
df['ReviewText'] = df['ReviewText'].str.replace('EARLY ACCESS REVIEW\r\n', '')

#sanity check
df

Unnamed: 0,ReviewText,Review,ReviewLength(Chars),PlayHours,DatePosted
0,0yuyou9yfui9ftrygyg6ftdryujiouhyeasrtgf4r,Recommended,59,36.6 hrs on record,Posted: August 3
1,Hm.I think it's a good game,Recommended,40,24.3 hrs on record,Posted: August 3
2,I uhh....... already feel my body getting fatter,Recommended,59,21.7 hrs on record,Posted: August 3
3,"Its pretty cool, really slow however in the st...",Recommended,60,4.4 hrs on record,Posted: August 3
4,very good,Recommended,26,5.6 hrs on record,Posted: August 3
...,...,...,...,...,...
6175,It is such a complex grinding game :),Recommended,48,157.2 hrs on record,Posted: April 18
6176,"Terrible early access game, literally nothing ...",Recommended,63,394.7 hrs on record,Posted: April 18
6177,bean,Recommended,22,186.1 hrs on record,Posted: April 18
6178,This game tells me that i have too much free time,Recommended,57,143.8 hrs on record,Posted: April 18


Before getting deeper with text cleaning using the NLTK library, we'll also have some other columns that need paying attention to. Namely, `DatePosted` has "Posted: " right before the date - this will be removed. Secondly, the `PlayHours` column contains "hrs on record" right after the number of hours. This information is not useful for us, and that will also be removed. 

In [5]:
# cleaning DatePosted to have only the dates
df['DatePosted'] = df['DatePosted'].str.replace('Posted: ', '')

# cleaning PlayHours to contain only the numerical values
df['PlayHours'] = df['PlayHours'].str.replace(' hrs on record', '')

# renaming some columns
df.rename(columns={'ReviewLength(Chars)':'CharLenOfReview'})

Unnamed: 0,ReviewText,Review,CharLenOfReview,PlayHours,DatePosted
0,0yuyou9yfui9ftrygyg6ftdryujiouhyeasrtgf4r,Recommended,59,36.6,August 3
1,Hm.I think it's a good game,Recommended,40,24.3,August 3
2,I uhh....... already feel my body getting fatter,Recommended,59,21.7,August 3
3,"Its pretty cool, really slow however in the st...",Recommended,60,4.4,August 3
4,very good,Recommended,26,5.6,August 3
...,...,...,...,...,...
6175,It is such a complex grinding game :),Recommended,48,157.2,April 18
6176,"Terrible early access game, literally nothing ...",Recommended,63,394.7,April 18
6177,bean,Recommended,22,186.1,April 18
6178,This game tells me that i have too much free time,Recommended,57,143.8,April 18


Next we'll use the `map()` function to assign binary values to the `Review` column (0,1) so that the data is more intrepretable. 

In [10]:
# assigning binary values for Review column (1 for Recommended, 0 for Not Recommended)
df['Review'] = df['Review'].map({'Recommended': 1, 'Not Recommended': 0})

#sanity check
df.head()

Unnamed: 0,ReviewText,Review,ReviewLength(Chars),PlayHours,DatePosted
0,0yuyou9yfui9ftrygyg6ftdryujiouhyeasrtgf4r,1,59,36.6,August 3
1,Hm.I think it's a good game,1,40,24.3,August 3
2,I uhh....... already feel my body getting fatter,1,59,21.7,August 3
3,"Its pretty cool, really slow however in the st...",1,60,4.4,August 3
4,very good,1,26,5.6,August 3


Let's find out the proportion of those that recommend the game or not, and how much of the total population they represent. 

In [28]:
total_recommended = df.Review[df['Review'] == 1].count()

total_not_recommended = df.Review[df['Review'] == 0].count()

print(f"There are {total_recommended} players that recommend the game, and {total_not_recommended} that don't recommend this game.")
print("*" * 50)
print(f"Those that recommend the game account for {'{:.2f}'.format((total_recommended/len(df))*100)}% of the total.")
print(f"People who don't recommend the game account for {'{:.2f}'.format((total_not_recommended/len(df))*100)}% of the total.")

There are 5985 players that recommend the game, and 195 that don't recommend this game.
**************************************************
Those that recommend the game account for 96.84% of the total.
People who don't recommend the game account for 3.16% of the total.


In [8]:
df_recom = df.loc[df['Review'] == 'Recommended']

df_recom['PlayHours'].max()

'995.1'

In [9]:
df_norecom = df.loc[df['Review'] == 'Not Recommended']

df_norecom['PlayHours'].max()

'96.8'