# Steam/Metacritic Review Analysis - by Josh McGiff
## Introduction

This Jupyter notebook is designed for a lecture that guides students through the analysis of Steam game reviews using Natural Language Processing (NLP) techniques. The notebook covers various steps involved in the analysis, including data retrieval, Named Entity Recognition (NER), visualisation of entities, and exploring features of reviews through NLP methods. Additionally, it introduces the use of machine learning models for sentiment analysis and sexism detection in reviews.



### Retrieving Review Data

To begin the analysis, we need to retrieve review data from the Steam platform using their API. We'll define functions to fetch reviews based on an AppID and filter.


In [None]:
import requests
import pandas as pd

def get_review_data(app_id, params={'json': 1}):
    base_url = 'https://store.steampowered.com/appreviews/'
    response = requests.get(url=base_url + str(app_id), params=params)
    return response.json()


In [None]:

def get_n_reviews(app_id, filter, n=1000):
    reviews = []
    cursor = '*'
    base_url = 'https://store.steampowered.com/appreviews/'

    while n > 0:
        params = {
            'json': 1,
            'filter': filter,  
            'language': 'english',
            'day_range': 9223372036854775807,
            'review_type': 'all',
            'purchase_type': 'all',
            'cursor': cursor.encode(),
            'num_per_page': min(100, n)
        }

        response = requests.get(url=base_url + str(app_id), params=params)
        data = response.json()

        cursor = data['cursor']
        reviews += data['reviews']
        n -= min(100, n)

        if len(data['reviews']) < 100:
            break

    return reviews


# 🔎Choose a Steam product to analyse by searching the platform. Once a selection is made, extract the AppID from the URL of the page:
- The URL format is: https://store.steampowered.com/app/[EXTRACT THIS APPID]/GameTitle/
- eg: 292120 from https://store.steampowered.com/app/292120/FINAL_FANTASY_XIII/

In [None]:
app_id = 1635590

reviews = []
reviews += (get_n_reviews(app_id, 'all'))
print(reviews)

### Named Entity Recognition (NER)

Before diving into the analysis, it's essential to understand the entities mentioned in the reviews. We'll use spaCy's NER model to extract named entities from the reviews and visualize them.


- Install required Named Entity Recognition model from the spaCy library

In [None]:
# Only need to run once!:
# !python -m spacy download en_core_web_sm

Extract the named entities from the reviews and output them to [app_id].csv

In [None]:
import re
import spacy
import pandas as pd

# Extract the review text and the corresponding upvote status
df = pd.DataFrame(reviews)[['review', 'voted_up']]

# Load the English language model for spaCy
nlp = spacy.load("en_core_web_sm")

# Function to clean text: removes non-alphanumeric characters and extra whitespaces
def clean_text(text):
    cleaned_text = re.sub(r'[^a-zA-Z0-9\s]', '', text)  # Remove non-alphanumeric characters
    cleaned_text = ' '.join(cleaned_text.split())      # Remove extra whitespaces including newlines
    return cleaned_text.strip()

# Clean the text in the 'review' column and apply Named Entity Recognition (NER)
def process_review(review):
    processed_review = clean_text(review)
    doc = nlp(processed_review.strip())                 # Strip to remove leading/trailing whitespaces
    entities = [{'Entity': ent.text, 'Label': ent.label_} for ent in doc.ents]
    return processed_review, entities

# Apply text cleaning and NER to each review in the DataFrame
df['review'], df['Entities'] = zip(*df['review'].apply(process_review))

# Write the DataFrame to a CSV file
csv_filename = str(app_id) + '.csv'
df.to_csv(csv_filename, index=False)

print('DataFrame with entities saved to', csv_filename)


Visualise the most frequently occuring entities with a barchart:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the DataFrame containing the entities
df = pd.read_csv(csv_filename)

# Flatten the list of entities into a single list
all_entities = [entity['Entity'] for entities in df['Entities'] for entity in eval(entities)]

# Create a DataFrame to count the occurrences of each entity
entity_counts = pd.Series(all_entities).value_counts().reset_index()
entity_counts.columns = ['Entity', 'Count']

# Plot the top N most common entities
top_n = 10  # Change this value to visualize more or fewer entities
top_entities = entity_counts.head(top_n)

# Plotting
plt.figure(figsize=(10, 6))
sns.barplot(x='Count', y='Entity', data=top_entities, palette='viridis')
plt.xlabel('Count')
plt.ylabel('Entity')
plt.title(f'Top {top_n} Most Common Entities')
plt.tight_layout()
plt.show()


Visualise the most frequently occuring entities with a wordcloud:

In [None]:
from wordcloud import WordCloud

# Convert the list of entities into a single string
all_entities_text = ' '.join(all_entities)

# Generate a word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(all_entities_text)

# Plot the word cloud
plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Entity Word Cloud')
plt.show()


Visualise the most frequently occuring entities types with a stacked barchart:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load the DataFrame containing entities
df = pd.read_csv(csv_filename)

# Flatten the list of entities and labels into separate lists
all_entities = [entity['Entity'] for entities in df['Entities'] for entity in eval(entities)]
all_labels = [entity['Label'] for entities in df['Entities'] for entity in eval(entities)]

# Count the occurrences of each label
label_counts = pd.Series(all_labels).value_counts()

# Plotting
plt.figure(figsize=(10, 6))
label_counts.plot(kind='bar', stacked=True)
plt.xlabel('Entity Label')
plt.ylabel('Count')
plt.title('Stacked Bar Plot of Entity Labels')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


**Analysis solely using NER only provides a limited view of inference. Contextually informative text such as adjectives are highly valuable when building a bigger picture of the perception of a product via analysing reviews.**

**Therefore, let's try to investigate various features of the reviews:**

In [None]:
import re
from wordcloud import WordCloud
import matplotlib.pyplot as plt


# Read the CSV file into a DataFrame
df = pd.read_csv(csv_filename)

# Convert the 'review' column to strings
df['review'] = df['review'].astype(str)

# Combine all the reviews into a single string
combined_reviews = ' '.join(df['review'])

# Generate the word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(combined_reviews)

# Display the word cloud using matplotlib
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()


In [None]:
import nltk
from nltk import FreqDist, ngrams
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Download the punkt resource
nltk.download('punkt')


# Tokenize the text
tokens = word_tokenize(combined_reviews)

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

# Generate bigrams
bigrams = list(ngrams(filtered_tokens, 2))

# Filter bigrams that contain specific terms like 'main character' or 'protagonist'
specific_term_bigrams = [bigram for bigram in bigrams]

# Calculate frequency distribution of specific term-related bigrams
specific_term_bigram_freq = FreqDist(specific_term_bigrams)

# Extract the most common bigrams related to the specific terms
most_common_specific_term_bigrams = specific_term_bigram_freq.most_common(20)

# Extract bigrams and their frequencies
bigram, freq = zip(*most_common_specific_term_bigrams)

# Plot the bar plot
plt.figure(figsize=(10, 6))
plt.barh(range(len(bigram)), freq, color='skyblue')
plt.yticks(range(len(bigram)), [' '.join(b) for b in bigram])  # Set bigram labels
plt.xlabel('Frequency')
plt.ylabel('Bigram')
plt.title('Most Frequently Occurring Bigrams')
plt.gca().invert_yaxis()  # Invert y-axis to display the most frequent bigrams at the top
plt.show()

In [None]:
import nltk
from nltk import FreqDist, ngrams
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import matplotlib.pyplot as plt


# Tokenize the text
tokens = word_tokenize(combined_reviews)

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

# Generate trigrams
trigrams = list(ngrams(filtered_tokens, 3))

# Filter trigrams that contain specific terms like 'main character' or 'protagonist'
specific_term_trigrams = [trigram for trigram in trigrams]

# Calculate frequency distribution of specific term-related trigrams
specific_term_trigram_freq = FreqDist(specific_term_trigrams)

# Extract the most common trigrams related to the specific terms
most_common_specific_term_trigrams = specific_term_trigram_freq.most_common(20)

# Extract trigrams and their frequencies
trigram, freq = zip(*most_common_specific_term_trigrams)

# Plot the bar plot
plt.figure(figsize=(10, 6))
plt.barh(range(len(trigram)), freq, color='skyblue')
plt.yticks(range(len(trigram)), [' '.join(t) for t in trigram])  # Set trigram labels
plt.xlabel('Frequency')
plt.ylabel('Trigram')
plt.title('Most Frequently Occurring Trigrams')
plt.gca().invert_yaxis()  # Invert y-axis to display the most frequent trigrams at the top
plt.show()


**While this analysis of all the reviews may yield insights into the general perception of the chosen product, it is possible that important information could be hidden due to a lack of representative data.**

**Imagine that you are the developer behind the product that is being reviewed, and the game has 80% positive reviews and 20% negative reviews. Selecting a subset that contains all the negative reviews could help to highlight the reasons for the poor reception. This could enable developers to fix bug-related issues or adopt product-specific that could improve future products.** 

In [None]:
app_id = 1635590

reviews = []
reviews += (get_n_reviews(app_id, 'negative'))
print(reviews)

In [None]:
import re
import spacy
import pandas as pd

# Extract the review text and the corresponding upvote status
df = pd.DataFrame(reviews)[['review', 'voted_up']]

# Load the English language model for spaCy
nlp = spacy.load("en_core_web_sm")

# Function to clean text: removes non-alphanumeric characters and extra whitespaces
def clean_text(text):
    cleaned_text = re.sub(r'[^a-zA-Z0-9\s]', '', text)  # Remove non-alphanumeric characters
    cleaned_text = ' '.join(cleaned_text.split())      # Remove extra whitespaces including newlines
    return cleaned_text.strip()

# Clean the text in the 'review' column and apply Named Entity Recognition (NER)
def process_review(review):
    processed_review = clean_text(review)
    doc = nlp(processed_review.strip())                 # Strip to remove leading/trailing whitespaces
    entities = [{'Entity': ent.text, 'Label': ent.label_} for ent in doc.ents]
    return processed_review, entities

# Apply text cleaning and NER to each review in the DataFrame
df['review'], df['Entities'] = zip(*df['review'].apply(process_review))

# Write the DataFrame to a CSV file
csv_filename = str(app_id) + '.csv'
df.to_csv(csv_filename, index=False)

print('DataFrame with entities saved to', csv_filename)


Visualise the most frequently occuring entities with a barchart:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the DataFrame containing the entities
df = pd.read_csv(csv_filename)

# Flatten the list of entities into a single list
all_entities = [entity['Entity'] for entities in df['Entities'] for entity in eval(entities)]

# Create a DataFrame to count the occurrences of each entity
entity_counts = pd.Series(all_entities).value_counts().reset_index()
entity_counts.columns = ['Entity', 'Count']

# Plot the top N most common entities
top_n = 10  # Change this value to visualise more or fewer entities
top_entities = entity_counts.head(top_n)

# Plotting
plt.figure(figsize=(10, 6))
sns.barplot(x='Count', y='Entity', data=top_entities, palette='viridis')
plt.xlabel('Count')
plt.ylabel('Entity')
plt.title(f'Top {top_n} Most Common Entities')
plt.tight_layout()
plt.show()


Visualise the most frequently occuring entities with a wordcloud:

In [None]:
from wordcloud import WordCloud

# Convert the list of entities into a single string
all_entities_text = ' '.join(all_entities)

# Generate a word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(all_entities_text)

# Plot the word cloud
plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Entity Word Cloud')
plt.show()


Visualise the most frequently occuring entities types with a stacked barchart:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load the DataFrame containing entities
df = pd.read_csv(csv_filename)

# Flatten the list of entities and labels into separate lists
all_entities = [entity['Entity'] for entities in df['Entities'] for entity in eval(entities)]
all_labels = [entity['Label'] for entities in df['Entities'] for entity in eval(entities)]

# Count the occurrences of each label
label_counts = pd.Series(all_labels).value_counts()

# Plotting
plt.figure(figsize=(10, 6))
label_counts.plot(kind='bar', stacked=True)
plt.xlabel('Entity Label')
plt.ylabel('Count')
plt.title('Stacked Bar Plot of Entity Labels')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


Analysis solely using NER only provides a limited view of inference. Contextually informative text such as adjectives are highly valuable when building a bigger picture of the perception of a product via analysing reviews.

Therefore, let's try to investigate various features of the reviews:

### Exploring Review Features

Understanding various features of reviews beyond just entities is crucial. We'll explore features like word frequency, bigrams, and trigrams to gain deeper insights into the reviews.


In [None]:
import re
from wordcloud import WordCloud
import matplotlib.pyplot as plt


# Read the CSV file into a DataFrame
df = pd.read_csv(csv_filename)

# Convert the 'review' column to strings
df['review'] = df['review'].astype(str)

# Combine all the reviews into a single string
combined_reviews = ' '.join(df['review'])

# Generate the word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(combined_reviews)

# Display the word cloud using matplotlib
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()


In [None]:
import nltk
from nltk import FreqDist, ngrams
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Download the punkt resource
nltk.download('punkt')


# Tokenize the text
tokens = word_tokenize(combined_reviews)

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

# Generate bigrams
bigrams = list(ngrams(filtered_tokens, 2))

# Filter bigrams that contain specific terms like 'main character' or 'protagonist'
specific_term_bigrams = [bigram for bigram in bigrams]

# Calculate frequency distribution of specific term-related bigrams
specific_term_bigram_freq = FreqDist(specific_term_bigrams)

# Extract the most common bigrams related to the specific terms
most_common_specific_term_bigrams = specific_term_bigram_freq.most_common(20)

# Extract bigrams and their frequencies
bigram, freq = zip(*most_common_specific_term_bigrams)

# Plot the bar plot
plt.figure(figsize=(10, 6))
plt.barh(range(len(bigram)), freq, color='skyblue')
plt.yticks(range(len(bigram)), [' '.join(b) for b in bigram])  # Set bigram labels
plt.xlabel('Frequency')
plt.ylabel('Bigram')
plt.title('Most Frequently Occurring Bigrams')
plt.gca().invert_yaxis()  # Invert y-axis to display the most frequent bigrams at the top
plt.show()

In [None]:
import nltk
from nltk import FreqDist, ngrams
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import matplotlib.pyplot as plt

# Tokenize the text
tokens = word_tokenize(combined_reviews)

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

# Generate trigrams
trigrams = list(ngrams(filtered_tokens, 3))

# Filter trigrams that contain specific terms like 'main character' or 'protagonist'
specific_term_trigrams = [trigram for trigram in trigrams]

# Calculate frequency distribution of specific term-related trigrams
specific_term_trigram_freq = FreqDist(specific_term_trigrams)

# Extract the most common trigrams related to the specific terms
most_common_specific_term_trigrams = specific_term_trigram_freq.most_common(20)

# Extract trigrams and their frequencies
trigram, freq = zip(*most_common_specific_term_trigrams)

# Plot the bar plot
plt.figure(figsize=(10, 6))
plt.barh(range(len(trigram)), freq, color='skyblue')
plt.yticks(range(len(trigram)), [' '.join(t) for t in trigram])  # Set trigram labels
plt.xlabel('Frequency')
plt.ylabel('Trigram')
plt.title('Most Frequently Occurring Trigrams')
plt.gca().invert_yaxis()  # Invert y-axis to display the most frequent trigrams at the top
plt.show()


### 🌟 Bonus section: Sentiment Analysis

Sentiment analysis helps in understanding the overall sentiment expressed in reviews. We'll utilise a pre-trained sentiment analysis model to classify reviews as positive or negative.


**The previous sections utilise a wide-range of NLP/text analysis methods. However, they distinctly lack many of the more frequently used machine learning methods such as sentiment analysis, text summarisation and machine translation.**

What if we wanted to compare the rating (i.e positive/negative) of a review with the results of a sentiment analysis model? Is it possible that reviewers gave a positive recommendation, even if their written review was predominantly negative?

What if we wanted to investigate the amount of reviews that engage in hate speech? Movies, games and shows are frequently "review-bombed" by masses of users due to discriminatory views. Filtering your analysis of reviews to exclude discriminatory content could provide more genuine & informative feedback on the product. 

Interesting & concerning links:
^ "How Captain Marvel and Brie Larson beat the internet’s sexist trolls" - https://www.vox.com/culture/2019/3/8/18254584/captain-marvel-boycott-controversy
^ "Marvel's Eternals Gets Review Bombed for LGBTQ+ Relationship" - https://thedirect.com/article/marvel-eternals-reviews-bombed-lgbtq-relationship





As a proof on concept, let's use a sexism detection model from HuggingFace. https://huggingface.co/NLP-LTU/bertweet-large-sexism-detector

In [None]:
# Import necessary modules from the transformers library
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline

# Import the PyTorch library
import torch

# Load the pre-trained model for sequence classification
model = AutoModelForSequenceClassification.from_pretrained('NLP-LTU/distilbert-sexism-detector')

# Load the tokenizer associated with the pre-trained model
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

# Create a text classification pipeline using the pre-trained model and tokenizer
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)


In [None]:
def preprocess_text(text, max_len=512):
    # Tokenize the text
    tokens = tokenizer.encode(text, add_special_tokens=True)
    
    # Truncate or pad to the desired length
    tokens = tokens[:max_len]

    return tokens

In [None]:
import pandas as pd
from transformers import BertTokenizer

# Load the CSV file into a DataFrame
df = pd.read_csv(csv_filename)

# Assuming your CSV file has a column named 'review' containing the reviews
reviews = df['review'].tolist()

# Preprocess each review in the list
max_len = 511
processed_reviews = [preprocess_text(review, max_len) for review in reviews]

# Assuming you have a tokenizer initialised somewhere in your code
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Decode the processed reviews
decoded_reviews = [tokenizer.decode(tokens, skip_special_tokens=True) for tokens in processed_reviews]


In [None]:
# Apply the classifier to each review in the list
predictions = [classifier(review) for review in decoded_reviews]

In [None]:
# Initialise an empty list to store reviews identified as sexist
sexistReviews = []

# Iterate through each review and its corresponding prediction
for review, prediction in zip(reviews, predictions):
    # Check if the predicted label for the review is "sexist"
    if prediction[0]["label"] == "sexist":
        # Print the review
        print(f"Review: {review}")
        # Append the review to the list of sexist reviews
        sexistReviews.append(review)

# Print the list of sexist reviews
print(sexistReviews)


### Conclusion

Through this notebook, students will gain hands-on experience in analysing text data, extracting insights, and applying machine learning models to understand user sentiments and identify problematic content in reviews.
