# Darks Souls II Reviews (2023)

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import re

## Steam Reviews as of 12/18/23:

In [3]:
df = pd.read_csv('reviews.csv')
reviews = df.copy()
reviews = reviews.set_index('recommendationid')
reviews.drop(columns={'Unnamed: 0'}, inplace=True)
reviews

Unnamed: 0_level_0,review,init_date,update_date,in_early_access,voted_up
recommendationid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
161822555,I know Dark Souls 2 gets a lot of hate but per...,1711822731,1711822731,False,False
161819114,If you're expecting it to be like the original...,1711819586,1711819586,False,True
161818428,shit.,1711819000,1711819000,False,False
161816145,,1711816954,1711816954,False,True
161816097,genuinely bad,1711816908,1711816908,False,False
...,...,...,...,...,...
15162268,Try tongue but hole,1427932431,1428081346,False,True
15162220,"So far so good, played it for 20mins so far wi...",1427932153,1427932153,False,True
15162161,Still haven't died!\n\nBonedrinker Rufus 109 -...,1427931845,1427931845,False,True
15162057,Needs more cow Bell\n\n10/10,1427931196,1427931196,False,True


Just getting the dates in DatePosted:

In [None]:
dates = [re.compile(r'Posted:|,').sub(' ', date).strip() for date in reviews.DatePosted]
dates = [re.compile(r' +').sub(' ', date).strip() for date in dates]
reviews.DatePosted = dates
reviews[['DatePosted']].head()

Seperating the 'DatePosted' column year column:
- There are some entries that don't have years (will investigate later)

In [None]:
reviews['Date'] = pd.to_datetime(reviews.DatePosted, errors='coerce')
reviews['Year'] = reviews['Date'].dt.strftime('%Y')
reviews.drop(columns={'DatePosted', 'Date'}, inplace=True)
reviews

In [None]:
reviews.Year.isna().sum()

Cleaning up the reviews:

In [None]:
from nltk.corpus import stopwords
from textblob import TextBlob

- Fixing misspellings:

In [None]:
from autocorrect import Speller

spell = Speller(lang='en')

In [None]:
# r = [spell(review) if pd.notna(review) else review for review in reviews.Review]

- Removing anything that's not a letter (urls, esc seqs, etc.):

In [None]:
# Removing urls:
r = [re.sub(r'http\S+', '', review).lower().strip() if pd.notna(review) else review for review in reviews.Review]

# Removing esc sequences, punctuation, and numbers:
    # There's some ASCII art in some of the reviews
r = [re.sub(r'[^A-Za-z]', ' ', review).strip() if pd.notna(review) else review for review in r]

# Removing stop words:
stop_wrds = re.compile(''.join([rf'\b{wrd}\b|' for wrd in stopwords.words('english')]))
r = [re.sub(stop_wrds, '', review).strip() if pd.notna(review) else review for review in r]

# Removing multiple and trailiing whitespaces:
r = [re.sub(r' +', ' ', review).strip() if pd.notna(review) else review for review in r]

reviews['Review'] = r

In [None]:
reviews.dtypes

In [None]:
reviews.describe()

## EDA:

Distribution of whether or not people recommend buying the game:

In [None]:
sns.barplot(data=reviews['Recommended?'].value_counts().reset_index(), 
            x='Recommended?',
            y='count');

- Most people recommend playing DS2

Years captured in the webscraped reviews:

In [None]:
yr_dist = reviews.groupby('Year')['Recommended?'].count().reset_index()
yr_dist.rename(columns={'Recommended?':'count'}, inplace=True)

sns.barplot(data=yr_dist, 
            x='Year',
            y='count');

- Most reviews are fairly recent
    - Game released    : March 2014
    - Remaster released: April 2015 (focus of this project)

Proportions of those who don't and do recommend DS2 in each of the years:

In [None]:
yr_plt = reviews.groupby(['Year', 'Recommended?']).count().reset_index()
yr_plt.rename(columns={'Review':'proportion'}, inplace=True)
yr_plt['proportion'] = 100 * (yr_plt['proportion'] / yr_plt.groupby(['Year'])['proportion'].transform('sum'))

sns.catplot(data=yr_plt,
            kind='bar',
            x='Year',
            y='proportion',
            hue='Recommended?'
            );

In [None]:
yr_plt[yr_plt['Recommended?'] == 'Recommended'].plot(kind='line',
                                                     x='Year',
                                                     y='proportion');

- Prior to contrary belief, the game was initially praised more than disliked during initial release
- Throughout the years, treatment of the game has been consistently positive

## Sentiment Analysis:
- Seeing why people were positive or negative about the game
    - Comments on story, gameplay, etc

For sake of analysis specifically on the actual reviews, drop any rows that have no reviews:

In [None]:
reviews = reviews.dropna(subset=['Review'])
reviews.shape

Top 10 Most Common Words in the Reviews:

In [None]:
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [None]:
tfidf = TfidfVectorizer(sublinear_tf=True,
                        analyzer='word',
                        max_features=4000,
                        tokenizer=word_tokenize,
                        stop_words=stopwords.words("english"))

In [None]:
review_txt = reviews.Review.values.flatten()
tfidf_array = tfidf.fit_transform(review_txt).toarray()
tfidf_df = pd.DataFrame(tfidf_array)
tfidf_df.columns = tfidf.get_feature_names_out()
tfidf_df.head()

In [None]:
most_unique = tfidf_df.idxmax(axis=1)
top_10 = most_unique.value_counts()[:10]
top_10

In [None]:
top_10 = top_10.reset_index()
top_10.rename(columns={'index':'word'}, inplace=True)
sns.barplot(data=top_10,
            x='word',
            y='count');

- Most popular word w/ semantic meaning: Triple A 
    - def: an informal classification used to classify video games produced and distributed by a mid-sized or major publisher
- Top words used seem to be positive, but this is looking at the words without context

Other popular words:

In [None]:
from wordcloud import WordCloud

In [None]:
pop_wrds = " ".join( review for review in reviews.Review)
wordcloud = WordCloud(max_font_size=150, max_words=100, background_color="white").generate( pop_wrds )
plt.figure(figsize=(30, 10))
plt.imshow(wordcloud, interpolation="bilinear") 
plt.axis("off")
plt.title('Popular Words in DS2 Reviews')
plt.show()

- Most common word among the reviews isn't very informative - including some of the other popular words
    - Looking at subsets of the reviews could be useful

### Topic Modeling:
- Exploring certain aspects on why people like the game
    - Also get critiques of the game in positive reviews (if any but there sure is considering DS2's reputation in the community)

- Exploring why people don't like the game:
    - Also get positive aspects within this subset of the reviews
    
- Algorithms I can use to perform topic modeling:
    1. Latent Dirichlet Allocation (LDA) 
    2. Non-negative Matrix Factorization (NMF)

Splitting the reviews by how many do and don't recommend buying the game:

In [None]:
pos_reviews = reviews[reviews['Recommended?'] == 'Recommended']
neg_reviews = reviews[reviews['Recommended?'] == 'Not Recommended']

In [None]:
pos_reviews.shape, neg_reviews.shape

Function to display the output of the models:

In [None]:
def display_topics(model, feature_names, no_top_words):
    topic_dict = {}
    for topic_idx, topic in enumerate(model.components_):
        topic_dict["Topic %d words" % (topic_idx + 1)]= ['{}'.format(feature_names[i])
                        for i in topic.argsort()[:-no_top_words - 1:-1]]
        topic_dict["Topic %d weights" % (topic_idx + 1)]= ['{:.1f}'.format(topic[i])
                        for i in topic.argsort()[:-no_top_words - 1:-1]]
    return pd.DataFrame(topic_dict)

LDA: Probabilistic graphical modeling, and uses CountVectorizer as input

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

In [None]:
count_vector = CountVectorizer()

tf = count_vector.fit_transform(reviews.Review).toarray()
tf_feat_names = count_vector.get_feature_names_out()

pos_tf = count_vector.fit_transform(pos_reviews.Review).toarray()
pos_tf_feat_names = count_vector.get_feature_names_out()

neg_tf = count_vector.fit_transform(neg_reviews.Review).toarray()
neg_tf_feat_names = count_vector.get_feature_names_out()

In [None]:
lda = LatentDirichletAllocation(n_components=3, random_state=42069)
lda.fit(tf)

In [None]:
no_top_words = 10
display_topics(lda, tf_feat_names, no_top_words)

- Interpreted topics that were identified:
    1. People saying how good the game is
    2. Bosses/enemies
    3. People expressing their likes or dislikes of the game

In [None]:
pos_lda = LatentDirichletAllocation(n_components=3, random_state=42069)
pos_lda.fit(pos_tf)

In [None]:
display_topics(pos_lda, pos_tf_feat_names, no_top_words)

- Interpreted topics that were identified:
    1. People expressing that they loved the game (expected since I'm looking at the subset of reviews that recommend the game)
    2. (similar to 1st topic)
    3. Bosses/enemies

In [None]:
neg_lda = LatentDirichletAllocation(n_components=3, random_state=42069)
neg_lda.fit(neg_tf)

In [None]:
display_topics(neg_lda, neg_tf_feat_names, no_top_words)

- Interpreted topics that were identified:
    1. Bosses/enemies
    2. Controls/PC port of the game
    3. Players' comments on that it's the worst Dark Souls game they've played

NMF: Linear algebra and uses the TF-IDF vectorizer as input

In [None]:
from sklearn.decomposition import NMF

In [None]:
nmf = NMF(n_components=3, random_state=42069)
nmf.fit(tfidf_array)

In [None]:
display_topics(nmf, tfidf.get_feature_names_out(), no_top_words)

- Interpreted topics that were identified:
    1. Positive experiences from the game
    2. (similar to 1st topic)
    3. Mixed reception of the game (love and hate)

In [None]:
pos_txt = pos_reviews.Review.values.flatten()
pos_tfidf_array = tfidf.fit_transform(pos_txt).toarray()
nmf.fit(pos_tfidf_array)

In [None]:
display_topics(nmf, tfidf.get_feature_names_out(), no_top_words)

- Interpreted topics that were identified:
    1. Positive outloooks on the game
    2. similar to 1st topic
    3. People expressing their opinion on the game, ranging from good to bad

In [None]:
neg_txt = neg_reviews.Review.values.flatten()
neg_tfidf_array = tfidf.fit_transform(neg_txt).toarray()
nmf.fit(neg_tfidf_array)

In [None]:
display_topics(nmf, tfidf.get_feature_names_out(), no_top_words)

- Interpreted topics that were identified:
    1. Vague but concerned w/ enemies
    2. Very negative perspectives on the game
    3. Negative experience regarding bosses, hitboxes, and game design

## Conclusion: