# Review Analysis and Classification for Battlefield 2042

By Liam Zalubas

## Introduction: 

The Battlefield franchise of video games is generally known for combining multiplayer infantry and vehicle combat in dynamic and destructible environments. Battlefield 2042 is the latest entry in the series, having been released in October of 2021. The game is available on console and PC hardware, and is available on the Steam, Origin, and the Epic Games digital marketplaces. From this point forward, I will be referring to the Steam version of the game, as it is the version I was able to collect data on most readily. Because of this, any conclusions drawn from this project may not be applicable to the game as a whole.

In terms of player reception, Battlefield 2042 initially released to overwhelmingly negative reviews, but has been updated consistently since release. Over two years after its release, the game has seen a significant increase in positive reviews, bringing its overall review score from "Mostly Negative" (~31% positive) to "Mostly Positive" (~71% positive).

This project aims to analyze the reviews of Battlefield 2042 on Steam by using a classifier which we will train to predict the time a review was posted, given the text of the review. We will then use this classifier to analyze the reviews of Battlefield 2042 and see if there are any trends in the reviews over time.

## Data Gathering: 

The first step is to gather the data. To download reviews on Steam, we can use the Steam API. The Steam API is a web API which allows us to access data from the Steam store and community pages. For Python, we can import the `steamreviews` library to access the API. The following code makes a request to the API for some reviews of Battlefield 2042, and saves the results to a pandas dataframe.

In [11]:
import pandas as pd
import matplotlib.pyplot as plt
import steamreviews

# Set the app ID for Battlefield 2042
app_id = 1517290

# Set the request parameters
request_params = dict()
request_params['review_type'] = 'all'
request_params['filter'] = 'all'
request_params['day_range'] = '7'
request_params['language'] = 'english'
request_params['purchase_type'] = 'steam'

# Fetch the reviews using the steamreviews library
review_dict, query_count = steamreviews.download_reviews_for_app_id(app_id, chosen_request_params=request_params)

# Create a list to store the review data
review_data = []

# Iterate over the reviews in review_dict
for review_id in review_dict['reviews']:
    # Extract the desired values from the review
    review_text = review_dict['reviews'][review_id]['review']
    latest_update = review_dict['reviews'][review_id]['timestamp_updated']
    is_recommended = review_dict['reviews'][review_id]['voted_up']

    # Append the review data to the list
    review_data.append([review_id, review_text, latest_update, is_recommended])

# Create a pandas DataFrame from the review data
df = pd.DataFrame(review_data, columns=['review_id', 'review_text', 'latest_update', 'is_recommended'])

# Print the DataFrame
print(df)


[appID = 1517290] expected #reviews = 80453
     review_id                                        review_text  \
0    153522729  My friend was 10 feet in front of me picking m...   
1    153527630  This could be a pretty decent game, but unfort...   
2    153479542  Not the greatest Battlefield experience, but i...   
3    153479197                                 it fixed play game   
4    153477433                                               Nice   
..         ...                                                ...   
121  153444017  I avoided this game for a long time due to the...   
122  153443143  (Only jumped in a quarter into season 6)\nHone...   
123  153355918  They've done a lot for the game since the horr...   
124  153207345      dog shit, battlefield will never be the same.   
125  153234689      This game suck so much, i tried i really did.   

     latest_update  is_recommended  
0       1702490638            True  
1       1702496004           False  
2       17024302

## Data Ethics: 

When collecting data, it is important to consider the ethics of the data collection and how mistakes or oversights in the data collection process can skew our results. In this particular example, we are considering reviews in English on the Steam store. Because this game is available on other digital distribution platforms for the PC, on gaming consoles (XBox, PlayStation), and in other countries, we must recognize that this data set is a subset of a larger data set. This means that any conclusions we draw from this data set may not be applicable to the entirety of public opinion on the game as a whole. 

## Munging, Wrangling, and Cleaning Data: 

The next step in this project is to clean the data. This means removing any data that is not relevant to our analysis, and making sure that the data is in a format that is easy to work with. In this case, we will remove any reviews that are not in English, and follow by removing any "stop words" from the review text itself.

- I have chosen to remove non-English reviews because I am not familiar enough with other languages nor do I trust online translation services enough to be confident that I can accurately translate reviews into English. By our previous discussion of data ethics, we know this means our analysis cannot be used to draw conclusions about the experiences of non-English speaking players.

- I have chosen to remove stop words because they do not provide any useful information about the review itself. Examples of stop words include "the", "a", and "of". These are words which are common in the English language but do not convey individual meaning without context.

We will use the ```preprocess_text``` method to clean the review text for each review in our data set as well as  the ```increment_word``` method to keep track of the frequency of each word in our data set, :

In [12]:
# it appears I just lost all my code in this block before this point.

# for each row in the DataFrame, process the review and update the 'review_text' column
for index, row in df.iterrows():
    review_text = row['review_text']

    # get the review's timestamp to find the time period
    review_timestamp = row['latest_update']
    # convert the timestamp to a datetime object
    review_datetime = datetime.datetime.fromtimestamp(review_timestamp)
    # iterate over the time periods and find the correct one
    time_period = ()

    for interval in time_period_list:
        if review_datetime >= interval[0] and review_datetime <= interval[1]:
            time_period = interval
            break
    else:
        # handle the case when the time period is not found
        continue

    processed_text = preprocess_text(review_text, time_period)

    # if processed_text is empty, remove the row from the DataFrame
    if not processed_text:
        df.drop(index, inplace=True)
    else:
        df.at[index, 'review_text'] = processed_text

# Print the updated dataframe
# print(df)

NameError: name 'datetime' is not defined

## Analysis: 

Assuming I had not lost the majority of my code, I would now have:
- A comprehensive list of all the reviews for Battlefield 2042 on Steam
- A list of all the words used in the reviews and their frequency for each month of the game's release
- A word count for each month of the game's release

From here, we would be able to use the word count to train a classifier to predict the month a review was posted based on the text of the review. 

To do this, we could use the ```train_test_split``` method from the ```sklearn``` library to split our data into a training set and a testing set. 
- We want to do this because we want to train our classifier on a subset of our data, and then test it on the rest of our data to see how accurate it is. If we didn't do this, we would be testing our classifier on the same data we trained it on, which would give us a false sense of accuracy.

We would then use the ```fit``` method to train our classifier on the training set, and the ```predict``` method to predict the month a review was posted based on the text of the review. 

- Training our classifier means that we are teaching it how to predict the month a review was posted based on the text of the review. We do this by showing it the text of a review and the month it was posted, and then asking it to predict the month a review was posted based on the text of the review. Our model would be using Multinomial Naive Bayes, which is a classification algorithm that would use Bayes' Theorem to predict the probability of a review being posted in a given month based on the text of the review. For each review, our classifier would be calculating the probability of the review being posted in each month, and then predicting the month with the highest probability.

We could then use the ```score``` method to determine the accuracy of our classifier.

- Scoring our classifier means that we are asking it to predict the month a review was posted based on the text of the review, and then comparing its prediction to the actual month the review was posted. We would do this for each review in our testing set, and then calculate the percentage of correct predictions. This percentage would be our accuracy score. The higher our accuracy score, the more accurate our classifier is at making this particular prediction.