# Review Analysis and Classification for Battlefield 2042

By Liam Zalubas

## Introduction: 

The Battlefield franchise of video games is generally known for combining multiplayer infantry and vehicle combat in dynamic and destructible environments. Battlefield 2042 is the latest entry in the series, having been released in October of 2021. The game is available on console and PC hardware, and is available on the Steam, Origin, and the Epic Games digital marketplaces. From this point forward, I will be referring to the Steam version of the game, as it is the version I was able to collect data on most readily. Because of this, any conclusions drawn from this project may not be applicable to the game as a whole.

In terms of player reception, Battlefield 2042 initially released to overwhelmingly negative reviews, but has been updated consistently since release. Over two years after its release, the game has seen a significant increase in positive reviews, bringing its overall review score from "Mostly Negative" (~31% positive) to "Mostly Positive" (~71% positive).

This project aims to analyze the reviews of Battlefield 2042 on Steam by writing a program for a classifier which we will train to predict the most common words in a review based on whether the review is positive or negative and the date the review was most recently updated.


## Data Gathering: 

The first step in this project was to gather the data. To download reviews on Steam, we can use the Steam API. The Steam API is a web API which allows us to access data from the Steam store and community pages. For Python, we can import the `steamreviews` library to access the API. The following code makes a request to the API for some reviews of Battlefield 2042, and saves the results to a pandas dataframe.

In [None]:
import csv
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import nltk
import steamreviews

# Set the app ID for Battlefield 2042
app_id = 1517290

# Set the request parameters
request_params = dict()
request_params['review_type'] = 'all'
request_params['filter'] = 'all'
request_params['day_range'] = '1'
request_params['language'] = 'english'
request_params['purchase_type'] = 'steam'

# Fetch the reviews using the steamreviews library
review_dict, query_count = steamreviews.download_reviews_for_app_id(app_id, chosen_request_params=request_params)

# Create a list to store the review data
review_data = []

# Iterate over the reviews in review_dict
for review_id in review_dict['reviews']:
    # Extract the desired values from the review
    review_text = review_dict['reviews'][review_id]['review']
    latest_update = review_dict['reviews'][review_id]['timestamp_updated']
    is_recommended = review_dict['reviews'][review_id]['voted_up']

    # Append the review data to the list
    review_data.append([review_id, review_text, latest_update, is_recommended])

# Create a pandas DataFrame from the review data
df = pd.DataFrame(review_data, columns=['review_id', 'review_text', 'latest_update', 'is_recommended'])

# Print the DataFrame
print(df)


[appID = 1517290] expected #reviews = 80451

      review_id                                        review_text  \      latest_update  is_recommended  

0     153250869  My good friends in the NTABC gaming group reco...   0        1702163585            True  

1     153250298                                  Absolutley Great!   1        1702163017            True  

2     153244819  Super boring not fun at all and super boring¨\n\n   2        1702157746           False  

3     153244702                                under rated game fs   3        1702157625            True  

4     153242987                                     Cringy as hell   4        1702156183           False  

...         ...                                                ...   ...             ...             ...  

2291  153477311  The big battles are fun. I did was not playing...   2291     1702426784            True  

2292  153381807  Games aight. I do love joining loosing games t...   2292     1702310302            True  

2293  153444017  I avoided this game for a long time due to the...   2293     1702389257            True  

2294  153443143  (Only jumped in a quarter into season 6)\nHone...   2294     1702388319            True  

2295  153355918  They've done a lot for the game since the horr...   2295     1702276913            True  

[2296 rows x 4 columns]


## Data Ethics: 

When collecting data, it is important to consider the ethics of the data collection and how mistakes or oversights in the data collection process can skew our results. In this particular example, we are considering reviews in English on the Steam store. Because this game is available on other digital distribution platforms for the PC, on gaming consoles (XBox, PlayStation), and in other countries, we must recognize that this data set is a subset of a larger data set. This means that any conclusions we draw from this data set may not be applicable to the entirety of public opinion on the game as a whole. 


## Munging, Wrangling, and Cleaning Data: 

The next step in this project is to clean the data. This means removing any data that is not relevant to our analysis, and making sure that the data is in a format that is easy to work with. In this case, we will remove any reviews that are not in English, and follow by removing any "stop words" from the review text itself. 
- I have chosen to remove non-English reviews because I am not familiar enough with other languages nor do I trust online translation services enough to be confident that I can accurately translate reviews into English.
- I have chosen to remove stop words because they do not provide any useful information about the review itself. Examples of stop words include "the", "a", "of", and other words which are common in the English language but do not convey individual meaning without context.


## Analysis: Use a classification algorithm (multinomial or gaussian) OR regression to analyze your data. You should explain:
- The theory behind the classification or regression you perform,
- Why you chose the variables you did to focus on,
- Each step of the code, which you should include on the website,
- The results your model/analysis comes to and what they mean.



## Pretty Pictures: 

Whether you run a regression or classification analysis you should help the reader understand your data by using a plot of some kind. 

Classification Algorithms - show the distribution of the data before you run the classification algorithm - histograms for multinomial classifications and gaussian curves for gaussian classification. (The code should also be included for these visualizations)

These pictures should be helpful to you when explaining the theory behind the models you produce. 



## Grading

### Information Provided

- Introduction. Helps orient the reader at the beginning of the tutorial and explains what they will be learning. 

- Data Gathering. Clearly explains the techniques that can be used with examples.

- Data Ethics. Clearly explains the unintended consequences of some data techniques and how to avoid bias and misrepresentation of information. 

- Data Cleaning. Clearly explains the difficulties faced by data scientists in data collection and standardization, and how to manage and manipulate data so that it can be helpful. 

- Data Analysis. Code, images, and documentation clearly present what can be learned from data using one of the techniques we learned in class. 

### Presentation of Information

- Understanding. After reading through the tutorial, does an uninformed reader feel informed about the topic and data science? Would a reader who already knew about the topic/data science feel like they learned more about it? 

- Readability. Is the information explained in a grammatically correct way that allows the reader to easily understand what they are looking at?

- Code. Is the code well written, well documented, reproducible, and does it help the reader understand the tutorial? Does it give good examples of specific techniques? 

- Style. Does the website look like it was made in 1995 or is there some styling that helps the reader focus, refer back to previous sections when they would like to revisit the material, and understand what they should be reading next.

- Images. Are the appropriate images used to help explain concepts throughout the website? Are labels and colors done appropriately so that the reader can understand the charts provided? 

- Citations. All data, code, and information originating from outside sources should be properly and consistently cited. 
