# Final Report Notebook

Here you will find a complete summary of my work for this project--this will include all of the code required to reproduce this project, as well as detailed descriptions and motivations for each step taken.

First we will make the necessary imports:

In [2]:
# General purpose
import pandas as pd

# Natural language processing


# Modeling



## Data Acquisition

#### VGChartz

The data in this project came from a few sources. First, we'll start with the scraped data from the video game database [VGChartz](https://www.vgchartz.com/).

The actual scraping work can be found in the exploratory notebook [Initial Web Scraping](../notebooks/01_explore_scrape.ipynb).

Below we will import the resulting data:

In [18]:
df = pd.read_csv('../data/nice_data.csv', low_memory=False)
df.shape

(16719, 16)

We end up with a ton of entries! Let's see how they look

In [29]:
df.sample(3)

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
16055,IHRA Drag Racing: Sportsman Edition,XB,2006.0,Racing,Bethesda Softworks,0.01,0.0,0.0,0.0,0.02,35.0,4.0,tbd,,Bethesda Softworks,E10+
13852,Painkiller: Hell & Damnation,PS3,2013.0,Shooter,Nordic Games,0.02,0.01,0.0,0.01,0.04,,,,,,
3234,Tak 2: The Staff of Dreams,PS2,2004.0,Platform,THQ,0.3,0.24,0.0,0.08,0.62,71.0,31.0,8,10.0,Avalanche Software,E


Alright, we have a lot of missing data here. We're most interested in the 'critic_score' column, so let's see how many entries are present

In [24]:
scores = df[df.Critic_Score.isna() == False].shape
print(f'{scores[0]} out of {df.shape[0]} possible critic scores are present')

8137 out of 16719 possible critic scores are present


Not too great, a majority of our entries have no critic score values--those will need to be dropped. However, we are still left with over 8,000 games, so we'll take them to the next step

#### Wikipedia

For the next step in acquiring the data, we turn to the Wikipedia API. Using the available python library, getting Wikipedia information is extremely easy. This is good for us, because we will need to make a lot of requests. 

To use the API, you supply the name of the relevant Wikipedia page, and it will return the text of any section you would like. So we send in all of our game titles, along with some variations on them, and we got back most of our entries.

We'll import the resulting data below, but the full work can be found in the [Wikiscraping Notebook](../notebooks/02_wikiscraping.ipynb)

In [48]:
df = pd.read_csv('../data/final_plots.csv').drop('Unnamed: 0', axis=1)
df.shape

(5143, 17)

Now we have 5,143 entries with full Wikipedia descriptions and critic scores

Let's see one of these descriptions

In [47]:
df.plots.sample(random_state=2).values

array(['Master of Illusion puts the player in the role of an illusionist who must learn and perfect his tricks. The game has three basic modes: Solo Magic, Magic Show and Magic Training. The first one is a compilation of varied minigames, the other two being the "meat of the game", or the important part, according to reviewers. In both, the objective is to perform tricks and earn points, which grant the player more tricks and illusions. The system has a limit for the points a player can earn in a full day, though this can be bypassed by changing the date on the Nintendo DS system.'],
      dtype=object)