# Executive Summary
---

# Preprocessing & Modeling
---

## Load in final data for preprocessing and modeling

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt 
from matplotlib.patches import Patch
plt.style.use('fivethirtyeight')

In [2]:
final_reviews_df = pd.read_csv('../data/final_all_console_reviews.csv')

In [3]:
final_reviews_df.head()

Unnamed: 0,console,video_game_name,summary,developer,genre(s),num_players,esrb_rating,critic_score,avg_user_score,user_review,user_score,target
0,ps4,Red Dead Redemption 2,developed by the creators of grand theft auto ...,Rockstar Games,"Action Adventure, Open-World",32,M,97,8.6,this site is a joke this the first time when i...,9,1
1,ps4,Red Dead Redemption 2,developed by the creators of grand theft auto ...,Rockstar Games,"Action Adventure, Open-World",32,M,97,8.6,fair review of rdr2 im almost <number> finishe...,7,1
2,ps4,Red Dead Redemption 2,developed by the creators of grand theft auto ...,Rockstar Games,"Action Adventure, Open-World",32,M,97,8.6,i really wanted to love it the overworld is be...,6,1
3,ps4,Red Dead Redemption 2,developed by the creators of grand theft auto ...,Rockstar Games,"Action Adventure, Open-World",32,M,97,8.6,beautiful graphics excellent voice acting lots...,7,1
4,ps4,Red Dead Redemption 2,developed by the creators of grand theft auto ...,Rockstar Games,"Action Adventure, Open-World",32,M,97,8.6,this game is really overrated the amazing envi...,7,1


## Let's take a look at some of the columns from our EDA to see if we need to make any edits.

## `avg_user_score` had potential outliers below 6.5 lets take a closer look

In [4]:
final_reviews_df.target.value_counts()

1    57740
0    54605
Name: target, dtype: int64

In [5]:
final_reviews_df[final_reviews_df['avg_user_score'] > 6.05].target.value_counts()

1    57740
0    48002
Name: target, dtype: int64

### Removing these outliers removes 5681 reviews with a target of 0. Let's go ahead and drop these from our dataframe.

In [6]:
final_reviews_df.drop((final_reviews_df[final_reviews_df['avg_user_score'] < 6.05].index), inplace=True)

## `critic_score` had some potential outliers (scores below 78), let's take a look

In [7]:
final_reviews_df.target.value_counts()

1    57740
0    48002
Name: target, dtype: int64

In [8]:
final_reviews_df[final_reviews_df['critic_score'] > 78].target.value_counts()

1    57718
0    47016
Name: target, dtype: int64

### Removing outliers in critic score removes 1826 reviews, primarily reviews with a target of 0. Let's go ahead and drop these rows from our main dataframe

In [9]:
final_reviews_df.drop((final_reviews_df[final_reviews_df['critic_score'] < 78].index), inplace=True)

## `user_score` had some potential outliers (scores below 2.5), let's take a look

In [10]:
final_reviews_df.target.value_counts()

1    57718
0    47120
Name: target, dtype: int64

In [11]:
final_reviews_df[final_reviews_df['user_score'] > 2.5].target.value_counts()

1    54506
0    39923
Name: target, dtype: int64

In [12]:
final_reviews_df[final_reviews_df['user_score'] < 2.5]['video_game_name'].value_counts()[-100:-50]

Mortal Kombat X                                                              4
Chivalry 2                                                                   4
Warframe                                                                     4
Injustice 2: Legendary Edition                                               4
Forza Horizon 2                                                              4
Astro Bot: Rescue Mission                                                    4
Bug Fables: The Everlasting Sapling                                          4
The Orange Box                                                               4
Control: Ultimate Edition                                                    4
DiRT Rally                                                                   4
The King of Fighters XV                                                      4
DmC: Devil May Cry Definitive Edition                                        4
Metro Exodus: Complete Edition                      

### Removing these outliers removes 10409 reviews. The large majority coming from reviews with a target of 0. Most of these correspond to only a few reviews per game. Let's remove these outliers.

In [13]:
final_reviews_df.drop((final_reviews_df[final_reviews_df['user_score'] < 2.5].index), inplace=True)

In [17]:
final_reviews_df['video_game_name'].value_counts()[:10]

Red Dead Redemption 2                   2221
Elden Ring                              2177
The Witcher 3: Wild Hunt                2135
Grand Theft Auto V                      1638
DOOM Eternal                            1560
Metal Gear Solid V: The Phantom Pain    1436
Assassin's Creed Valhalla               1300
Fallout 4                               1275
Undertale                               1198
Sekiro: Shadows Die Twice               1141
Name: video_game_name, dtype: int64

## Now let's dummy our categorical columns

In [14]:
# reviews_df[reviews_df['video_game_name'] == 'MotoGP 22'].sample(n=1000, replace=True)