# Prediction Global Sales of Games

## Problem

1. Can we accurately predict the global sales of a game before its release using the average of user ratings and critic ratings from previous titles?

2. Can we accurately predict the global sales of a game after receiving critic scores but before obtaining user ratings?

3. Can we accurately predict the total global sales of a game using all available features, including average user ratings and critic scores from previous titles?

# Plan for my solution:

Problem Analysis

1. What type of problem is it?
2. Can we break the problem down into sub problems?
3. Are there anything we need to think about, when tackling the problem.

Analysing the Dataset:

1. What is the Taxonomics of the Data?
2. What is the Data Type and its Category?
3. Is there anything we need to be aware of for preparing the data for statistical analysis?

Preparing Data:

1. Removing data sets with empty rows
2. Save the data into a new clean_games_sales.csv file
3. Validate the cleaned data

Statistical Analysis:

# Step 1: Problem Analysis

[The problem is a regression problem, as we want to predict a numerical value based on input data.]

1. Before sale: This scenario involves predicting global sales using only the average of user ratings and critic ratings before the game's release. While it might provide a rough estimate, it may not be as accurate as using all available features. Nonetheless, this approach could still be useful in providing an early indication of the game's potential sales performance.

2. After critic score: In this scenario, you would predict global sales after the critic scores are available but before user ratings. This approach might yield better predictions compared to the first scenario since critic scores can be a significant factor in influencing game sales.

3. Total sale: This scenario leverages all available features, including user ratings and critic scores, to predict global sales. This approach is likely to provide the most accurate predictions since it takes into account the most information. Starting with this scenario makes sense as it allows the model to learn from the most comprehensive dataset, and later, you can experiment with the other scenarios if needed.

# Step 2: Data Analysis

## Data Taxonomics

- It is a structured dataset as the data is stored in a CSV file with labeled columns.

## Data Type & Category

- The data type is ordinal (non-parametric) since it features labels in a specific order as its attributes.

## Preparing the Data

- There is alot of empty rows in the data set, so these needs to be removed. 

# Step 3: Preparing the data



Removing empty rows from the data set

In [5]:
import csv

with open('Video_Games_Sales_as_at_22_Dec_2016.csv', 'r') as csvfile:
    reader = csv.reader(csvfile)
    csv_data = [row for row in reader if all(row)]

with open('Clean_Games_Sales_Data.csv', 'w', newline='') as outfile:
     writer = csv.writer(outfile)
     writer.writerows(csv_data)

# Step 4: Validating the Data

## Range Checking

I want to range check the features.

### Year of Release

In [7]:
import pandas as pd

# Load the CSV file into a pandas dataframe
df = pd.read_csv('Clean_Games_Sales_Data.csv')

# Find the minimum and maximum years of release
min_sales = df['Year_of_Release'].min()
max_sales = df['Year_of_Release'].max()

# Convert the values to integers
min_sales = int(min_sales)
max_sales = int(max_sales)

# Print the results
print("Minimum year of release:", min_sales)
print("Maximum year of release:", max_sales)

Minimum year of release: 1985
Maximum year of release: 2016


### EU Sales

In [9]:
import pandas as pd

# Load the CSV file into a pandas dataframe
df = pd.read_csv('Clean_Games_Sales_Data.csv')

# Find the minimum and maximum years of release
min_sales = df['EU_Sales'].min()
max_sales = df['EU_Sales'].max()

# Print the results
print("Minimum year of release:", min_sales)
print("Maximum year of release:", max_sales)

Minimum year of release: 0.0
Maximum year of release: 28.96


## 4.2 Boundary Checking

I want to boundary check the features that is a number.

# Step 5: Statistical Analysis

I want to know what data I'm working with before using my ML model to learn on the data.

I need to store each sample of features into an list, that can be used for the ML model.

I need to analysise the metrics of my model, and compare it to other models I create.

1. Overfitting (reduced accuracy and performance, increased variability, decreased interpretability)
2. Accuracy
