# Final Project: Video Game Sales Prediction

**Problem Description:**

This dataset contains several observations, each represents a single Video Game, with several features including a game title, metacritic rating for the game, publisher, developer, etc...

**Warning**: Not every game has a matching metacritic review, as some of the older games existed before the time of metacritic. So you will have to weigh your options with this missing data, drop, fill in, etc...

Your taks is to predict the ```Global Sales``` column, you must perform some EDA to find which features are most strongly correlated with the ```Global_sales``` figure, clean and preprocess the data, split up dataset into test and train, model the data against several models, and evaluate model performance to find the most performant model.



<hr>

<br>

## Overview

<br>

**Import Dataset**

**Exploratory Data Analysis:**
  - Identify Feature Data Types and Values Counts
  - Analyze Distributions Using Various types of Plots 
  - Identify What Each Feature's Distribution Is Describing About The Target Variable
    - [Click Link To Explore If There Is Possiblility To Transform This Distribution To A Normal Distribution](https://medium.com/ai-techsystems/gaussian-distribution-why-is-it-important-in-data-science-and-machine-learning-9adbe0e5f8ac)
  - Analyze Correlations Using A Heatmap

<br>

**Data Cleaning:**
  - Handle Nan Values Possibly Using Imputation
  - Handle Duplicates
  - Fix Structural Errors
    - Typos
    - Possibly Bin Similar Values
  - Remove Unused Variables
  - Handle Outliers
  - [Click Link To Learn About Normalizing Features Using Z-Score](https://lazyprogrammer.me/what-the-hell-is-a-z-score/)
  
<br>

**Feature Engineering:**
  - Discretization For Numerical Values
  - Bin Nominal Categorical Values
    - After Binning One Hot Encode These Features
  - Encode Ordinal Categorical Values As A Indicator Variable, 
    - Don't One Hot Encode Oridinal Categorical Values to preserve more information

<br>

**View Distributions After Data Cleaning and Feature Engineering** 

<br>

**Preparation of Data:**
  - Splitting of Data Into Train and Test Sets
  - Separate the dataset's features from target variable
  - Split data into training and testing sets

<br>

**Modeling:**
- Make a pipeline
- Make a hyper-parameter dictionary
- Perform a cross validaton grid search using the training set, and setting the following parameters of the `GridSearchCV` object initialization:
  - `estimator` parameter will be assigned with the `pipeline` object instance as its argument
  - `param_grid` parameter will be assigned with the `hyper-parameter-dictionary` instance
- Identify best performing model using the training set with specific hyper-parameters set
- Evaluate the best performing model using the testing data

<br>

**REPEAT WITH NEW FEATURE ENGINEERING TECHNIQUES** 

**This is more of an iterative process!** 

You may build a model only to find you're accuracy is low, which will require you to go back and engineer new features or maybe preform some more EDA to ensure that you've selected the most important features, given the problem at hand. Look the article discussing [how to evaluate a linear regression model]((https://www.ritchieng.com/machine-learning-evaluate-linear-regression-model/)as well as the article discussing [PCA](https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60) to get an understanding of how each feature is impacting the model.

<br>

**Resources:**
- [Distributions and Correlations Exploratory Data Analysis, Along With Dataset Cleaning and Preparation](https://www.neuraldesigner.com/learning/tutorials/data-set)
- [Outliers](https://www.neuraldesigner.com/blog/3_methods_to_deal_with_outliers)
- [Tips On Feature Engineering](https://github.com/SoftStackFactory/PythonDataScienceHandbook/blob/master/notebooks/05.04-Feature-Engineering.ipynb)
- [How To Evaluate A Linear Regression Model](https://www.ritchieng.com/machine-learning-evaluate-linear-regression-model/) 
- [Identifying Feature Importances with PCA](https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60)
- [Data Science Handbook](https://github.com/SoftStackFactory/PythonDataScienceHandbook/)


<hr>
<br>

# Good Luck!