# <span style="color:darkBlue"> Project Report:  Game Price Prediction Project</span>

## <span style="color:Purple"> 1) Introduction </span>

The objective of this project is to predict the prices of video games based on various features gathered from the SteamDB website. Accurate price prediction can help developers and publishers in strategizing their marketing and sales efforts, and can assist consumers in making informed purchasing decisions.

## <span style="color:Purple"> 2) Data Collection </span>
The data collection phase involved web scraping from the SteamDB website to gather information on approximately 3000 video games. This phase was divided into two main steps:  

#### 2.1 Web Scraping Game Information

In the first step, game URLs and detailed game information were scraped from the SteamDB search pages. The key features collected for each game include:
- `NAME`: The title of the game.
- `STORE_GENRE`: The genre(s) of the game.
- `RATING_SCORE`: The overall rating score of the game.
- `N_SUPPORTED_LANGUAGES`: The number of languages supported by the game.
- `DEVELOPERS`: The developer(s) of the game.
- `SUPPORTED_PLATFORMS`: The platforms on which the game is available.
- `POSITIVE_REVIEWS`: The number of positive reviews.
- `NEGATIVE_REVIEWS`: The number of negative reviews.
- `TECHNOLOGIES`: The technologies used in the game.
- `RELEASE_DATE`: The release date of the game.
- `TOTAL_TWITCH_PEAK`: The peak number of viewers on Twitch.
- `PRICE`: The price of the game.
- `N_DLC`: The number of downloadable content (DLC) available.
- `24_HOUR_PEAK`: The peak number of players in the last 24 hours.

This step involved several challenges, including handling the large volume of data (3000 games) and ensuring the scraping process was efficient and reliable. The data collected in this step was saved in two files:
- `game_urls.txt`: Contains URLs of the games.
- `games_info.csv`: Contains detailed information of the games.

#### 2.2 Web Scraping Game Prices

In the second step, the focus was on gathering the prices of the games in USD to ensure consistency across different regions. This involved scraping the name, release date, and price of each game from their respective pages. This data was saved in:
- `games_details.csv`: Contains the name, release date, and price of the games.

*** 

#### Data Files and Notebooks

The following files and notebooks were created and used during the data collection phase:
1. **Data Files**:
   - `game_urls.txt`: A text file containing URLs of the games.
   - `games_info.csv`: A CSV file containing detailed game information.
   - `games_details.csv`: A CSV file containing the name, release date, and price of the games.

2. **Notebooks**:
   - `webScrape_1.ipynb`: Notebook used to scrape game URLs and detailed information, resulting in `game_urls.txt` and `games_info.csv`.
   - `webScrape_2_price.ipynb`: Notebook used to scrape game prices, resulting in `games_details.csv`.


The data collection phase provided a comprehensive dataset that includes various features of video games and their prices. This dataset forms the foundation for the subsequent analysis and price prediction models. The challenges encountered during web scraping were addressed effectively to ensure data accuracy and completeness.


## <span style="color:Purple">B) Preprocessing</span>



### 1. Introduction

The preprocessing phase is crucial for ensuring that the data is clean, consistent, and ready for subsequent analysis. This phase involves handling missing values, correcting data types, creating new features, and addressing any inconsistencies. The preprocessing steps were executed in the `Preprocessing.ipynb` notebook, with `game_info.csv` and `game_details.csv` as inputs, resulting in the output file `preprocessed_game_info.csv`.

### 2. Methodology

#### a. Loading Data

The datasets `game_info.csv` and `game_details.csv` were loaded into Pandas DataFrames to facilitate data manipulation. This step sets the foundation for all subsequent data processing tasks.

#### b. Initial Data Exploration

Basic exploratory data analysis was performed to understand the structure and content of the datasets. This included checking the data types of each column, identifying missing values, and generating summary statistics to get a high-level overview of the data.

#### c. Removing Duplicated Games

Duplicates in the dataset can lead to skewed analysis results. To prevent this, duplicate entries based on the 'NAME' column were identified and removed.

#### d. Handling Missing Values

Several steps were taken to handle missing values across different columns:

- **General Replacement**: 'N/A' values in the dataset were standardized by replacing them with Pandas' `pd.NA`.
- **'DEVELOPERS' Column**: To maintain data integrity, rows with missing values in the 'DEVELOPERS' column were dropped.
- **'RELEASE_DATE' Column**: The 'RELEASE_DATE' column was processed to extract the year, creating a new 'PUBLISH_YEAR' column. Missing years were filled with the median value of the column, and the original 'RELEASE_DATE' column was subsequently dropped.
- **'PRICE' Column**: Missing prices were filled using corresponding values from the `game_details.csv` file. Prices were cleaned and converted to numeric values, ensuring consistency across the dataset. Remaining rows with missing prices were dropped to avoid incomplete data entries.
- **'N_SUPPORTED_LANGUAGES' Column**: Missing values were filled with a default value, and the column was cleaned and converted to an integer type to ensure proper data formatting.
- **'RATING_SCORE' Column**: Missing values were filled with a placeholder, cleaned, and converted to numeric values. The placeholder values were then replaced with the mean rating score to ensure a realistic representation of the data.

#### e. Data Cleaning and Feature Engineering

Several columns required specific cleaning and feature engineering steps:

- **'STORE_GENRE' Column**: Missing values were filled, unnecessary text was removed, and the genres were split into a list format to facilitate better analysis.
- **'24_HOUR_PEAK' Column**: This column was cleaned by filling missing values and extracting relevant numerical values, which were then converted to an integer type.
- **'TECHNOLOGIES' Column**: Missing values were filled with empty strings, and the technologies were split into lists for better usability during analysis.
- **'TOTAL_TWITCH_PEAK' Column**: The column was split into 'TWITCH_PEAK_HOUR' and 'TWITCH_PEAK_YEAR', both of which were cleaned and converted to numeric types. The original 'TOTAL_TWITCH_PEAK' column was dropped after extracting the necessary information.
- **'N_DLC' Column**: Missing values were temporarily filled and then restored. The percentage of null values was calculated to make an informed decision, and the column was dropped due to a high percentage of missing values.
- **'TOTAL_REVIEW' Column**: A new column was created to represent the proportion of positive reviews out of the total reviews, providing insights into the overall sentiment and reception of the games.

### 3. Data Integration

The cleaned and preprocessed data from the primary dataset (`game_info.csv`) and the secondary dataset (`game_details.csv`) were merged to form a unified dataset. This step ensured that all relevant information was combined, providing a comprehensive dataset for further analysis.

### 4. Final Dataset Preparation

A final check was conducted to ensure all preprocessing steps were correctly applied. This included verifying that there were no remaining missing values, that all data types were correct, and that all necessary transformations had been completed. The cleaned and prepared dataset was then saved as `preprocessed_game_info.csv`.

### 5. Summary

In summary, the preprocessing phase involved:
- Loading and exploring the data.
- Removing duplicate entries.
- Handling missing values across various columns.
- Cleaning data and engineering new features.
- Integrating datasets to form a comprehensive dataset.

These steps ensured that the data was clean, consistent, and ready for subsequent analysis and modeling. The output of this phase, `preprocessed_game_info.csv`, y additional details or specific adjustments, please let me know!

## <span style="color:Purple">C) Analysis</span>