# <span style="color:darkBlue"> Project Report: Steam Games Data Analysis</span>

## <span style="color:Purple"> A) Web Scraping </span>

### 1. Introduction
The goal of this part is to gather data on approximately 3000 games from the website [SteamDB](https://steamdb.info). The data collected includes various attributes of the games, which will be used for subsequent analysis, preprocessing, feature engineering, and model training.

### 2. Methodology

#### a. Initial Data Collection
1. **Target URL**: [SteamDB](https://steamdb.info)
2. **Objective**: Extract URLs of about 3000 games and gather detailed information for each game.
3. **Data Points Collected**:
    - NAME
    - STORE_GENRE
    - RATING_SCORE
    - N_SUPPORTED_LANGUAGES
    - DEVELOPERS
    - SUPPORTED_PLATFORMS
    - POSITIVE_REVIEWS
    - NEGATIVE_REVIEWS
    - TECHNOLOGIES
    - RELEASE_DATE
    - TOTAL_TWITCH_PEAK
    - PRICE
    - N_DLC
    - 24_HOUR_PEAK

#### b. Scraping Process
1. **Scripts and Notebooks**:
    - `weScrape_1.ipynb`: This script extracts game URLs and gathers detailed information for each game.
    - `weScrape_2_price.ipynb`: This script ensures that the prices of games are converted to USD, addressing the challenge of varied pricing in different currencies.
2. **Files Generated**:
    - `game_urls.txt`: List of URLs for the 3000 games.
    - `games_info.csv`: Contains detailed information about the games.
    - `games_details.csv`: Contains game names, release dates, and prices in USD.

#### c. Challenges Faced
1. **Time-Consuming Process**: Scraping a large number of games required significant time and computational resources. Multiple systems were used to expedite the process.
2. **Price Standardization**: Different currencies on the SteamDB website posed a challenge, which was addressed by writing a separate script to ensure all prices are in USD.

### 3. Results
The web scraping phase successfully gathered data for 3000 games, resulting in two primary datasets:
1. `games_info.csv`: Comprehensive details about each game.
2. `games_details.csv`: Focused on name, release date, and standardized price.


## <span style="color:Purple">B) Preprocessing</span>



### 1. Introduction

The preprocessing phase is crucial for ensuring that the data is clean, consistent, and ready for subsequent analysis. This phase involves handling missing values, correcting data types, creating new features, and addressing any inconsistencies. The preprocessing steps were executed in the `Preprocessing.ipynb` notebook, with `game_info.csv` and `game_details.csv` as inputs, resulting in the output file `preprocessed_game_info.csv`.

### 2. Methodology

#### a. Loading Data

The datasets `game_info.csv` and `game_details.csv` were loaded into Pandas DataFrames to facilitate data manipulation. This step sets the foundation for all subsequent data processing tasks.

#### b. Initial Data Exploration

Basic exploratory data analysis was performed to understand the structure and content of the datasets. This included checking the data types of each column, identifying missing values, and generating summary statistics to get a high-level overview of the data.

#### c. Removing Duplicated Games

Duplicates in the dataset can lead to skewed analysis results. To prevent this, duplicate entries based on the 'NAME' column were identified and removed.

#### d. Handling Missing Values

Several steps were taken to handle missing values across different columns:

- **General Replacement**: 'N/A' values in the dataset were standardized by replacing them with Pandas' `pd.NA`.
- **'DEVELOPERS' Column**: To maintain data integrity, rows with missing values in the 'DEVELOPERS' column were dropped.
- **'RELEASE_DATE' Column**: The 'RELEASE_DATE' column was processed to extract the year, creating a new 'PUBLISH_YEAR' column. Missing years were filled with the median value of the column, and the original 'RELEASE_DATE' column was subsequently dropped.
- **'PRICE' Column**: Missing prices were filled using corresponding values from the `game_details.csv` file. Prices were cleaned and converted to numeric values, ensuring consistency across the dataset. Remaining rows with missing prices were dropped to avoid incomplete data entries.
- **'N_SUPPORTED_LANGUAGES' Column**: Missing values were filled with a default value, and the column was cleaned and converted to an integer type to ensure proper data formatting.
- **'RATING_SCORE' Column**: Missing values were filled with a placeholder, cleaned, and converted to numeric values. The placeholder values were then replaced with the mean rating score to ensure a realistic representation of the data.

#### e. Data Cleaning and Feature Engineering

Several columns required specific cleaning and feature engineering steps:

- **'STORE_GENRE' Column**: Missing values were filled, unnecessary text was removed, and the genres were split into a list format to facilitate better analysis.
- **'24_HOUR_PEAK' Column**: This column was cleaned by filling missing values and extracting relevant numerical values, which were then converted to an integer type.
- **'TECHNOLOGIES' Column**: Missing values were filled with empty strings, and the technologies were split into lists for better usability during analysis.
- **'TOTAL_TWITCH_PEAK' Column**: The column was split into 'TWITCH_PEAK_HOUR' and 'TWITCH_PEAK_YEAR', both of which were cleaned and converted to numeric types. The original 'TOTAL_TWITCH_PEAK' column was dropped after extracting the necessary information.
- **'N_DLC' Column**: Missing values were temporarily filled and then restored. The percentage of null values was calculated to make an informed decision, and the column was dropped due to a high percentage of missing values.
- **'TOTAL_REVIEW' Column**: A new column was created to represent the proportion of positive reviews out of the total reviews, providing insights into the overall sentiment and reception of the games.

### 3. Data Integration

The cleaned and preprocessed data from the primary dataset (`game_info.csv`) and the secondary dataset (`game_details.csv`) were merged to form a unified dataset. This step ensured that all relevant information was combined, providing a comprehensive dataset for further analysis.

### 4. Final Dataset Preparation

A final check was conducted to ensure all preprocessing steps were correctly applied. This included verifying that there were no remaining missing values, that all data types were correct, and that all necessary transformations had been completed. The cleaned and prepared dataset was then saved as `preprocessed_game_info.csv`.

### 5. Summary

In summary, the preprocessing phase involved:
- Loading and exploring the data.
- Removing duplicate entries.
- Handling missing values across various columns.
- Cleaning data and engineering new features.
- Integrating datasets to form a comprehensive dataset.

These steps ensured that the data was clean, consistent, and ready for subsequent analysis and modeling. The output of this phase, `preprocessed_game_info.csv`, y additional details or specific adjustments, please let me know!

## <span style="color:Purple">C) Analysis</span>