# COGS 118B - Project Proposal

# Project Description

You will design and execute a machine learning project. There are a few constraints on the nature of the allowed project. 
- The problem addressed will not be a "toy problem" or "common training students problem" like mtcars, iris, palmer penguins etc.
- The dataset will have >1k observations and >5 variables. I'd prefer more like >10k observations and >10 variables. A general rule is that if you have >100x more observations than variables, your solution will likely generalize a lot better. The goal of training an unsupervised machine learning model is to learn the underlying pattern in a dataset in order to generalize well to unseen data, so choosing a large dataset is very important.

- The project must include some elements of unsupervised learning, but you are welcome to include some supervised or other learning approaches as well.
- The project will include a model selection and/or feature selection component where you will be looking for the best setup to maximize the performance of your ML system.
- You will evaluate the performance of your ML system using more than one appropriate metric
- You will be writing a report describing and discussing these accomplishments


Feel free to delete this description section when you hand in your proposal.

# Names

Hopefully your team is at least this good. Obviously you should replace these with your names.

- Xueyan Shi
- Yacheng Xiao
- Yimeng Wang
- Roberto Carlos
- Franz Beckenbaur

# Abstract 
This section should be short and clearly stated. It should be a single paragraph <200 words.  It should summarize: 
- what your goal/problem is
- what the data used represents and how they are measured
- what you will be doing with the data
- how performance/success will be measured

# Background

The exponential growth of the computer gaming industry into a multi-billion dollar industry shows its widespread popularity and economic significance. However, the industry faces unique challenges, such as the difficulty in satisfying a multicultural player base. Research indicates that implementing discount strategies is a crucial business tactic for bringing up sales in the competitive market. <a name="dunote"></a>[<sup>[1]</sup>](#dunote)Previous studies have focused on predicting the timing of discounts for games, leveraging historical data to predict when price reductions are likely to occur.<a name="linnote"></a>[<sup>[2]</sup>](#linnote) Building on this foundation, our research aims to predict not just the timing but the specific discount rates of computer games on steam.

# Problem Statement
We are aiming to predict the discount price of games in Steam. Games in steam are all labeled with tags and features, which makes it perfect for clustering algorithms. Though tags are generated by users, we can still transform it into one-hot encoding for it to be quantifiable. Measuring should be easy as we can use accuracy for predict prices, or use distance for clustering performance. We will use fixed seed to ensure the algorithm is replicable.

# Data
We will be using the [Steam Games Dataset](https://www.kaggle.com/datasets/nikatomashvili/steam-games-dataset) from Kaggle. It has 71000 game data points, with around 15 variables, including `Price`, `Review Rate`, `Release Date`, `Tag`,.etc. 

In [1]:
import pandas as pd
df = pd.read_csv('Steam_Game_Dataset.csv')

Here's what an observation consist of:

In [2]:
df.columns

Index(['Title', 'Original Price', 'Discounted Price', 'Release Date', 'Link',
       'Game Description', 'Recent Reviews Summary', 'All Reviews Summary',
       'Recent Reviews Number', 'All Reviews Number', 'Developer', 'Publisher',
       'Supported Languages', 'Popular Tags', 'Game Features',
       'Minimum Requirements'],
      dtype='object')

Some variables like `Tags` and `Features` are presented as a list:

In [13]:
print(df.iloc[0].iloc[13]) # Tag
print(df.iloc[0].iloc[14]) # Game Features

['RPG', 'Choices Matter', 'Character Customization', 'Story Rich', 'Adventure', 'Online Co-Op', 'CRPG', 'Multiplayer', 'Fantasy', 'Turn-Based Combat', 'Dungeons & Dragons', 'Co-op Campaign', 'Strategy', 'Singleplayer', 'Romance', 'Class-Based', 'Dark Fantasy', 'Combat', 'Controller', 'Stealth']
['Single-player', 'Online Co-op', 'LAN Co-op', 'Steam Achievements', 'Full controller support', 'Steam Cloud']


The dataset was obtained via web-crawling, which is very raw and contains garbled characters. We need a lot of pre-processing to clean and prune the dataset so it is usable. Also, some features are presented with list, so we need a `dict` and remap it into one-hot encoding for ML.

# Proposed Solution

In this section, clearly describe a solution to the problem. The solution should be applicable to the project domain and appropriate for the dataset(s) or input(s) given. Provide enough detail (e.g., algorithmic description and/or theoretical properties) to convince us that your solution is applicable. Why might your solution work? Make sure to describe how the solution will be tested.  

If you know details already, describe how (e.g., library used, function calls) you plan to implement the solution in a way that is reproducible.

If it is appropriate to the problem statement, describe a benchmark model<a name="sota"></a>[<sup>[3]</sup>](#sotanote) against which your solution will be compared. 

# Evaluation Metrics

**Prediction Accuracy**: Mean Absolute Error (MAE)

It measures the average magnitude of errors in a set of predictions, without considering their direction. It's calculated as the average of the absolute differences between predicted values and actual values, making it a straightforward and interpretable metric for assessing price prediction accuracy

**Clustering Performance**: Adjusted Ranked Index (ARI)

The Adjusted Rand Index (ARI) measures the similarity between two clusterings, considering all pairs of samples and counting pairs that are assigned in the same or different clusters in the predicted and true clusterings. The ARI is adjusted for the chance grouping of elements, making it a more reliable metric for the quality of the clustering.

MAE will give us a direct measure of the accuracy of the price predictions, while ARI will assess the effectiveness of our clustering, ensuring that the foundation for our predictions—how games are grouped based on tags and features—is sound. This dual approach allows us to optimize both the clustering of games and the accuracy of the price predictions.

# Ethics & Privacy

This project leverages publicly available datasets from platforms such as Kaggle, with the intent of adhering strictly to academic research protocols and respecting all relevant data usage agreements. We take measures to address potential inaccuracies within our dataset by cross-referencing information with alternative, independent sources. We mainly focuses on public data provided by Steam, hence, we acknowledge that the insights may not be universally applicable to other gaming platform.

# Team Expectations 

Our team aims for an efficient and respectful working environment along with the following expectations:

* **Open and Engaged Interaction**: We prioritize transparent and dynamic interactions. It is important for team members to comfortably share their perspectives, ideas, and any concerns they may have. We commit to listening with attention and respect to one another.

* **Contribution of Ideas**: We invite all team members to offer their insights and proposals. We will regard each other's contributions with respect and value, acknowledging the diversity of our ideas.

* **Collaborative Effort Continuously**: Should any team member have suggestions or face obstacles, we encourage sharing them promptly via our Discord channel. We recognize the importance of immediate and open dialogue to effectively and quickly resolve any issues.

* **Commitment to Deadlines**: Acknowledging the significance of adhering to timelines, we expect team members to fulfill their tasks punctually. Anticipate potential delays and inform the team at least two days in advance, allowing us to work together towards a solution.

* **Fair Allocation of Responsibilities**: Tasks will be assigned fairly among all team members, ensuring an even distribution of work. Everyone is expected to equally participate in the project, preventing disproportionate workloads.



# Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 1/20  |  1 PM |  Brainstorm topics/questions  | Do background research; Find datasets | 
| 1/25  |  10 AM |  Do background research on topic (ALL) | Discuss ethics; draft project proposal | 
| 2/1  | 10 AM  | Descriptive statistics; data visualization  | Assign group members to lead each specific part   |
| 2/12  | 6 PM  | Import & Wrangle Data ,EDA| Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 2/22  | 12 PM  | Finalize wrangling; Begin programming | Discuss/edit project code; Complete project |
| 3/14  | 12 PM  | Complete analysis; Finish result| Discuss/second edit full project |
| 3/17  | Before 11:59 PM  | NA | Turn in Final Project  |

# Footnotes
<a name="dunote"></a>1.[^](#du): L. Du, "Steam Game Discount Prediction Using Machine Learning Methods," 2021 3rd International Conference on Machine Learning, Big Data and Business Intelligence (MLBDBI), Taiyuan, China, 2021, pp. 149-152, doi: 10.1109/MLBDBI54094.2021.00037.. https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9730988&isnumber=9730940<br> 
<a name="linnote"></a>2.[^](#lin): Lin, D., Bezemer, CP., Zou, Y. et al. "An empirical study of game reviews on the Steam platform". Empir Software Eng 24, 170–207 (2019). https://doi.org/10.1007/s10664-018-9627-4<br>

