## Project Phase II

### Team Members:
- Neel Mallik
- Joshua Herren
- Kunal Lotun
- Kyle Hong


## Part 1: 
(1%) Expresses the central motivation of the project and explains the (at least) two key questions to be explored. Gives a summary of the data processing pipeline so a technical expert can easily follow along.

### Problem Motivation

Video games are becoming more and more popular by the day, with releases become instant hits and others slowly rising in popularity as it gains an audience. For games that become popular right off the bat, how do they retain their popularity? Is it because of their genre? Their length? There are many factors to a game's success, but there are even more related to how a game maintains its popularity, as that is a much harder struggle than the initial popularity of the game. Now, there are plenty of outliers to this. For example, games like Madden NFL, NBA 2K, FC, have a cult following due to their respective sports that massively contributes to their popularity. On the other hand, there are plenty of games that are completely imagined that have nothing to do with real life yet still have a massive, dedicated following for years. Think games like Overwatch, with no semblance to reality yet still massively popular. This is what I aim to figure out. The key questions are as follows: 

1. Are there any particular markers or trends that go towards a game being popular? 
2. Say a game is popular on release, are there any marker or trends showing this game retaining its popularity? 
3. How do such markers or trends change when differing certain charecteristics of the game (genre, year of release, country of origin)?
4. How can we predict, overall, if a game will start off popular and retain its popularity? On the same line, are there any markers or trends that are consistent with how long a game retains its popularity? If so, how can we use these markers to guage similar games and their popularity retention scores?

Motivation sources:
* .... (fill these in once we gather data I guess)
* .... 
* ....

## Summary of the Data Processing Pipeline

1. Web scrape the API to get the raw data
2. Clean the data for analysis and visualization
3. Format the cleaned data, this will be done via plotting in order to get a clean visualization of the data 

In this section we will talk about which API we will use, how we did it, etc. Then, once that part about actually gathering the data is over, we move on to cleaning it. What did we clean, how did we clean it, and how did the overall process go. Finally, we talk about the formatting and visualization of the data. Talk about what exactly we analyzed, how we analyzed it, and what visualizations we used. Talk about what specific plotting libraries we used (seaport, madplotlib, plotly, whatever). Then how our data solved each question, this will be BROAD, since we will actually answer the questions later.

## Part 2: 
(2\%) Obtains, cleans, and merges all data sources involved in the project.

### Below is what I did to get my raw data, just want for anybody to be able to run it now (in this group) so you can see what my data is and how we want to change/alter how we collect the data, if anybody has a better idea/api. 

In [13]:
# imports 
import requests
import pandas as pd

# Site
url = "https://api.rawg.io/api/games"
# API key used to pull data
key = "07dfd979a676418e813cbba97a9dfe4f"
pages = 5
# (1) we'll need something like this since this API limits to the first page in a single get request

# Dictionary of parameters that is sent to the url in a get (r)
params = {
    "key": key,
    "dates": "2024-01-01,2025-12-31",  # filter by date released
    "ordering": "-added",  # sort by added (so we don't get random games with no adds)
    "page": pages
}

# send the get, get the json response, which we can't use directly, so convert it to a dictionary
data = requests.get(url, params=params).json()

# create empty list, we will store all of our data here
games = []
for pages in range(1, pages + 1): #Iterates through each page
    params = {
        "key": key,
        "dates": "2024-01-01,2025-12-31",  # filter by date released
        "ordering": "-added",  # sort by added (so we don't get random games with no adds)
        "page": pages #the current page
    }

    # send the get, get the json response, which we can't use directly, so convert it to a dictionary
    data = requests.get(url, params=params).json()

    for game in data.get("results", []):
        temp = game.get('added_by_status')
        games.append({  # append column with corresponding data (via get("{name of column needed}"))
            "name": game.get("name"),
            # for example, we create the column "name" with data from game (the specific game in "results", which is the actual data from the API, which is, in this case, the name of the game)
            "released": game.get("released"),
            "rating (out of 5.0)": game.get("rating"),
            "ratings_count": game.get("ratings_count"),
            "length (h)": game.get("playtime"),
            "added": game.get("added"),
            #"added_by_status": game.get("added_by_status"),
            #if you guys want you can change the column titles I just set it to these based on my understanding of what they mean
            "yet to be downloaded": temp.get('yet', 0),
            "owned": temp.get('owned', 0),
            "beaten": temp.get('beaten', 0),
            "have not played yet": temp.get('toplay', 0),
            "dropped": temp.get('dropped', 0),
            "currently playing": temp.get('playing', 0),
            "genres": [genre["name"] for genre in game.get("genres", [])]
            # in this line, we create the column "genres". this is a list of genres (as strings) so we iterate through the list and add    each one to the dataset

        })
# (2) we'll need to create another loop here that iterates however many times "pages" is set to, making sure to create a get request every time it iterates through a single page

# iterate through this until we reach each game
# convert to dataframe, i know this wasn't necessary but it just looked better this way similar to the spotify example
df = pd.DataFrame(games)
# display as a table rather than just text
display(df)


IndentationError: unindent does not match any outer indentation level (<string>, line 35)

In [3]:
# cleaning dataset function
def clean_data(df):
    # do whatever to the dataset
    """
    Cleans the dataframe by doing the following:
        - Converting times to DateTime
        - Dropping duplicate games
        - Removing invalid rows
        - Showing current player ratio
        - Turning genres into a string
        - Adding a popularity score, rentention ratio, and engagement col
        - Removing games that do not meet rating count
        - Adds a Game Number

    Args:
        df (pd.DataFrame) : a data frame to clean

    Returns:
        df (pd.DataFrame) : cleaned data frame
    """

    df = df.drop_duplicates(subset=['name']) # Remove Duplicates
    df = df.dropna(subset=['name', 'rating (out of 5.0)', 'ratings_count', 'added']) # Remove NA rows (rows without names, ratings, rating count, added or not)
    df['released'] = pd.to_datetime(df['released'], errors='coerce') # Converts released to datetime
    df['genres'] = df['genres'].apply(lambda x: ', '.join(x) if isinstance(x, list) else x) # Converts the genre to a string

    df['active rate'] = (df['currently playing'] / (df['owned'] + 1)).round(3) # Adds the current player ratio
    df['popularity score'] = (df['added'] + df['owned'] + df['currently playing']).round(3) # Adds a pop score, calculated by adding the amount added, owned, and currently playing
    df['retention ratio'] = ((df['currently playing'] + df['beaten']) / (df['dropped'] + 1)).round(3) # Adds a retention score, how well people who bought the game are still playing it
    df['engagement'] = ((df['currently playing'] + df['beaten']) / (df['added'] + 1)).round(3) # Fraction of interested users who actually played or completed the game
    df = df[df['ratings_count'] > 10] # Remove games that have less than 10 ratings
    df.index.name = "Game Number"

    return df

df = clean_data(df)
display(df)

Unnamed: 0_level_0,name,released,rating (out of 5.0),ratings_count,length (h),added,yet to be downloaded,owned,beaten,have not played yet,dropped,currently playing,genres,active rate,popularity score,retention ratio,engagement
Game Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
0,Satisfactory,2024-09-11,4.29,333,12,2888,98,2222,102,146,207,113,"Indie, Strategy, Adventure, Action",0.051,5223,1.034,0.074
1,Vampire: The Masquerade – Bloodlines 2,2025-10-21,3.92,315,329,2404,286,683,16,1372,12,35,"Action, RPG",0.051,3122,3.923,0.021
2,S.T.A.L.K.E.R. 2: Heart of Chornobyl,2024-11-20,3.77,292,7,2140,336,534,104,1063,61,42,"Shooter, Adventure, Action, RPG",0.079,2716,2.355,0.068
3,V Rising,2024-05-08,3.73,167,7,2066,89,1587,69,142,147,32,"Massively Multiplayer, Adventure, Action",0.02,3685,0.682,0.049
4,Hollow Knight: Silksong,2025-09-04,4.26,160,13,1412,166,262,69,820,26,69,"Indie, Platformer, Adventure, Action",0.262,1743,5.111,0.098
5,Senua's Saga: Hellblade II,2024-05-21,3.96,173,0,1287,198,175,169,707,25,13,Action,0.074,1475,7.0,0.141
6,Synergy,2024-05-21,2.09,22,3,1018,27,961,3,22,5,0,"Strategy, Indie, Simulation",0.0,1979,0.5,0.003
7,Black Myth: Wukong,2024-08-20,4.31,148,0,995,125,179,126,508,30,27,"Adventure, Action, RPG",0.15,1201,4.935,0.154
8,Marvel Rivals,2024-12-06,3.71,105,4,938,29,729,27,11,88,54,Action,0.074,1721,0.91,0.086
9,Content Warning,2024-04-01,3.42,62,3,907,42,755,20,16,66,8,"Indie, Adventure, Action",0.011,1670,0.418,0.031


In [None]:
# Test 
# Print the first few (50+) rows of the cleaned CSV file 

## Part 3:
(2\%) Builds at least two visualizations (graphs/plots) from the data which help to understand or answer the questions of interest. These visualizations will be graded based on how much information they can effectively communicate to readers. Please make sure your visualization are sufficiently distinct from each other.

In [1]:
# This is pretty simple, just put the data in graphs/plots 
# Data (NOT FROM CSV) -> Dataframe -> Visualization/Analyzation

Talk here (after each few plots/graphs) about how our data can be analyzed, basically do the thinking for the reader. ill be writing a lot in these parts so we atleast get these points even if a lot of it is BS and irrelevant. Make sure to answer a question or two depending on the graphs. 

^^ This should be repeated for each graph/plot, at the very end every question should be answered (pasted from above below): 

1. Are there any particular markers or trends that go towards a game being popular? 
2. Say a game is popular on release, are there any marker or trends showing this game retaining its popularity? 
3. How do such markers or trends change when differing certain charecteristics of the game (genre, year of release, country of origin)?
4. How can we predict, overall, if a game will start off popular and retain its popularity? On the same line, are there any markers or trends that are consistent with how long a game retains its popularity? If so, how can we use these markers to guage similar games and their popularity retention scores?