# COGS 108 - Final Project 

# Overview

Games nowadays are often categorized fairly broadly, with many falling into the categories of shooters, rpgs, rogue-likes, and so on. Our main goal with this project is to see what type of categories tend to get the best user responses to them and how this response has shifted over time. We wanted to figure out what is the current trend for popular game categories. We will split this up by seeing the trends for each specific year and how the reception for the game has been.

We will utilize two datasets, one for Steam games and one for mobile games in order to see if there's a category more suited for mobile devices compared to computers.

# Names

- Edward Xie

# Group Members IDs

- Edwardxie72---A15534895

# Research Question

What type of game categories get the best user ratings and how has this trend changed over the years. 

Our second goal is to see whether these trends differ between mobile games and PC games.

## Background and Prior Work

## TODO
*Fill in your background and prior work here* 

References (include links):
- 1)
- 2)

# Hypothesis


## TODO
*Fill in your hypotheses here*

# Dataset(s)

*Dataset Information*

- Dataset Name: Steam Store Games
- Link to the dataset: https://www.kaggle.com/nikdavis/steam-store-games
- Number of observations: 27033

This dataset contains 27033 games from steam and it includes information on the games. The data we will be utilizing for each game is the categories of the game, the genre of the game, the steamspy tags of the game, the number of positive ratings for the game, and the number of negative ratings for the game.

- Dataset Name: Google Store Play Apps
- Link to the dataset: https://www.kaggle.com/gauthamp10/google-playstore-apps#Google-Playstore-Full.csv
- Number of observations: 244407

This dataset contains 244407 apps from the play store. With this data set we will extract all the games and from these games, the data we will utilize from them is the category of the game, the rating for the app, and the review count for the app.

*How we will combine the datasets:*
We will be treating these as seperate databases, but will use the results to compare the differences between games on the computer and games on a mobile device to see if there is some difference in the trends between these.


# Setup

In [1]:
#Imports (Following A2 as a guideline for processing data)
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns

In [2]:
# Configure libraries(taken from A2)
# The seaborn library makes plots look nicer
sns.set()
sns.set_context('talk')

# Don't display too many rows/cols of DataFrames
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8

# Round decimals when displaying DataFrames
pd.set_option('precision', 2)

# Data Cleaning

## TODO
Describe your data cleaning steps here.

In [4]:
# Read in data
df_steam = pd.read_csv('steam.csv', low_memory=False)
df_googleplay = pd.read_csv('Google-Playstore-Full.csv', low_memory=False)

In [5]:
# Preview and clean up steam data set first
df_steam.head(5)

Unnamed: 0,appid,name,release_date,english,...,average_playtime,median_playtime,owners,price
0,10,Counter-Strike,2000-11-01,1,...,17612,317,10000000-20000000,7.19
1,20,Team Fortress Classic,1999-04-01,1,...,277,62,5000000-10000000,3.99
2,30,Day of Defeat,2003-05-01,1,...,187,34,5000000-10000000,3.99
3,40,Deathmatch Classic,2001-06-01,1,...,258,184,5000000-10000000,3.99
4,50,Half-Life: Opposing Force,1999-11-01,1,...,624,415,5000000-10000000,3.99


In [6]:
# Start by dropping unwanted data, keep the name so we have some way to identify apps, but not neccesary
del df_steam['appid']
del df_steam['english']
del df_steam['developer']
del df_steam['publisher']
del df_steam['platforms']
del df_steam['required_age']
del df_steam['achievements']
del df_steam['average_playtime']
del df_steam['median_playtime']
del df_steam['price']
del df_steam['owners']  # We drop owners because it gives a range which might make data inconsistent, instead we will weigh off each rating

In [7]:
# Rename categories before we start cleaning data
df_steam.columns = ['Name', 'Year', 'Steam Categories', 'Genre', 'Steamspy Categories', 'Positive Ratings', 'Negative Ratings']

In [8]:
# Extract only the year from the release date because we analyze by year
def extract_year(string):
    return int(string[:4])  # Only need first 4 characters for year

In [9]:
df_steam['Year'] = df_steam['Year'].apply(extract_year)  # Change all years in our df

In [10]:
# Preview and clean up Google Play Data
df_googleplay.head(5)

Unnamed: 0,App Name,Category,Rating,Reviews,...,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14
0,DoorDash - Food Delivery,FOOD_AND_DRINK,4.548561573,305034,...,,,,
1,TripAdvisor Hotels Flights Restaurants Attract...,TRAVEL_AND_LOCAL,4.400671482,1207922,...,,,,
2,Peapod,SHOPPING,3.656329393,1967,...,,,,
3,foodpanda - Local Food Delivery,FOOD_AND_DRINK,4.107232571,389154,...,,,,
4,My CookBook Pro (Ad Free),FOOD_AND_DRINK,4.647752285,2291,...,,,,


In [11]:
# Start by dropping unwanted data, keep the name so we have some way to identify apps, but not neccesary
del df_googleplay['Installs']  # We don't care about installs for same reason as steam, use number of ratings instead
del df_googleplay['Size']
del df_googleplay['Price']
del df_googleplay['Content Rating']
del df_googleplay['Last Updated']
del df_googleplay['Minimum Version']
del df_googleplay['Latest Version']
del df_googleplay['Unnamed: 11']
del df_googleplay['Unnamed: 12']
del df_googleplay['Unnamed: 13']
del df_googleplay['Unnamed: 14']

In [12]:
df_googleplay = df_googleplay.dropna(subset=['Category', 'Rating', 'Reviews'])

In [13]:
# First start by dropping all non game category games, games will start with the string "GAME_" in category
df_googleplay = df_googleplay[df_googleplay['Category'].str.contains('GAME')]
df_googleplay = df_googleplay.reset_index(drop=True)

In [14]:
# Add extra columns to steam one for the average rating(out of 10) and total number of ratings(Pos rating/(Pos + neg rating))
df_steam['AvgRating'] = df_steam['Positive Ratings'].div(df_steam['Positive Ratings'].add(df_steam['Negative Ratings']))
df_steam['NumRating'] = df_steam['Positive Ratings'].add(df_steam['Negative Ratings'])

In [15]:
# Add extra columns to googleplay, one for avg rating(out of 10) and rename total to match steam
df_googleplay['Rating'] = df_googleplay['Rating'].astype(float)
df_googleplay['AvgRating'] = df_googleplay['Rating'].mul(2)
df_googleplay.rename(columns = {'Reviews':'NumRating'}, inplace = True) 

In [16]:
# Clean up the categories for the games
df_googleplay['Category'] = df_googleplay['Category'].str.slice(start=5)

In [17]:
# Preview mostly cleaned up data
df_steam.head(5)

Unnamed: 0,Name,Year,Steam Categories,Genre,...,Positive Ratings,Negative Ratings,AvgRating,NumRating
0,Counter-Strike,2000,Multi-player;Online Multi-Player;Local Multi-P...,Action,...,124534,3339,0.97,127873
1,Team Fortress Classic,1999,Multi-player;Online Multi-Player;Local Multi-P...,Action,...,3318,633,0.84,3951
2,Day of Defeat,2003,Multi-player;Valve Anti-Cheat enabled,Action,...,3416,398,0.9,3814
3,Deathmatch Classic,2001,Multi-player;Online Multi-Player;Local Multi-P...,Action,...,1273,267,0.83,1540
4,Half-Life: Opposing Force,1999,Single-player;Multi-player;Valve Anti-Cheat en...,Action,...,5250,288,0.95,5538


In [18]:
df_googleplay.head(5)

Unnamed: 0,App Name,Category,Rating,NumRating,AvgRating
0,King of Crabs,ACTION,3.28,785,6.56
1,Match 3 App Rewards: Daily Game Rewards,CASUAL,4.52,248,9.04
2,Brown Dust,ROLE_PLAYING,4.48,70260,8.95
3,Poly - Coloring Puzzle Art Book,PUZZLE,4.58,878,9.16
4,Legend of Empress,ROLE_PLAYING,3.82,750,7.64


In [19]:
# Clean up categories
# Convert Steam Categories + Genre + Steamspy Categories into one category and put it in an array
df_steam["Steam Categories"].unique()

array(['Multi-player;Online Multi-Player;Local Multi-Player;Valve Anti-Cheat enabled',
       'Multi-player;Valve Anti-Cheat enabled',
       'Single-player;Multi-player;Valve Anti-Cheat enabled', ...,
       'Online Multi-Player;Steam Achievements;Full controller support;In-App Purchases;Steam Cloud',
       'Multi-player;Local Multi-Player;Co-op;Local Co-op;Shared/Split Screen',
       'Multi-player;Online Multi-Player;Cross-Platform Multiplayer;Stats'],
      dtype=object)

In [20]:
df_steam["Genre"].unique()

array(['Action', 'Action;Free to Play', 'Action;Free to Play;Strategy',
       ...,
       'Action;Adventure;Indie;Massively Multiplayer;RPG;Strategy;Early Access',
       'Action;Adventure;Casual;Free to Play;Indie;RPG;Simulation;Sports;Strategy',
       'Casual;Free to Play;Massively Multiplayer;RPG;Early Access'],
      dtype=object)

In [21]:
df_steam["Steamspy Categories"].unique()

array(['Action;FPS;Multiplayer', 'FPS;World War II;Multiplayer',
       'FPS;Action;Sci-fi', ..., 'Casual;Adventure;Arcade',
       'Free to Play;Visual Novel',
       'Early Access;Adventure;Sexual Content'], dtype=object)

In [22]:
# Convert Google Play Categories into same style as Steam
df_googleplay["Category"].unique()

array(['ACTION', 'CASUAL', 'ROLE_PLAYING', 'PUZZLE', 'RACING',
       'ADVENTURE', 'ARCADE', 'STRATEGY', 'SPORTS', 'SIMULATION', 'MUSIC',
       'EDUCATIONAL', 'WORD', 'TRIVIA', 'BOARD', 'CASINO', 'CARD'],
      dtype=object)

In [27]:
# Combine all the categories into one single array
array_steam = df_steam["Steam Categories"].unique()
array_genre = df_steam["Genre"].unique()
array_steampy = df_steam["Steamspy Categories"].unique()
array_googleplay = df_googleplay["Category"].unique()
all_categories = np.concatenate((array_steam, array_genre, array_steampy, array_googleplay), axis=0)
all_categories

array(['Multi-player;Online Multi-Player;Local Multi-Player;Valve Anti-Cheat enabled',
       'Multi-player;Valve Anti-Cheat enabled',
       'Single-player;Multi-player;Valve Anti-Cheat enabled', ...,
       'BOARD', 'CASINO', 'CARD'], dtype=object)

# Data Analysis & Results

Include cells that describe the steps in your data analysis.

In [5]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

# Ethics & Privacy

*Fill in your ethics & privacy discussion here*

# Conclusion & Discussion

*Fill in your discussion information here*