# Gurmandeep Bal

## Research question/interests

Does a game rate higher with the existence of a certain category?

In general, do games with more achievements have more categories associated with them? And are games with more achievements good games?

Does a game which has more categories associated with it do better than ones who meet few categories?

My first research questions will be to ask which categories do the top ten studios and publishers who made the greatest number of games do well at making. Doing well can be determined by looking at the Metacritic score, total positive score, looking at the positive score percentage, and looking at the number of achievements in the game. This is relevant and interesting as it will keep determining what categories a particular publisher is good at making, thus suggesting that the next game they make of the same category will also be good. This is feasible as the required data is available and novel as we have no metric for this question. This is ethical as this data was collected ethically.

For my second research question, I will be looking at if games with more achievements have more categories associated with them? And are games with more achievements good games? With the same metrics of 'good' as before, this is relevant as if there is shown to be some sort of correlation, then one can see that if a game has many achievements, then I might be good. This is feasible as the required data is available and novel as we have no metric for this question. This is ethical as this data was collected ethically.


As for my third research question, I want to see if games which focus on one defining feature are 'better' than jack of all trade games. Using the same metrics of 'good' as before, its relevant because should we be able to see some sort of trend with number of categories implying better or worse games, new games may be judged based on this factor. This is feasible as the required data is available and novel as we have no metric for this question. This is ethical as this data was collected ethically.

In [2]:
# Run this cell to ensure that altair plots show up without having
# the notebook be really large.
# We will talk more about what these lines do later in the course

import os
import altair as alt
import pandas as pd
from toolz.curried import pipe
import ast
import numpy as np

# Create a new data transformer that stores the files in a directory
def json_dir(data, data_dir='altairdata'):
    os.makedirs(data_dir, exist_ok=True)
    return pipe(data, alt.to_json(filename=data_dir + '/{prefix}-{hash}.{extension}') )

# Register and enable the new transformer
alt.data_transformers.register('json_dir', json_dir)
alt.data_transformers.enable('json_dir')

# Handle large data sets (default shows only 5000)
# See here: https://altair-viz.github.io/user_guide/data_transformers.html
alt.data_transformers.disable_max_rows()

alt.renderers.enable('jupyterlab')

RendererRegistry.enable('jupyterlab')

In [3]:
path = '../../data/processed/cleaned_games.csv'
data = pd.read_csv(path)
print(data.shape)
data.head()

(58041, 21)


Unnamed: 0.1,Unnamed: 0,steam_appid,name,developers,publishers,categories,genres,required_age,n_achievements,platforms,...,additional_content,total_reviews,total_positive,total_negative,review_score,review_score_desc,positive_percentual,metacritic,is_free,price_initial (USD)
0,0,2719580,勇者の伝説の勇者,['ぽけそう'],['ぽけそう'],"['Single-player', 'Family Sharing']","['Casual', 'Indie']",0,0,['windows'],...,[],0,0,0,0.0,No user reviews,0.0,0,False,0.99
1,2,2719600,Lorhaven: Cursed War,['GoldenGod Games'],['GoldenGod Games'],"['Single-player', 'Multi-player', 'PvP', 'Shar...","['RPG', 'Strategy']",0,32,"['windows', 'mac']",...,[],9,8,1,0.0,9 user reviews,88.9,0,False,9.99
2,3,2719610,PUIQ: Demons,['Giammnn'],['Giammnn'],"['Single-player', 'Steam Achievements', 'Famil...","['Action', 'Casual', 'Indie', 'RPG']",0,28,['windows'],...,[],0,0,0,0.0,No user reviews,0.0,0,False,2.99
3,4,2719650,Project XSTING,['Saucy Melon'],['Saucy Melon'],"['Single-player', 'Steam Achievements', 'Steam...","['Action', 'Casual', 'Indie', 'Early Access']",0,42,['windows'],...,[],9,9,0,0.0,9 user reviews,100.0,0,False,7.99
4,7,2719710,Manor Madness,['Apericot Studio'],['Apericot Studio'],"['Single-player', 'Steam Achievements', 'HDR a...","['Action', 'Adventure', 'Indie', 'RPG', 'Simul...",0,5,"['windows', 'mac', 'linux']",...,[],0,0,0,0.0,No user reviews,0.0,0,True,0.0


Below shows the top top developer (studio) who made the most games, which is what my research question is looking at

In [4]:
countByDeveloper = data.groupby('developers')['name'].count().reset_index(name='Count')
top10Dev = countByDeveloper.sort_values('Count',ascending=False).iloc[1:11]
top10Dev

Unnamed: 0,developers,Count
7086,['Creobit'],122
6087,['Choice of Games'],104
18030,['Laush Dmitriy Sergeevich'],99
16774,"['KOEI TECMO GAMES CO., LTD.']",79
10180,['EroticGamesClub'],76
14409,['Hosted Games'],64
9807,['Elephant Games'],59
4606,['Boogygames Studios'],58
29454,['Somer Games'],55
29382,['Sokpop Collective'],55


The Category category were represented like [Category1, Category2, ...]. Thus for the data to be usable in my case, I had to explode the data and tidy it up to make it so each Category had their own row. Below is the datasets for both developers and publishers

In [5]:
dataWithTop10Dev = data[data['developers'].isin(top10Dev['developers'])]
dataWithTop10Dev['categories'] = dataWithTop10Dev['categories'].apply(ast.literal_eval)
dataWithTop10Dev = dataWithTop10Dev.explode('categories')
dataWithTop10Dev.describe()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataWithTop10Dev['categories'] = dataWithTop10Dev['categories'].apply(ast.literal_eval)


Unnamed: 0.1,Unnamed: 0,steam_appid,required_age,n_achievements,total_reviews,total_positive,total_negative,review_score,positive_percentual,metacritic,price_initial (USD)
count,2566.0,2566.0,2566.0,2566.0,2566.0,2566.0,2566.0,2566.0,2566.0,2566.0,2566.0
mean,34223.561964,1398988.0,2.66212,25.219797,148.588465,126.095479,22.492985,3.344115,66.932931,2.66212,10.111645
std,19467.670206,752529.1,14.499177,102.562425,856.930597,746.047705,121.690505,3.386658,29.344558,14.499177,14.177698
min,51.0,299540.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,15789.0,882515.0,0.0,0.0,3.0,2.0,1.0,0.0,50.0,0.0,2.99
50%,36390.0,1257290.0,0.0,10.0,10.0,7.0,3.0,4.0,75.0,0.0,4.99
75%,49978.0,1859100.0,0.0,30.0,32.0,26.0,7.0,7.0,89.5,0.0,7.99
max,71238.0,3398790.0,86.0,1596.0,9188.0,8405.0,1485.0,9.0,100.0,86.0,69.99


This Graph shows how many of each Category each studio made compared to each other. Categories which were represented by less than half of the total developers or were not a game type (PvP for example) were dropped for simplicity and insignificance sake.

In [6]:
#print(dataWithTop10['categories'].unique())
removeGenres = ['Captions available','In-App Purchases','Remote Play Together' ,
                'Multi-player','Partial Controller Support','Cross-Platform Multiplayer',
                'Online PvP','Full controller support', 'HDR avaliable', 
                'Includes level editor','HDR available', 'Remote Play on Tablet',
                'Shared/Split Screen Co-op', 'Shared/Split Screen','Shared/Split Screen PvP',
                'Stats','Steam Trading Cards','Steam Workshop']

filtered_data = dataWithTop10Dev[~dataWithTop10Dev['categories'].isin(removeGenres)].dropna()

alt.Chart(filtered_data).mark_bar().encode(
    y=alt.Y('developers'),
    x=alt.X('count(categories)').stack("normalize"),
    color='categories'
    
).properties(
    height=500,
    width=1000
)

<VegaLite 5 object>

If you see this message, it means the renderer has not been properly enabled
for the frontend that you are using. For more information, see
https://altair-viz.github.io/user_guide/display_frontends.html#troubleshooting


Below shows the distribution of the scores each of the game developers have. This will be useful when exploring these expectations of the research question. We can see that some developer review distributions are quite odd, which will to be looked at later on when looking at this part of the question in more depth

In [7]:
boxplot = alt.Chart(dataWithTop10Dev).mark_boxplot().encode(
    x=alt.X('developers', title='Developer'),
    y=alt.Y('review_score', title='Review Score'),
    color='developers'
).properties(
    height=400,
    width=600
)

boxplot

<VegaLite 5 object>

If you see this message, it means the renderer has not been properly enabled
for the frontend that you are using. For more information, see
https://altair-viz.github.io/user_guide/display_frontends.html#troubleshooting


Below we see that in general, highly rated games do tend to have more achievements in them, which could be useful in further exploration.

In [None]:
catCountData = pd.DataFrame(data)

bottomRated = catCountData.sort_values(by='review_score',ascending=True).head(20)
topRated = catCountData.sort_values(by='review_score',ascending=False).head(20)

topChart = alt.Chart(topRated).mark_bar().encode(
    y='developers',
    x='n_achievements'
    
).properties(
    title='Count of achivments for average top rated developers'
)

bottomChart = alt.Chart(bottomRated).mark_bar().encode(
    y='developers',
    x='n_achievements'
).properties(
    title='Count of achivments for average bottom rated developers'
)

topChart | bottomChart

<VegaLite 5 object>

If you see this message, it means the renderer has not been properly enabled
for the frontend that you are using. For more information, see
https://altair-viz.github.io/user_guide/display_frontends.html#troubleshooting


Below shows the number of categories that a game belongs to for the top 20 and bottom rated 20 games. It seems like overall, the top rated games have more categroies for thier games, which suggests this correlation could be tested further.

In [None]:
catCountData = pd.DataFrame(data)
catCountData['categories'] = data['categories'].apply(ast.literal_eval)
catCountData = catCountData.explode('categories')

catCountSummary = catCountData.groupby('developers').agg(
    count=('categories', 'nunique'),  
    score=('review_score', 'mean')  
).reset_index()

bottomRated = catCountSummary.sort_values(by='score',ascending=True).head(20)
topRated = catCountSummary.sort_values(by='score',ascending=False).head(20)

topChart = alt.Chart(topRated).mark_bar().encode(
    y='developers',
    x='count',
    tooltip='score',
    
).properties(
    title='Count of categories for average top rated developers'
)

bottomChart = alt.Chart(bottomRated).mark_bar().encode(
    y='developers',
    x='count',
    tooltip='score'
).properties(
    title='Count of categories for average bottom rated developers'
)

topChart | bottomChart


<VegaLite 5 object>

If you see this message, it means the renderer has not been properly enabled
for the frontend that you are using. For more information, see
https://altair-viz.github.io/user_guide/display_frontends.html#troubleshooting
