# Super Mario 64 Speedruns - Nationality Notebook



# Import Libraries

In [None]:
import numpy as np
import pandas as pd
import os
import geopandas as gpd
import json
from urllib.request import urlopen

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

from IPython.display import display, Markdown

In [None]:
def _display(text):
    display(Markdown(text))

# Load Data

In [None]:
categories = ["0 Star", "1 Star", "16 Star", "70 Star", "120 Star"]
INPUT_DIR = "/kaggle/input/super-mario-64-speedruns"
df_dict = {}
for cat in categories:
    df_dict[cat] = pd.read_csv(os.path.join(INPUT_DIR, f"data_{cat}.csv"))

Let's see what our data looks like:

In [None]:
df_dict["70 Star"].head()

We can check details about our dataset by calling the `info` and `describe` commands:

In [None]:
df_dict['0 Star'].info()

# Exploratory Data Analysis

## Country and Category Total Entires

A casual scan of the leaderboards show a pretty consistent presense of folks from the US being heavily present, and my educated guess that the game is likely very popular in its home country, Japan, as well. 

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

# Define dataset paths and titles
dataset_paths = [
    '/kaggle/input/super-mario-64-speedruns/data_0 Star.csv',
    '/kaggle/input/super-mario-64-speedruns/data_1 Star.csv',
    '/kaggle/input/super-mario-64-speedruns/data_120 Star.csv',
    '/kaggle/input/super-mario-64-speedruns/data_16 Star.csv',
    '/kaggle/input/super-mario-64-speedruns/data_70 Star.csv',
]
dataset_titles = [
    'Zero Star',
    'One Star',
    '120 Star',
    '16 Star',
    '70 Star',
]

# Function to create and display pie chart for a given dataset
def create_pie_chart(dataset_path, title):
    data = pd.read_csv(dataset_path)
    
    # Group the data by 'player_country' and calculate the count of players in each country
    country_counts = data['player_country'].value_counts()
    
    # Limit the number of countries to plot by selecting the top 10 countries
    top_countries = country_counts.head(10)  
    
    # Create a pie chart
    plt.pie(top_countries, labels=top_countries.index, autopct='%1.1f%%', startangle=140)
    
    # Add a title
    plt.title(f'Top 10 Countries for {title}')

    # Display the chart
    plt.show()

# Loop through the datasets and create pie charts
for path, title in zip(dataset_paths, dataset_titles):
    create_pie_chart(path, title)

### Accounting for Population Size

I was right! Kinda! The US does appear the most frequently in every category, but does give some ground in 1 star categories. The slots following are in strict competition between Japan, Canada, and a number of europeon countries. But how do these numbers compare to the total population? 

In [None]:
import matplotlib.pyplot as plt

# Population data
population = {
    'United States': 331002651,
    'Japan': 126476461,
    'Germany': 83783942,
    'Canada': 37742154,
    'France': 65273511,
    'England': 56079000,
    'Australia': 25499884,
    'Spain': 46754778,
    'Netherlands': 17134872,
    'Chile': 19116201,
    'South Korea': 51269185
}

# Countries and their corresponding populations
countries = list(population.keys())
populations = list(population.values())

# Create a pie chart
plt.figure(figsize=(10, 7))
plt.pie(populations, labels=countries, autopct='%1.1f%%', startangle=140)
plt.title('Population Distribution by Country')
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

# Display the pie chart
plt.show()

### Finding Speedrunner Per Capita Proportions

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

# Load your player data using pandas
data = pd.read_csv('/kaggle/input/super-mario-64-speedruns/data_0 Star.csv')

# Population data
population = {
    'United States': 331002651,
    'Japan': 126476461,
    'Germany': 83783942,
    'Canada': 37742154,
    'France': 65273511,
    'England': 56079000,
    'Australia': 25499884,
    'Spain': 46754778,
    'Netherlands': 17134872,
    'Chile': 19116201,
    'South Korea': 51269185
}

# Initialize a dictionary to store speedrunner proportions
speedrunner_proportions = {}

# Calculate the proportion of players per capita for each country
for country in population:
    # Filter the data for the current country
    country_data = data[data['player_country'] == country]
    
    # Calculate the proportion of players per capita
    proportion = len(country_data) / population[country]
    
    # Store the proportion in the dictionary
    speedrunner_proportions[country] = proportion

# Convert the dictionary to a pandas DataFrame for easier plotting
proportions_df = pd.DataFrame(list(speedrunner_proportions.items()), columns=['Country', 'Proportion'])

# Sort the DataFrame by proportion
proportions_df = proportions_df.sort_values(by='Proportion', ascending=False)

# Create a bar graph
plt.figure(figsize=(12, 6))
plt.bar(proportions_df['Country'], proportions_df['Proportion'], color='skyblue')
plt.xlabel('Countries')
plt.ylabel('Proportion of Players')
plt.title('Proportion of Top Ranked Speedruns by Country')
plt.xticks(rotation=45, ha="right")  # Rotate x-axis labels for better readability

# Display the bar graph
plt.tight_layout()
plt.show()


### Nationality Propotions Observations
Well! A number of things; firstly we find some evidence to my original claim, the US and Japan have a higher proportion than most countries in an expected player per capita, but we find that multiple countries fall into this category! This data implies that speedrunning seems to have a stronger presense in some countries, atleast when examing the top 500, to a rather signigient margin. I do wish to especially highlight Canada here, as it is by far the most overparticpating region, and suprisingly, that South Korea and England both are extreme underperformers. 

To note, this should not be used to make sweeping observations about the skill of runnings in any given category, but rather to highlight the culteral important that might be placed on the activvty. Italian runners remain in the top three for both 1 and 0 star at the time of this writing, implying that even less represented regions should not be ignored. 

## Limitations
* Me: I remain human, and fallible. The proportion chart above in particular should be given extra scrutinity, as I am a student and the math on that one seems a wonky to me. 
* Top 500: This dataset only examines the top 500 of the at time of pulling this data; this means that this data will both shift with time, and does not count for the history of the leaderboard, nor the breadth of runners who fail to land on the top board.

## Further Study
Below is a list of items I present to both you, dear reader, and myself as further directions to take this data.
* Region Preference: Exmaing per-capita each regions prefered game to run, and least prefered.
* Average Placement / Time Differenes: On the leaderboard, ranking regions by their aggregate performance. 
* ML Model to Estimate Time Saves: Examing the changes in the leaderboard first place over time, and developing a model that can predict how far the run may be optimized in a given time frame. (This may be beyond the scope of this data)

*That's all I got! Feel free to correct me on any math or logical mistakes I made, and copy this notebook freely to tinker with however you like!*