# Analyzing 3x3 Data From WCA Competitions

In this project I use python to analyze data on Rubik's Cube competitions hosted by the WCA, or "World Cube Association." 
Many different events are held besides the traditional 3x3 cube, although I will focus solely on 3x3 for this analysis. There are a total of 17 events, including 3x3 blindfolded and 7x7. The competitions are structured in the following manner: There are a certain number of rounds for each event, typically 2-4 rounds. In each round, competitors will generally do 5 solves. The worst and best time are thrown out and the middle 3 are averaged. This determines the ranking for the competitors in each round. Less competitors advanced after each round, with the final round typically being 12-16 competitors. The person with the fastest average time in the final round wins the event for that competition. For more information on the WCA and WCA competitions, visit the WCA website [here](https://www.worldcubeassociation.org/). The dataset can be exported from [here](https://www.worldcubeassociation.org/export/results).

## Loading the Data

First we will import the necessary libraries.

In [15]:
import pandas as pd
from matplotlib import pyplot as plt

Next, I need to read the data into pandas dataframes. The data was exported from the WCA website in tsv format. The necessary data is contained within 3 seperate tsv files we need to read. Firstly, there is the "results" file. This file shows information on individual averages done in WCA competitions. Secondly, there is the "persons" file, which shows information on every WCA competitor. We will use this file only to get the gender of each competitor. Finally, there is the "competitions" file, which contains information on individual WCA competitions. 

In [17]:
results = pd.read_csv(r'Data/WCA_export_Results.tsv', sep='\t', usecols=["competitionId", "eventId", "roundTypeId", "personName", "personId", "value1", "value2", "value3", "value4", "value5", "personCountryId"])
print(results.head())

  competitionId eventId roundTypeId               personName    personId  \
0  LyonOpen2007     333           1            Etienne Amany  2007AMAN01   
1  LyonOpen2007     333           1           Thomas Rouault  2004ROUA01   
2  LyonOpen2007     333           1  Antoine Simon-Chautemps  2005SIMO01   
3  LyonOpen2007     333           1           Irène Mallordy  2007MALL01   
4  LyonOpen2007     333           1       Marlène Desmaisons  2007DESM01   

   value1  value2  value3  value4  value5 personCountryId  
0    1968    2203    2138    2139    2108   Cote d_Ivoire  
1    2222    2153    1731    2334    2046          France  
2    3430    2581    2540    2789    2305          France  
3    2715    2452    2868    2632    2564          France  
4    2921    3184    2891    2677    2907          France  


In [18]:
persons = pd.read_csv(r"Data/WCA_export_Persons.tsv", sep="\t", usecols=["id", "gender"])
print(persons.head())

  gender          id
0      m  1982BORS01
1      m  1982BRIN01
2      m  1982CHIL01
3      f  1982FRID01
4      f  1982FRID01


In [19]:
competitions = pd.read_csv(r"Data/WCA_export_Competitions.tsv", sep="\t", usecols=["id", "countryId", "year", "month", "day"])
print(competitions.head())

                             id  countryId  year  month  day
0                 100Merito2018     Brazil  2018      4   14
1    100YearsRepublicAnkara2023     Turkey  2023     10   28
2  100YearsRepublicIstanbul2023     Turkey  2023     10   28
3      100YilMBACubeWeekend2023     Turkey  2023     12   16
4    10AniversarioGuatemala2023  Guatemala  2023     10   14


## Data Cleaning and Wrangling

I need to adjust the results dataframe a bit, making a couple of changes. First, we only want to include averages where all 5 solves were completed. This is due to some later analysis we will do on median solve time for different solve numbers. Often times time limits will be imposed in competitions, and these time limits must be met within the first 2 solves. If we include the results for any solve done, then the first 2 solves will have a higher median time due to slow competitors only being allowed to complete the first 2 solves. I also change the dataframe so that every row is one solve. Additionally, I make the "variable" column which shows what solve number the solve was just a numeric column by removing the "value" part. I then filter for only 3x3 solves and divide the times by 100 so they will be shown in seconds. 

In [22]:
results = results[(results["value1"] > 0) & (results["value2"] > 0) & (results["value3"] > 0) & (results["value4"] > 0) & (results["value5"] > 0)]
results = results.melt(id_vars=["competitionId", "eventId", "personName", "personId", "personCountryId", "roundTypeId"], value_vars=["value1", "value2", "value3", "value4", "value5"])
results["variable"] = results["variable"].str.replace("value", "")
results = results[results["eventId"] == "333"]
results["value"] = results["value"]/100
print(results.head())

  competitionId eventId               personName    personId personCountryId  \
0  LyonOpen2007     333            Etienne Amany  2007AMAN01   Cote d_Ivoire   
1  LyonOpen2007     333           Thomas Rouault  2004ROUA01          France   
2  LyonOpen2007     333  Antoine Simon-Chautemps  2005SIMO01          France   
3  LyonOpen2007     333           Irène Mallordy  2007MALL01          France   
4  LyonOpen2007     333       Marlène Desmaisons  2007DESM01          France   

  roundTypeId variable  value  
0           1        1  19.68  
1           1        1  22.22  
2           1        1  34.30  
3           1        1  27.15  
4           1        1  29.21  


Next, I need to adjust the "competitions" dataframe so that there is one single datetime column showing the date.

In [24]:
competitions["date"] = pd.to_datetime(competitions[["day", "month", "year"]])
competitions = competitions.drop(columns=["day", "month", "year"])
print(competitions.head())

                             id  countryId       date
0                 100Merito2018     Brazil 2018-04-14
1    100YearsRepublicAnkara2023     Turkey 2023-10-28
2  100YearsRepublicIstanbul2023     Turkey 2023-10-28
3      100YilMBACubeWeekend2023     Turkey 2023-12-16
4    10AniversarioGuatemala2023  Guatemala 2023-10-14


Finally, I need to merge the datasets together into one dataframe. I also rename a few of the columns to be more meaningful and reorder them. 

In [26]:
data = results.merge(persons, left_on = "personId", right_on = "id")
data = data.merge(competitions, left_on = "competitionId", right_on = "id")
data = data.drop(columns=["id_x", "id_y", "competitionId", "eventId"])
data = data.rename(columns={"variable": "solveNumber", "value": "time", "countryId": "competitionCountryId"})
data = data.reindex(columns = ["personId", "personName", "personCountryId", "gender", "time", "solveNumber", "roundTypeId", "date", "competitionCountryId"])
print(data.head())

     personId               personName personCountryId gender   time  \
0  2007AMAN01            Etienne Amany   Cote d_Ivoire      m  19.68   
1  2004ROUA01           Thomas Rouault          France      m  22.22   
2  2005SIMO01  Antoine Simon-Chautemps          France      m  34.30   
3  2007MALL01           Irène Mallordy          France      f  27.15   
4  2007DESM01       Marlène Desmaisons          France      f  29.21   

  solveNumber roundTypeId       date competitionCountryId  
0           1           1 2007-09-01               France  
1           1           1 2007-09-01               France  
2           1           1 2007-09-01               France  
3           1           1 2007-09-01               France  
4           1           1 2007-09-01               France  


## Analyzing Solve Speed by Solve Number

We will simply examine the median solve time by solve number.

In [72]:
by_number = data["time"].groupby(data["solveNumber"]).agg(["median", "count"])
print(by_number)

             median    count
solveNumber                 
1             15.94  1353509
2             15.93  1353509
3             15.90  1353509
4             15.88  1353509
5             15.87  1353509


We can see there seems to be a steady improvement in solve speed over time, with later solves being faster than earlier solves. This is the result I would expect. At the beginning of the average, competitors might be nervous, needing time to adjust to the competition atmosphere. As the average goes on they get used to the situation and become faster. However, I suspect this is not always the case, particularly if the first few solves went very well, they might start to get nervous and perform poorly, knowing a very fast average is possible if they continue getting good results.

## Analyzing Solve Speed of New Competitors Over Time

I'm also interested in how fast new competitors have gotten over time. This could show us how Rubik's Cube hardware and solving techniques have advanced over time. We can find out what year a competitor started in by looking at the "personId" column, as the number at the start of the values contains this information.

In [112]:
same_year = data[data["personId"].str[:4] == data["date"].dt.year.astype(str)]
same_year_medians = same_year["time"].groupby(same_year["date"].dt.year).agg(["median", "count"])
print(same_year_medians)

      median   count
date                
2003  21.420      35
2004  31.345     560
2005  31.480    1745
2006  29.650    2970
2007  32.580    6225
2008  28.980   12660
2009  28.900   23605
2010  26.650   27895
2011  25.710   30090
2012  26.640   29570
2013  27.190   44130
2014  28.030   66645
2015  27.680   87470
2016  26.560  130550
2017  24.930  180200
2018  26.600  194495
2019  26.370  209345
2020  29.600   24515
2021  22.760   30200
2022  22.840  251725
2023  24.700  372395
2024  26.140  299675
2025  30.060   44750


I find the data to be quite surprising. It doesn't seem like there is much of an improvement over time. For example, the median solve time in 2011 was faster than in 2024. I think one possible explanation could be due to more widespread competitions. Over time the amount of competitions taking place has increased drastically, which makes it easier for now competitors to compete right away. If we go back many years, people might have waited a lot longer to go to their first competition due to the lack of availability, causing them to have faster times initially due to having more experience.

## Analyzing the Rise of Top Chinese Speedcubers

An interesting phenomenon his arise among top competitors: many of them are now from China. I examine this situation and see if I can explain it. First, we can see if China accounts for a larger percentage of overall competitors than before. If China accounts for a larger percent of all competitors it would make sense that they would also account for a larger percent of top competitors.

In [92]:
data["year"] = data["date"].dt.year
total_competitors_per_year = data.groupby(data["year"])['personId'].nunique()
china_competitors_per_year = data[data['personCountryId'] == 'China'].groupby('year')['personId'].nunique()
china_comparison = pd.DataFrame({'TotalCompetitors': total_competitors_per_year, 'ChinaCount': china_competitors_per_year}).fillna(0).astype(int)
china_comparison['ChinaPercent'] = (china_comparison['ChinaCount'] / china_comparison['TotalCompetitors'] * 100).round(2)
data = data.drop(columns=["year"])
print(china_comparison)


      TotalCompetitors  ChinaCount  ChinaPercent
year                                            
2003                 8           0          0.00
2004               108           1          0.93
2005               300           0          0.00
2006               554           0          0.00
2007              1142          40          3.50
2008              2175         240         11.03
2009              4060         704         17.34
2010              5255         919         17.49
2011              6077         888         14.61
2012              6530         854         13.08
2013              8638        1273         14.74
2014             12139        1262         10.40
2015             16468        1644          9.98
2016             23596        3001         12.72
2017             31922        4672         14.64
2018             37991        6122         16.11
2019             42442        6577         15.50
2020             11268         521          4.62
2021              84

First, we should ignore the percentages for 2020-2023 due to the pandemic shutting down many competitions. However, for 2024 and so far in 2025, the percentages are actually much lower than they have been previously, which is quite surprising considering the amount of top Chinese cubers. Maybe, for some reason, on average cubers in China are just faster than they used to be compared to other countries. To test this hypothesis, I will examine the median solve time in China over time compared to all other countries. 

In [101]:
data["year"] = data["date"].dt.year
median_yearly = data[data["personCountryId"] != "China"].groupby(data["year"])['time'].median()
china_median_yearly = data[data['personCountryId'] == 'China'].groupby('year')['time'].median()
china_comparison2 = pd.DataFrame({'NonChinaMedian': median_yearly, 'ChinaMedian': china_median_yearly})
data = data.drop(columns=["year"])
print(china_comparison2)

      NonChinaMedian  ChinaMedian
year                             
2003           21.48          NaN
2004           27.36       25.045
2005           25.39          NaN
2006           24.34          NaN
2007           23.84       24.305
2008           21.71       25.660
2009           21.28       22.560
2010           19.56       19.835
2011           18.16       17.720
2012           17.55       17.110
2013           18.05       16.910
2014           18.17       16.290
2015           17.86       15.980
2016           17.55       16.400
2017           16.75       16.190
2018           16.16       15.960
2019           15.80       15.720
2020           15.43       14.740
2021           13.89       13.350
2022           15.11       13.710
2023           15.28       11.790
2024           14.60       11.390
2025           14.49       11.930


Here we can see our hypothesis was correct. Before the pandemic, the median solve time in China was similar to other countries (15.80 vs 15.72) while after the pandemic, the median solve time in China became much faster (14.49 vs 11.93). So what we are seeing in China is a smaller relative population of cubers who are on average much faster than they used to be. Why might this be the case? Well if we examine many top cubers from China, as soon as they started competing, they were already quite fast. Many of them started out already average under 10 seconds, while for many other countries, those competing in their first competition might average 30 seconds. I suspect the culture around cubing in China has grown to be less casual and more competitive. People don't often attend competitions unless they are already very fast, focusing less on the fun of the competition and more on getting results. There might be a much larger overall population of those who can solve a Rubik's cube in China than we would see based on participation in WCA competitions, explaining why there are so many fast cubers from China.

## Conclusions

We answered a number of interesting questions through our analysis. First, we observed that competitors' solves become faster as the average goes on, likely due to them being less nervous. Second, we saw that new competitors have not gotten much faster over time, potentially due to more widespread competitions. Finally, we noted that while the relative population of competitors shrunk in China, their median times became much faster, possibly due to a shift in cubing culture there. 