# Wooden Spoons

## Introduction

This notebook is an analysis of wooden spoons in tennis.

The wooden spoon is a statistical booby prize that any early loser in a single-elimination tournament can earn. The winner of the wooden spoon is a player who is defeated by another player in the first round, who is then defeated by another player in the second, who is then defeated by someone else in the third, and so on, all the way up to the finalist (who loses the tournament champion in the final). Unlikely as it sounds at first, just as every tournament always has exactly one champion, it always has exactly one wooden spoon!

Whereas the champion sits atop the longest possible list of wins, the holder of the wooden spoon sits beneath the longest possible list of losses. Therefore if the champion of the tournament is supposedly the best player there, then logically the holder of the wooden spoon is the worst. 

However, this is only really true if matches are deterministic, and the winner of each match is always the better of the two players. In reality, there is a lot of randomness involved (just enough to make the sport exiting). In almost every sport even top-ranked players are statistically likely to eat a shock first-round exit from time to time.

The women's tennis world recently got a particularly interesting wooden spoon holder. The highest tier of tennis tournaments are the Grand Slams, four of which are held every year. [Jeļena Ostapenko](https://en.wikipedia.org/wiki/Je%C4%BCena_Ostapenko), an aggressive, hard-hitting Latvian player, shook things up in 2017 by winning one of them, the French Open, prevailing over overwhelming favorite [Simona Halep](https://en.wikipedia.org/wiki/Simona_Halep) in the final to secure the first French Cup win by an unseeded (low-ranked) player since 1933.

Well, the French Open came and went again, and this year we saw defending champion Ostapenko win...a wooden spoon.

The best player at the tournament in 2017, worst player in 2018. A slam and an anti-slam, in the same event, back to back! Impressive.

This got me thinking. Who are the wooden spoon holders? How often do top-ranking players earn them? [The best list I could find](http://www.mikero.com/misc/anti-slam/) is several years out of date, and only covers the ATP. [The data](https://github.com/JeffSackmann) is out there, courtesy of Jeff Sackmann. Is there anything to the wooden spoon?

## Data processing

Raw data via https://github.com/JeffSackmann/tennis_wta and https://github.com/JeffSackmann/tennis_atp, downloaded, unzipped, and deposited into a `data` local folder for this repository. Thanks Jeff!

In [7]:
%ls ../data

[0m[01;34mtennis_atp-master[0m/  [01;34mtennis_wta-master[0m/


In [26]:
pd.set_option('max_columns', None)

In [10]:
import pandas as pd
from tqdm import tqdm
import numpy as np

atp = pd.concat([pd.read_csv('../data/tennis_atp-master/atp_matches_{0}.csv'.format(year)) for year in tqdm(range(1968, 2019))])



  0%|          | 0/51 [00:00<?, ?it/s][A[A

 16%|█▌        | 8/51 [00:00<00:00, 79.45it/s][A[A

 29%|██▉       | 15/51 [00:00<00:00, 69.75it/s][A[A

 43%|████▎     | 22/51 [00:00<00:00, 67.86it/s][A[A

 55%|█████▍    | 28/51 [00:00<00:00, 64.21it/s][A[A

 67%|██████▋   | 34/51 [00:00<00:00, 62.58it/s][A[A

 78%|███████▊  | 40/51 [00:00<00:00, 62.04it/s][A[A

 92%|█████████▏| 47/51 [00:00<00:00, 61.96it/s][A[A

100%|██████████| 51/51 [00:00<00:00, 62.19it/s][A[A

In [11]:
atp.shape

(167879, 49)

Focus on the Slams.

In [103]:
atp_slams = atp[atp['tourney_name'].isin(['Wimbledon', 'Roland Garros', 'US Open', 'Us Open', 'Australian Open'])]

In [104]:
atp_slams.shape

(24062, 49)

In [105]:
atp_slams.head()

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,winner_name,winner_hand,winner_ht,winner_ioc,winner_age,winner_rank,winner_rank_points,loser_id,loser_seed,loser_entry,loser_name,loser_hand,loser_ht,loser_ioc,loser_age,loser_rank,loser_rank_points,score,best_of,round,minutes,w_ace,w_df,w_svpt,w_1stIn,w_1stWon,w_2ndWon,w_SvGms,w_bpSaved,w_bpFaced,l_ace,l_df,l_svpt,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced
223,1968-560,US Open,Grass,96,G,19680829,1,100087,4.0,,John Newcombe,R,183.0,AUS,24.268309,,,110066,,,Allen Quay,R,,USA,,,,6-1 6-3 6-3,5,R128,,,,,,,,,,,,,,,,,,,
224,1968-560,US Open,Grass,96,G,19680829,2,100023,,,Ramanathan Krishnan,R,,IND,31.383984,,,109966,,,Warren Jacques,R,,AUS,30.472279,,,14-12 2-6 6-0 6-4,5,R128,,,,,,,,,,,,,,,,,,,
225,1968-560,US Open,Grass,96,G,19680829,3,109816,,,E Victor Seixas,R,185.0,USA,44.999316,,,100127,,,Tom Gorman,R,180.0,USA,21.69473,,,8-6 6-4 6-3,5,R128,,,,,,,,,,,,,,,,,,,
226,1968-560,US Open,Grass,96,G,19680829,4,109932,,,Charles Mckinley,R,,USA,27.646817,,,109982,,,William Tym,R,,USA,,,,6-3 8-6 2-6 3-6 6-1,5,R128,,,,,,,,,,,,,,,,,,,
227,1968-560,US Open,Grass,96,G,19680829,5,100060,15.0,,Marty Riessen,R,185.0,USA,26.735113,,,100141,,,Zeljko Franulovic,R,,CRO,21.212868,,,11-9 4-6 6-3 6-4,5,R128,,,,,,,,,,,,,,,,,,,


In [106]:
atp_slams['round'].value_counts()

R128    11628
R64      6265
R32      3184
R16      1592
QF        796
SF        398
F         199
Name: round, dtype: int64

In [183]:
def spoon(tourney):
    round_order = ['R128', 'R64', 'R32', 'R16', 'QF', 'SF', 'F']
    
    try:
        latest_loser = tourney.query("round == 'F'").iloc[0].loser_name
    except IndexError:
        # Partial data. Occurs occassionally.
        print("WARNING: could not parse data for tourney {0} with timestamp {1}".format(
            tourney.tourney_name.iloc[0], tourney.tourney_date.iloc[0]
        ))
        return ("Unknown", "Unseeded")
    
    latest_round = 'F'
    for r in round_order[:-1][::-1]:
        try:
            latest_loser = tourney.query("round == '{0}'".format(r)).query("winner_name == '{0}'".format(latest_loser)).iloc[0].loser_name
            latest_round = r
        except IndexError:
            break
            
    if pd.isnull(latest_loser):
        return ("Unknown", "Unseeded")
    
    else:
        seed = tourney.query("loser_name == '{0}'".format(latest_loser)).iloc[0].loser_seed
        if pd.isnull(seed):
            seed = "Unseeded"
        else:
            seed = "[" + str(int(seed)) + "]"
            
        return latest_loser, seed

In [180]:
atp_slam_spoons = (
    atp_slams
        .groupby(['tourney_name', 'tourney_date'])
        .apply(lambda df: " ".join(spoon(df)))
        .reset_index()
        # Grab just the year from the int-formatted start date.
        .pipe(lambda df: df.assign(tourney_date=df.tourney_date.map(lambda v: str(v)[:4]).astype(int)))
        .rename(columns={'tourney_date': 'Year', 'tourney_name': 'Tournament', 0: 'Player'})
        # Recent US Opens are 'Us Open' for some reason.
        .replace('Us Open', 'US Open')
        # To maintain consistency with WTA list.
        .replace('Roland Garros', 'French Open')
        # Deal with doubled entries. The Australian was contested twice a couple of times due to calendar changes.
        # cf. http://www.mikero.com/misc/anti-slam/
        .groupby(['Tournament', 'Year'])
        .first()
        .unstack('Tournament')
        .replace(np.nan, 'Unknown')
        .applymap(lambda v: v.replace(" Unseeded", ""))
)

In [186]:
atp_slam_spoons

Unnamed: 0_level_0,Player,Player,Player,Player
Tournament,Australian Open,French Open,US Open,Wimbledon
Year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1968,Unknown,Joseph Mateo,Billy Knight,Lance Lumsden
1969,Bob Giltinan,Ivan Molina,David Lloyd,Bob Giltinan
1970,Anthony Hammond,Jean Francois Caujolle,Onny Parun,Szabolcz Baranyi
1971,Alvin Gardiner,Bob Giltinan,D Richard Russell,Pat Cramer
1972,Jun Kuki,Ion Tiriac,Pat Dupre,Jim Osborne
1973,Robert Casey C100,Piero Toci,Rayno Seegers,Neale Fraser
1974,John James,Michele Leclercq,Jun Kamiwazumi,Peter Szoke
1975,Joao Soares,Patricio Cornejo,Howard Schoenfield,John Andrews
1976,Unknown,Chris Kachel,Ove Nils Bengtson,Jay Royappa
1977,Allan Stone,Chris Kachel,Ricardo Cano,Mike Machette


## WTA

In [138]:
wta = pd.concat([pd.read_csv('../data/tennis_wta-master/wta_matches_{0}.csv'.format(year), encoding='latin-1') for year in tqdm(range(1968, 2019))])

wta_slams = wta[wta['tourney_name'].isin(['Wimbledon', 'Roland Garros', 'French Open', 'US Open', 'Us Open', 'Australian Open'])]




  0%|          | 0/51 [00:00<?, ?it/s][A[A[A


 35%|███▌      | 18/51 [00:00<00:00, 168.93it/s][A[A[A


 55%|█████▍    | 28/51 [00:00<00:00, 133.79it/s][A[A[A


 73%|███████▎  | 37/51 [00:00<00:00, 117.40it/s][A[A[A


 88%|████████▊ | 45/51 [00:00<00:00, 105.12it/s][A[A[A


100%|██████████| 51/51 [00:00<00:00, 100.61it/s][A[A[A

In [184]:
wta_slam_spoons = (
    wta_slams
        .groupby(['tourney_name', 'tourney_date'])
        .apply(lambda df: " ".join(spoon(df)))
        .reset_index()
        # Grab just the year from the int-formatted start date.
        .pipe(lambda df: df.assign(tourney_date=df.tourney_date.map(lambda v: str(v)[:4]).astype(int)))
        .rename(columns={'tourney_date': 'Year', 'tourney_name': 'Tournament', 0: 'Player'})
        # Recent US Opens are 'Us Open' for some reason.
        .replace('Us Open', 'US Open')
        .replace('Roland Garros', 'French Open')
        # Deal with doubled entries. The Australian was contested twice a couple of times due to calendar changes.
        # cf. http://www.mikero.com/misc/anti-slam/
        .groupby(['Tournament', 'Year'])
        .first()
        .unstack('Tournament')
        .replace(np.nan, 'Unknown')
        .applymap(lambda v: v.replace(" Unseeded", ""))    
)



In [185]:
wta_slam_spoons

Unnamed: 0_level_0,Player,Player,Player,Player
Tournament,Australian Open,French Open,US Open,Wimbledon
Year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1968,Unknown,Sally Holdsworth,Mary Lowdon,Gail Benedetti
1969,Wendy Gilchrist,Erzsebet Polgar,Eva Lundquist,Kerstin Seelbach
1970,L Cameron,Unknown,Mary Struthers,Odile De Roubin
1971,Janet Fallis,Helga Masthoff [5],Lany Kaligis,Helen Amos
1972,Dorte Ekner,Unknown,Maria Teresa Nasuelli,Glynis Coles
1973,Frances Candy,Patti Hogan,Sharon Walsh Pete,Gertruida Walhof
1974,Unknown,Linky Boshoff,Janet Haas,Christina Sandberg
1975,Helen Cawley [5],Sue Mappin,Mima Jausovec,Terry Holladay
1976,Unknown,Iris Riedel Kuhn,Carrie Meyer [14],Florenta Mihai
1977,Kathleen Harter,Carrie Meyer,Robin Harris,Kathleen Harter


In [None]:
winner_seed

PS: interesting properties of the wooden spoon --- https://web.archive.org/web/*/http://www.geocities.ws/andrewbroad/tennis/wooden_spoon/theory.html