# Intro to Pandas

For easy reference: the documentation for pandas is here: https://pandas.pydata.org/pandas-docs/stable/index.html

To explore some pandas functionality, we're going to look at a dataset which is utterly relevant to the work we do today: 
The results of the [2018 Winter Olympics figure skating competition](https://en.wikipedia.org/wiki/Figure_skating_at_the_2018_Winter_Olympics)

This dataset comes from here: https://github.com/BuzzFeedNews/2018-02-olympic-figure-skating-analysis

Performance data is here: https://github.com/BuzzFeedNews/2018-02-olympic-figure-skating-analysis/blob/master/data/performances.csv

If we grab the raw .csv, we can import it directly into Pandas (rad, right?). This creates a new dataframe from a .csv source:

You can also import data from excel, sql, and html tables directly into pandas

In [1]:
import pandas as pd  # pd is common convention
df = pd.read_csv('https://raw.githubusercontent.com/BuzzFeedNews/2018-02-olympic-figure-skating-analysis/master/data/performances.csv')

info() will give us some details about the dataframe we created
We see that it has an index, 250 entries, and 11 columns

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   performance_id         250 non-null    object 
 1   competition            250 non-null    object 
 2   program                250 non-null    object 
 3   name                   250 non-null    object 
 4   nation                 250 non-null    object 
 5   rank                   250 non-null    int64  
 6   starting_number        250 non-null    int64  
 7   total_segment_score    250 non-null    float64
 8   total_element_score    250 non-null    float64
 9   total_component_score  250 non-null    float64
 10  total_deductions       250 non-null    float64
dtypes: float64(4), int64(2), object(5)
memory usage: 21.6+ KB


Jupyter notebooks do some nice formatting for you if you're displaying a pandas dataframe.
You can do it in a terminal, too, but it's not as pretty or easy to read
Let's take a look at our data

In [3]:
df.head(5)

Unnamed: 0,performance_id,competition,program,name,nation,rank,starting_number,total_segment_score,total_element_score,total_component_score,total_deductions
0,a3f8fac157,Olympic Winter Games 2018,Ice Dance - Free Dance,LAURIAULT Marie-Jade / le GAC Romain,FRA,17,1,89.62,47.04,42.58,0.0
1,d727237592,Olympic Winter Games 2018,Ice Dance - Free Dance,MYSLIVECKOVA Lucie / CSOLLEY Lukas,SVK,20,2,82.82,41.65,41.17,0.0
2,93fe6322fa,Olympic Winter Games 2018,Ice Dance - Free Dance,LORENZ Kavita / POLIZOAKIS Joti,GER,16,3,90.5,46.78,43.72,0.0
3,cb67dacba3,Olympic Winter Games 2018,Ice Dance - Free Dance,MIN Yura / GAMELIN Alexander,KOR,19,4,86.52,44.61,41.91,0.0
4,b79025399c,Olympic Winter Games 2018,Ice Dance - Free Dance,AGAFONOVA Alisa / UCAR Alper,TUR,18,5,87.76,44.01,43.75,0.0


In [4]:
# Easy descriptive statistics to help sanity-check the numeric data in your dataset

df.describe()

Unnamed: 0,rank,starting_number,total_segment_score,total_element_score,total_component_score,total_deductions
count,250.0,250.0,250.0,250.0,250.0,250.0
mean,10.836,10.836,96.18888,49.40216,47.14272,-0.356
std,7.545385,7.545385,38.66708,19.953451,19.276711,0.726205
min,1.0,1.0,42.93,18.68,21.23,-6.0
25%,4.0,4.0,66.93,34.6525,31.6575,-1.0
50%,9.0,9.0,82.33,43.24,40.59,0.0
75%,16.0,16.0,122.145,61.81,60.67,0.0
max,30.0,30.0,215.08,127.64,96.62,0.0


You can also reference specific columns within a dataframe:

In [5]:
df['name']

0      LAURIAULT Marie-Jade / le GAC Romain
1        MYSLIVECKOVA Lucie / CSOLLEY Lukas
2           LORENZ Kavita / POLIZOAKIS Joti
3              MIN Yura / GAMELIN Alexander
4              AGAFONOVA Alisa / UCAR Alper
                       ...                 
245    DELLA MONICA Nicole / GUARISE Matteo
246           JAMES Vanessa / CIPRES Morgan
247           DUHAMEL Meagan / RADFORD Eric
248         SAVCHENKO Aljona / MASSOT Bruno
249     TARASOVA Evgenia / MOROZOV Vladimir
Name: name, Length: 250, dtype: object

When you pull out one column of a dataframe, what you get is what pandas calls a 'Series'

You can think of a Series as one-dimensional data, like a list or an array, and a dataframe as two dimensional data, like a spreadsheet. 
Note that they are both indexed. Even when I pull out a single Series, it's still ordered by the index. 

I can also slice a dataframe by telling it I only want to see certain columns. Like the events, names, and total scores:

Note the double brackets here, as opposed to last time. We're passing a list into an indexing operator. 

This is the same as writing:
```
columns_i_care_about = ['program', 'name', 'total_segment_score']
df[columns_i_care_about]
```

In [6]:
df[['program', 'name', 'total_segment_score']]

Unnamed: 0,program,name,total_segment_score
0,Ice Dance - Free Dance,LAURIAULT Marie-Jade / le GAC Romain,89.62
1,Ice Dance - Free Dance,MYSLIVECKOVA Lucie / CSOLLEY Lukas,82.82
2,Ice Dance - Free Dance,LORENZ Kavita / POLIZOAKIS Joti,90.50
3,Ice Dance - Free Dance,MIN Yura / GAMELIN Alexander,86.52
4,Ice Dance - Free Dance,AGAFONOVA Alisa / UCAR Alper,87.76
...,...,...,...
245,Team Event - Pair Skating Short Program,DELLA MONICA Nicole / GUARISE Matteo,67.62
246,Team Event - Pair Skating Short Program,JAMES Vanessa / CIPRES Morgan,68.49
247,Team Event - Pair Skating Short Program,DUHAMEL Meagan / RADFORD Eric,76.57
248,Team Event - Pair Skating Short Program,SAVCHENKO Aljona / MASSOT Bruno,75.36


So who scored the highest, anyway? 'total_segment_score'is the total score for a performance. Let's sort by that.

In [7]:
df.sort_values('total_segment_score', ascending=False)

Unnamed: 0,performance_id,competition,program,name,nation,rank,starting_number,total_segment_score,total_element_score,total_component_score,total_deductions
106,ac169ad456,Olympic Winter Games 2018,Men Single Skating - Free Skating,CHEN Nathan,USA,1,9,215.08,127.64,87.44,0.0
119,d4fa68d71a,Olympic Winter Games 2018,Men Single Skating - Free Skating,HANYU Yuzuru,JPN,2,22,206.17,109.55,96.62,0.0
121,73f4447f2c,Olympic Winter Games 2018,Men Single Skating - Free Skating,UNO Shoma,JPN,3,24,202.73,111.01,92.72,-1.0
120,c98ec5202d,Olympic Winter Games 2018,Men Single Skating - Free Skating,FERNANDEZ Javier,ESP,4,23,197.66,101.52,96.14,0.0
117,75d8d34efb,Olympic Winter Games 2018,Men Single Skating - Free Skating,JIN Boyang,CHN,5,20,194.45,109.69,85.76,-1.0
...,...,...,...,...,...,...,...,...,...,...,...
128,ea501f0f22,Olympic Winter Games 2018,Men Single Skating - Short Program,PANIOT Yaroslav,UKR,30,7,46.58,18.68,29.90,-2.0
210,2360e544d1,Olympic Winter Games 2018,Team Event - Ladies Single Skating Short Program,BUCHANAN Aimee,ISR,10,1,46.30,25.07,21.23,0.0
195,9690094547,Olympic Winter Games 2018,Team Event - Ice Dance Short Dance,TANKOVA Adel / ZILBERBERG Ronald,ISR,10,1,44.61,22.32,22.29,0.0
76,44961aae70,Olympic Winter Games 2018,Ladies Single Skating - Short Program,MAMBEKOVA Aiza,KAZ,30,9,44.40,21.29,23.11,0.0


Wait, that's not right, there are scores for a ton of different events! We need to sort that out.

with the `groupby` command, you can apply a pandas command to subsets of a dataframe broken up by group
`df.groupby('program').max()` will give us the maximum values for each series, but we only care about the `total_segment_score`
So we'll filter down our results by indexing the dataframe that's returned, so we just get one Series

This will give us a series consisting of the maximum values in the column we specified. The series is indexed by that column.

In [8]:
df.groupby('program').max()['total_segment_score']

program
Ice Dance - Free Dance                              123.35
Ice Dance - Short Dance                              83.67
Ladies Single Skating - Free Skating                156.65
Ladies Single Skating - Short Program                82.92
Men Single Skating - Free Skating                   215.08
Men Single Skating - Short Program                  111.68
Pair Skating - Free Skating                         159.31
Pair Skating - Short Program                         82.39
Team Event - Ice Dance Free Dance                   118.10
Team Event - Ice Dance Short Dance                   80.51
Team Event - Ladies Single Skating Free Skating     158.08
Team Event - Ladies Single Skating Short Program     81.06
Team Event - Men Single Skating Free Skating        179.75
Team Event - Men Single Skating Short Program       103.25
Team Event - Pair Skating Free Skating              148.51
Team Event - Pair Skating Short Program              80.92
Name: total_segment_score, dtype: float64

But we want to know WHO WON! For that, we need the original row indexed by the maximum of each group.

Let's get the locations of the maximums first

In [9]:
index_locations = df.groupby('program')['total_segment_score'].idxmax()
index_locations

program
Ice Dance - Free Dance                               17
Ice Dance - Short Dance                              39
Ladies Single Skating - Free Skating                 65
Ladies Single Skating - Short Program                95
Men Single Skating - Free Skating                   106
Men Single Skating - Short Program                  146
Pair Skating - Free Skating                         164
Pair Skating - Short Program                        184
Team Event - Ice Dance Free Dance                   194
Team Event - Ice Dance Short Dance                  203
Team Event - Ladies Single Skating Free Skating     209
Team Event - Ladies Single Skating Short Program    219
Team Event - Men Single Skating Free Skating        223
Team Event - Men Single Skating Short Program       234
Team Event - Pair Skating Free Skating              238
Team Event - Pair Skating Short Program             249
Name: total_segment_score, dtype: int64

Now we'll look at the full rows at these index locations:

Note: nothing is stopping you from chaining this: 
```
df.loc[df.groupby('program')['total_segment_score'].idxmax()]
```
Pandas will allow you to chain as many method calls together as you want, and perform huge operations on one line. 
Which is a terrible idea if you ever expect anyone (including you) to be able to read and understand what you did.

In [10]:
df.loc[index_locations] 

Unnamed: 0,performance_id,competition,program,name,nation,rank,starting_number,total_segment_score,total_element_score,total_component_score,total_deductions
17,f9b3b1bf16,Olympic Winter Games 2018,Ice Dance - Free Dance,PAPADAKIS Gabriella / CIZERON Guillaume,FRA,1,18,123.35,63.98,59.37,0.0
39,80c4a7d391,Olympic Winter Games 2018,Ice Dance - Short Dance,VIRTUE Tessa / MOIR Scott,CAN,1,20,83.67,44.53,39.14,0.0
65,80154aa3f8,Olympic Winter Games 2018,Ladies Single Skating - Free Skating,ZAGITOVA Alina,OAR,2,22,156.65,81.62,75.03,0.0
95,b0bdffd51c,Olympic Winter Games 2018,Ladies Single Skating - Short Program,ZAGITOVA Alina,OAR,1,28,82.92,45.3,37.62,0.0
106,ac169ad456,Olympic Winter Games 2018,Men Single Skating - Free Skating,CHEN Nathan,USA,1,9,215.08,127.64,87.44,0.0
146,14db11efb7,Olympic Winter Games 2018,Men Single Skating - Short Program,HANYU Yuzuru,JPN,1,25,111.68,63.18,48.5,0.0
164,7bfaa8fc93,Olympic Winter Games 2018,Pair Skating - Free Skating,SAVCHENKO Aljona / MASSOT Bruno,GER,1,13,159.31,82.07,77.24,0.0
184,9e771ce55d,Olympic Winter Games 2018,Pair Skating - Short Program,SUI Wenjing / HAN Cong,CHN,1,17,82.39,44.49,37.9,0.0
194,cff3930586,Olympic Winter Games 2018,Team Event - Ice Dance Free Dance,VIRTUE Tessa / MOIR Scott,CAN,1,5,118.1,59.25,58.85,0.0
203,2cb6e9d049,Olympic Winter Games 2018,Team Event - Ice Dance Short Dance,VIRTUE Tessa / MOIR Scott,CAN,1,9,80.51,41.61,38.9,0.0


Okay, there's one more problem. It turns out that those are the winners of the individual events. But gold medals are awarded based on your cumulative performance (why do they make it so hard?!)

To get this right, we need to aggregate by each event and filter out the Team Events, which are scored differently.

Let's start by adding a new column which is just the performance categories. 
We can do this by taking the first part of the program text, splitting off everything before the dash. We'll do this using the string methods built into pandas.

In [11]:
df['category'] = df['program'].str.split('-').str.get(0).str.strip()

Splitting on the dash turns *"Ice Dance - Free Dance"* in to `('Ice Dance ', ' Free Dance')`

`get(0)` will give us just the first part of that: *'Ice Dance '*

`.strip()` will strip off the trailing whitespace, to give us: '*Ice Dance'*

assigning this to `df['category']` gives us a new column in our dataframe that contains just that data.

Details on pandas string methods can be found here: https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html


Let's look at how our new column compares to the program column. We'll look at just those two columns in the dataframe 
(remember we're passing a list into an indexing operator, like we did before, with what look like double brackets) 

In [12]:
df[['program', 'category']]

Unnamed: 0,program,category
0,Ice Dance - Free Dance,Ice Dance
1,Ice Dance - Free Dance,Ice Dance
2,Ice Dance - Free Dance,Ice Dance
3,Ice Dance - Free Dance,Ice Dance
4,Ice Dance - Free Dance,Ice Dance
...,...,...
245,Team Event - Pair Skating Short Program,Team Event
246,Team Event - Pair Skating Short Program,Team Event
247,Team Event - Pair Skating Short Program,Team Event
248,Team Event - Pair Skating Short Program,Team Event


There, that looks great.

Now let's get rid of the Team Events. First, we find them all.

Reminder: Where a single equals sign is used to make assignments in python (and most other languages), a double equals sign is an evaluation, and means "is equal to". These are the rows where the category column is equal to 'Team Event'

In [13]:
df['category'] == 'Team Event'

0      False
1      False
2      False
3      False
4      False
       ...  
245     True
246     True
247     True
248     True
249     True
Name: category, Length: 250, dtype: bool

That command gives us a new series that's just booleans (True/False values). We can use that to filter down our dataset.

We're using the index operator again, the brackets. This time, instead of passing in which columns we want, we're using
Boolean indexing, and passing it an expression telling it which rows we want to filter out. In this case, we only want 
Rows that don't have 'Team Event' in the category column. So we look at the `df['category']` column for those rows
`!=` is the not equal to operator in python (and most other langauges).

We'll create a new dataframe with only the rows we want, called `singles_df`. 

Note: Pay attention to when we're assigning our dataframe result to a new value or just manipulating it for the readout. 
If we don't assign the value (even to the same dataframe varaible, overwriting it), we're not keeping the work we're doing.

In [14]:
singles_df = df[df['category'] != 'Team Event']
singles_df

Unnamed: 0,performance_id,competition,program,name,nation,rank,starting_number,total_segment_score,total_element_score,total_component_score,total_deductions,category
0,a3f8fac157,Olympic Winter Games 2018,Ice Dance - Free Dance,LAURIAULT Marie-Jade / le GAC Romain,FRA,17,1,89.62,47.04,42.58,0.0,Ice Dance
1,d727237592,Olympic Winter Games 2018,Ice Dance - Free Dance,MYSLIVECKOVA Lucie / CSOLLEY Lukas,SVK,20,2,82.82,41.65,41.17,0.0,Ice Dance
2,93fe6322fa,Olympic Winter Games 2018,Ice Dance - Free Dance,LORENZ Kavita / POLIZOAKIS Joti,GER,16,3,90.50,46.78,43.72,0.0,Ice Dance
3,cb67dacba3,Olympic Winter Games 2018,Ice Dance - Free Dance,MIN Yura / GAMELIN Alexander,KOR,19,4,86.52,44.61,41.91,0.0,Ice Dance
4,b79025399c,Olympic Winter Games 2018,Ice Dance - Free Dance,AGAFONOVA Alisa / UCAR Alper,TUR,18,5,87.76,44.01,43.75,0.0,Ice Dance
...,...,...,...,...,...,...,...,...,...,...,...,...
185,499773cc7c,Olympic Winter Games 2018,Pair Skating - Short Program,MARCHEI Valentina / HOTAREK Ondrej,ITA,7,18,74.50,40.36,34.14,0.0,Pair Skating
186,48e7d087f1,Olympic Winter Games 2018,Pair Skating - Short Program,DUHAMEL Meagan / RADFORD Eric,CAN,3,19,76.82,41.26,35.56,0.0,Pair Skating
187,a07c2e0f5b,Olympic Winter Games 2018,Pair Skating - Short Program,ZABIIAKO Natalia / ENBERT Alexander,OAR,8,20,74.35,40.13,34.22,0.0,Pair Skating
188,8872285901,Olympic Winter Games 2018,Pair Skating - Short Program,SAVCHENKO Aljona / MASSOT Bruno,GER,4,21,76.59,39.16,37.43,0.0,Pair Skating


Note that we're down from 250 rows to 190 rows and we've successfuly filtered out the Team Events.
Now we can group by the categories to find our winners

In [15]:
group = singles_df.groupby(['category','name'])

# This gives us a pandas groupby object.

group

# You'll pretty much always follow this command up with another to perform the operation
# you're grouping something together for.

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x1241a9b90>

In [16]:
# In this case, we want to sum the numeric values in the groups we've created

summed_df = group.sum()
summed_df

Unnamed: 0_level_0,Unnamed: 1_level_0,performance_id,competition,program,nation,rank,starting_number,total_segment_score,total_element_score,total_component_score,total_deductions
category,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Ice Dance,AGAFONOVA Alisa / UCAR Alper,b79025399cc309f0e91f,Olympic Winter Games 2018Olympic Winter Games ...,Ice Dance - Free DanceIce Dance - Short Dance,TURTUR,38,13,147.18,73.65,73.53,0.0
Ice Dance,BOBROVA Ekaterina / SOLOVIEV Dmitri,29d3629a8d9af05e3103,Olympic Winter Games 2018Olympic Winter Games ...,Ice Dance - Free DanceIce Dance - Short Dance,OAROAR,10,38,186.92,94.61,92.31,0.0
Ice Dance,CAPPELLINI Anna / LANOTTE Luca,1f92da49011c0cb2713c,Olympic Winter Games 2018Olympic Winter Games ...,Ice Dance - Free DanceIce Dance - Short Dance,ITAITA,11,40,184.91,95.27,90.64,-1.0
Ice Dance,CHOCK Madison / BATES Evan,ec76a70474ba726e7a33,Olympic Winter Games 2018Olympic Winter Games ...,Ice Dance - Free DanceIce Dance - Short Dance,USAUSA,19,33,175.58,88.43,89.15,-2.0
Ice Dance,COOMES Penny / BUCKLAND Nicholas,4754ed4e3bc8b28adc40,Olympic Winter Games 2018Olympic Winter Games ...,Ice Dance - Free DanceIce Dance - Short Dance,GBRGBR,20,21,170.32,86.36,83.96,0.0
...,...,...,...,...,...,...,...,...,...,...,...
Pair Skating,SUZAKI Miu / KIHARA Ryuichi,c39eade62e,Olympic Winter Games 2018,Pair Skating - Short Program,JPN,21,3,57.74,32.89,24.85,0.0
Pair Skating,TARASOVA Evgenia / MOROZOV Vladimir,02c6b6bb2f4de9f03c55,Olympic Winter Games 2018Olympic Winter Games ...,Pair Skating - Free SkatingPair Skating - Shor...,OAROAR,6,38,224.93,114.05,111.88,-1.0
Pair Skating,YU Xiaoyu / ZHANG Hao,84d7bd92a14632ef259d,Olympic Winter Games 2018Olympic Winter Games ...,Pair Skating - Free SkatingPair Skating - Shor...,CHNCHN,16,24,204.10,105.08,101.02,-2.0
Pair Skating,ZABIIAKO Natalia / ENBERT Alexander,ea904d3e61a07c2e0f5b,Olympic Winter Games 2018Olympic Winter Games ...,Pair Skating - Free SkatingPair Skating - Shor...,OAROAR,15,30,212.88,110.49,102.39,0.0


In [17]:
# By default, the groupby operations are applied to every column they're valid for
# There are a number of other groupby operations we can perform: mean, max, min, count, etc. 
# (Count will count the number of columns that contain data)

# Here's a count of performances which had deductions, by performer (and sorted) 
df.groupby(['name']).count()['total_deductions'].sort_values(ascending=False)

# Note that this is a series because I'm pulling out a single column

name
MURAMOTO Kana / REED Chris             4
KOLYADA Mikhail                        4
SHIBUTANI Maia / SHIBUTANI Alex        4
DUHAMEL Meagan / RADFORD Eric          4
BOBROVA Ekaterina / SOLOVIEV Dmitri    4
                                      ..
RUSSO Giada                            1
TEN Denis                              1
KHNYCHENKOVA Anna                      1
NAZAROVA Alexandra / NIKITIN Maxim     1
ZIEGLER Miriam / KIEFER Severin        1
Name: total_deductions, Length: 107, dtype: int64

Here are the average element and component scores by performer:

In [18]:
# Here are the average element and component scores by performer:
df.groupby(['name']).mean(numeric_only=True)[['total_element_score', 'total_component_score']]

Unnamed: 0_level_0,total_element_score,total_component_score
name,Unnamed: 1_level_1,Unnamed: 2_level_1
AGAFONOVA Alisa / UCAR Alper,36.825000,36.765000
ALEXANDROVSKAYA Ekaterina / WINDSOR\nHarley,34.700000,26.850000
ALIEV Dmitri,71.185000,63.570000
ASTAKHOVA Kristina / ROGONOV Alexei,50.550000,47.675000
AUSTMAN Larkyn,25.930000,26.490000
...,...,...
ZABIIAKO Natalia / ENBERT Alexander,59.516667,56.203333
ZAGITOVA Alina,69.993333,62.556667
ZAGORSKI Tiffani / GUERREIRO Jonathan,40.440000,40.680000
ZHOU Vincent,80.370000,57.975000


Back to our summed values grouped by category and name...
You may notice that our integer-based index is gone. Because of our groupby operation, we now have a Multi-Indexed
Dataframe, using the indices that we specified in our groupby: category and name.

You can think of this as a multi-dimensional array, or you can simply think of it as a spreadsheet with two frozen columns
Where each row is categorized twice. 

Note: You can have multiple indices along each axis, if you're feeling obsessive about data categorization.

In [19]:
summed_df.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 106 entries, ('Ice Dance', 'AGAFONOVA Alisa / UCAR Alper') to ('Pair Skating', 'ZIEGLER Miriam / KIEFER Severin')
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   performance_id         106 non-null    object 
 1   competition            106 non-null    object 
 2   program                106 non-null    object 
 3   nation                 106 non-null    object 
 4   rank                   106 non-null    int64  
 5   starting_number        106 non-null    int64  
 6   total_segment_score    106 non-null    float64
 7   total_element_score    106 non-null    float64
 8   total_component_score  106 non-null    float64
 9   total_deductions       106 non-null    float64
dtypes: float64(4), int64(2), object(4)
memory usage: 9.5+ KB


Let's take our new dataframe and sort it by the values we care about

In [20]:
sorted_df = summed_df.sort_values(['category','total_segment_score'], ascending=False)
sorted_df

Unnamed: 0_level_0,Unnamed: 1_level_0,performance_id,competition,program,nation,rank,starting_number,total_segment_score,total_element_score,total_component_score,total_deductions
category,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Pair Skating,SAVCHENKO Aljona / MASSOT Bruno,7bfaa8fc938872285901,Olympic Winter Games 2018Olympic Winter Games ...,Pair Skating - Free SkatingPair Skating - Shor...,GERGER,5,34,235.90,121.23,114.67,0.0
Pair Skating,SUI Wenjing / HAN Cong,85e473cad19e771ce55d,Olympic Winter Games 2018Olympic Winter Games ...,Pair Skating - Free SkatingPair Skating - Shor...,CHNCHN,4,32,235.47,120.78,114.69,0.0
Pair Skating,DUHAMEL Meagan / RADFORD Eric,d9343a9c1348e7d087f1,Olympic Winter Games 2018Olympic Winter Games ...,Pair Skating - Free SkatingPair Skating - Shor...,CANCAN,5,33,230.15,121.12,109.03,0.0
Pair Skating,TARASOVA Evgenia / MOROZOV Vladimir,02c6b6bb2f4de9f03c55,Olympic Winter Games 2018Olympic Winter Games ...,Pair Skating - Free SkatingPair Skating - Shor...,OAROAR,6,38,224.93,114.05,111.88,-1.0
Pair Skating,JAMES Vanessa / CIPRES Morgan,5458eddc1d1c6598ff9d,Olympic Winter Games 2018Olympic Winter Games ...,Pair Skating - Free SkatingPair Skating - Shor...,FRAFRA,11,26,218.53,112.26,106.27,0.0
...,...,...,...,...,...,...,...,...,...,...,...
Ice Dance,MYSLIVECKOVA Lucie / CSOLLEY Lukas,d7272375920332369044,Olympic Winter Games 2018Olympic Winter Games ...,Ice Dance - Free DanceIce Dance - Short Dance,SVKSVK,39,13,142.57,73.05,69.52,0.0
Ice Dance,NAZAROVA Alexandra / NIKITIN Maxim,f8cf0bf55f,Olympic Winter Games 2018,Ice Dance - Short Dance,UKR,21,2,57.97,27.26,30.71,0.0
Ice Dance,WANG Shiyue / LIU Xinyu,5abdd13573,Olympic Winter Games 2018,Ice Dance - Short Dance,CHN,22,10,57.81,29.28,29.53,-1.0
Ice Dance,MANSOUROVA Cortney / CESKA Michal,a8c457d283,Olympic Winter Games 2018,Ice Dance - Short Dance,CZE,23,1,53.53,29.11,24.42,0.0


Now our Multi-Indexed dataframe sorts our data into performance category and name, summing the values of the individual events and sorting them by highest scores. We have our answer in the first entry for each category/name pair

But to slice it to the winners, we can group again by just our first index and read out each first line:

In [21]:
winners = sorted_df.groupby(['category']).head(1) 

# Here are your gold medal winners!

winners

Unnamed: 0_level_0,Unnamed: 1_level_0,performance_id,competition,program,nation,rank,starting_number,total_segment_score,total_element_score,total_component_score,total_deductions
category,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Pair Skating,SAVCHENKO Aljona / MASSOT Bruno,7bfaa8fc938872285901,Olympic Winter Games 2018Olympic Winter Games ...,Pair Skating - Free SkatingPair Skating - Shor...,GERGER,5,34,235.9,121.23,114.67,0.0
Men Single Skating,HANYU Yuzuru,d4fa68d71a14db11efb7,Olympic Winter Games 2018Olympic Winter Games ...,Men Single Skating - Free SkatingMen Single Sk...,JPNJPN,3,47,317.85,172.73,145.12,0.0
Ladies Single Skating,ZAGITOVA Alina,80154aa3f8b0bdffd51c,Olympic Winter Games 2018Olympic Winter Games ...,Ladies Single Skating - Free SkatingLadies Sin...,OAROAR,3,50,239.57,126.92,112.65,0.0
Ice Dance,VIRTUE Tessa / MOIR Scott,97fcedee9680c4a7d391,Olympic Winter Games 2018Olympic Winter Games ...,Ice Dance - Free DanceIce Dance - Short Dance,CANCAN,3,40,206.07,107.88,98.19,0.0


I can also use the .first() command if I pull the name column out of the index (otherwise it won't be shown)

In [22]:
sorted_df.reset_index().set_index('category').sort_values('total_segment_score', ascending=False).groupby(['category']).first()

Unnamed: 0_level_0,name,performance_id,competition,program,nation,rank,starting_number,total_segment_score,total_element_score,total_component_score,total_deductions
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Ice Dance,VIRTUE Tessa / MOIR Scott,97fcedee9680c4a7d391,Olympic Winter Games 2018Olympic Winter Games ...,Ice Dance - Free DanceIce Dance - Short Dance,CANCAN,3,40,206.07,107.88,98.19,0.0
Ladies Single Skating,ZAGITOVA Alina,80154aa3f8b0bdffd51c,Olympic Winter Games 2018Olympic Winter Games ...,Ladies Single Skating - Free SkatingLadies Sin...,OAROAR,3,50,239.57,126.92,112.65,0.0
Men Single Skating,HANYU Yuzuru,d4fa68d71a14db11efb7,Olympic Winter Games 2018Olympic Winter Games ...,Men Single Skating - Free SkatingMen Single Sk...,JPNJPN,3,47,317.85,172.73,145.12,0.0
Pair Skating,SAVCHENKO Aljona / MASSOT Bruno,7bfaa8fc938872285901,Olympic Winter Games 2018Olympic Winter Games ...,Pair Skating - Free SkatingPair Skating - Shor...,GERGER,5,34,235.9,121.23,114.67,0.0


Remember when I said you can chain pandas method calls? Yeah, not kidding. Here's the whole thing in one line. Just because you _can_ do this doesn't mean you should. As you can see, it may be concise, but it's much more difficult to read. 

In [23]:
df[df['category'] != 'Team Event'].groupby(['category','name']).sum().sort_values(['category','total_segment_score'], ascending=False).groupby(['category'], level=0).head(1)

Unnamed: 0_level_0,Unnamed: 1_level_0,performance_id,competition,program,nation,rank,starting_number,total_segment_score,total_element_score,total_component_score,total_deductions
category,name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Pair Skating,SAVCHENKO Aljona / MASSOT Bruno,7bfaa8fc938872285901,Olympic Winter Games 2018Olympic Winter Games ...,Pair Skating - Free SkatingPair Skating - Shor...,GERGER,5,34,235.9,121.23,114.67,0.0
Men Single Skating,HANYU Yuzuru,d4fa68d71a14db11efb7,Olympic Winter Games 2018Olympic Winter Games ...,Men Single Skating - Free SkatingMen Single Sk...,JPNJPN,3,47,317.85,172.73,145.12,0.0
Ladies Single Skating,ZAGITOVA Alina,80154aa3f8b0bdffd51c,Olympic Winter Games 2018Olympic Winter Games ...,Ladies Single Skating - Free SkatingLadies Sin...,OAROAR,3,50,239.57,126.92,112.65,0.0
Ice Dance,VIRTUE Tessa / MOIR Scott,97fcedee9680c4a7d391,Olympic Winter Games 2018Olympic Winter Games ...,Ice Dance - Free DanceIce Dance - Short Dance,CANCAN,3,40,206.07,107.88,98.19,0.0


And finally: getting data out of a pandas dataframe is just as easy as getting it in:

In [25]:
winners.to_csv('~/Documents/winners.csv')
summed_df.to_csv('~/Documents/skating_df.csv')
#winners.to_excel('~/Documents/winners.xls')