# Statistics On Dataframes
import pandas

In [1]:
import pandas as pd

import data

In [2]:
df = pd.read_csv('athlete_events.csv', index_col='ID')

We can run many statistical functions on each column (Series) to get a single result

In [6]:
df['Year'].max()

2016

In [7]:
df['Age'].mean()

25.556898357297374

In [12]:
df['Weight'].count()

208241

In [14]:
df['Height'].sum()

36986879.0

In [15]:
df[['Age','Height', 'Weight']].mean()

Age        25.556898
Height    175.338970
Weight     70.702393
dtype: float64

We can even run it on entire dataframe without specifying columns

In [7]:
df.max()

Name                                       zzet nce
Sex                                               M
Age                                              97
Height                                          226
Weight                                          214
Team                                           rn-2
NOC                                             ZIM
Games                                   2016 Summer
Year                                           2016
Season                                       Winter
City                                      Vancouver
Sport                                     Wrestling
Event     Wrestling Women's Middleweight, Freestyle
dtype: object

## groupby()

## Order of Operations
*dataframe_name* [*filter expresion*].groupby(*column_name*).agg_function() [[*column_names*]].head() 

Create a group by object

In [4]:
by_year = df.groupby('Year') 
print(by_year)
print(type(by_year))

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0F70D0B0>
<class 'pandas.core.groupby.generic.DataFrameGroupBy'>


In order to actually use the group by object on our data we need to provide an aggregate function.

In [16]:
by_year.mean().head()

Unnamed: 0_level_0,Age,Height,Weight
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1896,23.580645,172.73913,71.387755
1900,29.034031,176.637931,74.556962
1904,26.69815,175.788732,72.197279
1906,27.125253,178.206226,75.917073
1908,26.970228,177.543158,75.386128


Pandas will automatically calculate all numeric columns (except for min and max). We can be more specific though

In [20]:
by_year.mean()[['Age','Height']].head()

Unnamed: 0_level_0,Age,Height
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
1896,23.580645,172.73913
1900,29.034031,176.637931
1904,26.69815,175.788732
1906,27.125253,178.206226
1908,26.970228,177.543158


Using max or min will return strings as well

In [22]:
by_year.max().head()

Unnamed: 0_level_0,Name,Sex,Age,Height,Weight,Team,NOC,Games,Season,City,Sport,Event
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1896,Xenon Mikhailidis,M,40.0,188.0,106.0,United States,USA,1896 Summer,Summer,Athina,Wrestling,"Wrestling Men's Unlimited Class, Greco-Roman"
1900,van der Stoppen,M,71.0,191.0,102.0,Vesper Boat Club,USA,1900 Summer,Summer,Paris,Water Polo,Water Polo Men's Water Polo
1904,Zoltn Imre dn von Halmay,M,71.0,195.0,115.0,Winnipeg Shamrocks-1,USA,1904 Summer,Summer,St. Louis,Wrestling,"Wrestling Men's Welterweight, Freestyle"
1906,rpd Erds,M,54.0,196.0,114.0,United States,USA,1906 Summer,Summer,Athina,Wrestling,"Wrestling Men's Middleweight, Greco-Roman"
1908,tienne Poillot,M,61.0,201.0,115.0,Zut,USA,1908 Summer,Summer,London,Wrestling,"Wrestling Men's Middleweight, Greco-Roman"


We can also check the number of values in each column

In [24]:
by_year.count().head()

Unnamed: 0_level_0,Name,Sex,Age,Height,Weight,Team,NOC,Games,Season,City,Sport,Event,Medal
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1896,380,380,217,46,49,380,380,380,380,380,380,380,143
1900,1936,1936,1146,116,79,1936,1936,1936,1936,1936,1936,1936,604
1904,1301,1301,1027,213,147,1301,1301,1301,1301,1301,1301,1301,486
1906,1733,1733,990,257,205,1733,1733,1733,1733,1733,1733,1733,458
1908,3101,3101,2452,475,483,3101,3101,3101,3101,3101,3101,3101,831


Create a DF from multiple aggregate operations

In [28]:
pd.DataFrame([by_year.count()['Age'],by_year.mean()['Height'], by_year.max()['Weight']])

Year,1896,1900,1904,1906,1908,1912,1920,1924,1928,1932,...,1998,2000,2002,2004,2006,2008,2010,2012,2014,2016
Age,217.0,1146.0,1027.0,990.0,2452.0,3884.0,3447.0,4551.0,4611.0,2991.0,...,3603.0,13820.0,4109.0,13443.0,4382.0,13600.0,4402.0,12920.0,4891.0,13688.0
Height,172.73913,176.637931,175.788732,178.206226,177.543158,177.447989,175.752282,174.963039,175.162051,174.220115,...,174.581369,176.089721,174.702451,175.97285,174.623172,176.211062,174.918182,176.262469,174.81667,176.034266
Weight,106.0,102.0,115.0,114.0,115.0,125.0,146.0,146.0,125.0,110.0,...,123.0,180.0,123.0,198.0,127.0,214.0,116.0,214.0,116.0,170.0


We can use the **transpose** method to switch the axes and view the data in a more convenient way

In [8]:
pd.DataFrame([by_year.count()['Age'],by_year.mean()['Height'], by_year.max()['Weight']]).transpose().head(10)

Unnamed: 0_level_0,Age,Height,Weight
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1896,217.0,172.73913,106.0
1900,1146.0,176.637931,102.0
1904,1027.0,175.788732,115.0
1906,990.0,178.206226,114.0
1908,2452.0,177.543158,115.0
1912,3884.0,177.447989,125.0
1920,3447.0,175.752282,146.0
1924,4551.0,174.963039,146.0
1928,4611.0,175.162051,125.0
1932,2991.0,174.220115,110.0


An easier, more readable way to create complex "groupby" datasets is by using the ***agg*** method. This will allow us to specify our desired measures and aggregate operations for each in a single ***groupby*** operation

In [10]:
df.groupby('Year').agg({'Age':'count', 'Height':'mean', 'Weight':'max'}).head(10)

Unnamed: 0_level_0,Age,Height,Weight
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1896,217,172.73913,106.0
1900,1146,176.637931,102.0
1904,1027,175.788732,115.0
1906,990,178.206226,114.0
1908,2452,177.543158,115.0
1912,3884,177.447989,125.0
1920,3447,175.752282,146.0
1924,4551,174.963039,146.0
1928,4611,175.162051,125.0
1932,2991,174.220115,110.0


### describe()
Get all statistics on a column

In [36]:
by_year.describe()['Age'].head(10)

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1896,217.0,23.580645,4.692803,10.0,21.0,23.0,26.0,40.0
1900,1146.0,29.034031,9.358352,13.0,22.0,27.0,34.0,71.0
1904,1027.0,26.69815,8.752523,14.0,21.5,24.0,29.0,71.0
1906,990.0,27.125253,7.913107,13.0,22.0,25.0,29.75,54.0
1908,2452.0,26.970228,7.820216,14.0,22.0,25.0,30.0,61.0
1912,3884.0,27.53862,8.052353,13.0,22.0,25.0,31.0,67.0
1920,3447.0,29.290978,8.273074,13.0,23.0,28.0,34.0,72.0
1924,4551.0,28.373325,8.47985,11.0,22.0,26.0,32.0,81.0
1928,4611.0,29.112557,10.533807,11.0,22.0,26.0,33.0,97.0
1932,2991.0,32.58208,13.795758,11.0,22.0,27.0,40.0,96.0


We can get staistics for a specific year with an additional filter

In [40]:
by_year.describe()['Age'].transpose()[2000]

count    13820.000000
mean        25.422504
std          5.440812
min         13.000000
25%         22.000000
50%         25.000000
75%         28.000000
max         63.000000
Name: 2000, dtype: float64

## Pivot Tables

In [42]:
df = pd.read_csv('athlete_events.csv')
df.columns

Index(['ID', 'Name', 'Sex', 'Age', 'Height', 'Weight', 'Team', 'NOC', 'Games',
       'Year', 'Season', 'City', 'Sport', 'Event', 'Medal'],
      dtype='object')

### pivot_table()
Create a pivot table with a single dimension (specifying index and values)

In [45]:
df.pivot_table(aggfunc='mean',values='Age',index='Year').head(10)

Unnamed: 0_level_0,Age
Year,Unnamed: 1_level_1
1896,23.580645
1900,29.034031
1904,26.69815
1906,27.125253
1908,26.970228
1912,27.53862
1920,29.290978
1924,28.373325
1928,29.112557
1932,32.58208


Add another axis by specifying the columns

In [50]:
df.pivot_table(aggfunc='mean', values='Age', index='NOC', columns='Season').head(10)

Season,Summer,Winter
NOC,Unnamed: 1_level_1,Unnamed: 2_level_1
AFG,23.538462,
AHO,26.246575,31.6
ALB,25.888889,20.428571
ALG,24.45591,20.583333
AND,27.320755,21.12069
ANG,24.910112,
ANT,23.227273,
ANZ,24.246914,
ARG,26.370908,22.559055
ARM,24.948276,23.340426


We can specify mutiple aggregations by passing a list to the ***aggfunc*** parameter

In [53]:
df.pivot_table(aggfunc=['mean', 'sum'], values='Age', index='NOC', columns='Season').head(10)

Unnamed: 0_level_0,mean,mean,sum,sum
Season,Summer,Winter,Summer,Winter
AFG,23.538462,,1836.0,
AHO,26.246575,31.6,1916.0,158.0
ALB,25.888889,20.428571,1631.0,143.0
ALG,24.45591,20.583333,13035.0,247.0
AND,27.320755,21.12069,1448.0,2450.0
ANG,24.910112,,6651.0,
ANT,23.227273,,3066.0,
ANZ,24.246914,,1964.0,
ARG,26.370908,22.559055,70885.0,8595.0
ARM,24.948276,23.340426,4341.0,1097.0


We can also aggregate multiple columns (measures) by passing a list to the ***values*** parameter

In [57]:
df.pivot_table(aggfunc='mean', values=['Age','Height', 'Weight'], index='NOC', columns='Season').head(10)

Unnamed: 0_level_0,Age,Age,Height,Height,Weight,Weight
Season,Summer,Winter,Summer,Winter,Summer,Winter
NOC,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
AFG,23.538462,,170.592593,,65.901639,
AHO,26.246575,31.6,177.24,180.0,76.06,82.0
ALB,25.888889,20.428571,172.7,175.142857,71.833333,68.857143
ALG,24.45591,20.583333,174.733471,171.0,68.738144,63.25
AND,27.320755,21.12069,173.0,174.079545,68.297872,71.897727
ANG,24.910112,,178.204082,,74.142857,
ANT,23.227273,,175.121739,,66.172414,
ANZ,24.246914,,176.730769,,70.181818,
ARG,26.370908,22.559055,176.773888,174.537572,74.355384,69.583333
ARM,24.948276,23.340426,172.181818,171.022727,74.195906,68.3


Each of the axes can be multidimensional as well (can be specified with a list)

In [59]:
df.pivot_table(aggfunc='mean', values='Age', index=['NOC','Sport'], columns=['Season','Sex']).head(10)

Unnamed: 0_level_0,Season,Summer,Summer,Winter,Winter
Unnamed: 0_level_1,Sex,F,M,F,M
NOC,Sport,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
AFG,Athletics,20.75,22.277778,,
AFG,Boxing,,21.6,,
AFG,Hockey,,24.25,,
AFG,Judo,18.0,27.5,,
AFG,Taekwondo,,24.5,,
AFG,Wrestling,,24.4375,,
AHO,Athletics,21.0,23.454545,,
AHO,Bobsleigh,,,,31.25
AHO,Equestrianism,,53.0,,
AHO,Fencing,28.0,35.0,,


Just like the ***groupby*** operation, we can include in our aggfunc parameter a dictionary with a specification of all the desired measures and aggregates. In this case the ***values*** parameter is redundant and we don't need it

In [17]:
df.pivot_table(aggfunc={'Name':'count', 'Age': ['mean', 'std'], 'Height':['max', 'median']}, index=['NOC','Medal'], columns='Sex').head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Age,Age,Age,Age,Height,Height,Height,Height,Name,Name
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,mean,std,std,max,max,median,median,count,count
Unnamed: 0_level_2,Sex,F,M,F,M,F,M,F,M,F,M
NOC,Medal,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3
AFG,Bronze,,23.0,,2.828427,,183.0,,183.0,,2.0
AHO,Silver,,19.0,,,,,,,,1.0
ALG,Bronze,23.0,22.428571,,2.370453,155.0,189.0,155.0,180.0,1.0,7.0
ALG,Gold,26.5,24.333333,3.535534,1.527525,162.0,175.0,160.0,170.0,2.0,3.0
ALG,Silver,,26.0,,2.828427,,186.0,,177.5,,4.0
ANZ,Bronze,,23.8,,5.167204,,188.0,,184.0,,5.0
ANZ,Gold,22.0,24.578947,,2.911964,,188.0,,173.0,1.0,19.0
ANZ,Silver,20.0,23.333333,,7.023769,,178.0,,174.0,1.0,3.0
ARG,Bronze,27.594595,26.148148,3.677707,5.307195,178.0,210.0,165.0,191.0,37.0,54.0
ARG,Gold,29.5,26.516854,0.707107,6.452822,164.0,208.0,157.0,178.0,2.0,89.0
