# 1.2 Basics of statistics

In [1]:
import pandas as pd

Here we go over some basic statistics operations we can do with pandas DataFrames. To start, we'll load a DataFrame from a CSV file showing [the number of athletes broken down by gender for various organized sports in 1398](https://iranopendata.org/en/dataset/organized-athletes-categorized-by-sports-and-gender-1398/resource/a82ca9e9-d774-4558-950c-46b77ef9bd39).

In [2]:
athletes = pd.read_csv('../../input/IOD-00932-organized-athletes-categorized-by-sports-and-sex-1398-en.csv')

As before, we can use `head()` to preview the first 5 rows of our DataFrame and `shape` to see the dimensions. In this case, we have 52 rows and 5 columns.

In [3]:
athletes.head()

Unnamed: 0,Sports,Total,Male,Female
0,Total,3386522,1973949,1412573
1,Skate,25972,8731,17241
2,Squash,2363,875,1488
3,Ski,2315,1604,711
4,Sports Groups,21067,13803,7264


In [4]:
athletes.shape

(52, 4)

If we want to get a better sense of any specific column, we can also use `unique()`. For example, we can see what values are in the `Sports` column:

In [5]:
athletes['Sports'].unique()

array(['Total', 'Skate', 'Squash', 'Ski', 'Sports Groups', 'Badminton',
       'Bodybuilding and Body Sculpting', 'Basketball',
       'Bowling and Billiards', 'Boxing', 'Tennis', 'Table Tennis',
       'Taekwondo', 'Shooting', 'Archery', 'Judo', 'Polo',
       'Track and Field', 'Biking', 'Martial Arts', 'Gymnastics',
       'Bureau of Prisons', 'Traditional Indigenous Games', 'Equestrian',
       'Chess', 'Fencing', 'Swimming', 'Soccer', 'Rowing', 'Golf',
       'Motorcycle Racing', 'Lifeguard', 'Hockey',
       'Public Fitness Groups', 'Weightlifting', 'Wushu', 'Karate',
       'Kabaddi', 'Wrestling', 'Kung Fu and Martial Arts',
       'Mountaineering', 'Handball', 'Volleyball', 'Labor Unions Sports',
       'Sports for the Deaf',
       'Sports for the Special Patients & Organ Receivers',
       'Traditional Sports & Heroic Wrestling',
       'Sports for the Wounded in Action & the Disabled', 'K-12 Sports',
       'College Sports', 'Triathlon',
       'Sports for the Blind & Visual

## Calculating new fields from old fields

Right now, our DataFrame has the gross number of male atheletes and female athletes. If we wanted the percentage of female athletes for each sport, we could simply create a new column that is based on values of other columns:

In [6]:
athletes['Percent_Female'] = athletes['Female'] / athletes['Total'] * 100

In [7]:
athletes.head()

Unnamed: 0,Sports,Total,Male,Female,Percent_Female
0,Total,3386522,1973949,1412573,41.711614
1,Skate,25972,8731,17241,66.383028
2,Squash,2363,875,1488,62.9708
3,Ski,2315,1604,711,30.712743
4,Sports Groups,21067,13803,7264,34.480467


## Statistics about one field
If we wanted to, we could isolate the variables of this new column:

In [10]:
athletes['Percent_Female']

0     41.711614
1     66.383028
2     62.970800
3     30.712743
4     34.480467
5     66.390934
6     46.231983
7     44.908654
8     17.144720
9      0.016035
10    46.780917
11    35.821523
12    48.132674
13    43.705197
14    52.326821
15    28.062049
16    40.174672
17    48.204183
18    29.069920
19    31.855006
20    49.451705
21    12.189156
22    25.780660
23    47.017707
24    41.845269
25    48.511905
26    54.623440
27     4.419968
28    53.077699
29    37.864823
30    10.121457
31    47.483709
32    46.814603
33    88.153490
34    20.950485
35    34.475257
36    47.358407
37    49.725554
38     0.926246
39    30.160776
40    36.557205
41    58.012079
42    58.677138
43    66.281513
44    32.123288
45    52.674157
46     0.000000
47    34.603520
48    39.860394
49    56.457486
50    54.697674
51    36.931818
Name: Percent_Female, dtype: float64

If we wanted things like the maximum, minimum, or median values, we can call these functions on that columns:

In [11]:
athletes['Percent_Female'].max()

88.15348956992017

In [12]:
athletes['Percent_Female'].min()

0.0

In [13]:
athletes['Percent_Female'].median()

42.77523317164018

And if we wanted to sort our DataFrame by this column, we could see exactly which sport is the one with 0% female atheltes and which is the one with 88.15%:

In [13]:
athletes.sort_values('Percent_Female')

Unnamed: 0,Sports,Total,Male,Female,Percent_Female
46,Traditional Sports & Heroic Wrestling,11415,11415,0,0.0
9,Boxing,37418,37412,6,0.00016
38,Wrestling,98354,97443,911,0.009262
27,Soccer,604122,577420,26702,0.0442
30,Motorcycle Racing,5434,4884,550,0.101215
21,Bureau of Prisons,2453,2154,299,0.121892
8,Bowling and Billiards,1534,1271,263,0.171447
34,Weightlifting,4019,3177,842,0.209505
22,Traditional Indigenous Games,133829,99327,34502,0.257807
15,Judo,41451,29819,11632,0.28062


We can also use the describe function to see the percentiles. By default, it will show the 25%, 50%, and 75% percentiles. Percentiles are used to indicate the distribution of a set of numbers. For example, the 50% percentile means that there are 50% of rows are below that value and 50% are above; in other words, this is the median.

In [23]:
athletes['Percent_Female'].describe()

count    52.000000
mean     40.248318
std      18.236251
min       0.000000
25%      31.569440
50%      42.775233
75%      50.375871
max      88.153490
Name: Percent_Female, dtype: float64

We can also pass in an argument if we want to see another percentile. For example, we can pass in 2/3 = 0.667 = 66.7% to show the 66.7% percentile. This roughly means that two-thirds of our rows have a percentage of female athletes less than or equal to 48.13%

In [24]:
athletes['Percent_Female'].describe(percentiles=[2/3])

count    52.000000
mean     40.248318
std      18.236251
min       0.000000
50%      42.775233
66.7%    48.132674
max      88.153490
Name: Percent_Female, dtype: float64

## Statistics about multiple fields
You can also call these statistical fields on the whole DataFrame to get those calculations for all fields. Note that for the `Sports` column, the min and max function show us the string that comes first and last in alphabetical order, respectively

In [18]:
athletes.min()

Sports            Archery
Total                 229
Male                  137
Female                  0
Percent_Female        0.0
dtype: object

In [17]:
athletes.max()

Sports               Wushu
Total              3386522
Male               1973949
Female             1412573
Percent_Female    88.15349
dtype: object

In [19]:
athletes.mean()

  """Entry point for launching an IPython kernel.


Total             130250.846154
Male               75921.115385
Female             54329.730769
Percent_Female        40.248318
dtype: float64

In [20]:
athletes.median()

  """Entry point for launching an IPython kernel.


Total             12076.000000
Male               6859.000000
Female             3528.500000
Percent_Female       42.775233
dtype: float64

In [33]:
athletes.describe()

Unnamed: 0,Total,Male,Female,Percent_Female
count,52.0,52.0,52.0,52.0
mean,130250.8,75921.12,54329.73,40.248318
std,476010.2,281895.3,202794.7,18.236251
min,229.0,137.0,0.0,0.0
25%,2933.5,1604.75,893.75,31.56944
50%,12076.0,6859.0,3528.5,42.775233
75%,71682.5,44800.0,23472.25,50.375871
max,3386522.0,1973949.0,1412573.0,88.15349
