# Sorting, Counting, Filtering & Querying

## Upload Olympics datasets to this notebook.

### Import ```summer.csv``` and store in a dataframe called ```dfs``` (<a href="https://raw.githubusercontent.com/sandeepmj/datasets/main/summer.csv">link</a>).


### Import ```winter.csv``` and store in a dataframe called ```dfw``` (<a href="https://raw.githubusercontent.com/sandeepmj/datasets/main/winter.csv">link</a>).

In [2]:
#import pandas
import pandas as pd

In [4]:
## import summer data
dfs = pd.read_csv("https://raw.githubusercontent.com/sandeepmj/datasets/main/summer.csv")
dfs.head(3)

Unnamed: 0,Year,City,Sport,Discipline,Athlete Name,Athlete Country,Gender,Event,Medal
0,1896,Athens,Aquatics,Swimming,"HAJOS, Alfred",HUN,Men,100M Freestyle,Gold
1,1896,Athens,Aquatics,Swimming,"HERSCHMANN, Otto",AUT,Men,100M Freestyle,Silver
2,1896,Athens,Aquatics,Swimming,"DRIVAS, Dimitrios",GRE,Men,100M Freestyle For Sailors,Bronze


In [5]:
## import summer data
dfw = pd.read_csv("https://raw.githubusercontent.com/sandeepmj/datasets/main/winter.csv")
dfw.head(3)

Unnamed: 0,Year,City,Sport,Discipline,Athlete Name,Athlete Country,Gender,Event,Medal
0,1924,Chamonix,Biathlon,Biathlon,"BERTHET, G.",FRA,Men,Military Patrol,Bronze
1,1924,Chamonix,Biathlon,Biathlon,"MANDRILLON, C.",FRA,Men,Military Patrol,Bronze
2,1924,Chamonix,Biathlon,Biathlon,"MANDRILLON, Maurice",FRA,Men,Military Patrol,Bronze


## Get a sense of the data

In [6]:
## SUMMER
## what exactly do we have: columns, datatypes, etc.
dfs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31165 entries, 0 to 31164
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Year             31165 non-null  int64 
 1   City             31165 non-null  object
 2   Sport            31165 non-null  object
 3   Discipline       31165 non-null  object
 4   Athlete Name     31165 non-null  object
 5   Athlete Country  31161 non-null  object
 6   Gender           31165 non-null  object
 7   Event            31165 non-null  object
 8   Medal            31165 non-null  object
dtypes: int64(1), object(8)
memory usage: 2.1+ MB


In [7]:
dfw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5770 entries, 0 to 5769
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Year             5770 non-null   int64 
 1   City             5770 non-null   object
 2   Sport            5770 non-null   object
 3   Discipline       5770 non-null   object
 4   Athlete Name     5770 non-null   object
 5   Athlete Country  5770 non-null   object
 6   Gender           5770 non-null   object
 7   Event            5770 non-null   object
 8   Medal            5770 non-null   object
dtypes: int64(1), object(8)
memory usage: 405.8+ KB


#### Multiple ```series``` make up a ```dataframe```

The key is to know what you are working with because once again, there are specific operations you can run on a ```series``` that you can't run on a ```dataframe```, and vise versa.

### Method 1 to return series: dot notation

```dataframe.column_name```

In [11]:
## return Year as a Pandas Series for winter games
dfw.Year

0       1924
1       1924
2       1924
3       1924
4       1924
        ... 
5765    2014
5766    2014
5767    2014
5768    2014
5769    2014
Name: Year, Length: 5770, dtype: int64

### Method 2 to return series: square brackets

```dataframe["column_name"]```

In [13]:
## return Year as a Pandas Series for winter games
dfw["Year"]

0       1924
1       1924
2       1924
3       1924
4       1924
        ... 
5765    2014
5766    2014
5767    2014
5768    2014
5769    2014
Name: Year, Length: 5770, dtype: int64

### Method 2 is preferred because you will often work with data with multiword column headers:

This will break:

```df.Multi Word Header```

This will NOT break:

```df["Multi Word Header"]```


In [17]:
## Call the winter games Athlete Name series using dot notation
# will break
dfw.Athlete Name

SyntaxError: invalid syntax (<ipython-input-17-875fa948d3ba>, line 3)

In [18]:
## Call the winter games Athlete Name series using dot notation
dfw["Athlete Name"]

0                BERTHET, G.
1             MANDRILLON, C.
2        MANDRILLON, Maurice
3            VANDELLE, André
4       AUFDENBLATTEN, Adolf
                ...         
5765            JONES, Jenny
5766         ANDERSON, Jamie
5767      MALTAIS, Dominique
5768            SAMKOVA, Eva
5769        TRESPEUCH, Chloe
Name: Athlete Name, Length: 5770, dtype: object

### If you pass a list of multiple column headers, it returns dataframe with only those columns

####  ```df[["Column 1", "Column 2"]]```

In [23]:
## Create a dataframe that for only summer games years and cities
dfs_Year_City = dfs[["Year", "City"]]
dfs_Year_City

Unnamed: 0,Year,City
0,1896,Athens
1,1896,Athens
2,1896,Athens
3,1896,Athens
4,1896,Athens
...,...,...
31160,2012,London
31161,2012,London
31162,2012,London
31163,2012,London


In [24]:
dfs_Year_City.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31165 entries, 0 to 31164
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Year    31165 non-null  int64 
 1   City    31165 non-null  object
dtypes: int64(1), object(1)
memory usage: 487.1+ KB


## Sorting by single value

```## are the two column headers identical?
list(dfs.columns) == list(dfw.columns))``` for A-Z or smaller numbers to bigger numbers

```ascending = False``` for Z-A or bigger to smaller numbers.

In [27]:
## sort by athlete names A-Z for winter olympics
dfw.sort_values (by = "Athlete Name", ascending = True)

Unnamed: 0,Year,City,Sport,Discipline,Athlete Name,Athlete Country,Gender,Event,Medal
1834,1980,Lake Placid,Ice Hockey,Ice Hockey,"AAHLBERG, Mats",SWE,Men,Ice Hockey,Bronze
2052,1984,Sarajevo,Ice Hockey,Ice Hockey,"AAHLEN, Thomas",SWE,Men,Ice Hockey,Bronze
1985,1980,Lake Placid,Skiing,Cross Country Skiing,"AALAND, Per Knut",NOR,Men,4X10KM Relay,Silver
5303,2014,Sochi,Ice Hockey,Ice Hockey,"AALTONEN, Juhamatti",FIN,Men,Ice Hockey,Bronze
2738,1992,Albertville,Skiing,Alpine Skiing,"AAMODT, Kjetil Andre",NOR,Men,Super-G,Gold
...,...,...,...,...,...,...,...,...,...
2920,1994,Lillehammer,Ice Hockey,Ice Hockey,"ÖRNSKOG, Stefan",SWE,Men,Ice Hockey,Gold
994,1960,Squaw Valley,Skiing,Cross Country Skiing,"ÖSTBY, Einar",NOR,Men,4X10KM Relay,Silver
549,1948,St.Moritz,Skiing,Cross Country Skiing,"ÖSTENSSON, Nils",SWE,Men,18KM,Silver
556,1948,St.Moritz,Skiing,Cross Country Skiing,"ÖSTENSSON, Nils",SWE,Men,4X10KM Relay,Gold


## Sorting by multiple values

```dataframe.sort_values(by=["Value 1", "Value 2"], ascending = True)``` for A-Z or smaller numbers to bigger numbers

```ascending = False``` for Z-A or bigger to smaller numbers.


In [28]:
## Sort by values (City and Year) for winter olympics
dfw.sort_values(by = ["City", "Year"], ascending = True)

Unnamed: 0,Year,City,Sport,Discipline,Athlete Name,Athlete Country,Gender,Event,Medal
2502,1992,Albertville,Biathlon,Biathlon,"ELORANTA, Harri",FIN,Men,10KM,Bronze
2503,1992,Albertville,Biathlon,Biathlon,"KIRCHNER, Mark",GER,Men,10KM,Gold
2504,1992,Albertville,Biathlon,Biathlon,"GROSS, Ricco",GER,Men,10KM,Silver
2505,1992,Albertville,Biathlon,Biathlon,"BEDARD, Myriam",CAN,Women,15KM,Bronze
2506,1992,Albertville,Biathlon,Biathlon,"HARVEY, Antje",GER,Women,15KM,Gold
...,...,...,...,...,...,...,...,...,...
5153,2010,Vancouver,Skiing,Snowboard,"WESCOTT, Seth",USA,Men,Snowboard Cross,Gold
5154,2010,Vancouver,Skiing,Snowboard,"ROBERTSON, Mike",CAN,Men,Snowboard Cross,Silver
5155,2010,Vancouver,Skiing,Snowboard,"NOBS, Olivia",SUI,Women,Snowboard Cross,Bronze
5156,2010,Vancouver,Skiing,Snowboard,"RICKER, Maelle",CAN,Women,Snowboard Cross,Gold


In [29]:
## sort by city and year but descending for summer games
dfs.sort_values(by=["City", "Year"], ascending = False)

Unnamed: 0,Year,City,Sport,Discipline,Athlete Name,Athlete Country,Gender,Event,Medal
10674,1964,Tokyo,Aquatics,Diving,"GOMPF, Thomas Eugen",USA,Men,10M Platform,Bronze
10675,1964,Tokyo,Aquatics,Diving,"WEBSTER, Robert David",USA,Men,10M Platform,Gold
10676,1964,Tokyo,Aquatics,Diving,"DIBIASI, Klaus",ITA,Men,10M Platform,Silver
10677,1964,Tokyo,Aquatics,Diving,"ALEKSEEVA, Galina",URS,Women,10M Platform,Bronze
10678,1964,Tokyo,Aquatics,Diving,"BUSH, Lesley Leigh",USA,Women,10M Platform,Gold
...,...,...,...,...,...,...,...,...,...
5709,1928,Amsterdam,Wrestling,Wrestling Gre-R,"KOKKINEN, Väinö Anselmi",FIN,Men,67.5 - 75KG (Middleweight),Gold
5710,1928,Amsterdam,Wrestling,Wrestling Gre-R,"PAPP, L.",HUN,Men,67.5 - 75KG (Middleweight),Silver
5711,1928,Amsterdam,Wrestling,Wrestling Gre-R,"PELLINEN, Onni",FIN,Men,75 - 82.5KG (Light-Heavyweight),Bronze
5712,1928,Amsterdam,Wrestling,Wrestling Gre-R,"MOUSTAPHA, Ibrahim",EGY,Men,75 - 82.5KG (Light-Heavyweight),Gold


In [30]:
## you can store into a new df
dfx = dfs.sort_values(by=["City", "Year"], ascending = False)

## Should Winter and Summer data remain separate?


### Creating a new column with a default value

In [32]:
## In the summer dataframe, add "Season" column with the value "Summer"
dfs ["Season" ] = "Summer"
dfs

Unnamed: 0,Year,City,Sport,Discipline,Athlete Name,Athlete Country,Gender,Event,Medal,Season
0,1896,Athens,Aquatics,Swimming,"HAJOS, Alfred",HUN,Men,100M Freestyle,Gold,Summer
1,1896,Athens,Aquatics,Swimming,"HERSCHMANN, Otto",AUT,Men,100M Freestyle,Silver,Summer
2,1896,Athens,Aquatics,Swimming,"DRIVAS, Dimitrios",GRE,Men,100M Freestyle For Sailors,Bronze,Summer
3,1896,Athens,Aquatics,Swimming,"MALOKINIS, Ioannis",GRE,Men,100M Freestyle For Sailors,Gold,Summer
4,1896,Athens,Aquatics,Swimming,"CHASAPIS, Spiridon",GRE,Men,100M Freestyle For Sailors,Silver,Summer
...,...,...,...,...,...,...,...,...,...,...
31160,2012,London,Wrestling,Wrestling Freestyle,"JANIKOWSKI, Damian",POL,Men,Wg 84 KG,Bronze,Summer
31161,2012,London,Wrestling,Wrestling Freestyle,"REZAEI, Ghasem Gholamreza",IRI,Men,Wg 96 KG,Gold,Summer
31162,2012,London,Wrestling,Wrestling Freestyle,"TOTROV, Rustam",RUS,Men,Wg 96 KG,Silver,Summer
31163,2012,London,Wrestling,Wrestling Freestyle,"ALEKSANYAN, Artur",ARM,Men,Wg 96 KG,Bronze,Summer


In [33]:
## In the winter dataframe, add "Season" column with the value "Winter"
dfw ["Season" ] = "Winter"
dfw

Unnamed: 0,Year,City,Sport,Discipline,Athlete Name,Athlete Country,Gender,Event,Medal,Season
0,1924,Chamonix,Biathlon,Biathlon,"BERTHET, G.",FRA,Men,Military Patrol,Bronze,Winter
1,1924,Chamonix,Biathlon,Biathlon,"MANDRILLON, C.",FRA,Men,Military Patrol,Bronze,Winter
2,1924,Chamonix,Biathlon,Biathlon,"MANDRILLON, Maurice",FRA,Men,Military Patrol,Bronze,Winter
3,1924,Chamonix,Biathlon,Biathlon,"VANDELLE, André",FRA,Men,Military Patrol,Bronze,Winter
4,1924,Chamonix,Biathlon,Biathlon,"AUFDENBLATTEN, Adolf",SUI,Men,Military Patrol,Gold,Winter
...,...,...,...,...,...,...,...,...,...,...
5765,2014,Sochi,Skiing,Snowboard,"JONES, Jenny",GBR,Women,Slopestyle,Bronze,Winter
5766,2014,Sochi,Skiing,Snowboard,"ANDERSON, Jamie",USA,Women,Slopestyle,Gold,Winter
5767,2014,Sochi,Skiing,Snowboard,"MALTAIS, Dominique",CAN,Women,Snowboard Cross,Silver,Winter
5768,2014,Sochi,Skiing,Snowboard,"SAMKOVA, Eva",CZE,Women,Snowboard Cross,Gold,Winter


In [34]:
## are the two column headers identical?
list(dfs.columns) == list(dfw.columns)

True

In [38]:
## join winter and summer dataframes together into a dataframe called oly
oly = pd.concat ([dfs,dfw], ignore_index = True) # ignore index es para que no mantenga esa columna y haga una nueva

In [39]:
oly

Unnamed: 0,Year,City,Sport,Discipline,Athlete Name,Athlete Country,Gender,Event,Medal,Season
0,1896,Athens,Aquatics,Swimming,"HAJOS, Alfred",HUN,Men,100M Freestyle,Gold,Summer
1,1896,Athens,Aquatics,Swimming,"HERSCHMANN, Otto",AUT,Men,100M Freestyle,Silver,Summer
2,1896,Athens,Aquatics,Swimming,"DRIVAS, Dimitrios",GRE,Men,100M Freestyle For Sailors,Bronze,Summer
3,1896,Athens,Aquatics,Swimming,"MALOKINIS, Ioannis",GRE,Men,100M Freestyle For Sailors,Gold,Summer
4,1896,Athens,Aquatics,Swimming,"CHASAPIS, Spiridon",GRE,Men,100M Freestyle For Sailors,Silver,Summer
...,...,...,...,...,...,...,...,...,...,...
36930,2014,Sochi,Skiing,Snowboard,"JONES, Jenny",GBR,Women,Slopestyle,Bronze,Winter
36931,2014,Sochi,Skiing,Snowboard,"ANDERSON, Jamie",USA,Women,Slopestyle,Gold,Winter
36932,2014,Sochi,Skiing,Snowboard,"MALTAIS, Dominique",CAN,Women,Snowboard Cross,Silver,Winter
36933,2014,Sochi,Skiing,Snowboard,"SAMKOVA, Eva",CZE,Women,Snowboard Cross,Gold,Winter


In [41]:
### confirm that you have the correct number of rows after the join.
### you can scroll up to count the number of entries in oly_w and oly_s. 
### The totals should be equal to the value generated below
dfw.shape[0] + dfs.shape[0] == oly.shape[0]


True

In [42]:
## show top and bottom
oly

Unnamed: 0,Year,City,Sport,Discipline,Athlete Name,Athlete Country,Gender,Event,Medal,Season
0,1896,Athens,Aquatics,Swimming,"HAJOS, Alfred",HUN,Men,100M Freestyle,Gold,Summer
1,1896,Athens,Aquatics,Swimming,"HERSCHMANN, Otto",AUT,Men,100M Freestyle,Silver,Summer
2,1896,Athens,Aquatics,Swimming,"DRIVAS, Dimitrios",GRE,Men,100M Freestyle For Sailors,Bronze,Summer
3,1896,Athens,Aquatics,Swimming,"MALOKINIS, Ioannis",GRE,Men,100M Freestyle For Sailors,Gold,Summer
4,1896,Athens,Aquatics,Swimming,"CHASAPIS, Spiridon",GRE,Men,100M Freestyle For Sailors,Silver,Summer
...,...,...,...,...,...,...,...,...,...,...
36930,2014,Sochi,Skiing,Snowboard,"JONES, Jenny",GBR,Women,Slopestyle,Bronze,Winter
36931,2014,Sochi,Skiing,Snowboard,"ANDERSON, Jamie",USA,Women,Slopestyle,Gold,Winter
36932,2014,Sochi,Skiing,Snowboard,"MALTAIS, Dominique",CAN,Women,Snowboard Cross,Silver,Winter
36933,2014,Sochi,Skiing,Snowboard,"SAMKOVA, Eva",CZE,Women,Snowboard Cross,Gold,Winter


In [43]:
### generate a random sample of 20 rows
### if you see only "summer" or only "winter" in the "Season" column, just run the cell again to
## confirm both seasons are in oly.
oly.sample(20)


Unnamed: 0,Year,City,Sport,Discipline,Athlete Name,Athlete Country,Gender,Event,Medal,Season
18631,1988,Seoul,Cycling,Cycling Road,"GRONE, Bernd",FRG,Men,Individual Road Race,Silver,Summer
13970,1976,Montreal,Aquatics,Swimming,"STRACHAN, Rodney",USA,Men,400M Individual Medley,Gold,Summer
23129,1996,Atlanta,Wrestling,Wrestling Free.,"ANGLE, Kurt",USA,Men,90 - 100KG (Heavyweight),Gold,Summer
26302,2004,Athens,Gymnastics,Artistic G.,"YANG, Tae Young",KOR,Men,Individual All-Round,Bronze,Summer
18926,1988,Seoul,Gymnastics,Artistic G.,"POTORAC, Gabriela",ROU,Women,Vault,Silver,Summer
12602,1968,Mexico,Volleyball,Volleyball,"MITSUMORI, Yasuaki",JPN,Men,Volleyball,Silver,Summer
36926,2014,Sochi,Skiing,Snowboard,"DUJMOVITS, Julia",AUT,Women,Parallel Slalom,Gold,Winter
21939,1996,Atlanta,Boxing,Boxing,"MESA, Arnaldo",CUB,Men,51 - 54KG (Bantamweight),Silver,Summer
33370,1984,Sarajevo,Skiing,Cross Country Skiing,"SVAN, Gunde Anders",SWE,Men,4X10KM Relay,Gold,Winter
907,1904,St Louis,Football,Football,"JANUARY, Thomas Thurston",USA,Men,Football,Gold,Summer


In [45]:
oly.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36935 entries, 0 to 36934
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Year             36935 non-null  int64 
 1   City             36935 non-null  object
 2   Sport            36935 non-null  object
 3   Discipline       36935 non-null  object
 4   Athlete Name     36935 non-null  object
 5   Athlete Country  36931 non-null  object
 6   Gender           36935 non-null  object
 7   Event            36935 non-null  object
 8   Medal            36935 non-null  object
 9   Season           36935 non-null  object
dtypes: int64(1), object(9)
memory usage: 2.8+ MB


## Improving memory allocation

If we were working with a massive dataset, we'd want to find ways to improve processing power.

For example, the following columns have only a few values that repeat again and again as strings. This is highly inefficient. 

- The "Season" column has "winter" and "summer", 
- The "Medal" column only has "gold", "silver" and "bronze",
- "Gender" at this point in the Olympics only has "male" and "female".

One way to improve memory allocation is to take columns with that contain only a few data points and turn them into categories. 



In [48]:
## let's first get info() on oly.
## you should see how "Season", "Medal" and "Gender" are all string objects.
## Note that memory usage: 2.8+ MB
# Lo vamos a convertir en categorias (como mucho, unas cinco categorias)
oly [["Medal", "Gender", "Season"]] = \
oly [["Medal", "Gender", "Season"]].astype("category")

In [49]:
oly [["Medal", "Gender", "Season"]].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36935 entries, 0 to 36934
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   Medal   36935 non-null  category
 1   Gender  36935 non-null  category
 2   Season  36935 non-null  category
dtypes: category(3)
memory usage: 108.7 KB


# Filter/Subset 



### Create a subset that holds the results from the Sochi Olympics

**Old way:**

```df[df["column_name"] == "value_in_column"]```

**Newer way:**

```df.query("ColumnName == 'some_value'")```

In [51]:
## return df that holds only Sochi data

df_sochi = oly.query("City == 'Sochi'")
df_sochi

Unnamed: 0,Year,City,Sport,Discipline,Athlete Name,Athlete Country,Gender,Event,Medal,Season
36323,2014,Sochi,Biathlon,Biathlon,"LANDERTINGER, Dominik",AUT,Men,10KM,Silver,Winter
36324,2014,Sochi,Biathlon,Biathlon,"SOUKUP, Jaroslav",CZE,Men,10KM,Bronze,Winter
36325,2014,Sochi,Biathlon,Biathlon,"BJOERNDALEN, Ole Einar",NOR,Men,10KM,Gold,Winter
36326,2014,Sochi,Biathlon,Biathlon,"MORAVEC, Ondrej",CZE,Men,12.5Km Pursuit,Silver,Winter
36327,2014,Sochi,Biathlon,Biathlon,"BEATRIX, Jean Guillaume",FRA,Men,12.5Km Pursuit,Bronze,Winter
...,...,...,...,...,...,...,...,...,...,...
36930,2014,Sochi,Skiing,Snowboard,"JONES, Jenny",GBR,Women,Slopestyle,Bronze,Winter
36931,2014,Sochi,Skiing,Snowboard,"ANDERSON, Jamie",USA,Women,Slopestyle,Gold,Winter
36932,2014,Sochi,Skiing,Snowboard,"MALTAIS, Dominique",CAN,Women,Snowboard Cross,Silver,Winter
36933,2014,Sochi,Skiing,Snowboard,"SAMKOVA, Eva",CZE,Women,Snowboard Cross,Gold,Winter


**Syntax for querying multi-word column names:**

```dataframe.query('`column name` comparison_operator value')```

Note the opening and closing tick  ``` ` ``` marks.

In [54]:
## query to create df so that only athletes from France appear
df_france = oly.query ("`Athlete Country` == 'FRA'")
df_france

Unnamed: 0,Year,City,Sport,Discipline,Athlete Name,Athlete Country,Gender,Event,Medal,Season
17,1896,Athens,Athletics,Athletics,"LERMUSIAUX, Albin",FRA,Men,1500M,Bronze,Summer
47,1896,Athens,Athletics,Athletics,"TUFFERI, Alexandre",FRA,Men,Triple Jump,Silver,Summer
51,1896,Athens,Cycling,Cycling Track,"FLAMENG, Léon",FRA,Men,100KM,Gold,Summer
54,1896,Athens,Cycling,Cycling Track,"MASSON, Paul",FRA,Men,10KM,Gold,Summer
55,1896,Athens,Cycling,Cycling Track,"FLAMENG, Léon",FRA,Men,10KM,Silver,Summer
...,...,...,...,...,...,...,...,...,...,...
36846,2014,Sochi,Skiing,Freestyle Skiing,"ROLLAND, Kevin",FRA,Men,Ski Halfpipe,Bronze,Winter
36860,2014,Sochi,Skiing,Freestyle Skiing,"MARTINOD, Marie",FRA,Women,Ski Halfpipe,Silver,Winter
36903,2014,Sochi,Skiing,Ski Jumping,"MATTEL, Coline",FRA,Women,K90 Individual,Bronze,Winter
36917,2014,Sochi,Skiing,Snowboard,"VAULTIER, Pierre",FRA,Men,Snowboard Cross,Gold,Winter


## How many values in a column?

```df_name["column_name"].value_counts()```

In [57]:
## How many men and women won medals Sochi?
oly["Gender"].value_counts()

Men      26690
Women    10245
Name: Gender, dtype: int64

## Filtering by dates

You'll learn more about date and time in the coming week. For now, please note how the ```Year``` column is a  ```int64``` object and NOT a ```datetime``` object. We can still work with it for our needs.

In [60]:
## Find all the 1900 competitions
df_1900 = oly.query ("Year == 1900")
df_1900

Unnamed: 0,Year,City,Sport,Discipline,Athlete Name,Athlete Country,Gender,Event,Medal,Season
151,1900,Paris,Aquatics,Swimming,"HALMAY, Zoltan",HUN,Men,1500M Freestyle,Bronze,Summer
152,1900,Paris,Aquatics,Swimming,"JARVIS, John Arthur",GBR,Men,1500M Freestyle,Gold,Summer
153,1900,Paris,Aquatics,Swimming,"WAHLE, Otto",AUT,Men,1500M Freestyle,Silver,Summer
154,1900,Paris,Aquatics,Swimming,"DROST, Johannes",NED,Men,200M Backstroke,Bronze,Summer
155,1900,Paris,Aquatics,Swimming,"HOPPENBERG, Ernst",GER,Men,200M Backstroke,Gold,Summer
...,...,...,...,...,...,...,...,...,...,...
658,1900,Paris,Tug of War,Tug of War,"COLLAS, Jean",FRA,Men,Tug Of War,Silver,Summer
659,1900,Paris,Tug of War,Tug of War,"GONDOUIN, Charles",FRA,Men,Tug Of War,Silver,Summer
660,1900,Paris,Tug of War,Tug of War,"HENRIQUEZ DE ZUBIERRA, Constantin",FRA,Men,Tug Of War,Silver,Summer
661,1900,Paris,Tug of War,Tug of War,"ROFFO, Joseph",FRA,Men,Tug Of War,Silver,Summer


In [63]:
### return all the competitions between 1920 and 1950.
df_1920_1950 = oly.query ("1920 <= Year <=1950")
df_1920_1950

Unnamed: 0,Year,City,Sport,Discipline,Athlete Name,Athlete Country,Gender,Event,Medal,Season
2822,1920,Antwerp,Aquatics,Diving,"PRIESTE, Harry",USA,Men,10M Platform,Bronze,Summer
2823,1920,Antwerp,Aquatics,Diving,"PINKSTON, Clarence",USA,Men,10M Platform,Gold,Summer
2824,1920,Antwerp,Aquatics,Diving,"ADLERZ, Erik",SWE,Men,10M Platform,Silver,Summer
2825,1920,Antwerp,Aquatics,Diving,"OLLIVIER, Eva",SWE,Women,10M Platform,Bronze,Summer
2826,1920,Antwerp,Aquatics,Diving,"FRYLAND CLAUSEN, Stefani",DEN,Women,10M Platform,Gold,Summer
...,...,...,...,...,...,...,...,...,...,...
31731,1948,St.Moritz,Skiing,Nordic Combined,"HASU, Heikki",FIN,Men,Individual,Gold,Winter
31732,1948,St.Moritz,Skiing,Nordic Combined,"HUHTALA, Martti",FIN,Men,Individual,Silver,Winter
31733,1948,St.Moritz,Skiing,Ski Jumping,"SCHJELDERUP, Thorleif",NOR,Men,K90 Individual (70M),Bronze,Winter
31734,1948,St.Moritz,Skiing,Ski Jumping,"HUGSTED, Petter",NOR,Men,K90 Individual (70M),Gold,Winter


In [67]:
## return only Tennis competitions between 1920 and 1950.
df_1920_1950_Tennis = oly.query ("1920 <= Year <=1950 & Sport == 'Tennis'")
df_1920_1950_Tennis

Unnamed: 0,Year,City,Sport,Discipline,Athlete Name,Athlete Country,Gender,Event,Medal,Season
4026,1920,Antwerp,Tennis,Tennis,"ALBARRAN, Pierre",FRA,Men,Doubles,Bronze,Summer
4027,1920,Antwerp,Tennis,Tennis,"DECUGIS, Max",FRA,Men,Doubles,Bronze,Summer
4028,1920,Antwerp,Tennis,Tennis,"TURNBULL, Oswald Graham Noel",GBR,Men,Doubles,Gold,Summer
4029,1920,Antwerp,Tennis,Tennis,"WOOSNAM, Maxwell",GBR,Men,Doubles,Gold,Summer
4030,1920,Antwerp,Tennis,Tennis,"KASHIO, Seiichiro",JPN,Men,Doubles,Silver,Summer
4031,1920,Antwerp,Tennis,Tennis,"KUMAGAE, Ichiya",JPN,Men,Doubles,Silver,Summer
4032,1920,Antwerp,Tennis,Tennis,"D'AYEN, Elisabeth",FRA,Women,Doubles,Bronze,Summer
4033,1920,Antwerp,Tennis,Tennis,"LENGLEN, Suzanne",FRA,Women,Doubles,Bronze,Summer
4034,1920,Antwerp,Tennis,Tennis,"MCKANE, Kathleen",GBR,Women,Doubles,Gold,Summer
4035,1920,Antwerp,Tennis,Tennis,"MCNAIR, Winifred Margaret",GBR,Women,Doubles,Gold,Summer


In [70]:
## return only Women's Tennis competitions between 1996 and 2019 
df_1966_2019_Tennis_W = oly.query ("1996 <= Year <= 2019 \
                                    & Sport == 'Tennis' \
                                    & Gender == 'Women'")
df_1966_2019_Tennis_W                      

Unnamed: 0,Year,City,Sport,Discipline,Athlete Name,Athlete Country,Gender,Event,Medal,Season
22975,1996,Atlanta,Tennis,Tennis,"MARTINEZ, Conchita",ESP,Women,Doubles,Bronze,Summer
22976,1996,Atlanta,Tennis,Tennis,"SANCHEZ-VICARIO, Arantxa",ESP,Women,Doubles,Bronze,Summer
22977,1996,Atlanta,Tennis,Tennis,"FERNANDEZ, Gigi",USA,Women,Doubles,Gold,Summer
22978,1996,Atlanta,Tennis,Tennis,"FERNANDEZ, Mary Joe",USA,Women,Doubles,Gold,Summer
22979,1996,Atlanta,Tennis,Tennis,"NOVOTNA, Jana",CZE,Women,Doubles,Silver,Summer
22980,1996,Atlanta,Tennis,Tennis,"SUKOVA, Helena",CZE,Women,Doubles,Silver,Summer
22984,1996,Atlanta,Tennis,Tennis,"NOVOTNA, Jana",CZE,Women,Singles,Bronze,Summer
22985,1996,Atlanta,Tennis,Tennis,"DAVENPORT, Lindsay",USA,Women,Singles,Gold,Summer
22986,1996,Atlanta,Tennis,Tennis,"SANCHEZ-VICARIO, Arantxa",ESP,Women,Singles,Silver,Summer
24981,2000,Sydney,Tennis,Tennis,"CALLENS, Els",BEL,Women,Doubles,Bronze,Summer


In [80]:
## Mujeres jugando tenis o box
df_Tennis_Box_W = oly.query ("Gender == 'Women' & (Sport == 'Tennis' | Sport == 'Boxing')")
df_Tennis_Box_W

Unnamed: 0,Year,City,Sport,Discipline,Athlete Name,Athlete Country,Gender,Event,Medal,Season
639,1900,Paris,Tennis,Tennis,"JONES, Marion",ZZX,Women,Mixed Doubles,Bronze,Summer
640,1900,Paris,Tennis,Tennis,"ROSENBAUM, Hedwig",ZZX,Women,Mixed Doubles,Bronze,Summer
641,1900,Paris,Tennis,Tennis,"COOPER, Charlotte",GBR,Women,Mixed Doubles,Gold,Summer
642,1900,Paris,Tennis,Tennis,"PREVOST, Hélène",ZZX,Women,Mixed Doubles,Silver,Summer
647,1900,Paris,Tennis,Tennis,"JONES, Marion",USA,Women,Singles,Bronze,Summer
...,...,...,...,...,...,...,...,...,...,...
30949,2012,London,Tennis,Tennis,"ROBSON, Laura",GBR,Women,Mixed Doubles,Silver,Summer
30951,2012,London,Tennis,Tennis,"RAYMOND, Lisa",USA,Women,Mixed Doubles,Bronze,Summer
30955,2012,London,Tennis,Tennis,"WILLIAMS, Serena",USA,Women,Singles,Gold,Summer
30956,2012,London,Tennis,Tennis,"SHARAPOVA, Maria",RUS,Women,Singles,Silver,Summer


## What year was women's boxing introduced to the olympics?

In [82]:
## find all the names of sports first just to confirm boxing 
## does not appear in different names
## how is boxing listed?

In [85]:
oly["Sport"].unique()

array(['Aquatics', 'Athletics', 'Cycling', 'Fencing', 'Gymnastics',
       'Shooting', 'Tennis', 'Weightlifting', 'Wrestling', 'Archery',
       'Basque Pelota', 'Cricket', 'Croquet', 'Equestrian', 'Football',
       'Golf', 'Polo', 'Rowing', 'Rugby', 'Sailing', 'Tug of War',
       'Boxing', 'Lacrosse', 'Roque', 'Hockey', 'Jeu de paume', 'Rackets',
       'Skating', 'Water Motorsports', 'Modern Pentathlon', 'Ice Hockey',
       'Basketball', 'Canoe / Kayak', 'Handball', 'Judo', 'Volleyball',
       'Table Tennis', 'Badminton', 'Baseball', 'Softball', 'Taekwondo',
       'Triathlon', 'Canoe', 'Biathlon', 'Bobsleigh', 'Curling', 'Skiing',
       'Luge'], dtype=object)

In [87]:
#How many unique values are there?
oly["Sport"].nunique()

48

In [89]:
### When did boxing first appear as a sport in which women competed at the olympics?
oly.query ("Gender == 'Women' & Sport == 'Boxing'").sort_values(by = "Year", ascending = True)

Unnamed: 0,Year,City,Sport,Discipline,Athlete Name,Athlete Country,Gender,Event,Medal,Season
29876,2012,London,Boxing,Boxing,"ADAMS, Nicola",GBR,Women,51 KG,Gold,Summer
29877,2012,London,Boxing,Boxing,"REN, Cancan",CHN,Women,51 KG,Silver,Summer
29878,2012,London,Boxing,Boxing,"ESPARZA, Marlen",USA,Women,51 KG,Bronze,Summer
29879,2012,London,Boxing,Boxing,"KOM, Mary",IND,Women,51 KG,Bronze,Summer
29896,2012,London,Boxing,Boxing,"TAYLOR, Katie",IRL,Women,60 KG,Gold,Summer
29897,2012,London,Boxing,Boxing,"OCHIGAVA, Sofya",RUS,Women,60 KG,Silver,Summer
29898,2012,London,Boxing,Boxing,"ARAUJO, Adriana",BRA,Women,60 KG,Bronze,Summer
29899,2012,London,Boxing,Boxing,"CHORIEVA, Mavzuna",TJK,Women,60 KG,Bronze,Summer
29912,2012,London,Boxing,Boxing,"SHIELDS, Claressa",USA,Women,75 KG,Gold,Summer
29913,2012,London,Boxing,Boxing,"TORLOPOVA, Nadezda",RUS,Women,75 KG,Silver,Summer


## Summer v. Winter

It's likely that the Summer Olympics have more medal winners than Winter Olympics. How can we check?

In [91]:
## what is the total medal count for winter v. summer?
oly["Season"].value_counts()


Summer    31165
Winter     5770
Name: Season, dtype: int64

## How many medals were handed out at each olympics between 1896 and 2014?

Show the result from highest to lowest.

In [97]:
oly[["Year","Season"]].value_counts().sort_index(ascending = True)

Year  Season
1896  Summer     151
1900  Summer     512
1904  Summer     470
1908  Summer     804
1912  Summer     885
1920  Summer    1298
1924  Summer     884
      Winter     118
1928  Summer     710
      Winter      89
1932  Summer     615
      Winter     116
1936  Summer     875
      Winter     108
1948  Summer     814
      Winter     140
1952  Summer     889
      Winter     136
1956  Summer     885
      Winter     150
1960  Summer     882
      Winter     147
1964  Summer    1010
      Winter     185
1968  Summer    1031
      Winter     199
1972  Summer    1185
      Winter     200
1976  Summer    1305
      Winter     210
1980  Summer    1387
      Winter     218
1984  Summer    1459
      Winter     222
1988  Summer    1546
      Winter     264
1992  Summer    1705
      Winter     325
1994  Winter     343
1996  Summer    1859
1998  Winter     447
2000  Summer    2015
2002  Winter     481
2004  Summer    1998
2006  Winter     531
2008  Summer    2042
2010  Winter     529


## Filter and count

What are the top 5 sports that France has won the most gold in?

(HINT: different answers based on columns you decide to count)

In [104]:
gold_france = oly.query ("`Athlete Country` == 'FRA' & Medal == 'Gold'")
gold_france

Unnamed: 0,Year,City,Sport,Discipline,Athlete Name,Athlete Country,Gender,Event,Medal,Season
51,1896,Athens,Cycling,Cycling Track,"FLAMENG, Léon",FRA,Men,100KM,Gold,Summer
54,1896,Athens,Cycling,Cycling Track,"MASSON, Paul",FRA,Men,10KM,Gold,Summer
59,1896,Athens,Cycling,Cycling Track,"MASSON, Paul",FRA,Men,1KM Time Trial,Gold,Summer
62,1896,Athens,Cycling,Cycling Track,"MASSON, Paul",FRA,Men,Sprint Indivual,Gold,Summer
65,1896,Athens,Fencing,Fencing,"GRAVELOTTE, Eugène-Henri",FRA,Men,Foil Individual,Gold,Summer
...,...,...,...,...,...,...,...,...,...,...
36270,2010,Vancouver,Skiing,Nordic Combined,"LAMY CHAPPUIS, Jason",FRA,Men,"Individual, Ski Jumping K90 (70M)",Gold,Winter
36328,2014,Sochi,Biathlon,Biathlon,"FOURCADE, Martin",FRA,Men,12.5Km Pursuit,Gold,Winter
36332,2014,Sochi,Biathlon,Biathlon,"FOURCADE, Martin",FRA,Men,20KM,Gold,Winter
36843,2014,Sochi,Skiing,Freestyle Skiing,"CHAPUIS, Jean Frederic",FRA,Men,Ski Cross,Gold,Winter


In [109]:
gold_france["Sport"].value_counts()

Fencing              116
Cycling               66
Handball              30
Aquatics              24
Equestrian            23
Skiing                22
Rowing                17
Football              17
Rugby                 17
Sailing               16
Athletics             14
Judo                  12
Shooting               9
Weightlifting          9
Biathlon               8
Tennis                 7
Canoe / Kayak          7
Archery                6
Skating                6
Boxing                 4
Wrestling              4
Croquet                4
Gymnastics             3
Canoe                  2
Water Motorsports      1
Name: Sport, dtype: int64