Download this data to use in the exercise:

https://www.kaggle.com/the-guardian/olympic-games/data

This exercise looks at data on all of the past olympics (up to 2014). Follow allong the text and use code cells to answer the questions.

First, use the following code to initialize your backend and upload the datasets to colaboratory.

In [0]:
import pandas as pd
pd.options.display.max_rows = 200
# The following is code for uploading a file to the colab.research.google 
# environment.

# library for uploading files
from google.colab import files 

def upload_files():
    # initiates the upload - follow the dialogues that appear
    uploaded = files.upload()

    # verify the upload
    for fn in uploaded.keys():
        print('User uploaded file "{name}" with length {length} bytes'.format(
            name=fn, length=len(uploaded[fn])))

    # uploaded files need to be written to file to interact with them
    # as part of a file system
    for filename in uploaded.keys():
        with open(filename, 'wb') as f:
            f.write(uploaded[filename])

In [0]:
upload_files()

Saving dictionary.csv to dictionary.csv
Saving summer.csv to summer.csv
Saving winter.csv to winter.csv
User uploaded file "dictionary.csv" with length 7624 bytes
User uploaded file "summer.csv" with length 2573921 bytes
User uploaded file "winter.csv" with length 466225 bytes


## Basics

1) Load the three data files as DataFrames.

In [0]:
country_dict_df = pd.read_csv("dictionary.csv")
summer_df = pd.read_csv("summer.csv")
winter_df = pd.read_csv("winter.csv")

2) Get a basic feel for the schemas of these three dataframes. Run some of the EDA techniques we discussed.

3) The semantics of this data all looks pretty straight forward. Can we verify that the Population column is not a power of ten? Try querying for just the 'Canada' row in the country dictionary DataFrame to verify this. (Tip: use `.loc`)

4) Looking at just the Summer olympics let's do some querying. Every query here can be done in one line each:

>a) Display the row corresponding to athlete Kate Walsh's medal.

>b) Display just the City corresponding to athlete Kate Walsh's medal.

>c) List each sport found in the summer olympics DataFrame

>d) What medal did Cristian Gatu win in 1976?

>e) How many medals have been won by USA and China cummulative? Hint: Use `.count()` after a query to count the rows.


5) Let's do some sorting! I want you to make two DataFrame variables:


*   Sort the country dictionary by Population from lowest to highest
*   Sort the country dictionary by population from highest to lowest

Make sure you save the results of each sort to a variable.

Next, call `.reset_index(drop=True)` on each of the new variables you created. What has this done?

Finally, we are interested in know the country name with the largest population and the country name with the
smallest population. Fortunately, the data structures we just created lend themselves nicely to this task. Query
for the Country name in the first row of each of DataFrames you created in this question. (hint: Use `.at[,]`)


6) Grouping time! Remember the group and aggregate paradigm for these tasks.

Let's use the winter olympics DataFrame this time. Can you display:

>a) The number of medals for each country? (hint: use `.groupby` and then `.count`)

>b) The number of medals for each sport. Treat Mens and Womens categories as separate sports? (Hint: groupby can take a list of column names)

>c) The Average number of medals in a year they competed for each country? This one requires two groupby statements. First, groupby two columns and aggregate. Then use groupby on that output to compute an average.

7) value_counts() is a method that can be used on a Series (individual column) to quickly summarize the contents. It is similar to calling groupby, count and then sort_values. I want to show how this method works and some tricks you can use with it. We are going to use the winter olmpics DataFrame again.

>a) Use .value_counts() on the Sport column to see what Sports exist and how many medals have been won in each sport. Save the output to a variable.

>b) Notice that the output of value_counts is a Series where the values are the counts and the indices are the labels. Display the largest value in the Series you produced in a). (Hint: use .iat)

>c) How many medals have been won in the "Skating" sport? (hint: sport names are the indices of the series so you can access them directly with the [] operator)

>d) Sometimes it can be easier to work with data as a DataFrame than a Series (e.g. using the loc operator). Use .reset_index() on your Series to turn the value_counts() series into a DataFrame.

8) Joins. The country dict is a great example of where you can use a join to connect two DataFrames. Our goal for this question is to join the winter olympics DataFrame with the country dictionary so that every row in the winter olympics dataframe also contains the Country name (not just the country code), Population, and GDP per Capita:

>a) Use set_index() on the country dictionary DataFrame and set the index to be the country code.

>b) Call .join using the winter olympics DataFrame. The tricky part is deciding what the arguments should be. The first argument should be the DataFrame from a). The join should be done on the "Country" column. Give informative suffix values for lsuffix and rsuffix so that there isn't name conflicts. You decide what value should be used for "how" so that each row corresponds to a row in the winter olympics DataFrame.

9) Apply. The apply method can be very powerful as it allows you to modify or create columns with arbitrary functions. Let's make a new column in the winter olympics DataFrame that contains the Athlete's last name only. Follow these steps:
>a) Write a function that takes a string in the format "x, y" and returns the "x" portion. You can use the .split string method to accomplish this.

>b) Call .apply on the Athlete column and give your function you wrote in a) as the argument. Assign the output to a new column named "athlete_last_name".

## Advanced Questions



1) What country has the most medals in the summer olympics overall? Please return the name of the country. Your code should do look up the name of the country from the country codes dictionary.

For each country, what sport have they won the most Gold medals in? Consider
 * 2) Just the summer olympics.
 * 3) Just the winter olympics.
 * 4) Both the summer and winter olympics.
 
Some bonus goals:

*   Think about how to organize your code to avoid repetition
*   Display the results with informative column names
*   Use full names of the countries
*   Ignore countries that are not in the country codes dictionary



5) For each country, what is the average medal count for each gender? The average should be across years. This time report the average taking both summer and winter olympics into account.

6) Get the top 5 athletes in terms of how many separate olympics they have appeared at? (NOT the most medals). Use both the winter and summer olympics. Treat the winter and summer olympics as separate olympics.

Optional: 7) Come up with your own analytical question. What interesting things can you ask the data? Use pandas to answer the question. If possible, elaborate how the answer could be used.