<a href="https://colab.research.google.com/github/SARA3SAEED/DA-Mu/blob/main/s00_olympic_case_study.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project - Olympic Sports Analysis [Data Pre-processing]

- You can find the full project & the dataset at: https://www.kaggle.com/the-guardian/olympic-games
- In this project, we will consider these topics:
    - Data Cleaning & Manipulation
    - Data Grouping & Aggregation
    - Data Reshaping & Pivoting
    - Data Merging, Joining, & Concatenation

## Olympic Sports and Medals, 1896-2014
Which countries and athletes have won the most medals at the Olympic games?

### Importing libraries & data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()

In [None]:
summer = pd.read_csv('data/summer.csv')

In [None]:
winter = pd.read_csv('data/winter.csv')

In [None]:
countries = pd.read_csv('data/dictionary.csv')

### Inspecting Datasets

In [None]:
summer.head()

In [None]:
summer.info()

In [None]:
winter.head()

In [None]:
winter.info()

In [None]:
countries.head()

In [None]:
countries.info()

In [None]:
# Listing all of the missing data in the 'countries' dataframe
countries[countries.isnull().any(axis = 1)].reset_index(drop=True)

### Proposed Questions

- ***Analysing all Summer editions data***
    - Can you find the __highest__ male / female __athletes__ of all time in the Summer editions?
    - Find the highest __athletes__ regarding to each __medal type__ in the Summer editions?

- ***Which are the most successful countries in both Summer and Winter editions?***
    - What are the __Top 10__ Countries by __total medals__?
    - __Split__ the total medals of Top 10 Countries into __Summer / Winter__. Are there typical Summer/Winter Games Countries?
    - __Split__ the total medals of Top 10 Countries into __Gold, Silver, Bronze__.


==========

- ***Analysing all Summer editions data***
    - Can you find the __highest__ male / female __athletes__ of all time in the Summer editions?
    - Find the highest __athletes__ regarding to each __medal type__ in the Summer editions?

##### Q. Can you find the highest male / female athletes of all time in the Summer editions

In [None]:
summer.head()

In [None]:
# Modify the full name of the athletes in Winter and Summer editions
summer['Athlete'] = summer['Athlete'].str.split(', ').str[::-1].str.join(' ')
summer['Athlete'] = summer['Athlete'].str.title()
summer.head()

In [None]:
# Adding the countries column to our dataframe
summer = summer.merge(countries,left_on='Country',right_on='Code',how='left')
summer.head()

In [None]:
summer=summer[['Year','City','Sport','Discipline','Athlete','Country_x','Gender','Event','Medal','Country_y']]
summer.columns=['Year','City','Sport','Discipline','Athlete','Code','Gender','Event','Medal','Country']
summer.head()

In [None]:
# The highest male athlete of all Summer editions
male_athlete = summer[summer['Gender']=='Men']['Athlete'].value_counts()[:1].index[0]
male_athlete

In [None]:
# His total number of medals
num_of_male_medals = summer[summer['Gender']=='Men']['Athlete'].value_counts()[:1].values[0]
num_of_male_medals

In [None]:
# The highest female athlete of all Summer editions
female_athlete = summer[summer['Gender']=='Women']['Athlete'].value_counts()[:1].index[0]
female_athlete

In [None]:
# Her total number of medals
num_of_female_medals = summer[summer['Gender']=='Women']['Athlete'].value_counts()[:1].values[0]
num_of_female_medals

##### Q. Find the highest athletes regarding to each medal type in the Summer editions

In [None]:
summer.head()

In [None]:
# Let's discover what does 'Michael Phelps' have of medals
summer[summer.Athlete == 'Michael Phelps']

In [None]:
top_medals = summer.groupby(['Athlete','Medal'])['Sport'].count().reset_index().sort_values(by='Sport',ascending=False)
top_medals

In [None]:
top_medals = top_medals.drop_duplicates(subset=['Medal'],keep='first')
top_medals.columns = [['Athlete','Medal','Count']]
top_medals

##### Q. Calculate the medals per each country for the best male and females athletes in all of the Summer editions and visualize the results

In [None]:
medals_country = summer.groupby(['Country','Medal'])['Gender'].count().reset_index().sort_values(by='Gender',ascending=False)
medals_country = medals_country.pivot('Country','Medal','Gender').fillna(0)
medals_country

In [None]:
top_10 = medals_country.sort_values(by='Gold',ascending=False)[:11]
top_10

In [None]:
top_10.plot.barh(width=0.8,color=['#CD7F32','#FFDF00','#D3D3D3'])
fig = plt.gcf()
fig.set_size_inches(12,12)
plt.title('Medals Distribution Of Top 10 Countries (Summer Olympics)')
plt.show()

In [None]:
fig,ax=plt.subplots(1,2,figsize=(18,15))
men=summer[summer['Gender']=='Men']
men=men.groupby(['Athlete','Medal'])['Country'].count().reset_index().sort_values(by='Country',ascending=False)
men=men[men['Athlete'].isin(summer['Athlete'].value_counts().index[:15])]
men=men.pivot('Athlete','Medal','Country')
men.plot.barh(width=0.8,color=['#CD7F32','#FFDF00','#D3D3D3'],ax=ax[0])
ax[0].set_title('Best Male Athletes')
ax[0].set_ylabel('Athlete')

women=summer[summer['Gender']=='Women']
women=women.groupby(['Athlete','Medal'])['Country'].count().reset_index().sort_values(by='Country',ascending=False)
women=women[women['Athlete'].isin(summer['Athlete'].value_counts().index[:30])]
women=women.pivot('Athlete','Medal','Country')
women.plot.barh(width=0.8,color=['#CD7F32','#FFDF00','#D3D3D3'],ax=ax[1])
ax[1].set_title('Best Female Athletes')
ax[1].set_ylabel('')
plt.show()

==========

##### Q. Which are the most successful countries in both Summer and Winter editions?
- What are the __Top 10__ Countries by __total medals__?
- __Split__ the total medals of Top 10 Countries into __Summer / Winter__. Are there typical Summer/Winter Games Countries?
- __Split__ the total medals of Top 10 Countries into __Gold, Silver, Bronze__.

#### 1] Data Merging

In [None]:
summer.head()

In [None]:
winter.head()

In [None]:
countries.head()

In [None]:
# Now you can easily merge the two dataframes
olympics = pd.concat([summer, winter], keys = ["Summer", "Winter"], names = ["Edition"]).reset_index().drop(columns = "level_1")
olympics

In [None]:
# We need to refine our 'olympics' dataframe by adding the 'country' column
olympics = olympics.merge(countries.iloc[:, :2], how = "left", left_on = "Country", right_on = "Code").drop(columns = ["Code"])
olympics

#### 2] Data Cleaning

In [None]:
olympics

##### Assign appropriate Column Headers to Country Codes and full Country Names

In [None]:
olympics.rename(columns = {"Country_x":"Code", "Country_y": "Country"}, inplace = True)

##### For some Country Codes, there is no corresponding __full Country Name__ available (e.g. for "URS") -> __missing values__ in olympics. Identify these Country Codes and search the Web for the full Country Names. __Replace__ missing values!

In [None]:
# Finding the missing data in the new dataframe
olympics.loc[olympics.Country.isnull()]

In [None]:
# List all of the old countries' codes
olympics.loc[olympics.Country.isnull()].Code.value_counts()

In [None]:
# Get all of the indicies of the old countries
old_indices = olympics.loc[olympics.Country.isnull()].Code.value_counts().index
old_indices

In [None]:
# Create a mapper to match the old countries' codes with their corresponding names
mapper = pd.Series(index=old_indices, name = "Country", data = ["Soviet Union", "East Germany", "Romania", "West Germany", "Czechoslovakia",
                               "Yugoslavia", "Unified Team", "Unified Team of Germany", "Mixed teams", "Serbia",
                              "Australasia", "Russian Empire", "Montenegro", "Trinidad and Tobago", "Bohemia",
                              "West Indies Federation", "Singapore", "Independent Olympic Participants"])

mapper

In [None]:
# Let's get all the missing data indicies to map them to countries
missing_indices = olympics.loc[olympics.Country.isnull()].index
missing_indices

In [None]:
# Now, we need to map the names
olympics.loc[missing_indices, "Code"].map(mapper)

In [None]:
# Filling the missing data with the new names
olympics.Country.fillna(olympics.Code.map(mapper), inplace = True)

In [None]:
olympics.loc[missing_indices]

##### Remove rows from olympics where the Country code is unknown

In [None]:
# Double-check for any missing data
olympics[olympics.Code.isna()]

In [None]:
# Drop these missing records
olympics.dropna(subset = ["Code"], inplace = True)

In [None]:
# Reseting indicies to get rid of the deleted records
olympics.reset_index(drop = True, inplace = True)

In [None]:
olympics.info()

##### Convert the column Medal into an ordered Categorical column ("Bronze" < "Silver" < "Gold")

In [None]:
olympics['Medal'] = olympics['Medal'].astype("category")

In [None]:
olympics.info()

In [None]:
olympics.Medal.sort_values()

In [None]:
olympics.Medal.cat.set_categories(["Bronze", "Silver", "Gold"], ordered = True, inplace = True)

In [None]:
olympics.Medal.sort_values()

#### 3] Data Analysis & Visualization (EDA)

##### Q. What are the Top 10 Countries by total medals?

In [None]:
olympics

In [None]:
olympics.Country.value_counts()

In [None]:
top_10 = olympics.Country.value_counts().nlargest(10)
top_10

In [None]:
top_10.plot(kind = "bar", fontsize = 15, figsize=(12,8))
plt.title("Top 10 Countries by Medals", fontsize = 15)
plt.ylabel("Medals", fontsize = 14)
plt.show()

##### Q. Split the total medals of Top 10 Countries into Summer / Winter. Are there typical Summer/Winter Games Countries?

In [None]:
# Gathering the top10 data
olympics_10 = olympics[olympics.Country.isin(top_10.index)]
olympics_10

In [None]:
plt.figure(figsize=(20,10))
sns.set(font_scale=1.5, palette= "dark")
sns.countplot(data = olympics_10, x = "Country", order = top_10.index)
plt.title("Top 10 Countries by Medals", fontsize = 20)
plt.show()

In [None]:
plt.figure(figsize=(20,10))
sns.set(font_scale=1.5, palette= "dark")
sns.countplot(data = olympics_10, x = "Country", hue = "Edition", order = top_10.index)
plt.title("Top 10 Countries by Medals", fontsize = 20)
plt.show()

In [None]:
plt.figure(figsize=(20,10))
sns.set(font_scale=1.5, palette= "dark")
sns.countplot(data = olympics_10, x = "Edition", hue = "Country", hue_order = top_10.index)
plt.title("Top 10 Countries by Medals", fontsize = 20)
plt.show()

##### Q. Split the total medals of Top 10 Countries into Gold, Silver, Bronze

In [None]:
plt.figure(figsize=(20,10))
sns.set(font_scale=1.5, palette= "dark")
sns.countplot(data = olympics_10, x = "Country", hue = "Medal", order = top_10.index,
              hue_order = ["Gold", "Silver", "Bronze"], palette = ["gold", "silver", "brown"])
plt.title("Top 10 Countries by Medals", fontsize = 20)
plt.show()

In [None]:
plt.figure(figsize=(20,10))
sns.set(font_scale=1.5, palette= "dark")
sns.countplot(data = olympics_10, x = "Medal", hue = "Country",
              order = ["Gold", "Silver", "Bronze"], hue_order= top_10.index)
plt.title("Top 10 Countries by Medals", fontsize = 20)
plt.show()

===========

# THANK YOU!