# Olympics Dataset analysis:

**INTRODUCTION:**

The modern Olympic Games or Olympics are leading international sporting events featuring summer and winter sports competitions in which thousands of athletes from around the world participate in a variety of competitions. The Olympic Games are considered the world's foremost sports competition with more than 200 nations participating. The Olympic Games are normally held every four years, alternating between the Summer and Winter Olympics every two years in the four-year period.

In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

sns.set_theme()

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
df = pd.read_csv('/kaggle/input/olympics-athletes-events-dataset-of-120-years/athlete_events.csv')

In [None]:
athletes = pd.read_csv('/kaggle/input/olympics-athletes-events-dataset-of-120-years/athlete_events.csv')
regions = pd.read_csv('/kaggle/input/noc-region/noc_regions (1).csv')


In [None]:
athletes.head()


In [None]:
regions.head()

In [None]:
athletes.shape

In [None]:
# Merging the both csv files
athletes_df = athletes.merge(regions, how = 'left', on = 'NOC')
athletes_df.head()

In [None]:
athletes_df.rename(columns={'region':'Region', 'notes':'Notes'}, inplace = True)
athletes_df.head()

In [None]:
athletes.info()

In [None]:
athletes.describe()

As you can see there's a lot of variation in the age group section, the athletes from the age of 10 to 97 has participated in the Olympics.

In [None]:
# Checking for the null values in the data

nan_values = athletes_df.isna()
nan_columns = nan_values.any()
nan_columns

There are six columns named *Age, Height, Weight, Medal, Region & Notes* which has NaN values.

In [None]:
athletes_df.isnull().sum()

In [None]:
null_list = [['Age','Height','Weight','Medal','Region','Notes']]
print(null_list)

In [None]:
# Indian Athlete details

athletes_df.query('Team=="India"').head(5)

In [None]:
# Top countries participation

top_10_countries= athletes_df.Team.value_counts().sort_values(ascending=False).head(10)
top_10_countries

In [None]:
# Top 10 countries plotting

plt.figure(figsize=(10,5))
#plt.xticks(rotation=20)
plt.title('Overall participation by countries')
sns.barplot(x=top_10_countries.index, y=top_10_countries, palette = 'Set2')


In [None]:
plt.figure(figsize=(15,6))
plt.title("Age distribution of the athletes")
plt.xlabel('Age')
plt.ylabel('Number of participants')
plt.hist(athletes_df.Age, bins = np.arange(10,80,2), color='magenta', edgecolor='white');


In [None]:
# Winter Olympics Sports

winter_sports = athletes_df[athletes_df.Season == 'Winter'].Sport.unique()
winter_sports

In [None]:
# Summer Olympics Sports

summer_sports = athletes_df[athletes_df.Season == 'Summer'].Sport.unique()
summer_sports

In [None]:
# Male & Female participants

gender_counts = athletes_df.Sex.value_counts()
gender_counts

There are 196594 males & 74522 females participation in Olympics since first Olympics which was held in the year 1896.  


In [None]:
plt.figure(figsize=(14,8))
plt.title('Gender Distribution in Oylmpics')
plt.pie(gender_counts, labels=gender_counts.index,  autopct='%1.2f%%', startangle = 190);


In [None]:
# Total Medals
total_medals_df= athletes_df.Medal.value_counts()
total_medals_df

In [None]:
# Total number of female athletes in each olympics

female_participants= athletes_df[(athletes_df.Sex=='F') & (athletes_df.Season=='Summer')][['Sex', 'Year']]
female_participants= female_participants.groupby('Year').count().reset_index()
female_participants.tail() 

                                 
                                  

There is a increase in the women's participation over the years.

In [None]:
womenOlympics = athletes_df[(athletes_df.Sex=='F') & (athletes_df.Season=='Summer')]

In [None]:
# Plotting of Women's Participation 

sns.set(style="darkgrid")
plt.figure(figsize=(20, 10))
sns.countplot(x='Year', data = womenOlympics, palette="Spectral")
plt.title('Women Participation in the Olympics')

As you can see the highest number of women participating is in the year 2016 in 'RIO Olympics'.

Now, I'm plotting the line graph so you can have the better understanding about how female athletes participation have increased over the years.

In [None]:
part = womenOlympics.groupby('Year')['Sex'].value_counts()
plt.figure(figsize=(20, 10))
part.loc[:,'F'].plot()
plt.title('Plot of Female Athletes over the years')

**Since the beginnig of the Olympics there is a continuous increase in the women's participition. Although there is a little decrease in the numbers around the year 1960 and 1980. But you can see from the year 1980 there's been a continuos growth in the women's participation till the year 2016.**

In [None]:
# Gold medal winnig athletes

goldMedals = athletes_df[(athletes_df.Medal=='Gold')]
goldMedals.head()

In [None]:
goldMedals = goldMedals[np.isfinite(goldMedals['Age'])]

In [None]:
# Gold won beyond the age 55

goldMedals['ID'][goldMedals['Age']>55].count()

**27 Gold Medals** has been won by the athletes which are beyond 55.

In [None]:
sporting_event = goldMedals['Sport'][goldMedals['Age']>55]
sporting_event.head()

As you can see above are the sporting events in which athletes beyond 55 has won the Gold medals 

In [None]:
# Plot for sporting event beyond age of 55

sns.set(style="darkgrid")
plt.figure(figsize=(12,6))
plt.tight_layout()
sns.countplot(sporting_event, palette="Set1")
plt.title('Gold Medals for Athletes over 55 years')


As you can see the highest number of Gold medals over the age of 55 years is from **Shooting, 6 Gold medals.**
Total 8 sports event have gold medalist athletes over the age of 55 in the Olympics.

But, *Alpinism, Art competiton, Roque & Croquet* all these events was abandoned before mid 1950's. So, it is no longer an event in the Olympics.

In [None]:
# Number of Gold Medals from each country

goldMedals.Region.value_counts().reset_index(name='Medal')

We can see clearly that **US** is the no. 1 country in this race and they have the highest number of Gold medals counts and they are ahead with a very good lead.
And, there are many countries with one Gold medal.

In [None]:
# Plotting Gold medals per country

totalGoldmedals = goldMedals.Region.value_counts().reset_index(name='Medal').head(6)
g = sns.catplot(x="index", y="Medal", data = totalGoldmedals, height=5, kind="bar", palette="magma")
g.despine(left=True)
g.set_xlabels("Top 6 countries")
g.set_ylabels("Number of Medals")
plt.title('Gold Medals per Country')

In [None]:
# Rio Oylmpics
max_year = athletes_df.Year.max()
print("Rio Olympics - 2016")
team_names = athletes_df[(athletes_df.Year== max_year) & (athletes_df.Medal=='Gold')].Team

team_names.value_counts().head(10)

In [None]:
sns.barplot(x = team_names.value_counts().head(20), y= team_names.value_counts().head(20).index)

plt.ylabel(None);
plt.xlabel('Countrywise Medals for Rio Olympics 2016');

In [None]:
# Excluding the null values

not_null_medal = athletes_df[(athletes_df['Height']. notnull()) & (athletes_df['Weight']. notnull())]

In [None]:
# Plotting scatterplot "Height vs Weight of the Olympic Medalists"

plt.figure(figsize=(12, 10))
axis = sns.scatterplot(x="Height", y="Weight", data = not_null_medal, hue="Sex")
plt.title('Height vs Weight of the Olympic Medalists')