<a href="https://colab.research.google.com/github/Dd8985/DATA_ANALYTIC_PROJECTS/blob/main/PYTHON_PROJECTS/Summer_Olimpic_Games/1896_2016_OL%C4%B0MP%C4%B0YAT_VER%C4%B0LER%C4%B0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 DEFINE THE DATASET: A comprehensive dataset of the Olympic Games, spanning from the 1896 Athens Olympics to the 2016 Rio Olympics, with records of the games. Each instance corresponds to an individual athlete competing in an individual Olympic event (athlete events).

1) install python libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

2) upload the Olympic athletes dataset

In [None]:
df1 = pd.read_csv("/df1.csv", index_col=1).drop(columns=["Unnamed: 0"])
df1

3) load the NOC (National Olympic Committee) Region Data set:

In [None]:
df2 = pd.read_csv("/noc_regions.csv")
df2

4) examine both data sets in detail and analyze their statistical values.


In [None]:
df1.info()

In [None]:
df1.describe()

In [None]:
df2.info()

In [None]:
df2.describe()

5) Since both data sets have the NOC column, let's merge the data sets according to this column:

In [None]:
df= pd.merge(df1, df2, on='NOC')
df

6) Since the focus of this project will only be on the summer Olympics, let's filter out all winter Olympics from our dataset and do some basic analysis on our dataset.

In [None]:
df = df[df["Season"]=="Summer"]
df

7) After filtering, see how many NULL (NaN) values are in which columns:

In [None]:
df.isnull().sum()

8) Since there are 3 types of medals, let's look at the percentages of the medals won:

In [None]:
df.Medal.value_counts(normalize = True)

9)  see what percentage of the whole data the missing observations make up:

In [None]:
missing_percentage= 100*(df.isna().sum().sort_values(ascending = False)/len(df))
missing_percentage[missing_percentage!=0]

10) remove the Notes column from the data set:

In [None]:
df.drop(columns=["notes"], inplace=True)

11) examine the data type of the weight column in detail: 53764 missing data.

In [None]:
df['Weight'].value_counts()

12) We want the ? in the weight column to be NaN from now on.

In [None]:
df['Weight'] = df['Weight'].replace(['?'], np.nan)

13) check the missing values again: in the first check, it seemed that there were no missing values in the weight column, but now it is fixed.

In [None]:
df.isnull().sum()

14) Fill all empty values in the Age column with the average:

In [None]:
df['Age'].fillna(df['Age'].mean(),inplace=True)

15) Fill all empty values in the Height column with the average:

In [None]:
df['Height'].fillna(df['Height'].mean(),inplace=True)

16)  change the data type of the weight column from string to float:

In [None]:
df['Weight'] = df['Weight'].astype(float)

17) now fill in the missing values in the weight column with the average:

In [None]:
df['Weight'].fillna(df['Weight'].mean(), inplace=True)

18) Let's look at the different values in the Region column:

In [None]:
df["region"].unique()

19) Only 21 regions data was missing from the 222203 data, so let's remove the 21 missing values in this region column from the whole data set. because removing these 21 values does not affect the data set.

In [None]:
df.region.dropna(inplace=True)

20) Since there are 3 different medal types and there are quite a lot of NaN values in the medal column in the dataset, let's fill these NaN values as 'did not win a medal':

In [None]:
df['Medal'].fillna("Medal Not Won", inplace=True)

21) After combining the data sets and partially cleaning them, let's see if there are rows that are the same with each other:

In [None]:
df[df.duplicated()]

22) How many total duplicate rows are there?:

In [None]:
df.duplicated().sum()

23) remove duplicates values (rows) from the data set:

In [None]:
df.drop_duplicates(inplace=True)

24) check if there are any repeating values left in the data set:

In [None]:
df.duplicated().sum()

25) Which country has sent the most athletes to the Summer Olympics and how many athletes did each country send in total during these years?

In [None]:
athlete_count = df.Team.value_counts()
athlete_count

26)  visualize the top 10 countries that sent the most athletes to the Summer Olympics with a bar chart:

In [None]:
plt.figure(figsize=[18,8])
sns.barplot(x=athlete_count[:10].index, y=athlete_count[:10])
plt.title("Countries Send the Most Athletes to the Olympics")
plt.xlabel("Countries")
plt.ylabel("Athlete Count");

27) Let's remove 'did not win a medal' from the dataset and see only the gold, silver and bronze medal winners (i.e. the list of medal winners):

In [None]:
df[df.Medal != 'Medal Not Won']

28) Let's filter out the non-medal winners and group by country to see which countries have won the most medals:

In [None]:
df_filtered = df[df.Medal != 'Medal Not Won']
medals_by_country = df_filtered.groupby('Team')['Medal'].count().sort_values(ascending=False)
medals_by_country

29) Let's visualize the top 10 countries with the most medals with a bar chart:

In [None]:
plt.figure(figsize=[18,8])
sns.barplot(x=medals_by_country[:10].index, y=medals_by_country[:10], palette="YlOrBr_r")
plt.title("Countries Won the Most Medals in the Olympics");

30) How successful are countries in winning medals according to the number of athletes they send?

In [None]:
filtered_athlete = athlete_count[athlete_count > 1000]
country_success = (medals_by_country / filtered_athlete).sort_values(ascending=False).dropna()
country_success

31) Now we can visualize this:

In [None]:
plt.figure(figsize=[15,10])
sns.barplot(x=country_success.values*100, y=country_success.index, palette='coolwarm_r')
plt.title("Countries Medal Won Percentage in the Olympics")
plt.xlabel("Percentage (%)");

32) How have our participant numbers changed over the years?

In [None]:
athlete_by_year = df.groupby('Year')['Name'].count()
athlete_by_year


33) We can visualize the change in the number of participants over the years with a line chart:

In [None]:
plt.figure(figsize=[18,8])
plt.xticks(np.linspace(1896,2016,13))
plt.grid()
sns.lineplot(x=athlete_by_year.index, y=athlete_by_year.values)
plt.title("Change in the Number of Athletes Over the Years")
plt.ylabel("Athlete Count");

34) We can see the difference between male and female participants by sport:

In [None]:
plt.figure(figsize=(12, 5));
highest_sport = df['Sport'].value_counts().index
sns.countplot(data = df, x = 'Sport', hue = 'Sex', order = highest_sport, palette=sns.color_palette("seismic",2))
plt.xticks(rotation=90)
plt.title('Sports with Gender distribution')
plt.xlabel('Sport', fontweight='bold')
plt.ylabel('No. of Athletes', fontweight='bold');
plt.legend(['Male','Female'],loc=1, shadow=True);

35) we can look at the participation of male and female athletes over time:

In [None]:
year_wise_participants = df.groupby('Year')['Sex'].value_counts()
year_wise_participants.head(10)

36) Let's see the participation levels of male and female participants over the years with a bar chart:

In [None]:
plt.figure(figsize=(15,6))
sns.countplot(x = df['Year'], hue=df['Sex']);

37) how many people won how many medals in which sport and how much?

In [None]:
pd.DataFrame(df_filtered.groupby(['Sport','Medal']).Name.count().reset_index())

In [None]:
medals_in_sports = df_filtered.groupby(['Sport','Medal']).Name.count().reset_index()
medals_in_sports = medals_in_sports.pivot_table(index='Sport', columns='Medal', values='Name')
medals_in_sports = medals_in_sports.replace(np.nan, 0)
medals_in_sports

38) We can see the change in America's medal count over the years:

In [None]:
usa = df[df['Team']=='United States']
usa

39) Let's see only medal winners and filter out non-medal winners in America:

In [None]:
usa_medals_count = usa[usa['Medal'] !='Medal Not Won']
usa_medals_count

40) How many people won medals in the US in which year?

In [None]:
usa_medal = usa_medals_count.groupby('Year')['Medal'].count()
usa_medal

41) Let's visualize with a line chart how many people won medals in America in which year

In [None]:
plt.figure(figsize=[24,8])
plt.xticks(np.linspace(1896,2016,25))
plt.grid()
sns.lineplot(x=usa_medal.index, y=usa_medal.values);