In [1]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing

In [2]:
olympic_df = pd.read_csv('dataset_olympic_data/dataset_olympics.csv')

# FUNCTIONS

In [3]:
def print_unique_values(df):
    for column in df.columns:
        unique_values = df[column].unique()
        print(f"Unique values in column '{column}': {unique_values}")

# DATA EXPLORATION AND DATA WRANGLING

In [4]:
olympic_df.columns

Index(['ID', 'Name', 'Sex', 'Age', 'Height', 'Weight', 'Team', 'NOC', 'Games',
       'Year', 'Season', 'City', 'Sport', 'Event', 'Medal'],
      dtype='object')

**COLUMNS**
1. **ID:** identifier for each athlete.
2. **NAME:**  The full name of the athlete.
3. **SEX:** The gender of the athlete, represented as 'M' for male and 'F' for female.
4. **AGE:** The age of the athlete at the time of the Olympics.
5. **HEIGHT:** The height of the athlete in centimeters.
6. **WEIGHT:** The weight of the athlete in kilograms.
7. **TEAM:** The country the athlete represents.
8. **NOC:** The National Olympic Committee (NOC) code for the country the athlete represents.
9. **GAMES:** The edition of the Olympics the athlete participated in, including the year and the season (Summer or Winter).
10. **YEAR:** The year of the Olympics.
11. **SEASON:** The season of the Olympics, either Summer or Winter.
12. **CITY:** The host city of the Olympics.
13. **SPORT:** The sport the athlete competed in.
14. **EVENT:** The specific event within the sport that the athlete competed in.
15. **MEDAL:** The type of medal won by the athlete, if any (Gold, Silver, Bronze, or NaN if no medal was won).


## IDEAS

TO DO for dataset exploration: 
1. compare male and female athlets participation to the games each year
2. correlation between age, sex, height and weight.
3. distribution of medals per age and per sex
4. Check if medals are distribuited correctly, in sense of: if there are years that have more medal than the maximum they can have. For this study the olympics description on kaggle. 
5. LOOK IF IT MAKES SENSE TO ASSOCIATE AN ID TO EACH COUNTRY IN *noc_region.csv* DATASET AND USE IT FOR ANALYSING THE SUCCESS OF THEM?
6. Should I transform float data in int? not all of them, like medals etc ?

## EXPLORING THE DATASET

In [5]:
olympic_df.head()

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
2,3,Gunnar Nielsen Aaby,M,24.0,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,
3,4,Edgar Lindenau Aabye,M,34.0,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold
4,5,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,


In [6]:
olympic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70000 entries, 0 to 69999
Data columns (total 15 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ID      70000 non-null  int64  
 1   Name    70000 non-null  object 
 2   Sex     70000 non-null  object 
 3   Age     67268 non-null  float64
 4   Height  53746 non-null  float64
 5   Weight  52899 non-null  float64
 6   Team    70000 non-null  object 
 7   NOC     70000 non-null  object 
 8   Games   70000 non-null  object 
 9   Year    70000 non-null  int64  
 10  Season  70000 non-null  object 
 11  City    70000 non-null  object 
 12  Sport   70000 non-null  object 
 13  Event   70000 non-null  object 
 14  Medal   9690 non-null   object 
dtypes: float64(3), int64(2), object(10)
memory usage: 8.0+ MB


In [7]:
olympic_df.describe()

Unnamed: 0,ID,Age,Height,Weight,Year
count,70000.0,67268.0,53746.0,52899.0,70000.0
mean,18081.846986,25.644645,175.505303,70.900216,1977.766457
std,10235.613253,6.485239,10.384203,14.217489,30.103306
min,1.0,11.0,127.0,25.0,1896.0
25%,9325.75,21.0,168.0,61.0,1960.0
50%,18032.0,25.0,175.0,70.0,1984.0
75%,26978.0,28.0,183.0,79.0,2002.0
max,35658.0,88.0,223.0,214.0,2016.0


In [8]:
olympic_df.isna().sum()

ID            0
Name          0
Sex           0
Age        2732
Height    16254
Weight    17101
Team          0
NOC           0
Games         0
Year          0
Season        0
City          0
Sport         0
Event         0
Medal     60310
dtype: int64

In [9]:
olympic_df['Medal'].unique()

array([nan, 'Gold', 'Bronze', 'Silver'], dtype=object)

In [10]:
nan_mask = ['Gold', 'Silver', 'Bronze']
olympic_df['Medal'].value_counts()[nan_mask] #Gold= 3292, silver =3190, bronze= 3208


Gold      3292
Silver    3190
Bronze    3208
Name: Medal, dtype: int64

## Data Cleaning

COLUMNS CAN BE TRANSFORM:
- SEX: M/F -> 0/1
- MEDAL: NaN, Bronze, Silver, Gold -> 0, 1, 2, 3

I can use two methods to encode the labels:
1. LabelEncoder from sklearn library
2. Doing it by hand

As I have just few labels to encode, I prefer to use second method because LabelEncoder assign numbers starting from 0 and so on and in my case is not preferable a generic assignment because of the different priorty/importance of the medal types. Instead for Sex encoding it can be used LabelEncoding but for code clearity I will use one method for all my features that I want to encode.

In [11]:
# col = olympic_df.columns
olympic_df['Medal'].unique()
# print_unique_values(olympic_df[col])

array([nan, 'Gold', 'Bronze', 'Silver'], dtype=object)

In [12]:
# label_encoder = preprocessing.LabelEncoder() 
# olympic_df['Sex']= label_encoder.fit_transform(olympic_df['Sex']) 

In [13]:
replace_sex={
    'M':0,
    "F":1
}
olympic_df['Sex'].replace(replace_sex,inplace=True)

# I WILL DO THIS AFTER I STUDIED NaN VALUES 
replace_medal={
    "Bronze":1,
    "Silver":2,
    "Gold":3
}
olympic_df['Medal'].replace(replace_medal,inplace=True)

### Missing values

Filling missing values in Medal column with fillna(), because actually they are not missing values where we have NaN, but it indicates that the athlet didn't get any medal

In [14]:
olympic_df['Medal'] = olympic_df['Medal'].fillna(value=0)

In [15]:
olympic_df.head()

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,0,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,0.0
1,2,A Lamusi,0,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,0.0
2,3,Gunnar Nielsen Aaby,0,24.0,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,0.0
3,4,Edgar Lindenau Aabye,0,34.0,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,3.0
4,5,Christine Jacoba Aaftink,1,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,0.0


In [16]:
olympic_df.isna().sum()

ID            0
Name          0
Sex           0
Age        2732
Height    16254
Weight    17101
Team          0
NOC           0
Games         0
Year          0
Season        0
City          0
Sport         0
Event         0
Medal         0
dtype: int64

In [17]:
olympic_df["Medal"].unique()

array([0., 3., 1., 2.])