# About the dataset
**Name:** This column contains the full name of the athlete participating in the Olympic Games.<br>
**Sex:** This column indicates the gender of the athlete. It has two unique values: "M" for male and "F" for female.<br>
**Age:** This column represents the age of the athlete at the time of the competition.<br>
**Team:** The name of the team or delegation that the athlete represents in Olympics.<br>
**NOC:** It contains the three-letter country code assigned by the National Olympic Committee (NOC).<br> 
**Year:** Represents the year in which the Olympic Games took place.<br> 
**Season:** Indicates whether the Olympic Games occurred in the "Summer" or "Winter" season. This distinction is important because different sports are played in each season.<br>
**City:** The host city where the Olympic event took place. This information can be useful for analyzing the impact of location and climate conditions on athlete performance.<br>
**Sport:** Represents the broad category of the sport in which the athlete competed (e.g., Athletics, Swimming, Gymnastics).<br> 
**Event:** The specific event within a sport in which the athlete participated (e.g., "100m Sprint", "Long Jump").<br> 
**Medal:** Indicates the type of medal won by the athlete. Possible values include "Gold", "Silver", "Bronze", or "NaN" if no medal was won.<br> 
**Country:** This column represents the full country name corresponding to the NOC code.<br> 
**Height:** The height of the athlete is in centimetres.<br> 
**Weight:** The weight of the athlete in kilograms.<br> 

# Import the Libraries and Load the Data

In [3]:
#Import the Libraries
import numpy as np
import pandas as pd
from warnings import filterwarnings
filterwarnings('ignore')

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder,StandardScaler,MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score,f1_score
from sklearn.svm import SVC
!pip install xgboost
import xgboost as xgb
from xgboost import XGBClassifier
import time



In [4]:
#Loading the file
df = pd.read_csv('Athletes_summer_games.csv')
df_athlete = pd.read_csv('Olympic_Athlete_Biography.csv')
df_region = pd.read_csv('Olympic_Country_Profiles.csv')

In [5]:
df.sample(1)

Unnamed: 0.1,Unnamed: 0,Name,Sex,Age,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
159196,193355,Janez Pristov,M,29.0,Yugoslavia,YUG,1936 Summer,1936,Summer,Berlin,Gymnastics,Gymnastics Men's Pommelled Horse,


In [6]:
# Getting all column name in same format
df_region.rename(columns={'noc':'NOC','country':'Country'},inplace = True)

In [7]:
df_region.sample(1)

Unnamed: 0,NOC,Country
86,HON,Honduras


In [8]:
# Merging Country in main dataset
df = df.merge(df_region,on='NOC',how = 'left')

In [9]:
# dropping unnecessary column
df = df.drop(columns=['Unnamed: 0','Games'])

In [10]:
df.sample(1)

Unnamed: 0,Name,Sex,Age,Team,NOC,Year,Season,City,Sport,Event,Medal,Country
70124,Aksel Hroar Gresvig,M,30.0,Norway,NOR,1972,Summer,Munich,Sailing,Sailing Mixed Two Person Keelboat,,Norway


In [11]:
# dropping unnecessary column and Getting all column name in same format
df_athlete = df_athlete.drop(columns=['athlete_id','sex','born','country','country_noc','description','special_notes'])
df_athlete.rename(columns={'name':'Name','height':'Height','weight':'Weight'},inplace=True)
df_athlete.sample(1)

Unnamed: 0,Name,Height,Weight
70387,Laia Palau,181.0,69.0


In [12]:
# Merging Height & Weight in main dataset
df = df.merge(df_athlete,on='Name',how='left')

In [13]:
df.sample(3)

Unnamed: 0,Name,Sex,Age,Team,NOC,Year,Season,City,Sport,Event,Medal,Country,Height,Weight
51603,Paul Robert Easter,M,21.0,Great Britain,GBR,1984,Summer,Los Angeles,Swimming,Swimming Men's 200 metres Freestyle,,Great Britain,,
146216,Jennifer Lynn O'Donnell,F,18.0,United States,USA,1992,Summer,Barcelona,Archery,Archery Women's Team,,United States,,
161906,Jasna Ptujec,F,25.0,Yugoslavia,YUG,1984,Summer,Los Angeles,Handball,Handball Women's Handball,Gold,Yugoslavia,176.0,72.0


# Understanding of the data

In [15]:
#Dimensions of the data
df.shape

(241723, 14)

In [16]:
#Information about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 241723 entries, 0 to 241722
Data columns (total 14 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   Name     241723 non-null  object 
 1   Sex      241723 non-null  object 
 2   Age      232357 non-null  float64
 3   Team     241723 non-null  object 
 4   NOC      241723 non-null  object 
 5   Year     241723 non-null  int64  
 6   Season   241723 non-null  object 
 7   City     241723 non-null  object 
 8   Sport    241723 non-null  object 
 9   Event    241723 non-null  object 
 10  Medal    37196 non-null   object 
 11  Country  241469 non-null  object 
 12  Height   68601 non-null   float64
 13  Weight   68601 non-null   float64
dtypes: float64(3), int64(1), object(10)
memory usage: 25.8+ MB


In [17]:
# Converted all country values into string
df['Country'] = df['Country'].astype(str)

In [18]:
#Information about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 241723 entries, 0 to 241722
Data columns (total 14 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   Name     241723 non-null  object 
 1   Sex      241723 non-null  object 
 2   Age      232357 non-null  float64
 3   Team     241723 non-null  object 
 4   NOC      241723 non-null  object 
 5   Year     241723 non-null  int64  
 6   Season   241723 non-null  object 
 7   City     241723 non-null  object 
 8   Sport    241723 non-null  object 
 9   Event    241723 non-null  object 
 10  Medal    37196 non-null   object 
 11  Country  241723 non-null  object 
 12  Height   68601 non-null   float64
 13  Weight   68601 non-null   float64
dtypes: float64(3), int64(1), object(10)
memory usage: 25.8+ MB


In [19]:
# Missing values
df.isnull().sum()

Name            0
Sex             0
Age          9366
Team            0
NOC             0
Year            0
Season          0
City            0
Sport           0
Event           0
Medal      204527
Country         0
Height     173122
Weight     173122
dtype: int64

In [20]:
# checking duplicates values.
df.duplicated().sum()

1765

In [21]:
# Now there is no duplicate values.
df.drop_duplicates(inplace=True)
df.duplicated().sum()

0

In [22]:
# Need to maintain consistancy in Country column
Country_Mapping={'ROC':'Russian Olympic Committee'}
df['Country'] = df['Country'].replace(Country_Mapping)

In [23]:
df

Unnamed: 0,Name,Sex,Age,Team,NOC,Year,Season,City,Sport,Event,Medal,Country,Height,Weight
0,A Dijiang,M,24.0,China,CHN,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,,People's Republic of China,,
1,A Lamusi,M,23.0,China,CHN,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,,People's Republic of China,170.0,60.0
2,Gunnar Nielsen Aaby,M,24.0,Denmark,DEN,1920,Summer,Antwerpen,Football,Football Men's Football,,Denmark,,
3,Edgar Lindenau Aabye,M,34.0,Denmark/Sweden,DEN,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold,Denmark,,
4,"Cornelia ""Cor"" Aalten (-Strannood)",F,18.0,Netherlands,NED,1932,Summer,Los Angeles,Athletics,Athletics Women's 100 metres,,Netherlands,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
241718,ZYKOVA Yulia,F,25.0,Russia,ROC,2020,Summer,Tokyo,Shooting,50m Rifle 3 Positions Women,Silver,Russian Olympic Committee,,
241719,ZYUZINA Ekaterina,F,24.0,Russia,ROC,2020,Summer,Tokyo,Sailing,Women's One Person Dinghy - Laser Radial,,Russian Olympic Committee,,
241720,ZYUZINA Ekaterina,F,24.0,Russia,ROC,2020,Summer,Tokyo,Sailing,Women's One Person Dinghy - Laser Radial,,Russian Olympic Committee,,
241721,ZYZANSKA Sylwia,F,24.0,Poland,POL,2020,Summer,Tokyo,Archery,Women's Individual,,Poland,,


# Model Building

In [25]:
# dropping duplicates on the basis Name,Event,Team,Country
df.drop_duplicates(subset=['Name','Event','Team','Country'])

Unnamed: 0,Name,Sex,Age,Team,NOC,Year,Season,City,Sport,Event,Medal,Country,Height,Weight
0,A Dijiang,M,24.0,China,CHN,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,,People's Republic of China,,
1,A Lamusi,M,23.0,China,CHN,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,,People's Republic of China,170.0,60.0
2,Gunnar Nielsen Aaby,M,24.0,Denmark,DEN,1920,Summer,Antwerpen,Football,Football Men's Football,,Denmark,,
3,Edgar Lindenau Aabye,M,34.0,Denmark/Sweden,DEN,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold,Denmark,,
4,"Cornelia ""Cor"" Aalten (-Strannood)",F,18.0,Netherlands,NED,1932,Summer,Los Angeles,Athletics,Athletics Women's 100 metres,,Netherlands,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
241716,ZWOLINSKA Klaudia,F,22.0,Poland,POL,2020,Summer,Tokyo,Canoe Slalom,Women's Kayak,,Poland,,
241717,ZYKOVA Yulia,F,25.0,Russia,ROC,2020,Summer,Tokyo,Shooting,50m Rifle 3 Positions Women,Silver,Russian Olympic Committee,,
241719,ZYUZINA Ekaterina,F,24.0,Russia,ROC,2020,Summer,Tokyo,Sailing,Women's One Person Dinghy - Laser Radial,,Russian Olympic Committee,,
241721,ZYZANSKA Sylwia,F,24.0,Poland,POL,2020,Summer,Tokyo,Archery,Women's Individual,,Poland,,


**Interpretation:** This cleaned dataset is now ready for classification analysis, as it retains unique athlete-event-country combinations while removing duplicate medal entries. It is now well-suited for building models that classify which type of medal (Gold, Silver, or Bronze) an athlete is likely to win, based on their attributes and event-related features.

In [27]:
# Filling NaN Values to maintain consistancy
for col in ['Height', 'Weight', 'Age']:
    sport_avg = df.groupby('Sport')[col].transform('mean')  # Sport-wise mean
    overall_avg = df[col].mean()  # Overall mean (all sports)
    
    # If sport_avg is NaN (i.e., sport has no values), fill it with overall_avg
    df[col] = df[col].fillna(sport_avg).fillna(overall_avg)

**Interpretation:** To handle missing values in key numeric columns like Height, Weight and Age, sport-specific averages were used for imputation, ensuring contextual relevance. If sport-level data was unavailable, the overall average was used as a fallback, preserving data integrity while minimizing bias before applying the classification model.

In [29]:
# Filling No Medal inplace of NaN 
df['Medal'].fillna('No Medal',inplace=True)

In [30]:
# Giving numbers orderwise
medal_order = [['No Medal', 'Gold', 'Silver', 'Bronze']]
medal_encoder = OrdinalEncoder(categories=medal_order)
df['Is_Medal'] = medal_encoder.fit_transform(df[['Medal']])
df['Is_Medal'] = df['Is_Medal'].astype(int)

**Interpretation:** Above 2 steps, missing values in the 'Medal' column were filled with 'No Medal', ensuring no nulls remain. An OrdinalEncoder was then used to convert medal types into numeric labels 0 for No Medal, 1 for Gold, 2 for Silver, and 3 for Bronze making the data suitable for classification models that predict medal categories based on athlete and event features.

In [32]:
# Groupbying & Aggregating on the basis of following columns 
df_2 = df.groupby(['Year','Sport','Country','Sex','Is_Medal']).agg({'Height':'mean','Weight':'mean','Age':'mean'}).reset_index()
df_2

Unnamed: 0,Year,Sport,Country,Sex,Is_Medal,Height,Weight,Age
0,1896,Athletics,Australia,M,0,176.208009,69.208994,22.000000
1,1896,Athletics,Australia,M,1,176.208009,69.208994,22.000000
2,1896,Athletics,Denmark,M,0,176.208009,69.208994,26.250000
3,1896,Athletics,France,M,0,176.208009,69.208994,22.899637
4,1896,Athletics,France,M,2,176.208009,69.208994,19.000000
...,...,...,...,...,...,...,...,...
43307,2020,Wrestling,United States,M,1,171.919772,74.544279,25.500000
43308,2020,Wrestling,United States,M,2,171.919772,74.544279,25.000000
43309,2020,Wrestling,United States,M,3,171.919772,74.544279,28.500000
43310,2020,Wrestling,Uzbekistan,M,0,171.919772,74.544279,29.000000


**Interpretation:** This step aggregates the dataset to compute the average Height, Weight, and Age of athletes grouped by Year, Sport, Country, Sex, and Medal category (Is_Medal). It helps in understanding demographic and physical trends across different medal levels and prepares structured input features for building a multi-class classification model.

In [34]:
# Making Keycolumn for better understanding
df_2['Country'] = df_2['Country'].str.strip()
df_2['Sex'] = df_2['Sex'].str.strip()
df_2['Sport'] = df_2['Sport'].str.strip()
df_2['Country'] = df_2['Country'].str.replace(" ","_")
df_2['Sex'] = df_2['Sex'].str.replace(" ","_")
df_2['Sport'] = df_2['Sport'].str.replace(" ","_")
df_2['Key_col'] = df_2['Country'].astype(str)+ "_" + df_2['Sport'].astype(str)+ "_" +  df_2["Sex"].astype(str)
df_2

Unnamed: 0,Year,Sport,Country,Sex,Is_Medal,Height,Weight,Age,Key_col
0,1896,Athletics,Australia,M,0,176.208009,69.208994,22.000000,Australia_Athletics_M
1,1896,Athletics,Australia,M,1,176.208009,69.208994,22.000000,Australia_Athletics_M
2,1896,Athletics,Denmark,M,0,176.208009,69.208994,26.250000,Denmark_Athletics_M
3,1896,Athletics,France,M,0,176.208009,69.208994,22.899637,France_Athletics_M
4,1896,Athletics,France,M,2,176.208009,69.208994,19.000000,France_Athletics_M
...,...,...,...,...,...,...,...,...,...
43307,2020,Wrestling,United_States,M,1,171.919772,74.544279,25.500000,United_States_Wrestling_M
43308,2020,Wrestling,United_States,M,2,171.919772,74.544279,25.000000,United_States_Wrestling_M
43309,2020,Wrestling,United_States,M,3,171.919772,74.544279,28.500000,United_States_Wrestling_M
43310,2020,Wrestling,Uzbekistan,M,0,171.919772,74.544279,29.000000,Uzbekistan_Wrestling_M


**Interpretation:** This code block cleans and standardizes string columns by removing extra spaces and replacing them with underscores, ensuring consistency in categorical values. It also creates a composite key (Key_col) combining Country, Sport, and Sex, which uniquely identifies each athlete group — useful for filtering, grouping, or model training on consistent cohorts across years.

In [36]:
# Finding only those rows who have >=2 data in history
abc = pd.DataFrame(df_2.groupby('Key_col')['Is_Medal'].count().sort_values()).reset_index()
abc = abc.rename(columns = {'Is_Medal':'number_of_rows'})
abc = abc[abc['number_of_rows']>=2]
abc

Unnamed: 0,Key_col,number_of_rows
1840,Hungary_Art_Competitions_F,2
1841,Independent_Olympic_Athletes_Athletics_F,2
1842,Kazakhstan_Fencing_F,2
1843,Belgium_Synchronized_Swimming_F,2
1844,Norway_Figure_Skating_M,2
...,...,...
6288,United_States_Wrestling_M,87
6289,Italy_Fencing_M,88
6290,Great_Britain_Athletics_M,98
6291,United_States_Swimming_M,104


**Interpretation:** This code identifies athlete groupings (Key_col) that have participated at least twice by counting occurrences of Is_Medal. Filtering for number_of_rows >= 2 ensures only groups with enough data points are retained, making the dataset more reliable and suitable for training robust classification models to predict medal types.

In [38]:
# Finding only those 'Key_col' who are in history and same available in the 2020 dataset 
intersection_list = list(set(list(df_2[df_2['Year'] == 2020]['Key_col'].unique())).intersection(set(abc['Key_col'].unique())))
len(intersection_list)

2344

**Interpretation:** This code computes the intersection between athlete groupings (Key_col) present in the 2020 Olympics and those with sufficient historical data (at least 2 entries). The resulting intersection_list represents valid data groups suitable for building classification models to predict medal outcomes in 2020, based on patterns from past performances.

In [40]:
# Model Predictions
start_time = time.time()
final_df_clf = pd.DataFrame()

for i in intersection_list:
    train_data = df_2[(df_2['Key_col'] == i) & (df_2['Year'] <= 2016)].reset_index(drop=True)
    test_data = df_2[(df_2['Year'] == 2020) & (df_2['Key_col'] == i)].reset_index(drop=True)

    if len(train_data) < 2 or train_data['Is_Medal'].nunique() == 1:
        continue  

    if test_data.empty:
        continue  
    X_train = train_data[['Year', 'Height', 'Weight', 'Age']]
    y_train = train_data[['Is_Medal']]

    X_test = test_data[['Year', 'Height', 'Weight', 'Age']]
    y_test = test_data[['Is_Medal']]

    # **Feature Scaling**
    standard_scaler = StandardScaler()
    minmax_scaler = MinMaxScaler()

    X_train_standard = standard_scaler.fit_transform(X_train)
    X_test_standard = standard_scaler.transform(X_test)

    X_train_minmax = minmax_scaler.fit_transform(X_train)
    X_test_minmax = minmax_scaler.transform(X_test)

    # **Define models**
    models = {
        "Logistic Regression": (LogisticRegression(), X_train_standard, X_test_standard),
        "Decision Tree": (DecisionTreeClassifier(), X_train_minmax, X_test_minmax),
        "Random Forest": (RandomForestClassifier(n_estimators=100, random_state=42, max_features='sqrt', n_jobs=-1), X_train_minmax, X_test_minmax),
        "SVC": (SVC(kernel='rbf', C=1, gamma='scale', random_state=42), X_train_standard, X_test_standard)
    }

    results = []
    for name, (model, X_train_scaled, X_test_scaled) in models.items():
        model.fit(X_train_scaled, y_train.values.ravel())
        y_pred = model.predict(X_test_scaled)
        acc = accuracy_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred, average="weighted")
        results.append([i, name, y_pred.tolist(), y_test.values.flatten().tolist(), acc, f1])

    results_df_clf = pd.DataFrame(results, columns=["Key_col", "Model", "Predictions", "Actual", "Accuracy", "F1 Score"])
    final_df_clf = pd.concat([results_df_clf, final_df_clf])

end_time = time.time()
execution_time = end_time - start_time
print(f"Execution Time: {execution_time:.4f} seconds")

Execution Time: 506.2511 seconds


In [41]:
final_df_clf

Unnamed: 0,Key_col,Model,Predictions,Actual,Accuracy,F1 Score
0,Tunisia_Taekwondo_M,Logistic Regression,[3],[2],0.0,0.000000
1,Tunisia_Taekwondo_M,Decision Tree,[3],[2],0.0,0.000000
2,Tunisia_Taekwondo_M,Random Forest,[3],[2],0.0,0.000000
3,Tunisia_Taekwondo_M,SVC,[3],[2],0.0,0.000000
0,Italy_Archery_M,Logistic Regression,"[0, 0]","[0, 2]",0.5,0.333333
...,...,...,...,...,...,...
3,Greece_Tennis_M,SVC,[0],[0],1.0,1.000000
0,Belgium_Sailing_M,Logistic Regression,[0],[0],1.0,1.000000
1,Belgium_Sailing_M,Decision Tree,[0],[0],1.0,1.000000
2,Belgium_Sailing_M,Random Forest,[0],[0],1.0,1.000000


**Interpretation:** This code performs a classification analysis to predict the type of medal (or no medal) an athlete is likely to win in the 2020 Olympics using models trained on data from previous years (up to 2016). It evaluates multiple classification algorithms (Logistic Regression, Decision Tree, Random Forest, and SVC) for each group (Key_col) and records their accuracy and F1 score, helping compare model performance in multiclass classification.

In [43]:
final_df_clf.to_excel('Classification_results_final.xlsx',index=False)