# Assignment 2

**Generative AI rule:** For this assignment, you are allowed to use generative AI tools for assistance, but the code must be your original work—any code that is not your own will be considered cheating.

## Data

This is a historical dataset on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016. The data was taken from Kaggle. The `athlete_events` Dataset contains $271,116$ rows and $15$ columns.

**Important note**: Athletes with the same name might not be the same individuals. To accurately distinguish them, make sure to use their unique IDs.

**ATTRIBUTES:**

**athlete_events.csv**

| Column Name | Data Type | Description/Notes |
|:----:|:----:|:----|
| ID |  integer | Unique number for each athlete |
| Name | string | Athlete’s name |
| Sex | string | M or F |
| Age | integer |  |
| Height | integer | In centimeters |
| Weight | integer | In kilograms |
| Team | string | Team name |
| NOC | string | National Olympic Committee, 3 letter code (Matches with `NOC` from noc_regions.csv) |
| Games | string | Year and season |
| Year | integer |  |
| Season | string | Summer or Winter |
| City | string | Host city |
| Sport | string |  |
| Event | string |  |
| Medal | string | Gold, Silver, Bronze, or NA |

**Source:** Griffin, R, H (2018) 120 years of Olympic history: athletes and results, athlete_events, Found at: https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results#athlete_events.csv

Download athlete_events.csv from the link above and load it into a DataFrame called `athlete_events`:

In [7]:
# Your code goes here
import pandas as pd
df = pd.read_csv('athlete_events.csv')

Use `info()` method on this DataFrame to get a sense of the data:

In [8]:
# Your code goes here
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271116 entries, 0 to 271115
Data columns (total 15 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   ID      271116 non-null  int64  
 1   Name    271116 non-null  object 
 2   Sex     271116 non-null  object 
 3   Age     261642 non-null  float64
 4   Height  210945 non-null  float64
 5   Weight  208241 non-null  float64
 6   Team    271116 non-null  object 
 7   NOC     271116 non-null  object 
 8   Games   271116 non-null  object 
 9   Year    271116 non-null  int64  
 10  Season  271116 non-null  object 
 11  City    271116 non-null  object 
 12  Sport   271116 non-null  object 
 13  Event   271116 non-null  object 
 14  Medal   39783 non-null   object 
dtypes: float64(3), int64(2), object(10)
memory usage: 31.0+ MB


## Question 1

Identify athletes who meet all of the following criteria:

- They are male (Sex is "M").
- Their age is below 25.
- They have participated in the Summer Olympics.
- They have won a medal.

Once you have filtered the athletes based on the above criteria, calculate the average height and weight of these athletes.

In [16]:
# Your code goes here
filtered_athletes = df[
    (df['Sex'] == 'M') &
    (df['Age'] < 25) &
    (df['Season'] == 'Summer') &
    (df['Medal'].notna())]

In [12]:
print(filtered_athletes)

            ID                             Name Sex   Age  Height  Weight  \
86          25                 Alf Lied Aanning   M  24.0     NaN     NaN   
106         38                     Karl Jan Aas   M  20.0     NaN     NaN   
150         56                       Ren Abadie   M  21.0     NaN     NaN   
158         62                Giovanni Abagnale   M  21.0   198.0    90.0   
173         73                        Luc Abalo   M  23.0   182.0    86.0   
...        ...                              ...  ..   ...     ...     ...   
270960  135498                     Denis vegelj   M  20.0     NaN     NaN   
270981  135503                  Zurab Zviadauri   M  23.0   182.0    90.0   
271010  135521                    Anton Zwerina   M  23.0     NaN    66.0   
271013  135522  Klaas Erik "Klaas-Erik" Zwering   M  23.0   189.0    80.0   
271046  135544               Krzysztof Zwoliski   M  21.0   175.0    70.0   

               Team  NOC        Games  Year  Season            City  \
86  

In [13]:
# Calculate the average height and weight
average_height = filtered_athletes['Height'].mean()
average_weight = filtered_athletes['Weight'].mean()

print(f"Average height of filtered athletes: {average_height:.2f} cm")
print(f"Average weight of filtered athletes: {average_weight:.2f} kg")

Average height of filtered athletes: 181.07 cm
Average weight of filtered athletes: 77.58 kg


## Question 2

Using the dataset, group the athletes by "Team" and calculate the following for each team:

- Which team has the maximum number of atheletes?
- Which team has competed in the highest number of distinct sports?


In [None]:
# Your code goes here

## Question 3

**True or False?**

> The average height of athletes who won a medal in Speed Skating is greater than the average height of athletes who won a medal in Basketball.

Write code to determine if this statement is True or False.

In [None]:
# Your code goes here

## Question 4

Identify athletes who have participated in multiple Olympic events (more than one unique event). For these athletes:

- Calculate the sum of the total number of events they participated in altogether.
- Identify the athletes who have won at least two medals (any type of medal).  How many such atheletes are there?

In [None]:
# Your code goes here

## Question 5

Identify all athletes who have won at least one medal and have participated in two or more different sports. For these athletes:

- Find the total number of different sports each of them participated in. Which athlete has the highest number of sports? And how many?
- Calculate the average number of medals won by these athletes. Which athlete has the maximum number of medals? And how many?

In [None]:
# Your code goes here

## Question 6

Identify the top 3 most common sports in which athletes have won a Gold medal. List the sports in descending order of frequency.

In [None]:
# Your code goes here

## Question 7

As part of a sponsorship deal, a company wants to endorse athletes who have a consistent performance in the Olympics. The company defines consistency for athletes who have participated more than one year and those who have won at least one medal in every year.

How many such athletes exist?



In [None]:
# Your code goes here

## Question 8

A sports analytics firm wants to identify athletes who have shown versatility by competing in both individual and team sports. The firm is particularly interested in those who have won at least one medal in both types of events.

Provide the total number of athletes who have won medals in both individual and team sports.

Note: The following is a list of all team sports to reference for this question:

```
['Basketball', 'Football', 'Tug-Of-War', 'Ice Hockey', 'Handball', 'Water Polo', 'Hockey', 'Rowing', 'Bobsleigh', 'Sailing', 'Baseball', 'Softball',
'Rugby Sevens', 'Volleyball', 'Beach Volleyball', 'Synchronized Swimming' 'Curling', 'Lacrosse', 'Polo', 'Cricket', 'Military Ski Patrol', 'Croquet']
```

In [None]:
# Your code goes here

## Question 9

A sports health association is analyzing athletes' BMI to assess fitness, defining "fit" as those in the Normal weight category.

1. Create a **BMI** column using **Height** and **Weight**:  
$BMI=\frac{(Weight_{kg})}{Height_m^2}$

Fill missing Height and Weight values with average within the same Sex group and round BMI to 1 decimal place.

2. Count the number of "fit" athletes in the BMI range of 20-28 (inclusive).

Note: some athletes might have different weights and heights over the years. Take an average over all their measurements.

In [None]:
# Your code goes here