# Practice Exercise: Exploring data (Exploratory Data Analysis)

## Context:
- The data includes 120 years (1896 to 2016) of Olympic games with information about athletes and medal results.
- We'll focus on practicing the summary statistics and data visualization techniques that we've learned in the course. 
- In general, this dataset is popular to explore how the Olympics have evolved over time, including the participation and performance of different genders, different countries, in various sports and events.

- Check out the original source if you are interested in using this data for other purposes (https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results)

## Dataset Description:

We'll work on the data within athlete_events.csv. 

Each row corresponds to an individual athlete competing in an individual Olympic event.

The columns are:
- **ID**: Unique number for each athlete
- **Name**: Athlete's name
- **Sex**: M or F
- **Age**: Integer
- **Height**: In centimeters
- **Weight**: In kilograms
- **Team**: Team name
- **NOC**: National Olympic Committee 3-letter code
- **Games**: Year and season
- **Year**: Integer
- **Season**: Summer or Winter
- **City**: Host city
- **Sport**: Sport
- **Event**: Event
- **Medal**: Gold, Silver, Bronze, or NA

## Objective: 
   - Examine/clean the dataset
   - Explore distributions of single numerical and categorical features via statistics and plots
   - Explore relationships of multiple features via statistics and plots

We are only going to explore part of the dataset, please feel free to explore more if you are interested.

### 1. Import the libraries `Pandas` and `Seaborn`

### 2. Import the data from the csv file as DataFrame `olympics`

### 3. Look at the info summary, head of the DataFrame

### 4. Impute the missing data

#### Use `IterativeImputer` in `sklearn` to impute based on columns `Year`, `Age`, `Height`, `Weight`

##### Import libraries

##### Build a list of columns that will be used for imputation, which are `Year`, `Age`, `Height`, `Weight`
The column `Year` doesn't have mssing values, but we include it since it might be helpful modeling the other three columns. The age, height, and weight could change across years.

##### Create an `IterativeImputer` object and set its `min_value` and `max_value` parameters to be the minumum and maximum of corresponding columns

##### Apply the imputer to fit and transform the columns to an imputed NumPy array

##### Assign the imputed array back to the original DataFrame's columns

#### Fill the missing values in the column `Medal` with string of 'NA'

#### Double check that the columns are all imputed

### 5. Use the `describe` method to check the numerical columns

### 6. Plot the histograms of the numerical columns using `Pandas`

### 7. Plot the histogram with a rug plot of the column `Age` using `Seaborn`, with both 20 and 50 bins

### 8. Plot the boxplot of the column `Age` using `Pandas`

### 9. Plot the boxplot of the column `Age` using `Seaborn`

### 10. Calculate the first quartile, third quartile, and IQR of the column `Age`

### 11. Print out the lower and upper thresholds for outliers based on IQR for the column `Age`

### 12. What are the `Sport` for the athletes of really young age

#### Filter for the column `Sport` when the column `Age` has outliers of lower values

#### Look at the unique values of `Sport` and their counts when `Age` are low-valued outliers

Did you find any sports popular for really young athletes?

### 13. What are the `Sport` for the athletes of older age

#### Filter for the column `Sport` when the column `Age` has outliers of higher values

#### Look at the unique values of `Sport` and their counts when `Age` are high-valued outliers
Did you find any sports popular for older age athletes?

### 14. Check for the number of unique values in each column

### 15. Use the `describe` method to check the non-numerical columns

### 16. Apply the `value_counts` method for each non-numerical column, check for their unique values and counts

### 17. Check the first record within the dataset for each Olympic `Sport`

*Hint: sort the DataFrame by `Year`, then groupby by `Sport`*

### 18. What are the average `Age`, `Height`, `Weight` of female versus male Olympic athletes

### 19. What are the minimum, average, maximum `Age`, `Height`, `Weight` of athletes in different `Year`

### 20. What are the minimum, average, median, maximum `Age` of athletes for different `Season` and `Sex` combinations

### 21. What are the average `Age` of athletes, and numbers of unique `Team`, `Sport`, `Event`, for different `Season` and `Sex` combinations

### 22. What are the average `Age`, `Height`, `Weight` of athletes, for different `Medal`, `Season`, `Sex` combinations

### 23. Plot the scatterplot of `Height` and `Weight`

### 24. Plot the scatterplot of `Height` and `Weight`, using different colors and styles of dots for different `Sex`

### 25. Plot the pairwise relationships of `Age`, `Height`, `Weight`

### 26. Plot the pairwise relationships of `Age`, `Height`, `Weight`, with different colors for `Sex`

### 27. Print out the correlation matrix of `Age`, `Height`, `Weight`

### 28. Use heatmap to demonstrate the correlation matrix of `Age`, `Height`, `Weight`, use a colormap (`cmap`) of 'crest'

### 29. Plot the histograms of `Age`, with different colors for different `Sex`

### 30. Plot the histograms of `Age`, on separate plots for different `Sex`

### 31. Look at the changes of average `Age` across `Year` by line charts, with separate lines for different `Season` using different colors

### 32. Look at the distributions of `Age` for different `Sex` using boxplots

### 33. Look at the distributions of `Age` for different `Sex` using violin plots

### 34. Look at the distributions of `Age` for different `Sex` using boxplots, with different colors of plots for different `Season`

### 35. Use count plots to look at the changes of number of athlete-events across `Year`, for different `Sex` by colors, and different `Season` on separate plots