# COGS 108 - Data Checkpoint

## Authors

Beliz Akbulut: Background research, data curation, analysis

Anna Lewis: Background research, conceptualization, analysis

Karsten Jensen: Background research, data curation, experimental investigation

Camerin Oliver: Conceptualization, analysis

Ryenn Thompson: Conceptualization, analysis



## Research Question

How well can national health metrics—such as life expectancy at birth, infant and child under-5 mortality, and physicians per 1,000 people —forecast a country’s medal total in future Summer Olympic Games when models are trained only on recent prior Games?
- Does incorporating same-year national health indicators reduce the average out-of-sample prediction error for Olympic medal totals when evaluated using forward-chaining cross-validation across Olympic cycles and adjusted for economic/demographic variables?
- How much average prediction error is achieved by models using only economic and demographic variables (GDP per capita growth, population, urban population percentage)? 
- How much does the average out-of-sample error change when national health indicators are added to the baseline model?

## Background and Prior Work

The International Olympic Games is a longstanding tradition of global athletic competition and patriotism that has occurred regularly since the first Summer Games in 1896, which itself was inspired by the Ancient Greek Olympic Games. This year, Italy is hosting the Winter Olympics, and the increasing media coverage of athletes and events led us to consider how we might predict national Olympic performance.

As many of our group members are interested in public health and environmental safety, we wondered if there might be a relationship between national health indices and Olympics performance. In conducting preliminary research on the topic, we found several sources referencing this association, as well as other extraneous variables that affect Olympic performance rates. One research article we found, entitled “Assessment of Olympic performance in relation to economic, demographic, geographic, and social factors: quantile and Tobit approaches,” uncovered some of these factors and analyzed their effects on the 2016 Rio Olympics. The authors discussed how economic factors like income classification, government corruption, and athletic health culture can largely impact a nation’s athletic output and, ultimately, their historical performance at Olympic Games 1. Their analysis helped us establish a significance between national conditions in athletes’ home countries and their corresponding medal successes at the Olympics. As hypothesized, the data uncovered in this research indicated that developing nations have been performing better over time as their economic conditions improve, leading us to believe that we may find similar results in public health conditions.

After looking at historical data from the World Health Organization’s website, we brainstormed specific parameters that we think would most greatly influence a nation’s overall health rating, and thus support stronger and more athletic future generations. We want to determine if “developing” countries that recently improved their public health scores in prior decades also performed better in the Olympic Games, and use that information to predict what nations can be expected to improve in the future at the Games. We decided to only include Summer Olympic Games data to ensure confounds between the Winter and Summer seasons would not influence our findings, such as geographic advantages and disadvantages between nations, seasonal health concerns, and lack of diversity in types of sports events.

Our modeling approach for this project is to use forward-chaining cross-validation to analyze historical trends in Olympic cycles, and predict the performance of specific countries in a specific season. We will train our model on both health data and medal performance data from 1976 to 2014 to predict the outcome of the 2016 Summer Olympics, and then assess how well our model matched the true performance of that Olympic year. We are focusing on predictive performance to see if there is not only a spurious correlation between national public health information and Olympic athletic performance, but potentially a more rigorous projective relationship. Olympic performance prediction is time-dependent. Policymakers and sports organizations need models that can forecast future medal counts using only historical data, not models that explain associations after outcomes are known. Forward-chaining mimics this realistic constraint by training exclusively on past cycles and evaluating on subsequent Games. Out-of-sample prediction error provides a more stringent test of whether health indicators have genuine predictive utility beyond economic and demographic baselines for our model.

Due to the long history of the Olympic Games, many projects have focused on predicting and analyzing athletic performance data to find trends and outliers. For example, one project found on Kaggle, authored by user EricSBrown, sought to determine how historic national economic information, specifically GDP data, could predict a family’s fantasy draft picks for the 2022 Winter Olympics 2. The creator of this project merged several different datasets, including GDP value information, historic national Olympics medal data, and comparative time tracking between the two. This research helped us understand how to find correlations between datasets, and use that information to create a predictive model for other instances. In our project, we intend to “test” our model several times by attempting to predict past games using prior data, such as predicting different nations’ performance at the 2008 Olympics using their performance and health data from between 1980 and 2004.

Another project, published on Github by user jalwz17, predicted which countries would win the most medals at the 2024 Summer Olympics and the 2026 Winter Olympics 3. They utilized a time series prediction model to determine which countries have either consistently performed highly at the Olympic Games, including the United States and Germany, and which countries saw a rise over time in Olympic performance, such as China post-1984. This model resembles the one we want to construct in the sense that it analyzes historical performance trends to predict future successes, but we intend to add another layer of complexity to our project to find a causal relationship that explains the reason for changes in performance trends. Their project included several interesting visualization techniques that we intend to incorporate into our own work, including seaborn heatmaps and histograms, that aided in both comprehension and explanation of their data findings.


## Hypothesis


We predict that nations with higher national health metrics perform better in the Summer Olympic Games. Based on historical trends and the development of health standards in various nations, we can anticipate which nations will win more medals overall, and which nations can be expected to improve over time in medal wins. In terms of health indices, we chose life expectancy at birth as a solid baseline variable to showcase general healthcare accessibility, from prenatal hospital care to maternal and birth health systems. We also chose infant and child under 5 mortality rates as a similar overall indicator of national health systems, as early infant and childhood healthcare quality can also indicate overall healthcare quality for a larger population, from access to effective medicines to professional healthcare providers. Our last health variable is physicians per 1,000 people, which should indicate overall population accessibility to healthcare and medical resources. In terms of economic variables, we chose GDP per capita growth and overall population to give us a general idea of different nations’ economic standing over time. We also chose urban population percentage to provide more context and insight into the structure of different nations and their histories of modern development. 


## Data

### Data overview

In [1]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [2]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
    { 'url': 'https://drive.google.com/uc?id=1CyX4u64tTUQHa0kgGVTKGXyFQmfPenc9', 'filename':'olympics_dataset.csv'},
    { 'url': 'https://drive.google.com/uc?id=1fz6WjVnBxmZM5Ok6Q0yVHRawxeZ_VH0x', 'filename':'noc_to_country.json'},
    {'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/bad-drivers/bad-drivers.csv', 'filename':'bad-drivers.csv'},
    {'url': 'https://drive.google.com/uc?id=1S34HIQDCEt30pWrSnvVZo90lXwxlU10r', 'filename':'final_olympics_dataset.csv'}
]

get_data.get_raw(datafiles,destination_directory='data/00-raw/')

Overall Download Progress:  25%|██▌       | 1/4 [00:03<00:10,  3.39s/it]

Successfully downloaded: olympics_dataset.csv


Overall Download Progress:  50%|█████     | 2/4 [00:05<00:05,  2.76s/it]

Successfully downloaded: noc_to_country.json


Overall Download Progress:  75%|███████▌  | 3/4 [00:05<00:01,  1.55s/it]

Successfully downloaded: bad-drivers.csv


Overall Download Progress: 100%|██████████| 4/4 [00:07<00:00,  1.76s/it]

Successfully downloaded: final_olympics_dataset.csv





### World Bank Dataset
Name of Dataset: World Bank World Development Indicators (WDI) 

Link: https://data.worldbank.org/

Observations: 586 rows 

Variables: 10 columns 

Countries: 133

Years: 1976-2016

Variables included in the dataset:

- total_medals
- Country_code
- Country
- year
- dp_per_capita_growth
- under5_mortality 
- physicians_per_1000
- life_expec_under5
- population
- urban_population_pct

Background—

The data set presented is gathered through the World Bank. The set of records contains yearly data about health, demographics, and economics for individual countries. We decided to use this data because it uses standardization methods and collects information in a consistent manner throughout all countries and years, this dataset will allow us to make comparisons regarding trends across years. We will use a subset of these indicators to measure the health conditions of each nation, as well as some basic demographic and economic factors that could impact an Olympics participating nations' performance.

Health Variables—-

Life Expectancy at birth total (years)-

The average number of years a child is expected to live at birth in the country, based on today's death rates for that country. A higher life expectancy would generally indicate that children have greater access to health care, nutrition and other living standards.

Mortality rate under 5- (per 1,000 live births)-

This represents how many deaths occur among children under the age of five for every 1,000 live births. Countries with higher quality child and medical healthcare tend to show lower numbers.

Physicians (per 1000 people:)--

This variable helps us indicate a countries' number of physicians in practice that are available per 1,000 people. Higher values indicate better access to medical professionals and stronger healthcare system capacity. Increased physician density is generally associated with better access to care, improved disease prevention and treatment, and stronger overall population health.

Demographic and Economic Variables—

Population Total--

This variable represents the number of people who live in a country; thus, high-population countries may represent a greater number of potential athletes for the Olympics.

Urban Population (% of total population:)-- 

This metric explains the percentage of a country's population living in an urban area. The ability to access sport infrastructure and training resources as well as improved health care can vary based on the location of a country's population.

GDP per capita growth (annual %)--

Annual change in GDP per capital is calculated as an annual rate of change in a country's gross domestic product divided by it's population; this metric is designed to measure how quickly the GDP per capital is growing each year and therefore measures a short-term trend in a country's economy.

Shortcomings & Limitations—-

Although the World Bank data set has standard indicators that are comparable among countries, there are several variables with missing data for several countries and/or time periods, especially during earlier Olympic cycles. The health and demographic measures used to measure the health of a country's population as a whole also do not account for factors specific to an individual athlete that could potentially influence their Olympic performance.

Integration—-

The World Bank dataset will be merged to the Olympic medal dataset by matching country codes and olympic years. Indicators from the World Bank will be standardized to align with summer Olympics (one row per country for each Olympic Games) in preparation for modeling.


In [3]:
# Relevant libraries
import pandas as pd
import numpy as np
import json 

# Loading data


### Olympic Medal Datset

Name of Dataset: Olympics Dataset

Link: https://www.kaggle.com/datasets/stefanydeoliveira/summer-olympics-medals-1896-2024

Observations: 252565 rows 

Variables: 11 columns 

Years: 1896-2024

Variables included in the dataset:

- player_id
- Name
- Sex
- Team
- NOC
- Year
- Season
- City
- Sport
- Event
- Medal

This dataset will be keeping track of how many medals are won or not won by each country during the Summer Olympics for each Olympic game between 1976-2017. They will be measured by indication of whether or not the country won a medal or not for each game in the Olympics. In the original dataset before cleaning, the data was broken down into no medal, gold, silver, and bronze, but for our purposes we only needed to reduce the options to indicate whether the country won a medal, or not. No medal is indicated as 0 and a medal is indicated as 1. There are 252565 rows and 11 columns. There were also country name changes during the timespan we were observing, so in data cleaning the country name changes were indicated to count as one count instead of a separate count. 

By reducing our data for medal indication into medal or no medal, instead of bronze, silver, gold, or no medal, this can limit the types of analysis we can do, such as us being unable to analyze performance quality based on whether a country has consistently received bronze, compared to silver or gold. Country size is also a factor that can confound the amount of medals won per country. Because larger countries can send more athletes than smaller countries, therefore having more chances to win medals. This dataset also does not account for inconsistencies over time in rules for sports, eligibility, or the number of Olympic events changing per Olympics.

We will cross-reference the amount of Olympic medals received for each Olympic game by each country in this dataset with our first dataset on health indicators.


In [4]:
data = pd.read_csv('data/00-raw/olympics_dataset.csv')
data.head()

# Let's take a look at the column names:
data.columns

Index(['player_id', 'Name', 'Sex', 'Team', 'NOC', 'Year', 'Season', 'City',
       'Sport', 'Event', 'Medal'],
      dtype='object')

In [5]:
# Number of observations
data.shape[0]

252565

We can see that for each medal awarded, information about the athlete, year, country and olympic game is recorded.

In [6]:
# Check for any missing values
data.isna().any()

player_id    False
Name         False
Sex          False
Team         False
NOC          False
Year         False
Season       False
City         False
Sport        False
Event        False
Medal        False
dtype: bool

In [7]:
data['Medal'].unique()

array(['No medal', 'Gold', 'Bronze', 'Silver'], dtype=object)

Here we can see that there are no missing values in any of the columns of data. 

Upon taking a closer look at the Medal column we see that there are 4 possible values, Gold, Silver, Bronze and No medal. As we are not concerned with the specifities of the medal we should map this column to count 1 for any medal (Gold, Silver, Bronze) and 0 for No medal.

In [8]:
# Normalize the medal column
data['Medal'] = data['Medal'].str.strip().str.lower()

# Convert to int, 0 if no medal, 1 if any type of medal
data['Medal'] = (data['Medal'] != 'no medal').astype(int)
data.head()


Unnamed: 0,player_id,Name,Sex,Team,NOC,Year,Season,City,Sport,Event,Medal
0,0,A Dijiang,M,China,CHN,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,0
1,1,A Lamusi,M,China,CHN,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,0
2,2,Gunnar Aaby,M,Denmark,DEN,1920,Summer,Antwerpen,Football,Football Men's Football,0
3,3,Edgar Aabye,M,Denmark/Sweden,DEN,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,1
4,26,Cornelia (-strannood),F,Netherlands,NED,1932,Summer,Los Angeles,Athletics,Athletics Women's 100 metres,0


Now that we have a sense of what information is captured in our dataset, we can get rid of any addtional information that is not relevant to our research question. To protect ethical concerns and privacy, we don't want to include any information about specific player names or data. As we are only concerned about each country's medal count for each Summer Olympic Games we only need the 'NOC', 'Year', and 'Medal' columns from our original dataset.

In [9]:
cols = ['NOC','Year', 'Medal']
olympic_medals = data[cols]

# Group by country, year and sum up medal counts. 
medal_count = olympic_medals.groupby(['NOC','Year'], as_index=False)['Medal'].sum()

# We are only concered with Olympic Games that took place between 1976 and 2016.
in_range = medal_count[medal_count['Year'].isin(range(1976, 2017))]

in_range.head()

Unnamed: 0,NOC,Year,Medal
7,AFG,1980,0
8,AFG,1988,0
9,AFG,1996,0
10,AFG,2004,0
11,AFG,2008,1


Next, we need to ensure country names are consisent with their modern names so that we dont misrepresent their true medal count. For example, West Germany is represented 3 times but we want to consolodate that to just Germany. Also, for the sake of readability we want to map country codes to their name.

In [10]:
print((in_range['NOC'] == 'FRG').sum())

3


In [11]:
# Json file mapping country codes to their offical name
with open("data/00-raw/noc_to_country.json", "r") as f:
    noc_to_country = json.load(f)

# Historical NOC to current NOC
modern_merge = {
    "FRG": "GER",   # West Germany → Germany
    "GDR": "GER",   # East Germany → Germany
    "URS": "RUS",   # Soviet Union → Russia
    "EUN": "RUS",   # Unified Team → Russia
    "TCH": "CZE",   # Czechoslovakia → Czech Republic
    "YUG": "SRB",   # Yugoslavia → Serbia
    "SCG": "SRB",   # Serbia & Montenegro → Serbia
    "YAR": "YEM",   # North Yemen → Yemen
    "YMD": "YEM",   # South Yemen → Yemen
    "AHO": "NED",   # Netherlands Antilles → Netherlands
}


# Replace historical codes with modern NOC codes
in_range_copy = in_range.copy()
in_range_copy["NOC"] = in_range_copy["NOC"].replace(modern_merge)

# Map NOC codes to country names
in_range_copy['Country'] = in_range_copy['NOC'].map(noc_to_country)

olympics_cleaned = in_range_copy.drop(columns = ['NOC']).reset_index()
olympics_cleaned


Unnamed: 0,index,Year,Medal,Country
0,7,1980,0,Afghanistan
1,8,1988,0,Afghanistan
2,9,1996,0,Afghanistan
3,10,2004,0,Afghanistan
4,11,2008,1,Afghanistan
...,...,...,...,...
1849,3215,2000,0,Zimbabwe
1850,3216,2004,3,Zimbabwe
1851,3217,2008,4,Zimbabwe
1852,3218,2012,0,Zimbabwe


Now that we've cleaned our Olympic Medal dataset lets see some quick summary statstics.

In [12]:
olympics_cleaned.groupby('Country', as_index=False)['Medal'].sum().sort_values(by='Medal', ascending = False)

Unnamed: 0,Country,Medal
196,United States,2540
153,Russia,2136
70,Germany,1932
10,Australia,997
40,China,909
...,...,...
43,Cook Islands,0
121,Monaco,0
117,Mauritania,0
116,Marshall Islands,0


In [13]:
olympics_cleaned.groupby(['Country'], as_index=False)['Medal'].mean().sort_values(by='Medal', ascending= False)

Unnamed: 0,Country,Medal
196,United States,254.000000
153,Russia,213.600000
70,Germany,148.615385
40,China,101.000000
10,Australia,90.636364
...,...,...
151,Republic of the Congo,0.000000
78,Guinea-Bissau,0.000000
38,Chad,0.000000
80,Haiti,0.000000


Here we can see that the United States has the most total medals awarded to them with Russia coming in second. Also, we can see that countries like Cambodia and the Marshall Island have gotten no medals over 1976-2016.Furthermore, the United States has taken home an average of 254 medals each olympics in our timeframe.

Finally both of our datasets are cleaned and tidy. Now we want to combine them into one dataset so that the data is easier to use and understand later on. We'll combine them into one csv called "final_olympics_dataset.csv", indexed by country name alphabetically.

In [14]:
final_df = pd.read_csv('data/00-raw/final_olympics_dataset.csv')
final_df.head()

Unnamed: 0,total_medals,Country_code,Country,year,gdp_per_capita_growth,under5_mortality,physicians_per_1000,life_expec_under5,population,urban_population_pct
0,1.0,AFG,Afghanistan,2008,1.677279,96.3,0.183,59.708,26482622.0,21.124148
1,1.0,AFG,Afghanistan,2012,8.279369,81.2,0.246,61.735,30560034.0,23.343439
2,0.0,AFG,Afghanistan,2016,-0.300121,70.0,0.284,62.646,34700612.0,24.658835
3,0.0,ALB,Albania,1992,-6.622551,37.6,1.604,73.303,3247039.0,36.718428
4,0.0,ALB,Albania,1996,8.005514,32.4,1.318,74.113,3168033.0,38.899604


## Ethics

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?
    - This study does not contain individual subjects; therefore, no informed consent is needed. Rather, this project accumulates data from publicly accessible websites including the World Health Organization (WHO), and World Bank.    

 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?
    - There are many ways in which we might collect sources that contain bias. Furthermore, different health organizations across the world potentially report their data incorrectly or of separate quality. With the Olympics, there is also probable bias in hosting countries, as their chance of winning more medals increases due to geographical location. To combat this problem we will use data from trusted sources such as the WHO and World Bank, add economic information to compare countries more efficiently, and consider the country who hosts the Olympics that specific year to ensure this is not an issue. 

 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?
    - For this project our group is not using information from real people directly, rather, our data is from countries' medals and health reports. No individual can be identified from this data. 

 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?
    - Within our project we will prioritize not carrying bias towards any particular country when analyzing results. Furthermore, we will identify patterns across different countries, and understand that Olympic performance can also relate to the countries socioeconomic status, and other imbalances between countries. 

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
    - The data that we will collect is not privately owned, and does not use information from specific individuals. We will use shared project folders and Github to store all of our project data. Only our group members will have access to it. 

 - [X] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed? 
    - This project does not use personal data from individuals, meaning that there is no data that will need to be removed. 

 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?
    - We plan on keeping this data while working on our project, when it is finished there will be no additional need to use this information. 

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
    - Some blind spots we are concerned about are the impact of economic and demographic factors on Olympic success. We are addressing these potential blind spots in our analysis by measuring how much average prediction error is achieved by models using solely economic and demographic variables, and comparing it to the error when national health indicators are added.

 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?
    - Most dataset biases will be addressed as we evaluate our model with and without national health indicators in an effort to uncover omitted confounding variables and/or imbalanced classes. Using a forward-chaining cross-validation model will help mitigate overfitting of our data and ensure our model performs well with unseen data. 

 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
    - We will report a baseline error, error with solely economic and demographic variables, and error with national health indices variables included. This is done in an effort to provide the reader with a transparent analysis of our data and model. 

 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
    - Our data will not involve the names or any identifiable information about Olympic athletes. The datasets that we will use will contain information solely pertaining to countries' medal count and overall health/economic metrics.

 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?
- Our project will provide a step-by-step documentation of our data collection, cleaning, EDA, and analysis of our model. 


### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
 - [X] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
    - We are reporting the average out-of-sample prediction error as an additional metric that will help us determine how well our model performs on unseen data. This metric provides a clear summary of the predictive accuracy of our model across different Olympic cycles, allowing us to compare it to different models, such as the one created solely using economic and demographic data. 

 - [X] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
    - As we have clearly defined the national health indices we will use to make our predictions, we can backtrack and investigate exactly which feature of the data caused the model to make a particular decision on Olympic success. 

 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?
    - Our only measurement of Olympic success is total medal count, and we will not account for the type of medal (Gold, Silver, Bronze) or distinguish between different sports. We made this decision to reduce the complexity of our model, and we do not expect this simplification to have an impact on the success of the model. We are aware of the possibility that some predictions may be influenced by a country's historical investment in a particular sport and/or socio-cultural variables not included in the dataset. 


### E. Deployment
 - [X] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
 - [X] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
    - If users are harmed by our model, we, as a team, have discussed re-evaluating the sources from which we are getting our data to ensure that it is non-discriminatory and ensures an individual's privacy. In this event, we will also re-evaluate our model to ensure that it is free of any biases that could be contributing to the harm of others. 

 - [X] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
    - If necessary, our team has agreed to remove this model from the internet. 

 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?
    - Because this model uses economic and health indicators to predict Olympic performance, a country may manipulate its sports funding, infrastructure, or health metrics to improve its rankings and predictive performance. We will monitor this potential misuse by ensuring that the data collected is reputable and unbiased.



## Team Expectations 

- Main communications will be through iMessage, or through Zoom meetings. It’s reasonable to wait up to 6 hours for a response to a message sent in the chat. We will meet virtually through zoom at least once a week.
- Polite, but to the point. Can be clear about concerns or issues, or differences in opinions, but in a way that is still respectful to teammates.
- Decisions will be made through majority vote, and so if a teammate is non responsive and the majority have voted on the task already, the decision will fall with the majority.
- There will be specializations in tasks, with tasks being delegated among members in a fair manner. This may change depending on what is needed, and roles may rotate throughout the weeks. Tasks will be assigned between preferences, team needs, and skills of the members. The whole team can see current tasks and progress through our group github, and this shared document with our project timeline.
- When a team member is struggling with tasks, they should immediately communicate with the group their concerns in person or through text within at least 24 hours from the deadline of the task, and meet with the TA, or ask the professor if they are stuck on something that the group can’t point them in the right direction on. The group will collectively do their best to complete the section that there is a pitfall on if trying to help the member doesn’t work, but further communications will take place to make sure work is distributed fairly.



## Project Timeline Proposal


| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/04  |  3:30 PM | Brainstormed Research Question and Hypothesis  | Edited and finalized Research Question and Hypothesis. Assigned roles for Project Proposal. | 
| 2/17  |  3:00 PM |  Look into olympic and health data trends. Determine the two datasets we will be comparing during the meeting. | Split up work for a data checkpoint. Decide datasets and olympic years we will be analyzing. | 
| 2/18  | 3:00 PM  | Finalizing variables and metrics used.  | Decide sample size, clarifying everyone's part. Sharing clean data files.  |
| 3/05  | 3:30 PM  | Complete analysis; write our results. | Edit and finalize our project.  |
| 3/20  | Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |