# COGS 108 - Project Proposal

## Authors

Team list and credits:
- Beliz Akbulut: Research question, Hypothesis, Ethics (C-E)
- Anna Lewis: Research question, Hypothesis, Background work and research 
- Karsten Jensen: Research question, Hypothesis, Data, Conceptualization
- Camerin Oliver: Team Expectations
- Ryenn Thompson: Research question, Ethics (A-B), Project Timeline Proposal 

## Research Question

How well can national health metrics—such as life expectancy, infant mortality, and obesity rates—forecast a country’s medal total in future Summer Olympic Games when models are trained only on recent prior Games?
- Does incorporating same-year national health indicators—such as life expectancy, infant mortality, and obesity rates—reduce the average out-of-sample prediction error for Olympic medal totals when evaluated using forward-chaining cross-validation across Olympic cycles and adjusted for economic/demographic variables? 
- How much does the average out-of-sample error change when national health indicators are added to the baseline model?


## Background and Prior Work

The International Olympic Games is a longstanding tradition of global athletic competition and patriotism that has occurred regularly since the first Summer Games in 1896, which itself was inspired by the Ancient Greek Olympic Games. This year, Italy is hosting the Winter Olympics, and the increasing media coverage of athletes and events led us to consider how we might predict national Olympic performance. 

As many of our group members are interested in public health and environmental safety, we wondered if there might be a relationship between national health indices and Olympics performance. In conducting preliminary research on the topic, we found several sources referencing this association, as well as other extraneous variables that affect Olympic performance rates. One research article we found, entitled “Assessment of Olympic performance in relation to economic, demographic, geographic, and social factors: quantile and Tobit approaches,” uncovered some of these factors and analyzed their effects on the 2016 Rio Olympics. The authors discussed how economic factors like income classification, government corruption, and athletic health culture can largely impact a nation’s athletic output and, ultimately, their historical performance at Olympic Games <a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1). Their analysis helped us establish a significance between national conditions in athletes’ home countries and their corresponding medal successes at the Olympics. As hypothesized, the data uncovered in this research indicated that developing nations have been performing better over time as their economic conditions improve, leading us to believe that we may find similar results in public health conditions.

After looking at historical data from the World Health Organization’s website, we brainstormed specific parameters that we think would most greatly influence a nation’s overall health rating, and thus support stronger and more athletic future generations. We want to determine if “developing” countries that recently improved their public health scores in prior decades also performed better in the Olympic Games, and use that information to predict what nations can be expected to improve in the future at the Games. We decided to only include Summer Olympic Games data to ensure confounds between the Winter and Summer seasons would not influence our findings, such as geographic advantages and disadvantages between nations, seasonal health concerns, and lack of diversity in types of sports events. 

Due to the long history of the Olympic Games, many projects have focused on predicting and analyzing athletic performance data to find trends and outliers. For example, one project found on Kaggle, authored by user EricSBrown, sought to determine how historic national economic information, specifically GDP data, could predict a family’s fantasy draft picks for the 2022 Winter Olympics <a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2). The creator of this project merged several different datasets, including GDP value information, historic national Olympics medal data, and comparative time tracking between the two. This research helped us understand how to find correlations between datasets, and use that information to create a predictive model for other instances. In our project, we intend to “test” our model several times by attempting to predict past games using prior data, such as predicting different nations’ performance at the 2008 Olympics using their performance and health data from between 1980 and 2004. 

Another project, published on Github by user jalwz17, predicted which countries would win the most medals at the 2024 Summer Olympics and the 2026 Winter Olympics <a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3). They utilized a time series prediction model to determine which countries have either consistently performed highly at the Olympic Games, including the United States and Germany, and which countries saw a rise over time in Olympic performance, such as China post-1984. This model resembles the one we want to construct in the sense that it analyzes historical performance trends to predict future successes, but we intend to add another layer of complexity to our project to find a causal relationship that explains the reason for changes in performance trends. Their project included several interesting visualization techniques that we intend to incorporate into our own work, including seaborn heatmaps and histograms, that aided in both comprehension and explanation of their data findings. 

1. <a name="cite_note-1"></a> [^](#cite_ref-1) https://www.kaggle.com/datasets/ericsbrown/winter-olympics-prediction-fantasy-draft-picks
2. <a name="cite_note-2"></a> [^](#cite_ref-2) https://github.com/jalwz17/Olympic_Medals_Analysis
3.  <a name="cite_note-3"></a> [^](#cite_ref-3) Wang Shasha, Babar Nawaz Abbasi & Ali Sohail (2023) Assessment of Olympic performance in relation to economic, demographic, geographic, and social factors: quantile and Tobit approaches, Economic Research-Ekonomska Istraživanja, 36:1, 2080735, DOI: 10.1080/1331677X.2022.2080735 https://doi.org/10.1080/1331677X.2022.2080735



## Hypothesis


We predict that nations with higher national health metrics perform better in the Summer Olympic Games. Based on historical trends and the development of health standards in various nations, we can anticipate which nations will win more medals overall, and which nations can be expected to improve over time in medal wins.

## Data

Our ideal dataset would contain between 6 and 10 Olympic summer cycles, which is roughly 24-40 years, with a preference for more recent data. We’d have observations of a particular country in a given year, so (24 to 40)*(number of countries). Our target variable would be a country’s Olympic medals per capita (or some similar measure). We’d have health predictors such as life expectancy, infant mortality, and obesity rates. We’d also have economic/demographic controls, such as GDP per capita, population, and host country. We would take medal data compiled from official Olympic results. We would collect health and economic/demographic data from international sources, such as the WHO and the World Bank. We would store these data in merged table form so that they can easily be read as a CSV file, where each row is country-Olympic year (we plan to use ISO3 country codes). 

1.  GHO indicator data (example link- life expectancy at birth): https://ghoapi.azureedge.net/api/WHOSIS_000001
The WHO provides country-year health indicator data via GHO Odata API. The data are publicly accessible and downloadable in CSV form, which is what we will most likely use. We will download similar data files to this “Life Expectancy at Birth” dataset for other variables like general life expectancy, obesity prevalence, and so on. These variables allow us to directly measure population health and we can match them with their respective Olympic years for predictive modeling. 
2. World Bank development indicator data (example dataset- all countries, GDP per capita growth (annual %), 2000-2024): https://databank.worldbank.org/source/world-development-indicators#
These WDI datasets are publicly available through the World Bank DataBank at the link above. Any country, series, and time can be selected for a given dataset. We tested downloading one as a CSV, which worked just fine. We plan to use data on real GDP per capita in constant USD and population as key variables. We can also extract relevant indicators like poverty rates or health expenditures, if needed. We have consistent data across time and years, making analysis of this data relatively straightforward in the context of building suitable predictive models for Olympic medal outcomes. 


## Ethics 

Instructions: Keep the contents of this cell. For each item on the checklist
-  put an X there if you've considered the item
-  IF THE ITEM IS RELEVANT place a short paragraph after the checklist item discussing the issue.
  
Items on this checklist are meant to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Your teams will document these discussions and decisions for posterity using this section.  You don't have to solve these problems, you just have to acknowledge any potential harm no matter how unlikely.

Here is a [list of real world examples](https://deon.drivendata.org/examples/) for each item in the checklist that can refer to.

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?
- This study does not contain individual subjects; therefore, no informed consent is needed. Rather, this project accumulates data from publicly accessible websites including the World Health Organization (WHO), and World Bank.    

 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?
- There are many ways in which we might collect sources that contain bias. Furthermore, different health organizations across the world potentially report their data incorrectly or of separate quality. With the Olympics, there is also probable bias in hosting countries, as their chance of winning more medals increases due to geographical location. To combat this problem we will use data from trusted sources such as the WHO and World Bank, add economic information to compare countries more efficiently, and consider the country who hosts the Olympics that specific year to ensure this is not an issue. 

 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?
- For this project our group is not using information from real people directly, rather, our data is from countries' medals and health reports. No individual can be identified from this data. 

 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?
- Within our project we will prioritize not carrying bias towards any particular country when analyzing results. Furthermore, we will identify patterns across different countries, and understand that Olympic performance can also relate to the countries socioeconomic status, and other imbalances between countries. 

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
- The data that we will collect is not privately owned, and does not use information from specific individuals. We will use shared project folders and Github to store all of our project data. Only our group members will have access to it. 

 - [X] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed? 
- This project does not use personal data from individuals, meaning that there is no data that will need to be removed. 

 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?
- We plan on keeping this data while working on our project, when it is finished there will be no additional need to use this information. 

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
- Some blind spots we are concerned about are the impact of economic and demographic factors on Olympic success. We are addressing these potential blind spots in our analysis by measuring how much average prediction error is achieved by models using solely economic and demographic variables, and comparing it to the error when national health indicators are added.

 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?
- Most dataset biases will be addressed as we evaluate our model with and without national health indicators in an effort to uncover omitted confounding variables and/or imbalanced classes. Using a forward-chaining cross-validation model will help mitigate overfitting of our data and ensure our model performs well with unseen data. 

 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
- We will report a baseline error, error with solely economic and demographic variables, and error with national health indices variables included. This is done in an effort to provide the reader with a transparent analysis of our data and model. 

 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?
- Our project will provide a step-by-step documentation of our data collection, cleaning, EDA, and analysis of our model. 


### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
 - [X] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
- We are reporting the average out-of-sample prediction error as an additional metric that will help us determine how well our model performs on unseen data. This metric provides a clear summary of the predictive accuracy of our model across different Olympic cycles, allowing us to compare it to different models, such as the one created solely using economic and demographic data. 

 - [X] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
- As we have clearly defined the national health indices we will use to make our predictions, we can backtrack and investigate exactly which feature of the data caused the model to make a particular decision on Olympic success. 

 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?
- Our only measurement of Olympic success is total medal count, and we will not account for the type of medal (Gold, Silver, Bronze) or distinguish between different sports. We made this decision to reduce the complexity of our model, and we do not expect this simplification to have an impact on the success of the model. We are aware of the possibility that some predictions may be influenced by a country's historical investment in a particular sport and/or socio-cultural variables not included in the dataset. 



### E. Deployment
 - [X] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
 - [X] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
 - [X] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?


## Team Expectations 


* Main communications will be through iMessage, or through Zoom meetings. It’s reasonable to wait up to 6 hours for a response to a message sent in the chat. We will meet virtually through zoom at least once a week.
* Polite, but to the point. Can be clear about concerns or issues, or differences in opinions, but in a way that is still respectful to teammates.
* Decisions will be made through majority vote, and so if a teammate is non responsive and the majority have voted on the task already, the decision will fall with the majority.
* There will be specializations in tasks, with tasks being delegated among members in a fair manner. This may change depending on what is needed, and roles may rotate throughout the weeks. Tasks will be assigned between preferences, team needs, and skills of the members. The whole team can see current tasks and progress through our group github, and this shared document with our project timeline.
* When a team member is struggling with tasks, they should immediately communicate with the group their concerns in person or through text within at least 24 hours from the deadline of the task, and meet with the TA, or ask the professor if they are stuck on something that the group can’t point them in the right direction on. The group will collectively do their best to complete the section that there is a pitfall on if trying to help the member doesn’t work, but further communications will take place to make sure work is  distributed fairly.



## Project Timeline Proposal


| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/04  |  3:30 PM | Brainstormed Research Question and Hypothesis  | Edited and finalized Research Question and Hypothesis. Assigned roles for Project Proposal. | 
| 2/16  |  3:30 PM |  Import and clean olympic & health data. Looking into data trends. | Review trends, collectively decide on approach to analysis. | 
| 2/22  | 3:30 PM  | Begin checking ideas using our data.  | Discuss results and any potential changes or edits.  |
| 3/05  | 3:30 PM  | Complete analysis; write our results. | Edit and finalize our project.  |
| 3/20  | Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |