# COGS 108 - Project Proposal

## Authors

* Ruobing Wang: Experimental investigation,Visualization
* Zhuoqing Tang:Methodology, Writing – original draft
* Chaoyan Lei:Analysis,Conceptualization, Project administration
* Ziqing Yuan:Background research,Writing – review & editing
* Zhongzheng Liu:Data curation, Software

## Research Question

The research project aims to identify the relationship between pre-race betting odds and the average rate of return with respect to the horse racing events taking place at the Saga Racecourse. The research project is based on the following research question: are horses with lower pre-race odds related to varying levels of average returns? The independent variable of the research project is the pre-race betting odds, while the dependent variable is the average rate of return. To identify the potential relationship between the independent and dependent variables of the research project, exploratory data analysis (EDA) techniques, including statistical inference methods such as correlation analysis and regression analysis, are used. It is worth mentioning here that the research project is based solely on the identification of correlational relationships.


## Background and Prior Work

Horse racing is one of the oldest organized sports in the world, with a documented history dating back to ancient civilizations such as Greece, Rome, and China. Morden horse racing developed in Britain during the 17th and 18th centuries and later spread to the United States, where it was closely tied to gambling and probability-based decision making.<a href="#ref1">1</a> Today, horse racing generates large-scale structured data, including race conditions, horse characteristics, jockey performance, and betting odds, making it a natural domain for quantitative analysis and predictive modeling.

Beyond its historical and cultural significance, horse racing has served as an important real-world testbed for studying human decision-making under uncertainty for a long time. Betting markets associated with horse races have frequently been analyzed to understand whether markets efficiently incorporate available information. 

Prior economic research has shown that while betting odds often approximate true winning probabilities, systematic biases, such as the “favorite–longshot bias,” where longshots are overbet and favorites are underbet. This persists across races and contexts.<a href="#ref2">2</a> These findings suggest that even in environments with rich information, human judgment and behavior can deviate from purely rational models.

This work is relevant to our project because it establishes horse racing as a meaningful domain for studying probabilistic judgment and prediction. The presence of systematic biases suggests that historical race data contains informative patterns that can be analyzed and modeled, rather than being purely random or fully efficient. Building on this foundation, our project uses modern data analysis techniques to further explore predictive relationships within horse racing.

For the prior work of this project, we use horse racing data as a case study for applying data analysis and predictive modeling to real-world decision-making contexts. By examining how measurable race and participant features relate to outcomes, we aim to better understand both the predictive structure of the data and how people react to them.

<a name="ref1">1</a>. Encyclopaedia Britannica. Horse racing. https://www.britannica.com/sports/horse-racing
<a href="#top">^</a>

<a name="ref2">2</a>. Thaler, R. H., & Ziemba, W. T. (1988). Parimutuel betting markets: Racetracks and lotteries. Journal of Economic Perspectives, 2(2), 161–174. https://www.aeaweb.org/articles?id=10.1257/jep.2.2.161
<a href="#top">^</a>



## Hypothesis


We assume that in horse races held at Saga Racecourse, the average return rate of horses with lower pre-race betting odds (i.e., higher implied probabilities of winning in the market) is often lower. This assumption is based on the fact that the betting market usually integrates public information such as the past performance of the horses and the information of the jockeys, making the odds reflect to some extent the relative strength of the horses and thereby compressing the potential return space of high-probability winning horses.


## Data

To investigate the research question, we should obtain a dataset containing winning odds and actual payout, which is equal to the product of winning odds and the truth value of winning (0 for losing, 1 for winning) of entries of all races held at the Saga racecourse. Thus, winning odds are required, and we also need either the truth value of winning or the true return. All data should not contain any duplicates or any missing entries within a continuous range of time, to ensure completeness of data (investigating the population), thus avoiding biases. Although odds and winning or not are the only required variables, an unique identifier for races and racehorses is strongly suggested for data management, debugging and evaluation purposes. The total number of races between February 2015 and December 2025 is about 10000, which is more than enough for single-variable analysis.

The data will be collected both by using a web crawler and manually from an open and viable data source such as netkeiba.com, temporarily stored as a dictionary. This data can be stored as a tabular csv file with one line per entry (or in other words, per horse per race) after conversion to tabular format, and analyzed in the format of pandas dataframes in python.

One potential real dataset can be the 'JRA日本中央競馬会 Horse Racing Dataset' on https://www.kaggle.com/datasets/takamotoki/jra-horse-racing-dataset. This dataset is open for use, but its last update is in 2021, before widespread use of artificial intelligence, which have unknown affects on bettor's ability to identify and wager on horses, changing the behavior of odds. This dataset also included all Japan Racing Association racing data, which does not include races at the Saga racecourse. This dataset contains winning and odds data enough for our project.

## Ethics 

Instructions: Keep the contents of this cell. For each item on the checklist
-  put an X there if you've considered the item
-  IF THE ITEM IS RELEVANT place a short paragraph after the checklist item discussing the issue.
  
Items on this checklist are meant to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Your teams will document these discussions and decisions for posterity using this section.  You don't have to solve these problems, you just have to acknowledge any potential harm no matter how unlikely.

Here is a [list of real world examples](https://deon.drivendata.org/examples/) for each item in the checklist that can refer to.

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?
 - Our data did not include data related to human individuals.
 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?
 - We collected continuous data in a period of 10 years without repetition or missing data on all races, preventing bias.
 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?
 - Our data did not include data related to human individuals.
 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?
 - Our data did not include data related to human individuals.

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
 - We only used open sourced data.
 - [X] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
 - Our data did not include data related to human individuals.
 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?
 - We only used open sourced data.

### C. Analysis
 - [ ] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?
 - [ ] The data is collected from a single racecourse (Saga Racecourse) and therefore does not represent all horse races across different regions. In addition, betting odds are market prices rather than natural probabilities, as betting companies may adjust odds to manage risk and avoid losses.
 - [ ] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
 - [ ] The analysis focuses on aggregated group-level results, and we avoid discussing or highlighting individual horses or horse racers.
 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?
 - [ ] We clearly document how the data was obtained, the analytical steps taken, the conclusions drawn, and the formulas used.

### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
 - [ ] The model only uses race-related variables such as odds and outcomes and does not include or proxy any protected human attributes.
 - [X] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
 - [ ] This project does not make decisions affecting human groups, so group-level fairness testing is not applicable.
 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considering additional metrics?
 - [ ] We considered how different performance metrics (e.g., average return versus risk) may affect the interpretation of results.
 - [X] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
 - [ ] We will analyze using simple statistical summaries and explain them in clear, non-technical terms.
 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

### E. Deployment
 - [X] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
 - [ ] The model is not deployed in a real-world system and is only used for academic analysis, so ongoing monitoring is not required.
 - [X] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
 - [ ] Since the analysis does not directly impact users or decision-making, a formal redress mechanism is not applicable.
 - [X] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 - [ ] The project does not involve a production system, so rollback mechanisms are unnecessary.
 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?
 - [ ] We clearly document how the data was obtained, the analytical steps taken, the conclusions drawn, and the formulas used.


## Team Expectations 

* *Team Expectation 1*：**Timely communication and response.** Each team member must maintain regular communication. When other members have questions regarding the project, they must respond by 11:59 PM that same day. Additionally, team members are required to regularly report project progress in the group chat.
* *Team Expectation 2*：**Complete the corresponding tasks on time.** Each team member must complete their assigned portion within the specified timeframe. If unable to submit on time, promptly notify other team members in the group chat and discuss a new deadline.
* *Team Expectation 3*：**Active participation and discussion.** All team members are required to actively participate in project discussions and attend weekly team meetings punctually each time. If unable to attend, promptly notify the group chat. During discussions, everyone should actively voice their ideas or raise questions about specific matters.
* *Team Expectation 4*: **Mutual Respect and Cooperation.** Everyone should respect each other's work. We encourage the expression of differing viewpoints, but we do not tolerate vicious insults or accusations against other members. Discussions should be conducted rationally.
* *Team Expectation 5*: **Resolve conflicts.** When conflicts arise, we should first attempt to resolve them internally. If resolution proves impossible, promptly seek assistance from the relevant party or professor. We must address issues in a fair manner.

## Project Timeline Proposal

Instructions: REPLACE the contents of this cell with your work

Specify your team's specific project timeline. An example timeline has been provided. Changes the dates, times, names, and details to fit your group's plan.

If you think you will need any special resources or training outside what we have covered in COGS 108 to solve your problem, then your proposal should state these clearly. For example, if you have selected a problem that involves implementing multiple neural networks, please state this so we can make sure you know what you’re doing and so we can point you to resources you will need to implement your project. Note that you are not required to use outside methods.



| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 1/20  |  1 PM | Read & Think about COGS 108 expectations; brainstorm topics/questions  | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research | 
| 1/26  |  10 AM |  Do background research on topic | Discuss ideal dataset(s) and ethics; draft project proposal | 
| 2/1  | 10 AM  | Edit, finalize, and submit proposal; Search for datasets  | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part   |
| 2/14  | 6 PM  | Import & Wrangle Data (Ant Man); EDA (Hulk) | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 2/23  | 12 PM  | Finalize wrangling/EDA; Begin Analysis (Iron Man; Thor) | Discuss/edit Analysis; Complete project check-in |
| 3/13  | 12 PM  | Complete analysis; Draft results/conclusion/discussion (Wasp)| Discuss/edit full project |
| 3/20  | Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |