# COGS 108 - Data Checkpoint

- Ryo Sakai: Writing - originial draft, Writing - review & editing
- Servando Lara: Background research, Writing - originial draft, Writing - review & editing
- Brook Isaac: 
- Kota Fukuda: Writing - originial draft, Writing - review & editing
- Elisey Ovchinnikov: 


## Research Question

How well can season-to-season changes in professional male soccer players’ inflation-adjusted market values (as reported by Transfermarkt) from 2013–2023 be predicted using age, primary playing position (goalkeeper, defender, midfielder, forward), total league minutes played, and standardized performance metrics (goals, assists, expected goals, expected assists, key passes, defensive actions per 90)?

Among these variables, which are the strongest predictors of positive versus negative changes in market value between consecutive seasons?


## Background and Prior Work

Professional soccer operates as a global labor market in which player valuations influence transfers, wages, sponsorship visibility, and long-term club strategy. Top European leagues generate billions of dollars annually in revenue, and transfer spending alone regularly exceeds several billion euros per season.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) In this context, how players are valued has real economic consequences for clubs, agents, and players themselves. Because actual transfer fees only materialize when a transfer occurs, researchers and analysts often rely on publicly available “market value” estimates as a consistent proxy for how the market rates a player at a given moment. These values allow large-scale cross-league and cross-season statistical analysis that would not be possible using transfer fees alone. 

A widely used public source of such estimates is Transfermarkt, which publishes crowd-informed market values. Transfermarkt explicitly distinguishes its estimates from realized transfer fees and describes its valuation process as combining pricing logic with structured, moderated community discussion.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) This is crucial for our project: it implies that market value captures meaningful signals (age, contract status, performance, reputation) but may also reflect perception-based biases such as league visibility, media exposure, or hype. Thus, market value is best understood not as a “true” intrinsic worth, but as an observable outcome of a social and economic evaluation process. 

Prior academic research has investigated which factors explain player market value. A recent systematic review synthesizes evidence that performance metrics, age, positional role, contract characteristics, and contextual variables all contribute to valuation patterns.<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3) Müller et al. (2017) developed a data-driven framework to estimate market values across multiple European leagues and demonstrated that statistical modeling can reproduce meaningful valuation patterns using performance and contextual indicators.<a name="cite_ref-4"></a>[<sup>4</sup>](#cite_note-4) Their work shows that valuation is partially predictable and structured, rather than arbitrary. 

However, much of the existing literature focuses either on specific subsets of leagues (e.g., top European competitions) or on predictive accuracy rather than interpretability. In other words, prior work often asks: Can we estimate market value? Our project instead emphasizes: Which measurable attributes are most strongly associated with market value in our dataset, and how do contextual factors compare to on-field performance? By focusing on interpretability within a regression-style framework, we aim to clarify the relative contribution of performance, demographic, and contextual predictors rather than maximizing prediction alone. 

Researchers have also documented structural effects that may influence valuation. For example, studies of relative age effects find that players born earlier in the selection year may have higher market values in youth contexts, suggesting that developmental structures shape perceived value.<a name="cite_ref-5"></a>[<sup>5</sup>](#cite_note-5) This highlights a broader concern: even when performance metrics are included, market value may embed selection effects and institutional bias. Additionally, competing “algorithmic valuation” proposals—such as those referencing CIES-style models incorporating age and contract length—demonstrate that valuation is multi-factor and contested, with no universally accepted formula.<a name="cite_ref-6"></a>[<sup>6</sup>](#cite_note-6) 

These prior studies motivate and delimit our contribution. Rather than proposing a new proprietary valuation model or claiming to identify a “true” player worth, our goal is to empirically examine associations between measurable player attributes (performance statistics, playing time, age, position, and league/club context) and publicly reported market value within our dataset. By situating our analysis within this broader literature, we treat market value as a socially informed economic signal and focus on interpreting patterns rather than asserting causal or normative conclusions. 

1. <a name="cite_note-1"></a> [^](#cite_ref-1) Franceschi, M. et al. (2024). Determinants of football players’ valuation: A systematic review. Journal of Economic Surveys. https://onlinelibrary.wiley.com/doi/10.1111/joes.12552 
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Transfermarkt. (2021). Transfermarkt Market Value explained – How is it determined? https://www.transfermarkt.co.in/transfermarkt-market-value-explained-how-is-it-determined-/view/news/385100 
3. <a name="cite_note-3"></a> [^](#cite_ref-3) Franceschi, M. et al. (2024). Systematic review (as above). 
4. <a name="cite_note-4"></a> [^](#cite_ref-4) Müller, O., Simons, A., & Weinmann, M. (2017). Beyond crowd judgments: Data-driven estimation of market value in association football. European Journal of Operational Research. https://www.sciencedirect.com/science/article/pii/S0377221717304332 
5. <a name="cite_note-5"></a> [^](#cite_ref-5) Romann, M. et al. (2021). How Relative Age Effects Associate with Football Players’ Market Values. Frontiers in Sports and Active Living. https://pmc.ncbi.nlm.nih.gov/articles/PMC8309691/ 
6. <a name="cite_note-6"></a> [^](#cite_ref-6) ssociated Press. (2023). Infantino refloats idea of using algorithm to set soccer player transfer fees. https://apnews.com/article/6b7f4c40581e08e69e4198bdcd579773 


## Hypothesis


We hypothesize that season-to-season percentage changes in players’ inflation-adjusted market value (Transfermarkt, 2013–2023) are significantly predicted by age and standardized performance metrics, including goals per 90 minutes and assists per 90 minutes, after controlling for total league minutes played and primary playing position (goalkeeper, defender, midfielder, forward). Specifically, we expect a nonlinear relationship between age and market value change (with younger players exhibiting larger positive changes), and a positive linear association between offensive performance metrics and market value growth.

Additionally, we hypothesize that nationality, operationalized as a categorical variable grouped by major footballing regions (e.g., Top-5 European leagues vs. non–Top-5 regions), will remain a statistically significant predictor in a multivariate regression model, reflecting differences in league exposure and market demand.


## Data

### Data overview

Instructions: REPLACE the contents of this cell with descriptions of your actual datasets.

For each dataset include the following information
- Dataset #1
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
  - Description of the variables most relevant to this project
  - Descriptions of any shortcomings this dataset has with repsect to the project
- Dataset #2 (if you have more than one!)
  - same as above
- etc

Each dataset deserves either a set of bullet points as above or a few sentences if you prefer that method.

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

In [None]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [None]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/airline-safety/airline-safety.csv', 'filename':'airline-safety.csv'},
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/bad-drivers/bad-drivers.csv', 'filename':'bad-drivers.csv'}
]

get_data.get_raw(datafiles,destination_directory='data/00-raw/')

### Dataset #1 

Instructions: 
1. Change the header from Dataset #1 to something more descriptive of the dataset
2. Write a few paragraphs about this dataset. Make sure to cover
   1. Describe the important metrics, what units they are in, and giv some sense of what they mean.  For example "Fasting blood glucose in units of mg glucose per deciliter of blood.  Normal values for healthy individuals range from 70 to 100 mg/dL.  Values 100-125 are prediabetic and values >125mg/dL indicate diabetes. Values <70 indicate hypoglycemia. Fasting idicates the patient hasn't eaten in the last 8 hours.  If blood glucose is >250 or <50 at any time (regardless of the time of last meal) the patient's life may be in immediate danger"
   2. If there are any major concerns with the dataset, describe them. For example "Dataset is composed of people who are serious enough about eating healthy that they voluntarily downloaded an app dedicated to tracking their eating patterns. This sample is likely biased because of that self-selection. These people own smartphones and may be healthier and may have more disposable income than the average person.  Those who voluntarily log conscientiously and for long amounts of time are also likely even more interested in health than those who download the app and only log a bit before getting tired of it"
3. Use the cell below to 
    1. load the dataset 
    2. make the dataset tidy or demonstrate that it was already tidy
    3. demonstrate the size of the dataset
    4. find out how much data is missing, where its missing, and if its missing at random or seems to have any systematic relationships in its missingness
    5. find and flag any outliers or suspicious entries
    6. clean the data or demonstrate that it was already clean.  You may choose how to deal with missingness (dropna of fillna... how='any' or 'all') and you should justify your choice in some way
    7. You will load raw data from `data/00-raw/`, you will (optionally) write intermediate stages of your work to `data/01-interim` and you will write the final fully wrangled version of your data to `data/02-processed`
4. Optionally you can also show some summary statistics for variables that you think are important to the project
5. Feel free to add more cells here if that's helpful for you


In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE


### Dataset #2 

See instructions above for Dataset #1.  Feel free to keep adding as many more datasets as you need.  Put each new dataset in its own section just like these. 

Lastly if you do have multiple datasets, add another section where you demonstrate how you will join, align, cross-reference or whatever to combine data from the different datasets

Please note that you can always keep adding more datasets in the future if these datasets you turn in for the checkpoint aren't sufficient.  The goal here is demonstrate that you can obtain and wrangle data.  You are not tied down to only use what you turn in right now.

In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE


## Ethics 

Instructions: Keep the contents of this cell. For each item on the checklist
-  put an X there if you've considered the item
-  IF THE ITEM IS RELEVANT place a short paragraph after the checklist item discussing the issue.
  
Items on this checklist are meant to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Your teams will document these discussions and decisions for posterity using this section.  You don't have to solve these problems, you just have to acknowledge any potential harm no matter how unlikely.

Here is a [list of real world examples](https://deon.drivendata.org/examples/) for each item in the checklist that can refer to.

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?
 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?
> This is relevant to our project because soccer player market values are not neutral measures of ability as they are affected by media exposure, league prestige, and subjective evaluation by clubs and analysts. As a result, the data may systematically overrepresent players from major European leagues, while undervaluing players from less visible leagues or countries, which may cause bias into any analysis that treats market value as an objective outcome.
 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?
 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
 - [X] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
> We acknowledge that market value metrics may overlook non-quantifiable aspects of player contribution, such as leadership or tactical role. These limitations are considered when interpreting results and discussed explicitly in our analysis.
 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?
 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
 - [X] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
 - [X] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

### E. Deployment
 - [X] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
 - [X] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
 - [X] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?
> We recognize the risk that results could be misinterpreted as objective evaluations of player ability. To mitigate this, we clearly state that findings are descriptive and context-dependent, and we discourage any real-world decision-making based on our analysis.


## Team Expectations 

* *Actively Participate*
* *Join weekly meetings*
* *Ask questions*
* *Let others know before pushing code to github*


## Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/3  |  7 PM | Read & Think about COGS 108 expectations; brainstorm topics/questions  | Edit, finalize, and submit proposal; Search for datasets | 
| 2/10  |  7 PM |  Do background research on topic | Discuss ideal dataset(s) and ethics | 
| 2/17  | 7 PM  | Finalize Datasets/cleaning data  | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part   |
| 2/24  | 7 PM  | Import & Wrangle Data (Ant Man); EDA (Hulk) | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 3/3  | 7 PM  | Finalize wrangling/EDA; Begin Analysis (Iron Man; Thor) | Discuss/edit Analysis; Complete project check-in |
| 3/10  | 7 PM  | Complete analysis; Draft results/conclusion/discussion (Wasp)| Discuss/edit full project |
| 3/17  | Before 11:59 PM  | Final touch-ups and Final Thoughts | Turn in Final Project & Group Project Surveys |