## Motivation

Out of all of Washington state’s (not the university) employees, the highest paid are head coaches employed at [public institutions](https://fiscal.wa.gov/Salaries.aspx). These salaries range from 1-3 million per year, exceeding even the university president. This isn’t unique to Washington-based universities. The 20 most well paid state employees of California are [head coaches belonging to basketball or football](https://www.google.com/url?q=https://transparentcalifornia.com/salaries/all/&sa=D&source=docs&ust=1651889083939114&usg=AOvVaw2dypMm6Sq2xGeZNk4dSH_Z). It’s commonly defended that these people are paid more because of their positive impact in generating profit for the university. The reality is that very few athletic departments have a [net positive revenue](https://www.bestcolleges.com/news/analysis/2020/11/20/do-college-sports-make-money/). Another argument is that athletic wins impact the number of university applicants. And while this was identified to be [true in 2009](https://www.researchgate.net/publication/23780052_The_Impact_of_College_Sports_Success_on_the_Quantity_and_Quality_of_Student_Applications#:~:text=Key%20findings%20include%20the%20following,in%20application%20rates%20after%20sports), it’s unsure whether this trend holds up 12 years later.

## Related Work

In 2018, the Atlantic published an article, [College Sports Are Affirmative Action for Rich White Students](https://www.theatlantic.com/education/archive/2018/10/college-sports-benefits-white-students/573688/). Despite a rich diversity of high profile collegiate sports like Football and Basketball, the overwhelming majority of student athletes tend to be white. Statistics are reported to challenge particular beliefs and identify curious relationships. One interesting finding is that 65% of student athletes in the NCAA are white. Another is that athleticism stifle Asian mens' chances at Harvard during the admission process. And that a majority of student athletes have incomes of $250,000 with many sports associated with country-club like activities such as Sailing, lacrosse, and ice-hockey with less than 10% of black student athletes across the all NCAA teams.

Another 2017 article, [The March Madness Application bump](https://www.theatlantic.com/education/archive/2017/03/the-march-madness-application-bump/519846/), discusses the impact that March Madness has on number of applications a university receives. Cinderalla teams, are perceived underdog universities that after beating larger rivals, gain an increase in popularity. While the number of applicants increases, there was a finding that the SAT scores are weighted to be slightly lower. As  students apply to multiple colleges, sports could impact the selection of fall-back schools.

The [College Athletics Database](https://knightnewhousedata.org) discusses the issue with funding and athletic departments. It is a database defining where funding comes from and where it is spent. While the data visualization website has some usability issues it does help increase transparency about institutional funding.

https://www.washingtonpost.com/graphics/2018/sports/ncaa-applicants/

## Data

The sports datasets that are used are the *money-maker* sports of collegiate athletics. These are:
- [College Basketball](https://www.kaggle.com/datasets/andrewsundberg/college-basketball-dataset)
- [College Football](https://www.kaggle.com/datasets/jeffgallini/college-football-team-stats-2019?select=cfb21.csv)

#### College Basketball

This dataset comes from Kaggle.com. The original dataset is scraped from [Barttorvik.com](https://barttorvik.com/trank.php#) who appears to be an attorney with an extreme interest in NCAA basketball. This dataset is paid from the NCAA stats website. Bart Torvik is transparent about his dataset and transformations invovled. The kaggle dataset includes faceting by year and replacing variable names with more understandable terminology. While there's no license on the original Bart Torvik license, it seems that they are okay with datascience projects according to their blogspot. The Kaggle dataset is under public domain.

Certain Variables of interest.

| Variable | Definition | Use Case |
|----------|------------|----------|
| Year     | Year of data recorded | Track the university scores over time |
| Team | The collegiate basketball team | Define certain schools of interest |
| Conf | Conference the basketball team plays | Determine top school of each region |
| Games Played | Number of games played that year | Determine win ratio that year |
| Games Won | Number of games won by team | Determine win ratio that year |
| Power Rating | Chance of beating an average team | Help determine success of the team nationally|

#### College Football

The college football dataset also comes from kaggle.com. This dataset is paid directly from the NCAA website and is available under an open data commons license. 

Certain Variables of interest.

| Variable | Definition | Use Case |
|----------|------------|----------|
| Year     | Year of data recorded | Track university scores over time |
| Team     | Collegiate football team | Define schools of interest |
| Conference | Subcategory of Team | Determine top schools of each region |
| Games    | Number of games played that year | Determine win ratio |
| Win      | Number of games won that year | Determine the win ratio |

#### College ScoreCard Dataset

An additional [dataset](https://collegescorecard.ed.gov/data/) from the US Department of Education could be used to compare these athletic datasets to the university performance. This dataset does not have a specific license so it is assumed that it is covered under public domain. 

Variables of interest include:

| Variable | Definition | Use Case|
| ---------|------------|---------|
| INSTNM   | Institution name | Variable to conjoin with previous sports related dataset|
| Control  | Structure as public, private nonprofit, for-profit | Filter to only include public schools as they have published employee salaries |
| AVGFACSAL | Average faculty salary | Compare with the traditional head coach salary |
| ADM_RATE | Admission rate | Lower admission rates correlate with more applicants |
| COSTT4_A | Cost of attendance for academic year | Understand whether students pay more for school earnings |

## Unknowns

While there are sites that can tell me the number of applicants individual universities have received during a particular year, there are no sites that I have found that can provide me with an exact dataset. Similarly head coach salaries and number of applicants per university would have to be found manually to create a dataset. The biggest barriers include balancing this project with my capstone project which I will prioritize. 

## Feedback (Project Proposal)

My original idea was about university financial transparency and Washington state employee salary data. The feedback given by my peers was about how I was defining transparency and what conclusions I could get from salary data. Instead of focusing on individual salaries as university transparency, looking at school-wide data instead. Instead of asking how a state employee can have a salary of 3-million, I could focus on a more quantifiable, do sports have a significant impact on the university?

## Research Questions

1. Is there a correlation between team success and coach payment?
2. Do more people apply if the team does better?
3. Is a high coach's salary justified?

## Methods

1. Download athletics and education dataset.
3. Merge with education variable "Control" & filter only to public institution datasets.
4. Use win-loss to determine win-loss ratio in a given year.
5. Find top 3 and bottom 3 schools per year per sport, 96 different rows.
6. Create a new spreadsheet manually filling out head coach salary that year and admission rates of the following school year.

## Hypothesis

**A high coach salary has a negative correlation with the admission rates of the school.**

## Null Hypothesis H(0)

**A high coach salary has no effect on the admission rates of a university.**

## Relevant Assumptions.

1. College football and basketball have impact on following years' admission rates not the current year that they are playing.

2. This project stops the analysis by 2019 when coronavirus pandemics began.