# How Holistic Are the Ivy League Universities, Really?
This Jupyter notebook contains a report as well as an analysis (in wonderful, intertwined, new-age data science format) of the undergraduate admissions practicies of the 8 Ivy League Universities: Brown, Cornell, Harvard, Dartmouth, Columbia, Princeton, Yale, and the University of Pennsylvania.

## What do the universities claim?
**_TO-DO: Get a quote from each of their admissions pages and cite in References._**
- Brown:
- Cornell:
- Harvard:
- Dartmouth:
- Columbia:
- Princeton:
- Yale:
- University of Pennsylvania:

## The Data
We begin by importing the requisite modules we need and then loading in and examining our data, taken from the [Integrated Postsecondary Education Data System](https://nces.ed.gov/ipeds/use-the-data).

In [3]:
import pandas as pd
import numpy as np
import altair as alt

In [4]:
data = pd.read_csv('./data/all_data_variables.csv')
labels = pd.read_csv('data/labels.csv')
data

Unnamed: 0,UnitID,Institution Name,Member of National Athletic Association (IC2021),Member of National Collegiate Athletic Association (NCAA) (IC2021),Member of National Athletic Association (IC2020),Member of National Collegiate Athletic Association (NCAA) (IC2020),Member of National Athletic Association (IC2019),Member of National Collegiate Athletic Association (NCAA) (IC2019),Member of National Athletic Association (IC2018),Member of National Collegiate Athletic Association (NCAA) (IC2018),...,Total price for in-district students living off campus (not with family) 2019-20 (DRVIC2019),Total price for in-state students living off campus (not with family) 2019-20 (DRVIC2019),Total price for out-of-state students living off campus (not with family) 2019-20 (DRVIC2019),Total price for in-district students living on campus 2018-19 (DRVIC2018),Total price for in-state students living on campus 2018-19 (DRVIC2018),Total price for out-of-state students living on campus 2018-19 (DRVIC2018),Total price for in-district students living off campus (not with family) 2018-19 (DRVIC2018),Total price for in-state students living off campus (not with family) 2018-19 (DRVIC2018),Total price for out-of-state students living off campus (not with family) 2018-19 (DRVIC2018),Unnamed: 251
0,217156,Brown University,1,1,1,1,1,1,1,1,...,,,,73802,73802,73802,,,,
1,190150,Columbia University in the City of New York,1,1,1,1,1,1,1,1,...,86257.0,86257.0,86257.0,76856,76856,76856,83470.0,83470.0,83470.0,
2,190415,Cornell University,1,1,1,1,1,1,1,1,...,76258.0,76258.0,76258.0,73904,73904,73904,73904.0,73904.0,73904.0,
3,182670,Dartmouth College,1,1,1,1,1,1,1,1,...,,,,74359,74359,74359,,,,
4,166027,Harvard University,1,1,1,1,1,1,1,1,...,,,,71650,71650,71650,,,,
5,186131,Princeton University,1,1,1,1,1,1,1,1,...,,,,70900,70900,70900,,,,
6,215062,University of Pennsylvania,1,1,1,1,1,1,1,1,...,75480.0,75480.0,75480.0,74408,74408,74408,74408.0,74408.0,74408.0,
7,130794,Yale University,1,1,1,1,1,1,1,1,...,,,,73900,73900,73900,,,,


In [5]:
# list(data.columns)

In its "raw" form, it looks like we can say, at minimum, the following about our data:
- We have 249 variables, many of them the same, but across different years.
- The data stretches from 2018 - 2022, at the latest, with some variables ending in 2021.
- Binary data is encoded using 0s and 1s (this is useful to know, as it is already in a format that is conducive to a potential model).
- There is definitely a lot of missing data.
- As of now, the year is not encoded in a way that is particularly easy to extract.
- As we delve into the analysis, we will likely need to clean and extract various aspects of this data.
- We have information on membership in the NCAA, admissions percentages by various groups, GPA + ranking information, SAT + ACT scores, enrollment breakdowns by race, and cost of attendance.

## Exploratory Data Analysis
We start by looking through a few visualizations of variables of interest to get an idea of their distribution.

### Grades and Ranking

In [6]:
data_grades_ranks = data.loc[:, ['Institution Name', 'Secondary school GPA (ADM2021)', 'Secondary school rank (ADM2021)']]
display(data_grades_ranks)
labels[(labels['VariableName'] == 'Secondary school GPA (ADM2021)') | (labels['VariableName'] == 'Secondary school rank (ADM2021)')]

Unnamed: 0,Institution Name,Secondary school GPA (ADM2021),Secondary school rank (ADM2021)
0,Brown University,2,2
1,Columbia University in the City of New York,2,2
2,Cornell University,5,5
3,Dartmouth College,1,5
4,Harvard University,2,2
5,Princeton University,2,2
6,University of Pennsylvania,1,3
7,Yale University,2,2


Unnamed: 0,VariableName,Value,ValueLabel
0,Secondary school GPA (ADM2021),1,Required
1,Secondary school GPA (ADM2021),5,Considered but not required
2,Secondary school GPA (ADM2021),2,Recommended
3,Secondary school rank (ADM2021),1,Required
4,Secondary school rank (ADM2021),5,Considered but not required
5,Secondary school rank (ADM2021),2,Recommended
6,Secondary school rank (ADM2021),3,Neither required nor recommended


It's important to understand what the data above is showing before we move on. We extracted out two variables — GPA and ranking. However, according to the labels file, these numbers do not represent admitted students' performance on these metrics. Rather, they show how much the colleges (officially) consider these metrics in their admissions decisions. According to this, the Ivy Leagues have the following stance on GPA:
- **Required**: Dartmouth, University of Pennsylvania
- **Recommended**: Brown, Columbia, Harvard, Princeton, Yale
- **Considered but not required**: Cornell

Similarly, they have the following stance on ranking:
- **Required**: None
- **Recommended**: Brown, Columbia, Harvard, Princeton, Yale
- **Considered but not required**: Cornell, Dartmouth
- **Neither required nor recommended**: University of Pennsylvania.

Now, let's take a look at the actual distribution of these metrics for admitted students from a secondary data source (see the Reference list for links to this data, which were mostly just taken from the class profile pages of the colleges themselves or secondary source estimates). Note the GPAs are all normalized to be unweighted.
- Brown: Mean GPA of 4.0, 95% of students in top decile for ranking
- Cornell: Mean GPA of 4.0, 84% of students in top decile for ranking
- Harvard: Mean GPA of 4.0, 93% of students in top decile for ranking
- Dartmouth: Mean GPA of 4.0, 95% of students in top decile for ranking
- Columbia: Mean GPA of 4.0, 96% of students in top decile for ranking
- Princeton: Mean GPA of 3.9, N/A class ranking data
- Yale: Mean GPA of 4.0, 95% of students in top decile for ranking
- University of Pennsylvania: Mean GPA of 3.9, 94% of students in top decile for ranking

Let's put this data into a DataFrame so we can easily generate a few visualizations.

In [7]:
data_dict = {'University': ['Brown', 'Cornell', 'Harvard', 'Dartmouth', 'Columbia', 'Princeton', 'Yale', 'University of Pennsylvania'],
             'Mean GPA': [4.0, 4.0, 4.0, 4.0, 4.0, 3.9, 4.0, 3.9],
             '% in Top Decile': [95, 84, 93, 95, 96, 0, 95, 94]
            }

data = pd.DataFrame(data_dict)
data

Unnamed: 0,University,Mean GPA,% in Top Decile
0,Brown,4.0,95
1,Cornell,4.0,84
2,Harvard,4.0,93
3,Dartmouth,4.0,95
4,Columbia,4.0,96
5,Princeton,3.9,0
6,Yale,4.0,95
7,University of Pennsylvania,3.9,94


### Test Scores

### Financial Aid

### Suggestions
- Future importance analysis because all the variables probably won't be relevant
- SHEP
- After doing some quant analysis here, like if different universities have SAT scores/GPA that are consistent, or if they are scattered, finding some correlations, etc., and then seeing if I can find some qualitative metrics or anecdotes online that (A3 ethnography like) help supprt my claims.
- I can look into the universities themselves and see if online athletic rankinds map to my own athletic data.
- For the mini presentation, can probably just focus on 1-2 interesting things that showed up in the data, or perhaps something that is controversial.

# Additional Data Sources
- https://admissions.dartmouth.edu/apply/class-profile-testing
- https://admission.brown.edu/explore/brown-facts
- https://admissions.cornell.edu/sites/admissions.cornell.edu/files/ClassProfile%202025%20Profile%20Updated%20FINAL.pdf
- https://blog.collegevine.com/what-does-it-really-take-to-get-into-harvard
- https://undergrad.admissions.columbia.edu/class-2025-profile
- https://admissions.yale.edu/sites/default/files/yale_classprofile2025web.pdf
- https://www.upenn.edu/about/faq

**Note**: Some sources are listed in the file chatgpt.md for clarity, as these are sources that ChatGPT used in order to give certain responses.