# How Holistic Are the Ivy League Universities, Really?
This Jupyter notebook contains a report as well as an analysis (in wonderful, intertwined, new-age data science format) of the undergraduate admissions practicies of the 8 Ivy League Universities: Brown, Cornell, Harvard, Dartmouth, Columbia, Princeton, Yale, and the University of Pennsylvania.

## What do the universities claim?
- **Brown**: "Brown's admission process is holistic, and we review every application. The admission statistics available through Brown Facts, as well as grade and score ranges for the Class of 2026, may help to provide a broad perspective of the academic strength of our pool of applicants. However, please be aware that these data points are not a set of requirements and should not be used to predict odds of admission."
- **Cornell**: "Cornell is happy to provide general admissions statistical data from the most recently admitted class to help provide prospective applicants with a broad understanding of the kind of highly qualified candidates that have been admitted in the past. The Class Profile numbers, however, should not be interpreted to mean that objective data are the most important criteria in the selection process. Other factors, such as secondary school curriculum and performance, special talents, extracurricular activities, application essays, and interviews (where required) are critical to Cornell's decision making process as well."
- **Harvard**: "There is no formula for gaining admission to Harvard. Academic accomplishment in high school is important, but the Admissions Committee also considers many other criteria, such as community involvement, leadership and distinction in extracurricular activities, and personal qualities and character. We rely on teachers, counselors, and alumni to share information with us about an applicant's strength of character, his or her ability to overcome adversity, and other personal qualities."
- **Dartmouth**: "Every applicant is reviewed individually and holistically . We are aware that in many cases multiple applicants attend the same secondary school, and the decision for any one applicant will not determine the outcome of another."
- **Columbia**: "The Columbia University first-year class of College and Engineering students is chosen from a large and diverse group of applicants. Columbia employs a holistic approach in assessing candidates in order to evaluate which students are the best matches for Columbia's unique educational experience. In the process of selection, the Committee on Admissions considers each applicant's academic potential, intellectual strength and ability to think independently. The Committee also considers the general attitudes and character of the applicant, special abilities and interests, maturity, motivation, curiosity and whether they are likely to make productive use of the four years at Columbia. In its final selection, Columbia seeks diversity of personalities, achievements and talents, and of economic, social, ethnic, cultural, religious, racial and geographic backgrounds."
- **Princeton**: "For each class, we bring together a varied mix of high-achieving, intellectually gifted students from diverse backgrounds to create an exceptional learning community. We care about what students have accomplished in and out of the classroom. As you prepare your application, help us to appreciate your talents, academic accomplishments and personal achievements. We'll ask for your transcript and recommendations, and we will want to know more than just the statistics in your file. Tell us your story. Show us what’s special about you. Tell us how you would seize the academic and nonacademic opportunities at Princeton and contribute to the Princeton community."
- **Yale**: "Students commonly want to know what part of the college application “carries the most weight.” The truth is, there are many parts to your application, and together they help us discover and appreciate your particular mix of qualities. Academic criteria are important to Yale’s selective admissions process, but we look at far more than test scores and grades."
- **University of Pennsylvania**: "We look for students who aspire to develop and refine their talents and abilities within Penn’s liberal arts-based, practical, and interdisciplinary learning environment. Our ideal candidates are inspired to emulate our founder Benjamin Franklin by applying their knowledge in “service to society” to our community, the city of Philadelphia, and the wider world. To best understand prospective students’ paths through Penn, we approach applications holistically and with great care."

## The Data
We begin by importing the requisite modules we need and then loading in and examining our data, taken from the [Integrated Postsecondary Education Data System](https://nces.ed.gov/ipeds/use-the-data).

In [7]:
import pandas as pd
import numpy as np
import altair as alt

In [8]:
data = pd.read_csv('./data/all_data_variables.csv')
labels = pd.read_csv('data/labels.csv')
data

Unnamed: 0,UnitID,Institution Name,Member of National Athletic Association (IC2021),Member of National Collegiate Athletic Association (NCAA) (IC2021),Member of National Athletic Association (IC2020),Member of National Collegiate Athletic Association (NCAA) (IC2020),Member of National Athletic Association (IC2019),Member of National Collegiate Athletic Association (NCAA) (IC2019),Member of National Athletic Association (IC2018),Member of National Collegiate Athletic Association (NCAA) (IC2018),...,Total price for in-district students living off campus (not with family) 2019-20 (DRVIC2019),Total price for in-state students living off campus (not with family) 2019-20 (DRVIC2019),Total price for out-of-state students living off campus (not with family) 2019-20 (DRVIC2019),Total price for in-district students living on campus 2018-19 (DRVIC2018),Total price for in-state students living on campus 2018-19 (DRVIC2018),Total price for out-of-state students living on campus 2018-19 (DRVIC2018),Total price for in-district students living off campus (not with family) 2018-19 (DRVIC2018),Total price for in-state students living off campus (not with family) 2018-19 (DRVIC2018),Total price for out-of-state students living off campus (not with family) 2018-19 (DRVIC2018),Unnamed: 251
0,217156,Brown University,1,1,1,1,1,1,1,1,...,,,,73802,73802,73802,,,,
1,190150,Columbia University in the City of New York,1,1,1,1,1,1,1,1,...,86257.0,86257.0,86257.0,76856,76856,76856,83470.0,83470.0,83470.0,
2,190415,Cornell University,1,1,1,1,1,1,1,1,...,76258.0,76258.0,76258.0,73904,73904,73904,73904.0,73904.0,73904.0,
3,182670,Dartmouth College,1,1,1,1,1,1,1,1,...,,,,74359,74359,74359,,,,
4,166027,Harvard University,1,1,1,1,1,1,1,1,...,,,,71650,71650,71650,,,,
5,186131,Princeton University,1,1,1,1,1,1,1,1,...,,,,70900,70900,70900,,,,
6,215062,University of Pennsylvania,1,1,1,1,1,1,1,1,...,75480.0,75480.0,75480.0,74408,74408,74408,74408.0,74408.0,74408.0,
7,130794,Yale University,1,1,1,1,1,1,1,1,...,,,,73900,73900,73900,,,,


In [9]:
list(data.columns)

['UnitID',
 'Institution Name',
 'Member of National Athletic Association (IC2021)',
 'Member of National Collegiate Athletic Association (NCAA) (IC2021)',
 'Member of National Athletic Association (IC2020)',
 'Member of National Collegiate Athletic Association (NCAA) (IC2020)',
 'Member of National Athletic Association (IC2019)',
 'Member of National Collegiate Athletic Association (NCAA) (IC2019)',
 'Member of National Athletic Association (IC2018)',
 'Member of National Collegiate Athletic Association (NCAA) (IC2018)',
 'Percent admitted - men (DRVADM2021)',
 'Percent admitted - women (DRVADM2021)',
 'Percent admitted - total (DRVADM2021)',
 'Percent admitted - men (DRVADM2020_RV)',
 'Percent admitted - women (DRVADM2020_RV)',
 'Percent admitted - total (DRVADM2020_RV)',
 'Percent admitted - men (DRVADM2019_RV)',
 'Percent admitted - women (DRVADM2019_RV)',
 'Percent admitted - total (DRVADM2019_RV)',
 'Percent admitted - men (DRVADM2018_RV)',
 'Percent admitted - women (DRVADM2018_

In its "raw" form, it looks like we can say, at minimum, the following about our data:
- We have 249 variables, many of them the same, but across different years.
- The data stretches from 2018 - 2022, at the latest, with some variables ending in 2021.
- Binary data is encoded using 0s and 1s (this is useful to know, as it is already in a format that is conducive to a potential model).
- There is definitely a lot of missing data.
- As of now, the year is not encoded in a way that is particularly easy to extract.
- As we delve into the analysis, we will likely need to clean and extract various aspects of this data.
- We have information on membership in the NCAA, admissions percentages by various groups, GPA + ranking information, SAT + ACT scores, enrollment breakdowns by race, and cost of attendance.

## Exploratory Data Analysis
We start by looking through a few visualizations of variables of interest to get an idea of their distribution.

### Grades and Ranking

In [4]:
data_grades_ranks = data.loc[:, ['Institution Name', 'Secondary school GPA (ADM2021)', 'Secondary school rank (ADM2021)']]
display(data_grades_ranks)
labels[(labels['VariableName'] == 'Secondary school GPA (ADM2021)') | (labels['VariableName'] == 'Secondary school rank (ADM2021)')]

Unnamed: 0,Institution Name,Secondary school GPA (ADM2021),Secondary school rank (ADM2021)
0,Brown University,2,2
1,Columbia University in the City of New York,2,2
2,Cornell University,5,5
3,Dartmouth College,1,5
4,Harvard University,2,2
5,Princeton University,2,2
6,University of Pennsylvania,1,3
7,Yale University,2,2


Unnamed: 0,VariableName,Value,ValueLabel
0,Secondary school GPA (ADM2021),1,Required
1,Secondary school GPA (ADM2021),5,Considered but not required
2,Secondary school GPA (ADM2021),2,Recommended
3,Secondary school rank (ADM2021),1,Required
4,Secondary school rank (ADM2021),5,Considered but not required
5,Secondary school rank (ADM2021),2,Recommended
6,Secondary school rank (ADM2021),3,Neither required nor recommended


It's important to understand what the data above is showing before we move on. We extracted out two variables — GPA and ranking. However, according to the labels file, these numbers do not represent admitted students' performance on these metrics. Rather, they show how much the colleges (officially) consider these metrics in their admissions decisions. According to this, the Ivy Leagues have the following stance on GPA:
- **Required**: Dartmouth, University of Pennsylvania
- **Recommended**: Brown, Columbia, Harvard, Princeton, Yale
- **Considered but not required**: Cornell

Similarly, they have the following stance on ranking:
- **Required**: None
- **Recommended**: Brown, Columbia, Harvard, Princeton, Yale
- **Considered but not required**: Cornell, Dartmouth
- **Neither required nor recommended**: University of Pennsylvania.

Now, let's take a look at the actual distribution of these metrics for admitted students from a secondary data source (see the Reference list for links to this data, which were mostly just taken from the class profile pages of the colleges themselves or secondary source estimates). Note the GPAs are all normalized to be unweighted.
- Brown: Mean GPA of 4.0, 95% of students in top decile for ranking
- Cornell: Mean GPA of 4.0, 84% of students in top decile for ranking
- Harvard: Mean GPA of 4.0, 93% of students in top decile for ranking
- Dartmouth: Mean GPA of 4.0, 95% of students in top decile for ranking
- Columbia: Mean GPA of 4.0, 96% of students in top decile for ranking
- Princeton: Mean GPA of 3.9, N/A class ranking data
- Yale: Mean GPA of 4.0, 95% of students in top decile for ranking
- University of Pennsylvania: Mean GPA of 3.9, 94% of students in top decile for ranking

Let's put this data into a DataFrame so we can easily generate a few visualizations.

In [5]:
data_dict = {'University': ['Brown', 'Cornell', 'Harvard', 'Dartmouth', 'Columbia', 'Princeton', 'Yale', 'University of Pennsylvania'],
             'Mean GPA': [4.0, 4.0, 4.0, 4.0, 4.0, 3.9, 4.0, 3.9],
             'Percentage': [95, 84, 93, 95, 96, 0, 95, 94]
            }

gpa_rank_data = pd.DataFrame(data_dict)
gpa_rank_data

Unnamed: 0,University,Mean GPA,Percentage
0,Brown,4.0,95
1,Cornell,4.0,84
2,Harvard,4.0,93
3,Dartmouth,4.0,95
4,Columbia,4.0,96
5,Princeton,3.9,0
6,Yale,4.0,95
7,University of Pennsylvania,3.9,94


Below, we use Altair to generate two bar charts side by side, one showing the average GPA of students admitted to these universities, and one showing the percentage of students in the top decile of their high school graduating class. In viewing the visualization, please note that Princeton has missing data for the class ranking visualization, and so it just appears blank.

In [6]:
gpa_chart = alt.Chart(gpa_rank_data).mark_bar().encode(
    alt.X('University'),
    alt.Y('Mean GPA')
).properties(
    width=350,
    title='Average GPA of Admitted Students'
)

rank_chart = alt.Chart(gpa_rank_data).mark_bar().encode(
    alt.X('University'),
    alt.Y('Percentage')
).properties(
    width=350,
    title = 'Percentage of Admitted Students in Top Decile of Graduating Class'
)

gpa_chart | rank_chart

### Test Scores

In [13]:
data_grades_sat = data.loc[:, ['Institution Name', 'Admission test scores (ADM2021)']]
display(data_grades_sat)
labels[(labels['VariableName'] == 'Admission test scores (ADM2021)')]

Unnamed: 0,Institution Name,Admission test scores (ADM2021)
0,Brown University,5
1,Columbia University in the City of New York,5
2,Cornell University,5
3,Dartmouth College,5
4,Harvard University,5
5,Princeton University,5
6,University of Pennsylvania,5
7,Yale University,5


Unnamed: 0,VariableName,Value,ValueLabel
13,Admission test scores (ADM2021),1,Required
14,Admission test scores (ADM2021),5,Considered but not required


According the the data and corresponding labels above, it seems that in 2021, admissions test scores were considered but not required for all of these schools. The implication of something like this is that perhaps they are not that important. This outward information has been the case for the past few years, it seems. Nevertheless, the schools still publish the mean test scores of admitted students in their class profiles. Let's look at the most recent data:
- Brown: SAT: Middle 50% between 1500 and 1570, ACT: Middle 50% between 34 and 36
- Cornell: No data on average score, but profile says 41% of enrolling students submitted SAT, 20% submitted ACT
- Harvard: No information on class profile about testing.
- Dartmouth: SAT Reading/Writing: 733, SAT Math: 750, ACT: 33
- Columbia: SAT: Middle 50% of students between 1490 and 1560, ACT: Middle 50% between 34 and 35.
- Princeton: SAT: Middle 50% between 1490 and 1580, ACT: Middle 50% between 33 and 35.
- Yale: No data published, explicitly acknowledge it was because the test scores were optional.
- University of Pennsylvania: SAT: Middle 50% between 1510 and 1560, ACT: Middle 50% between 34 and 36.

As far as admissions scores go, it seems like the Ivy Leagues can be taken more at their word. Yes, it is true that the reported scores are still high (especially for being the middle 50%), but at the same time, it seems like many of the universities (especially Yale, Cornell, and Harvard) seem to be acknowledging the fact that test scores are optional and placing less weight on them in the admissions decisions.

### Financial Aid

### Suggestions
- Future importance analysis because all the variables probably won't be relevant
- SHEP
- After doing some quant analysis here, like if different universities have SAT scores/GPA that are consistent, or if they are scattered, finding some correlations, etc., and then seeing if I can find some qualitative metrics or anecdotes online that (A3 ethnography like) help supprt my claims.
- I can look into the universities themselves and see if online athletic rankinds map to my own athletic data.
- For the mini presentation, can probably just focus on 1-2 interesting things that showed up in the data, or perhaps something that is controversial.

# Additional Data Sources
- https://admissions.dartmouth.edu/apply/class-profile-testing
- https://admission.brown.edu/explore/brown-facts
- https://admissions.cornell.edu/sites/admissions.cornell.edu/files/ClassProfile%202025%20Profile%20Updated%20FINAL.pdf
- https://blog.collegevine.com/what-does-it-really-take-to-get-into-harvard
- https://undergrad.admissions.columbia.edu/class-2025-profile
- https://admissions.yale.edu/sites/default/files/yale_classprofile2025web.pdf
- https://www.upenn.edu/about/faq

**Note**: Some sources are listed in the file chatgpt.md for clarity, as these are sources that ChatGPT used in order to give certain responses.