# How Holistic Are the Ivy League Universities, Really?

# ABSTRACT
This Jupyter notebook contains a report as well as an analysis (in wonderful, intertwined, new-age data science format) of the undergraduate admissions practicies of the 8 Ivy League Universities: Brown, Cornell, Harvard, Dartmouth, Columbia, Princeton, Yale, and the University of Pennsylvania. Originally, quantitative research questions were posed, aiming to determine with statistical significance what factors most correlated with admissions prospects. However, due to data limitations, the research questions and methodology were altered to take a more exploratory and ethnographic below, with the details documented throughout. I find that while the Ivy Leagues are holistic in some sense, they still tend to admit a very particular type of student, and that they emphasize specific qualities (such as stellar academics, for example) as baselines more than they let on. I conclude that if this information was more clearly available, it is possible many more students could have a fairer shot at admission into these schools.

# INTRODUCTION
The Ivy League Universities have been the coveted colleges for American and international students for decades. Notoriously difficult to gain admission into, many of these schools report on holistic (and ultimately, fair) admissions policies on their websites.

However, such claims stand in direct contradiction to countless personal anecodotes and national-level news stories, such as the recent college admissions scandal. In this notebook report, I study the various factors which influence a student's admission into Ivy League Universities, and determine if these schools really are as holistic as they claim to be.

To be perfectly clear, the intention of this exploration and analysis is not necessarily to reveal anything dramatic (although that would definitely be interesting). Rather, the goal is to discover if there is any disconnect between the stated admissions criteria of these universities and what the data actually reflects. This could be as simple as an official statement which claims need-blind admissions (in terms of finances) and data which shows otherwise, or as astonishing as an internal policy which accepts anyone who donates enough money.

Practically speaking, this is important because any finding that differs from the schools' official statements is important for potential applicants to know. Many young students quite literally dedicate their lives into getting into these schools. If one is going to do this, it's helpful to know what to focus on. As such, the overarching goal of this analysis, stated imply, is to determine the factors that most influence a potential student's admission to these universities.

# BACKGROUND AND RELATED WORK
For convenience, I am listing the primary references I will use here, rather than at the end, as is traditionally done:
1. [Wealth, Legacy, and College Admission](https://link.springer.com/article/10.1007/s12115-019-00377-2)
2. [The Future of College Admissions: Discussion](https://link.springer.com/article/10.1007/s12115-019-00377-2)
3. [Merit and Competition in Selective College Admissions](https://muse.jhu.edu/article/262793)

Though I will undoubtedly find other relevant work as I work on the project (to be cited at the end), my reading of the three works above indicates that they provide a solid foundation to build on. This is primarily because many of the findings are related to my research goals, but still provide room for me to explore more (there doesn't seem to be exclusive focus on the Ivy Leagues in the existing work, though mention of top schools is certainly made). They additionally provide information across a fair span of time, which will help me to trace how policies have (or have not) evolved over the last two decades.

Killgore splits college admissions into two types: 1) merit based, which focuses on academic and non-academic achievements, and 2) competition based, which chooses students based on organizational needs (e.g. sports teams, band, etc.) [3]. Her findings indicated that "a) elite colleges’ maintenance of an illusion that student achievement determines admissions, and (b) that admissions practices are designed to maintain the colleges’ elite status." This connects to my first question, which focuses on the factors that influence admission.

In the more recent 2019 paper [1], the authors explicitly state, "Actually, the admissions scandal at Harvard and Yale, and other prestigious colleges, is just another example of how the rich and powerful make the rules or change and bend them to suit their own interests– and at the expense of ordinary people who are just trying to keep their heads above water and pay their bills." This seems to indicate social class does indeed play a large role in admissions.

Based on the data I have available (see subsection below), I don't expect I will be able to make definitive determinations about factors like social class. However, I do anticipate there is much to be learned about the points Killgore makes above. Specifically, I suspect there may be evidence pointing to the fact that admission into these universities is not simply about student achievement.

Before introducing my primary research questions, I provide a bit more helpful context in the subsections. First, let's take a direct look at what these colleges claim about admissions on their official websites. Then, let's see what the data looks like.

## What do the universities claim?
Below are excerpts taken from the official admissions websites of the Ivy League Universities. They provide helpful background on the analysis and discussion that follows.

- **Brown**: _"Brown's admission process is holistic, and we review every application. The admission statistics available through Brown Facts, as well as grade and score ranges for the Class of 2026, may help to provide a broad perspective of the academic strength of our pool of applicants. However, please be aware that these data points are not a set of requirements and should not be used to predict odds of admission."_
- **Cornell**: _"Cornell is happy to provide general admissions statistical data from the most recently admitted class to help provide prospective applicants with a broad understanding of the kind of highly qualified candidates that have been admitted in the past. The Class Profile numbers, however, should not be interpreted to mean that objective data are the most important criteria in the selection process. Other factors, such as secondary school curriculum and performance, special talents, extracurricular activities, application essays, and interviews (where required) are critical to Cornell's decision making process as well."_
- **Harvard**: _"There is no formula for gaining admission to Harvard. Academic accomplishment in high school is important, but the Admissions Committee also considers many other criteria, such as community involvement, leadership and distinction in extracurricular activities, and personal qualities and character. We rely on teachers, counselors, and alumni to share information with us about an applicant's strength of character, his or her ability to overcome adversity, and other personal qualities."_
- **Dartmouth**: _"Every applicant is reviewed individually and holistically . We are aware that in many cases multiple applicants attend the same secondary school, and the decision for any one applicant will not determine the outcome of another."_
- **Columbia**: _"The Columbia University first-year class of College and Engineering students is chosen from a large and diverse group of applicants. Columbia employs a holistic approach in assessing candidates in order to evaluate which students are the best matches for Columbia's unique educational experience. In the process of selection, the Committee on Admissions considers each applicant's academic potential, intellectual strength and ability to think independently. The Committee also considers the general attitudes and character of the applicant, special abilities and interests, maturity, motivation, curiosity and whether they are likely to make productive use of the four years at Columbia. In its final selection, Columbia seeks diversity of personalities, achievements and talents, and of economic, social, ethnic, cultural, religious, racial and geographic backgrounds."_
- **Princeton**: _"For each class, we bring together a varied mix of high-achieving, intellectually gifted students from diverse backgrounds to create an exceptional learning community. We care about what students have accomplished in and out of the classroom. As you prepare your application, help us to appreciate your talents, academic accomplishments and personal achievements. We'll ask for your transcript and recommendations, and we will want to know more than just the statistics in your file. Tell us your story. Show us what’s special about you. Tell us how you would seize the academic and nonacademic opportunities at Princeton and contribute to the Princeton community."_
- **Yale**: _"Students commonly want to know what part of the college application 'carries the most weight.' The truth is, there are many parts to your application, and together they help us discover and appreciate your particular mix of qualities. Academic criteria are important to Yale’s selective admissions process, but we look at far more than test scores and grades."_
- **University of Pennsylvania**: _"We look for students who aspire to develop and refine their talents and abilities within Penn’s liberal arts-based, practical, and interdisciplinary learning environment. Our ideal candidates are inspired to emulate our founder Benjamin Franklin by applying their knowledge in 'service to society' to our community, the city of Philadelphia, and the wider world. To best understand prospective students’ paths through Penn, we approach applications holistically and with great care."_

At the risk of being facetious, I must say, it almost sounds like all of the above are the same exact paragraph, rewritten into 8 different variations by ChatGPT. One theme emerges pretty clearly: Whether or not it is true, these universities want all potential candidates to believe that their application is treated with great care, and that there are no set criteria which determine admission.

## The Data
I plan to use data from the [Institute of Education Sciences](https://ies.ed.gov/) (IES), which describes itself as "the nation's leading source for rigorous, independent education research, evaluation and statistics." More specifically, my data will come from one of their sub-organizations, the [National Center for Education Statistics](https://nces.ed.gov/) (NCES). My data will be curated from their extensive [Integrated Postsecondary Education Data System](https://nces.ed.gov/ipeds/datacenter) (IPEDS), which allows users to select the institutions they are interested in as well as a list of variables from a huge number (quite literally hundreds) of available ones. This lends itself nicely to my overall research goals, and will help me home in on specific research questions as I explore.

### Data Terms of Use
The data from NCES is available under the following terms of use, which users must agree to before downloading the final data set they curate:

"Under law, public use data collected and distributed by the National Center for Education Statistics (NCES) may be used only for statistical purposes. Any effort to determine the identity of any reported case by public-use data users is prohibited by law. Violations are subject to Class E felony charges of a fine up to $250,000 and/or a prison term up to 5 years.
NCES does all it can to assure that the identity of data subjects cannot be disclosed. All direct identifiers, as well as any characteristics that might lead to identification, are omitted or modified in the dataset to protect the true characteristics of individual cases. Any intentional identification or disclosure of a person or institution violates the assurances of confidentiality given to the providers of the information. Therefore, users shall:

Use the data in any dataset for statistical purposes only.

Make no use of the identity of any person or institution discovered inadvertently, and advise NCES of any such discovery.

Not link any dataset with individually identifiable data from other NCES or non-NCES datasets.
To proceed you must signify your agreement to comply with the above-stated statutorily based requirements. This window will close and you can now download the file."

### Potential Ethical Considerations
The primary ethical consideration here, also stated in the terms of use above, is to ensure that no individual is exposed as a result of the analysis. This is especially important in the event that I end up being critical of any particular admissions practices. Luckily, the way NCES provides and stores data is already conducive to such anonymity, but I will need to be additionally careful to ensure I do not miss anything.

### Loading in the Data
Below, I import the requisite modules we need and then loading in and examining the data, having already downloaded it from the [Integrated Postsecondary Education Data System](https://nces.ed.gov/ipeds/use-the-data). For convenience, I import all the modules that will be needed throughout the notebook here.

In [2]:
import pandas as pd
import numpy as np
import altair as alt

In [33]:
# Read data into CSV files
data = pd.read_csv('./data/all_data_variables.csv')
labels = pd.read_csv('data/labels.csv')
data

Unnamed: 0,UnitID,Institution Name,Member of National Athletic Association (IC2021),Member of National Collegiate Athletic Association (NCAA) (IC2021),Member of National Athletic Association (IC2020),Member of National Collegiate Athletic Association (NCAA) (IC2020),Member of National Athletic Association (IC2019),Member of National Collegiate Athletic Association (NCAA) (IC2019),Member of National Athletic Association (IC2018),Member of National Collegiate Athletic Association (NCAA) (IC2018),...,Total price for in-district students living off campus (not with family) 2019-20 (DRVIC2019),Total price for in-state students living off campus (not with family) 2019-20 (DRVIC2019),Total price for out-of-state students living off campus (not with family) 2019-20 (DRVIC2019),Total price for in-district students living on campus 2018-19 (DRVIC2018),Total price for in-state students living on campus 2018-19 (DRVIC2018),Total price for out-of-state students living on campus 2018-19 (DRVIC2018),Total price for in-district students living off campus (not with family) 2018-19 (DRVIC2018),Total price for in-state students living off campus (not with family) 2018-19 (DRVIC2018),Total price for out-of-state students living off campus (not with family) 2018-19 (DRVIC2018),Unnamed: 251
0,217156,Brown University,1,1,1,1,1,1,1,1,...,,,,73802,73802,73802,,,,
1,190150,Columbia University in the City of New York,1,1,1,1,1,1,1,1,...,86257.0,86257.0,86257.0,76856,76856,76856,83470.0,83470.0,83470.0,
2,190415,Cornell University,1,1,1,1,1,1,1,1,...,76258.0,76258.0,76258.0,73904,73904,73904,73904.0,73904.0,73904.0,
3,182670,Dartmouth College,1,1,1,1,1,1,1,1,...,,,,74359,74359,74359,,,,
4,166027,Harvard University,1,1,1,1,1,1,1,1,...,,,,71650,71650,71650,,,,
5,186131,Princeton University,1,1,1,1,1,1,1,1,...,,,,70900,70900,70900,,,,
6,215062,University of Pennsylvania,1,1,1,1,1,1,1,1,...,75480.0,75480.0,75480.0,74408,74408,74408,74408.0,74408.0,74408.0,
7,130794,Yale University,1,1,1,1,1,1,1,1,...,,,,73900,73900,73900,,,,


That's a lot of columns. In part, this is because the data includes multiple years. However, even with that caveat taken into account, there is quite a bit going on here. Let's print the column names out in a nice, readable format with the code below. This will make it easier to extract out information of interest once we begin analyzing the data.

In [34]:
for col_name in list(data.columns):
    print(col_name)

UnitID
Institution Name
Member of National Athletic Association (IC2021)
Member of National Collegiate Athletic Association (NCAA) (IC2021)
Member of National Athletic Association (IC2020)
Member of National Collegiate Athletic Association (NCAA) (IC2020)
Member of National Athletic Association (IC2019)
Member of National Collegiate Athletic Association (NCAA) (IC2019)
Member of National Athletic Association (IC2018)
Member of National Collegiate Athletic Association (NCAA) (IC2018)
Percent admitted - men (DRVADM2021)
Percent admitted - women (DRVADM2021)
Percent admitted - total (DRVADM2021)
Percent admitted - men (DRVADM2020_RV)
Percent admitted - women (DRVADM2020_RV)
Percent admitted - total (DRVADM2020_RV)
Percent admitted - men (DRVADM2019_RV)
Percent admitted - women (DRVADM2019_RV)
Percent admitted - total (DRVADM2019_RV)
Percent admitted - men (DRVADM2018_RV)
Percent admitted - women (DRVADM2018_RV)
Percent admitted - total (DRVADM2018_RV)
Percent of full-time first-time under

In its "raw" form, it looks like we can say, at minimum, the following about our data:
- We have 249 variables, many of them the same, but across different years.
- The data stretches from 2018 - 2022, at the latest, with some variables ending in 2021.
- Binary data is encoded using 0s and 1s (this is useful to know, as it is already in a format that is conducive to a potential model).
- There is definitely a lot of missing data. But this makes sense, as different colleges likely report different metrics, and the presence of a metric for one college is probably the minimum needed to add the entire field.
- As of now, the year is not encoded in a way that is particularly easy to extract.
- As we delve into the analysis, we will likely need to clean and extract various aspects of this data.
- We have information on membership in the NCAA, admissions percentages by various groups, GPA + ranking information, SAT + ACT scores, enrollment breakdowns by race, and cost of attendance.

Now then — with an idea of how our data looks combined with the background knowledge from above, let's finally delve into some concrete research questions.

# RESEARCH QUESTIONS AND HYPOTHESES
In the rest of this notebook, I plan to explore the following research questions and associated hypotheses:
- What factors are the strongest predictors of whether or not a student is accepted into the Ivy League Universities? **Hypothesis: This is a more general question, so hard to predict exactly. But at a high level, I hypothesize that a few factors will play a huge role, suggesting that admissions process is not really that holistic after all.**
- Is there a statistically significant correlation between various factors (social class, athletic potential, donation amounts, etc.) and chance of acceptance? **Hypothesis: I do think I will find a statistically significant correlation here**.

# METHODOLOGY

## Exploratory Data Analysis
We start by looking through a few visualizations of variables of interest to get an idea of their distribution.

### Grades and Ranking
Below, we extract out variables of interest related to grades and ranking for these universities, making a simplified version of the DataFrame.

In [5]:
# Extract out variables of interest
data_grades_ranks = data.loc[:, ['Institution Name', 'Secondary school GPA (ADM2021)', 'Secondary school rank (ADM2021)']]
display(data_grades_ranks)
labels[(labels['VariableName'] == 'Secondary school GPA (ADM2021)') | (labels['VariableName'] == 'Secondary school rank (ADM2021)')]

Unnamed: 0,Institution Name,Secondary school GPA (ADM2021),Secondary school rank (ADM2021)
0,Brown University,2,2
1,Columbia University in the City of New York,2,2
2,Cornell University,5,5
3,Dartmouth College,1,5
4,Harvard University,2,2
5,Princeton University,2,2
6,University of Pennsylvania,1,3
7,Yale University,2,2


Unnamed: 0,VariableName,Value,ValueLabel
0,Secondary school GPA (ADM2021),1,Required
1,Secondary school GPA (ADM2021),5,Considered but not required
2,Secondary school GPA (ADM2021),2,Recommended
3,Secondary school rank (ADM2021),1,Required
4,Secondary school rank (ADM2021),5,Considered but not required
5,Secondary school rank (ADM2021),2,Recommended
6,Secondary school rank (ADM2021),3,Neither required nor recommended


It's important to understand what the data above is showing before we move on. We extracted out two variables — GPA and ranking. However, according to the labels file, these numbers do not represent admitted students' performance on these metrics. Rather, they show how much the colleges (officially) consider these metrics in their admissions decisions. According to this, the Ivy Leagues have the following stance on GPA:
- **Required**: Dartmouth, University of Pennsylvania
- **Recommended**: Brown, Columbia, Harvard, Princeton, Yale
- **Considered but not required**: Cornell

Similarly, they have the following stance on ranking:
- **Required**: None
- **Recommended**: Brown, Columbia, Harvard, Princeton, Yale
- **Considered but not required**: Cornell, Dartmouth
- **Neither required nor recommended**: University of Pennsylvania.

Now, let's take a look at the actual distribution of these metrics for admitted students from a secondary data source (see the Reference list for links to this data, which were mostly just taken from the class profile pages of the colleges themselves or secondary source estimates, or via data provided by ChatGPT queries). Note the GPAs are all normalized to be unweighted.
- Brown: Mean GPA of 4.0, 95% of students in top decile for ranking
- Cornell: Mean GPA of 4.0, 84% of students in top decile for ranking
- Harvard: Mean GPA of 4.0, 93% of students in top decile for ranking
- Dartmouth: Mean GPA of 4.0, 95% of students in top decile for ranking
- Columbia: Mean GPA of 4.0, 96% of students in top decile for ranking
- Princeton: Mean GPA of 3.9, N/A class ranking data
- Yale: Mean GPA of 4.0, 95% of students in top decile for ranking
- University of Pennsylvania: Mean GPA of 3.9, 94% of students in top decile for ranking

Let's put this data into a DataFrame so we can easily generate a few visualizations. Below, we generate a new DataFrame from scratch, collecting the data above into a dictionary and converting it into a DataFrame

In [35]:
# Represent data as dictionary, which can easily be converted into a pandas DataFrame
data_dict = {'University': ['Brown', 'Cornell', 'Harvard', 'Dartmouth', 'Columbia', 'Princeton', 'Yale', 'University of Pennsylvania'],
             'Mean GPA': [4.0, 4.0, 4.0, 4.0, 4.0, 3.9, 4.0, 3.9],
             'Percentage': [95, 84, 93, 95, 96, 0, 95, 94]
            }

gpa_rank_data = pd.DataFrame(data_dict)
gpa_rank_data

Unnamed: 0,University,Mean GPA,Percentage
0,Brown,4.0,95
1,Cornell,4.0,84
2,Harvard,4.0,93
3,Dartmouth,4.0,95
4,Columbia,4.0,96
5,Princeton,3.9,0
6,Yale,4.0,95
7,University of Pennsylvania,3.9,94


Below, I use Altair to generate two bar charts side by side, one showing the average GPA of students admitted to these universities, and one showing the percentage of students in the top decile of their high school graduating class. In viewing the visualization, please note that Princeton has missing data for the class ranking visualization, and so it just appears blank.

In [36]:
# Generate GPA chart
gpa_chart = alt.Chart(gpa_rank_data).mark_bar().encode(
    alt.X('University'),
    alt.Y('Mean GPA')
).properties(
    width=350,
    title='Average GPA of Admitted Students'
)

# Generate ranking chart
rank_chart = alt.Chart(gpa_rank_data).mark_bar().encode(
    alt.X('University'),
    alt.Y('Percentage')
).properties(
    width=350,
    title = 'Percentage of Admitted Students in Top Decile of Graduating Class'
)

gpa_chart | rank_chart

Visually, the suggestion is quite striking: Admitted students tend to have very high GPAs and class rankings, put simply.

### Test Scores
We repeat the process above for test scores, again extracting out the variables of interest, and then generating visualization with auxiliary data linked in the references.

In [37]:
# Extract variables of interest
data_grades_sat = data.loc[:, ['Institution Name', 'Admission test scores (ADM2021)']]
display(data_grades_sat)
labels[(labels['VariableName'] == 'Admission test scores (ADM2021)')]

Unnamed: 0,Institution Name,Admission test scores (ADM2021)
0,Brown University,5
1,Columbia University in the City of New York,5
2,Cornell University,5
3,Dartmouth College,5
4,Harvard University,5
5,Princeton University,5
6,University of Pennsylvania,5
7,Yale University,5


Unnamed: 0,VariableName,Value,ValueLabel
13,Admission test scores (ADM2021),1,Required
14,Admission test scores (ADM2021),5,Considered but not required


According the the data and corresponding labels above, it seems that in 2021, admissions test scores were considered but not required for all of these schools. The implication of something like this is that perhaps they are not that important. This outward information has been the case for the past few years, it seems. Nevertheless, the schools still publish the mean test scores of admitted students in their class profiles. Let's look at the most recent data:
- Brown: SAT: Middle 50% between 1500 and 1570, ACT: Middle 50% between 34 and 36
- Cornell: No data on average score, but profile says 41% of enrolling students submitted SAT, 20% submitted ACT
- Harvard: No information on class profile about testing.
- Dartmouth: SAT Reading/Writing: 733, SAT Math: 750, ACT: 33
- Columbia: SAT: Middle 50% of students between 1490 and 1560, ACT: Middle 50% between 34 and 35.
- Princeton: SAT: Middle 50% between 1490 and 1580, ACT: Middle 50% between 33 and 35.
- Yale: No data published, explicitly acknowledge it was because the test scores were optional.
- University of Pennsylvania: SAT: Middle 50% between 1510 and 1560, ACT: Middle 50% between 34 and 36.

Again, I manually generate a new DataFrame from the additional data above, and I use it to generate two charts. This time, I generate a different sort of bar graph, which visualizes the interquartile spread (middle 50%) of the test scores for schools which provide this data.

In [38]:
data_dict = {'University': ['Brown', 'Cornell', 'Harvard', 'Dartmouth', 'Columbia', 'Princeton', 'Yale', 'University of Pennsylvania'],
             'Low SAT': [1500, 0, 0, 1440, 1490, 1490, 0, 1510],
             'High SAT': [1570, 0, 0, 1560, 1560, 1580, 0, 1560],
             'Low ACT': [34, 0, 0, 32, 34, 33, 0, 34],
             'High ACT': [36, 0, 0, 35, 35, 35, 0, 36]
            }

testing_data = pd.DataFrame(data_dict)

# The X2 parameter allows us to make bars that start and end in a specific place
sat_chart = alt.Chart(testing_data).mark_bar().encode(
    alt.X('Low SAT'),
    alt.X2('High SAT'),
    alt.Y('University')
).properties(
    width=500,
    height=350,
    title = 'Middle 50% of SAT Scores for Ivy Leagues'
)

act_chart = alt.Chart(testing_data).mark_bar().encode(
    alt.X('Low ACT'),
    alt.X2('High ACT'),
    alt.Y('University')
).properties(
    width=500,
    height=350,
    title = 'Middle 50% of ACT Scores for Ivy Leagues'
)

display(sat_chart)
display(act_chart)

As far as admissions scores go, it seems like the Ivy Leagues can be taken more at their word. Yes, it is true that the reported scores are still high (especially for being the middle 50%), but at the same time, it seems like many of the universities (especially Yale, Cornell, and Harvard) seem to be acknowledging the fact that test scores are optional and placing less weight on them in the admissions decisions.

### Financial Aid
Finally, let's see if there are any interesting patterns in the financial data we have.

In [39]:
data_price = data.loc[:, ['Institution Name', 'Total price for in-state students living on campus 2021-22 (DRVIC2021)', 'Total price for out-of-state students living on campus 2021-22 (DRVIC2021)']]
display(data_price)

Unnamed: 0,Institution Name,Total price for in-state students living on campus 2021-22 (DRVIC2021),Total price for out-of-state students living on campus 2021-22 (DRVIC2021)
0,Brown University,82570,82570
1,Columbia University in the City of New York,82584,82584
2,Cornell University,80287,80287
3,Dartmouth College,81501,81501
4,Harvard University,78028,78028
5,Princeton University,78490,78490
6,University of Pennsylvania,83298,83298
7,Yale University,82170,82170


Let's consider this in the context of some of the financial aid data released in these universities' most recent class profiles:
- Brown: No information
- Cornell: 83% of students who applied for need-based aid received it; average award was 49022
- Harvard: Out of total tuition of 68100, typical aid package included 55900 of aid.
- Dartmouth: 56% of students received aid, average award of 67,127
- Columbia: Average grant of 63971 for those who received it.
- Princeton: 61% of admitted students qualified for financial aid.
- Yale: 51% of admitted students receiving need-based award, average award of 61500
- University of Pennsylvania: No information

Finally, I generate a third type of bar chart. I group the data by university, and we visualize the difference between the total cost of attendance and the average financial aid package. Note that for universities that don't have available data, I leave the grouping blank by providing zero values.

In [44]:
data_dict = {'University': ['Brown', 'Cornell', 'Harvard', 'Dartmouth', 'Columbia', 'Princeton', 'Yale', 'University of Pennsylvania',
                            'Brown', 'Cornell', 'Harvard', 'Dartmouth', 'Columbia', 'Princeton', 'Yale', 'University of Pennsylvania'],
             'Type': ['Total Tuition ($)']*8 + ['Average Grant ($)']*8,
             'Cost': [0, 78000, 80000, 80000, 80000, 0, 78000, 0, 0, 49022, 55900, 67127, 63971, 0, 61500, 0],
            }

financial_data = pd.DataFrame(data_dict)

# The "column" property lets us generate the different groupings
# We provide a redundant encoding of color for 'Type' to make the values a bit easier to distinguish
finance_chart = alt.Chart(financial_data).mark_bar().encode(
    alt.X('Type'),
    alt.Y('Cost'),
    alt.Color('Type'),
    column='University'
).properties(
    title='Total Attendance Cost vs. Average Aid Given',
    width=100
)

finance_chart

## A Change in Approach
At this point, I seem to have run into a bit of an issue. It seems I did not consider the limitations of the data closely enough with regard to my initial research questions. In particular, I only have access to aggregate data regarding admissions. That is, there is no individual mapping of factors such as grades, test scores, etc. to admission outcome. This makes it difficult to quantitatively answer my first research question, as there is no outcome variable I can use to train a potential model. It also makes it downright impossible to answer my second question, as I cannot conduct a correlation analysis for each factor vs. admission outcome when I don't have individualized data.

At this point, I have a couple of options. I can change my research questions entirely and start from scratch, I can go hunting for additional data that might enable me to better answer these questions, or I can shift my research goals and methodology slightly to better align with the data I already have available. I will choose the third option, as it is the most realistic when taking into consideration both my time and knowledge restrictions.

Although I cannot use it to make direct causal claims, I do think the exploratory data analysis above provides great and useful insights with respect to my first research question (which, as a reminder, is concerned with determining the most important factors influencing admission into these universities). As such, at this point, I will table the second research question, and use the analysis above to spur exploration of a slightly altered version of the first research question:

**From the perspective of a potential applicant to an Ivy League University, what factors are the most important for maximizing chances of admission?**

As I mentioned, I don't have the data to answer this quantitatively. However, coupled with the light quantitative analysis above, I conduct an ethnography below from the perspective of such a student, in an effort to explore this research question in greater detail.

## An Ethnography of Sorts
After some exploratory data analysis, it seems to be the case that these college do make some effort to be holistic (financial aid data and optional test scores provide some evidence for this), but they are still not at the level they claim to be at (the average GPAs and class ranking are quite biased toward a particular "type" of student, for lack of a better word.

For the next part of my analysis, I will conduct an online autoethnography, adopting the persona of a high school student who wants to gain admission to the Ivy League Universities. For simplicity, I will focus on one school, Columbia University. I will approach the ethnography in three phrases, with each phase consisting of detailed note taking and reflection:
1. Exploring the school admissions page itself
2. Looking at the advice of college preparatory websites
3. Browsing Columbia's Reddit forum r/columbia.
As I explore these pages, I will keep the following questions/themes in mind: What qualities do I need to exhibit to gain admission? Does the process truly feel holistic or more atomistic? Additionally, I will keep an eye out for information regarding grades/ranking, test scores, and financial aid, to complement the analyses above.

### Part 1: [The Columbia Admissions Page](https://undergrad.admissions.columbia.edu/)
#### Notes
- Interesting opening: "Dear Aspiring Columbian: Take a deep breath. Now, let it out. The college application process is a personally meaningful milestone, and there are many resources to support you. This is a time to discover who you are, imagine who you want to become, and decide whether Columbia College or Columbia Engineering might be the right college for you." Go on to say that what they are looking for is the best _match_. But how do I become that?
- "Columbia is test-optional and does not have a "cut-off" GPA or test score for admission, and academics are considered alongside the full application."
- Some of the Columbia-specific application questions: "List the titles of the books, essays, poetry, short stories or plays you read outside of academic courses that you enjoyed most during secondary/high school (75 words or fewer)."Why are you interested in attending Columbia University? We encourage you to consider the aspect(s) that you find unique and compelling about Columbia (200 words or fewer)." "In Columbia’s admissions process, we value who you are as a unique individual, distinct from your goals and achievements. In the last words of this writing supplement, we would like you to reflect on a source of happiness. Help us get to know you further by describing the first thing that comes to mind when you consider what simply brings you joy (35 words or fewer)." Seems to support this idea of being holistic, but how much do these essays really matter?
- Adjectives they use to describe admissions process: holistic, contextual, need blind, and committee based.
- Seems to indicate that if I need financial aid, I will get it.
- Look for academic preparation, curiosity, engagement with others, individual voice, knowledge of Columbia.
- They really want people to be excited about this Core Curriculum of theirs. I should look into this more.
- "Our responsibility is to admit students who will be successful in our rigorous curriculum, so first and foremost the review is grounded in an examination of the applicant's past and current academic performance. Only after a foundation of academic excellence has been demonstrated will other areas of achievement and potential for impact be considered." This quote stands in direct opposition to the claims of being holistic (it is on a different page, the Testing Policy one, but it is still within the admissions site).
- Standardized tests are optional. Keeps saying "not a disadvantage" if you don't submit them, but also will be reviewed if submitted. These two things feel somewhat contradictory. Why not exclude them altogether?

#### Reflection
There are a number of interesting points scattered throughout the Columbia admissions page. However, what I found most confusing is that there almost seems to be a contradiction in the advice they give to aspiring students. On the one hand, they encourage taking it slow, emphasize that their process does not look for a specific grade or test score, and insist that they just want students who are the best match. On the other hand, they also directly state, "Our responsibility is to admit students who will be successful in our rigorous curriculum, so first and foremost the review is grounded in an examination of the applicant's past and current academic performance. Only after a foundation of academic excellence has been demonstrated will other areas of achievement and potential for impact be considered." By its very definition, such a process is no longer holistic, since it quite literally places one metric first among all others. I am not quite sure what to make of this, but so far, I think they are really misrepresenting what they mean by holistic. They want people to _think_ their holistic process will encourage all aspects of an applicant equally from the get-go, but in reality, it feels like once academic excellence is established, other things are considered. This would imply that many people with great GPAs and ranking do not get in, but those that do get in have high GPAs and rankings. This also seems to match with some of the quantitative data above.

### Part 2: [Going Ivy](https://goingivy.com/colleges/how-to-get-into-columbia/)
#### Notes
- "If you have a goal of getting into Columbia, you will have to obtain As in nearly all of your high school classes, receive top scores on your standardized admission test, write compelling essays, and participate in extracurricular activities that showcase your abilities and qualities." This seems to match what I wrote in the previous section's reflection. But I need more.
- This page seems to emphasize test scores much more than Columbia's official page. Maybe it is outdated. Or maybe it's trying to tell me something. But later on, the page does mention that Columbia is test optional now.
- "While most students do not want to write their essays for their applications to Columbia, the essays and short answer questions are critical." Are they really that critical? I want to know more about this in particular.
- "Many high school students have the idea that they have to participate in as many extracurricular activities as possible to stand out on their college applications. When you are applying to an elite institution like Columbia, however, this approach is not effective. Columbia is more interested in the quality of your participation in your extracurricular activities rather than in how many you have participated in."
- "If you do not receive an invitation to be interviewed, don’t worry. This is not a sign that your application will be denied." Don't know if I believe this.
- "Make certain that you have a written schedule that you follow. You should carve out time in your schedule each week to relax and have fun. High school students are expected to have some fun and to build strong relationships with others. If you study all of the time and never take time for yourself, you will miss out on key developmental opportunities and appear to be one-dimensional." Ach. Even having fun is about getting into college now, indirectly.

#### Reflection
There was not a huge amount of new information on this forum, but it did change, or rather, sharpen, some of my previous thoughts. Specifically, my suspicion that grades matter more than Columbia lets on officially, as a baseline at least, seems to be supported by the material on this page. However, there are a few claims made here that I am skeptical about, and want to explore further in the next section below. Specifically, I feel like an interview does indicate that you might have a better chance at admission, and that the essays don't matter nearly as much as this page claims. Let's see what Reddit has to say about these topics.

### Part 3: [Columbia Reddit](https://www.reddit.com/r/columbia/)
#### Notes
- Regarding interview chances: "Possibly but don’t read too much into it. My year a lot of people I talked to didn’t get an interview offer at all but got in so not getting one doesn’t necessarily mean much. How many interviews they conduct depends on a lot of other factors."
- On a post where someone asked for tips about getting in: "Think of a good reason why you want to go. Get a perfect SAT score. Ask your parents to donate at least 10 million dollars to Columbia. Have President Obama write your recommendation letter."
- Some say the Core Curriculum, because required, can actually be restrictive if you want to study something specific deeply.
- One piece of advice from current students for essays was to try and stress a broad set of interests.
- "high school senior thinking about double majoring in CS and environmental engineering, and premed, and possibly go into law as well. Do you think it’s possible to double major as well as premed and prelaw, on top of Columbia’s core curriculum?" Not super relevant, but saddens me a little, and felt it worth noting.
- "Yeah burnout is real. Enjoy the high of getting in. Read for fun. Learn about something irrelevant. Learn how to make pasta from scratch. You'll have plenty of time to study and hate yourself later."
- Within the essays, the idea of match comes up again: "The admissions officers want to know that you aren't just applying to Columbia for the sake of applying, and get to know your interests and how they would be best explored at Columbia over any other school."

#### Reflection
I found browsing the Reddit forum to be particularly useful because it highlighted/provided information that was lacking within the official pages and college forums above. In particular, it seems what they say about interviews officially is the truth: Not getting one doesn't necessarily hurt your chances of acceptance. That's good to know. Of course, there is also the matter of the essays. One admitted students advised an applicant, "The admissions officers want to know that you aren't just applying to Columbia for the sake of applying, and get to know your interests and how they would be best explored at Columbia over any other school." This seems to indicate that I was half right about the essays. They don't appear to be of the utmost importance, _but_ they do play a role. Almost like checking a box off on a list.

At this point, I realize one could reach the conclusion that maybe the process is holistic after all. I mean, they don't _just_ use grades, but they expect good ones, and they don't _just_ use essays, but the essays need to reveal that you've done your homework about Columbia. Isn't that holistic? I am not so sure, however. I feel like it is more a situation where they look at these baselines, and then they want someone who has done something incredible (e.g. Started a nonprofit, won a national competition in STEM, competed as a D1 athlete, etc.). This may not seem obvious from the ethnography above, but there are subtle hints. For example, the middle step, where I studied the GoingIvy forum, emphasized quality of extracurriculars over quantity. It seemed to imply you needed to be stellar in something. Then there is that quote from Reddit which ends with Obama writing a rec letter, for example. I am not sure how to classify this type of admissions process, but it sure isn't holistic, as in a scattered way, the admitted applicants seem to fir similar overall profiles.

# FINDINGS
Based on the exploratory data analysis above as well as the online ethnography conducted, I now discuss some of the findings relevant to my revised research question of what factors a prospective student should consider in order to maximize chances of admission into Ivy League universities. To best understand this, it is logical to split the findings into two parts: what the official statements from the universities say and what the data seems to suggest.

## The Official Appearance
By and large, it's quite clear that the Ivy League universities stress how _holistic_ their overall process is. At the beginning of this analysis, when I looked into some background and related work, this seemed to be the case. The quotes taken from their official pages, especially, seem to affirm this claim. At the same time, the additional review of related work at the start seemed to suggest that these quotes were not to be taken at face value, and indeed the data supports this somewhat.

## What the Data Shows
Let's start with the exploratory data analysis. We can see from the graphs generated above that by and large, students admitted to these universities tend to have high GPAs, class rankings, and test scores. Yes, some data is missing, but the general trend is clear. Taken just by itself, this might be a little misleading, as it suggests that just high grades are enough. However, if that were the case, more studets would get in than there is space available.

The ethnography revealed something quite interesting. The word "holistic" at face value seems to suggest that it's not all about grades and scores, so how can this be reconciled with what the data seems to show? Well, a study of Columbia's official admissions page seems to reveal that academix excellence is an important baseline. In fact, this page (which is a bit deeper and not on the front admissions page, which is interesting) states quite clearly that only after academic excellence is established are other qualities considered. Additionally, Phase 2 of the ethnography (reviewing a professional admissions consulting site) emphasizes this point, and also builds upon it. In particular, it indicates that beyond academics, admissions officers expect to see extremely high quality in whatever extracurriculars are chosen. If you are an athlete, you should be competitive at the national level. If you are in a club, you should hold a leadership position and use it to accomplish something meaningful and impactful. So on and so forth. Finally, a review of the final part of the ethnography hints, somewhat jokingly, at the incredible level of accomplishment needed to secure admission (e.g. sardonically suggesting getting a letter of recommendation from Barack Obama).

All of these things in mind, I summarize the tentative findings of my work. Based on the exploratory data analysis above and the ethnography, it seems a prospective Ivy League student should keep the following things in mind if they wish to maximize chances of admission:

- While there is no official minimum for grades and test scores (and in some cases test scores are not even required), these universities expect strong applications to showcase strong evidence of academic rigor and excellence. This is far from the only necessity, but the data suggests that it tends to be an important starting point.
- The number of extracurricular activities is not that important. This is important to mention as many students often try to join as many clubs as possible to appear attractive to these schools. However, it seems that these schools want to see incredible levels of achievement in 1-2 areas. Put simply, to have a chance at admission, one needs to stand out at whatever they do.
- The term _holistic_ is a little misleading. It seems to suggest, especially with other wording on the page, that perhaps academic excellence doesn't matter that much (since there is no official minimum), or that involvement in many different things can be attractive. The reality seems to be that no specific things are officially required, but both strong academic performance and some other factors that makes an applicant stand out immensely are important in the decision-making process.

# DISCUSSION

## Appropriateness of Methodology
As mentioned earlier when I shifted my research question, the methodology used above (specifically, an exploratory quantitative analysis followed by an ethnography) would not have been appropriate for my initial research questions, which involved making definitive statistical claims about factors influencing admissions. However, they fit well with my revised research question, which doesn't seek to mathematically determine the connection between various factors and my final conclusion, but rather explore them from a qualitative lens. In particular, I wanted to know what is most important from the specific _perspective_ of a student applicant. For this, an ethnography taking on the persona of an applicant exploring the process of applying to one of these colleges felt appropriate. Additionally, the initial exploratory data analysis (conducted before the research question revision) provided me with some context for what appears to matter most for these admissions, and helped me to outline the high-level structure of the ethnography before actually conducting it.

## Implications
The best way to discuss the implications is to draw a distinction between what a student might think matters for gaining admission into these universities at first glance, and what actually ends up mattering.

Based on the way these schools market themselves, especially on their front pages, it's very easy to assume that the word "holistic" is somehow synonymous with "diverse." However, there is a difference between a diverse campus and a diverse individual. I know from experience, as well as from the work above, that it is very easy for students to fall into the trap of thinking that if they just do many different things, they will be a strong candidate for admission. However, the reality appears to be that they need to be academically strong no matter what, and then they really need to excel in 1-2 areas. This makes for a diverse campus overall, as everyone's individually strong extracurriculars will combine together in the admitted student pool.

I think the implications of this work are important. As it stands, this doesn't seem to be super well known. It isn't immediately obvious from the admissions pages; it is almost veiled somehow, hidden in a way. It is something I have heard anecdotally, and it is a theme that only _starts_ to emerge even after all of the work above. Even though the methods are qualitative, it is unlikely the majority of applicants consider and review material this deeply before applying (for instance, I never checked forums like Reddit, and I applied to many Ivy Leagues, unsuccessfully, as a high schooler). If this were to be stated more directly and publicly, I think many more students would have a fair shot at acceptance into these schools, especially as they would be able to start prepping early on.

## Limitations
There are a few important limitations with this work. Primarily, the methods used cannot be used to make any causal claims about factors that weigh into admission. At best, I can refer to the findings above as educated guesses. They are certainly supported by the evidence found, but they should be taken with a grain of salt.

Additionally, particularly within the exploratory data analysis, it is important to acknowledge that some data is missing, as it was not available on the class profiles of the respective universities.

Finally, the ethnography above currently only looks at Columbia University. It is reasonable to loosely extend the implications to the other Ivy Leagues, particularly in light of the similarities discovered during the background work and exploratory data analysis, but it would still be a useful area of future work to conduct similar ethnographies for the other universities as well.

# CONCLUSION
Overall, it appears to be the case that the Ivy League Universities are holistic, but not necessarily in the way they claim to be. In the related work, I referenced a paper which discussed how these colleges don't actually admit based on student achievement, but on the basis of maintaining their elite statuses. This appears to be somewhat true, but in a more subtle way that still gives the illusion of student achievement being the only factor at play.

In particular, the admissions process does take into account academic performance as well as the unique abilities and contributions of students. However, almost all admitted students seem to follow the same pattern: They performed extremely well academically, and they are extremely talented in some extracurricular. This makes it so that pointing to an individual student that is admitted supports the idea of holistic admission, but it also restricts admissions to a narrow paragon of excellence, which helps maintain the elite images these colleges wish to portray and protect. Of course, there must be exceptions, but this seems to be the general tale.

Though I did not end up fully exploring my initial first research question due to the data limitations, I do think the work here supports my original hypothesis at least indirectly. It does appear that two factors, academic and extracurricular excellence, seem to play quite the large role in admissions. However, they are broad enough that they can be split into many different factors if presented in the right way.

In future work, I would like to explore the influence of more niche factors, such as social class, donations, athletic potential, etc., on admission to these colleges. This work sets a good foundation for exploring such questions, and I would be curious to see, through a different set of data, if they are interrelated in some way.

But for now, I return to my original question: How holistic are the Ivy Leagues, really? By their definition, pretty holistic. But by the definition that comes to mind for most people, maybe not so holistic after all.

# Additional Data Sources and References Not Linked Throughout Notebook
- https://admissions.dartmouth.edu/apply/class-profile-testing
- https://admission.brown.edu/explore/brown-facts
- https://admissions.cornell.edu/sites/admissions.cornell.edu/files/ClassProfile%202025%20Profile%20Updated%20FINAL.pdf
- https://blog.collegevine.com/what-does-it-really-take-to-get-into-harvard
- https://undergrad.admissions.columbia.edu/class-2025-profile
- https://admissions.yale.edu/sites/default/files/yale_classprofile2025web.pdf
- https://www.upenn.edu/about/faq

**Note**: Some sources are listed in the file chatgpt.md for clarity, as these are sources that ChatGPT used in order to give certain responses. This file, for complete transparency, lists the full queries given to and responses received from ChatGPT in the course of gathering data for this research.