<div class="alert alert-block alert-danger">

# 2A: College and Social Mobility (COMPLETE)

*This notebook is intended for students who have completed up to:*
 
**Page 2.5**

</div>

<div class="alert alert-block alert-warning">

#### Summary of Notebook:

In this lesson, students will explore a data frame of over 2,000 colleges and universities. It contains information about income before and after college and students will examine whether parental income or type of college have any impact one's social mobility. 

#### Includes:

- Practice effectively visualizing both quantitative and categorical variables
- Using two-variable visualizations to gauge how much variation in the response is explained by a predictor
- Exploring the relationships between visualizations and summary statistics

</div>

<div class="alert alert-block alert-success">

## Approximate time to complete Notebook: 40-55 Mins

</div>

In [None]:
# This code will load the R packages we will use
suppressPackageStartupMessages({
    library(coursekata)
})

<img src="https://i.postimg.cc/hgQxmL5Q/johnston-gate-harvard.jpg" alt="Johnston Gates at Harvard University" width = 50%>

## College and the American Dream

We tend to think of colleges as places of opportunity. In particular, for low-income students, colleges present themselves as places of social mobility. If you can score a letter of admission, obtain financing, and work hard for four years, you'll get a degree that will open personal and financial doors for the rest of your life.

However, critics have challenged the above narrative. They say that colleges simply reproduce current social hiearchies. They claim that, for the most part, colleges serve and/or graduate students from already high-income families. In doing so, they cement inequities in opportunity.

### Motivating Question: Are colleges *actually* engines for social mobility? Do some create more economic advancement than others?

### The Dataset
**Description:** Describes family income (income while growing up) and adult income (income after graduating college) for students who attended college in the early 2000s. Data are in tidy format, with one row per college. See more in the official dataset [documentation](https://opportunityinsights.org/wp-content/uploads/2018/03/Codebook-MRC-Table-1.pdf).

##### Variables
- `name`: Name of college
- `location`: City in which college is located
- `state`: State in which college is located
- `class_size`: Average size of graduating classes during years measured
- `tier`: School tier (1-12), lower number indicates more prestigious tier of school
- `tier_name`: Description of tier (e.g. selective public, selective private, IVY Plus, etc.)
- `type`: School type (private non-profit, private for-profit, public)
- `median_family_income`: Median family income of students at the school
- `percent_from_bottom_20`: Percent of students at the school who come from bottom 20% income households
- `percent_from_bottom_20_and_reached_top_20`: Percent of all students at the school who came from bottom 20% income households and who reached the top 20% of incomes as adults


##### Data Sources: 
 - Chetty et al. (Opportunity Insights), Mobility Report Cards database: https://opportunityinsights.org/paper/mobilityreportcards/
    - Specifically: MRC Codebook Table 1, Preferred Estimates of Access and Mobility Rates by College:  https://opportunityinsights.org/data/?#resource-listing

<div class="alert alert-block alert-success">

### 1.0 - Approximate Time:  10-15 mins

</div>

### 1.0 - Exploring the College Mobility Dataset

**1.1 -** Run the following codeblock to download the dataset and display the first few rows.

In [None]:
# Download the dataset
colleges <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRKiYHNVKs8sWMo1FpG_whoNhiMGR5NQ36hiBqJbOtKnvpzStY9g-dLjAyPCDywnHVH_zOFoyWQPpyD/pub?gid=1276561522&single=true&output=csv")
head(colleges)

**1.2 -** In this dataset, each row represents a college. How many total colleges are included in the dataset?

In [None]:
# Sample Responses

str(colleges)

glimpse(colleges)

<div class="alert alert-block alert-warning">

<b>Sample Response:</b> 

There are 2,199 colleges in the dataset.

</div>

**1.3 -** The `tier` column categorizes schools by their selectivity and prestige (1-12). Print out some of the Tier 1 colleges. Print out some of the Tier 12 colleges. What do you notice?

In [None]:
# Sample Response
filter(colleges, tier == 1)
filter(colleges, tier == 12)

# Sample Response
head(arrange(colleges, tier), 10)
tail(arrange(colleges, tier), 10)

<div class="alert alert-block alert-warning">

<b>Sample Response:</b> 

The Tier 1 schools tend to be Ivy League, private Universities. The Tier 12 schools tend to be for-profit, technical schools that have less than two-year degrees.

</div>

**1.4 -** Come up with a strategy for using the data in this data frame to unpack our motivating questions. Which variables might be relevant? What relationships should we look at?

<div class="alert alert-block alert-warning">

<b>Sample Response:</b> 

Students might zoom in on outcome variables such as `percent_from_bottom_20_and_reached_top_20` or `percent_from_bottom_20` and want to sort colleges by those variables. Encourage them to do so and discuss observations.

Some students might also consider relationships between variables like `tier` and `median_family_income` and the aforementioned outcome variables. This is a great idea (and exactly where we are headed). Lead into the next section which explores `median_family_income` of college students -- who are the colleges serving overall?
</div>

<div class="alert alert-block alert-success">

### 2.0 - Approximate Time:  15 - 20 mins

</div>

### 2.0 - Whom do colleges serve?

<div class="alert alert-block alert-info">

<b> <font size="+1">Key Question</font></b>


**2.1 -** The median household income in the United States in 2014 was $53,657. 

Visualize the median parental household income among these colleges. How does these parental median incomes compare to that of the country? What does this suggest about the vision of colleges as places for social mobility?

</div>

In [None]:
# Sample Response
gf_histogram(~median_family_income, data = colleges) %>% 
    gf_vline(xintercept=53657)

# Sample Response
gf_histogram(~median_family_income, data = colleges) %>%
    gf_boxplot(width = 20, alpha = .3) %>%
    gf_vline(xintercept=53657)

<div class="alert alert-block alert-warning">

<b>Sample Response:</b> 

We can see here that the majority of colleges tend to have median parental incomes above the median household income for the country. The right skew indicates that some colleges have parental median incomes **far** above that of the country.

</div>

**2.2 -** Which colleges serve the richest families? Did you expect these colleges? Do any of these names surprise you?

In [None]:
# Sample Response
arrange(colleges, desc(median_family_income))

**2.3 -** How are the colleges that serve the richest families different from colleges that have a lot of students who come from families in the bottom 20% of the income distribution?

In [None]:
# Sample Response
head(arrange(colleges, desc(median_family_income)), 20)
head(arrange(colleges, desc(percent_from_bottom_20)), 20)

<div class="alert alert-block alert-warning">

<b>Sample Response:</b> 

The tier description of the schools are quite different. The schools that serve the rich families tend to be "elite" or "Ivy plus"; are private non-profit; and are often in the east coast. 

The schools with a large percentage of students from lower income families tend to be non-selective or two year (or less) schools; a mix of for-profit or public and private; many in TX and NY.

</div>

<div class="alert alert-block alert-success">

### 3.0 - Approximate Time:  15-20 mins

</div>

### 3.0 - Exploring Relationships

<div class="alert alert-block alert-info">

<b> <font size="+1">Key Question</font></b>

**3.1 -** Do you think there is a relationship between `tier` and the percent of students who come from the bottom 20% income families? Write a word equation to express this idea and make a visualization. Comment on any trends you see.

</div>

In [None]:
# Sample Responses

gf_boxplot(percent_from_bottom_20 ~ factor(tier), data = colleges)

gf_jitter(percent_from_bottom_20 ~ tier, data = colleges)

<div class="alert alert-block alert-warning">

<b>Sample Response:</b> 

**percent_from_bottom_20 = tier + other stuff**

Lower tier schools (more prestigious) tend to serve few bottom 20% income students. The opposite is true for high teir (lower prestige) schools. There seems to be variation by public/private. For example, Tier 3 is highly selective public and Tier 4 is highly selective private. The public schools tend to serve more bottom 20% income students.

</div>

**3.2 -** The dataset has a variable - `percent_from_bottom_20_and_reached_top_20` - that measures the percent of all students at the school who came from bottom 20% income households and who reached the top 20% of incomes as adults. The researchers behind the data refer to this as the **mobility rate.** 

We can hypothesize that some tiers have better mobility rates than others. Write this as a word equation and visualize this relationship. What do you notice?

In [None]:
# Sample Response
gf_boxplot(percent_from_bottom_20_and_reached_top_20 ~ factor(tier), data = colleges, width = .1)

<div class="alert alert-block alert-warning">

<b>Sample Response:</b> 

**mobility rate = tier + other stuff**
**percent_from_bottom_20_and_reached_top_20 = tier + other stuff**

The relationship isn't entirely clear. It seems that Tier 3 schools have generally the highest rate of mobility; however, there are several outlier Tier 5 and higher schools that have relatively high rates as well. Interestingly, Tier 11 & 12 schools (which had very high rates of accepting students from the bottom 20% income households) had low rates on this metric. This suggests that, while they accept many students from bottom 20% families, they are not the most successful at giving opportunities for mobility.  

</div>

**3.2, bonus -** Notice that this is a super long (but clear) variable name. Figure out a way to have a variable named `mobility_rate` (a shorter name) with the same values.

In [None]:
# Sample Response
colleges$mobility_rate <- colleges$percent_from_bottom_20_and_reached_top_20

colleges <- colleges %>%
    mutate(mobility_rate = percent_from_bottom_20_and_reached_top_20)

<div class="alert alert-block alert-info">

<b> <font size="+1">Key Question</font></b>

**3.3 -** If you haven't already, take a look at the top 20 schools, in terms of mobility rate. How are they different from the schools that serve mostly high income or mostly low income families?

</div>

In [None]:
# Sample Responses

head(arrange(colleges, desc(percent_from_bottom_20_and_reached_top_20)),20)

head(arrange(colleges, desc(mobility_rate)),20)

<div class="alert alert-block alert-warning">

<b>Sample Response:</b> 

These seem to mostly be small state universities or non-selective private schools that serve large urban areas (e.g. City College of NY, Cal State LA, Vaughn College Of Aeronautics And Technology). Note: Cal State LA is one of the main contributing universities that developed the CourseKata curriculum! Go Golden Eagles!

So, the most prestigious schools might not be the real engines for social mobility in higher education (since they accept so few students from lower-income households). Rather, modest universities in large urban areas do a better overall job in mobility rate.

</div>