# 3B: College Mobility - Part II

In [None]:
# This code will load the R packages we will use
suppressPackageStartupMessages({
    library(coursekata)
})

# This code will make sure the middle rows/columns don't get cut out (ellipsized) when you 
# print out a really large data frame (you can adjust the values for max rows/cols)
options(repr.matrix.max.rows=1400, repr.matrix.max.cols=200)

# Load the data frame
colleges <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRg0xH6f5WG5-Tyxm-hoPEuyy6bao9FtB1kCumuFrS_PhjZVy1sXCmJ1mR0aYMrOXCf-zuG5_vFz5By/pub?gid=0&single=true&output=csv")

<img src="https://i.postimg.cc/hgQxmL5Q/johnston-gate-harvard.jpg" alt="Johnston Gates at Harvard University" width = 50%>

## College and the American Dream

Previously, we explored whether colleges are places of social mobility. We looked at the mobility index of each college to see which ones help to bring the most people from the bottom 20% of household income to the top 20%. We saw that mid-tier schools tended to have the highest mobility rates, while low- and high-tier schools tended to have lower rates, and low-tier schools (e.g., Ivy League) tended to not admit very many students from the bottom 20% of incomes to begin with.

Let's continue to explore the various factors that may contribute to a college's mobility index.

### Motivating Questions: Does the type of college have an affect on mobility? How about class size?

### The Dataset
**Description:** Describes family income (income while growing up) and adult income (income after graduating college) for students who attended college in the early 2000s. Data are in tidy format, with one row per college. See more in the official dataset [documentation](https://opportunityinsights.org/wp-content/uploads/2018/03/Codebook-MRC-Table-1.pdf).

##### Variables
- `name`: Name of college
- `location`: City in which college is located
- `state`: State in which college is located
- `class_size`: Average size of graduating classes during years measured
- `tier`: School tier (1-12), lower number indicates more prestigious tier of school
- `tier_name`: Description of tier (e.g. selective public, selective private, IVY Plus, etc.)
- `type`: School type (private non-profit, private for-profit, public)
- `median_family_income`: Median family income of students at the school
- `percent_from_bottom_20`: Percent of students at the school who come from bottom 20% income households
- `percent_from_bottom_20_and_reached_top_20`: Percent of all students at the school who came from bottom 20% income households and who reached the top 20% of incomes as adults
- `mobility_index`: The same variable as `percent_from_bottom_20_and_reached_top_20` but renamed to be shorter


##### Data Sources: 
 - Chetty et al. (Opportunity Insights), Mobility Report Cards database: https://opportunityinsights.org/paper/mobilityreportcards/
    - Specifically: MRC Codebook Table 1, Preferred Estimates of Access and Mobility Rates by College:  https://opportunityinsights.org/data/?#resource-listing

### 1.0 - Explore `mobility_index`

**1.1** - Take a look at some of the data frame. What does each row represent? 

In [None]:
head(colleges)

**1.2** - Create a visualization to look at the distribution of `mobility_index`. What do you think is the maximum value that `mobility_index` could be? Are many of the colleges near that maximum value?

**1.3** - Fit the empty model for `mobility_index` and interpret the value. (Bonus: add the empty model to the data visualization you made above.)

In [None]:
# Fit the empty model


## 2.0 - Does `type` explain variation in mobility?

#### Explore the Distribution

**2.1** - Do you think the `type` of college (private non-profit, private for-profit, or public) will explain any of the variation in `mobility_index`?

Make some predictions about what you might expect to find, then create a visualization to explore the hypothesis. Describe what you see.

#### Fit and Interpret the Model

**2.2** - Find the best fitting parameter estimates and put the model into GLM notation. 

Here is some markdown to get you started:

$Y_i = b_0 + b_1(X_{1i}) + b_2(X_{2i}) + e_i$

<div class="alert alert-block alert-info">

#### Key Question:

**2.3** - Interpret the parameter estimates and add them to your visualization.

</div>

#### Make Predictions with the Model

**2.4** - Use the model to make a prediction for each group:

What would the model predict for a:

- for-profit college?
- private non-profit college?
- public college?

**2.5** - According to our model, which type of schools generally perform the lowest in the mobility index? The highest?

#### Quantify Residual Error from the `type` Model and Empty Model

Below we have generated a couple of visualizations with the `type` model in red and the empty model in blue. In both visualizations, find the case that represents "California State University, Los Angeles" (selected and printed out below). 

**2.6** - Which model's prediction results in a smaller residual error for Cal State LA: the empty model or the type model? Is there much difference between the two model predictions? 

In [None]:
# Look at the row for California State University, Los Angeles
colleges[colleges$name == "California State University, Los Angeles", ]

type_model <- lm(mobility_index~type, data = colleges)

# Jitter Plot
gf_jitter(mobility_index ~ type, data = colleges, color = "grey") %>%
    gf_model(type_model, color = "red") %>%
    gf_hline(yintercept = 1.92, color = "blue")

# Faceted Histogram
gf_histogram(~mobility_index, data = colleges, fill = "grey") %>%
    gf_facet_grid(type ~ .)%>%
    gf_model(type_model, color = "red") %>%
    gf_vline(xintercept = 1.92, color = "blue")

**2.7** - Pick a different college. Identify it in your visualization and compare the prediction from the empty model to the `type` model. Which model's prediction results in a smaller residual error for this college? 

In [None]:
# if you want to print out all the available names of colleges:
#select(colleges, name)

In [None]:
# Look at the row for a different college 


<div class="alert alert-block alert-info">

<b> <font size="+1">Key Question</font></b>

<br>

**2.8** - Generate the ANOVA table to get the sum of squared error from the empty model and the type model. You'll see that the type model has less error relative to the empty model. How much less error? 

What is the proportion of error reduced by the type model?

### 3.0 - What other variables might explain the variation in mobility index?


#### Explore the Distribution

We have explored whether the type of school might explain variation in mobility index. 

**3.1** - But now it's your turn: what variable would be better than type at predicting mobility index?

Write a hypothesis about what you might expect to find (both as a sentence and as a word equation), then create a visualization to explore the hypothesis. Describe what you see.

#### Fit and Interpret the Model

**3.2** - Find the best fitting parameter estimates and put the model into GLM notation. 

Here is some markdown to get you started:

$Y_i = b_0 + b_1(X_i) + e_i$

<div class="alert alert-block alert-info">

#### Key Question:

**3.3** - Interpret the parameter estimates and add the model predictions to your visualization.

</div>

#### Evaluate the Error off the Model

**3.4** - In the visualization you made above, find the case in the visualization that represents "California State University, Los Angeles". Which model's prediction results in less residual error for that college: the empty model or your model? 

In [None]:
colleges[colleges$name == "California State University, Los Angeles", ]

**3.5** - Compare the Sum of Squares (SS) for each model (the empty model or your model). Which one has less SS error overall? Is there a very big difference between the two SS values? What does this mean?

**3.6** - Why is the SS for the empty model the same for both the `type` model and your model?

<div class="alert alert-block alert-info">

<b> <font size="+1">Key Question</font></b>

<br>

**3.7** - Which model, the `type` model or your model appears to explain the most variation in college mobility?

### 4.0 - Is there a general economic benefit from college?

What is the overall picture we are seeing with regard to college and economic mobility (considering our findings from the past college notebook as well)?

Before we draw any final conclusions, let's consider our measure of social mobility.

The `mobility_index` represents the percent of all students at the school who came from the bottom 20% income households and who reached the top 20% of incomes as adults. For [reference](https://engaging-data.com/household-spending-income/), the bottom 20\% by income earns about \$25,500 annually while the top 20\% earns about \$188,000 annually. But, what if they reach the top 50% of incomes instead; should that count as social mobility?

That is one measure of the economic impact of going to college. Another way to measure the economic value of a college degree is to simply look at the mean earnings of people with different levels of education.

**4.1** - Take a look at the graph below (generated from a different data frame called `earnings`; [data source](https://www.statista.com/statistics/184242/mean-earnings-by-educational-attainment/)).

What do you think these lines represent?

In [None]:
earnings <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vT8f6Hnonep3D7SBL73foZxYNkqCzssVaLeMvy7y0un3gIdmEDn2f2vp7WbFRrHyW63wYqydWY0AuxH/pub?gid=1804563070&single=true&output=csv")
earnings$edu_level_ordered <- factor(earnings$edu_level, levels = c("less_than_9th_grade", "HS_9th_12th_nongrad", "HS_grad_GED", "some_college_no_degree", "assoc_degree", "bach_degree", "masters_degree", "prof_degree", "doctorate_degree"))

gf_line(mean_earnings ~ year, color = ~edu_level_ordered, data = earnings)

**4.2** - According to the data, which level of education gets closest to the top 20% of income earnings (which would be over $180,000 per year)?

**4.3** - According to the data, which levels of education have seen the steepest rise in their earnings over the years? Which levels of education appear to have experienced the smallest rise in their earnings over the years?

**4.4** - From any of the analyses in this Jupyter notebook, can we conclude that going to college helps people make more money? If you were advising someone about their future, would you encourage them to go to college?