<div class="alert alert-block alert-danger">

# 3B: College Mobility - Part II (COMPLETE)

*This notebook is intended for students who have completed up to:*
 
**Page 3.9**

</div>

<div class="alert alert-block alert-warning">

#### Summary of Notebook:

In this lesson, students will once again consider the relationship between attending college and achieving social mobility. This time they will explore whether the type of college (for-profit, private non-profit, or public) explains variation in mobility, or whether the graduating class size of the college has anything to do with mobility.


#### Includes:

- Fitting and interpreting a model with a categorical explanatory variable.
- Fitting and interpreting a model with a quantitative explanatory variable.
- Connecting parameter estimates to visualizations.
- Making predictions with models, and evaluating error off of the models.

#### Resources:

- Optional [Printable Graph Handout](https://docs.google.com/document/d/1BCSPksPRqOm36RPtyPXM714sKqdfMzIy1sppwTLfpc8/edit?usp=sharing). This handout contains images of the relevant visualizations that are made throughout the lesson. They can be used for students to manually draw on, mark up, and make notes. This can give students the chance to process the graphs more deeply, and connect them to the models they are fitting.

</div>

<div class="alert alert-block alert-success">

## Approximate time to complete Notebook: 55-75 Mins

</div>

In [None]:
# This code will load the R packages we will use
suppressPackageStartupMessages({
    library(coursekata)
})

# This code will make sure the middle rows/columns don't get cut out (ellipsized) when you 
# print out a really large data frame (you can adjust the values for max rows/cols)
options(repr.matrix.max.rows=1400, repr.matrix.max.cols=200)

# Load the data frame
colleges <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRg0xH6f5WG5-Tyxm-hoPEuyy6bao9FtB1kCumuFrS_PhjZVy1sXCmJ1mR0aYMrOXCf-zuG5_vFz5By/pub?gid=0&single=true&output=csv")

<img src="https://i.postimg.cc/hgQxmL5Q/johnston-gate-harvard.jpg" alt="Johnston Gates at Harvard University" width = 50%>

## College and the American Dream

Previously, we explored whether colleges are places of social mobility. We looked at the mobility index of each college to see which ones help to bring the most people from the bottom 20% of household income to the top 20%. We saw that mid-tier schools tended to have the highest mobility rates, while low- and high-tier schools tended to have lower rates, and low-tier schools (e.g., Ivy League) tended to not admit very many students from the bottom 20% of incomes to begin with.

Let's continue to explore the various factors that may contribute to a college's mobility index.

### Motivating Questions: Does the type of college have an affect on mobility? How about class size?

### The Dataset
**Description:** Describes family income (income while growing up) and adult income (income after graduating college) for students who attended college in the early 2000s. Data are in tidy format, with one row per college. See more in the official dataset [documentation](https://opportunityinsights.org/wp-content/uploads/2018/03/Codebook-MRC-Table-1.pdf).

##### Variables
- `name`: Name of college
- `location`: City in which college is located
- `state`: State in which college is located
- `class_size`: Average size of graduating classes during years measured
- `tier`: School tier (1-12), lower number indicates more prestigious tier of school
- `tier_name`: Description of tier (e.g. selective public, selective private, IVY Plus, etc.)
- `type`: School type (private non-profit, private for-profit, public)
- `median_family_income`: Median family income of students at the school
- `percent_from_bottom_20`: Percent of students at the school who come from bottom 20% income households
- `percent_from_bottom_20_and_reached_top_20`: Percent of all students at the school who came from bottom 20% income households and who reached the top 20% of incomes as adults
- `mobility_index`: The same variable as `percent_from_bottom_20_and_reached_top_20` but renamed to be shorter


##### Data Sources: 
 - Chetty et al. (Opportunity Insights), Mobility Report Cards database: https://opportunityinsights.org/paper/mobilityreportcards/
    - Specifically: MRC Codebook Table 1, Preferred Estimates of Access and Mobility Rates by College:  https://opportunityinsights.org/data/?#resource-listing

<div class="alert alert-block alert-success">

### 1.0 - Approximate Time:  5-10 mins

</div>

### 1.0 - Explore `mobility_index`

**1.1** - Take a look at some of the data frame. What does each row represent? 

In [None]:
head(colleges)

<div class="alert alert-block alert-warning">

**Sample Response:**

Each row is a college.

</div>

**1.2** - Create a visualization to look at the distribution of `mobility_index`. What do you think is the maximum value that `mobility_index` could be? Are many of the colleges near that maximum value?

In [None]:
# Sample Response
gf_histogram(~mobility_index, data = colleges) %>%
    gf_boxplot(fill = "white", width = 9)

# Sample Response
gf_boxplot(mobility_index~1, data = colleges)

<div class="alert alert-block alert-warning">

**Sample Response:**

Mobility index is a percentage (the percentage of students at the school who came from bottom 20% income households and who reached the top 20% of incomes as adults). It could be 100% (that would happen if all the students at a college came from poor households and reached the top 20% as adults). 

Most colleges have a mobility index that is really low (around 1.5% - 2.5%). The distribution is skewed, with a few colleges having a much higher index of about 10% - 16%.  

Students may be surprised at how low the mobility index is. It is a good opportunity to remind them of the prior notebook where they looked at the median family incomes at these colleges... colleges tend to serve already wealthy families.
</div>

**1.3** - Fit the empty model for `mobility_index` and interpret the value. (Bonus: add the empty model to the data visualization you made above.)

In [None]:
# Fit the empty model
empty_model <- lm(mobility_index ~ NULL, data = colleges)
empty_model

<div class="alert alert-block alert-warning">

**Sample Response:**

The empty model predicts that a college will have a mobility rate of about 1.92%.

</div>

In [None]:
# bonus 
gf_histogram(~ mobility_index, data = colleges) %>%
    gf_boxplot(fill = "white", width = 9) %>%
    gf_model(empty_model)

<div class="alert alert-block alert-success">

### 2.0 - Approximate Time:  20-25 mins

</div>

## 2.0 - Does `type` explain variation in mobility?

#### Explore the Distribution

**2.1** - Do you think the `type` of college (private non-profit, private for-profit, or public) will explain any of the variation in `mobility_index`?

Make some predictions about what you might expect to find, then create a visualization to explore the hypothesis. Describe what you see.

In [None]:
# Sample Response
gf_jitter(mobility_index ~ type, data = colleges)

# Sample Response
gf_histogram(~mobility_index, data = colleges) %>%
  gf_facet_grid(type ~ .) %>%
  gf_boxplot(width = 20)

# Sample Response
gf_dhistogram(~mobility_index, data = colleges) %>%
  gf_facet_grid(type ~ .) 

<div class="alert alert-block alert-warning">

**Sample Response:**

They all seem kind of similarly distributed and skewed right. For profit colleges have the fewest cases and the least amount of variation. They all seem to have similar centers (around 1.5-2.5%). Public and private non-profit colleges have some of the highest cases of mobility rate, but they appear to be outliers.


</div>

#### Fit and Interpret the Model

**2.2** - Find the best fitting parameter estimates and put the model into GLM notation. 

Here is some markdown to get you started:

$Y_i = b_0 + b_1(X_{1i}) + b_2(X_{2i}) + e_i$

In [None]:
type_model <- lm(mobility_index ~ type, data = colleges)
type_model

<div class="alert alert-block alert-warning">

**Sample Response:**

$Y_i = 1.723 + 0.004(X_i) + 0.554(X_i) + e_i$

OR

$mobility\_index_i = 1.723 + 0.004(typeprivatenonprofit_i) + 0.554(typepublic_i) + e_i$


</div>

<div class="alert alert-block alert-info">

#### Key Question:

**2.3** - Interpret the parameter estimates and add them to your visualization.

</div>

<div class="alert alert-block alert-warning">

**Sample Response:**

- $b_0$ = 1.723; this is the intercept, or group mean for the "for-profit" group.
- $b_1$ = 0.004; this is the increment that we add on to $b_0$ to get the mean of the "private non-profit" group.
- $b_2$ = 0.554; this is the increment that we add on to $b_0$ to get the mean of the "public" group.

**Instructor Note:**

Once they add the model to their visualization, you may also want to encourage students to connect the estimates to the visualization. For example, get them to verbalize and point to $b_0$ (the base mean), $b_1$ (the distance between $b_0$ and the mean of "private non-profit"), and $b_2$ (the distance between $b_0$ and the mean of "public"). The handout linked at the top of the notebook can also be used here so students can manually draw on the visualizations and mark them up with other notes, etc.

</div>

In [None]:
# Examples below use gf_model
# Students might also use gf_vline/gf_hline, gf_lm, predict, etc.

# Sample Response
gf_jitter(mobility_index ~ type, data = colleges, width = .1, alpha = .1) %>%
    gf_model(type_model, color = "red")

# Sample Response
gf_histogram(~mobility_index, data = colleges) %>%
    gf_facet_grid(type ~ .)%>%
    gf_model(type_model, color = "red")


#### Make Predictions with the Model

**2.4** - Use the model to make a prediction for each group:

What would the model predict for a:

- for-profit college?
- private non-profit college?
- public college?

<div class="alert alert-block alert-warning">

**Sample Response:**

For-profit (multiply $b_1$ and $b_2$ by 0):

> $mobility\_index_i = 1.723 + 0.004(0) + 0.554(0) = 1.723 + 0 + 0 + e_i$
> mobility index = 1.723

For private non-profit (multiply $b_1$ by 1 and $b_2$ by 0):

> $mobility\_index_i = 1.723 + 0.004(1) + 0.554(0) = 1.723 + 0.004 + 0 + e_i$
> mobility index = 1.727

For public (multiply $b_1$ by 0 and $b_2$ by 1):

> $mobility\_index_i = 1.723 + 0.004(0) + 0.554(1) = 1.723 + 0 + 0.554 + e_i$
> mobility index = 2.277

**Instructor Note:**

Students can also use the `favstats()` or `predict()` functions, however, the goal of this exercise is to get them to think about how the model works when you do use the `predict()` function.

</div>

In [None]:
# although R is not necessary for question 2.4, 
# after considering the GLM it might be helpful 
# to look at favstats() or predict()
favstats(mobility_index ~ type, data = colleges)

head(predict(type_model))

**2.5** - According to our model, which type of schools generally perform the lowest in the mobility index? The highest?

<div class="alert alert-block alert-warning">

**Sample Response:**

For profit and private non-profit colleges have nearly equal mobility rates on average, while public colleges have the highest mobility rates (although they are still pretty low at about 2.27%).

</div>

#### Quantify Residual Error from the `type` Model and Empty Model

Below we have generated a couple of visualizations with the `type` model in red and the empty model in blue. In both visualizations, find the case that represents "California State University, Los Angeles" (selected and printed out below). 

**2.6** - Which model's prediction results in a smaller residual error for Cal State LA: the empty model or the type model? Is there much difference between the two model predictions? 

In [None]:
# Look at the row for California State University, Los Angeles
colleges[colleges$name == "California State University, Los Angeles", ]

# Teachers may wish to pick one type of plot to focus on in this lesson
# Jitter Plot
gf_jitter(mobility_index ~ type, data = colleges, color = "grey") %>%
    gf_model(type_model, color = "red") %>%
    gf_hline(yintercept = 1.92, color = "blue")

# Faceted Histogram
gf_histogram(~mobility_index, data = colleges, fill = "grey") %>%
    gf_facet_grid(type ~ .)%>%
    gf_model(type_model, color = "red") %>%
    gf_vline(xintercept = 1.92, color = "blue")

<div class="alert alert-block alert-warning">

**Sample Response:**

Students should identify Cal State LA as a case in the "public" group that is near 9-10 on the mobility_index. 

The `type` model has a slightly smaller residual.

**Instructor Note:**

It may be helpful to refer to the printable handout for these questions as well, so students can draw and make notes.

</div>

**2.7** - Pick a different college. Identify it in your visualization and compare the prediction from the empty model to the `type` model. Which model's prediction results in a smaller residual error for this college? 

In [None]:
# if you want to print out all the available names of colleges:
#select(colleges, name)

In [None]:
# Look at the row for a different college 
colleges[colleges$name == "Academy Of Art University", ]
colleges[colleges$name == "Harvard University", ]
colleges[colleges$name == "University Of Michigan - Ann Arbor", ]

<div class="alert alert-block alert-warning">

**Sample Response:**

Students should identify a new college in the data frame. For example: Academy of Art University

In the plot, they should locate Academy of Art University as a case in the "for profit" group that is near 1.64 on the mobility_index. 

The type model still makes a slightly better prediction, however, there is almost no difference between it and the empty model prediction.

**Instructor Note:**

You may want to prod them to answer this follow-up question: If there isn't much difference between the two model predictions, what does that suggest?


Possible Answers:

It would suggest that the type model is as good as the empty model. 

It would suggest that the DGP is random, or all unexplained variation, with regard to the relationship between type and mobility_index.

Depending on which college they select (i.e., which `type`), they may notice that the type model seems to work best for the "public" colleges, but that the predictions for the other two groups are much closer to the empty model. This can be considered a prelude to pairwise comparisons later in the book.

</div>

<div class="alert alert-block alert-info">

<b> <font size="+1">Key Question</font></b>

<br>

**2.8** - Generate the ANOVA table to get the sum of squared error from the empty model and the type model. You'll see that the type model has less error relative to the empty model. How much less error? 

What is the proportion of error reduced by the type model?

In [None]:
supernova(type_model)

<div class="alert alert-block alert-warning">

**Sample Response:**

Note the type model's SS is called SS Error but the empty model's SS is called SS Total.

The type model has a lower SS error. 93.988 less in squared percentage (mobility index is a percentage) but that's a weird unit. PRE is .0339 (about 3%).
</div>

<div class="alert alert-block alert-success">

### 3.0 - Approximate Time:  20-25 mins

</div>

### 3.0 - What other variables might explain the variation in mobility index?


#### Explore the Distribution

We have explored whether the type of school might explain variation in mobility index. 

**3.1** - But now it's your turn: what variable would be better than type at predicting mobility index?

Write a hypothesis about what you might expect to find (both as a sentence and as a word equation), then create a visualization to explore the hypothesis. Describe what you see.

<div class="alert alert-block alert-warning">

**Sample Response:**

The mobility index might be related to class size. Maybe smaller more intimate schools do a better job of improving social mobility.

**social mobility = class size + other stuff**

(Describing visualization below.) The distribution seems to be pretty clumped up in the bottom corner. There is not a lof of variation in mobility_index, with most values below about 7%, and class_size is pretty clumped up under about 5000. There are some outliers on both dimensions. Overall, it doesn't look like there is a clear upward or downward trend.

</div>

In [None]:
# Sample Response
gf_jitter(mobility_index ~ class_size, data = colleges)


#### Fit and Interpret the Model

**3.2** - Find the best fitting parameter estimates and put the model into GLM notation. 

Here is some markdown to get you started:

$Y_i = b_0 + b_1(X_i) + e_i$

In [None]:
# sample response
class_size_model <- lm(mobility_index ~ class_size, data = colleges)
class_size_model

<div class="alert alert-block alert-warning">

**Sample Response:**

$Y_i = 1.89 + 0.00003318(X_i) + e_i$

OR

$mobility\_index_i = 1.89 + 0.00003318(class\_size_i) + e_i$

**Note to Instructors:**

Students may need assistance interpreting the values of the R output. For instance, the e+00 and e-05 tell us how many places to move the decimal in the postive or negative direction. Thus, a value like 3.318e-05 translates to 0.00003318, and can be interpreted as basically 0.

</div>

<div class="alert alert-block alert-info">

#### Key Question:

**3.3** - Interpret the parameter estimates and add the model predictions to your visualization.

</div>

In [None]:
# Sample Response
gf_jitter(mobility_index ~ class_size, data = colleges) %>%
    gf_lm(color = "red")

# Sample Response
gf_jitter(mobility_index ~ class_size, data = colleges, color = "grey") %>%
    gf_model(class_size_model, color = "red")

<div class="alert alert-block alert-warning">

**Sample Response:**

- $b_0$ = 1.89; this is the y-intercept; where the regression line crosses the y-intercept when class_size is zero.
- $b_1$ = 0.00; this is the slope; for every 1 unit increase in class_size, we add on 0.00 to $b_0$ to get the predicted mobility_rate.

**Instructor Note:**

Once they add the model to their visualization, you may also want to encourage students to connect the estimates to the visualization. For example, get them to verbalize and point to $b_0$ (the y-intercept), and $b_1$ (the slope as class_size goes increases). The handout linked at the top of the notebook can also be used here so students can manually draw on the visualizations and mark them up with other notes, etc.

</div>

#### Evaluate the Error off the Model

**3.4** - In the visualization you made above, find the case in the visualization that represents "California State University, Los Angeles". Which model's prediction results in less residual error for that college: the empty model or your model? 

In [None]:
colleges[colleges$name == "California State University, Los Angeles", ]

<div class="alert alert-block alert-warning">

**Sample Response:**

The mobility index for Cal State LA is 9.92.

The class size model makes a slightly worse prediction (1.89) compared to the empty model's prediction (1.92), but the predictions are really not that different.

</div>

In [None]:
lm(mobility_index~NULL, data = colleges)

**3.5** - Compare the Sum of Squares (SS) for each model (the empty model or your model). Which one has less SS error overall? Is there a very big difference between the two SS values? What does this mean?

In [None]:
# sample response
supernova(class_size_model)

<div class="alert alert-block alert-warning">

**Sample Response:**

The SS from the class size model is slightly smaller (2765.11) than the SS from the empty model (2769.23), but the numbers are quite close. This means the class size model reduces the sum of the squared error off the model, but only by a little bit (it does not explain much of the variation). 

</div>

**3.6** - Why is the SS for the empty model the same for both the `type` model and your model?

<div class="alert alert-block alert-warning">

**Sample Response:**

For both complex models, the empty model is modeling the same thing: mobility_index.

We have the same outcome variable for each model.

</div>

<div class="alert alert-block alert-info">

<b> <font size="+1">Key Question</font></b>

<br>

**3.7** - Which model, the `type` model or your model appears to explain the most variation in college mobility?

In [None]:
# sample answer
supernova(type_model)
supernova(class_size_model)

<div class="alert alert-block alert-warning">

**Sample Response:**

- SS Total (from empty model) = 2769
- SS Error (from type model) = 2675
- SS Error (from e.g., class_size model) = 2765

In this example (class size versus type model), the type model has the smaller SS Error (and larger PRE).

</div>

<div class="alert alert-block alert-success">

### 4.0 - Approximate Time:  10-15 mins

</div>

### 4.0 - Is there a general economic benefit from college?

What is the overall picture we are seeing with regard to college and economic mobility (considering our findings from the past college notebook as well)?

Before we draw any final conclusions, let's consider our measure of social mobility.

The `mobility_index` represents the percent of all students at the school who came from the bottom 20% income households and who reached the top 20% of incomes as adults. For [reference](https://engaging-data.com/household-spending-income/), the bottom 20\% by income earns about \$25,500 annually while the top 20\% earns about \$188,000 annually. But, what if they reach the top 50% of incomes instead; should that count as social mobility?

That is one measure of the economic impact of going to college. Another way to measure the economic value of a college degree is to simply look at the mean earnings of people with different levels of education.

**4.1** - Take a look at the graph below (generated from a different data frame called `earnings`; [data source](https://www.statista.com/statistics/184242/mean-earnings-by-educational-attainment/)).

What do you think these lines represent?

In [None]:
earnings <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vT8f6Hnonep3D7SBL73foZxYNkqCzssVaLeMvy7y0un3gIdmEDn2f2vp7WbFRrHyW63wYqydWY0AuxH/pub?gid=1804563070&single=true&output=csv")
earnings$edu_level_ordered <- factor(earnings$edu_level, levels = c("less_than_9th_grade", "HS_9th_12th_nongrad", "HS_grad_GED", "some_college_no_degree", "assoc_degree", "bach_degree", "masters_degree", "prof_degree", "doctorate_degree"))

gf_line(mean_earnings ~ year, color = ~edu_level_ordered, data = earnings)

<div class="alert alert-block alert-warning">

**Sample Response:**

The lines represent the average annual earnings of people with various levels of education for the years 2005-2020.

</div>

**4.2** - According to the data, which level of education gets closest to the top 20% of income earnings (which would be over $180,000 per year)?

<div class="alert alert-block alert-warning">

**Sample Response:**

Professional degrees have the highest incomes, but even those groups don't earn as much as the top 20\% (\$188,000) on average. They are closer to about \$140,000 annually.

</div>

**4.3** - According to the data, which levels of education have seen the steepest rise in their earnings over the years? Which levels of education appear to have experienced the smallest rise in their earnings over the years?

<div class="alert alert-block alert-warning">

**Sample Response:**

Those with BAs and above tend to show the biggest increases in earnings from 2005-2020.

Those with AAs and below have shown the least growth in earnings over the years.


</div>

**4.4** - From any of the analyses in this Jupyter notebook, can we conclude that going to college helps people make more money? If you were advising someone about their future, would you encourage them to go to college?

<div class="alert alert-block alert-warning">

**Sample Response:**

*Student responses will vary.*


</div>