# __Cost of Education in Republican vs Democrat leaning US States__

### Group 15: Alice Le, Jitao Zhang, Lincoln Lee, Yitong Gong

<center><img src="https://bpr.berkeley.edu/wp-content/uploads/2020/02/College-Debt-Essay-Cover-Photo-scaled.jpg" width = 400></center>

### __1. Introduction__

Post-secondary education is an important and often necessary step in the desired career paths of many students in the USA. The cost to attend, however, is often a prohibiting factor. With fair variance in the cost of attendance between schools in different states, it's fairly common to see students move across the country to attend school where it is more affordable. As a result, tuition costs has long been a point of contention in the US political arena, and a central talking point in presidential elections. But for all of the talking and debating that occurs in TV and during campaigns, are schools in Democrat or Republican leaning areas actually cheaper or more expensive to attend?

In other words, we are interested in **assessing whether there is a statistically significant difference between the average tuition cost in Republican-leaning states and Democrat-leaning states.**

We will investigate this using hypothesis testing on the difference of two population means, and also investigate the standard deviation of the difference in means to produce a confidence interval.

### __2. Preliminary Results__
Let's do a simple, preliminary investigation of the tuition cost means in republican vs democrat leaning states. We'll import the necessary libraries and the dataset first. We are using an open-source online dataset provided by Kaggle.com that contains information on each US state and their tuition costs across multiple years.

In [18]:
# All needed libraries:
library(tidyverse)
library(broom)
library(repr)
library(digest)
library(infer)
library(gridExtra)

# General Graphs' setting:
options(repr.plot.width = 10, repr.plot.height = 6)

# Import online dataset
original_dataset <- read_csv("https://raw.githubusercontent.com/Jitao-Z/dataset/main/nces330_20.csv")
head(original_dataset)

[1mRows: [22m[34m3548[39m [1mColumns: [22m[34m6[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): State, Type, Length, Expense
[32mdbl[39m (2): Year, Value

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


Year,State,Type,Length,Expense,Value
<dbl>,<chr>,<chr>,<chr>,<chr>,<dbl>
2013,Alabama,Private,4-year,Fees/Tuition,13983
2013,Alabama,Private,4-year,Room/Board,8503
2013,Alabama,Public In-State,2-year,Fees/Tuition,4048
2013,Alabama,Public In-State,4-year,Fees/Tuition,8073
2013,Alabama,Public In-State,4-year,Room/Board,8473
2013,Alabama,Public Out-of-State,2-year,Fees/Tuition,7736


__2.1 Cleaning the data__

The dataset also contains information regarding miscellaneous student costs such as books and living costs. Further, as we are using the 2020 presidential election as a gauge for the political leaning of each state, we want to rely on the data from 2020 only to avoid including data from when the political landscape may have been different.

Let's filter for just the data we're interested in: student tuition in 2020.

In [19]:
set.seed(2356)

# filters out all the irrelevant columns
tuition <- original_dataset |>
    filter(!is.na(Value),
           Expense == "Fees/Tuition",
           Year == 2020)

Now, we want to group the data by state to find the average cost of tuition in that state. Then, we want to append a column to the data correlating each state with their presidential election result (source below).

In [20]:
# calculates the mean tuition fee for each state in 2020
# mean_tuition_fee represents the average tuition fee for each state in 2020
tuition <- tuition |>
    group_by(State) |>
    summarize(mean_tuition_fees = mean(Value))


# adds the third column of political leaning
# note that we are using the party affiliation of each state in the 2020 presidential election 
# to represent their political leanings 
# demo stands for democrat; repub stands for republican
tuition <- tuition |>
    mutate(political_leaning = as.factor(ifelse(grepl("Arizona|California|Colorado|Connecticut|Delaware|Georgia|Hawaii
                                                |Illinois|Maine|Maryland|Massachusetts|Michigan|Minnesota
                                                |Nevada|New Hampshire|New Jersey|New Mexico|New York|Oregon
                                                |Pennsylvania|Rhode Island|Vermont|Virginia|Washington|Wisconsin", State), 
                                            "demo", "repub")))

head(tuition)
nrow(tuition |> filter(political_leaning == "demo"))
nrow(tuition |> filter(political_leaning == "repub"))

State,mean_tuition_fees,political_leaning
<chr>,<dbl>,<fct>
Alabama,13628.8,repub
Alaska,18248.67,repub
Arizona,12235.4,demo
Arkansas,12422.0,repub
California,17368.4,demo
Colorado,14833.2,demo


__2.2 Visualization of the initial data__

Let's use two boxplots side-by-side to get a general idea of the average tuition fees of schools in democrat and republican-leaning states in 2020.

Note that even though the boxplot **will not** reflect the true population parameters, it could still give us some insights into what our dataset looks like and facilitate some first guesses in the difference of mean tuition fees between democrat and republican states.

In [1]:
tuition_boxplots <- tuition |>
    ggplot(mapping = aes(x = political_leaning, y = mean_tuition_fees, fill = political_leaning)) +
    geom_boxplot() +
    ggtitle("Boxplots for tuition fees of democrat and republican states in 2020") +
    labs(x = "Political leaning according to 2020 election", y = "Average tuition fee (USD)", fill = "Political leaning") +
    theme_bw() +
    theme(text = element_text(size = 15))

tuition_boxplots

ERROR: Error in ggplot(tuition, mapping = aes(x = political_leaning, y = mean_tuition_fees, : could not find function "ggplot"


__2.3 Computation of estimate of the parameter__

Let's now take a closer look at the **difference** of these two means, starting with some variable definitions.

- $\mu_1$: population mean of tuition fees of democrat states in 2020
- $\mu_2$: population mean of tuition fees of republican states in 2020

- $\bar{x_1}$: sample mean of tuition fees of democrat states in 2020
- $\bar{x_2}$: sample mean of tuition fees of republican states in 2020

The difference of the two sample means, $\bar{x_1} - \bar{x_2}$, is used as our estimate to approximate $\mu_1 - \mu_2$, our parameter of interest.

In [22]:
summary <- tuition |>
    group_by(political_leaning) |>
    summarize(mean = mean(mean_tuition_fees))

# diff represents x1_bar - x2_bar
estimates <- data.frame(x1_bar = summary$mean[1],
                        x2_bar = summary$mean[2]) |>
    mutate(diff = x1_bar - x2_bar)

estimates

x1_bar,x2_bar,diff
<dbl>,<dbl>,<dbl>
17996.75,14564.99,3431.759


Our initial rough guess says the difference between the two samples means is **3431.759 USD**, suggesting that it is that much cheaper to attend school in a Republican-leaning state (though again, we will need to conduct a more thorough analysis for a concrete estimate)

### __3. Methods: Plan__

__3.1 Is this preliminary exploration good enough?__

This report is based on a dataset of "Average cost of the undergraduate student by state USA" with observations ranging from 2013 to 2021. The exploratory data analysis filtered out the missing values from the dataset and categorized each state according to its political status. For our initial exploration, we only provided a simple comparison of two sample means, one from each group. It resulted in the finding that the mean tuition in Republican states is lower by 3431.759 USD.


_3.1.1 The Good Things_

The preliminary result is alright for getting a rough glimpse into what the population means look like, and what their difference is. Albeit simple, the sample size (number of schools surveyed) is 246, which is a decent sample size to capture a fair proportion of the student population in the US. Given the sample size, it is unlikely that the sample means would have produced an outlandishly incorrect value. Further, the data is from a reliable source; its original source is the US Department of Education and the dataset was updated just a month ago.


_3.1.2 Why this is not enough, and what to do next_

However, more is needed to conclude that the population means are in fact different since the difference in mean values between the two point estimates could simply be the result of sample variation.

For a more thorough analysis, we will use inferential statistics to conduct a hypothesis test to determine whether there is a statistically significant difference between the mean tuition of Democratic States and the average tuition of Republican States. This will be a two-tailed hypothesis test, where the null hypothesis is that the difference in population means equals zero, and the alternative hypothesis that it is not zero. We will use bootstrapping over the theoretical methods as a way to avoid making any assumptions about the distribution of the tuition costs in each population.

Furthermore, we will calculate a confidence interval to estimate a range between which the difference in the means lies. This will also be completed using bootstrapping, wherein a bootstrap distribution is used to find the standard error, and to generate an interval.


__3.2 Expected Findings__

The political landscape in recent years is very polarised, and the issue of education has been a dividing issue between the two parties for a long time. Further, the preliminary results state a large gap between the tuition fees of the two groups. We expect to find a statistically significant difference between the average tuition of Democratic and Republican states. We also expect to find a confidence interval that describes that Republican costs are cheaper with at least a 90% confidence level.

__3.3 Implications__

There are many stakeholders that may be interested in the cost of tuition. Finding a statistically significant difference in the population means would be meaningful to skeptical politicians, journalists or citizens that may be concerned about whether all the discussion in debates translates to real action and impact. This may motivate activists to re-evaluate their strategies, politicians to assess more impactful programs/policies, and corporations to increase lobbying efforts.

A confidence interval describing the difference in tuition between Republican and Democrat states may better inform incoming students and their loved ones, who are concerned about finding education within their budget. It may also reveal that the efforts of some parties may be less/more effective than expected.

__3.4 Future Questions__

The findings of this study may contribute to identifying the true population means of the two parties' tuition costs, which may be useful for future statistical analyses. Namely, some further explorations may investigate the granular correlation between _how_ Republican or Democrat-leaning a state is and what their tuition costs are; perhaps through a scatterplot and a correlation coefficient. Another related study may assess the correlation between public opinion and tuition costs, as a way of measuring the impact that citizens have on their tuition costs.

### __4. References__

Chirumamilla, B. (2023, February 9). Average cost of undergraduate student by State USA. Kaggle. Retrieved March 18, 2023, from https://www.kaggle.com/datasets/bhargavchirumamilla/average-cost-of-undergraduate-student-by-state-usa

College Debt Essay Cover Photo. (n.d.). Berkeley Political Review. Retrieved March 18, 2023, from https://bpr.berkeley.edu/wp-content/uploads/2020/02/College-Debt-Essay-Cover-Photo-scaled.jpg.

Freedberg, L. (2020, June 16). Democrats and Republicans in Congress spar over need for more Federal Education Aid. EdSource. Retrieved March 18, 2023, from https://edsource.org/2020/democrats-and-republicans-in-congress-spar-over-need-for-more-federal-education-aid/633765 

Hartig, H. (2021, August 13). Democrats overwhelmingly favor free college tuition, while Republicans are divided by age, Education. Pew Research Center. Retrieved March 18, 2023, from https://www.pewresearch.org/fact-tank/2021/08/11/democrats-overwhelmingly-favor-free-college-tuition-while-republicans-are-divided-by-age-education/ 

The New York Times. (2020, November 3). Presidential election results: Biden wins. The New York Times. Retrieved March 18, 2023, from https://www.nytimes.com/interactive/2020/11/03/us/elections/results-president.html 