# Table of content

1. [Introduction](#Introduction)
2. [Data overview](#Data-overview)
3. [Data wrangling](#Data-wrangling)
4. [Factor analysis](#Factor-analysis)
    1. [Factor interpretation](#Factor-interpretation)
5. [Exploratory analysis](#Exploratory-analysis) 
    1. [Box-Cox power transformation](#Box-Cox-power-transformation)
    2. [Linearity and normality](#Linearity-and-normality)
6. [Multiple linear regression](#Multiple-linear-regression)
    1. [Model interpretation](#Model-interpretation)
7. [Final remarks](#Final-remarks)


# Introduction

For this analysis, my research question would be what factors are associated with COVID-19 infection rates in the United States counties. I will be utilizing various datasets concerning county health information, county COVID-19 infection rates, county population densities, and state political affiliation. In turn, I will attempt to identify and evaluate risk factors connected to infection rates by doing multiple regression analysis. The resultant linear regression model would be used mainly for interpretation. The outcome of the study could be used to better understand a population's vulnerability to COVID-19 based on the community's characteristics from the reference point of the American counties.  

In [None]:
# Kaggle specific code

!pip install factor_analyzer

# https://www.kaggle.com/general/63534#672910
!pip install altair vega_datasets notebook vega # needs internet in settings (right panel)

# https://www.kaggle.com/jakevdp/altair-kaggle-renderer
# Define and register a kaggle renderer for Altair

import json
import altair as alt
from IPython.display import HTML

KAGGLE_HTML_TEMPLATE = """
<style>
.vega-actions a {{
    margin-right: 12px;
    color: #757575;
    font-weight: normal;
    font-size: 13px;
}}
.error {{
    color: red;
}}
</style>
<div id="{output_div}"></div>
<script>
requirejs.config({{
    "paths": {{
        "vega": "{base_url}/vega@{vega_version}?noext",
        "vega-lib": "{base_url}/vega-lib?noext",
        "vega-lite": "{base_url}/vega-lite@{vegalite_version}?noext",
        "vega-embed": "{base_url}/vega-embed@{vegaembed_version}?noext",
    }}
}});
function showError(el, error){{
    el.innerHTML = ('<div class="error">'
                    + '<p>JavaScript Error: ' + error.message + '</p>'
                    + "<p>This usually means there's a typo in your chart specification. "
                    + "See the javascript console for the full traceback.</p>"
                    + '</div>');
    throw error;
}}
require(["vega-embed"], function(vegaEmbed) {{
    const spec = {spec};
    const embed_opt = {embed_opt};
    const el = document.getElementById('{output_div}');
    vegaEmbed("#{output_div}", spec, embed_opt)
      .catch(error => showError(el, error));
}});
</script>
"""

class KaggleHtml(object):
    def __init__(self, base_url='https://cdn.jsdelivr.net/npm'):
        self.chart_count = 0
        self.base_url = base_url
        
    @property
    def output_div(self):
        return "vega-chart-{}".format(self.chart_count)
        
    def __call__(self, spec, embed_options=None, json_kwds=None):
        # we need to increment the div, because all charts live in the same document
        self.chart_count += 1
        embed_options = embed_options or {}
        json_kwds = json_kwds or {}
        html = KAGGLE_HTML_TEMPLATE.format(
            spec=json.dumps(spec, **json_kwds),
            embed_opt=json.dumps(embed_options),
            output_div=self.output_div,
            base_url=self.base_url,
            vega_version=alt.VEGA_VERSION,
            vegalite_version=alt.VEGALITE_VERSION,
            vegaembed_version=alt.VEGAEMBED_VERSION
        )
        return {"text/html": html}
    
alt.renderers.register('kaggle', KaggleHtml())
print("Define and register the kaggle renderer. Enable with\n\n"
      "    alt.renderers.enable('kaggle')")

In [None]:
alt.renderers.enable('kaggle')

In [None]:
# Import necessary libraries 

import pandas as pd
import numpy as np 
import altair as alt
import statsmodels.formula.api as smf
import statsmodels.api as sm  
from factor_analyzer import FactorAnalyzer
import matplotlib.pyplot as plt
from factor_analyzer.factor_analyzer import calculate_kmo
from scipy import stats
import itertools
from statsmodels.graphics.gofplots import qqplot
from matplotlib import pyplot as plt
from scipy.stats import levene, normaltest

# Data overview

Let's take a look at the datasets this analysis will be using. A few of them  are put together by crawling wiki pages. The rest are from what are provided officially.

In [None]:
# https://github.com/nytimes/covid-19-data
# Cumulative counts of coronavirus cases in the US at the county level
county_infection = pd.read_csv('../input/county-covid-related/us-counties.csv')

In [None]:
county_infection.head()

In [None]:
# let's sort the counties first by date

county_infection['date'] = pd.to_datetime(county_infection['date'])
county_infection = county_infection.sort_values(by='date')

In [None]:
county_infection.tail()

> The latest date of the data is May 13th, 2020.

In [None]:
county_infection[(county_infection['state'] == 'Illinois') & (county_infection['county'] == 'Cook')]

In [None]:
county_infection[(county_infection['state'] == 'California') & (county_infection['county'] == 'Santa Clara')]

> It seems like each county's data starts with the first case of infection and then contains each subsequent day's cumulative count.

In [None]:
# https://en.wikipedia.org/wiki/County_(United_States)
# County population and density
county_population = pd.read_csv('../input/county-covid-related/county-population.csv')

In [None]:
county_population.head()

In [None]:
# https://en.wikipedia.org/wiki/Political_party_strength_in_U.S._states
# https://en.wikipedia.org/wiki/List_of_United_States_governors
# State party affiliation based on house representation
state_party_line = pd.read_csv('../input/county-covid-related/state_party_line.csv')

In [None]:
state_party_line.head()

In [None]:
# Source: https://www.countyhealthrankings.org/
# Access: https://app.namara.io/#/data_sets/579ee1c6-8f66-418c-9df9-d7b5b618c774?organizationId=5ea77ea08fb3bf000c9879a1
# County health information
county_health = pd.read_csv('../input/uncover/UNCOVER/county_health_rankings/county_health_rankings/us-county-health-rankings-2020.csv')

In [None]:
county_health.head()

In [None]:
county_health.columns[:75]

> This dataset contains extensive information about a county's attributes, including the rankings, quantiles, rates, and percentages of numerous demographic as well as health qualities. Of the many measurements of each quality, we probably only need one or two to avoid duplication. In addition, I will do a **factor analysis** on the columns to see if it makes sense.

For more information about these columns, please visit this [info](https://app.namara.io/#/data_sets/579ee1c6-8f66-418c-9df9-d7b5b618c774/info?organizationId=5ea77ea08fb3bf000c9879a1) page

# Data wrangling

In this section, we want to prepare our data for further exploration and analysis. 

In [None]:
# Aggregate data related to county infection and basic characteristics
county = county_infection.merge(
    county_population, left_on=['county', 'state'], right_on=['county', 'state']
).merge(
    state_party_line, left_on=['state'], right_on=['state']
)

In [None]:
county.sample(5)

Let's look at the statistics of the counted days for the counties

In [None]:
# Count the number of days each county data has
def count_days(series):
    time_series = pd.to_datetime(series)
    first_date = time_series.iloc[0]
    last_date = time_series.iloc[-1]
    
    return (last_date - first_date).days + 1

In [None]:
grouped_county = county.groupby(['state', 'county']).agg(days_counted=('date', count_days))

In [None]:
grouped_county.describe()

In [None]:
grouped_county.shape

> We have 2758 counties in the data. The minimum amount of days counted for a county is only one, while the maximum is about almost four months. I am happy that the median is 50 days. Ideally, I want all counties in the analysis to have at least two months worth of data so that any of its heath characteristics can have a decent chance of exerting its influence if there could be any at all. With the current data and analysis, I will only include counties with 50 day worth of data to maximize the representativeness of the eventual infection picture and not exclude too much data. Please understand that I'm not a domain expert. I apologize that this cutoff point seems rather arbitrary, but I hope the rationale makes sense domain-wise.

With that said, for the next step, we want to group the infection data by counties and create a bunch of aggregated columns including counted days, confirmed infection in the percentage of county population, death rate, and raw infection counts. We will also calculate those columns for the cutoff point of 50 days so that we can do the analysis without accounting for the number of days for model simplicity. This is also where we will exclude counties that have less than 50 days of data.

In [None]:
# Find the value at the 50 day mark
def county_cumulative_days(series, days = 50):
    # This may not be 100% accurate because perhaps some days are missing, 
    # but that seems to happen rarely. So this should be accurate enough.
    if len(series) < days:
        return series.iloc[-1]
    else:
        return series.iloc[days - 1]

In [None]:
# Group our data in terms of county and aggregate some columns to show overall infection rate 
# and death rate as well as at the 50 day mark
def group_county_data(data):
    grouped_data = data.groupby(['state', 'county']).agg(
        population=('population', lambda x: x.iloc[-1]),
        density_km=('density_km', lambda x: x.iloc[-1]),
        state_house_blue_perc=('state_house_blue_perc', lambda x: x.iloc[-1]),
        state_governor_party=('state_governor_party', lambda x: x.iloc[-1]),
        days_counted=('date', count_days),
        case_sum=('cases', lambda x: x.iloc[-1]),
        death_sum=('deaths', lambda x: x.iloc[-1]),
        case_count_50_days=('cases', county_cumulative_days),
        death_count_50_days=('deaths', county_cumulative_days)
    )
    
    grouped_data = grouped_data[grouped_data['days_counted'] >= 50]
    grouped_data['infection_rate'] = grouped_data['case_sum']/grouped_data['population']*100
    grouped_data['death_rate'] = grouped_data['death_sum']/grouped_data['case_sum']*100
    grouped_data = grouped_data[grouped_data['infection_rate'] != float("inf")]
    grouped_data['infection_rate_50_days'] = grouped_data['case_count_50_days']/grouped_data['population']*100
    grouped_data['death_rate_50_days'] = grouped_data['death_count_50_days']/grouped_data['case_count_50_days']*100
    
    return grouped_data.reset_index()

In [None]:
grouped_county = group_county_data(county)

In [None]:
grouped_county

> We end up with 1463 counties.

In [None]:
grouped_county.sample(5)

Next, let's tackle county health data.

In [None]:
# Remove state total rows first
county_health = county_health.dropna(subset=['county'])

Take a quick look over the data again.

In [None]:
county_health.sample(5)

In [None]:
county_health.columns

In [None]:
county_health.columns[:100]

> There are 507 columns. To reiterate my proposed course of action, we want to first get rid of many different measurements of the same quality and only keep the rates. We also want to remove some redundant columns such as population. The purpose is to hopefully keep the complexity under a managable level, while maintaining the values of the information.

In [None]:
excluded_column_words = [
    'quartile',
    'ci_high',
    'ci_low',
    'fips',
    'num',
    'denominator',
    'ratio',
    'population',
]

In [None]:
filtered_columns = county_health.columns[~county_health.columns.str.contains('|'.join(excluded_column_words))]

In [None]:
print(str(len(filtered_columns)) + ' columns remain!')

In [None]:
filtered_county_health = county_health[filtered_columns]

Next, let's merge the health data into the infection data, and check out the merged data.

In [None]:
county = grouped_county.merge(
    filtered_county_health, left_on=['county', 'state'], right_on=['county', 'state']
)

In [None]:
county

> We still have a lot of columns. Perhaps a lot of them have missing data for more than half of the data. We have no reasonable and accessible way of dealing with missing data here. We could fill in missing values from nearby counties, but that could be both erroneous and difficult. As a result, we will simply get rid of missing data in terms of columns and rows. Let's deal with columns first because we want to keep as many as rows as possible.

In [None]:
# Let's see the columns at near 90% cutoff points
county.dropna(thresh=1300, axis=1).info(max_cols=200)

> At the 90% row number cutoff point, we have a decent amount of columns. Most of the columns seem important, so we will try to keep most of them by setting the cutoff point at 1370 rows to keep the indexes related to suicide.

In [None]:
county.dropna(thresh=1370, axis=1).dropna()

We are keeping a good amount of data. Let's go ahead with that decision.

In [None]:
county = county.dropna(thresh=1370, axis=1).dropna()

# Factor analysis

After data wrangling, we are still dealing with a large number of columns. If we continue with our anaylsis as is, it might suffer from the curse of dimensionality. Also, if we are to include interaction terms, the number of parameters could get close to the number of rows. Furthermore, there is a high chance that we will run into multicollinearity. For all these reaons, I have decided to run factor anaylsis as the next step to reduce dimensionality and find independant latent variables. Please refer to its [wiki](https://en.wikipedia.org/wiki/Factor_analysis) for more information on the technique itself.

In [None]:
# Exclude columns that won't be used as explanatory variables and can't used in factor analysis
excluded_columns = [
    'state',
    'county', 
    'population',
    'state_house_blue_perc',
    'state_governor_party',
    'days_counted', 
    'case_sum', 
    'death_sum', 
    'case_count_50_days',
    'death_count_50_days', 
    'infection_rate', 
    'death_rate',
    'infection_rate_50_days', 
    'death_rate_50_days',
    'presence_of_water_violation'
]

In [None]:
county_non_factor = county[excluded_columns]

In [None]:
county_factor = county.drop(excluded_columns, axis=1)

In [None]:
len(county_factor.columns)

In [None]:
county_factor.columns

Let's check whether factor anaylsis is appropriate first. We will be using [Levene’s test](https://en.wikipedia.org/wiki/Levene%27s_test) and [Kaiser-Meyer-Olkin Test](https://www.statisticshowto.com/kaiser-meyer-olkin/). The former is used to assess whether or not the variables have homoscedasticity for samples that might not have perfectly normal distributions. The latter measures the suitability of data for factor analysis.

To assess which function of the data to use in the Levene test, we need to look at the normality of the columns.

In [None]:
fig = county_factor.hist(
    column=county_factor.columns, 
    xlabelsize=0.1, 
    ylabelsize=0.1, 
    layout=(11, 7), 
    figsize=(10, 10),
    bins=50
)  
[x.title.set_size(0) for x in fig.ravel()]
plt.show()

> Seems like most columns have tailed distributions. Only a few have non-normal distributions. As a result, we will look at the test results for both `mean` and `trimmed` functions. I do acknowledge that there seems to be no perfect test to account for the variety of distributions here.

In [None]:
levene(*county_factor.to_numpy(), center='trimmed')

In [None]:
levene(*county_factor.to_numpy(), center='mean')

> The tests were both statistically significant, indicating that there is likely no homoscedasticity among the variables. 

In [None]:
kmo_all, kmo_model = calculate_kmo(county_factor)

In [None]:
kmo_model

> This score indicates that the data is excellent for factor analysis.

Let's check out all the original eigenvalues first.

In [None]:
fa = FactorAnalyzer()

# Using the varimax rotation because it makes it easier to identify each variable with a single factor.
fa.set_params(rotation='varimax')
fa.fit(county_factor)

In [None]:
ev, v = fa.get_eigenvalues()
ev[:30]

In [None]:
plt.scatter(range(1, len(ev)+1), ev)
plt.plot(range(1, len(ev)+1), ev)
plt.title('Scree plot')
plt.xlabel('Factor')
plt.ylabel('Eigenvalue')
plt.grid()
plt.show()

> It seems like we have 14 factors that are significant (eigenvalue >= 1)..

In [None]:
fa = FactorAnalyzer()
fa.set_params(n_factors=14, rotation='varimax')
fa.fit(county_factor)

In [None]:
factor_loading = pd.DataFrame(fa.loadings_)

In [None]:
factor_loading.index = county_factor.columns

In [None]:
factor_loading.shape

In [None]:
factor_loading

## Factor interpretation

Now that we have collected all the significant factors, let's interpret them one by one. We will look at columns that have decent loadings(>|.3|) for easier interpretations.

In [None]:
def filter_decent_loadings(factor):
    return factor[(factor > 0.3) | (factor < -0.3)]

In [None]:
for factor in factor_loading.columns:
    print('Factor ' + str(factor + 1) + ' loadings: ')
    print()
    print(filter_decent_loadings(factor_loading[factor]))
    print()
    print()

1. The first factor seems to encompass a lot of indexes related to general well-being. Some of its most substantial positive loadings(> .8) are for the years of potential life lost rate, the percentage of fair or poor health, the percentage of smokers, the percentage of physical or mental distress, and child poverty percentage. Some of its most substantial negative loadings(< -.7) are for median household income, life expectancy, general income, and college percentage. It explains a lot of variables connected to welfare in high coefficients. And its positive direction is towards poor welfare. Based on the loadings, I can identify this factor as **poor general well-being** with decent confidence.

2. The second factor has apparent connections to variables related to housing issues. Its biggest loadings(> .8) are for the percentage of severe housing problems, severe housing cost burden, and the percentage of severe housing cost burden. The percentage of homeowners has the lowest loadings for this factor. It is convincing that this factor is for **housing burden**.

3. The third factor seems to be connected to the prevalence of hispanic population. Its most substantial positive loadings(> 0.8) are for the percentage of hispanic population and the percentage of people not proficient in English. Its most vigorous negative factor loading is for the percentage of non-hispanic white population(~-.52). Other variables, such as housing problems and youth population, with lower factor loadings also seem to make sense for hispanic population prevalence. As a result, I would determine this factor to be **hispanic relative population size**.  

4. The fourth factor is mostly about suicide rates in the opposite direction, so we could interpret this to be **inverse suicide rate**.

5. The fifth factor is for **uninsured rate** because that's all its concerns with high loadings.

6. The sixth factor has mostly to do with care provider rates such as dentist and mental health(> .5). It seems to be inversely connected with rural percentage and long commute. Although its loadings are relatively weaker, we can probably conclude that it is for **care provider accessibility**.

7. The seventh factor seems to be about the population age as the extreme youth percentage has a positive loading(> .55), and the senior percentage has a very negative loading(< -.86). We can somewhat conclude that this factor is for **population youth**.

8. The eighth factor seems to be mostly about the crime rate and its contributing factors, so we will determine this as **crime risk**.

9. The ninth factor should have weak loadings overall. The theme seems to be about the overall income as it includes median household income, 80th & 20th percentile income, long commute, and white household income. We will loosely define this factor to be about **overall income level**.

10. The tenth factor seems to be highly related to population density. Its highest loadings are for density in km(\~.66), traffic volumn(\~.4), and Asian population(\~.48), while its lowest loadings are related to lone drive to work. It is clear that this factor is about **population density level**.

11. The ninth factor should be somewhat apparent, with its two biggest loadings being the percentage of American Indian Alaska Native(.75) and inadequate kitchen or plumbing facilities(.71). We will determine this factor to be about **native relative population size**.

12. The twelveth factor seems to be mainly connected to black population. Its most significant loading is for the percentage of black population(~.44). A lot of its other loadings are seemingly problems more common in black communities. Some examples are low high school graduation rate, crime rate, and single parent households. One of its negative loadings is the percentage of non-hispanic white population. I think we can determine that this factor is for **black relative population size**.

13. The thirteenth factor seems to be about urbanization with its most negative loading being about rural percentage and its most positive one about access to exercise opportunities. The [latter](https://www.countyhealthrankings.org/explore-health-rankings/measures-data-sources/county-health-rankings-model/health-factors/health-behaviors/diet-exercise/access-to-exercise-opportunities) is defined to be specifically about facilities, which are more plentiful in an urban environment. As a result, we will determine this factor to be about **urbanization level**.

14. The fourteenth factor is also straightforward as its loadings are all about food. We will determine it to be about **poor food environment index** due to its related variables and their directions.

> Please note that these interpreations are purely subjective and could be done better with more domain knowledge.

In [None]:
fa.get_factor_variance()

> Together the 14 factors explain about 74% of the total variance.

Now with the factors interpreted, let's transform the original columns into factor scores, and append `_fa_score` to the factor names. Then, we will merge the data back.

In [None]:
fa_score_columns = [
    'poor_general_wellbeing_fa_score',
    'housing_burden_fa_score',
    'hispanic_relative_population_fa_score',
    'inverse_sucicde_rate_fa_score',
    'uninsured_rate_fa_score',
    'care_provider_accessibility_fa_score',
    'population_youth_fa_score',
    'crime_risk_fa_score',
    'overall_income_fa_score',
    'population_density_fa_score',
    'native_relative_population_fa_score',
    'black_relative_population_fa_score',
    'urbanization_level_fa_score',
    'poor_food_environment_fa_score',
]

transformed_county_factor = pd.DataFrame(
    fa.transform(county_factor),
    columns=fa_score_columns
)

In [None]:
county = county_non_factor.reset_index(drop=True).join(transformed_county_factor)

# Exploratory analysis

In this section, we want to explore some factors' distribution and their relationships with the response variable. We will also compare the counties by their state governor parties. For the rest of the analysis, we will set the response variable as the infection rate at 50 days since the first case of a county.

In [None]:
# Remove some columns we are interested in for sure
county = county.drop(columns=[
    'population', 
    'state_house_blue_perc', 
    'days_counted',
    'case_sum',
    'death_sum',
    'case_count_50_days',
    'death_count_50_days',
    'presence_of_water_violation'
])

In [None]:
county.sample(5)

In [None]:
county.info()

In [None]:
alt.Chart(county).mark_bar().encode(
    alt.X("infection_rate_50_days", bin=alt.Bin(extent=[0, 3], step=0.02)),
    y='count()',
).properties(
    width=800,
    height=400,
    title='Infection rate at 50 days since first case'
)

> It looks like we have a right skewness for the response variable distribution. Alternatively and maybe more accurately, we are looking at a Gamma distribution here.  Most counties' infection rates seem to lie below 0.8.

## Box-Cox power transformation

In any case, we will fix the skewness and achieve a normal distribution by transforming the response variable as opposed to building a generalized linear model with Gamma distribution for easier interpretation and better intuition, from my perspective. We will use the [Box-Cox](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.boxcox.html) power transformation because it tends to be more powerful than log transformation and taking the root in solving skewness.

In [None]:
infection_rate_50_days_boxcox, lmbda = stats.boxcox(county['infection_rate_50_days'])

In [None]:
lmbda

Let the transformed rate be `x` and the old rate be `y`. 

The formula of their relationship is `(lmbda * x + 1)^(1/lambda) = y`, which in this case is approximately `(0.044x + 1)^22.73 = y`. Let's visualize what that means.

In [None]:
county['infection_rate_50_days_boxcox'] = infection_rate_50_days_boxcox

In [None]:
alt.Chart(county).mark_line().encode(
    x='infection_rate_50_days_boxcox',
    y='infection_rate_50_days'
).properties(
    title='Infection rate Boxcox transformation relationship'
)

> Basically, as the Box-Cox transformed infection rate increase, the real infection rate increases exponentially, meaning the coefficients have exponential impacts on the real infection rate.

In [None]:
alt.Chart(county).mark_bar().encode(
    alt.X("infection_rate_50_days_boxcox", bin=alt.Bin(extent=[-5, 2], step=0.1)),
    y='count()',
).properties(
    width=800,
    height=400,
    title='Boxcox infection rate at 50 days since first case'
)

> The new response variable's distribution looks normal.

Next, let's look at the correlatons with the transformed rate.

In [None]:
county[county['state_governor_party'] == 'blue'].corr(method='pearson')['infection_rate_50_days_boxcox']

> For a county in a blue state, the more prominent positively correlated factors(>.25) are **urbanization**, **income level**, and **inverse suicide**. The more prominent negatively correlated factors are **crime**(-.2) and **care provider accessibility**(-.1).

In [None]:
county[county['state_governor_party'] == 'red'].corr(method='pearson')['infection_rate_50_days_boxcox']

> For a county in a red state, the more prominent positively correlated factors(>.2) are **black population** and **inverse suicide**. The more prominent negatively correlated factors are **crime**(-.13) and **hispanic population**(-.11).

In [None]:
county.corr(method='pearson')['infection_rate_50_days_boxcox']

> Overall, the more prominent positively correlated factors(>.2) are **income level** and **inverse suicide**. The more prominent negatively correlated factors are **crime**(-.18) and **care provider accessibility**(-.085).

## Linearity and normality

Next, we will explore some of the more prominent explanatory variables and visualize their relationships with the infection rate as well as their distributions. We will look at common relatively significant and positively correlated explanatory variables first.

In [None]:
alt_y = alt.Y(
    'infection_rate_50_days_boxcox', 
    axis=alt.Axis(values=list(np.linspace(-6, 2, 81))),
    scale=alt.Scale(domain=(-5, 2), clamp=True)
)

In [None]:
alt.Chart(county).mark_point(filled=True, size=22).encode(
    x='inverse_sucicde_rate_fa_score',
    y=alt_y,
    color='state_governor_party'
).properties(
    width=800,
    height=400,
    title='Inverse suicide factor score vs Boxcox infection rate'
)

In [None]:
alt.Chart(county).mark_bar().encode(
    alt.X("inverse_sucicde_rate_fa_score", bin=alt.Bin(extent=[-3, 3], step=0.1)),
    y='count()',
)

In [None]:
alt.Chart(county).mark_point(filled=True, size=22).encode(
    x='overall_income_fa_score',
    y=alt_y,
    color='state_governor_party'
).properties(
    width=800,
    height=400,
    title='Overall income factor score vs Boxcox infection rate'
)

In [None]:
alt.Chart(county).mark_bar().encode(
    alt.X("overall_income_fa_score", bin=alt.Bin(extent=[-3, 3], step=0.1)),
    y='count()',
)

> As expected, though normally distributed, the above factor scores, with relatively high positive coefficients regardless of state party, have weak positive linearity relationships with the transformed infection rate.

We will look at some positively correlated explanatory variables that are relatively significant only to one party next.

In [None]:
alt.Chart(county).mark_point(filled=True, size=22).encode(
    x='urbanization_level_fa_score',
    y=alt_y,
    color='state_governor_party'
).properties(
    width=800,
    height=400,
    title='Urbanization factor score vs Boxcox infection rate'
)

> Blue counties exhibit stronger linearity here.

In [None]:
alt.Chart(county).mark_bar().encode(
    alt.X("urbanization_level_fa_score", bin=alt.Bin(extent=[-3, 3], step=0.1)),
    y='count()',
)

In [None]:
alt.Chart(county).mark_point(filled=True, size=22).encode(
    x='black_relative_population_fa_score',
    y=alt_y,
    color='state_governor_party'
).properties(
    width=800,
    height=400,
    title='Black population factor score vs Boxcox infection rate'
)

> Red counties exhibit slightly stronger linearity here.

In [None]:
alt.Chart(county).mark_bar().encode(
    alt.X("black_relative_population_fa_score", bin=alt.Bin(extent=[-3, 3], step=0.1)),
    y='count()',
)

> Here we are looking at two factors that have different effects on counties in states with different governor parties. The urbanization factor has a pretty obvious but weak linearity relationship with the transformed infection rate in blue counties, but that relationship cannot be found with red counties. On the other hand, the black population factor has a very weak linearity relationship with the transformed infection rate only for counties in red states. Overall, these relationships are difficult to spot because they are not especially strong.

In conclusion, although linearity seems to be weak, it does exist for some explanatary variables with the response variable so that we can be confident on finding a somewhat useful linear equation. We can be fairly assured that decent normality is ensured and multicollinearity is alleviated with the factors. Moreover, based on the relationship graphs, we see no obvious pattern with the residuals, so we can be somewhat confident with homoscedasticity as well. With that said, we can proceed with the regression.

# Multiple linear regression

First, We will devise all combinations of model formulas and build them. Then, we will identify and explore the models with the highest adjusted R-squared, lowest AIC, and BIC scores. Finally, we will evaluate assumptions again and interpret the models.

In [None]:
interaction_term = 'state_governor_party'
response_variable = 'infection_rate_50_days_boxcox'
explanatory_variables = fa_score_columns

In [None]:
explanatory_variables

In [None]:
variable_combinations = []

for variable in explanatory_variables:
    variable_combinations.append([variable, variable + '*' + interaction_term])

In [None]:
formula_combinations = list(itertools.product(*variable_combinations))

In [None]:
print('There are ' + str(len(formula_combinations)) + ' combinations.')

In [None]:
models = []
rsquared_adjs = []
formulas = []
aics = []
bics = []

for combo in formula_combinations:
    explanatory_variable_part = ' + '.join(combo)
    formula = ' '.join([
        'infection_rate_50_days_boxcox ~',
        explanatory_variable_part
    ])
    
    mod = smf.ols(formula=formula, data=county)
    res = mod.fit()

    models.append(res)
    formulas.append(formula)
    rsquared_adjs.append(res.rsquared_adj)
    aics.append(res.aic)
    bics.append(res.bic)
    
    if len(models)%1600 == 0:
        print(str(len(models)) + ' models finished so far!')

In [None]:
result = pd.DataFrame({
    'formula': formulas,
    'rsquared_adj': rsquared_adjs,
    'aic': aics,
    'bic': bics,
    'model': models
})

In [None]:
result.sort_values(by='rsquared_adj', ascending=False).head()

In [None]:
result.sort_values(by='aic').head()

In [None]:
result.sort_values(by='bic').head()

> Looks like optimizing for AIC and adjusted R-squared gives the same model, while minimizing BIC results in a different model. We will focus on these two models for the rest of the analysis.

In [None]:
aic_res = result.iloc[result.sort_values(by='aic').iloc[0].name]['model']

In [None]:
bic_res = result.iloc[result.sort_values(by='bic').iloc[0].name]['model']

Next, we will confirm whether the two models' residuals look random and are somewhat normally distributed.

In [None]:
model_df = pd.DataFrame({
    'aic_model_residual': county['infection_rate_50_days_boxcox'].values - aic_res.fittedvalues,
    'bic_model_residual': county['infection_rate_50_days_boxcox'].values - bic_res.fittedvalues,
    'real_val': county['infection_rate_50_days_boxcox'],
    'aic_model_pred': aic_res.fittedvalues,
    'bic_model_pred': bic_res.fittedvalues,
})

In [None]:
# QQ plot for the AIC model residuals
plt.show(qqplot(model_df['aic_model_residual'], line='s'))

In [None]:
# QQ plot for the BIC model residuals
plt.show(qqplot(model_df['bic_model_residual'], line='s'))

In [None]:
alt.Chart(model_df).mark_point().encode(
    x='aic_model_pred',
    y='aic_model_residual',
) | alt.Chart(model_df).mark_point().encode(
    x='bic_model_pred',
    y='bic_model_residual',
)

> The QQ plots would indicate that our residuals are normal, but their distributions do have slight heavy right tails. I think this would indicate that when the models underestimate, they tend to underestimate more compared to overestimation. The residual plots seem to display no obvious pattern.

## Model interpretation

Next, we will explore the models in depth. As we go through the models, the following sections will be relevant to keep in mind.

- [Boxcox transformation](#Box-Cox-power-transformation) for the relationship between the Boxcox transformed rate and real infection rate

- [Factor interpretation](#Factor-interpretation) for understanding what each factor encompasses in more detail

We will discuss model differences first and then explore what they have in common.

In [None]:
print(aic_res.summary())

> Although this model is less parsimonious, it does provide some useful information regarding the party-specific effects for some factors, which the other model lacks. We will look at the significant ones(< .05 p) here. The poor general well-being(.18) and housing burden(.13) factor scores in blue counties seems to have a moderate effect on the transformed infection rate. The native relative population factor score seems to have exactly opposite effects on the rate between blues(-.10) and reds(.10) counties. The hispanic relative population factor score has surprisingly a negative effect(-.11) on the infection rate for red counties.

In [None]:
print(bic_res.summary())

> This model is a lot simpler with less interaction terms. Let's look at the differing significant coefficients. The poor general well-being, the housing burden, and the black relative population factor scores are simplified to have an overall positive effect(.13, .095 & .12) on the transformed infection rate. On the other hand, the hispanic relative factor score is generalized to have a slight negative effect(-.086) on the infection rate.

The two models agree on a lot as well. Both have landed on a relatively substantial positive coefficient(.31) for the inverse suicide factor score as well as a relatively more negative coefficient(-.18) for the crime risk factor score. For the smaller effects, both identify the poor food environment(.036), black relative population(.12), care provider accessibility(-0.11), and population density(.083) factor scores. For insignificant factors, both include the population youth and uninsured rate factor scores. In terms of party differences, both identify the overall income(.33 & -.20) and the urbanization(.31 & -.13) factor scores to have the opposite effects on blue and red counties respectively in terms of their transformed infection rate.

# Final remarks

I want to touch on some limitations of my analysis. First, I am not a domain expert. As a result, there is probably a lot of room for improvement throughout the analysis that would make more sense domain-wise. For instance, there might be a more reasonable response variable given our data. Another example would be the factor analysis. I am confident that a domain expert would likely have more domain-specific insights for the factors, translating to a more robust model interpretation.

Where assumptions or simplifications were made, I have erred on the side of caution. For instance, in my model interpretation, I was merely describing the coefficients without going into detail about what they imply for the general public. However, I think a more thorough investigation of the model would prove to be extremely valuable.

I want to thank the New York Times and County Health Rankings for the data. Thank you for reading. Stay safe!