# Linear regression modeling

## Dataframe setup

Although our data was cleaned for the EDA section, some extra cleaning is required for running our model. Examples include, creating dummy variables as well as dropping columns with too many missing values as well as counties with missing values in any column. Below, we load in the required packages as well as clean as described above.

In [1]:
# loading in necessary packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [2]:
# loading in obesity data from our EDA and dropping columns that have too many null values. Additonally, hawaii and alaska
# had no inputs for obesity rates which are necessary for our model. Thus, they were dropped. 
# furthermore, obesity proxies that went into creating healthy score access was also dropped.
obesity_df = pd.read_csv("../../processed_data/obesity_eda.csv")
obesity_df[~obesity_df.region.str.contains('O')]
obesity_df = obesity_df.drop(columns = ['primary_minority', 'supercenter_access_score', 'grocery_access_score', 'fullservice_access_score', 'farmersmarket_access_score', 'wic_available_per1000', 'snap_bens_per1000'])

# Loading in car access data
car_access_df = pd.read_csv("../../processed_data/car_access_2017.csv")

# Merged the two data sets
obesity_df = pd.merge(obesity_df, car_access_df, how = 'left', on = 'fips')

# made some numerical variables easier to read 
obesity_df['percent_no_car'] = obesity_df['percent_no_car'] * 100
obesity_df['pop_estimate'] = obesity_df['pop_estimate']/1000

We wanted to find the counties within the contingous US that could be described as food desserts. One definition of  'food dessert', provided by the Annie E Casie foundation can be found [here](https://www.aecf.org/blog/exploring-americas-food-deserts),and other sources present a similar idea. The food insecurity variable in our data, obtained from Feeding America and coded as the fi_rate column, was calculated using many of the factors already described in the food insecurity definition found above; the calculation can be found [here](https://www.feedingamerica.org/sites/default/files/research/map-the-meal-gap/2016/2016-map-the-meal-gap-technical-brief.pdf). What was missing was a variable for access to food. Considering we had already created a proxy for access to food (healthy_access_category which encompasses grocery stores, supercenters, restaurants, and farmers markets), we decided to use both fi_rate and healthy_access_category to determine whether or not a county was labelled as being a food dessert.

The third quartile for food insecurity is at about 15.200000 meaning 75 percent is at or below that value. We decided that any county in the highest 25% would be considered the 'highest tier' of food insecurity. We chose to take the top 25% because it is marginally conservative. We then decided too be food insecure, the county not only had to be within the top 25% of insecure counties in the United states but also have either low or medium access to healthy foods.

In [3]:
# Making our food dessert categorical variable. 
conditions = [(obesity_df["fi_rate"] > 15) & (obesity_df["healthy_access_category"] != 'high'), 
              (obesity_df["fi_rate"] <= 15)]
values = [1, 0]

obesity_df["food_dessert"] = np.select(conditions, values)

In [4]:
# dropping any remaining null values so our model can work
obesity_df = obesity_df.dropna()

In [5]:
# creating dummy variables; the VIF test does not take in categorical variables

obesity_df['region'] = obesity_df['region'].map({'N':1, 'M':2, 'S':3, 'W':4})
obesity_df['healthy_access_category'] = obesity_df['healthy_access_category'].map({'low':1, 'medium':2, 'high':3})
obesity_df['class_category'] = obesity_df['class_category'].map({'low_income':1, 'lower_mid_class':2, 'mid_class':3, 'highest_income': 4})

### Checking data to determine which variables do not have colinearity then running multivarable regression model

In order to avoid multicollinearity, [independent values which have high correlation with eachother](https://towardsdatascience.com/multi-collinearity-in-regression-fe7a2c1467ea), we used the Variance Inflation Factor (VIF) technque. In the VIF method, all features are regressed against all of the other features. We only took variables with a VIF below 10.

#### Attempt 1: Running VIF on all columns followed by model (inefficent)

In this first attempt, we took every column in our dataset and ran VIF on it. We then only put the variables with low VIF into the regression model. Following that, we removed variables one by one in the model if they did not have a significant p value.

In [6]:
# putting all of our columns into a list
indp_vars = obesity_df[['percent_pop_low_access_15',
       'percent_low_income_low_access_15', 'percent_no_car_low_access_15',
       'percent_snap_low_access_15', 'percent_child_low_access_15',
       'percent_senior_low_access_15', 'percent_white_low_access_15',
       'percent_black_low_access_15', 'percent_hispanic_low_access_15',
       'percent_asian_low_access_15', 'percent_nhna_low_access_15',
       'nhpi_low_access_15', 'percent_nhpi_low_access_15',
       'percent_multiracial_low_access_15', 'grocery_per1000', 'super_per1000',
       'convenience_per1000', 'specialty_per1000', 'snap_available_per1000',
        'farmers_markets_per1000', 'pct_fm_accepting_snap',
       'pct_fm_accept_wic', 'pct_fm_credit', 'fm_sell_frveg',
       'pct_fm_sell_frveg', 'region','cost_per_meal', 'est_annual_food_budget_shortfall',
       'school_lunch_prog_17', 'school_bfast_prog_17', 'smr_food_prog_17',
       'wic_parts_pop_17', 'fast_food_per1000', 'full_service_per1000',
       'pop_estimate', 'percent_white', 'percent_black',
       'percent_native_american', 'percent_asian', 'percent_nhpi',
       'percent_multi', 'percent_nonwhite_hispanic', 'median_household_income',
       'class_category', 'healthy_access_score',
       'healthy_access_category', 'percent_no_car', 'food_dessert']]

In [7]:
#https://www.geeksforgeeks.org/detecting-multicollinearity-with-vif-python/
# creating a blank dataframe followed by adding each column to a new column called feature
vif_data = pd.DataFrame()
vif_data["feature"] = indp_vars.columns

# for every feature in vif_data, run VIF on it 
vif_data["VIF"] = [variance_inflation_factor(indp_vars.values, i)
                          for i in range(len(indp_vars.columns))]

# filter the data to include VIF less than ten
vif_data = vif_data[vif_data["VIF"] < 10]

In [8]:
# code was modeled and influenced by DS4A (Correlation One) material
formula= 'obesity_rate ~  percent_no_car_low_access_15 + grocery_per1000+ super_per1000+convenience_per1000+ specialty_per1000+farmers_markets_per1000+ pct_fm_accepting_snap+pct_fm_accept_wic+ pct_fm_credit+ fm_sell_frveg+pct_fm_sell_frveg+ smr_food_prog_17+ fast_food_per1000+full_service_per1000+ percent_native_american+ percent_asian+percent_nhpi+ percent_multi+ food_dessert'
model1 = sm.ols(formula = formula, data = obesity_df)
lin_reg = model1.fit()
print(lin_reg.summary())

                            OLS Regression Results                            
Dep. Variable:           obesity_rate   R-squared:                       0.402
Model:                            OLS   Adj. R-squared:                  0.398
Method:                 Least Squares   F-statistic:                     107.5
Date:                Sun, 11 Jul 2021   Prob (F-statistic):          1.77e-321
Time:                        04:49:57   Log-Likelihood:                -8144.4
No. Observations:                3062   AIC:                         1.633e+04
Df Residuals:                    3042   BIC:                         1.645e+04
Df Model:                          19                                         
Covariance Type:            nonrobust                                         
                                   coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------
Intercept       

#### Running VIS on columns thoght to have relevance in literature

In this second attempt, we took only columns in our dataset that had some evidence in the literature of having an effecton obesitynand ran VIF on it. We then only put the variables with low VIF into the regression model. Following that, we removed variables one by one in the model if they did not have a significant p-value.

In [9]:
# Making our food dessert categorical variable. 
indp_vars2 = obesity_df[[ 'grocery_per1000', 'super_per1000',
       'convenience_per1000', 'snap_available_per1000','fast_food_per1000', 'full_service_per1000',
       'farmers_markets_per1000', 'cost_per_meal', 'smr_food_prog_17', 'pop_estimate', 'percent_white', 'percent_black',
       'percent_native_american', 'percent_asian', 'percent_nhpi',
       'percent_multi', 'percent_nonwhite_hispanic', 'food_dessert', 'school_lunch_prog_17', 'school_bfast_prog_17']]

In [10]:
#https://www.geeksforgeeks.org/detecting-multicollinearity-with-vif-python/
# creating a blank dataframe followed by adding each column to a new column called feature
vif_data2 = pd.DataFrame()
vif_data2["feature"] = indp_vars2.columns

# for every feature in vif_data, run VIF on it 
vif_data2["VIF"] = [variance_inflation_factor(indp_vars2.values, i)
                          for i in range(len(indp_vars2.columns))]

# filter the data to include VIF less than ten
vif_data2 = vif_data2[vif_data2["VIF"] < 10]

array(['grocery_per1000', 'super_per1000', 'convenience_per1000',
       'fast_food_per1000', 'full_service_per1000',
       'farmers_markets_per1000', 'smr_food_prog_17', 'pop_estimate',
       'percent_black', 'percent_native_american', 'percent_asian',
       'percent_nhpi', 'percent_multi', 'percent_nonwhite_hispanic',
       'food_dessert'], dtype=object)

In [11]:
# code was modeled and influenced by DS4A (Correlation One) material
formula2= 'obesity_rate ~ super_per1000+ convenience_per1000+fast_food_per1000+ full_service_per1000+farmers_markets_per1000+ smr_food_prog_17+ pop_estimate+percent_black+ percent_native_american+ percent_asian+percent_nhpi+ percent_multi+ percent_nonwhite_hispanic+food_dessert'
model2 = sm.ols(formula = formula2, data = obesity_df)
lin_reg2 = model2.fit()
print(lin_reg.summary())

                            OLS Regression Results                            
Dep. Variable:           obesity_rate   R-squared:                       0.402
Model:                            OLS   Adj. R-squared:                  0.398
Method:                 Least Squares   F-statistic:                     107.5
Date:                Sun, 11 Jul 2021   Prob (F-statistic):          1.77e-321
Time:                        04:49:57   Log-Likelihood:                -8144.4
No. Observations:                3062   AIC:                         1.633e+04
Df Residuals:                    3042   BIC:                         1.645e+04
Df Model:                          19                                         
Covariance Type:            nonrobust                                         
                                   coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------
Intercept       

Looks like our second model would have (p = .01)
    - supercenter
    - convenience stores
    - fast food
    - full service restaurants
    - population
    - race factors (black, native, asian, hispanic) 
    - food dessert
    - population
    
    Total = 11 variables, 12 if chose a p value of .05
    
if it was .05 then only addition is summer food programs

## Predicting obesity

#### based on model two

In [12]:
import statsmodels.api as sm

In [13]:
# These are the coefficents of the variables in the model
lin_reg2.params

Intercept                    32.563944
super_per1000                16.007640
convenience_per1000           1.566608
fast_food_per1000            -0.694412
full_service_per1000         -2.293502
farmers_markets_per1000      -0.863413
smr_food_prog_17             -0.058752
pop_estimate                 -0.001028
percent_black                 0.079123
percent_native_american       0.059266
percent_asian                -0.474931
percent_nhpi                 -0.473993
percent_multi                -0.221359
percent_nonwhite_hispanic    -0.048267
food_dessert                  0.845674
dtype: float64

In [14]:
column_names = 'super_per1000', 'convenience_per1000','fast_food_per1000', 'full_service_per1000','farmers_markets_per1000', 'smr_food_prog_17', 'pop_estimate,percent_black', 'percent_native_american', 'percent_asian','percent_nhpi', 'percent_multi', 'percent_nonwhite_hispanic','food_dessert'

In [15]:
new_vals = pd.DataFrame(columns = column_names)

In [16]:
new_vals

Unnamed: 0,super_per1000,convenience_per1000,fast_food_per1000,full_service_per1000,farmers_markets_per1000,smr_food_prog_17,"pop_estimate,percent_black",percent_native_american,percent_asian,percent_nhpi,percent_multi,percent_nonwhite_hispanic,food_dessert


In [46]:
fips = float(input())
super_input = float(input())
convenience_store = float(input())
fast_food = float(input())
restaurant = float(input())
farmers_market = float(input())
summer_prog = float(input())
population = obesity_df.loc[obesity_df['fips'] == fips, 'pop_estimate'].iloc[0]
black = obesity_df.loc[obesity_df['fips'] == fips, 'percent_black'].iloc[0]
native = obesity_df.loc[obesity_df['fips'] == fips, 'percent_native_american'].iloc[0]
asian = obesity_df.loc[obesity_df['fips'] == fips, 'percent_asian'].iloc[0]
nhpi = obesity_df.loc[obesity_df['fips'] == fips, 'percent_nhpi'].iloc[0]
multi = obesity_df.loc[obesity_df['fips'] == fips, 'percent_multi'].iloc[0]
hisp = obesity_df.loc[obesity_df['fips'] == fips, 'percent_nonwhite_hispanic'].iloc[0]
food_des = obesity_df.loc[obesity_df['fips'] == fips, 'food_dessert'].iloc[0]

56037
.06
.63
.50
.90
.1
5.55


In [48]:
new_vals = pd.DataFrame({'super_per1000': [super_input], 'convenience_per1000': [convenience_store],'fast_food_per1000': [fast_food], 'full_service_per1000': [restaurant], 'farmers_markets_per1000': [farmers_market], 'smr_food_prog_17': [summer_prog], 'pop_estimate': [population], 'percent_black': [black],'percent_native_american': [native],'percent_asian': [asian], 'percent_nhpi': [nhpi], 'percent_multi': [multi], 'percent_nonwhite_hispanic': [hisp], 'food_dessert': [food_des]})

In [49]:
new_vals.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 14 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   super_per1000              1 non-null      float64
 1   convenience_per1000        1 non-null      float64
 2   fast_food_per1000          1 non-null      float64
 3   full_service_per1000       1 non-null      float64
 4   farmers_markets_per1000    1 non-null      float64
 5   smr_food_prog_17           1 non-null      float64
 6   pop_estimate               1 non-null      float64
 7   percent_black              1 non-null      float64
 8   percent_native_american    1 non-null      float64
 9   percent_asian              1 non-null      float64
 10  percent_nhpi               1 non-null      float64
 11  percent_multi              1 non-null      float64
 12  percent_nonwhite_hispanic  1 non-null      float64
 13  food_dessert               1 non-null      int64  
dty

In [50]:
xnew = sm.add_constant(new_vals)

In [51]:
ynewpred =  lin_reg2.predict(xnew)

In [52]:
ynewpred

0    30.337338
dtype: float64

In [53]:
obesity_df.loc[obesity_df['fips'] == 56037, 'obesity_rate'].iloc[0]

29.2