# Introduction
This file is for analyzing smoke impact on the budget of healthcare services in Whitman County, WA. Specifically, ambulatory health care services. I do this by taking the smoke impact calculated in 'smoke_estimate.ipynb' and yearly population to predict wages*employees for the year using a multiple linear regression.

### License
This code example was developed by Chandler Ault for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). 



In [1]:
import pandas as pd
import statsmodels.api as sm

In [2]:
# Data loading and preprocessing.
df_wages = pd.read_csv('../data/wages_timeseries.csv').reset_index(drop=True)

df_employees = pd.read_csv('../data/employee_timeseries.csv').reset_index(drop=True)
df_employees = df_employees.groupby('Year')['Employees'].mean().reset_index()

df_smoke = pd.read_csv('../data/smoke_impact_timeseries.csv').reset_index(drop=True)
df_smoke.rename(columns={'Fire_Year': 'Year'}, inplace=True)


df_pop = pd.read_csv('../data/population_timeseries.csv').reset_index(drop=True)
df_pop['Year'] = pd.to_datetime(df_pop['DATE']).dt.year
df_pop.rename(columns={'WAWHIT5POP': 'Population'}, inplace=True)



In [4]:
# Join the data.
df_final = df_smoke[['Year', 'smoke_impact', 'GIS_Acres']].merge(df_employees[['Year', 'Employees']], on='Year', how='outer')  # You can choose 'inner', 'left', 'right', or 'outer' for the 'how' parameter
df_final = df_final.merge(df_wages[['Year', 'Wages']], on='Year', how='outer')
df_final = df_final.merge(df_pop[['Year', 'Population']], on='Year', how='outer')

df_final.tail()

Unnamed: 0,Year,smoke_impact,GIS_Acres,Employees,Wages,Population
55,2019,1.313332,2041146.0,16.266667,55312.0,50.136
56,2020,8.033177,9381456.0,16.166667,58827.0,47.804
57,2021,,,16.7,61150.0,43.238
58,2022,,,16.7,64396.0,47.619
59,2023,,,17.416667,,


In [8]:
# Create multiple linear regression.
target_data = df_final[['smoke_impact', 'Population', 'Employees', 'Wages']]
target_data = target_data.dropna()

X = target_data[['smoke_impact', 'Population']]
y = target_data['Wages']*target_data['Employees']

X = sm.add_constant(X)

model = sm.OLS(y, X).fit()

summary = model.summary()

print(summary)

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.934
Model:                            OLS   Adj. R-squared:                  0.926
Method:                 Least Squares   F-statistic:                     120.7
Date:                Thu, 07 Dec 2023   Prob (F-statistic):           8.98e-11
Time:                        22:34:18   Log-Likelihood:                -246.77
No. Observations:                  20   AIC:                             499.5
Df Residuals:                      17   BIC:                             502.5
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const        -2.081e+06    1.8e+05    -11.555   