# Introduction
This file is for analyzing smoke impact on the budget of healthcare services in Whitman County, WA. Specifically, ambulatory health care services. I do this by taking the smoke impact calculated in 'smoke_estimate.ipynb' and yearly population to predict wages*employees for the year using a multiple linear regression.

### License
This code example was developed by Chandler Ault for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). 



In [53]:
import pandas as pd
import statsmodels.api as sm

In [56]:
# Data loading and preprocessing.
df_wages = pd.read_csv('../data/wages_timeseries.csv').reset_index(drop=True)

# df_employees = df_employees.groupby('Year')['Employees'].mean().reset_index()

df_smoke = pd.read_csv('../data/smoke_impact_timeseries.csv').reset_index(drop=True)
df_smoke.rename(columns={'Fire_Year': 'Year'}, inplace=True)

df_cpi = pd.read_csv('../data/cpi.csv')
df_cpi = df_cpi.groupby('Year')['cpi'].mean().reset_index()

df_pop = pd.read_csv('../data/population_timeseries.csv').reset_index(drop=True)
df_pop['Year'] = pd.to_datetime(df_pop['DATE']).dt.year
df_pop.rename(columns={'WAWHIT5POP': 'Population'}, inplace=True)




In [60]:
# Join the data.
df_final = df_smoke[['Year', 'smoke_impact', 'GIS_Acres']].merge(df_cpi[['Year', 'cpi']], on='Year', how='outer')  # You can choose 'inner', 'left', 'right', or 'outer' for the 'how' parameter
df_final = df_final.merge(df_wages[['Year', 'Wages']], on='Year', how='outer')
df_final = df_final.merge(df_pop[['Year', 'Population']], on='Year', how='outer')
base_year = 2023

# Calculate the adjusted prices
df_final['adjusted_wage'] = (df_final['Wages'] / df_final['cpi']) * df_final[df_final['Year'] == base_year]['cpi'].values[0]

df_final.tail()

Unnamed: 0,Year,smoke_impact,GIS_Acres,cpi,Wages,Population,adjusted_wage
55,2019,1.313332,2041146.0,255.991333,29214.0,50.136,34861.166086
56,2020,8.033177,9381456.0,258.476333,29894.0,47.804,35329.654738
57,2021,,,272.3645,33413.0,43.238,37474.954749
58,2022,,,295.078333,35219.0,47.619,36459.931311
59,2023,,,305.475333,,,


In [62]:
# Create multiple linear regression.
target_data = df_final[['smoke_impact','Wages', 'Population', 'adjusted_wage']]
target_data = target_data.dropna()

X = target_data[['smoke_impact', 'Population']]
y = target_data['adjusted_wage']

X = sm.add_constant(X)

model = sm.OLS(y, X).fit()

summary = model.summary()

print(summary)

                            OLS Regression Results                            
Dep. Variable:          adjusted_wage   R-squared:                       0.815
Model:                            OLS   Adj. R-squared:                  0.793
Method:                 Least Squares   F-statistic:                     37.40
Date:                Mon, 11 Dec 2023   Prob (F-statistic):           5.95e-07
Time:                        19:32:00   Log-Likelihood:                -187.01
No. Observations:                  20   AIC:                             380.0
Df Residuals:                      17   BIC:                             383.0
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const        -4.939e+04   9073.959     -5.443   