# 1. Q. Interpretation and significance

Suppose that we would like to know how much families in the US are spending on recreation annually. We've estimated the following model:

𝑒𝑥𝑝𝑒𝑛𝑑𝑖𝑡𝑢𝑟𝑒 = 873 + 0.0012𝑎𝑛𝑛𝑢𝑎𝑙_𝑖𝑛𝑐𝑜𝑚𝑒 + 0.00002𝑎𝑛𝑛𝑢𝑎𝑙_𝑖𝑛𝑐𝑜𝑚𝑒2 − 223.57ℎ𝑎𝑣𝑒_𝑘𝑖𝑑𝑠

expenditure is the annual spending on recreation in US dollars, annual_income is the annual income in US dollars, and have_kids is a dummy variable indicating the families with children. Interpret the estimated coefficients. What additional statistics should be given in order to make sure that your interpretations make sense statistically. Write up your answer.

# 1. A. :

The constant 873 is considered a bias value. It is a factor that is prevalent in most situations(different row values). The first annual income for the household has a minor effect on recreation expenditure, about 0.12%. A second income has even less of an effect, which is close to zero.

Families with kids can expect to pay $223.57 less on average.

Since two of the continuous variables in this model are relatively insignificant, we can run statistical tests with p-values to determine their significance to the model and whether we should drop any.

# 2. Q. Weather model

In this exercise, you'll work with the historical temperature data from the previous checkpoint. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

- Q. First, load the dataset from the weatherinszeged table from Thinkful's database.


- Q. Build a linear regression model where your target variable is the difference between the apparenttemperature and the temperature. As explanatory variables, use humidity and windspeed. Now, estimate your model using OLS. Are the estimated coefficients statistically significant? Are the signs of the estimated coefficients in line with your previous expectations? Interpret the estimated coefficients. What are the relations between the target and the explanatory variables?


- Q. Next, include the interaction of humidity and windspeed to the model above and estimate the model using OLS. Are the coefficients statistically significant? Did the signs of the estimated coefficients for humidity and windspeed change? Interpret the estimated coefficients.



# 2. A:

### 2 - I - A:

In [12]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import linear_model
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.tsa.stattools import acf
from sqlalchemy import create_engine
from scipy.stats import bartlett, jarque_bera, levene, normaltest



# # Display preferences.
# %matplotlib inline
# pd.options.display.float_format = '{:.3f}'.format

# import warnings
# warnings.filterwarnings(action="ignore")

# postgres_user = 'dsbc_student'
# postgres_pw = '7*.8G9QH21'
# postgres_host = '142.93.121.174'
# postgres_port = '5432'
# postgres_db = 'weatherinszeged'
# engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
#     postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

# df = pd.read_sql_query('select * from weatherinszeged',con=engine)



# engine.dispose()

# df.head(3)

### 2 - II - Q. Build a linear regression model where your target variable is the difference between the apparenttemperature and the temperature. As explanatory variables, use humidity and windspeed. Now, estimate your model using OLS. Are the estimated coefficients statistically significant? Are the signs of the estimated coefficients in line with your previous expectations? Interpret the estimated coefficients. What are the relations between the target and the explanatory variables?

### 2 - II - A. 

In [19]:
# Make a df with relevant variables
linRegDf = df[['temperature','apparenttemperature','humidity','windspeed']]
linRegDf['tempDiff'] = df['apparenttemperature'] - df['temperature']
linRegDf = linRegDf[['tempDiff','humidity','windspeed']]

# target & independent variables
X = linRegDf[['humidity','windspeed']]
Y = linRegDf.tempDiff

# # add a constant to the features
X = sm.add_constant(X)
# initialize & fit model to estimate coefficients using OLS (by default)
lrm = linear_model.LinearRegression()
lrm.fit(X, Y)
results = sm.OLS(Y, X).fit()
results.summary()



0,1,2,3
Dep. Variable:,tempDiff,R-squared:,0.288
Model:,OLS,Adj. R-squared:,0.288
Method:,Least Squares,F-statistic:,19490.0
Date:,"Mon, 09 Sep 2019",Prob (F-statistic):,0.0
Time:,21:44:31,Log-Likelihood:,-170460.0
No. Observations:,96453,AIC:,340900.0
Df Residuals:,96450,BIC:,340900.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.4381,0.021,115.948,0.000,2.397,2.479
humidity,-3.0292,0.024,-126.479,0.000,-3.076,-2.982
windspeed,-0.1193,0.001,-176.164,0.000,-0.121,-0.118

0,1,2,3
Omnibus:,3935.747,Durbin-Watson:,0.267
Prob(Omnibus):,0.0,Jarque-Bera (JB):,4613.311
Skew:,-0.478,Prob(JB):,0.0
Kurtosis:,3.484,Cond. No.,88.1


The constant is a 2.43 degrees bias. Humidity seems to have a sizeable effect on the temperature difference; for a 1% increase in humidity, there is a 3% decrease in the temperature difference. This leads to the idea that humidity increase tends to show apparent temperature in a way that reflects real temperature more closely.


### 2 - III - Q. Next, include the interaction of humidity and windspeed to the model above and estimate the model using OLS. Are the coefficients statistically significant? Did the signs of the estimated coefficients for humidity and windspeed change? Interpret the estimated coefficients.

### 2 - III - A. Windspeed shows a much smaller effect on temperature difference. To investigate further let's rebuild the model, this time separating their combined effect on the target:

In [20]:
# target & independent variables
X = linRegDf[['humidity','windspeed']]
Y = linRegDf.tempDiff
X['humidWind'] = X.humidity * X.windspeed

# # add a constant to the features
X = sm.add_constant(X)
# initialize & fit model to estimate coefficients using OLS (by default)
lrm = linear_model.LinearRegression()
lrm.fit(X, Y)
results = sm.OLS(Y, X).fit()
results.summary()



0,1,2,3
Dep. Variable:,tempDiff,R-squared:,0.341
Model:,OLS,Adj. R-squared:,0.341
Method:,Least Squares,F-statistic:,16660.0
Date:,"Mon, 09 Sep 2019",Prob (F-statistic):,0.0
Time:,22:01:07,Log-Likelihood:,-166690.0
No. Observations:,96453,AIC:,333400.0
Df Residuals:,96449,BIC:,333400.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0839,0.033,2.511,0.012,0.018,0.149
humidity,0.1775,0.043,4.133,0.000,0.093,0.262
windspeed,0.0905,0.002,36.797,0.000,0.086,0.095
humidWind,-0.2971,0.003,-88.470,0.000,-0.304,-0.291

0,1,2,3
Omnibus:,4849.937,Durbin-Watson:,0.265
Prob(Omnibus):,0.0,Jarque-Bera (JB):,9295.404
Skew:,-0.378,Prob(JB):,0.0
Kurtosis:,4.32,Cond. No.,193.0


Significant difference is seen in coefficients; the combined effect of wind and humidity has changed coefficients a lot. The constant is a 0.839 degrees bias, humidity has a much lower effect on temperature difference. This time, for a 1% increase in humidity, there is only a 0.17% decrease in the temperature difference. The idea that humidity increase tends to bring apparent temperature to real temperature lacked consideration of the effect of wind and humidity.

Windspeed, however, seems not to have been impacted much. It does have a smaller effect on temperature difference, but it hasn't changed much relatively.