## Assignments

To close out this checkpoint, you're going to do three assignments. For the first assignment, you'll write up a short answer to a question.  For the second two assignments, you'll do your work in Jupyter notebooks.


Please submit links to all your work below. This is not a graded checkpoint, but you should discuss your solutions with your mentor. Also, when you're done, compare your work to [these example solutions](https://github.com/Thinkful-Ed/machine-learning-regression-problems/blob/master/notebooks/4.solution_understanding_the_relationship.ipynb).

### 1. Interpretation and significance

Suppose that we would like to know how much families in the US are spending on recreation annually. We've estimated the following model:

$$ expenditure = 873 + 0.0012annual\_income + 0.00002annual\_income^2 - 223.57have\_kids $$

*expenditure* is the annual spending on recreation in US dollars, *annual_income* is the annual income in US dollars, and *have_kids* is a dummy variable indicating the families with children. Interpret the estimated coefficients. What additional statistics should be given in order to make sure that your interpretations make sense statistically. Write up your answer.

    * For every child a family has, they spend $223.57 less dollars annually on average
    * I'm unsure about the coefficients of annual income - they seem to be suspicously weakly correlated to the target
    
    * A summary chart would be nice so we could see some statistics of the model

### 2. Weather model

In this exercise, you'll work with the historical temperature data from the previous checkpoint. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

* First, load the dataset from the **weatherinszeged** table from Thinkful's database.
* Build a linear regression model where your target variable is the difference between the *apparenttemperature* and the *temperature*. As explanatory variables, use *humidity* and *windspeed*. Now, estimate your model using OLS. Are the estimated coefficients statistically significant? Are the signs of the estimated coefficients in line with your previous expectations? Interpret the estimated coefficients. What are the relations between the target and the explanatory variables? 
* Next, include the interaction of *humidity* and *windspeed* to the model above and estimate the model using OLS. Are the coefficients statistically significant? Did the signs of the estimated coefficients for *humidity* and *windspeed* change? Interpret the estimated coefficients.

In [1]:
import numpy as np
import pandas as pd
from sklearn import linear_model
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from sqlalchemy import create_engine

import warnings
warnings.filterwarnings('ignore')

In [None]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'weatherinszeged'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))
temp_hist = pd.read_sql_query('select * from weatherinszeged',con=engine)

# no need for an open connection, as we're only doing a single query
engine.dispose()


temp_hist.head()

In [None]:
temp_data = pd.DataFrame()
temp_data['target'] = temp_hist['apparenttemperature'] - temp_hist['temperature']
temp_data['humidity'] = temp_hist['humidity']
temp_data['windspeed'] = temp_hist['windspeed']

In [None]:
# We create a LinearRegression model object
lrm = linear_model.LinearRegression()

# We then select data and target 
data = temp_data.iloc[:, 1:]
target = temp_data['target']

# fit method estimates the coefficients using OLS
lrm.fit(data, target)

# Next we take a look at the results
# We need to manually add a constant in statsmodels' sm
data = sm.add_constant(data)

results = sm.OLS(target, data).fit()

results.summary()

In the previous checkpoint we determined that the estimated coefficients statistically are significant and the same is true in this case. It's important to point out that the target is temp - apparent_temp and there appears to be less std error in this model than in the last. 

For every unit 1 of humidity there is -3.0292 less in our target and for every unit 1 of windspeed there is -0.0119 less of our target when you add our constant. 

Alturnatively you could describe this relationship as: 

apparenttemperature - temperature = 2.438 + (-3.0292 * humidity) + (-0.1193 * windspeed)

* Next, include the interaction of *humidity* and *windspeed* to the model above and estimate the model using OLS. Are the coefficients statistically significant? Did the signs of the estimated coefficients for *humidity* and *windspeed* change? Interpret the estimated coefficients.

In [None]:
# First select data and target 
data2 = temp_data.iloc[:, 1:]
data2['humidity_windspeed'] = temp_hist['humidity'] * temp_hist['windspeed']
target = temp_data['target']

# We create a LinearRegression model object
lrm = linear_model.LinearRegression()

# We need to manually add a constant in statsmodels' sm
data2 = sm.add_constant(data2)

# fit method estimates the coefficients using OLS
lrm.fit(data2, target)

results = sm.OLS(target, data2).fit()

results.summary()

All of the coefficients remain statistically significant. The coefficients have flipped from negative to positive. The interaction feature has a negative coefficient. 

You could describe this relationship as: 

apparenttemperature - temperature = 0.0839 + (0.1775 * humidity) + (0.905 * windspeed) + (interaction * -0.2971)

###  3. House prices model

In this exercise, you'll interpret your house prices model. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

* Load the **houseprices** data from Thinkful's database.
* Run your house prices model again and interpret the results. Which features are statistically significant, and which are not?
* Now, exclude the insignificant features from your model. Did anything change?
* Interpret the statistically significant coefficients by quantifying their relations with the house prices. Which features have a more prominent effect on house prices?
* Do the results sound reasonable to you? If not, try to explain the potential reasons.

In [None]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))
homes_df = pd.read_sql_query('select * from houseprices',con=engine)

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))
homes_df = pd.read_sql_query('select * from houseprices',con=engine)

# no need for an open connection, as we're only doing a single query
engine.dispose()


In [None]:
# Preparing data for modeling about house prices 

# ojects holding columns
non_numeric_columns = homes_df.select_dtypes(['object']).columns
numeric_columns = homes_df.select_dtypes(['int64', 'float64']).columns

# dropping columns Missing data
homes_df = homes_df.drop(['poolqc', 'miscfeature', 'alley', 
                          'fence', 'fireplacequ', 'lotfrontage'], axis=1)

# Dropping missing observations
homes_df = homes_df.dropna(axis=0)

numeric_columns = numeric_columns.drop(['id'])

FILL_LIST = []
for cols in homes_df[:]:
    if cols in numeric_columns:
        FILL_LIST.append(cols)

In [None]:
from scipy.stats.mstats import winsorize

homes_win = homes_df.copy()

for col in FILL_LIST:
    homes_win[col] = winsorize(homes_win[col], (.05, .14))

In [None]:
from sklearn import preprocessing

def cat_converter(df):
    for cols in df:
        if cols in non_numeric_columns:
            
            # Create a label (category) encoder object
            le = preprocessing.LabelEncoder()
            
            # Create a label (category) encoder object
            le.fit(df[cols])
            
            # Apply the fitted encoder to the pandas column
            df[cols] = le.transform(df[cols]) 
    return df

cat_converter(homes_win)

In [None]:
# selecting data and target
homes_mod1 = homes_win[['lotarea', 'bsmtfinsf1', 'secondflrsf', 'grlivarea', 'saleprice']]


In [None]:
# We create a LinearRegression model object
lrm = linear_model.LinearRegression()

data = homes_mod1.iloc[:, :-1]
target = homes_mod1['saleprice']

# fit method estimates the coefficients using OLS
lrm.fit(data, target)

# We need to manually add a constant
# in statsmodels' sm
data = sm.add_constant(data)

results = sm.OLS(target, data).fit()

results.summary()

All of the variables used are significant to the target. 

The relationships between features, coefficients, and the target can be described as such:

saleprice = -5339.76 + (lotarea	* 1.1096) + (bsmtfinsf1	* 21.9535) + (secondflrsf * -34.5259) + (grlivarea * 117.3872)

*** Note we must make sure our conditions hold as multicollinearity is a problem for OLS. 

The constant's coefficient's p-values seems to be a bit high compared to our first evaluation, removing a variable due to multicollinearity may adjust this. It makes since that all the variables seems to be significant based on the chi-squared test I used to pick variables. 