# HiMCM Session 6
# Linear Regression Model


A **linear relationship** between two variables is one in which the scatterplot of them looks roughly like a line.  **Linear regression** is a method for modelling how a **dependent variable** linearly depends on one or more **independent variables**.  The dependent variable (also called a **response variable** and many other things) is what we are trying to model or predict, and is usually denoted $Y$.  The independent variables (also called **explanatory** or **input variables**) are the information we are using to make the predictiong, and are usually denoted $X_1, X_2, ...$.

The linear relationship is: $$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n + \epsilon$$
where $\epsilon$ represents the error.

Here, $Y, X_1, X_2, ..., X_n$ are *random variables* that can take different values with different probabilities.

Linear regression finds the coefficients $\beta_0, \beta_1, ..., \beta_n$ so that the sum of the squares of the error term for each data observation is minimized.


As an example, we will look at a dataset about Boston housing prices in the 1970s.

In [None]:
# Load the dataset from the sci-kit learn package.
from sklearn.datasets import load_boston
boston_dict = load_boston()

In [None]:
# Display the variable boston_dict:
boston_dict

In [None]:
# Display the list of keys
boston_dict.keys()

In [None]:
# Display the value linked to "feature_names"
boston_dict["feature_names"]

In [None]:
# This dictionary contains description of the data set
print(boston_dict["DESCR"])

To better analyze the data, let's convert it into a **Pandas data frame**.

In [None]:
import pandas as pd

boston = pd.DataFrame(data=boston_dict.data)

In [None]:
# Display the data frame
boston

In [None]:
# There are no column names, so let's repeat that command, but telling Pandas 
# that the column names are in feature_names
boston = pd.DataFrame(data=boston_dict.data,
                      columns=boston_dict.feature_names)

boston

In [None]:
# So far we have created a data frame with all independent variables.
# Let's also add the price data to the data frame.
boston["price"] = boston_dict.target 

# Display the first several rows to verify that the new column was added correctly.
boston.head()

In [None]:
# Let look at the relationship between the price and the average number of rooms (RM)
boston.plot.scatter(x="RM", y="price")

There seems to be a linear relationship between RM and price. How can we find a line that best describes this relation?
- What is the mathematcial expression of a straight line?
- Given two different lines, how do we tell which one fits the data better?
- How to find the line that best fits the data?

Let's perform linear regression using the `statsmodel` library.

In [None]:
import statsmodels.formula.api as smf

lm = smf.ols('price ~ RM',boston).fit()
lm.summary()

## Interpreting the Linear Regression Summary
- Dep. Variable
- Method
- Date and Time
- No. Observations
- Df Residuals: The degree of freedom
- Df Model: The number of dependent variables
- **R-squared**: How much of the changes dependent variables is explained by the changes in independent variables
- **Adjusted R-squared**: A better performance measure for multiple dependent variables
- **Prob(F statistic)**: How likely did this trend occur by pure luck?
- **coef**: Estimates of model coefficients
- **P > |t|**: How likely the true value of this coefficient is actually zero? This value is call the **p-value**.
- [0.025, 0.975]: The confidence interval of the coefficient


In [None]:
# Build a linear model on price and crime rate (CRIM)
# Is this a stronger relationship?


In [None]:
# Visualize the regression line with data
import seaborn as sns

sns.regplot(y="price", x="RM", data=boston, fit_reg = True)

## Is Linear Regression a Good Choice?

The **residuals** are the difference between the actual value of price and the value predicted by the regression line, for each row of the data. If the linear model is a good fit, the histogram should look like a **normal distribution**.


In [None]:
# Display the residuals
lm.resid

In [None]:
lm.resid.hist(bins = 20)

In [None]:
lm

Another good way to check the model performance is to visualize the predicted prices with the true prices.

In [None]:
# Extract model predictions on the data
lm.fittedvalues

In [None]:
# Draw a scatter plot with the true price as the x coordinate and the 
# predicted price as the y coordinate.


**Discussion**
- If all the prices were predicted correctly, what would this plot have looked like?
- Where does the linear model fail badly?

# Homework \#3

1. Use a linear model to describe how pupil-teacher ration (PTRATIO) affects the housing prices.

    - Perform linear regression and find the expression of the regression line.
    - Plot the line together with the data
    - Access whether the linear model properly describes the relationship:
        - Is the R-squared value close to 1?
        - Are the p-values of coefficients close to zero?
        - Is the histogram of the residuals close to a normal distribution?
        - Is the true vs. prediction plot close to a diagonal line?
    
2. **Smart Air-Conditioning by Programmable thermostats**

No one wants to spend more money than necessary on heating and air conditioning. But, everyone wants to be comfortable and cozy while at home. The development of programmable thermostats was an initial effort to assist in reducing energy costs. With a programmable thermostat, you can manually preset a schedule of temperature increases and decreases for weekdays and weekends, and day and night periods, to keep your home cool or warm when needed, but save energy when not needed. 

Suppose that you are designing a next-generation smart air-conditioning system for your house. Describe how this system can manage the home temperature automatically so that you and your family can live comfortably while the energy cost get reduced.

- Search for information on how existing smart air-conditioning system can achieve.
- Describe what information your systems needs know.
- Describe how your system determines the home temperature given the information it collects.
- Spend **no more than 2 hours** on this assignment.
- Write an essay to report your design. Submit to Liang.Zhao1@lehman.cuny.edu before **Thursday, June 24th at 11:59PM**

    