
# Tech Frontiers - Regression Project

---

In [1]:
from scipy import stats
from math import isnan
import numpy as np 
import statsmodels.api as sm
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline


---

### Multiple Linear Regression

There are two data sets on which to practice creating a multiple linear regression model. 

---

`Fish.csv`
This data file comes from kaggle.com: https://www.kaggle.com/aungpyaeap/fish-market

As stated on the linked page: "This dataset is a record of 7 common different fish species in fish market sales. With this dataset, a predictive model can be performed using machine friendly data and estimate the weight of fish can be predicted."

**Response**:
- Weight (in grams)

**Features**:
- Length1 (vertical length in cm)
- Length2 (diagonal length in cm)
- Length3 (cross length in cm)
- Height (in cm)
- Width (diagonal width in cm)

The species name of the fish is also given. 

---

`housing_pricing.csv`
This data file was generated by Rachel Cox (so it is fake!). I did my best to generate realistic values for typical homes. 

**Response**: 
- sale price (in hundreds of thousands of dollar) - e.g. sale price = 2.479 means the selling price was $247,900

**Features**: 
- clouds - represents the proportion of the sky covered with clouds on a typical day. 1 indicates total cloud coverage, 0 would indicate no clouds
- distance to metro - distance in miles to the nearest big city
- num bathrooms 
- square footage
- lot acreage
- age of house - in years
- num bedrooms
- precipitation - monthly average precipitation in inche
- walkability - A number between 0 and 100 that indicates how pleasant it is to walk nearby (https://www.redfin.com/how-walk-score-works)
- temperature - Average yearly temperature

---

**Part A**: Read the data from the csv of your choosing into a Pandas DataFrame.  If you are reading in `Fish.csv`, I would recommend dropping the species column as it is non-numerical.

Also, make sure to re-order the columns so that the response variable is the last column.

**Part B:** Make separate scatter plots for each feature versus the response. From these plots, we will try and make inferences about which features appear to have a relationship with the response variable.

With the `house_pricing.csv` data, it might be useful to rescale the "sale price" values by multiplying by 100,000.

**Part C:** Use stats.linregress to fit simple linear regression models to the data.

Further documentation: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html

**Part D:** Now, let's fit a multiiple linear regression model! We will uses statsmodels for this task. Execute the following cell to import the required package. Use sm.OLS.fit to accomplish this. Then use model.params to print the regression coeficients to the screen.

Further documentation: https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.html

In [95]:
import statsmodels.api as sm

**Part E**: 
Based on your MLR Model in **Part D**, use the full model to predict the fish weight when the following features are observed: 

- $\texttt{Length1}$: 26 cm
- $\texttt{Length2}$: 28 cm
- $\texttt{Length3}$: 31 cm
- $\texttt{Height}$: 9 cm
- $\texttt{Width}$: 4 cm

Based on your MLR Model for the house pricing, use the full model to predict the selling price when the following features are observed: 

- $\texttt{clouds}$: 0.75
- $\texttt{distance to metro}$: 5
- $\texttt{num bathrooms}$: 2
- $\texttt{square footage}$: 2425
- $\texttt{lot acreage}$: 1
- $\texttt{age of house}$: 55
- $\texttt{num bedrooms}$: 3
- $\texttt{precipitation}$: 1
- $\texttt{walkability}$: 58
- $\texttt{temperature}$: 60

**Part F**: Perform the appropriate statistical test at the $\alpha = 0.01$ significance level to determine if _at least one_ of the features is related to the the response $y$.  

## Forward Select to Build an MLR Model

**Part F**: Write a function `forward_select(df, resp_str, maxk)` that takes in the DataFrame, the name of the column corresponding to the response, and the maximum number of desired features, and returns a list of feature names corresponding to the `maxk` most important features via forward selection.  At each stage in forward selection you should add the feature whose inclusion in the model would result in the lowest sum of squared errors $(SSE)$. Use your function to determine the best $k=3$ features to include in the model. Clearly indicate which feature was added in each stage. 

**Note**: The point of this exercise is to see if you can implement **foward_select** yourself.  You may of course use canned routines like statmodels OLS, but you may not call any Python method that explicitly performs forward selection.

**Part G**: Write down the multiple linear regression model, including estimated parameters, obtained by your forward selection process. 

Use the reduced model to estimate the weight of a fish with:
- $\texttt{Length3}=31$, 
- $\texttt{Width}=4.5$, 
- $\texttt{Height}=9$

Use the reduced model to predict the sale price of a house with the following:
- $\texttt{age of house} = 55$,
- $\texttt{square footage} = 2425$,
- $\texttt{distance to metro} = 5$

<br>

---



## Backward Select to Build an MLR Model



**Part I**: Write a function `backward_select(df, resp_str, maxsse)` that takes in the DataFrame (`df`), the name of the column corresponding to the response (`resp_str`), and the maximum desired sum of squared errors (`maxsse`), and returns a list of feature names corresponding to the most important features via backward selection.  

`Fish.csv`
Use your code to determine the reduced MLR model with the minimal number of features such that the SSE of the reduced model is less than 3000000. 

At each stage in backward selection you should remove the feature that has the highest p-value associated with the hypothesis test for the given slope coefficient $\beta_k \neq 0$.

Your code should clearly indicate which feature was removed in each stage, and the SSE associated with the model fit before the feature's removal. _Specifically, please write your code to print the name of the feature that is going to be removed and the SSE before its removal_. Afterward, be sure to report all of the retained features and the SSE of the reduced model.

**Note**: If you are using `Fish.csv`, reorder the columns so that "Weight" is last. It's easier to do the backward select this way.

**Part J**: Write down the multiple linear regression model, including estimated parameters, obtained by your backward selection process. 

**Part K:** Consider the model you used in Part J, and consider the fact that you are trying to predict **Weight** or **sale price** respectively. What is one critical drawback to the MLR model (or any MLR model) for predicting response? What are some modifications that could improve on this issue?