# 3.6 Lab: Lab on Linear Regression

This notebook file is intended to be a translation of the R lab included in Chapter 3, Section 6.


## Installation
There are multiple ways to get a notebook environment working. The easiest is probably to install Anaconda and work off of that premade setup.

However, I personally recommend using ASDF to install versions of Python and using Poetry to manage package dependencies. This is because I tend to use separate environments for each project I'm working on (including this one) and I previously had problems with `conda` installs.

Please look at the README.md in the base of the repository to see how to install things.


## 3.6.1 Libraries
Python is handy but doesn't appear to have the same functionality as built into R. To better address matrix math and statistical calculations, I will be using numpy, pandas, and scipy (specifically the stats module from scipy)

The original text references the loading of libraries in order to incorpate functions and data sets not included in the the base `R` distribution. This is more normal behavior in Python, where we are more used to importing libraries like `numpy`, `pandas`, and `matplotlib`. 

Here we will load `numpy`, `pandas`, `scipy`, etc [may change as I go through the lab] We will also load the data from the `Boston` data set.

If you get an error saying the library can't be loaded, please check that you have installed the virtual evnvionment correctly and cleanly.


In [1]:
import numpy as np
import pandas as pd
from scipy import stats

## 3.6.2 Simple Linear Regressions 

We are looking at the `Boston` data set which records `medv` (median house value) for 506 census tracks in Boston. We seek to predict `medv` using 12 predictors such as `rm` (average number of rooms per house), `age` (average age of houses), and `lstat` (percent of households with low socioconomic status)

In [5]:
boston_dataframe_path = '../data/Boston.csv'
cols = list(pd.read_csv(boston_dataframe_path, nrows=1))
boston_dataframe = pd.read_csv(boston_dataframe_path, usecols = [i for i in cols if i!= "Unnamed: 0"])
boston_dataframe.head(10)

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,5.33,36.2
5,0.02985,0.0,2.18,0,0.458,6.43,58.7,6.0622,3,222,18.7,5.21,28.7
6,0.08829,12.5,7.87,0,0.524,6.012,66.6,5.5605,5,311,15.2,12.43,22.9
7,0.14455,12.5,7.87,0,0.524,6.172,96.1,5.9505,5,311,15.2,19.15,27.1
8,0.21124,12.5,7.87,0,0.524,5.631,100.0,6.0821,5,311,15.2,29.93,16.5
9,0.17004,12.5,7.87,0,0.524,6.004,85.9,6.5921,5,311,15.2,17.1,18.9


Need to fit a simple linear regression model on `medv` with `lstat` as the predictor. The book says to use the `lm()` function to fit a simple linear regression model on `lstat` to predict `medv`.

An analog for this would be the `linregress` function inside the `scipy.stats` module

In [9]:
?stats.linregress

[0;31mSignature:[0m [0mstats[0m[0;34m.[0m[0mlinregress[0m[0;34m([0m[0mx[0m[0;34m,[0m [0my[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0malternative[0m[0;34m=[0m[0;34m'two-sided'[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Calculate a linear least-squares regression for two sets of measurements.

Parameters
----------
x, y : array_like
    Two sets of measurements.  Both arrays should have the same length.  If
    only `x` is given (and ``y=None``), then it must be a two-dimensional
    array where one dimension has length 2.  The two sets of measurements
    are then found by splitting the array along the length-2 dimension. In
    the case where ``y=None`` and `x` is a 2x2 array, ``linregress(x)`` is
    equivalent to ``linregress(x[0], x[1])``.
alternative : {'two-sided', 'less', 'greater'}, optional
    Defines the alternative hypothesis. Default is 'two-sided'.
    The following options are available:

    * 'two-sided': the slope of the regr

In [11]:
result = stats.linregress(x=boston_dataframe['lstat'], y=boston_dataframe['medv'])
print("Intercept: " + str(result.intercept) + ", Intercept StdErr: " + str(result.intercept_stderr))
print("Slope: " + str(result.slope) + ", Slope StdErr: " + str(result.stderr))
print("P value: " + str(result.pvalue))
print("Pearson correlation coefficient (r): " + str(result.rvalue))

Intercept: 34.55384087938311, Intercept StdErr: 0.562627354988433
Slope: -0.9500493537579909, Slope StdErr: 0.038733416212639406
P value: 5.081103394386929e-88
Pearson correlation coefficient (r): -0.7376627261740151


In [7]:
# see the print outs/results from the lm() given in the text. How to add things like quantiles, residuals, residual standard error.Demo also calculates Multiple R-squared, Adjusted R-squared, F-statistic, although those values seem to be not necessary for simple linear regression

34.55384087938311