## Simple and Multiple Linear Regression in Python

There are several options to run a regression analysis in Python. You can use Scikit-learn, Statsmodels, SciPy, and probably other packages.

Since in this class we only want to __briefly review how to do linear regression with Python__ (we learned the theory about Regression in the R class already), I chose to illustrate regression analysis using one of the two libraries in Statsmodels available to do so. 

I encourage you to explore and learn how to do regression analysis with other Python packages, and then select your __personal favorite package__ to do Linear Regression with Python.

In [None]:
# Importing required libraries

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf

We will also import the _datasets_ package from Scikit-learn library.

The _datasets_ package includes seven clean and ready-to-use datasets than can be imported to try out statistical and ML techniques.

One of these datasets is the familiar Boston dataset. I chose to use this dataset to illustrate regression in Python since you are already familiar with it. 

In [None]:
from sklearn import datasets

To read the datasets available in the __datasets package__, you need to call the method .load_DATASETNAME()

In [None]:
boston_data=datasets.load_boston()

In [None]:
type(boston_data)

In [None]:
boston_data.keys()

In [None]:
boston_data.DESCR

In [None]:
print(boston_data.DESCR)

Now, let's read the data imported from sklearn, which we called boston_data, into a Pandas data frame.

First, read all the columns with the Xs variables (the predictors (or features, as they like to call them in Python circles)). After this, we will add the outcome (target) variable to the data frame.

In [None]:
# Creating a data frame with the predictors

boston_df= pd.DataFrame(boston_data.data, columns= boston_data.feature_names)

In [None]:
# Reading the outcome into the data frame

boston_df['medv']= boston_data.target

In [None]:
boston_df.head()

In [None]:
boston_df.info()

## Simple Linear Regression

Let's do the same example we did in the Statistics with R class. That is, the linear regression of __medv__ versus __lstat__.

Before applying regression, let's do the usual preliminary analysis. Specifically, let's compute the __correlation coefficient__ between medv and lstat and do a __scatterplot__ between these two variables.

In [None]:
# Option 1 (of MANY) to compute the correlation coefficient between two variables
# Use the corr() method for data frames

boston_df[['LSTAT', 'medv']].corr()

In [None]:
boston_df.loc[:, ['LSTAT', 'medv']].corr()

How would you retrieve only the correlation coefficient value from the previous matrix? TRY IT !!!

In [None]:
# ENTER YOUR ANSWER HERE



In [None]:
# Option 2 (of MANY) to compute the correlation coefficient
# Import the pearsonr() function from Scipy

from scipy.stats.stats import pearsonr

In [None]:
# The pearsonr() function returns the correlation coefficient and the P value associated to this coefficient

pearsonr(boston_df['LSTAT'], boston_df['medv'])

In [None]:
# If you only want the the correlation coefficient

pearsonr(boston_df['LSTAT'], boston_df['medv'])[0]

__Scatterplot__

Do a scatterplot between lstat and medv using using the functions in the __matplotlib library__

Quoting from __w3schools__ ... "Most of Matplotlib utilities lies under the pyplot submodule"

That's why we import Matplotlib and the pyplot module (package)

In [None]:
import matplotlib.pyplot as plt

We are going to use plt.scatter() method since it is the conventional way of doing a scatter plot. However, if your aren't trying to do a sophisticated scatter plot, like in our case, plt.plot() will also do the job too.

In [None]:
plt.scatter(boston_df['LSTAT'], boston_df['medv'],c='green',marker='x')

plt.xlabel("LSTAT")

plt.ylabel("medv")

plt.title ("Median house values VS % of houses with low SES in the neighborhood")

plt.axhline(boston_df['medv'].mean(),c='red',ls='--')

# I added the next line because I did not like the way the marks were placed in the y axis

plt.yticks(np.arange(boston_df['medv'].min(), boston_df['medv'].max()+1, 5))

plt.show()

### Obtaining the simple linear regression model between medv and lstat

In [None]:
regression_object= smf.ols('medv~LSTAT', data=boston_df)

What does 'ols' stand for?

In [None]:
regression_model= regression_object.fit()

In [None]:
regression_model.summary()

In [None]:
regression_model.params

__How to make predictions with the estimated equation?__

Let's get the prection of medv for the observations in the boston_df

In [None]:
regression_model.predict()

Now, let's use the equation to predict medv for test data.

For example, let's predict the values of medv based on the regression equation for five new values of lstat.

One of the valid options to pass the new data for the predictor is to do it as a dictonary.

One option to create this dictionary is to create a list and converted to a dictionary using the dict() function.

In [None]:
# 5.5, 6, 7, 8.5, and 9.3

# These are five new values of lstat for observations (neighborhoods) that were not in the boston_df

regression_model.predict(dict(LSTAT=[5.5, 6, 7, 8.5, 9.3]))

Option 2: Pass a dictionary directly

In [None]:
regression_model.predict({'LSTAT': [5.5, 6, 7, 8.5, 9.3]})

#### Plot of Residuals versus Fitted (Predicted) values

First, let's repeat the previous scatter plot of medv VS LSTAT, but let's add the regression line to it this time.

In [None]:
plt.scatter(boston_df['LSTAT'], boston_df['medv'],c='green',marker='o')
plt.xlabel("LSTAT")
plt.ylabel("medv")
plt.title ("Relationship between house values and % of low ses")
plt.axhline(boston_df['medv'].mean(),c='red',ls='--')
plt.yticks(np.arange(boston_df['medv'].min(), boston_df['medv'].max()+1, 5))

# This is the additional statement needed to plot the regression line in the scatterplot
plt.plot(boston_df['LSTAT'], regression_model.predict(), c='red', ls='-')

plt.show()

In [None]:
# Before doing the plot of the residuals, let's compute the residuals

residuals = boston_df['medv'] - regression_model.predict()

In [None]:
plt.scatter(regression_model.predict(), residuals,c='blue',marker='o')

plt.xlabel("Fitted")
plt.ylabel("Residuals")
plt.axhline(0,c='red',ls='--')
plt.show()

## Multiple Linear Regression

We are still using the Boston dataset example

In [None]:
boston_df.info()

#### Regression of medv VS all 13 predictors

In R, we used a shorcut to obtain a model including all the predictors without the need to enter each predictor individually. The shortcut was:

lm (medv ~ ., data= Boston)

It does not seem to be such an option when using the Statsmodel Python package. However, we can use a workaround: create a formula using string methods and pass this formula to the ols() method we used before in simple regression. See next:

In [None]:
boston_df.columns

What we are trying to get is this: 'medv~ CRIM+ZN+INDUS+CHAS+ ... all other predictors'

Let's get this part first: 'CRIM+ZN+INDUS+CHAS+ ... all other predictors'

In [None]:
all_predictors= '+'.join(boston_df.columns.difference(['medv']))

In [None]:
all_predictors

In [None]:
# Now, we can use all_predictors to get the formula we want

my_regression_formula = 'medv~' + all_predictors

In [None]:
my_regression_formula

In [None]:
regression_object2= smf.ols(my_regression_formula, data=boston_df)

In [None]:
regression_model2= regression_object2.fit()

In [None]:
regression_model2.summary()

In [None]:
regression_model2.params

If we only wanted the regression of medv with a subset of the predictors (e.g., crim, zn, and lstat), then:

In [None]:
regression_object3= smf.ols('medv~ CRIM+ ZN + LSTAT', data= boston_df)

In [None]:
regression_model3= regression_object3.fit()

In [None]:
regression_model3.params

Use this equation to predict medv for two new neighborhoods with the following characteristics:

Neighborhood 1: crim=0.85 , zn= 20, lstat= 8.5

Neighborhood 2: crim=0.75 , zn= 25, lstat= 8.8

In [None]:
regression_model3.predict({'CRIM':[0.85, 0.75],'ZN':[20, 25], 'LSTAT':[8.5, 8.8]})