# <center><b>Math for Data Science</b></center>

# <center><b>Simple Linear Regression</b></center>

# <center><b>Coding Notes</b></center>

<div class="alert alert-block alert-warning">
    <b><font size="4">Files needed for this presentation:</font></b>
</div>

[**Cereals.xlsx**](https://docs.google.com/spreadsheets/d/1w46w7MoPPxKaWPcp1XHAQRD5Vh81JT57/edit?usp=share_link&ouid=117745432621363033141&rtpof=true&sd=true)

## Display multiple output in one cell

In [29]:
# set up notebook to display multiple output in one cell

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

print('The notebook is set up to display multiple output in one cell.')

The notebook is set up to display multiple output in one cell.


## Import statements

In [73]:
import pandas as pd
import numpy as np
import seaborn as sns

## Load the cereals dataset

In [32]:
cereals = pd.read_excel("Cereals.xlsx", usecols = ['Cereal Name', 'Rating', 'Sugar'])

In [41]:
cereals.head()
cereals.shape

Unnamed: 0,Cereal Name,Rating,Sugar
0,100%_Bran,68.402973,6
1,100%_Natural_Bran,33.983679,8
2,All-Bran,59.425505,5
3,All-Bran_with_Extra_Fiber,93.704912,0
4,Almond_Delight,34.384843,8


(76, 3)

## Use the statsmodels library to perform simple linear regression

**Documentation:**&emsp;[**statsmodels**](https://www.statsmodels.org/stable/index.html)

In [35]:
import statsmodels.formula.api as smf

In [36]:
model = smf.ols(formula = 'Rating ~ Sugar', data = cereals)

In [38]:
results = model.fit()

In [39]:
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                 Rating   R-squared:                       0.584
Model:                            OLS   Adj. R-squared:                  0.578
Method:                 Least Squares   F-statistic:                     103.7
Date:                Sat, 18 Feb 2023   Prob (F-statistic):           1.01e-15
Time:                        11:29:15   Log-Likelihood:                -275.21
No. Observations:                  76   AIC:                             554.4
Df Residuals:                      74   BIC:                             559.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     59.8530      1.998     29.964      0.0

## Interpret the regression output

- **LSRL Equation**: Rating = -2.4614*Sugar + 59.8530
- **Slope interpretation**: For each additional gram of sugar, the LSRL predicts that the cereal's nutritional rating will decrease by 2.4614 points
- **Intercept interpretation**: The LSRL predicts that a cereal with 0 grams of sugar will have a nutritional rating of 59.8530.
- **R-squared** = 0.584 ... When using the LSRL to have grams of sugar predict a cereal's nutritional rating, 58.4% of the variation in a cereal's nutritional rating is accounted for by the number of grams of sugar in the cereal.
- **SE(Sugar)** = 0.242 ... "standard error of the slope" -- Over repeated random samples, the slope of the sample regression line would typically vary by about 0.242 from the slope of the population (true) regression line for predicting nutritional rating from grams of sugar. 
- **Hypothesis (Significance) Test**:
 - Student's t Test Statistic for Sugar: t = -10.183
 - p-value for Student's t Test Statistic for Sugar: P>|t| = 0.000
 - Decision about $H_{0}$: Since the p-value, ≈ 0.000, is less than α =0.05, we should reject $H_{0}$.
 - Conclusion about $H_{a}$: There is sufficient evidence to conclude that there is a linear relationship between a cereal's nutritional rating and its sugar content.
- **95% Confidence Interval**:
 - We are 95% confident that the interval from -2.943 to -1.980 captures the actual slope of the population regression line. That is, 95% of all possible samples of size n = 76 from this population of cereals result in an interval that captures the slope of the population (true) regression line for predicting nutritional rating from grams of sugar. 

In [43]:
# if you want just the parameters ... call the params attribute on the results

print(results.params)

Intercept    59.853017
Sugar        -2.461420
dtype: float64


In [46]:
# if you want to report a confidence interval, use the conf_int method
# ... the confidence interval identifies the possible/likely values that the estimated value can take on
 
print(results.conf_int())

                   0          1
Intercept  55.872858  63.833176
Sugar      -2.943061  -1.979779


## Use the sklearn library to perform simple linear regression

**Documentation:**&emsp;[**sklearn ... scikit-learn Machine Learning in Python**](https://scikit-learn.org/stable/index.html)

In [47]:
from sklearn import linear_model

In [48]:
# create our LinearRegression object

lr = linear_model.LinearRegression()

In [49]:
# next, specify the predictor X, and the reponse y
# note -- it is uppercase X and lowercase y
# this will fail because our X has only one variable 

predicted = lr.fit(X = cereals['Sugar'], y = cereals['Rating'])

ValueError: Expected 2D array, got 1D array instead:
array=[ 6  8  5  0  8 10 14  8  6  5 12  1  9  7 13  3  2 12 13  7  0  3 10  5
 13 11  7 10 12 12 15  9  5  3  4 11 10 11  6  9  3  6 12  3 11 11 13  6
  9  7  2 10 14  3  0  0  6 12  8  6  2  3  0  0  0 15  3  5  3 14  3  3
 12  3  3  8].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

- Since **sklearn** is built to take NumPy arrays, there will be times when you have to do some data manipulations to pass your DataFrame into **sklearn** 
- The error message in the output above essentially tells us that the matrix passed is not in the correct shape.  There we need to reshape the inputs.
- Depending on whether we have a single feature (whixh is the case above) or a single sample (i.e. multiple observations), we should specify reshape(-1, 1) or reshape(1, -1) respectively.
- To properly reshape our data, we must use the values attribute ... when we call values on a Pandas DataFrame or Series, we get the numpy ndarray representation of the data,

In [51]:
# to fix the error above

predicted = lr.fit(X = cereals['Sugar'].values.reshape(-1, 1), y = cereals['Rating'])

- Unfortunately, **sklearn** doesn't provide us with the nice summary table that **statsmodels** does,
- To obtain the slope coefficient in sklearn, call the coef_ attribute on the fitted model.
- To get the intercept, call the intercept_ attribute.

In [60]:
print(predicted.coef_)
print(predicted.intercept_)

[-2.46142021]
59.85301691061821


In [72]:
print(f'slope coefficient: {predicted.coef_}')
print(f'intercept: {predicted.intercept_}')

slope coefficient: [-2.46142021]
intercept: 59.85301691061821
