# Linear Regression for Business Statistics <a id="heading"></a>

1. [**Regression Analysis: An Introduction**](#week-1-rai)
2. [**Regression Analysis: Hypothesis Testing and Goodness of Fit**](#week-2-rahtagof)
3. [**Regression Analysis: Dummy Variables, Multicollinearity**](#week-3-radvm)
4. [**Regression Analysis: Various Extensions**](#week-4-rave)
---

## Regression Analysis: An Introduction <a id="week-1-rai"></a>

[*Back to the heading*](#heading)

**What is Linear Regression?**

**Linear regression** attempts to fit a linear relation between a variable of interest and a set of variables that may be related to the variable of interest.

There are two main types of Linear Regression:

* Simple Regression - a regression with only one explanatory or X variable
* Multiple Regression - a regression with two or more explanatory or X variables

Overview of Regression:

1. Modeling - developing a regression model
2. Estimation - using software to estimate the model
3. Inference - interpreting the estimated regression model
4. Prediction - making predictions about the variable of interest

---

Example:

There is a Sales manager of a toys retail company which sells various kinds of toys in the local market. This sales manager needs to make some kind of projections about the number of monthly units that the retail company will be able to sell of this particular toy in the coming half year. In the past she has been making such projections based on her gut feelingand now wishes to be a little more scientific about the whole process.

Based on her experience, the manager figures out that the **monthly unit sales** depend on three important variables, the **price** at which the toy is sold, the **monthly amount that the company spends on advertising** the toy and the **monthly amount spent on promotions** for the toy.

The final formula would be:

$ Sales = \beta_0 + \beta_1 Price + \beta_2 AdExp + \beta_3 PromExp $

Where:

* Sales - variable of interest / Y / Dependent / Response / Regressed / L.H.S.
* Price, AdExp, PromExp - Explanatory / X / Independent / Covariate / Regressor / R.H.S. variable
* $\beta_x$ - Coefficient Parameters (calculated)

In [2]:
import pandas as pd
import numpy as np
from scipy.stats import t
from scipy.stats import norm
from math import sqrt
from math import ceil
from sklearn import linear_model
import statsmodels.api as sm

In [3]:
toys_data = pd.read_excel("Excel files/Toy-sales.xlsx", "Sheet1", index_col=None, na_values=["NA"])

In [5]:
toys_data.head()

Unnamed: 0,Month,Unit Sales,Price ($),Adexp ('000$),Promexp ('000$)
0,1,73959,8.75,50.04,61.13
1,2,71544,8.99,50.74,60.19
2,3,78587,7.5,50.14,59.16
3,4,80364,7.25,50.27,60.38
4,5,78771,7.4,51.25,59.71


In [95]:
X = toys_data[["Price ($)", "Adexp (\'000$)", "Promexp (\'000$)"]]
Y = toys_data["Unit Sales"]

# via sklearn
regr = linear_model.LinearRegression()
regr.fit(X, Y)  # works the same way
print("Regression via sklearn:\nCoefficients:", regr.coef_, "\n")
price, adexp, promexp = 8.1, 50, 60
sales_predict = regr.predict([[price, adexp, promexp]])
print(f"Predicted sales with Price {price}, AdExp {adexp}, PromExp {promexp} is: {sales_predict}")
print(f"R^2: {regr.score(X, Y)}, Intercept: {regr.intercept_}")
print("\n", "_" * 78, "\n\nRegression via statsmodels\n", sep="")

# via statsmodels
X = sm.add_constant(X)
model = sm.OLS(Y, X, hasconst=True).fit()
print_model = model.summary()  # alpha=0.05 by default
print(print_model)

Regression via sklearn:
Coefficients: [-5055.26986592   648.61214026  1802.61095612] 

Predicted sales with Price 8.1, AdExp 50, PromExp 60 is: [74542.74554463]
R^2: 0.8588446525398427, Intercept: -25096.832921870096

______________________________________________________________________________

Regression via statsmodels

                            OLS Regression Results                            
Dep. Variable:             Unit Sales   R-squared:                       0.859
Model:                            OLS   Adj. R-squared:                  0.838
Method:                 Least Squares   F-statistic:                     40.56
Date:                Wed, 14 Apr 2021   Prob (F-statistic):           1.08e-08
Time:                        02:26:19   Log-Likelihood:                -203.48
No. Observations:                  24   AIC:                             415.0
Df Residuals:                      20   BIC:                             419.7
Df Model:                           3     

$R^2$ here is the proportion of variation in the Y variable explained by the regression model. The closer $R^2$ to 1 the better the fit.

**Regression is a process that has errors**
* Residuals and Errors
* Residuals $= Y^{actual} - Y^{predicted}$
* Errors are typically distributed equally above and below the regression line
* R-square: A “goodness of fit” measure

**Why do we have errors in the regression model?**
* Omitted variables
* Functional relationship between the Y and X variables
* The theory of regression analysis is based on certain assumptions about these errors

**NB:** $\beta_i$ is actually unknown (just like $\mu$). What we estimate are $b_i$, although analysts quite often mix these notation. $b_i$'s have normal distribution with $\beta_i$ as their mean. $b_i$'s can be considered as random variables: 

$$b_i \sim Normal(\beta_i, \text{some std})$$

Some important results that enable us to test stability and precision of coefficients and conduct hypothesis testing in a regression:

$$ \frac{b_i - \beta_i}{s_{b_i}} \sim t_{n-k-1} $$

Where:
* $i$ - number of the parameter
* $n$ - number of observations
* $k$ - number of "X" variables
* $n-k-1$ - residual degrees of freedom
* $s_{b_i}$ - the standard error of $b_i$

### Quiz 1

In [60]:
quiz_data = pd.read_excel("Excel files/1. Grocery Store Sales.xlsx", "Sheet1", index_col=None, na_values=["NA"])

In [62]:
quiz_data.head()

Unnamed: 0,Store,Sales per Square Foot ($),Size of Store (in Sq. Ft.),Advertising Dollars,Number of Products Offered in Store
0,1,837,64796,22000,32920
1,2,748,74179,58000,25034
2,3,744,70298,58000,23989
3,4,853,63367,56000,31095
4,5,839,74412,67000,35055


1.

Download Grocery Store Sales, which provides data in the following categories: Sales per Square Foot, Size of Store (in Square Feet), Advertising Dollars (in thousands), and Number of Products Offered in Store, from a sample size of 70 grocery stores.

We want to see how changes in our independent variables affect Sales per Square Foot.

Please run one multiple regression including all independent variables to estimate the coefficients for each of our independent variables. 

What is the coefficient for Size of Store? Please round to three decimal places.

In [67]:
X = quiz_data[["Size of Store (in Sq. Ft.)", "Advertising Dollars", "Number of Products Offered in Store"]]
Y = quiz_data["Sales per Square Foot ($)"]

X = sm.add_constant(X)
model = sm.OLS(Y, X, hasconst=True).fit()
print_model = model.summary()
print(print_model)

Regression via sklearn:
Coefficients: [-0.00203405  0.00211828 -0.00369796] 

R^2: 0.3829461983505511, Intercept: 978.1349066228497

---
Regression via statsmodels
                                OLS Regression Results                               
Dep. Variable:     Sales per Square Foot ($)   R-squared:                       0.383
Model:                                   OLS   Adj. R-squared:                  0.355
Method:                        Least Squares   F-statistic:                     13.65
Date:                       Tue, 13 Apr 2021   Prob (F-statistic):           5.00e-07
Time:                               22:13:10   Log-Likelihood:                -373.65
No. Observations:                         70   AIC:                             755.3
Df Residuals:                             66   BIC:                             764.3
Df Model:                                  3                                         
Covariance Type:                   nonrobust                  

5.

What would be the expected Sales per Square Foot if the Size of Store was 60,000 square feet, they spent \\$70,000 in Advertising Dollars, and offered 30,000 products (in \$) ? Please round to two decimal places. 

In [85]:
round(model.predict([1, 60000, 70000, 30000])[0], 2)  # 1 is added constant

893.43