#### Lecture
* overfitting like memorizing - won't be able to predict on test data
     * usually means model is too complex
     * training score higher than test score
* underfitting
    * low performance on training and test

* minimizing approximation error: Eapprox = (Etest - Etrain)
* better than approximation

* Linear Regression
    * most basic and popular ML/Stat technique
    * can be used to predict
    * linear relationship between DV and IV
    * yhat - predicted (dv) 
    * gives a line y=wx + b
    * w = weights, and b = constant that the algo gives

* Multiple Linear Regression
    * more than one feature - each with their own weights
    * one constant 
    * greater the weight (w) the more influence that x has on target y 

* See walkthrough for more information

#### From Compass
* goal of ML is to find the function that predicts target, given a feature/s
* notation = L,l for "loss function" - how far a function h is from the target
* risk vs loss function
    * loss function 
        * how far an estimate value of a quantity from a true value 
        * y vs y^
        * examples - squared error (always positive) - sensitive to outliers, absolute error loss - not smooth, LP loss (absolute error raised to power of p), more complex Kullback-Leibler Loss
        * loss function depends on selected data 
    * risk function
        * average measure of the loss
        * average of the loss function (mean squared error?)
        * expected value of loss - expected value of loss function, calculated on the 'integral'
        * 'arg min' - looking for a function h from a pool of functions H to minimize risk function
    
* what can go wrong - generalization error
    * approximation error
        * choose an H or l that are too simple, accuracy of model is low
        * not enough complexity to predict the target 
    * estimation error
        * lack of data points 
        * complex functions need more data to train
        * 100 pictures not enough for deep neural networks
    * optimization error
        * loss function is too complex
        * too many observations (data points) 

#### Linear Regression
* looking for a vector 'w' (weights) that will minimize our loss
* done by setting the **gradient = 0** and find the derivative using 0
* method of 'least squares' 
* getting sum of square residuals
    * subtract each value from the fitted line, square, then sum
    * sum of squared residuals should keep decreasing until smallest, end up with y=ax+b
* optimize a and b (slope and y int) to end up with smallest SSresiduals
* for each point added ((a*x1+b)-y1)^2, repeat for each
* takes the derivative of the function, and find point where slope = 0 
* takes the derivative of both the slope and intercepts - to know where the optimal values are for the best fit
* 1. minimize the square distance between observed values and the line
* 2. done by taking the derivative and finding where = 0
* line minimizes sum of squares i.e. the least squares
* seeing how good the guess of the line calculate the R^2

* calculate R^2, and p-value for R^2 
* first project the data onto one mean > find the variance (average sum of squares) 
* then on original plot - find variance for the fit (average SS(data-line))
* smaller residuals from the line/fit Var(fit) < Var(mean)
* therefore some of the variance in varX can be 'explained' by taking varY into account
* R^2 = Var(mean) - Var(fit) / Var(mean)
* R^2 = SS(mean) - SS(fit) / SS(mean) - same ratio
* e.g. 0.6 = 60% reduction in varX when taking varY into account
* in other words varY explains 60% of the variance in varX
* no residuals, R^2, varY explains all the variance of varX

* same thing for more complex models (multiple variables)
* for three variables, fit a plane instead of a line but still a least square fit 
* coefficient/weight for each variable, calculates residuals
* if variable doesn't contribute anything - i.e doesn't make the SS(fit) smaller - then coefficient becomes 0
* meaning more variables in the model is not penalized, if they are not good predictors they just become zeros based on SSfit
* but somewhat bad - because everything has some chance in adjusting the SSfit even a little - leading to a better R^2
* adjusted R^2 to scale R^2 based on the number of parameters 

* R^2 statistically significant 
* comes from F - variation(x) / var(notX) i.e. variation not explained by X, not explained by the fit
* based on a degrees of freedom
* F = variation explained by fit / variation not explained by fit
* compare to random normal data
* linear regression quantifies the relationship in your data
* large R^2, lower p value


##### Linear Regression - using Statsmodels

* predictive model for linear relationships between dv and iv
* simple linear regression - one iv for the dv
* more complex - has multiple ivs Y = C + M1*X1 + M2*X2 + …


In [1]:
import pandas as pd
import statsmodels.api as sm

In [3]:
Stock_Market = {'Year': [2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016],
                'Month': [12, 11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1],
                'Interest_Rate': [2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75],
                'Unemployment_Rate': [5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1],
                'Stock_Index_Price': [1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,884,866,876,822,704,719]        
                }

df = pd.DataFrame(Stock_Market,columns=['Year','Month','Interest_Rate','Unemployment_Rate','Stock_Index_Price']) 

X = df[['Interest_Rate','Unemployment_Rate']] # here we have 2 variables for the multiple linear regression. If you just want to use one variable for simple linear regression, then use X = df['Interest_Rate'] for example
Y = df['Stock_Index_Price']

X = sm.add_constant(X) # adding a constant

model = sm.OLS(Y, X).fit() ### sm, OLS linear regression 
predictions = model.predict(X) 

print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:      Stock_Index_Price   R-squared:                       0.898
Model:                            OLS   Adj. R-squared:                  0.888
Method:                 Least Squares   F-statistic:                     92.07
Date:                Sun, 29 May 2022   Prob (F-statistic):           4.04e-11
Time:                        13:18:35   Log-Likelihood:                -134.61
No. Observations:                  24   AIC:                             275.2
Df Residuals:                      21   BIC:                             278.8
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const              1798.4040    899.24

* adjusted R-squared reflects fit of model from 0 to 1
* const coeff - y intercept, y value when coefficients are zero
* coefficients for each variable - change in output y from one unit change in x
* std error - accuracy of coefficients 
* can then be plugged in for prediction 

#### Linear Regression with House Pricing (df_train)

In [10]:
df_train = pd.read_csv('df_train.csv')

In [11]:
df_train.head()

Unnamed: 0,OverallQual,YearBuilt,ExterQual,BsmtQual,TotalBsmtSF,GrLivArea,FullBath,KitchenQual,GarageCars,OverallGrade,SalePrice
0,7.0,2003.0,4.0,4.0,856.0,1710.0,2.0,4.0,2.0,35.0,208500
1,6.0,1976.0,3.0,4.0,1262.0,1262.0,2.0,3.0,2.0,48.0,181500
2,7.0,2001.0,4.0,4.0,920.0,1786.0,2.0,4.0,2.0,35.0,223500
3,7.0,1915.0,3.0,3.0,756.0,1717.0,1.0,4.0,3.0,35.0,140000
4,8.0,2000.0,4.0,4.0,1145.0,2198.0,2.0,4.0,3.0,40.0,250000


In [4]:
import statsmodels.api as sm

In [12]:
df_train.columns

Index(['OverallQual', 'YearBuilt', 'ExterQual', 'BsmtQual', 'TotalBsmtSF',
       'GrLivArea', 'FullBath', 'KitchenQual', 'GarageCars', 'OverallGrade',
       'SalePrice'],
      dtype='object')

In [13]:
X = df_train[['OverallQual', 'YearBuilt', 'ExterQual', 'BsmtQual', 'TotalBsmtSF',
       'GrLivArea', 'FullBath', 'KitchenQual', 'GarageCars', 'OverallGrade']]
Y = df_train['SalePrice']

In [14]:
X = sm.add_constant(X) # adding the constant

In [15]:
lin_reg = sm.OLS(Y,X) # creates a linear regression object 

In [16]:
model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:              SalePrice   R-squared:                       0.835
Model:                            OLS   Adj. R-squared:                  0.834
Method:                 Least Squares   F-statistic:                     732.0
Date:                Sun, 29 May 2022   Prob (F-statistic):               0.00
Time:                        13:33:52   Log-Likelihood:                -17206.
No. Observations:                1458   AIC:                         3.443e+04
Df Residuals:                    1447   BIC:                         3.449e+04
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const        -8.992e+05   8.93e+04    -10.069   

#### Using sklearn

In [17]:
from sklearn.linear_model import LinearRegression

In [19]:
regressor = LinearRegression() # args that can be set (copy_X=True, fit_intercept=True, n_jobs=None, normalize=False) #fit_intercept is the constant
regressor.fit(X, Y) 

In [21]:
print(regressor.coef_) # np array, 0 is the constant, can be removed when False - and no p-values harder to read - but more consistent with all other models and models in the library

[     0.           5507.54189138    392.2863556   14466.78601472
    920.78618122     42.13854481     66.85496149 -11218.59562134
  11469.89475761   9314.43585305   1078.19597724]


In [22]:
regressor.score(X,Y) #R^2 for regressor

0.8349492071391901

#### Classification
* target variable is categorical (not numeric) - with at least 2 categories
* based on number of categories - can be binary (2 cats) or multinomial classification (3+)
* MNIST - for images (computer vision) - classifying hand-written digits
* text emails - natural language processing 
* internet is made up of these two types of text
* spam vs real emails
* online advertising targetted (click or no click/buy or no buy)
* figures out pattern between inputs and targets -> then predict

* typically use randomforestclassifier
``` python
import RandomForestClassifier from sklearn
model = RnadomForestClassifier()
model.fit(X,Y) #learning
predictions = model.predict(X)
model.score(X,Y) # to get accuracy
```
* accuracy/classification rate = #correct/#total

#### Polynomial Linear Regression
* similar to multinomial - exception is that the features are transformed
* still falls under linear regression e.g. b1x1 + b2x1^2 - one variable but the different powers of that var
* good for fitting curves in data
* class of regression is based on coefficients - which are still linear - only the variables are being transformed
* can still be expressed as a linear equation