# Notes on Topics 21 - 24

## Topic 21: OOP + Appendix: More OOP
- Classes should follow CamelCase convention
- search order (method resolution order): instance, class, superclass(es)

#### Intro to OOP - Crash Course
- Four main principles: 
    - Encapsulation (information hiding):
        - refers to bundling data with methods that can operate on that data within a class
        - idea of hiding data within a class, preventing anything outside that class from directly interacting with it (e.g. edit its attributes)
        - members of other classes can interact with attrs of another object through its methods
        - keeps programmer in control of access to data and prevents program from ending up in any strange or unwanted states
    - Abstraction: 
        - refers to only showing essential details and hiding everything else
        - users of the class should not worry about inner workings/details of the class
        - interface: the way sections of code can communicate with one another
            - typically done through (getter) methods that each class can access
        - implementation: implementation of the methods, or the method code, should be hidden
        - prevent classes from becoming entangled, or else changes in one class will have a ripple effect on the others
        - allows program to be worked on incrementally, prevents entanglement, and reduces complexity
    - Inheritance: 
        - derive classes from other classes
        - Access modifiers:
            - Public: members/classes can be access from anywhere in your program
            - Private: members can only be accessed within the same class that member is defined
            - Protected: members can be accessed within the class it is defined, as well as its subclasses
    - Polymorphism
        - methods can take on many forms
        - dynamic polymorphism
            - occurs during runtime
            - method signature is in both a subclass and a superclass, subclass overrides
            - methods share same name/parameters but have different implementation
        - static polymorphism (method overloading)
            - occurs during compile time
            - methods with the same name but different arguments (different number, type or order of parameters) are defined in the same class
- Getting vs Setting vs Deleting Methods:
    - set with property decorator
    - Getting methods: get info from an object (attr/var could only be referenced, not changed \[read-only\])
    - Setting methods: set attrs to different values
    
#### Appendix
- Domain Model
- Abstract super classes (class AbsCls(object))

In [None]:
class Employee:                                             # defining class
    
    num_of_emps = 0
    raise_amount = 1.04                                     # class variable
    
    def __init__(self, first, last, pay):                   # constructor
        self.first = first                                  # instance attribute (variable defined in constructor)
        self.last = last
        self.pay = pay
        Employee.num_of_emps += 1
    
    @property                                               # getter method
    def email(self):                                        # defining email attribute in the class like it is a method
        return '{}.{}@email.com'.format(self.first, self.last)
    
    @property
    def fullname(self):                                  
        return '{} {}'.format(self.first, self.last)
    
    @fullname.setter                                        # setter method
    def fullname(self, name):                               # executed when emp_1.fullname is assigned a string
        first, last = name.split(' ')
        self.first = first
        self.last = last
    
    @fullname.deleter                                       # deleter method (executed of we run 'del emp_1.fullname)
    def fullname(self):
        self.first = None
        self.last = None

    def apply_raise(self):                                  # instance method (function defined within the class)
        self.pay = int(self.pay * self.raise_amount)
        
    @classmethod                                            # class method decorator
    def set_raise_amt(cls, amount):                         # class method, can use as alternative constructor (seen in datetime module)
        cls.raise_amt = amount
    
    @staticmethod                                           # static method decorator
    def is_workday(day):                                    # static method, doesn't auto pass in instance or class
        if day.weekday() == 5 or day.weekday() == 6:
            return False
        return True
        
class Manager(Employee):                                    # subclass(inherited superclass)
    
    def __init__(self, first, last, pay, employees=None):   # never set mutable data types as default arguments
        super().__init__(first, last, pay)
        if employees is None:
            self.employees = []
        else:
            self.employees = employees
            
    def add_emp(self, emp):
        if emp not in self.employees:
            self.employees.append(emp)

    def remove_emp(self, emp):
        if emp in self.employees:
            self.employees.remove(emp)

emp_1 = Employee('Corey','Schafer', 50000) # instance instantiated
emp_1.first = 'corey' # instance variable set outside of the constructor
Employee.set_raise_amt(1.05)

print(Employee.__dict__) # class variables and methods
print(emp_1.__dict__) # instance variables

Employee.set_raise_amt(1.05)

Special (Magic/Dunder) Methods on bottom of page

<hr style="border:1px solid gray"> </hr>


## Topic 22: Linear Algebra + Appendix: More Lin. Alg.

### Systems of linear equations

In [None]:
"""
    q + d = 30
    .25q + .1d = 5.7
"""
import numpy as np
import numpy.linalg as la
A = np.array([[  1, 1],
              [.25,.1]])
s = np.array([[30],[5.7]])
la.solve(A,s)

### Scalars, Vectors, Matrices, and Tensors
- vector: 1-D tensor, matrix: 2-D tensor\

In [None]:
# numpy methods/attrs
np.array([[,][,]]) # array (vector/matrix)
x = np.linspace(start, stop, num) # generate vector of 'num' samples between 'start' and 'stop'
# x[int/ -int /:] # indexing, get 'int'-th element
# x[::-1] # reverse the vector
# x[0::2]  # every other element
np.asmatrix(A) # or np.mat(A), returns matrix([[,],[,]]) instead of array
np.mat(A) # does not make a copy if the input is already a matrix or an ndarray.
          # Equivalent to np.matrix(data, copy=False)
# X[:,:]  index or assign new value
# .shape or np.shape(A)
# X.T or np.transpose(X)

### Matrix Multiplication
- element-wise multiplication: matrices must be same shape
- dot product: Anm dot Bmo = Cno
- cross product: $a \times b = \mid a \mid  \mid b \mid \sin(\theta) n $
     - $\mid a \mid$  is the magnitude (length) of vector $a$
     - $\mid b \mid$  is the magnitude (length) of vector $b$
     - $\theta$ is the angle between $a$ and $b$
     - $n$ is the unit vector at right angles to both $a$ and $b$
     - $a \times b = - a \times b$

In [None]:
A * B # element-wise/Hadamard Product
A.dot(B) # or
np.dot(A, B) # dot product
bp.cross(a, b) # cross product

### Solving Systems of Linear Equations with NumPy
- $A \cdot A^{-1} = I$
- **($A \cdot X = B$)**  -->  ($ A^{-1} \cdot A \cdot X = A^{-1} \cdot B$)  -->  ($I \cdot X = A^{-1} \cdot B$)  -->  **($X = A^{-1} \cdot B$)**

In [None]:
np.zeros/ones(r, c)
np.eye(4) # or
np.identity(4, dtype=int)
la.inv(A)
np.matrix.round(A)
la.solve(A,s)
np.square/sqrt(x)
np.arccos/sin/tan(x) # in radians
np.random.rand(r,c)

### Regression Analysis using Linear Algebra and NumPy
$$
    \left[ {\begin{array}{cc}
   1 & 1 \\
   1 & 2 \\
   1 & 3 \\
  \end{array} } \right]
   \left[ {\begin{array}{c}
   c \\
   m \\
  \end{array} } \right] =
    \left[ {\begin{array}{c}
    1 \\
    2 \\
    2 \\
  \end{array} } \right] 
$$

If you don't include this constant (column of ones), then the function is constrained to the origin (0,0)

In [None]:
from numpy.polynomial.polynomial import polyfit # least squares polynomial fit
# https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html

In [None]:
c, m = polyfit(x, y, 1) # 1 is degree, c(intercept) and m(slope)
plt.plot(x, y, 'o')
plt.plot(x, c + (m * x), '-')
plt.xticks(x)

### Ordinary least squares 

A common measure to find and minimize the value of this error is called *Ordinary Least Squares*. 

This says that our dependent variable, is composed of a linear part and error. The linear part is composed of an intercept and independent variable(s), along with their associated raw score regression weights.

In matrix terms, the same equation can be written as:

$ y = \boldsymbol{X} b + e $

This says to get y, multiply each $\boldsymbol{X}$ by the appropriate vector b (unknown parameters, the vector version of $m$ and $c$), then add an error term. We create a matrix $\boldsymbol{X}$ , which has an extra column of **1**s in it for the intercept. For each day, the **1** is used to add the intercept in the first row of the column vector $b$.

Let's assume that the error is equal to zero on average and drop it to sketch a proof:

$ y = \boldsymbol{X} b$


Now let's solve for $b$, so we need to get rid of $\boldsymbol{X}$. First we will make X into a nice square, symmetric matrix by multiplying both sides of the equation by $\boldsymbol{X}^T$ :

$\boldsymbol{X}^T y = \boldsymbol{X}^T \boldsymbol{X}b $

And now we have a square matrix that with any luck has an inverse, which we will call $(\boldsymbol{X}^T\boldsymbol{X})^{-1}$. Multiply both sides by this inverse, and we have

$(\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T y =(\boldsymbol{X}^T\boldsymbol{X})^{-1} \boldsymbol{X}^T \boldsymbol{X}b $


$(\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T y =I b $

$ b= (\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T y $

$ b= \hat{X} $

With least squares regression, in order to solve for the expected value of weights, referred to as $\hat{X}$ ("$X$-hat"), you need to solve the above equation. 

Remember all above variables represent vectors. The elements of the vector X-hat are the estimated regression coefficients $c$ and $m$ that you're looking for. They minimize the error between the model and the observed data in an elegant way that uses no calculus or complicated algebraic sums.


In [None]:
# Calculate an OLS Regression Line
X = np.array([[1, 1],[1, 2],[1, 3]])
y = np.array([1, 2, 2])
Xt = X.T
XtX = Xt.dot(X)
XtX_inv = np.linalg.inv(XtX)
Xty = Xt.dot(y)
x_hat = XtX_inv.dot(Xty) # the value for b shown above
x_hat

In [None]:
# plotting regression line
plt.plot(x, y, 'o')
plt.plot(x, x_hat[0] + (x_hat[1] * x), '-')

observed data $\rightarrow$ $y = b_0+b_1x_1+b_2x_2+ \ldots + b_px_p+ \epsilon $

predicted data $\rightarrow$ $\hat y = \hat b_0+\hat b_1x_1+\hat b_2x_2+ \ldots + \hat b_px_p $

error $\rightarrow$ $\epsilon = y - \hat y $

### Regression Analysis using Linear Algebra and NumPy Lab
1. Prepare data for modeling
2. Perform a 80/20 t-t-split
3. Calculate the beta, $\beta = (x_\text{train}^T. x_\text{train})^{-1} . x_\text{train}^T . y_\text{train}$
4. Make predictions
5. Evaluate the model

### Computational Complexity: From OLS to Gradient Descent
- The Big O Notation:
    - $O(\log n)$: aka $\log$ time
    - $O(n)$: aka linear time
    - $O(n^2)$: quadratic
    - $O(n^3)$: cubic
        - inverting a matrix
- OLS linear regression is computed as: $(\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T y$.

    - If $\boldsymbol{X}$ is an $(n \times k)$ matrix:

    - $(\boldsymbol{X}^T\boldsymbol{X})$ takes $O(n*k^2)$ time and produces a $(k \times k)$ matrix
    - The matrix inversion of a (k x k) matrix takes $O(k^3)$ time
$(\boldsymbol{X}^T\boldsymbol{y})$ takes $O(n*k^2)$ time and produces a $(k \times k)$ matrix
    - The final matrix multiplication of two $(k \times k)$ matrices takes $O(k^3)$ time
    
**So the Big O running time for OLS is $O(k^{2*(n + k)})$ - which is pretty expensive**

### Appendix: Vector Addition and Broadcasting in NumPy
- you can add/subtract (element-wise or broadcasting) vectors and matrices
- Broadcasting Analogy: scalar + vector : vector + matrix

### Appendix: Properties of Dot Product
- Distributive Property - matrix multiplication IS distributive
    - $A \cdot (B+C) = (A \cdot B + A \cdot C) $
- Associative Property - matrix multiplication IS associative
    - $A \cdot (B \cdot C) = (A \cdot B) \cdot C $
- Commutative Property - matrix multiplication is NOT commutative
    - $A \cdot B \neq B \cdot A $
- Commutative Property - vector multiplication IS commutative
    - $x^T \cdot y = y^T \cdot x$
- Simplification of the matrix product
    - $ (A \cdot B)^T = B^T \cdot A^T $

<hr style="border:1px solid gray"> </hr>


## Topic 23: Calculus, Cost Function, and Gradient Descent + Appendix: More on Derivatives

### Intro to Derivatives
$ f'(x) = \lim_{ h\to0} \frac{f(x + h) - f(x)}{h} $

In [9]:
def term_output(array, input_value):
    return array[0]*input_value**array[1]
# ex. term_output(np.array([3, 2]), 2)
def output_at(array_of_terms, x_value):
    outputs = []
    for i in range(int(np.shape(array_of_terms)[0])):
        outputs.append(array_of_terms[i][0]*x_value**array_of_terms[i][1])
    return sum(outputs)
def delta_f(array_of_terms, x_value, delta_x):
    return output_at(array_of_terms, x_value + delta_x) - output_at(array_of_terms, x_value)
def derivative_of(array_of_terms, x_value, delta_x):
    delta = delta_f(array_of_terms, x_value, delta_x)
    return round(delta/delta_x, 3)
def tangent_line(array_of_terms, x_value, line_length = 4, delta_x = .01):
    y = output_at(array_of_terms, x_value)
    derivative_at = derivative_of(array_of_terms, x_value, delta_x)    
    x_dev = np.linspace(x_value - line_length/2, x_value + line_length/2, 50)
    tan = y + derivative_at *(x_dev - x_value)
    return {'x_dev':x_dev, 'tan':tan, 'lab': " f' (x) = " + str(derivative_at)}


In [None]:
lin_function = np.array([[4, 1], [15, 0]])
x_values = np.linspace(0, 5, 100)
y_values = [output_at(lin_function, x) for x in x_values]
plt.plot(x_values, y_values, label = "4x + 15")


### Derivatives of Non-Linear Functions

In [None]:
def make_plot(delta_a):
    lab= "delta x = " + str(delta_a)
    plt.plot(x, f(x), label = lab)
    plt.hlines(y=9, xmin=1, xmax=3, linestyle = "dashed", color= 'lightgrey')
    plt.vlines(x=2, ymin=1, ymax=4, linestyle = "dashed", color= 'lightgrey')
    plt.hlines(y=4, xmin=1, xmax=2, linestyle = "dashed", color= 'lightgrey')
    plt.vlines(x=3, ymin=1, ymax=9, linestyle = "dashed", color= 'lightgrey')
    # tangent line
    x_dev = np.linspace(1.5, 3.2, 100)
    a = 2
    fprime = (f(a+delta_a)-f(a))/delta_a 
    tan = f(a)+fprime*(x_dev-a)
    # plot of the function and the tangent
    plt.plot(x_dev, tan, color = "black", linestyle="dashed")
    plt.legend(loc="upper left", bbox_to_anchor=[.5, 1],
           ncol=2, fancybox=True);

### Rules for Derivatives + Lab (in Appendix)
- Power Rule
    - $f(x) = x^r $ ---> $ f'(x) = r*x^{r-1} $
- Constant Factor Rule
    - $\frac{\Delta f}{\Delta x}(a*f(x)) = a * \frac{\Delta f}{\Delta x}*f(x) $  
- Addition Rule
    - the derivative of multiple terms (being added) is the same as each term being derived
- Chain Rule
    - $ F(x) = f(g(x)) $ ---> $ F'(x) = f'(g(x))*g'(x) $

In [None]:
def find_term_derivative(term):
    constant = term[0]*term[1]
    exponent = term[1] - 1 
    return np.array([constant, exponent])
# return something which looks like: np.array([constant, exponent])
def find_derivative(function_terms):
    der_array = np.zeros(np.shape(function_terms))
    for i in range(int(np.shape(function_terms)[0])):
        der_array[i] = find_term_derivative(function_terms[i])
    return der_array

### Derivatives: Conclusion
- minima/maxima exist where $ f'(x) = 0 $

### Intro to Gradient Descent
- minimize cost function
    - in linear regression, the cost function is RSS (Residual Sum of Squared Errors) and is reduced by adjusting the parameters (slopes and intercept)

### Gradient Descent: Step Sizes (learning rate)
- slope of the cost curve tells us our step size

We use the following procedure to find the ideal $m$: 
1.  Randomly choose a value of $m$ (random initialization)
2.  Update $m$ with the formula $ m = (-a) * slope_{m = i} + m_i, a = learning rate$

### Gradient Descent in 3D and The Gradient in Gradient Descent
- gradient descent: taking the shortest path to descend towards our minimum
- we denote the gradient of a function, $f(x, y)$, with $\nabla f(x, y)$
- if $\frac{df}{dy} > \frac{df}{dx}$, we should make that move more in the $y$ direction than the $x$ direction, and vice versa. and you should move $ \frac{\delta f}{\delta y} $ divided by $ \frac{\delta f}{\delta x} $. So for example, when $ \frac{\delta f}{\delta y}f(x, y) = 3 $ , and $ \frac{\delta f}{\delta x}f(x, y) = 2$, you traveled in line with a slope of 3/2.
- gradient ascent = $\nabla f(x, y)$, gradient descent = $-\nabla f(x, y)$

### Gradient to Cost Function + Lab (in appendix)
- cost function for linear regression:
$$
\begin{align}
J(m, b) & = \sum_{i=1}^{n}(y_i - \hat{y})^2 &&\text{cost function, $J$, (representing RSS)}\\
J(m, b) & = \sum_{i=1}^{n}(y_i - (mx_i + b))^2 &&\text{notice $\hat{y} = mx + b$}\\
\end{align}
$$

Take Partial Derivatives of the Cost Function:

$$
\begin{align}
\frac{\delta J}{\delta m}J(m, b) & = \boldsymbol{\frac{\delta J}{\delta m}}(y - (mx + b))^2 = -2x*(y - (mx + b )) &&\text{partial derivative with respect to} \textbf{ m}
\end{align}
$$

$$
\begin{align}
\frac{\delta J}{\delta b}J(m, b) & = \boldsymbol{\frac{\delta J}{\delta m}}(y - (mx + b))^2 = -2*(y - (mx + b)) &&\text{partial derivative with respect to} \textbf{ b}
\end{align}
$$

Replace $y - \hat{y}$ with $\epsilon$, our error:

$$ \frac{dJ}{dm}J(m,b) = -2*x(y - (mx + b )) = -2*x(y - \hat{y})  = -2x*\epsilon $$

$$ \frac{dJ}{db}J(m,b) = -2*(y - (mx + b)) -2*(y - \hat{y}) = -2\epsilon $$

$$ \frac{dJ}{dm}J(m,b) = -2*\sum_{i=1}^n x(y_i - \hat{y}_i)  = -2*\sum_{i=1}^n x_i*\epsilon_i$$
$$ \frac{dJ}{db}J(m,b) = -2*\sum_{i=1}^n(y_i - \hat{y}_i) = -2*\sum_{i=1}^n \epsilon_i$$

In the context of gradient descent, we use these partial derivatives to take a step size.  Remember that our step should be in the opposite direction of our partial derivatives as we are *descending* towards the minimum.  So to take a step towards gradient descent we use the general formula of:

`current_m` =  `old_m` $ - \frac{dJ}{dm}J(m,b)$

`current_b` =  `old_b` $ - \frac{dJ}{db}J(m,b) $

or in the code that we just calculated:

`current_m` = `old_m` $ -  (-2*\sum_{i=1}^n x_i*\epsilon_i )$

`current_b` =  `old_b` $ - ( -2*\sum_{i=1}^n \epsilon_i )$

### Applying Gradient Descent Lab
Note that, for our gradients, when having multiple predictors $x_j$ with $j \in 1,\ldots, k$

$$ \frac{dJ}{dm_j}J(m_j,b) = -2\sum_{i = 1}^n x_{j,i}(y_i - (\sum_{j=1}^km{x_{j,i}} + b)) = -2\sum_{i = 1}^n x_{j,i}*\epsilon_i$$
$$ \frac{dJ}{db}J(m_j,b) = -2\sum_{i = 1}^n(y_i - (\sum_{j=1}^km{x_{j,i}} + b)) = -2\sum_{i = 1}^n \epsilon_i $$
    

So we'll have one gradient per predictor along with the gradient for the intercept!

<hr style="border:1px solid gray"> </hr>


## Topic 24: Feature Selection, Ridge and Lasso (Regularization)
- Regularization: general technique of battling overfitting

### Ridge and Lasso Regression (L2/L1 Norm Regularization) + Lab
[Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html) and [Lasso](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html).
- Lasso:"Least Absolute Shrinkage and Selection Operator"
    - performs estimation and selection simultaneously
    - better when we have more features
- add penalty for large coefficients (AKA penalized estimations)
- Advantages:
    - reduce model complexity
    - may prevent from overfitting
    - Some of them may perform variable selection at the same time (when coefficients are set to 0)
    - can be used to counter multicollinearity
- must standardize data before using either of these
- scale after train-test-split to prevent data-leakage
- only fit scaler on training data

$ \text{cost_function_linear}= \sum_{i=1}^n(y_i - \hat{y})^2 = \sum_{i=1}^n(y_i - \sum_{j=1}^k(m_jx_{ij} ) -b )^2$
    
$ \text{cost_function_ridge}= \sum_{i=1}^n(y_i - \hat{y})^2 = \sum_{i=1}^n(y_i - \sum_{j=1}^k(m_jx_{ij})-b)^2 + \lambda \sum_{j=1}^p m_j^2$

$ \text{cost_function_lasso}= \sum_{i=1}^n(y_i - \hat{y})^2 = \sum_{i=1}^n(y_i - \sum_{j=1}^k(m_jx_{ij})-b)^2 + \lambda \sum_{j=1}^p \mid m_j \mid$

- [Full Code](https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/) of Ridge and Lasso Regression

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import Lasso, Ridge, LinearRegression
from sklearn.model_selection import train_test_split
y = data[['mpg']] # t-t split X and y
X = data.drop(['mpg', 'car name', 'origin'], axis=1)
X_train , X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12)
scale = MinMaxScaler() # scale after t-t-split
X_train_transformed = scale.fit_transform(X_train)
X_test_transformed = scale.transform(X_test)
ridge = Ridge(alpha=0.5) # fit our models, alpha is our lambda
ridge.fit(X_train_transformed, y_train) 
lasso = Lasso(alpha=0.5)
lasso.fit(X_train_transformed, y_train)
lin = LinearRegression()
lin.fit(X_train_transformed, y_train)
y_h_ridge_train = ridge.predict(X_train_transformed) #generate predictions
y_h_ridge_test = ridge.predict(X_test_transformed)
y_h_lasso_train = np.reshape(lasso.predict(X_train_transformed), (274, 1))
y_h_lasso_test = np.reshape(lasso.predict(X_test_transformed), (118, 1))
y_h_lin_train = lin.predict(X_train_transformed)
y_h_lin_test = lin.predict(X_test_transformed)
print('Train Error Ridge Model', np.sum((y_train - y_h_ridge_train)**2))
print('Test Error Ridge Model', np.sum((y_test - y_h_ridge_test)**2))
print('Train Error Lasso Model', np.sum((y_train - y_h_lasso_train)**2))
print('Test Error Lasso Model', np.sum((y_test - y_h_lasso_test)**2))
print('Train Error Unpenalized Linear Model (RSS', np.sum((y_train - lin.predict(X_train_transformed))**2))
print('Test Error Unpenalized Linear Model (RSS', np.sum((y_test - lin.predict(X_test_transformed))**2))
print('Ridge parameter coefficients:', ridge.coef_)
print('Lasso parameter coefficients:', lasso.coef_)
print('Linear model parameter coefficients:', lin.coef_)

In [None]:
# Lab code
# Remove "object"-type features from X
cont_features = [col for col in X.columns if X[col].dtype in [np.float64, np.int64]]
# Remove "object"-type features from X_train and X_test
X_train_cont = X_train.loc[:, cont_features]
X_test_cont = X_test.loc[:, cont_features]

from sklearn.metrics import mean_squared_error, mean_squared_log_error
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.impute import SimpleImputer
# SimpleImputer fills the missing values in data using {Strategy} of the columns ex. {mean/median/most_frequent/constant}
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

# Build naive linreg model. Impute missing values with median using SimpleImputer
impute = SimpleImputer(strategy='median')
X_train_imputed = impute.fit_transform(X_train_cont)
X_test_imputed = impute.transform(X_test_cont)
# Fit the mode
linreg = LinearRegression()
linreg.fit(X_train_imputed, y_train)
# Print R2 and MSE for training and test sets
print('Training r^2:', linreg.score(X_train_imputed, y_train)) # r^2
print('Test r^2:', linreg.score(X_test_imputed, y_test))
print('Training MSE:', mean_squared_error(y_train, linreg.predict(X_train_imputed)))
print('Test MSE:', mean_squared_error(y_test, linreg.predict(X_test_imputed)))

# Scale the train and test data (Normalize), errors will remain the same
ss = StandardScaler()
X_train_imputed_scaled = ss.fit_transform(X_train_imputed)
X_test_imputed_scaled = ss.transform(X_test_imputed)
# Fit the model 
linreg_norm = LinearRegression()
linreg_norm.fit(X_train_imputed_scaled, y_train)
# Print R2 and MSE for training and test sets
print('Training r^2:', linreg_norm.score(X_train_imputed_scaled, y_train))
print('Test r^2:', linreg_norm.score(X_test_imputed_scaled, y_test))
print('Training MSE:', mean_squared_error(y_train, linreg_norm.predict(X_train_imputed_scaled)))
print('Test MSE:', mean_squared_error(y_test, linreg_norm.predict(X_test_imputed_scaled)))

# CInclude categorical variables
features_cat = [col for col in X.columns if X[col].dtype in [np.object]]
X_train_cat = X_train.loc[:, features_cat]
X_test_cat = X_test.loc[:, features_cat]
# Fill missing values with the string 'missing'
X_train_cat.fillna(value='missing', inplace=True)
X_test_cat.fillna(value='missing', inplace=True)
# OneHotEncode categorical variables
ohe = OneHotEncoder(handle_unknown='ignore')
# Transform training and test sets
X_train_ohe = ohe.fit_transform(X_train_cat)
X_test_ohe = ohe.transform(X_test_cat)
# Convert these columns into a DataFrame and add to continuous DF
columns = ohe.get_feature_names(input_features=X_train_cat.columns)
cat_train_df = pd.DataFrame(X_train_ohe.todense(), columns=columns)
cat_test_df = pd.DataFrame(X_test_ohe.todense(), columns=columns)
X_train_all = pd.concat([pd.DataFrame(X_train_imputed_scaled), cat_train_df], axis=1)
X_test_all = pd.concat([pd.DataFrame(X_test_imputed_scaled), cat_test_df], axis=1)
# sever overfitting may show if r^

# Lasso 
lasso = Lasso() # alpha/lambda default = 1, can iterate through different lambda's to lower test mse
lasso.fit(X_train_all, y_train)
print('Training r^2:', lasso.score(X_train_all, y_train))
print('Test r^2:', lasso.score(X_test_all, y_test))
print('Training MSE:', mean_squared_error(y_train, lasso.predict(X_train_all)))
print('Test MSE:', mean_squared_error(y_test, lasso.predict(X_test_all)))

# Ridge
ridge = Ridge() # alpha/lambda default = 1, can iterate through different lambda's to lower test mse
ridge.fit(X_train_all, y_train)
print('Training r^2:', ridge.score(X_train_all, y_train))
print('Test r^2:', ridge.score(X_test_all, y_test))
print('Training MSE:', mean_squared_error(y_train, ridge.predict(X_train_all)))
print('Test MSE:', mean_squared_error(y_test, ridge.predict(X_test_all)))

print(sum(abs(ridge.coef_) < 10**(-10))) # Number of Ridge params almost zero
print(sum(abs(lasso.coef_) < 10**(-10))) # Number of Lasso params almost zero
print(len(lasso.coef_)) # number of variables unselected (coeff = 0)
print(sum(abs(lasso.coef_) < 10**(-10))/ len(lasso.coef_)) # % / 100 of these variables to all variables


In [None]:
def preprocess(X, y):
    '''Takes in features and target and implements all preprocessing steps for categorical and continuous features returning 
    train and test DataFrames with targets'''
    # Train-test split (75-25), set seed to 10
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=10)
    # Remove "object"-type features and SalesPrice from X
    cont_features = [col for col in X.columns if X[col].dtype in [np.float64, np.int64]]
    X_train_cont = X_train.loc[:, cont_features]
    X_test_cont = X_test.loc[:, cont_features]
    # Impute missing values with median using SimpleImputer
    impute = SimpleImputer(strategy='median')
    X_train_imputed = impute.fit_transform(X_train_cont)
    X_test_imputed = impute.transform(X_test_cont)
    # Scale the train and test data
    ss = StandardScaler()
    X_train_imputed_scaled = ss.fit_transform(X_train_imputed)
    X_test_imputed_scaled = ss.transform(X_test_imputed)
    # Create X_cat which contains only the categorical variables
    features_cat = [col for col in X.columns if X[col].dtype in [np.object]]
    X_train_cat = X_train.loc[:, features_cat]
    X_test_cat = X_test.loc[:, features_cat]
    # Fill nans with a value indicating that that it is missing
    X_train_cat.fillna(value='missing', inplace=True)
    X_test_cat.fillna(value='missing', inplace=True)
    # OneHotEncode Categorical variables
    ohe = OneHotEncoder(handle_unknown='ignore')
    X_train_ohe = ohe.fit_transform(X_train_cat)
    X_test_ohe = ohe.transform(X_test_cat
    columns = ohe.get_feature_names(input_features=X_train_cat.columns)
    cat_train_df = pd.DataFrame(X_train_ohe.todense(), columns=columns)
    cat_test_df = pd.DataFrame(X_test_ohe.todense(), columns=columns)
    # Combine categorical and continuous features into the final dataframe
    X_train_all = pd.concat([pd.DataFrame(X_train_imputed_scaled), cat_train_df], axis=1)
    X_test_all = pd.concat([pd.DataFrame(X_test_imputed_scaled), cat_test_df], axis=1)
    return X_train_all, X_test_all, y_train, y_test

Graph the training and test error to find optimal alpha values

In [None]:
X_train_all, X_test_all, y_train, y_test = preprocess(X, y)
train_mse = []
test_mse = []
alphas = []
for alpha in np.linspace(0, 200, num=50):
    lasso = Lasso(alpha=alpha)
    lasso.fit(X_train_all, y_train)
    train_preds = lasso.predict(X_train_all)
    train_mse.append(mean_squared_error(y_train, train_preds))
    test_preds = lasso.predict(X_test_all)
    test_mse.append(mean_squared_error(y_test, test_preds))
    alphas.append(alpha)
import matplotlib.pyplot as plt
%matplotlib inline
fig, ax = plt.subplots()
ax.plot(alphas, train_mse, label='Train')
ax.plot(alphas, test_mse, label='Test')
ax.set_xlabel('Alpha')
ax.set_ylabel('MSE')
# np.argmin() returns the index of the minimum value in a list
optimal_alpha = alphas[np.argmin(test_mse)]
# Add a vertical line where the test MSE is minimized
ax.axvline(optimal_alpha, color='black', linestyle='--')
ax.legend();
print(f'Optimal Alpha Value: {int(optimal_alpha)}')

### Feature and Model Selection: AIC and BIC
- AIC and BIC:
    - give comprehensive measure of model performance taking to account additional variables
- AIC (Akaike's Information Criterion):
    - generally used to compare each candidate model
    - for every model that uses MLE (Maximum Likelihood Estimation), the log-likelihood is automatically computed, so the AIC is very easy to calculate
    - acts as penalized log-likelihood criterion, giving a balance between a good fit (high value of log-likelihood) and complexity (complex models are penalized more than fairly simple ones)
    - unbounded but lowest AIC should be selected
    - built into statsmodels and in sklearn (such as LassoLarsIC)
    - $ \text{AIC} = -2\ln(\hat{L}) + 2k $
    
        Where:

        - $k$ : length of the parameter space (i.e. the number of features)
        - $\hat{L}$ : the maximum value of the likelihood function for the model
        
    - Another way to phrase the equation is:

        $ \text{AIC(model)} = - 2 * \text{log-likelihood(model)} + 2 * \text{length of the parameter space} $


- BIC (Bayesian Information Criterion):
    - penalty is slightly changed and depends on the number of rows in the dataset
    - Like the AIC, the lower your BIC, the better your model is performing
    - $ \text{BIC} = - 2\ln(\hat{L}) + \ln(n) * k $

        where:

        - $\hat{L}$ and $k$ are the same as in AIC
        - $n$ : the number of data points (the sample size)
        
    - Another way to phrase the equation is:

    $ \text{BIC(model)} = -2 * \text{log-likelihood(model)} + \text{log(number of observations)} * \text{(length of the parameter space)} $

### Feature Selection Methods
- Feature selection benefits include:
    - Decrease in computational complexity: With reduced features, it is easier to compute the model parameters and the amount of data storage required to maintain the features of your model decreases
    - Understanding your data: With feature selection, you gain more understanding of how features relate to one another
- Types of feature selection
    - There are different strategie/methods you can use to process features in an efficient way: 
        * Domain knowledge: 
            - knowledge/thoughts/research on important features
        * Wrapper methods: 
            - determine optimal subset of features using different combinations of features to train models and then calculating performance
            - Every subset is used, so this can end up being very computationally intensive and time consuming
            - highly effective, but challenging with large feature sets
            - ex. Recursive Feature Elimination (RFE) in linear regression
                - opposite of RFE is Forward Selection
        * Filter methods
            - carried out as a pre-processing step
            - different metrics are used to determine feature reduction
            - remove variables considered redundant
            - data scientist determines the cut-off point (to keep top n features)
            - ex. in linear regression, eliminate features that are highly correlated with one another
            - ex. use variance threshold
        * Embedded methods
            - included in actual formulation of ML algorithm
            - common ex. regularization, like Lasso
- full code shows data preprocessing, running a baseling model, adding polynomial features, filter methods, wrapper methods, Embedded Methods

In [None]:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
X_poly_train = pd.DataFrame(poly.fit_transform(X_train_transformed), columns=poly.get_feature_names(features.columns))
X_poly_test = pd.DataFrame(poly.transform(X_test_transformed), columns=poly.get_feature_names(features.columns))
# Fliter Methods
from sklearn.feature_selection import VarianceThreshold, f_regression, mutual_info_regression, SelectKBest
threshold_ranges = np.linspace(0, 2, num=6)
for thresh in threshold_ranges:
    print(thresh)
    selector = VarianceThreshold(thresh)
    reduced_feature_train = selector.fit_transform(X_poly_train)
    reduced_feature_test = selector.transform(X_poly_test)
    lr = LinearRegression()
    lr.fit(reduced_feature_train, y_train)
    run_model(lr, reduced_feature_train, reduced_feature_test, y_train, y_test)
    print('--------------------------------------------------------------------')

selector = SelectKBest(score_func=f_regression)
X_k_best_train = selector.fit_transform(X_poly_train, y_train)
X_k_best_test= selector.transform(X_poly_test)
lr = LinearRegression()
lr.fit(X_k_best_train ,y_train)
run_model(lr,X_k_best_train,X_k_best_test,y_train,y_test)

selector = SelectKBest(score_func=mutual_info_regression)
X_k_best_train = selector.fit_transform(X_poly_train, y_train)
X_k_best_test= selector.transform(X_poly_test)
lr = LinearRegression()
lr.fit(X_k_best_train ,y_train)
run_model(lr,X_k_best_train,X_k_best_test,y_train,y_test)
# Wrapper Methods
from sklearn.feature_selection import RFE, RFECV
from sklearn.linear_model import LinearRegression
rfe = RFECV(LinearRegression(),cv=5)
X_rfe_train = rfe.fit_transform(X_poly_train, y_train)
X_rfe_test = rfe.transform(X_poly_test)
lm = LinearRegression().fit(X_rfe_train, y_train)
run_model(lm, X_rfe_train, X_rfe_test, y_train, y_test)
print ('The optimal number of features is: ', rfe.n_features_)
# Embedded Methods
from sklearn.linear_model import LassoCV
lasso = LassoCV(max_iter=100000, cv=5)
lasso.fit(X_train_transformed, y_train)
run_model(lasso,X_train_transformed, X_test_transformed, y_train, y_test)
print('The optimal alpha for the Lasso Regression is: ', lasso.alpha_)

### Extensions to Linear Models Lab
- fulll code: Look at baseline model, include interactions, include polynomials, full model R^2, find best Lasso regularization parameter, analyzing the final result

### Generating Data + Lab
- generate data to evaluate and compare ML algorithms
- why generated datasets are preferred over real-world datasets:
    - Quick and easy generation - save data collection time and efforts
    - Predictable outcomes - have a higher degree of confidence in the result
    - Randomization - datasets can be randomized repeatedly to inspect performance in multiple cases
    - Simple data types - easier to visualize data and outcomes
- sklearn.datasets.make_blobs
    - generate any number of classes
    - can be used with a number of classifiers to see how accurately they perform
- sklearn.datasets.make_moons
    - used for binary classification problems and generates moon shaped patterns
    - allows you to specify the level of noise in the data
    - can try non-linear classification functions (like sigmoid and tanh etc.)
- sklearn.datasets.make_circles
    - creates values in the form of concentric circles
    - also suitable for testing complex, non-linear classifiers
- sklearn.datasets.make_regression
    - used to test regression algorithms
    - can further tweak the generated parameters to create non-linear relationships that can be solved using non-linear regression techniques (ex. squaring or cubing y)
- [sklearn API reference](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets) where the datasets module is shown

In [None]:
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=100, centers=3, n_features=2, cluster_std=2)
df = pd.DataFrame(dict(x=X[:, 0], y=X[:, 1], label=y))
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=100, noise=0.1)
from sklearn.datasets import make_circles
X, y = make_circles(n_samples=100, noise=0.05)
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=100, n_features=1, noise=5)

<hr style="border:1px solid gray"> </hr>


## Extra Notes

In [None]:
import random
random.randrange(1,10)

In [None]:
import timeit
start = timeit.default_timer()
time_spent = timeit.default_timer() - start

In [None]:
import numpy as np
import numpy.linalg as la
np.sum(A)
la.norm(a)
la.eig(A) # [0] for eigenvalues, [1] for eigenvectors
np.zeros/ones(r, c)
np.eye(4) # or
np.identity(4, dtype=int)
la.inv(A)
np.power(A,B) # elements in A raised to the power of elements in B
np.matrix.round(A)
la.solve(A,s)
np.square/sqrt(x)
np.arccos/sin/tan(x) # in radians
np.random.rand(r,c) # 0-1 uniform distribution
y_randterm = np.random.normal(loc/mu,scale/std,size) # draws from normal distribution with mu = loc and std = scale
np.random.choice(a, size=int/tuple, replace=False)
np.random.randint(low,high,size,dtype)
random_state = np.random.RandomState(42)
a = np.arange(start, stop, step, dtype)
np.asmatrix(A) # or np.mat(A), returns matrix([[,],[,]]) instead of array
np.mat(A)
# .shape or np.shape(A)
# .size
# X.T or np.transpose(X)
x = np.linspace(start, stop, num)
np.argmin/max() # Returns the indices of the min/max values along an axis.
import sys
np.savetxt(sys.stdout, bval_RSS, '%16.2f') 

In [None]:
import csv
# Create Empty lists for storing X and y values 
data = []
# Read the data from the csv file
with open('windsor_housing.csv') as f:
    raw = csv.reader(f) # <_csv.reader object at 0x7f22780cc7b0>
    # Drop the very first line as it contains names for columns - not actual data 
    print(next(raw)) # ['lotsize', 'bedrooms', 'bathrms', 'stories', 'driveway', 'recroom', 'fullbase', 'gashw', 'airco', 'garagepl', 'prefarea', 'price']
            
    # Read one row at a time. Append one to each row
    for row in raw:
        ones = [1.0]
        for r in row:
            ones.append(float(r)) # 'str' to 'float', append the row to [1.0]
        # Append the row to data 
        data.append(ones)
data = np.array(data)
data[:5,:] #first five rows of all columns

In [None]:
iterator = iter([1,2,3])

print(iterator) # <list_iterator object at 0x7f2278116ac0>
print(next(iterator)) # 1
print(next(iterator)) # 2

help(class_)

type(var)

isinstance(instance, class_/superclass)
isinstance(class_, superclass)

repr(obj), str(obj), len(obj) # directly calls special methods, same as obj.__repr/str__()

min(var), max(var)

vars(obj) # obj with __dict__ attr, returns this attr

sorted(list_, key=lambda x: x.var, reverse=True)

In [None]:
from pylab import rcParams
rcParams['figure.figsize'] = 15, 10

In [None]:
# normalized RMSE
root_mean_sq_err/(y_train.max() - y_train.min())

In [None]:
import datetime
datetime.date(2015, 7, 10).weekday()

In [None]:
import math
math.sqrt()

### Special Methods
https://docs.python.org/3/reference/datamodel.html#special-method-names
- repr: unambiguous representation to the object and used for debugging, logging, etc. Meant to be seen by other developers. calling an str special method if one is not defined will use repr as fallback.
- str: readable representation of the object. Meant to be used as a display to end user.
- good rule for the above methods is to display something that can allow you to recreate that same object.

In [None]:
def __repr__(self):
    return "Employee('{}', '{}', '{}')".format(self.first, self.last, self.pay)

def __str__(self):
    return '{} - {}'.format(self.fullname(), self.email)

def __add__(self, other):        # ex. int.__add__(1,2) same as 1+2, str.__add__('a','b') same as 'a' +'b'
    return self.pay + other.pay
    return NotImplemeneted # way to fall back on other object to handle operation


In [None]:
# sort dictionary
dict(sorted(genre_count_dict.items(), key=lambda item: item[1], reverse=True) )

In [None]:
# merging dataframes
new_df = leftdf.join(right, on=‘col’, how=‘left’) # may need rightdf.set_index(‘col’, inplace=True)
new_df = pd.merge(leaftdf, rightdf, left_on, right_on, how)
