# Final Review
By: Pieter




# Assignment Quick Look up

1. Supervised learning, Linear models, and Loss functions
2. Maximum Likelihood
3. Classification with Logistic Regression
4. Confidence Intervals & The Bootstrap
5. Model Selection & Cross Validation
6. Regularization
7. Midterm
8. Random Forest
9. Neural Networks 
10. Autoencoder
11. Clusters



# Assumptions

Assume the standard deviation obtained from using median as prediction is good estimate for noise present in data. Likely an overestimation.

# **Data Manipulation**

### Importing Data

```python
df = pd.read_csv('creditcard.csv')
df.head()
```

### Cleaning Data

Dropping Data
```python 
# Columns
X = df.drop('Class', axis='columns').values

# Multiple Columns
model_data = model_data.drop(columns=["DeviceName", "Outdoor_Humidity", "Discharge_Temperature"])

# Drop NA 
model_data.dropna()
```

Covert Categorical Data into Numerical Data
```python
# In order to get dummies, you can convert the categorical data to categorical type
# with a specific 
model_data['work_rate_att'] = pd.Categorical(model_data.work_rate_att, categories=['Low','Medium','High'])
model_data['work_rate_def'] = pd.Categorical(model_data.work_rate_def, categories=['Low','Medium','High'])
model_data['preferred_foot'] = pd.Categorical(model_data.preferred_foot, categories = ['Left','Right'])

# Dummies, dropping the first category - Allows for more efficient data use
# Example: Use 2 bits instead of 3 to represent Low, Medium, High
model_data = pd.get_dummies(model_data, drop_first=True)

model_data.head()
```


### Split Test/Train Data
```python
# Split Data into Training and Testing Data
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size = test_size, random_state = 0)
```

### Data Frame Building

```python
# Build Data Frame to extract Coef Columns by Name
p = PolynomialFeatures(degree=2).fit(x_test)
features = pd.DataFrame(ridge_coefs, columns=p.get_feature_names(x_data.columns))
```


Functions 

#### $R^2$
```python
def R_Squared(y_true, y_pred):
    rss = sum((y_true - y_pred)**2) 
    tss = sum(( y_true - y_true.mean())**2)
    r2 = 1 - rss/tss
    return r2
```

#### $RSS$
Residual Sum of Squares
```python
def RSS(y_true, y_pred):
    residual_sum_of_squares = sum((y_true - y_pred)**2) 
    return residual_sum_of_squares

```

#### $TSS$
Total Sum of Squares
```python
def TSS(y_true):
    total_sum_of_squares = sum(( y_true - y_true.mean())**2) 
    return total_sum_of_squares

```

#### $MAE$
Mean Average Error
```python
def mae(y,ypred):
    return abs(y - ypred).mean()
```

# **Graphs**

### Creating Line of best Fit 

```python
# Create x value that spans range of data
xp = np.linspace(66, 80, 30)

# Predict y values using model
yp = model.predict(xp)

# Plot data
import matplotlib.pyplot as plt
plt.plot(xp,yp)
```

### Scatter 

```python
# MatLab
def scatter_plot(x_data_pts, y_data_pts):
  # Plot
  fig, ax = plt.subplots(dpi = 120)
  fig.set_facecolor('white')

  # Plot Formatting 
  ax.set_title('X Title vs. Y Title')
  ax.set_xlabel('X Title')
  ax.set_ylabel('Y Title')

  # Plot Data
  # Label is for the legend
  ax.plot(x_data_pts, y_data_pts, 'k.', label='Data Set Name')

  # Legend
  ax.legend(loc=1)
  plt.show()
```

```python 
import seaborn as sns
df_test=pd.read_csv('hockey_draftees_test.csv')

# Seaborn
ax=sns.scatterplot(x=df_test.ht,y=df_test.wt)
ax.set_xlabel('Height')
ax.set_ylabel('Weight')

```

### Scatter Density

LAB 4 Q5
```python 
# Seaborn
import seaborn as sns
sns.jointplot(x=params[:,0],y=params[:,1])

```

### Histogram

LAB 4 Q6
```python
# Matlab
import matplotlib.pyplot as plt
plt.hist(params[:,1], edgecolor = 'white', density=True)
```

```python
hist = sns.distplot(slope)

hist.set_title("Slope Distribution")
hist.set_xlabel("Slope Value")
hist.set_ylabel("Probability")
hist.set_facecolor('white')
```

### Histogram with a line
LAB 6 #2
```python
ax = sns.distplot(Xtrain.stamina,
                 bins=50,
                 kde=True,
                 color='skyblue',
                 hist_kws={"linewidth": 15,'alpha':1})
ax.set(xlabel='stamina', ylabel='frequency', title="Before Standardization")
```

### Distribution Plot
LAB 5 Q1
```python
# Seaborn
sns.distplot(x)
```


# **LAB 1**

## Loss Functons

### OLS
Ordinary Least Square Function

$$ RSS = \sum_{i=1}^N( y_a(i) - y_p(i))^2 $$

Take the square of the residuals between the actual and the predicted vaules

**Written:** This is because OLS minimized the RSS, and therefore maximizes R2.

```python

def linearModelLossRSS(b, X, y):

    yp = X @ b
    l = sum(np.square(y-yp))
    
    grad = (X.T @ (y-yp)) * -2
    return l, grad
  
```


### LAD 
Least Absolute Deviations

$$ LAD = \sum_{i=1}^N| y_a(i) - y_p(i) | $$


Take the absolute values of all the residuals between the actual and the predicted values

**Written:**The r squared value of the LAD model is lower since it put less emphasis on outliers compared to RSS. 

Putting less emphasis on the outliers will make the line fit local data better but will have a worse global data fit (r squared value)

```python
def linearModelLossLAD(b, X, y):
    
    yp = X @ b
    l = sum(abs(y-yp))
    grad = -np.sign((y-yp)) @ X    
    return l, grad
```

### Comparison 

| OLS | LAD |
|--|-|
|Not very Robust | Robust |
|Stable Solution| Unstable Solution|
| One Solution | Possibly Multiple Solutions|






# **LAB 2**

## Likelihood

- Measure the goodness of fit of a statistical model 
- Given data how well does the distribution fit the data?
- Data is fixed

#### Probabilty
- The chances of an event of occuring 
- Given a distribution what is the chance of the event occuring
- Distribution is fixed


### Maximum Likelihood
Maximum likelihood estimation (MLE) is a method of estimating the parameters of a probability distribution by maximizing a likelihood function, so that under the assumed statistical model the observed data is most probable. 


### Possion Distribution

The poisson distribution is a discrete probability distribution often used to describe count-based data, like how many snowflakes fall in a day.

$$\ell(\lambda; \mathbf{y}) = -\sum_{i=1}^N\Bigg( y_{i}\cdot \ln(\lambda) - \lambda - \ln(y_i!) \Bigg)$$

```python

def poissonNegLogLikelihood(lam,y):
    
    # Read up on the gamma function to make sure you get the likelihood right!
    
    #neg_log_lik = -sum(np.log(lam)* y - lam)
    neg_log_lik = -np.sum(y * np.log(lam) - lam - gammaln(y+1))
    return neg_log_lik


def poissonRegressionNegLogLikelihood(b, X, y):
    #Enter the expression for lambda as shown above!
    lam = np.exp(X.dot(b))
           
    # Use poissonNegLogLikelihood to compute the likelihood
    neg_log_lik = poissonNegLogLikelihood(lam ,y)
    return neg_log_lik

```


# **LAB 3**

### **Regresions**

### **Linear Regression:**
Creates a line of best fit

```python
linear_poly_pipeline = Pipeline([('poly_features', PolynomialFeatures(degree=2)),
                                  ('LR', LinearRegression())])
linear_pipeline.fit(x_train, y_train)

```


### **Logistic Regression:**

Logistic Regression is used to make a true / false outcome based based on input parameters 

An outcome is marked positive / true if the probability of likehood is greater than 0.50. The threshold of 0.50 can be raised or lowered if required

Uses maximum likehood to determine best fit

Threshold can be editted to the user's design. A common threshold is one of 0.5 (50%). However, the threshold can be changed to fit certain needs. For example, it is very important to identify patients with cancer the threshold might be set to 0.3. This will result in more false positive than flase negetives. Ie: It is better to double check that the patient has cancer than leave it untreated. 

```python
LogisticRegression(solver='lbfgs',penalty = 'none',max_iter=10000)

# True / False Predict
yp_ =lr_amount.predict(xp)

# Probabilty Predict
yp=lr_amount.predict_proba(xp)

# Plotting 
sns.lineplot(xp,yp)
```

### **ROC** | Reciever Operator Characteristic

Plots the Rate of False Positive to True Positive 

Percentage of Cancer samples labeled correctly as Cancer (True Negative)
$$ True Positive Rate = \frac{True Positives}{True Positives + False Negatives} $$

Percentage of Not Cancer samples labeled incorrectly as Cancer (False Positive)
$$ False Positive Rate = \frac{False Positives}{True Positives + True Negatives} $$

ROC will end at 1,1 becuase if everything is cancer all the cancer samples will be labelled correctly. However, this means all the not cancer samples will be incorrectly labelled

Tells the us the optimal threshold


* True Positive: Cancer Samples Labeled as Cancer
* False Negative: Cancer Sample Labeled as Not Cancer
* True Negative: Not Cancer Sample Labeled as Not Cancer 
* False Positive: Not Cancer Sample Labeled as Cancer 


Precision:

Is the proportion of positive results that were correctly classified
$$ Precision = \frac{True Positives}{True Positives + False Positive} $$

Recall: (Same as True Positive Rate)

Percentage of Cancer samples labeled correctly as Cancer (True Negative)

$$ Recall= \frac{True Positives}{True Positives + False Negatives} $$

Note: Look at a confusion matrix if you are confused. It is easier to understand

### **AUC** | Area Under the Curve

AUC makes it easy to compare one ROC curve to another to determine which one is better. A higher AUC the better it is at properly classifying a sample.

One ROC could use a Random Forest while another ROC could use Logistic Regression. An AUC would tell us which ROC is better

```python
# Import 
from sklearn.metrics import confusion_matrix, roc_curve, precision_recall_curve, auc

# ROC for all-variable classifier

# Predict Using Model
ytest_prob = lr_all.predict_proba(Xtest)

# False Positive Rate, True Positive Rate
fpr, tpr, _ = roc_curve(ytest, ytest_prob[:,1], pos_label=1)

# Plot
ax=sns.lineplot(fpr,tpr)
ax.set(xlabel="FPR",ylabel="TPR")

# AUC
auc(fpr,tpr)

```


# **LAB 4**


### Confidence Interval 

Confidence Interval is the range of values we are fairly confident the true value lies in

ie: An interval that covers 95% of the means 


The $100(1-\alpha)\%$ confidence interval is 

$$ \bar{x} \pm  t_{1-\alpha/2, n-1} \dfrac{\hat{\sigma}}{\sqrt{n}} $$


```python
# Imports
from scipy.stats import t

# Function
def confidence_interval(data):

    estimated_mean = np.mean(data)

    # Confidence Interval = 0.95% in this case
    offset = t.ppf( (1 + 0.95)/2, df=len(data)-1) * sem(data)
    bounds = [estimated_mean - offset, estimated_mean + offset]

    return estimated_mean, bounds
    
```

### Bootstraping

Bootstrapping is any test or metric that uses random sampling with replacement, and falls under the broader class of resampling methods. 

This technique allows estimation of the sampling distribution of almost any statistic using random sampling methods.

Bootstraping always gives a normal distrubtion of means

```python

# Write a Bootstrap function that records the fitted models 
def BootstrapCoef(data,numboot=1000):
    regr = sklearn.linear_model.LinearRegression()
    #numboot = 1000
    n = len(data)
    theta = np.zeros((numboot,2))    
    for i in range(numboot):
        # Sample Data
        d = data.sample(n, replace=True)

        X_fit = np.c_[d.ht]

        # Fit Data using Linear Regression
        regr.fit(X_fit,d.wt)

        # Store Model Paramaters
        theta[i,0]=regr.intercept_
        theta[i,1]=regr.coef_
    return theta

params = BootstrapCoef(df,100)

```

# **LAB 5**

### **Cross Validation**

Cross validation spilts the data int blocks. One block of data is kept for testing and the rest are used for training. Cross validation then cycles through each block assigning it to the testing data and the rest of the blocks to training data. Then is takes the average score of each training - testing set at the end. 

Training Data:
Is used to estimate the paramters for the machine learning methods

Testing Data:
Used to evaluate how well the machine learning method works

Cross Validation Size:

Lower = better

| Paramteter| 2 Folds | 5-10 Folds | N Folds |
|-|-|-|-|
|Overestimation bias of prediction error| Bad | Present | Nearly Unbiased |
|Computational Cost | Low | Mid | High | 
|Variance of estimate| Low | Low | High |


### Cross Validation Score

```python
# CV is Fold Cross Validation
cv_score = cross_val_score(model2, Xtrain, ytrain, cv = 5, scoring=make_scorer(mae))
print(cv_score.mean())
```


### **Effective Test Size**

Using the formula for the effective test size ($n$) to get the precision to specific precision ($d$) relative to the test loss standard deviation of $\sigma_l$
$$ n = \left(\frac{1.96 \sigma_l}{d}\right)^2$$

```python

def mae(y,ypred):
    return abs(y - ypred).mean()

mu = mae(model_data.overall,model_data.overall.median())
loss = abs(model_data.overall - model_data.overall.median())  
sigma = loss.std()

# Test Size
test_size = (2*sigma/d)**2
```



### **Linear Regression**

```python
model1 = Pipeline([
    ('linear_regression', LinearRegression())
])
```

# **LAB 6**

#### **Standarization**

The standardize function centers the data around x = 0. This is because the Standard Scaler function coverts the data to z-scores. Z-scores are a measure if how of many standard deviations the data point is from the mean. The majority (68%) of the data will lie within 1 standard deviation or a z score between (+1, -1). Standardizing the data allows for easy comparision between different features

```python 

# Example Pipeline
model_pipeline = Pipeline([
    # Standardization
    ('standardize', StandardScaler()),
    # Linear Regression
    ('reg', sk.linear_model.LinearRegression())
])

# Access Step inside Pipeline
standardizer_step = model_pipeline.named_steps['standardize']
transformed_X = standardizer_step.fit_transform(Xtrain)
```


### **Ridge Regression (L2):**

Ridge regression = Sum of Square Residuals + $\lambda \times slope^2$

$$ Ridge = \sum_{i=1}^N residuals^2 + Pentalty \times Slope^2 $$
$$ Ridge = \sum_{i=1}^N( y_a(i) - \beta x(i))^2 + \lambda \beta^2 $$

Ridge regression adds a pentaly function on the slope
Pentaly can be anywhere betweem 0 and inf

Ridge regression introduces bias in the model to potentially reduce variance of the testing data. By have a slighty worse fit of the tr aining data (when it is limited) Ridge Regression can provide better long term preductions of the testing data. Ridge regression makes preidiction less sensetive to changes in the input variable means a smaller change in the output variable.



Lest Square needs 4 points to estimate parameters. Ridge regression can be used to estimate all parameters with less data points. Eg 1000 parameters with only 500 data points or less. 

**Note:** Ridge Regression can shrink slope asymptotically close to 0
Ridge Regression is better when most model features are useful

```python
ridge_pipeline = Pipeline([('poly_features', PolynomialFeatures()),
                           ('scaler', StandardScaler()), 
                           ('ridge_regression', Ridge(alpha=np.exp(2)))])

ridge_pipeline.fit(x_train, y_train)
yp_ridge = ridge_pipeline.predict(x_test)
```

### Ridge - Getting Best Lambda

```python
params = {'reg__alpha': np.exp(np.linspace(-8,6,15))}

# Grid search lets you test multiple paramters for lambda 
gscv = GridSearchCV(pipeline, param_grid=params, cv=10, scoring = 'neg_mean_squared_error', refit=True)
gscv.fit(X_new_train, ytrain)

# Alternatively 
# The best lambda can be found using gscv.best_params_

results = pd.DataFrame(gscv.cv_results_)
plt.scatter( np.linspace(-8, 6,15), -results.mean_test_score)
plt.xlabel(r'$\log(\lambda)$')
```

### **Lasso Regression (L1):**

Lasso regression adds a pentaly function on the slope

Lasso regression = Sum of Square Residuals + $\lambda \times |Slope|$
$$ Ridge = \sum_{i=1}^N residuals^2 + Pentalty \times |Slope| $$
$$ Ridge = \sum_{i=1}^N( y_a(i) - \beta x(i))^2 + \lambda |\beta| $$

Lasso regression can shrink slope to 0
Lasso regression is better when most model features are useless since they can be discluded from the line of best fit. This has the potential to simpifly a model greatly. This makes it better than ridge regression at reducing variance. 

```python
lasso_pipeline = Pipeline([('poly_features', PolynomialFeatures()), 
                           ('scaler', StandardScaler()), 
                           ('lasso', sk.linear_model.Lasso(alpha=np.exp(2)))])

lasso_pipeline.fit(x_train, y_train)
```

### **Elastic Net Regression:**

Elastic Net Regression combines the strength of Lasso and Ridge Regression

$$ Elastic = \sum_{i=1}^N residuals^2 + Pentalty_L \times |Slope|+ Pentalty_R \times Slope^2 $$
$$ Elastic = \sum_{i=1}^N( y_a(i) - \beta x(i))^2 + \lambda_L |\beta| + \lambda_R \beta^2 $$

Ridge Regression shrinks all of the parameters for the correlated variable together 

Lasso Regression picks one of the correlated terms and eliminates the other terms

Elastic Net regression groups and shrinks the paramteres associated with the correlated variables and leaves them in equation or removes them all at once. 


### Plot Slope of Each Parameter Ridge Regression

```python
regularization_strength = np.exp(np.linspace(np.log(0.2),np.log(200),50))

coefs = np.zeros((regularization_strength.size, X.shape[1]))

for i,L in enumerate(regularization_strength):
    lasso_pipe = Pipeline([
    ('scale', StandardScaler()),
    ('linear_regression', Lasso(alpha=L, 
                                               fit_intercept=True)) 
    ])
    
    lasso_pipe.fit(Xtrain, ytrain)
    coefs[i] = lasso_pipe.named_steps['linear_regression'].coef_    

fig, ax = plt.subplots(dpi = 120)
ax.plot(np.log(regularization_strength), coefs)
ax.set_xlabel(r'$\log(\lambda)$', fontsize = 16)
ax.set_ylabel(r'$\hat{\beta}$', fontsize = 16)
ax.set_title('Coefficient Path', fontsize = 18)
ax.set_xlim(-4,None)

for i, name in enumerate(DfFeatures.columns[:-1]):
    
    ax.annotate(name, xy = (-3, coefs[0,i]), ha = 'left', fontsize = 8)
```


# **LAB 8**

### **Random Forest**

* Step 1: Bootstrap Data Set  (Random Replace)
* Step 2: Create a Tree
* Step 3: Repeat n number of times
* Now we have a random Forest

Testing new Data

When we get new data we run it through all the random forests created in the last step. At each tree we record the result. Looking at the overall count we can then determine the classification of the data

Bagging:

Bootstrapping Data plus using the aggregate of the data

Out of Bag Data Set:

Data that didnot end up in the bootstrapped data set. We use this to test if the random forest properly classifies the data. Out of Bag Error is the portion of out of bag samples incorrectly classified

```python
    num_trees = 500
    # Bagged decision tree 
    tree1 = DecisionTreeClassifier()
    model_1 = BaggingClassifier(base_estimator = tree1, n_estimators = num_trees, random_state = seed)
    model_1.fit(Xtrain, ytrain)
    y_p1 = model_1.predict(Xtest)
    bag_accuracy = accuracy_score(ytest, y_p1)
    
    # Random Forest (max_features = 1)
    model_2 = RandomForestClassifier(max_features = 1, random_state = seed)
    model_2 = model_2.fit(Xtrain, ytrain)
    y_p2 = model_2.predict(Xtest)
    rf_mf1_accuracy = accuracy_score(ytest, y_p2)
```

### **Ada Boost**

In a forest of trees made with Ada Boost trees are usually just one node with and two leaves. One use one variable to make a decision so they are weak learners. In a forest of trees with Ada Boost some trees decisions have more weight to them in regards to the final classification. Each stump is made by take the previous stump's error into account.



*   Step 1: Evalute all tree and determine which tree best represents the data
*   Step 2: Reweight the data
*   Normalize Weight
*   Selected the next best stump 
*   Bootstrap new data using weights as probabilities 
*   Set weights equal again
*   Repeat

Weight of Stump
$$ Amount of Say = \frac{1}{2} log\left(\frac{1 - Total Error}{TotalError}\right)$$

$$ New Sample Weight = Sample Weight \times e^{Amount Of Say} $$

### **Graident Boost**

* Build bigger fixed sized trees than AdaBoost
* Gradient boost scales all tree by the same value


# **LAB 9**

### **Neural Networks**

Neural Networks can fit a squiggle to data

Hidden Layers:

Number of Layers between the input and output nodes

### Activation Functions
https://medium.com/the-theory-of-everything/understanding-activation-functions-in-neural-networks-9491262884e0
https://pytorch.org/docs/stable/nn.html#linear-layers


### Back propagation

*  Use chain rules to find derivative of the Sum of Squared Residuals (SSR)
*  Use gradient descent to optimize the unkown paramter


### Graphing

```python
def live_plot(loss, train_acc, valid_acc=None, figsize=(7,5), title=''):
    clear_output(wait=True)
    fig, ax1 = plt.subplots(figsize=figsize)
    ax1.plot(loss, label='Training Loss', color='red')
    ax1.legend(loc='lower left')
    ax1.set_ylabel('Cross Entropy Loss')
    ax2 = ax1.twinx()
    ax2.plot(train_acc, label='Training Accuracy', color='green')
    if valid_acc is not None:
        ax2.plot(valid_acc, label='Validation Accuracy', color='blue')
    ax2.legend(loc='lower right')
    ax2.set_ylabel('Accuracy (%)')
    ax2.set_xlabel('Epoch')
    plt.title(title)
    plt.show()

```

# **LAB 10**

### **Clustering**

### **PCA | Princal Component Analysis**

PCA tells you which variable is most important in representing the variation in the output data

To test how well a line fits the data PCA projects the data on it. PCA either minimzes the distance to the line (residuals) or maximize the distance from the projected point to the origin (variance)

PCA is a linear combination of variables. Sort of like a recipe. PCA is in n eigen vector. And the proportion of the variables are called loading scores.

PCA 2 is the line prependicular to PCA 1

Next the variation of each PCA can be calculated. 

$$Variance PCA = \frac{Sum Of Squares(PC1)}{n-1} $$

$$ New Sample Weight = Sample Weight \times e^{Amount Of Say} $$

Scree Plot is a graphical representation of the percentages of variation that each PC accounts for

### K Nearest Neighbor

K nearest Neighbor looks at the classification of the closest data points (K). 
Ideally keep K odd to avoid ties


### K Means Clustering 

* Step 1 select the number of clusters
* Step 2 Select 3 disctinct data points 
* Step 3 Measure the distance form each point
* Step 4 Assign the nearest data point to their repsective cluster

* Step 5 Move cluster locations and Repeat

To determine how well the clustering works, add up the total variation within each cluster. The best model is the one that minimzes this.

K means clustering specifically tries to put the data into the number of clusters you tell it to.

How to determine K?
Compare the total variation of the models (k, k+1). Graph these results. Where the variation drops off is where the optimal value of k should be.

Move 

### Hierarchical Clustering

Tell the user which points are most similar pairwise