<h1 align="center">Module 4 Assessment</h1>

## Overview

This assessment is designed to test your understanding of the Mod 4 material. It covers:

* Calculus, Cost Function, and Gradient Descent
* Introduction to Logistic Regression
* Extensions to Linear Models

Read the instructions carefully. You will be asked both to write code and respond to a few short answer questions.

### Note on the short answer questions

For the short answer questions please use your own words. The expectation is that you have not copied and pasted from an external source, even if you consult another source to help craft your response. While the short answer questions are not necessarily being assessed on grammatical correctness or sentence structure, do your best to communicate yourself clearly.

---
## Calculus, Cost Function, and Gradient Descent [Suggested Time: 25 min]
---

![best fit line](visuals/best_fit_line.png)

The best fit line that goes through the scatterplot up above can be generalized in the following equation: $$y = mx + b$$

Of all the possible lines, we can prove why that particular line was chosen using the plot down below:

![](visuals/cost_curve.png)

where RSS is defined as the residual sum of squares:

$$ 
\begin{align}
RSS &= \sum_{i=1}^n(actual - expected)^2 \\
&= \sum_{i=1}^n(y_i - \hat{y})^2 \\
&= \sum_{i=1}^n(y_i - (mx_i + b))^2
\end{align}
$$ 

### 1. What is a more generalized name for the RSS curve above? How is it related to machine learning models?
#### Hint: RSS is just one of many metrics that could be used to do the same job. We are asking about that general use.

# Your answer here
RSS is also known as the cost function. It is used to minimise the error term attributed to any machine learning model. This allows various models (or iterations of the same model) to be compared against each other with how close each model is to the representing the true data. In a 2d model it is represented as the squared distance between the line of best fit, and each actual data point squared. Passed 3D it is difficult to visualize but still exists.

RSS allows us to select the model which best fits the data, and also allows us to perform gradient descent, whereby we shift the M value and find which values reduce the error term (or minimise the cost function) instead of taking up a huge amount of computational value

### 2. Would you rather choose a $m$ value of 0.08 or 0.05 from the RSS curve up above?   What is the relation between the position on the cost curve, the error, and the slope of the line?

# Your answer here
We would rather choose 0.05. This is because it is lower on the cost function. With RSS as our cost function, a higher RSS means that our model is further away (or less representative of) from our data than a model with a lower RSS. M is simply a weighting placed on each of the coefficients in the inital model we are trying to fit. 0.05 is a more optimal M than 0.08 because it reduces the error term of the regression model.

In a 2D model, it means our line of best fit (regression model) fits the data better than M at 0.08, so we would rather use that 0.05M regression model as a predictor of y.

![](visuals/gd.png)

### 3. Using the gradient descent visual from above, explain why the distance between each step is getting smaller as more steps occur with gradient descent.

# Your answer here
The steps are larger the further away from the bottom because when we differentiate the cost function at step 1, we find that the gradient is high (steep). Therefore we are comfortable moving M a large amount without fear that we will 'overstep' and pass the optimal M which minimizes the cost function. The closer we get to the bottom of the cost function, the less steep our gradient from the differentiation will be, and so the smaller we move M to minimise the cost function - as we are worried we may end up overstepping the optimal M, which  minimizes the cost function.

### 4. What is the purpose of a learning rate in gradient descent? Explain how a very small and a very large learning rate would affect the gradient descent.

# Your answer here
The learning rate dictates how large we change each iteration of a change in M. A high learning rate will mean we take larger steps towards the minimised function, and small learning rate means we move slowly towards the bottom.

A very high learning rate could overshoot the minimum each time, jumping back and forth across the optimal M. A very slow learning rate will move very slowly towards the optimal M. Both are computationally inefficient, and set by the user who has an understanding of the model.

---
## Introduction to Logistic Regression [Suggested Time: 25 min]
---

<!---
# load data
ads_df = pd.read_csv("raw_data/social_network_ads.csv")

# one hot encode categorical feature
def is_female(x):
    """Returns 1 if Female; else 0"""
    if x == "Female":
        return 1
    else:
        return 0
        
ads_df["Female"] = ads_df["Gender"].apply(is_female)
ads_df.drop(["User ID", "Gender"], axis=1, inplace=True)
ads_df.head()

# separate features and target
X = ads_df.drop("Purchased", axis=1)
y = ads_df["Purchased"]

# train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=19)

# preprocessing
scale = StandardScaler()
scale.fit(X_train)
X_train = scale.transform(X_train)
X_test = scale.transform(X_test)

# save preprocessed train/test split objects
pickle.dump(X_train, open("write_data/social_network_ads/X_train_scaled.pkl", "wb"))
pickle.dump(X_test, open("write_data/social_network_ads/X_test_scaled.pkl", "wb"))
pickle.dump(y_train, open("write_data/social_network_ads/y_train.pkl", "wb"))
pickle.dump(y_test, open("write_data/social_network_ads/y_test.pkl", "wb"))

# build model
model = LogisticRegression(C=1e5, solver="lbfgs")
model.fit(X_train, y_train)
y_test_pred = model.predict(X_test)
y_train_pred = model.predict(X_train)

from sklearn.metrics import confusion_matrix

# create confusion matrix
# tn, fp, fn, tp
cnf_matrix = confusion_matrix(y_test, y_test_pred)
cnf_matrix

# build confusion matrix plot
plt.imshow(cnf_matrix,  cmap=plt.cm.Blues) #Create the basic matrix.

# Add title and Axis Labels
plt.title('Confusion Matrix')
plt.ylabel('True label')
plt.xlabel('Predicted label')

# Add appropriate Axis Scales
class_names = set(y_test) #Get class labels to add to matrix
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)

# Add Labels to Each Cell
thresh = cnf_matrix.max() / 2. #Used for text coloring below
#Here we iterate through the confusion matrix and append labels to our visualization.
for i, j in itertools.product(range(cnf_matrix.shape[0]), range(cnf_matrix.shape[1])):
        plt.text(j, i, cnf_matrix[i, j],
                 horizontalalignment="center",
                 color="white" if cnf_matrix[i, j] > thresh else "black")

# Add a Side Bar Legend Showing Colors
plt.colorbar()

# Add padding
plt.tight_layout()
plt.savefig("visuals/cnf_matrix.png",
            dpi=150,
            bbox_inches="tight")
--->

![cnf matrix](visuals/cnf_matrix.png)

### 1. Using the confusion matrix up above, calculate precision, recall, and F-1 score.

In [26]:
tp = 30
fp = 4
fn = 12


precision = tp/(tp+fp)
recall = tp/(tp+fn)

f1 = 2*(precision*recall/(precision+recall))



print(precision, recall, f1)

0.8823529411764706 0.7142857142857143 0.7894736842105262


### 2.  What is a real life example of when you would care more about recall than precision? Make sure to include information about errors in your explanation.

# Your answer here
Someone who focusses on recall with have much more false positives, but lower false negatives, and someone who focusses on negatives will have more false negatives, and lower false positives. 


The medical field is one where false positives are more focussed on catching all instances of a disease, and missing very few, even if this means that there are a large number of false positives too

<!---
# save preprocessed train/test split objects
X_train = pickle.load(open("write_data/social_network_ads/X_train_scaled.pkl", "rb"))
X_test = pickle.load(open("write_data/social_network_ads/X_test_scaled.pkl", "rb"))
y_train = pickle.load(open("write_data/social_network_ads/y_train.pkl", "rb"))
y_test = pickle.load(open("write_data/social_network_ads/y_test.pkl", "rb"))

# build model
model = LogisticRegression(C=1e5, solver="lbfgs")
model.fit(X_train, y_train)
y_test_pred = model.predict(X_test)
y_train_pred = model.predict(X_train)

labels = ["Age", "Estimated Salary", "Female", "All Features"]
colors = sns.color_palette("Set2")
plt.figure(figsize=(10, 8))
# add one ROC curve per feature
for feature in range(3):
    # female feature is one hot encoded so it produces an ROC point rather than a curve
    # for this reason, female will not be included in the plot at all since it is
    # disingeneuous to call it a curve.
    if feature == 2:
        pass
    else:
        X_train_feat = X_train[:, feature].reshape(-1, 1)
        X_test_feat = X_test[:, feature].reshape(-1, 1)
        logreg = LogisticRegression(fit_intercept=False, C=1e12, solver='lbfgs')
        model_log = logreg.fit(X_train_feat, y_train)
        y_score = model_log.decision_function(X_test_feat)
        fpr, tpr, thresholds = roc_curve(y_test, y_score)
        lw = 2
        plt.plot(fpr, tpr, color=colors[feature],
                 lw=lw, label=labels[feature])

# add one ROC curve with all the features
model_log = logreg.fit(X_train, y_train)
y_score = model_log.decision_function(X_test)
fpr, tpr, thresholds = roc_curve(y_test, y_score)
lw = 2
plt.plot(fpr, tpr, color=colors[3], lw=lw, label=labels[3])

# create foundation of the plot
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.yticks([i / 20.0 for i in range(21)])
plt.xticks([i / 20.0 for i in range(21)])
plt.xlabel("False positive rate")
plt.ylabel("True positive rate")
plt.title("ROC Curve")
plt.legend()
plt.tight_layout()
plt.savefig("visuals/many_roc.png",
            dpi=150,
            bbox_inches="tight")
--->

### 3. Pick the best ROC curve from this graph and explain your choice. 

*Note: each ROC curve represents one model, each labeled with the feature(s) inside each model*.

<img src = "visuals/many_roc.png" width = "700">


# Your answer here
One argument could be that the orange model is the best curve, as you can have a 0.33 true positive rate with a 0 false poitive rate
However, given the coefficients are age and estimated salary, there is probably little cost involved with false positives, so likely the pink model is best. This is because it can predict the most true posibie value, with the lowest increase in false positive rate, compared to the other models.

<!---
# sorting by 'Purchased' and then dropping the last 130 records
dropped_df = ads_df.sort_values(by="Purchased")[:-130]
dropped_df.reset_index(inplace=True)
pickle.dump(dropped_df, open("write_data/sample_network_data.pkl", "wb"))
--->

In [13]:
network_df = pickle.load(open("write_data/sample_network_data.pkl", "rb"))

# partion features and target 
X = network_df.drop("Purchased", axis=1)
y = network_df["Purchased"]

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2019)

# scale features
scale = StandardScaler()
scale.fit(X_train)
X_train = scale.transform(X_train)
X_test = scale.transform(X_test)

# build classifier
model = LogisticRegression(C=1e5, solver="lbfgs")
model.fit(X_train,y_train)
y_test_pred = model.predict(X_test)

# get the accuracy score
print(f"The original classifier has an accuracy score of {round(accuracy_score(y_test, y_test_pred), 3)}.")

# get the area under the curve from an ROC curve
y_score = model.decision_function(X_test)
fpr, tpr, _ = roc_curve(y_test, y_score)
auc = round(roc_auc_score(y_test, y_score), 3)
print(f"The original classifier has an area under the ROC curve of {auc}.")

The original classifier has an accuracy score of 0.956.
The original classifier has an area under the ROC curve of 0.836.


### 4. The model above has an accuracy score that might be too good to believe. Using `y.value_counts()`, explain how `y` is affecting the accuracy score.

In [14]:
y.value_counts()

0    257
1     13
Name: Purchased, dtype: int64

In [22]:
# // your answer here //
network_df.head()

Unnamed: 0,index,Age,EstimatedSalary,Purchased,Female
0,0,19,19000,0,0
1,180,26,16000,0,0
2,181,31,71000,0,1
3,183,33,43000,0,0
4,184,33,60000,0,1


In [24]:
y_test_pred.sum()

2

Because there are very few purchases, the model almost aways predicts values of 0, which, when lined up with the test data with is 95% 0's by the definition of the model, it ends up around 95% accurate. This could be a poor representation of the model, and not does not necessarily mean it can predict purchases 95% of the time.

95% of the time, there is no purchase, so it is not hard for the model to be 95% accurate in predicting no purchase.

### 5. What methods would you use to address the issues mentioned up above in question 4? 


# // your answer here //
We could perform kfold cross validations to repeatedly take samples from the data and, and retrain and retest the model and take and average of all of the cross validations and see how the model is truely performing. This would be a more representative accuracy score.

---
## Extensions to Linear Regression [Suggested Time: 25 min]
---

In this section, you're going to be creating linear models that are more complicated than a simple linear regression. In the cells below, we are importing relevant modules that you might need later on. We also load and prepare the dataset for you.

### 2. What is the optimal number of degrees for our polynomial features in this model? In general, how does increasing the polynomial degree relate to the Bias/Variance tradeoff?  (Note that this graph shows RMSE and not MSE.)

<img src ="visuals/rsme_poly_2.png" width = "600">

<!---
fig, ax = plt.subplots(figsize=(7, 7))
degree = list(range(1, 10 + 1))
ax.plot(degree, error_train[0:len(degree)], "-", label="Train Error")
ax.plot(degree, error_test[0:len(degree)], "-", label="Test Error")
ax.set_yscale("log")
ax.set_xlabel("Polynomial Feature Degree")
ax.set_ylabel("Root Mean Squared Error")
ax.legend()
ax.set_title("Relationship Between Degree and Error")
fig.tight_layout()
fig.savefig("visuals/rsme_poly.png",
            dpi=150,
            bbox_inches="tight")
--->

# Your answer here

7 is most likely the optimal polynomial feature degree, although an argument could be made for 3 depending on the context, and output of the different models. This is because at 3 the root mean sqared error for the features is the lowest than any point on the test data, however it may reduce the overal predictive ability of the model. This means that the model will have low bias, but could have high variance as it (due to this predictive limitation by a low number of features). After 7, the root mean squared error rises significantly. This means that adding these further features increases the variance massively for the test data. However it is fitting the training data fine at this point - in fact, this is the lowest training error on the graph. Past 7 means that the model is heavily biased towards the training data, and the variacance for the test data increases dramatically.

This is overfitting, and means that our model is very good at fitting our test data(high bias), but poor at being generalised to outside data. There is a balance between the bias and variance that is usually aimed for in models, so we can train them to be appropriate for the data we are feeding in, but used to predict accurately(with low error) external, unseen data. 

### 3. In general what methods could you use to reduce overfitting and underfitting? Provide an example for both and explain how each technique works to reduce the problems of underfitting and overfitting.

# Your answer here

If our model is underfitting our data, we can add more features to increase its predictive quality, or perform gradient descent to find a better coefficient weighting to minimise the cost function. We can also add in interaction terms, by multiplying the effect of seperate data points together do find a new data point which may increase the predictive quality of our model better. For example, when predicting the income of people, Educational Years, and Age may both help to predict income, however education years * Age may add a term which increases the Rsquared of a model too. They are also likely related in some way (though this depends on the demographic of the population). At its heart, this adds another feature which can help predict our target variable.

If our model is overfitting the data, we have high bias. We can use cross validation, which means our model does not train on one set of training data but build several models using different sample of training data, and see which ones performs best.
We can also reduce features in the model, which will reduce overfitting.
Finaly we can add polynomials, which essentially adds noise into the model.




### 4. What is the difference between the two types of regularization for linear regression?

# Your answer here

There are two regularizations, lasso and ridge. Both are regressions which a penalty to the cost function (which make them harder to minimise). Each penalizes model complexity.
A ridge regression keeps each coefficient, but squares it, meaning that all predictive features are retained in the model, however their individual weighing is continuossly reduced depending on how many features there are - more features = more reduction

A lasso regression squares the absolute value of the coefficients, which reduces some of them to zero. This drops those predictive features.

### 5. Why is scaling input variables a necessary step before regularization?

# Your answer here
Input variables can come in completely different units, this could be grams vs kilograms, or even kilograms vs dollar price. The fact that these are so unrelatable means that if they are fed into a regularization model without scaling, the regressions will immediatly weight towards the features which have the highest actual values (1000 grams) but drop ones with small values (1 Kg) even though they could be the same, or potentially even a better predictor (e.g $1). 

Scaling places all of these on the same range, meaning that those features which are the strongest predictors of our target variables are kept, and the smallest predictors reduced or dropped - as opposed to those with just the largest absolute value, which are skewed by their number, not their predictive effect.

### -------------------------------------------------------------------------------------------------------
                                           OPTIONAL
### -------------------------------------------------------------------------------------------------------

In [12]:
import pandas as pd
import itertools
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import numpy as np
from sklearn.linear_model import Lasso, Ridge
import pickle
from sklearn.metrics import mean_squared_error, roc_curve, roc_auc_score, accuracy_score
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler

In [None]:
data = pd.read_csv('raw_data/advertising.csv').drop('Unnamed: 0',axis=1)
data.describe()

In [None]:
X = data.drop('sales', axis=1)
y = data['sales']

In [None]:
# split the data into training and testing set. Do not change the random state please!
X_train , X_test, y_train, y_test = train_test_split(X, y,random_state=2019)

### 1. We'd like to add a bit of complexity to the model created in the example above, and we will do it by adding some polynomial terms. Write a function to calculate train and test error for different polynomial degrees.

This function should:
* take `degree` as a parameter that will be used to create polynomial features to be used in a linear regression model
* create a PolynomialFeatures object for each degree and fit a linear regression model using the transformed data
* calculate the mean square error for each level of polynomial
* return the `train_error` and `test_error` 


In [None]:
def polynomial_regression(degree):
    """
    Calculate train and test errorfor a linear regression with polynomial features.
    (Hint: use PolynomialFeatures)
    
    input: Polynomial degree
    output: Mean squared error for train and test set
    """
    # // your code here //
    
    train_error = None
    test_error = None
    return train_error, test_error

#### Try out your new function

In [None]:
polynomial_regression(3)

#### Check your answers

MSE for degree 3:
- Train: 0.2423596735839209
- Test: 0.15281375973923944

MSE for degree 4:
- Train: 0.18179109317368244
- Test: 1.9522597174462015