# ML Week 4 - Logistic Regression

---

[Top](#ML-Week-4---Logistic-Regression) | [Previous section](#ML-Week-3---Cross-Validation) | [Next section](#Part-0:-Quick-review) | [Bottom](#Thank-you)

This notebook has the following sections:

* [Part 0: Quick review!](#Part-0:-Quick-review)
* [Part 1: Linear to Logistic Regression](#Part-1:-Linear-to-Logistic-Regression)
* [Part 2: Assessing the model fit](#Part-2:-Assessing-the-model-fit)
* [Part 3: Receiver Operating Curves (Optional)](#Part-3:-Receiver-Operating-Curves-(Optional))
* [Part 4: Cross-validating logistic regression](#Part-4:-Cross-validating-logistic-regression)
* [Part 5: Another dataset with spam detection](#Part-5:-Another-dataset-with-spam-detection)

## Part 0: Quick review
---

[Top](#ML-Week-4---Logistic-Regression) | [Previous section](#ML-Week-4---Logistic-Regression) | [Next section](#Part-1:-Linear-to-Logistic-Regression) | [Bottom](#Thank-you)

Today we're going to primarily focus on a section of the [framingham heart study](https://www.kaggle.com/amanajmera1/framingham-heart-study-dataset) dataset, from Kaggle. For those who don't know the [Framingham heart disease study](https://www.framinghamheartstudy.org/) is a study that began back in 1948 to assess common factors or characteristics that contribute to heart disease. It is a _prospective_, meaning it assessed individuals over time to see how current factors contributed to heart coronary heart disease that developed _in the future_.

The dataset contains the following variables...

| Column | Description |
|--------|-------------|
| age | The age of the patient |
| male | 1 if the patient is male, 0 if female |
| cigsPerDay| Average number of reported cigarettes smoked per day |
| totChol | Total cholesterol level |
| sysBP | Systolic blood pressure reading |
| diaBP | Diastolic blood pressure reading |
| BMI | Body mass index |
| glucose | glucose level in the blood stream |
| heartRate | Heart rate when surveyed |
| TenYearCHD | Did someoned develop coronary heart disease (CHD) within the next 10 years? |

Our goal is to use these attributes to predict their 10 year risk of CHD. Once the model is developed, we could assess which features contribute the most to the likelihood of heart disease.

Let's import in our dataset and describe it.

In [None]:
# Import pandas
import pandas as pd

# Load data
framingham_data = pd.read_csv('data/framingham.csv')

# Describe
framingham_data.describe()

### Using linear regression to predict coronary heart disease risk

Last week we talked about predicting variables using linear regression, and cross-validating the best model using K-Fold cross validation. Let's import our models to run linear regression and visualise the performance.

In [None]:
# Import plotting modules
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Import linear regression and needed modules
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error

Here's a similar function from last week that will run linear regression with cross-validation.

In [None]:
def lin_regress_w_full_cross_val(
    data, 
    cols=['temp'],
    max_powers=[1],
    regression_type=None,
    target='cnt',
    cv=10,
):
    """
    Run linear regression with kfold cross validation to analyse error. We will be able to choose
    the type of regression, the columns we want within our model, and the maximum power.
    
    :param data: <pd.DataFrame>, the data for our model
    :param cols: list<str>, a list of columns to use in our model
    :param max_powers: list<int>, the max power to use for a corresponding column
    :param regression_type: <str>, either None (regular), 'lasso', or 'ridge'
    :param target: <str>, the target variable, defaults to 'cnt' assuming the
                   correct preprocessing
    :param cv: <int>, the number of folds
    """
    # Create necessary columns
    model_data = data.copy()
    
    # Create columns dict
    all_cols = dict()
    all_cols[1] = cols[:]
    
    # Go through each column and add the correct powers needed
    for i in range(len(cols)):
        all_cols
        if max_powers[i] > 1:
            for p in range(2, max_powers[i] + 1):
                if p not in all_cols.keys():
                    all_cols[p] = []
                model_data[cols[i] + '_' + str(p)] = model_data[cols[i]] ** p
                # Append to columns list
                all_cols[p].append(cols[i] + '_' + str(p))
                
    # Add on all keys
    for i in range(2, len(all_cols.keys()) + 1):
        all_cols[i] += all_cols[i - 1]
                    
    # Fit data towards each power, and calculate MSE
    powers = []
    mse = []
    
    # Get linear regression
    if regression_type == 'lasso':
        lr = Lasso(alpha=1.0)
    elif regression_type == 'ridge':
        lr = Ridge(alpha=0.1)
    else:
        lr = LinearRegression()
    
    for i in range(1, max(max_powers) + 1):
        # Run linear regression and cross validation
        mse += (cross_val_score(
            lr, model_data[all_cols[i]], y=model_data[target], scoring='neg_mean_squared_error', cv=cv
        ) * -1).tolist()
        # Get metrics
        powers += [i] * cv
        
    # Graph
    mse_df = pd.DataFrame({'Max Power': powers, 'MSE': mse})
    sns.pointplot(x='Max Power', y='MSE', data=mse_df, ci=68)

### Exercise

Like we did last week, use the inputs below to run cross-validation, and decide on a final model to predict heart disease. Reminder....

* `cols` is a list of columns, for example `['age', 'cigsPerDay', 'totChol']`
* `max_powers` is a list of maximum powers to use within the model for each columns, for example `[3, 1, 2]`
* `regression_type` can be `None` for regular regression, `'lasso'`, or `'ridge'`

The MSE will be plotted across the different models used.

In [None]:
# Adjust columns
cols = ['age', 'cigsPerDay', 'totChol']
max_powers = [3, 1, 2]
regression_type = None

# Run regression
lin_regress_w_full_cross_val(
    framingham_data, 
    cols=cols,
    max_powers=max_powers,
    regression_type=regression_type,
    target='TenYearCHD',
    cv=10,
)

## Part 1: Linear to Logistic Regression

---

[Top](#ML-Week-4---Logistic-Regression) | [Previous section](#Part-0:-Quick-review) | [Next section](#Part-2:-Assessing-the-model-fit) | [Bottom](#Thank-you)

Let's actually analyse a resulting regression. We'll use...

* regular linear regression
* complexity of order 1 on variables
* we'll use the [`sns.regplot()`](https://seaborn.pydata.org/generated/seaborn.regplot.html) function, which automatically draws a best fit line

What we'll do is we'll graph the relationship between someone's total cholesterol (within the column `glucose`) by the CHD risk (`TenYearCHD`). Let's only do this for a subset of cases for convenience.

In [None]:
# Draw regplot on a couple of variables
samp_data = framingham_data.loc[
        (framingham_data['glucose'] > 175) & (framingham_data['glucose'] < 225), :
    ].sample(n=10, random_state=20)

plt.figure(figsize=(15, 5))
sns.regplot(
    x='glucose', 
    y='TenYearCHD', 
    data=samp_data, 
    ci=None)
print("")

The fit here is...not great. Our MSE was low, but that's because the distance between the point and the line happens to be a small magnitude. 

The issue is that `TenYearCHD` is not a continuous variable, it's a **categorical variable**. It takes on...

* 0, if someone was _not_ diagnosed with CHD in the past 10 years
* 1, if someone was diagnosed with CHD in the past 10 years

### Thought exercise

What would you do to take this line and make a **categorical** prediction. Is there a certain threshold, where _if_ someone's glucose is above a certain level, we would start predicting that `TenYearCHD = 1`?

### Exercise

The following code runs linear regression using the sample points above. Add an extra step to turn your linear regression into a classification based upon your designated threshold. Here's an example of indexing an array, based upon a threshold (`1`) using an arbitrary numpy array called `a`.

```python
a[a > 1] = 22
```

In [None]:
# Run linear regression
lr = LinearRegression()

# Fit data
lr.fit(samp_data[['glucose']], samp_data['TenYearCHD'])

# Predict data
y_pred = lr.predict(samp_data[['glucose']])

# INSERT CODE HERE TO THRESHOLD THE ARRAY


# Graph line
plt.figure(figsize=(15, 5))
sns.regplot(x=samp_data['glucose'], y=samp_data['TenYearCHD'], ci=None, fit_reg=None, color='blue')
sns.regplot(x=samp_data['glucose'], y=y_pred, ci=None, fit_reg=None, color='red', marker='+')

# Add code to turn y_pred into a categorical variable
print(y_pred)

Let's now look at the entire dataset...

In [None]:
# Draw regplot on a couple of variables
plt.figure(figsize=(15, 5))
sns.regplot(x='glucose',y='TenYearCHD', data=framingham_data, ci=None)
print("")

It's not really easy to figure out where the threshold should be...is it? Imagine also if we get glucose values that are higher and lower than our current range...we're going to get predictions that are above 1 and below 0. Let's change things up a little bit. What we're going to do is the following...

* Run **univariate** linear regression, using just **glucose** to predict heart disease
* Use the resulting coefficient to transform the variable (I'll explain the transformation later)
* Graph the new fitted curve

In [None]:
# Import numpy
import numpy as np

# Add coefficients
m = 0.01000811
b = -2.54022054
x = np.array(range(-100, 700))

y_log = 1 / (1 + np.exp(-1 * (m*x + b))) # 1 / (1 + e^(-(mx + b)))

# Graph
plt.figure(figsize=(15, 5))
sns.regplot(x=framingham_data['glucose'], y=framingham_data['TenYearCHD'], fit_reg=False)
sns.regplot(x=x, y=y_log, fit_reg=False, marker='.')

### Thought exercise

* What is the maximum of the orange curve?
* What is the minimum of the orange curve?
* What do you think the orange curve represents?

The function we graphed in orange is called a **logistic function**. It's represented by the following form:

$$ f(z) = \frac{1}{1 + e^{-z}}$$

For those who do not know, in mathematics, $e$ is a special mathematical constant. 

$$e = \sum_{i=0}^{\infty}\frac{1}{n!} = 1 + \frac{1}{1} + \frac{1}{2 * 1} + \frac{1}{3 * 2 * 1} + ... \approx 2.71828$$

The number was oddly discovered by Jacob Bernoulli while [studying compound interest](https://www-history.mcs.st-and.ac.uk/HistTopics/e.html).

### Exercise

Graph the logistic function. The code initialises $z$ for you.

**HINT:** You can use the `np.exp(z)` function to code for $e$.

In [None]:
# Initialise z
z = np.arange(-10, 10, 0.1)

# INSERT CODE: Create f_z
f_z = 0

# INSERT CODE: Graph using sns.scatter
sns.scatterplot(z, f_z)

Note, this special curve (a logistic function where, $f(0) = 0.5$) is called a **sigmoid function**. Logistic functions predict a **probability** that an event is going to occur. As you can see

* as the curve gets lower, $f(z) \to 0$
* as the curve gets higher, $f(z) \to 1$

Let's make one other alteration to the function. Reminder that in **univariate linear regression**, we tried to find an $m$ and $b$ within the equation...

$ g(x) = mx + b$

Let's pretend that $z = g(x)$ and

$$ f(z) = \frac{1}{1 + e^{-z}} = \frac{1}{1 + e^{-g(x)}} = \frac{1}{1 + e^{-(mx + b)}}$$

$$ f(x) = \frac{1}{1 + e^{-(mx + b)}}$$

What we are doing is taking the **output of linear regression** and **squeezing it into a probability**. Thus, we can interpret the output of the logistic function within this example as...

> What is the **likelihood that someone has developed CHD within the next 10 years**? 

We call the **learning algorithm** that finds the specific $m$ and $b$ and that fit the logistic curve, **logistic regression**. Here's a picture to help with this interpretation...

---

<img src="img/Linear_to_Logistic_Regression.png" width="700">

---

#### MATH ALERT: What is a _log odds_?

For those interested, **a log odds ratio** is the logarithmic ratio of the probability someone developed CHD divided by the ratio some _did not_ develop CHD. It looks something like this...

$$ log \Bigg (\frac{P(TenYearCHD = 1)}{P(TenYearCHD = 0)} \Bigg ) = mx + b $$

Thus, when $x$ increases by 1, what we are saying is that the _log odds increases by 1_. We can have some fun and rearrange this equation to make our logistic function.

$$ \Bigg (\frac{P(TenYearCHD = 1)}{P(TenYearCHD = 0)} \Bigg ) = e^{mx + b} $$
$$ \Bigg (\frac{P(TenYearCHD = 1)}{1 - P(TenYearCHD = 1)} \Bigg ) = e^{mx + b} $$

and rearranging...

$$ P(TenYearCHD = 1) = \frac{e^{mx + b}}{1 + e^{mx + b}} = \frac{1}{1 + e^{-(mx + b)}} $$


### Classification

We're missing one extra step. Sometimes, we do not want the likelihood that someone will develop a disease, we want a definitive answer..._will I get CHD in the next ten years_?. What we can then do is **threshold our likelihood by a specific probability p**. We then conclude by saying...

* If $f(x) >= p$, TenYearCHD = 1,
* If $f(x) < p$, TenYearCHD = 0

This provides a final **classification** based upon our likelihood. Thus normally in machine learning, despite having the word "regression" in it.

> A logistic regression that outputs a categorical value is considered a **classification** algorithm, since it predicts a **categorical variable** once thresholded (e.g. will someone get CHD, yes or no?). If you are using a logistic regression algorithm that outputs a likelihood instead of a final classification, it is considered a regression algorithm, since the output is continuously valued. Don't get too hung up on this.

---

<img src="img/Logistic_Regression_Threshold.png" width="700">

---

### Exercise

Run logistic regression using the imported `LogisticRegression` library, fitting with just the `glucose` column. You can assess the fit using the `accuracy_score` function (instead of `mean_squared_error`). Remember the four steps for running a learning algortihm...

1. Create a LogisticRegression() object
2. Fit the model using the `framingham_data['glucose']`, and `framingham_data['TenYearCHD']` columns
3. Predict on the training data
4. Assess the model's performance

You can run these steps just like you did previously with `LinearRegression`, but replacing the `LinearRegression` object with a `LogisticRegression` object.

In [None]:
# Imports
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Create a LogisticRegression object
log_ress = 0

# Fit model

# Predict on the input

# Find the accuracy_score


## Part 2: Assessing the model fit

---

[Top](#ML-Week-4---Logistic-Regression) | [Previous section](#Part-1:-Linear-to-Logistic-Regression) | [Next section](#Part-3:-Receiver-Operating-Curves-(Optional)) | [Bottom](#Thank-you)

In the past example we used a metric called **accuracy** to assess the model fit...you might be asking. What is accuracy?? What other **metrics can we use to assess how well our classification model works**? Let's break this down a little bit further...

### Confusion Matrics

Let's say we're predicting how many people have a `TenYearCHD = 1`. We predict this for 10 people and we have the following data and predictions.

```python

y_true = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1]

y_pred = [0, 0, 1, 0, 1, 0, 0, 1, 1, 0]

```

We often call a `1` a **positive** score, and `0` a **negative** score.

### Thought exercise

* How many values were positive, and predicted as positive? These values are called **True Positives (TP)**.
* How many values were negative, and predicted as negative? These values are called **True Negatives (TN)**.
* How many values were negative, and predicted as positive? These values are called **False Positives (FP)**.
* How many values were positive, and predicted as negative? These values are called **False Negatives (FN)**.

As you see, there are **four possible** combinations of correct and incorrect combinations we can make between a true vector and the predicted results. We can provide some definitions on these. We can summarise these values in a **confusion matrix**, which tells us how well a classification model performs.

In [None]:
# Define vectors
y_true = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1]
y_pred = [0, 0, 1, 0, 1, 0, 0, 1, 1, 1]

# Import confusion matrix
from sklearn.metrics import confusion_matrix

# Make confusion matrix
conf = confusion_matrix(y_true, y_pred, labels=[1, 0])
sns.heatmap(conf, annot=True, annot_kws={'size': 16})

Let's breakdown the general version of this confusion matrix.

<img src="img/Confusion_Matrix.png" width="500">

<br>

Ideally, we want to get the **<span style="color:green;">green</span>** areas of the matrix (the diaganol) as high as possible, and the **<span style="color:red;">red</span>** areas of the matrix as low as possible. Let's review some metrics we can calculate using this matrix.

### Exercise

There are metrics we can calculate using a confusion matrix to **assess the efficacy of the model**. Our confusion matrix is currently stored in a variable called `conf`. Let's define each metric, and then we'll run a quick exercise to calculate each metric.

**Reminder:** To index a matrix, we can use the following notation: `conf[row_ind, col_ind]`. For example, `conf[0, 1]` would output the number `0`.

#### Accuracy

Accuracy is the total amount of correct predictions, divided by the total amount of predictions, in other words it is...

$$ Accuracy = \frac{TP + TN}{TP + TN + FP + FN} $$

Calculate the accuracy in the code cell below.

In [None]:
# Calculate accuracay


#### Precision

Precision is the correct positive predictions divided by the total overall number of predicted positive predictions.

$$ Precision = \frac{TP}{TP + FP} $$

Calculate the precision in the code cell below.

In [None]:
# Calculate precision


#### Recall

Recall is the correct positive predictions divided by the total overall number of actual positive predictions.

$$ Recall = \frac{TP}{TP + FN} $$

Calculate the recall in the code cell below.

In [None]:
# Calculate recall


I forget these terms a lot. But here's something from [stack overflow](https://stats.stackexchange.com/questions/122225/what-is-the-best-way-to-remember-the-difference-between-sensitivity-specificity/122228) that is helpful for remembering.

* <strong>P</strong>recision: TP / <strong>P</strong>redicted positive
* <strong>R</strong>ecall: TP / <strong>R</strong>eal positive

Sklearn has all of these metrics ready for you that you can import. Let's do that.

In [None]:
# Import metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score

### Using different scores


#### Accuracy issues

Let's assess the accuracy, precision, and recall of our own logistic regression model on the actual Framingham data. Let's be proper and use K-Fold cross validation with the [`cross_val_score`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score).

In [None]:
# Initialise model
log_ress = LogisticRegression(solver='lbfgs')

# Cross val with accuracy
acc = cross_val_score(
    estimator=log_ress, 
    X=framingham_data[['glucose']], 
    y=framingham_data['TenYearCHD'], 
    scoring='accuracy',
    cv=10
)
print('Mean +/- std of accuracy: %.2f +/- %.2f' % (np.mean(acc), np.std(acc)))

# Cross val with precision
prec = cross_val_score(
    estimator=log_ress, 
    X=framingham_data[['glucose']], 
    y=framingham_data['TenYearCHD'], 
    scoring='precision',
    cv=10
)
print('Mean +/- std of precision: %.2f +/- %.2f' % (np.mean(prec), np.std(prec)))

# Cross val with recall
rec = cross_val_score(
    estimator=log_ress, 
    X=framingham_data[['glucose']], 
    y=framingham_data['TenYearCHD'], 
    scoring='recall',
    cv=10
)
print('Mean +/- std of recall: %.2f +/- %.2f' % (np.mean(rec), np.std(rec)))

### Thought exercise

Why is accuracy _so much higher_ than precision and recall?

The reason is that the classes in our model are _imbalanced_, meaning that there are many more people with `TenYearCHD=1`, then `TenYearCHD=0`. Let's use the `series.value_counts()` to print out the percentage of people with specific values for `TenYearCHD` metric in our data.

In [None]:
# Print value counts
framingham_data.TenYearCHD.value_counts(normalize=True)

Imagine if we classified **every single oberservation in our model** as 0? We would get ~85% accuracy...and our classifier isn't doing very much. So though classifier will looks "good" since the accuracy is high, it's not really affective.

> A **general rule** is that when you are trying to predict a variable with **imbalanced classes**, it's better to use **precision and recall** as evaluation metrics for a model.

Other thoughts...if our model is mainly predicting all 0 values...

* Why is **precision ~0.5?** (Think about low TP, but low FP)
* Why is **recall ~0.0?** (Think about 0 TP, and 0 FN)

#### Precision or recall?

So, should we use precision or recall? It really depends on whether you care more about the FP rate, or the FN on being low...which is often context dependent. Here are a few examples.

### Thought exercise

* If you were developing a classifier for disease diagnosis, and treatment was inexpensive and non-dangerous, what are the trade-offs between a high FP or FN? How about a more circumstantial disease?
* If you were analysing the potential of a person being innocent or guilty, what are the trade-offs between high FP or FN?

If your answer is...I'd rather have both be reasonably low, then there is a metric for you, called a **F1-score**. The F1-score is a type of average between the precision and recall. Mathematically it looks a little funky...

$$ F1 = \frac{2}{\frac{1}{precision} + \frac{1}{recall}} = \frac{2* precision * recall}{precision + recall} $$

Let's calculate F1 using our cross validation.

In [None]:
# Cross val with recall
f1 = cross_val_score(
    estimator=log_ress, 
    X=framingham_data[['glucose']], 
    y=framingham_data['TenYearCHD'], 
    scoring='f1',
    cv=10
)
print('Mean +/- std of f1: %.2f +/- %.2f' % (np.mean(f1), np.std(f1)))

## Part 3: Receiver Operating Curves (Optional)

---

[Top](#ML-Week-4---Logistic-Regression) | [Previous section](#Part-2:-Assessing-the-model-fit) | [Next section](#Part-4:-Cross-validating-logistic-regression) | [Bottom](#Thank-you)

Remember that technically logistic regression does not output a classification, it outputs the **probability that an example is the positive class**. By default, sklearn says, if `probability >= 0.5, predict a True value`. We can use the `predict_proba` method to get the probability, and **choose our own threshold**.

Let's define a function that runs logistic regression over a specific train/test set. We'll include more variables than **count**. The function also graphs two metrics on an x and y axis...

* y-axis: the TP rate, which is $recall = \frac{TP}{TP + FN}$
* x-axis: the FP rate, which is $FPR = \frac{FP}{FP + TN}$

Note how _all_ the sections of the confusion matrix are used within these two metrics.

In [None]:
def logistic_regression_w_thresh(
    data,
    cols=['glucose'],
    target='TenYearCHD',
    thresholds=[0.5]
):
    """
    Run logistic regression for a set of columns and graph ROC curve
    
    :param data: pd.DataFrame, the data
    :param cols: list<str>, the list of columns to predict
    :param target: str, the target column to predict
    :param thresholds: list<float>, list of thresholds to check against
    """
    # Get train/test
    train, test = train_test_split(data, random_state=42)
    
    # Get data
    log_ress = LogisticRegression(solver='lbfgs')
    
    # Get train/test
    log_ress.fit(train[cols], train[target])
    
    # Generate predictions across tresholds
    probs = log_ress.predict_proba(test[cols])[:, 1]
    
    # Tresholds
    fp_rate = []
    tp_rate = []
    
    # Calculate
    for t in thresholds:
        y_pred = np.copy(probs)
        y_pred[y_pred >= t] = 1
        y_pred[y_pred < t] = 0
        conf = confusion_matrix(test[target], y_pred, labels=[1, 0])
        tp_rate.append(conf[0, 0] / (conf[0, 0] + conf[0, 1]))
        fp_rate.append(conf[1, 0] / (conf[1, 0] + conf[1, 1]))
        
    # Now graph
    plt.figure(figsize=(15,5))
    plt.plot(fp_rate, tp_rate)
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curve')
    plt.xlim([0.0, 1.0])

### Exercise

Add thresholds (between 0 and 1) to the curve below. Think about...

* What would this curve look like if we guessed completely random between classes?
* If our guesses were perfect, what would the resulting curve look like?

In [None]:
# Define thresholds (between 0 and 1)
thresh = [0, 0.5, 1.0]
thresh = np.arange(0, 1.1, 0.05)
cols = ['glucose']

# Run function
logistic_regression_w_thresh(
    framingham_data,
    cols=cols,
    target='TenYearCHD',
    thresholds=thresh
)

This type of curve has a special name, called a **receiver operating curve**, or **ROC** curve. It tells us how capable our model is of distinguishing between classes, and it is a robust metric as it covers how our model would respond over different thresholds. Let's look through a few examples of ROC curves with simpler data. The following function will help us input data.

In [None]:
# Import the roc curve metric
from sklearn.metrics import roc_curve, roc_auc_score

# Create roc_curve function
def create_roc_curve(y_true, y_prob):
    """
    Plot an ROC curve given a vector of probabilities and a true set of classification labels.
    
    :param y_true: np.array<float>, the true values
    :param y_prob: np.array<float>, the probability values
    :parma title: str, the title of the curve
    """
    fpr, tpr, thresholds = roc_curve(y_true, y_prob)
    plt.plot(fpr, tpr)
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.0])
    plt.xlabel('False positive rate')
    plt.ylabel('True positive rate')
    plt.title('ROC AUC: %.2f' % roc_auc_score(y_true, y_prob))

We'll plot three examples below, each time initialising a _probability_ vector, and a vector of true classification examples.

In [None]:
# Figure
plt.figure(figsize=(15, 5))

# 0.5 area
plt.subplot(1, 3, 1)
y_true = [0, 0, 0, 0, 1, 1, 1, 1]
y_prob = [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]
create_roc_curve(y_true, y_prob)

# 0.88 area
plt.subplot(1, 3, 2)
y_true = [0, 0, 0, 0, 1, 1, 1, 1]
y_prob = [0.1, 0.1, 0.7, 0.7, 0.7, 0.7, 0.9, 0.9]
create_roc_curve(y_true, y_prob)

# 1.0 area
plt.subplot(1, 3, 3)
y_true = [0, 0, 0, 0, 1, 1, 1, 1]
y_prob = [0, 0, 0, 0, 1, 1, 1, 1]
create_roc_curve(y_true, y_prob)

Let's breakdown what these pictures mean.

<img src="img/ROC_Curve.png" width="800">

You'll notice that from left-to-right, the prediction curves get better. One thing I haven't pointed out is the title, which has the words **AUC** or **Area Under the Curve**. The `AUC` will range between 0 and 1, and measures **how good our model performs**. A model with AUC = 1.0 performs well among any threshold.

![](https://media.springernature.com/original/springer-static/image/art%3A10.1007%2Fs10115-017-1022-8/MediaObjects/10115_2017_1022_Fig1_HTML.gif)

## Part 4: Cross-validating logistic regression

---

[Top](#ML-Week-4---Logistic-Regression) | [Previous section](#Part-3:-Receiver-Operating-Curves-(Optional)) | [Next section](#Part-5:-Another-dataset-with-spam-detection) | [Bottom](#Thank-you)

Let's talk about the hyperparameters on the [logistic regression model](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) in sklearn that can be fine-tuned.

* The features. We could fine tune what columns we use for the model. Thus far, we've only been using `glucose`
* `penalty`: whether to use `l1` or `l2` regularisation for the model
* `class_weight`: A **reweighting for classes**. This is **extremely useful for unbalanced classes**. The input is a dictionary. For example, if 75% of our observations have people with `TenYearCHD = 0`, and 25% of the observations have `TenYearCHD = 1`, we could use `class_weight='balanced'` so that the algorithm weights the data to balance the classes to be equal in the optimisation
* `solver`, which has multiple values that are work well for specific datasets. Different solvers are good for different sizes of data, speed, etc.

There are a ton of other hyperparameters that you can checkout on the documentation [link](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

Let's make a function that will cross-validate using K-Fold cross-validation and an inputted set of parameters. It will also graph the results on an ROC.

In [None]:
from sklearn.model_selection import KFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import auc
from scipy import interp

def logistic_regression_w_cross_val(
    data,
    cols=['glucose'],
    target='TenYearCHD',
    penalty='l2',
    class_weight='balanced',
    solver='lbfgs'

):
    """
    Run logistic regression for a set of columns and graph ROC curve
    
    :param data: pd.DataFrame, the data
    :param cols: list<str>, the list of columns to predict
    :param target: str, the target column to predict
    :param penalty: str, the penalty for the model
    :param class_weight: str or dict<int:float>, the balancing scheme for the training
    :param solver: str, the type of optimisation to use in the model
    
    :return log_ress_final: fully trained final model
    :return coef: the model coefficients
    """
    # Create splits
    cv = KFold(n_splits=10)
    classifier = LogisticRegression(
        penalty=penalty, 
        class_weight=class_weight, 
        solver=solver,
        max_iter=1000
    )
    
    # Run MinMaxScalar for model interpretability
    X = pd.DataFrame(MinMaxScaler().fit_transform(data[cols]), columns=cols)

    tprs = []
    aucs = []
    mean_fpr = np.linspace(0, 1, 100)
    f1s = []

    i = 0
    plt.figure(figsize=(15, 5))
    
    # Classify on different splits and create curves
    for train, test in cv.split(X[cols], data[target]):
        # Run classifier
        probas_ = classifier.fit(
            X.loc[train, cols], data.loc[train, target]
        ).predict_proba(X.loc[test, cols])
        # Compute ROC curve and area the curve
        fpr, tpr, thresholds = roc_curve(data.loc[test, target], probas_[:, 1])
        tprs.append(interp(mean_fpr, fpr, tpr))
        tprs[-1][0] = 0.0
        roc_auc = auc(fpr, tpr)
        aucs.append(roc_auc)
        # Plot the specific line with label
        plt.plot(fpr, tpr, lw=1, alpha=0.3,
                 label='ROC fold %d (AUC = %0.2f)' % (i, roc_auc))

        i += 1
    # Plot AUC = 0.5 line
    plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r',
             label='Chance', alpha=.8)

    # Plot mean
    mean_tpr = np.mean(tprs, axis=0)
    mean_tpr[-1] = 1.0
    mean_auc = auc(mean_fpr, mean_tpr)
    std_auc = np.std(aucs)
    plt.plot(mean_fpr, mean_tpr, color='b',
             label=r'Mean ROC (AUC = %0.2f $\pm$ %0.2f)' % (mean_auc, std_auc),
             lw=2, alpha=.8)

    std_tpr = np.std(tprs, axis=0)
    tprs_upper = np.minimum(mean_tpr + std_tpr, 1)
    tprs_lower = np.maximum(mean_tpr - std_tpr, 0)
    plt.fill_between(mean_fpr, tprs_lower, tprs_upper, color='grey', alpha=.2,
                     label=r'$\pm$ 1 std. dev.')

    plt.xlim([-0.05, 1.05])
    plt.ylim([-0.05, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC: Framingham Heart Disease Study')
    plt.legend(loc="lower right")
    plt.show()
    
    # Run final model on all data
    classifier.fit(X[cols], data[target])
    coef = pd.DataFrame({
        'Columns': cols + ['Intercept'], 
        'Coefficients': classifier.coef_[0].tolist() + [classifier.intercept_[0]]
    })

    return classifier, coef[['Columns', 'Coefficients']]

### Exercise

The following code will run logistic regression and allow you to input different parameters on our dataset. It will grade the resulting fit using the `F1 score` and `ROC AUC`, graph the `ROC` curves, and then output the `feature importance`. The feature importance are the model weight coefficients, and like linear regression, the more impactful features have a higher magnitude weight.

Try to tune the following...

* Change the `penalty` variable between `'l1'` and `'l2'`
* Change the `class_weight` between `None` and `'balanced'`
* Change the `'solver'` between `'liblinear'` and `'lbfgs'`

In [None]:
# INPUT PARAMETERS
cols = ['age', 'male', 'cigsPerDay', 'totChol', 'sysBP', 'diaBP', 'BMI', 'glucose', 'heartRate']
penalty = 'l2'
class_weight = 'balanced'
solver = 'lbfgs'

# Run logistic regression
model, coef = logistic_regression_w_cross_val(
    framingham_data,
    cols=cols,
    target='TenYearCHD',
    penalty=penalty,
    class_weight=class_weight,
    solver=solver

)

print(coef)

Now, it's possible that our model is much more _nonlinear_ than we think. Logistic Regression, though not linear regression, is part of a family of models called [**generalised linear models**](https://medium.com/@yongddeng/regression-analysis-generalised-linear-model-2f03c7e4cecb). We'll come back to this the next lesson.

![](https://sebastianraschka.com/images/blog/2014/kernel_pca/linear_vs_nonlinear.png)

## Part 5: Another dataset with spam detection

---

[Top](#ML-Week-4---Logistic-Regression) | [Previous section](#Part-4:-Cross-validating-logistic-regression) | [Next section](#Thank-you) | [Bottom](#Thank-you)

Let's take a look at another dataset for fun. We'll use a [spam detection](https://www.kaggle.com/uciml/sms-spam-collection-dataset/data) dataset that composes SMS's that were labeled as spam vs. not spam. We need to do some transformations of our data before using logistic regression, as computers are not very great at reading text messages.

**Natural langauges processing** is a complex field that works with translating human language into a form that computers can read, comprehend and make decisions off of.

Let's load-up our dataset, and compute transformations using something called a [Term frequency inverse document frequency (or TFIDF) Vectoriser](https://medium.freecodecamp.org/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3). We've also already processed the data using the [nltk](https://www.nltk.org/) library, by doing things like...

* Removing punctuation
* Removing commonly used _stop words_, like "the", "a", etc

In [None]:
# Upload spam dataset
spam_data = pd.read_csv('data/spam.csv')
spam_data.dropna(axis=0, inplace=True)

# Run TFIDF
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
message_mat = vectorizer.fit_transform(spam_data['Text_Clean'])
category = spam_data[['Class']]
new_mat = pd.concat(
    (pd.DataFrame(message_mat.todense(), columns=['V' + str(i) for i in range(message_mat.shape[1])]), category), 
    axis=1
)
new_mat.dropna(axis=0, inplace=True)
new_mat.reset_index(inplace=True, drop=True)

# Print data head
spam_data.head()

In [None]:
spam_data['Class'].value_counts(normalize=True)

Let's now train the classifier using our function from before.

In [None]:
# INPUT PARAMETERS
cols=['V' + str(i) for i in range(message_mat.shape[1])]
penalty = 'l2'
class_weight = 'balanced'
solver = 'lbfgs'

# Run logistic regression
model, coef = logistic_regression_w_cross_val(
    new_mat,
    cols=cols,
    target='Class',
    penalty=penalty,
    class_weight=class_weight,
    solver=solver

)

### Challenge...until next time

Can you improve the spam model? Take a look at what [other people have done on Kaggle](https://www.kaggle.com/uciml/sms-spam-collection-dataset/kernels) and post your results in slack if you improve the ROC AUC! It might be helpful to do some **exploratory data analysis** and find out...

* What words are mostly in spam/not spam emails?
* Is TFIDF the best feature creating process?
* Are there other algorithms that can be used? Here are a few we won't cover in the course that might be fun to checkout. The syntax for creating and predicting models are the same, but the hyperparameter set will be slightly different.
    * [Naive Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB). This is used often for simple spam problems.
    * [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
    * [SupportVectorMachine](https://scikit-learn.org/stable/modules/svm.html)
    * [MultilayerPerceptron](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier). We will cover this more next lesson, but with a different library!

## Thank you

[Top](#ML-Week-4---Logistic-Regression) | [Previous section](#Part-5:-Another-dataset-with-spam-detection) | [Next section](#Thank-you) | [Bottom](#Thank-you)

That concludes our week 4 lesson. Hopefully you enjoyed :)

### Downloading the notebook

If you would like to retain your work, please follow the following directions:
* On the top of this screen, in the header menu, click "File", then "Download .ipynb".
* You will need to download [Python 3.7 with Anaconda](https://www.anaconda.com/distribution/#download-section) to use this in the future