# Logistic Regression Lab
## Preprocessing Data
The dataset you're going to work with is already in pretty good shape in itself, so it shouldn't require a lot of preprocessing. But before any of that, we have to import the data first!
### Import The Data
You know the deal! Import the numpy and pandas libraries as `np` and pandas as `pd`, import the data from [this github link](https://raw.githubusercontent.com/Endothermic-Dragon/Polygence/master/Jupyter%20Notebooks/Logistic%20Regression/Heart%20Attack%20Possibility.csv) and save it as `dataTable`, and display the data.

In [None]:
# Import numpy as np, and pandas as pd



# Get data from CSV file on GitHub
dataTable = 

# Display data table


<details>
<summary>Stuck or completed? Click here to reveal a working example program that you could've written.</summary>

```python
# Import numpy as np, and pandas as pd
import numpy as np
import pandas as pd

# Get data from CSV file on GitHub
dataTable = pd.read_csv("https://raw.githubusercontent.com/Endothermic-Dragon/Polygence/master/Jupyter%20Notebooks/Logistic%20Regression/Heart%20Attack%20Possibility.csv")

# Display data table
dataTable
```
</details>

Our overall goal is to predict the value of `target`, which is the diagnosis of having a heart disease.

If you want to know what each column actually represents, see:
- [The Kaggle dataset I used for this lab](https://www.kaggle.com/datasets/nareshbhat/health-care-data-set-on-heart-attack-possibility)
- [The original dataset that the Kaggle dataset borrowed from](https://archive.ics.uci.edu/ml/datasets/Heart+Disease)
    - Specifically, look under the 14 attributes listed under the "Attribute Information" section

Otherwise, continue with the lab!

## Analyzing The Data
First, let's analyze relations.

We'll be using `seaborn` to plot the data. `sex`, `cp`, `fbs`, `restecg`, `exang`, `slope`, `ca`, and `thal` all are discrete values, so plotting them won't do us any favors. Create a copy of the dataset called `dataTableCopy` without those columns.

<details>
<summary>Hint</summary>

You can do this using `dataTable.drop(...)`. The documentation for the `drop` function can be accessed [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html). Make sure to specify both the column names and the axis (rows corresponds to 0, columns corresponds to 1).

</details>

After creating `dataTableCopy`, graph that using `sns.pairplot(...)`. Set the `hue` parameter to `"target"`. This will generate a set of graphs with colored points, where the color is determined by the `target` value. This will help you see any simple relations right away (if there are any).

- If you're able to seperate them by a line, then you want to make sure to keep a copy of those two features in degree `1` before classification
- If you're able to seperate them by a circle, then you want to make sure to keep a copy of those two features in degree `1` and `2` before classification

After that, graph the correlation matrix between the columns of `datatableCopy`.

<details>
<summary>Hint</summary>

You can use `.corr()` on `dataTable` to calculate the correlation matrix, and then use `.round(n)` to round each value in the matrix to `n` digits.

</details>

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Create and show plot
dataTableCopy = 

plt.show()

fig, ax = plt.subplots(figsize=(15,15))

# Calculate correlation matrix
corr = 

# Display correlation matrix
sns.heatmap(corr, annot=True, ax=ax)
plt.show()

<details>
<summary>Stuck or completed? Click here to reveal a working example program that you could've written.</summary>

```python
import matplotlib.pyplot as plt
import seaborn as sns

# Create and show plot
dataTableCopy = dataTable.drop(["sex", "cp", "fbs", "restecg", "exang", "slope", "ca", "thal"], axis=1)
sns.pairplot(dataTableCopy, hue="target")
plt.show()

fig, ax = plt.subplots(figsize=(15,15))

# Calculate correlation matrix
corr = dataTable.corr().round(2)

# Display correlation matrix
sns.heatmap(corr, annot=True, ax=ax)
plt.show()
```
</details>

Deduce any relations to keep an eye on or to modify, if there are any. You will get a chance to make your own modifications to the features very soon.

### Modify The Data
Make any modifications to the data that you need below. We will apply a Z Score to this in the next step, so don't worry about it for now.

<details>
<summary>What I did</summary>

I didn't do anything! As mentioned before, linear regression doesn't require too much manipulation of the data. Also, I found that not adding squared terms actually results in a slightly higher accuracy 😝.

</details>

Apply a Z Score to the `age`, `trestbps`, `chol`, and `thalach` columns as they are not binary values and have a large range (they could potentially slow down the gradient descent process). Also apply a Z Score to any columns you may have added in the previous cell.

In [None]:
# Complete the function to calculate the Z Score
def zScore(columnData):
    return

# Implementations may vary, but the basis is set up for you
for i in ["age", "trestbps", "chol", "thalach"]:
    dataTable[i] = zScore(dataTable[i])

<details>
<summary>Stuck or completed? Click here to reveal a possible solution for the <code>zScore</code> function.</summary>

```python
def zScore(columnData):
    return (columnData - np.mean(columnData)) / np.std(columnData)
```
</details>

## Get Ready to Train
Shuffle the data in the cell below. Use `dataTable.sample(...)`, whose documentation you can find [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html).
- Make sure to specify the `frac` parameter and not the default `n` parameter.
- You can optionally also set a `random_state` parameter to ensure consistency when randomizing - that way, you get the same results every time you rerun all the cells.
- Make sure to set `dataTable` equal to the output.

After sampling, apply `dataTable.reset_index(...)` with parameter `drop` set as true. Make sure to set `dataTable` equal to the output.

Alternatively, you can do both of these steps as a one-liner.

In [None]:
dataTable = 

<details>
<summary>Stuck or completed? Click here to reveal a working example program that you could've written.</summary>

```python
dataTable = dataTable.sample(frac=1, random_state=314).reset_index(drop=True)
```

</details>

Next, we have to split the data into a training set and a validation set. Note that the code for this section has been written for you in this lab. This is because it's trivial and somewhat of a menial task.

For now, simply modify `validaitonSetSize`. Note that we aren't working with a huge dataset (only ~300 data points), so I'd recommend choosing a value near 50 or 100.

In [None]:
# Choose the number of data points in the validation set
validaitonSetSize = 

splitPoint = dataTable.shape[0]-validaitonSetSize
del validaitonSetSize

trainingData = dataTable.drop(labels="target", axis=1).to_numpy()[:splitPoint]
trainingData = np.hstack([
                          np.ones((splitPoint, 1)),
                          trainingData
                         ])
trainingOutputs = dataTable[["target"]].to_numpy()[:splitPoint]

validationData = dataTable.drop(labels="target", axis=1).to_numpy()[splitPoint:]
validationData = np.hstack([
                          np.ones((len(dataTable) - splitPoint, 1)),
                          validationData
                           ])
validationOutputs = dataTable[["target"]].to_numpy()[splitPoint:]

## Training
Finally, we're here!

The framework of the code has been set up for you. First, begin with the `sigmoid` function. Note that this function should be able to handle both individual numbers and numpy arrays. Here's the formula, to jog your memory:

<h3>

$$\sigma(x) = \frac{1}{1+e^{-x}}$$

</h3>

This function should be a short one-liner.

Next, fill in the prediction function `f`. This should also be a very short one-liner. Remember that the dot product of two arrays can be taken using `@`. Also, don't forget the `sigmoid`!

After that, fill in the cost function `J`. Note that you can use `np.mean` instead of using `np.sum` and then dividing by the number of data points. You can set the regularization coefficient, $\lambda$, as 0.01. Here's the formula to calculate it, with $n$ as the number of rows and $m$ as the number of features (including the bias term):

<h3>

$$J(\theta) = - \frac{1}{n} \cdot \sum_{i=1}^n y_i\log(f(\theta, x_i)) + (1 - y_i)\log(1 - f(\theta, x_i)) + \lambda \sum_{i=2}^m \theta_i^2$$

</h3>

Note that the regularization of the weights starts from $\theta_2$ - you can account for this fact by [leveraging array indices](https://numpy.org/doc/stable/user/basics.indexing.html) when manipulating `thetas`. Also take note that `log` represents a natural log (mathematically denoted as ln) and not a base-10 log in the context of computer science and machine learning. This fact is also reflected by the `np.log(...)` function in the numpy library.

Finally, we reach the final boss - the function to take the gradients. Once again, if you know calculus, I would recommend figuring this part out by yourself. If you don't, simply expand the explanation below.

<details>
<summary>Calculating the Gradients</summary>

First, let's "unwrap" our cost function by ignoring the fact that we're taking the mean and negating the equation. Now, we're left with two terms:

<h3>

$$y_i\log(f(\theta, x_i))$$

</h3>

<h3>

$$(1 - y_i)\log(1 - f(\theta, x_i))$$

</h3>

Before taking the derivative, we also have to plug in the formula for `f`:

<h3>

$$f(\theta, x_i) = \sigma(x_i \theta)$$

</h3>

Take note that $x_i$ is a horizontal matrix and $\theta$ is a vertical matrix, but the output of the function is a numerical value.

The derivative of the first term (once again, assuming `log` is the natural log due to the context of computer science) is:

<h3>

$$\frac{y_i}{\sigma(x_i \theta)} \cdot \sigma(x_i \theta)(1 - \sigma(x_i \theta)) \cdot x_i$$

</h3>

Note that due to the $x_i$ term at the end, the result will be a matrix.

Similarly, the derivative of the second term is:

<h3>

$$-\left(\frac{1-y_i}{1-\sigma(x_i \theta)} \cdot \sigma(x_i \theta)(1 - \sigma(x_i \theta)) \cdot x_i \right)$$

</h3>

You calculate these two values for each row in the dataset, add both the values up. At this point, you should have a 2D array, where the number of rows is the number of data points, and the number of columns is the number of $\theta$'s.

Note that you can make this computation faster by taking out common terms when adding the two terms. Specifically, you can factor out $\sigma(x_i \theta)(1 - \sigma(x_i \theta)) \cdot x_i$ from the two terms. This makes the computer do less individual computations.

The next step is to undo our "unwrapping". We calculate the average across all rows (aka find the average value in each column), and take the negative of the resulting matrix. At this point, you should be left with a horizontal 1D array. Turn this into a vertical 2D array, as this is the format the $\theta$'s are in. After that, add the derivative of the regularization terms ($2\lambda\theta$) to each array "cell", except the bias term. Since the other array is already in the necessary format, you can do this efficiently by multiply `thetas` by $2\lambda$, zeroing out the bias term, add it to the other array, and return it.

To see how to zero out a row or column efficiently, see [this link from Stack Overflow](https://stackoverflow.com/questions/17482955).

Another optimization you can make to this gradient-computing process is to bring the outermost negative or the non-regularized cost function all the way on the "inside". Before this optimization, you would be adding a positive derivative (first term mentioned above) and a negative derivative (second term mentioned above). Essentially, this is the same as subtracting the second term from the first. However, bringing this negative inside, you could do the same thing in the opposite order. In addition, subtracting the first term from the second doesn't take any more computational steps, and reduces one operation for the computer to do (more acute observation will show that it reduces *more* than one operation, since the negative is performed on each element of an array).

Yet another third optimization to perform is to simply precompute the predictions from `f` and saving it in a variable. This is because without precomputation, you're essentially calculating the same thing 4 to 6 times (depending on your implementation). Precalculating this array will simply store the outputs in the memory and fetch it when needed, reducing the number of operations performed. Note that you can also do this for the cost function, but it won't have as much of a benefit as here.
</details>

In [None]:
# Initialize thetas
thetas = np.zeros([
                    trainingData.shape[1],
                    1
                  ])

def sigmoid(n):
    return

def f(thetas, data=trainingData):
    return

# Make sure to use the value of lambda
def J(thetas, l, data=trainingData, outputs=trainingOutputs):
    return

# Make sure to use the value of lambda
def getGradients(thetas, l, data=trainingData, outputs=trainingOutputs):
    return

# Set the value of lambda
l = 0
iterations = 1000
cost_history = []
cost_history.append(J(thetas, l))

# Train!
for i in range(iterations):
    thetas = thetas - 0.25 * getGradients(thetas, l)
    cost_history.append(J(thetas, l))

# Print results
print("Initial cost:", cost_history[0])
print("Final cost:", cost_history[-1])
plt.plot(np.arange(iterations+1), cost_history)
plt.show()

<details>
<summary>Stuck or completed? Click here to reveal a possible solution for the <code>sigmoid</code> function.</summary>

```python
def sigmoid(n):
    return 1/(1 + np.e**(-n))
```
</details>

<details>
<summary>Stuck or completed? Click here to reveal a possible solution for the <code>f</code> function.</summary>

```python
def f(thetas, data=trainingData):
    return sigmoid(data @ thetas)
```
</details>

<details>
<summary>Stuck or completed? Click here to reveal a possible solution for the <code>J</code> function.</summary>

```python
def J(thetas, l, data=trainingData, outputs=trainingOutputs):
    # Precompute the predictions
    predictions = f(thetas, data)
    return -np.mean(
        # First term inside summation
        outputs * np.log(predictions)
        # Second term inside summation
        + (1-outputs) * np.log(1 - predictions)
    )
    # Regularization term
    + l*np.sum(thetas[1:]**2)
```
</details>

<details>
<summary>Stuck or completed? Click here to reveal a possible solution for the <code>getGradients</code> function.</summary>

```python
def getGradients(thetas, l, data=trainingData, outputs=trainingOutputs):
    # Precompute the predictions
    predictions = f(thetas, data)
    results = np.mean(
        # First term's derivative and second terms derivative
        # Note the order is flipped because of a possible optimization discussed in the instructions
        # Consequently, the "np.mean" above doesn't have a negative before it
        ((1-outputs) / (1 - predictions) - outputs / predictions)
            # Factored out common terms (another optimization)
            * predictions * (1-predictions)
            * data
    , 0)

    # Turn 1D array into vertical 2D array
    results = np.array([results]).T

    # Regularization term
    reg_term = 2*l*thetas
    reg_term[0] = 0
    results += 2*l*reg_term

    return results
```
</details>

Use the next cell to graph the progress of the the model in accordance to the validation data. Make sure to pass the value of $\lambda$ (`l`). You can use the previous cell as a reference when writing code for this cell, but bear in mind that you don't have to redefine the same functions (namely, `sigmoid`, `f`, `J`, and `getGradients`).

<details>
<summary>Stuck or completed? Click here to reveal a working example program that you could've written.</summary>

```python
# Initialize thetas
thetas = np.zeros([
                   trainingData.shape[1],
                   1
                  ])

# Set the value of lambda
l = 0.01
iterations = 1000
cost_history = []

# Make sure to calculate cost on the validation dataset
cost_history.append(J(thetas, l, validationData, validationOutputs))

# Train!
for i in range(iterations):
    # Train using the training dataset
    thetas = thetas - 0.25 * getGradients(thetas, l)
    # Make sure to calculate cost on the validation dataset
    cost_history.append(J(thetas, l, validationData, validationOutputs))

# Print results
print("Initial cost:", cost_history[0])
print("Final cost:", cost_history[-1])
plt.plot(np.arange(iterations+1), cost_history)
plt.show()
```

</details>

## Judging Accuracy
First, calculate the accuracy of your model by calculating how often its prediction is correct. You should get a value slightly higher than 80%.

Note that you can perform element-wise comparisons with `>`, `<`, `>=`, `<=`, and `==` among arrays of the same dimensions or individual values. You can also use `a.astype(int)` to convert any array `a` from an array of booleans (`True` and `False`) to an array of integers (`1` and `0`).

<details>
<summary>Stuck or completed? Click here to reveal a working example program that you could've written.</summary>

```python
a = (
        (f(thetas, validationData) > 0.5) == validationOutputs
    ).astype(int)
print(f"{np.round(np.mean(a) * 100, 2)}%")
```

</details>

In the exercise above, I got an accuracy around 84%. After I repeated the same training process with the regularization coefficient set to 0 and checked the accuracy, I got a value of 82%. While not a huge boost (in this specific case), it demonstrates the importance of regularization, especially when considering how well a model generalizes to unseen data. You can repeat this exact circumstance by setting the validation dataset size to 100, and performing the shuffling with the same seed I used.

If you're wondering how to pick the regularization term, well, there's no hard and fast rule - it's more trial and error.

Next, calculate the F1 Score (and its associated parameters). The first step, converting predictions and true values into a 1D array of booleans, has already been done for you.

Note that `tp` stands for true positives, `fp` for false positives, and so on. Using numpy's `&` (and) operator and `~` (not) operator should suffice for calculating these values. Do ***NOT*** use python's version of these operators, as they will not work element-wise.

Here are any formulas and further details you might need:
- True positive means that the output and prediction match, and they are both true.
- True negative means that the output and prediction match, and they are both false.
- False positive means that the output and prediction do not match, and the prediction is true.
- False negative means that the output and prediction do not match, and the prediction is false.
- The formula for precision is:

<h3>

$$\frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}$$

</h3>

- The formula for recall is:

<h3>

$$\frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}$$

</h3>

- The formula for the F1 Score is:

<h3>

$$2 * \,\frac{\text{Precision}*\text{Recall}}{\text{Precision}+\text{Recall}}$$

</h3>

Your final F1 Score should be slightly higher than 80%.

In [None]:
predictions = (f(thetas, validationData) > 0.5).flatten()
outputs = (validationOutputs == 1).flatten()

<details>
<summary>Stuck or completed? Click here to reveal a working example program that you could've written.</summary>

```python
predictions = (f(thetas, validationData) > 0.5).flatten()
outputs = (validationOutputs == 1).flatten()

tp = np.sum(outputs & predictions)
fp = np.sum(~outputs & predictions)
tn = np.sum(~outputs & ~predictions)
fn = np.sum(outputs & ~predictions)

print(f"True Positives: {tp}")
print(f"True Negatives: {tn}")
print(f"False Positives: {fp}")
print(f"False Negatives: {fn}")

precision = tp/(tp+fp)
recall = tp/(tp+fn)

print(f"\nPrecision: {np.round(precision, 2)}")
print(f"Recall: {np.round(recall, 2)}")

print(f"\nF1 Score: {np.round(2 * precision*recall / (precision+recall) * 100, 2)}%")
```

</details>

## Credits
* This lab used a modified version of Naresha Bhat's heart attack possibility dataset from Kaggle. You can find the original dataset [here](https://www.kaggle.com/datasets/nareshbhat/health-care-data-set-on-heart-attack-possibility).
* Formula to calculate the F1 score from [Wikipedia](https://en.wikipedia.org/wiki/F-score).