# Homework 6

## Problem 0

It is highly recommended that you work with your group to fully complete the recent Discussion assignments related to the Palmer Penguins data set, as these will directly help with your project. 

## Problem 1

Gapminder is a foundation, based on Sweden, that aims to enhance basic awareness of basic facts about the socioeconomic global world. As part of their efforts, they collect detailed statistics on life expectancy, population, and GDP, sometimes going back over many years. 

In this case study, we'll look at an excerpt of the Gapminder data. This excerpt has been packaged up and made available via Jenny Bryan's [`gapminder` repo](https://github.com/jennybc/gapminder) on Github. 

Here's some familiar code to grab the data. 

In this problem, you will create an attractive visualization on the `gapminder` data set using line plots and the `apply` method of `pandas` data frames. 

### Part A

Run the code below to retrieve the data and take a look. As usual, you can also directly download the data by pasting the url into your browswer, saving the file, and reading it in locally via `pandas.read_csv`. 

In [None]:
import pandas as pd
        
url = "https://philchodrow.github.io/PIC16A/datasets/gapminder.csv"
gapminder = pd.read_csv(url)
gapminder

Use the `gapminder` data to create the following visualization. Here, each trendline corresponds to a distinct country in the stated continent.  

<figure class="image" style="width:100%">
  <img src="https://philchodrow.github.io/PIC16A/homework/gapminder_p1.png
" alt="A five-panel plot in which each panel corresponds to a continent. For each country, there is a trend-line in life expectancy in the panel corresponding to the continent on which the country is located. The trendlines are slightly transparent, and differently colored within each continent. The years on the axis are labeld from 1952 through 2007. The vertical axis is labeled 'Life Expectancy (Years).'" width="800px">
</figure>

You should achieve this result **without for-loops** and also without manually creating the plot on each axis. You may find it useful to define additional data structures such as dictionaries, that assign colors or axis indices to continents. Feel free to modify aesthetic details of the plots, such as the colors. 

Hint: `df.groupby().apply()`. You will need to define an appropriate function place inside the `apply` call. 

In [None]:
# your solution



## Problem 2

In our first lecture on machine learning, we did linear regression "by hand." In this problem, we will similarly perform logistic regression "by hand." This homework problem is closely parallel to the lecture, and so you might want to have the [notes](https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/ML/ML_2.ipynb) handy.  

Although logistic regression is a relatively simple model, it is a foundation for many modern deep neural nets. Additioanlly, logistic regression itself is still often used in applied contexts even now. 

<blockquote class="twitter-tweet"><p lang="en" dir="ltr">&quot;When we raise money it’s AI, when we hire it&#39;s machine learning, and when we do the work it&#39;s logistic regression.&quot;<br><br>(I&#39;m not sure who came up with this but it&#39;s a gem 💎)</p>&mdash; Dr. Daniela Witten (@daniela_witten) <a href="https://twitter.com/daniela_witten/status/1177294449702928384?ref_src=twsrc%5Etfw">September 26, 2019</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

Whereas linear regression is often used to measure arbitrary quantities like GDP or happiness scores, logistic regression is primarily used to estimate *probabilities*. For example, we might use logistic regression to estimate the probability of a passenger surviving the Titanic crash, a borrower defaulting on a loan, or an email being spam. In this problem, we'll focus on yes/no outcomes (called *binary outcomes*) like these. Logistic regression can be generalized to *multinomial logistic regression*, which handles multiple different types of outcomes and is one of the requirements for your project. 

For concreteness, let's say that we are considering the latter case. Suppose that we wish to model the probability that an email is spam as a function of the proportion of flag words (like "investment", "capital", "bank", "account", etc.) in the email's body text. Call this proportion $x$. $x$ is then a variable between $0$ and $1$. 

In logistic regression, we suppose that the probability $p$ that an email is spam has the form 

$$p = \frac{1}{1+e^{-ax - b}}\;,$$

where $a$ and $b$ are again parameters. Let's see how this looks. 

In [None]:
# run this block

import numpy as np
from matplotlib import pyplot as plt

n_points = 100

a = 10
b = -5

x = np.sort(np.random.rand(n_points))
p = 1/(1+np.exp(-a*x - b))

fig, ax = plt.subplots(1)
ax.plot(x, p, color = "black")

As usual, in practice we don't have access to the true function telling us the probability that an email is spam. Instead, we have access to data telling us whether or not the email really IS spam. We can model this situation by flipping a biased coin for each email, with the probability of heads determined by the modeled probability. 

In [None]:
y = 1.0*(np.random.rand(n_points) < p)

A value of 1 indicates that the email is indeed spam, while a value of 0 indicates that the email is not spam. 

In [None]:
ax.scatter(x, y,  alpha = 0.5)
fig

Notice that there are more spam emails where the model gives a high probability, and fewer where the model gives a lower probability. However, there may be some non-spam emails with even high probability -- sometimes we get legitimate emails about bank accounts, investments, etc.  

Of course, we don't have access to the true model, so our practical situation looks more like this: 

In [None]:
fig, ax = plt.subplots(1)
ax.scatter(x, y, alpha = 0.5)

We would like to use logistic regression to try to recover something close to the true model. 

## Part A

Write the model function `f`. The arguments of `f` should be the predictor variables `x` and the parameters `a` and `b`. The output of `f` should be the spam probabilities under the logistic model (see equation above) for these data. Use `numpy` tools, without `for`-loops. If you scan the above code carefully, you'll observe that most of this code is already written for you. 

This is a simple function, but **please add a short docstring indicating** what kinds of input it accepts and how to interpret the output. 

Comments are necessary only if your function body exceeds one line. 

In [None]:
# your solution


## Part B

Plot 10 candidate models against the data, using randomly chosen values of `a` between 5 and 15 and randomly chosen values of `b` between -2.5 and -7.5. Your plot should resemble in certain respects the third plot in [these lecture notes](https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/ML/ML_2.ipynb). 

Comments are not necessary in this part. 

In [None]:
# your solution here


## Part C

The *loss function* most commonly used in logistic regression is the *negative cross-entropy*. The negative cross-entropy of the `i`th observation is 

$$-\left[y_i \log \hat{p}_i + (1-y_i)\log(1-\hat{p}_i)\right]$$

where $y_i \in \{0,1\}$ is the `i`th entry of the target data and $\hat{p}_i$ is the model's estimated probability that $y_i = 1$. The negative cross-entropy of the entire data set is the sum of the negative cross-entropies for each individual observation. 

Write a function that computes the negative cross entropy as a function of `x`, `y`, `a`, and `b`. This can be done in no more than two lines using `numpy`, without `for`-loops. Don't forget which logarithm is \#BestLogarithm.  

As in Part B, please write a short docstring describing what your function does and what inputs it accepts. Comments are necessary only if your function body exceeds two lines. 

In [None]:
# your solution here


## Part D

On a single axis, plot 100 distinct models (using the code that you wrote in) in Part B. Highlight the one with the lowest negative cross entropy in a different color -- say, red. Compare the best values of `a` and `b` that you found to the true values, which were `a = 10` and `b = -5`. Are you close? 

The plot you produce should resemble, in some respects, the fifth plot in [these lecture notes](https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/ML/ML_2.ipynb). 

It is not necessary to write comments in this part. 

In [None]:
# your solution here


In [None]:
# compare best_a and best_b to a and b


It is not required to use `scipy.optimize` to more accurately estimate `a` and `b` for this homework assignment, but you are  free to do so if you wish. You may then use the optimal estimates in the following part. 

## Part E

In classification tasks, we evaluate not just the standard loss function, but also the *accuracy* -- how often does the model correctly classify the data? Let's say that the model classifies an email as spam according to the following rule: 

1. If $\hat{p}_i$ (the model probability plotted above) is larger than $c$, classify the email as spam. 
2. If $\hat{p}_i$ is less than or equal to $c$, classify the email as not-spam. 

Write a function called `positive_rates` which accepts the following arguments: 

1. The data, `x` and `y`. 
2. The best parameters `best_a` and `best_b`. 
3. A threshold `c` between 0 and 1. 

This function should output two numbers. The first of these is *false positive rate*: the proportion of non-spam emails that the model incorrectly labels as spam. The second is the *true positive rate*: the proportion of spam emails that the model correctly labels as spam. 

For example: 

```python 
positive_rates(x, y, best_a, best_b, c = 0.5)
```
```
(0.1454545454545455, 0.8545454545454545)
```

**Note**: due to randomization, your numerical output may be slightly different. 

Please write a descriptive docstring for your function. Comments are necessary only if your function body exceeds five lines. 

In [None]:
# your solution here


In [None]:
# demonstrate your function here


## Part F

Plot the *receiver operating characteristic* (ROC) curve for the logistic model with parameters `best_a` and `best_b`. The ROC curve is the plot of the `false_positive` rate (on the horizontal axis) against the `true_positive` rate (on the vertical axis) as the threshold `c` is allowed to vary. Additionally, plot a diagonal line ("the line of equality") between the points (0,0) and (1,1). Your ROC curve should lie noticeably above the line of equality. Plot your curves in different colors and add a legend to help your reader understand the plot. 

It is ok to use `for`-loops and list comprehensions in this part. 

Your result should look something like this, although it won't be exactly the same due to randomness. 

<figure class="image" style="width:60%">
  <img src="https://raw.githubusercontent.com/PhilChodrow/PIC16A/master/homework/roc-example.png" alt="The horizontal and vertical axes are the false positive and true positive rates. A diagonal black line goes from the bottom left to the top right. A red curve above the black line indicates the classifier performance.">
  <figcaption><i></i></figcaption>
</figure>

In [None]:
# roc plot here


Generally speaking, a "good" classifier is one that can reach the closets to the top-left corner of the ROC diagram. This is a classifier that can achieve a high rate of true positives, while keeping a low rate of false positives. There are various ways to measure how "good" an ROC curve is, which are beyond our present scope. 