# Week 7 Wednesday

## Announcements

* In discussion tomorrow, Jinghao will introduce some important new material: the MNIST dataset of handwritten digits.
* My mistake at the end of Monday's class was hard to spot: I rounded the numbers too much before plugging in to the sigmoid function.

In [1]:
import pandas as pd
import numpy as np
import altair as alt
import seaborn as sns

## Recap of Monday (and a correction)

On Monday we used Logistic Regression with the flipper length and bill length columns in the penguins dataset to predict if a penguin is in the Chinstrap species.

In [2]:
df = sns.load_dataset("penguins").dropna(axis=0)
df["isChinstrap"] = (df["species"] == "Chinstrap")

In [3]:
cols = ["flipper_length_mm", "bill_length_mm"]

In [4]:
alt.Chart(df).mark_circle().encode(
    x=alt.X(cols[0], scale=alt.Scale(zero=False)),
    y=alt.Y(cols[1], scale=alt.Scale(zero=False)),
    color="species"
)

In [5]:
from sklearn.linear_model import LogisticRegression

In [7]:
clf = LogisticRegression()

In [8]:
clf.fit(df[cols], df["isChinstrap"])

LogisticRegression()

In [9]:
df["pred"] = clf.predict(df[cols])

The Chinstrap penguins all appear in the same part of the dataset, so it's not surprising that none of these three consecutive rows is a Chinstrap penguin.  The most important thing to notice is that the wrong prediction was made for the row with label `19`.

In [10]:
df.loc[[18, 19, 20], cols + ["isChinstrap", "pred"]]

Unnamed: 0,flipper_length_mm,bill_length_mm,isChinstrap,pred
18,184.0,34.4,False,False
19,194.0,46.0,False,True
20,174.0,37.8,False,False


If you look one cell further down, you see that the classes are listed in the order `False` then `True`.  In this NumPy array of predicted probabilities, the left column corresponds to `False` and the right column corresponds to `True`.  Notice how much more confident the model is for row `18` than for row `19`.  That is good, because the model was correct for row `18` and incorrect for row `19`.

In [11]:
# predicted probabilities
clf.predict_proba(df.loc[[18, 19, 20], cols])

array([[9.99771697e-01, 2.28303114e-04],
       [3.29559670e-01, 6.70440330e-01],
       [7.71818762e-01, 2.28181238e-01]])

The order of the classes listed in the `classes_` attribute is the same as the order for the columns above.

In [12]:
clf.classes_

array([False,  True])

At the end of class Monday, I tried to recover the `0.67044` number for the predicted probability that the penguin in the row with label `19` is a Chinstrap penguin.

In [13]:
clf.coef_

array([[-0.34802208,  1.08405225]])

In [14]:
clf.intercept_

array([18.36005773])

In [15]:
df.loc[19, cols]

flipper_length_mm    194.0
bill_length_mm        46.0
Name: 19, dtype: object

Be sure you see where this formula comes from, in terms of the information above.

In [16]:
# Why not 0.67 or maybe 1-0.67???
1/(1+np.exp(-(18.36+-0.35*194+1.08*46)))

0.5349429451582182

It turned out that I wasn't using enough decimal places when I plugged in the numbers.  Here we see the `0.67` we expected.  All I did was replace `-0.35` with `-0.348` and replace `1.08` with `1.084`.

In [18]:
# One more digit of precision to my input coefficients
1/(1+np.exp(-(18.36+-0.348*194+1.084*46)))

0.670842935992734

Notice above that the `coef_` attribute is two-dimensional.  In this case, it's not clear why that is, but below when we have three target classes (rather than two), it will be more clear why we have this extra dimension.

In [19]:
clf.coef_[0]

array([-0.34802208,  1.08405225])

Here we are using the exact coefficients.  This result should be the exact predicted probability.  (Up to possible numerical precision issues related to floating point numbers.)

In [20]:
# more precision
1/(1+np.exp(-(18.36+clf.coef_[0][0]*194+clf.coef_[0][1]*46)))

0.6704275750208124

Interpretation question:

* As flipper length increases, is the penguin more or less likely to be a Chinstrap penguin, according to our model.  What about bill length?

If you look at the overall formula, this will get pretty confusing.  Better is to notice that $\sigma(x) = 1/(1+e^{-x})$ is an increasing function, and our formula is $\sigma(L(x))$ for some linear term $L(x)$.  This interpretation question is much easier when we focus on the linear part: all we need to do is look at the signs of the coefficients.

Answer: as flipper length increases, the probability of being Chinstrap decreases.  Why?  -0.34 is negative.  As bill length increases, probability increases (1.08 is positive).

## Decision boundary

* Drop the "pred" column from `df` (we will make a new one below) using the `drop` method and a suitable `axis` keyword argument.

We are using `axis=1` because it is the column labels that are changing (one of the column labels is being removed).  I added `errors="ignore"` so that if we execute this cell twice, no error is raised.

In [24]:
df = df.drop("pred", axis=1, errors="ignore")

* Fit a new logistic regression classifier, using the same input features, but this time using the "species" column as our target.  (This is our first time seeing logistic regression with more than two classes.  When we perform classification with two classes, it is called "Binary Classification", and is often easier to explain.)

Even though we have three output classes, the procedure is the same we have been using all along.

In [25]:
clf = LogisticRegression()

The classifier will automatically report outputs using the same names that are in the "species" column.

In [26]:
clf.fit(df[cols], df["species"])

LogisticRegression()

Remember how the order of `False` and `True` above was important.  Here the order of the penguin species is also important.  I think they will always be in alphabetical order (but I'm not certain of that).

In [27]:
clf.classes_

array(['Adelie', 'Chinstrap', 'Gentoo'], dtype=object)

* Check the `coef_` attribute.  How does it relate to the `coef_` attribute we found above, where we were only considering Chinstrap penguins?

In [28]:
clf.coef_

array([[-0.10544131, -0.59253116],
       [-0.2674632 ,  0.71305121],
       [ 0.37290451, -0.12052006]])

Here are the predicted probabilities for that "incorrect" row `19` from above.  The model is actually a little more confident in this case (and is still wrong), it thinks there is a 91.5% chance the penguin is a Chinstrap penguin.  (We know the middle number corresponds to Chinstrap, by looking at the order of the values in `clf.classes_`.)

In [29]:
clf.predict_proba(df.loc[[19], cols])

array([[0.08336009, 0.91516209, 0.00147781]])

Here is what it looks like if we evaluate the `predict_proba` method on a 4-row sub-DataFrame.  Notice how the output has four rows also and three columns (one column for each target class).

In [31]:
clf.predict_proba(df.loc[[18, 19, 20, 21], cols])

array([[9.99984693e-01, 1.46859326e-05, 6.21326672e-07],
       [8.33600949e-02, 9.15162092e-01, 1.47781293e-03],
       [9.93753092e-01, 6.24688236e-03, 2.57127059e-08],
       [9.97917055e-01, 2.08251094e-03, 4.34432466e-07]])

* Add a column "pred" to `df` containing the predicted values.

This is no different from above.

In [32]:
df["pred"] = clf.predict(df[cols])

Notice how the "pred" column at the far right contains penguin species strings.  So the `predict` method can output strings, not just numbers or Boolean values.

In [33]:
df

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,isChinstrap,pred
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male,False,Adelie
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female,False,Adelie
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female,False,Adelie
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female,False,Adelie
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male,False,Adelie
...,...,...,...,...,...,...,...,...,...
338,Gentoo,Biscoe,47.2,13.7,214.0,4925.0,Female,False,Gentoo
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,Female,False,Gentoo
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,Male,False,Gentoo
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female,False,Gentoo


* Make an Altair scatter plot showing the predicted values.

The portion where the model switches from one prediction to another prediction is called the "decision boundary".  It's hard to recognize the decision boundary in this picture, because there is so much empty space.

You should compare this to the true species values which were plotted at the very top of this notebook.  The predicted species values are very close, but a little more regular, than the actual values.

In [34]:
alt.Chart(df).mark_circle().encode(
    x=alt.X(cols[0], scale=alt.Scale(zero=False)),
    y=alt.Y(cols[1], scale=alt.Scale(zero=False)),
    color="pred"
)

It is a little hard to see from the above picture how the predictions are made.  It turns out there are a few straight line segments, and on one side of each line segment, one prediction is made, and on the other side, another prediction is made.  We will make a fake dataset from which these "decision boundaries" are more clear.

* Using `np.linspace`, make a NumPy array of 70 equally spaced x-coordinates and 70 equally spaced y-coordinates.  Name these NumPy arrays `x` and `y`.

Notice how the ranges chosen here are chosen so that it matches the approximate ranges of the flipper length and the bill length.

In [35]:
x = np.linspace(170, 235, 70)
y = np.linspace(30, 60, 70)

* Make a DataFrame `df_art` (for "artificial") containing all the possible pairs of coordinates from `x` and `y`.  (We chose `70` above so `df_art` will have `4900` rows, which is a good length for Altair.)

To get these pairs, we can for example use the NumPy function `meshgrid` or we can use `itertools.product`.  (In Worksheet 14, things are a little easier, we just choose random values, and don't worry about making them evenly spaced.)

* Add a corresponding "pred" column to `df_art`.

We start out here using the `product` function from the `itertools` library.

In [36]:
from itertools import product

Think of `product` as formuing something like a Cartesian product.  That's not clear at all from looking at it.

In [37]:
product(x,y)

<itertools.product at 0x7f6843e4df00>

If we convert it to a list, it's more clear.

In [38]:
list(product(x,y))

[(170.0, 30.0),
 (170.0, 30.434782608695652),
 (170.0, 30.869565217391305),
 (170.0, 31.304347826086957),
 (170.0, 31.73913043478261),
 (170.0, 32.17391304347826),
 (170.0, 32.608695652173914),
 (170.0, 33.04347826086956),
 (170.0, 33.47826086956522),
 (170.0, 33.91304347826087),
 (170.0, 34.34782608695652),
 (170.0, 34.78260869565217),
 (170.0, 35.21739130434783),
 (170.0, 35.65217391304348),
 (170.0, 36.08695652173913),
 (170.0, 36.52173913043478),
 (170.0, 36.95652173913044),
 (170.0, 37.391304347826086),
 (170.0, 37.82608695652174),
 (170.0, 38.26086956521739),
 (170.0, 38.69565217391305),
 (170.0, 39.130434782608695),
 (170.0, 39.565217391304344),
 (170.0, 40.0),
 (170.0, 40.434782608695656),
 (170.0, 40.869565217391305),
 (170.0, 41.30434782608695),
 (170.0, 41.73913043478261),
 (170.0, 42.17391304347826),
 (170.0, 42.608695652173914),
 (170.0, 43.04347826086956),
 (170.0, 43.47826086956522),
 (170.0, 43.91304347826087),
 (170.0, 44.34782608695652),
 (170.0, 44.78260869565217),
 

There are `4900` tuples in this list ($70 \cdot 70$).

In [39]:
len(list(product(x,y)))

4900

We convert this into a DataFrame with 4900 rows and two columns.

In [40]:
df_art = pd.DataFrame(list(product(x,y)))

We give the columns the same names as our input features.

In [41]:
df_art.columns = cols

Here is what `df_art` looks like.

In [42]:
df_art

Unnamed: 0,flipper_length_mm,bill_length_mm
0,170.0,30.000000
1,170.0,30.434783
2,170.0,30.869565
3,170.0,31.304348
4,170.0,31.739130
...,...,...
4895,235.0,58.260870
4896,235.0,58.695652
4897,235.0,59.130435
4898,235.0,59.565217


We now add a prediction column.  (A warning would be raised if we didn't have the same column names as when we fit the classifier using `clf.fit`.)

In [43]:
df_art["pred"] = clf.predict(df_art[cols])

Here is the new DataFrame.  For example, our classifier predicts a penguin with flipper length 170 and bill length 30 is an Adelie penguin.  (Notice how there is probably no actual penguin with these measurements.)

In [44]:
df_art

Unnamed: 0,flipper_length_mm,bill_length_mm,pred
0,170.0,30.000000,Adelie
1,170.0,30.434783,Adelie
2,170.0,30.869565,Adelie
3,170.0,31.304348,Adelie
4,170.0,31.739130,Adelie
...,...,...,...
4895,235.0,58.260870,Gentoo
4896,235.0,58.695652,Gentoo
4897,235.0,59.130435,Gentoo
4898,235.0,59.565217,Gentoo


* Make another Altair scatter plot of the predicted species, this time using `df_art`.

The most important thing to recognize from the following picture is that there are three regions (corresponding to the three classes) and that regions are separated by linear boundaries.  This partly explains why `LogisticRegression` is defined in the `linear_model` library of scikit-learn.

In [45]:
alt.Chart(df_art).mark_circle().encode(
    x=alt.X(cols[0], scale=alt.Scale(zero=False)),
    y=alt.Y(cols[1], scale=alt.Scale(zero=False)),
    color="pred"
)

We had some extra time, so let's see how to use `np.meshgrid` instead of `itertools.product`.  Here is a reminder of what `x` looks like.

In [46]:
x

array([170.        , 170.94202899, 171.88405797, 172.82608696,
       173.76811594, 174.71014493, 175.65217391, 176.5942029 ,
       177.53623188, 178.47826087, 179.42028986, 180.36231884,
       181.30434783, 182.24637681, 183.1884058 , 184.13043478,
       185.07246377, 186.01449275, 186.95652174, 187.89855072,
       188.84057971, 189.7826087 , 190.72463768, 191.66666667,
       192.60869565, 193.55072464, 194.49275362, 195.43478261,
       196.37681159, 197.31884058, 198.26086957, 199.20289855,
       200.14492754, 201.08695652, 202.02898551, 202.97101449,
       203.91304348, 204.85507246, 205.79710145, 206.73913043,
       207.68115942, 208.62318841, 209.56521739, 210.50724638,
       211.44927536, 212.39130435, 213.33333333, 214.27536232,
       215.2173913 , 216.15942029, 217.10144928, 218.04347826,
       218.98550725, 219.92753623, 220.86956522, 221.8115942 ,
       222.75362319, 223.69565217, 224.63768116, 225.57971014,
       226.52173913, 227.46376812, 228.4057971 , 229.34

The `meshgrid` function returns two NumPy arrays (when we provide two inputs), corresponding to the Cartesian product.

In [47]:
xx, yy = np.meshgrid(x,y)

Notice how the output `xx` is actually two-dimensional.

In [48]:
xx

array([[170.        , 170.94202899, 171.88405797, ..., 233.11594203,
        234.05797101, 235.        ],
       [170.        , 170.94202899, 171.88405797, ..., 233.11594203,
        234.05797101, 235.        ],
       [170.        , 170.94202899, 171.88405797, ..., 233.11594203,
        234.05797101, 235.        ],
       ...,
       [170.        , 170.94202899, 171.88405797, ..., 233.11594203,
        234.05797101, 235.        ],
       [170.        , 170.94202899, 171.88405797, ..., 233.11594203,
        234.05797101, 235.        ],
       [170.        , 170.94202899, 171.88405797, ..., 233.11594203,
        234.05797101, 235.        ]])

Imagine grouping together the `xx` and `yy` entries.  At the top we would get `(170, 30)`, then we would get (approximately) `(170.9, 30)`, and so on.

In [49]:
yy

array([[30.        , 30.        , 30.        , ..., 30.        ,
        30.        , 30.        ],
       [30.43478261, 30.43478261, 30.43478261, ..., 30.43478261,
        30.43478261, 30.43478261],
       [30.86956522, 30.86956522, 30.86956522, ..., 30.86956522,
        30.86956522, 30.86956522],
       ...,
       [59.13043478, 59.13043478, 59.13043478, ..., 59.13043478,
        59.13043478, 59.13043478],
       [59.56521739, 59.56521739, 59.56521739, ..., 59.56521739,
        59.56521739, 59.56521739],
       [60.        , 60.        , 60.        , ..., 60.        ,
        60.        , 60.        ]])

We get an error here (to me the error message is not very helpful) because we are using two-dimensional NumPy arrays for the columns.  (Remember that `xx` and `yy` were two-dimensional.)

In [50]:
df_art2 = pd.DataFrame(
    {
        cols[0]: xx,
        cols[1]: yy
    }
)

ValueError: If using all scalar values, you must pass an index

The array `xx` is a 70-by-70 array.

In [51]:
xx.shape

(70, 70)

We can reshape it to a one-dimensional array by calling `xx.reshape(-1)`.  Think of `-1` as like a wild card, which says to make it however long is necessary to include all the data.

In [52]:
xx.reshape(-1).shape

(4900,)

This code no longer raises an error.

In [53]:
df_art2 = pd.DataFrame(
    {
        cols[0]: xx.reshape(-1),
        cols[1]: yy.reshape(-1)
    }
)

We can add a "pred" column just like above.

In [55]:
df_art2["pred"] = clf.predict(df_art2[cols])

Here is another illustration of the decision boundary.  I believe it should be the exact same image as above.  (But notice we are using `df_art2` instead of `df_art`.)

In [56]:
alt.Chart(df_art2).mark_circle().encode(
    x=alt.X(cols[0], scale=alt.Scale(zero=False)),
    y=alt.Y(cols[1], scale=alt.Scale(zero=False)),
    color="pred"
)