In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go

import re
import util

plt.style.use('seaborn-white')
plt.rc('figure', dpi=100, figsize=(7, 5))
plt.rc('font', size=12)

ModuleNotFoundError: No module named 'plotly'

# Lecture 20 – Features

## DSC 80, Spring 2022

### Announcements

- Lab 7 is due on **Monday, May 16th at 11:59PM**.
- 📣 Come to the DSC **Town Hall**, where you can voice your feedback about the DSC program to faculty. 
    - Tuesday, May 16th from 3-5PM in the SDSC Auditorium.
    - [RSVP by **noon on Friday** to secure **free pizza 🍕**!](https://docs.google.com/forms/d/e/1FAIpQLScfP_EFEYt1d5N7dWXGQqQaOik3nY_KTIMYuB1uuEgjH83vRw/viewform)
- Project 4 will be released over the weekend 👀.

### Agenda

- Recap: TF-IDF.
- Features.
- Example: Predicting child heights 📏.

## Recap: TF-IDF

### Term frequency-inverse document frequency

The **term frequency-inverse document frequency (TF-IDF)** of word $t$ in document $d$ is the product:

$$
\begin{align*}\text{tfidf}(t, d) &= \text{tf}(t, d) \cdot \text{idf}(t) \\\ &= \frac{\text{number of occurrences of $t$ in $d$}}{\text{total number of words in $d$}} \cdot \log \left(\frac{\text{total number of documents}}{\text{number of documents in which $t$ appears}} \right) \end{align*} $$

- If $\text{tfidf}(t, d)$ is large, then $t$ is a good summary of $d$.
    - But to know if $\text{tfidf}(t, d)$ is large, we need to compare it to $\text{tfidf}(t_i, d)$, for several different words $t_i$.

- TF-IDF is a **heuristic** – it has no probabilistic justification.

### Example: State of the Union addresses 🎤

Recall, last class, we computed the TF-IDF for every word and every SOTU speech. We used TF-IDFs to **summarize** speeches.

In [None]:
def extract_struct(speech):
    L = speech.strip().split('\n', maxsplit=3)
    L[3] = re.sub(r"[^A-Za-z' ]", ' ', L[3]).lower()
    return dict(zip(['speech', 'president', 'date', 'contents'], L))

def five_largest(row):
    return list(row.index[row.argsort()][-5:])

sotu = open('data/stateoftheunion1790-2022.txt').read()
speeches = sotu.split('\n***\n')[1:]
speeches_df = pd.DataFrame(list(map(extract_struct, speeches)))
unique_words = pd.Series(speeches_df['contents'].str.split().sum()).value_counts()
unique_words = unique_words.iloc[:500].index

tfidf_dict = {}
tf_denom = speeches_df['contents'].str.split().str.len()
for word in unique_words:
    re_pat = fr' {word} ' # Imperfect pattern for speed
    tf = speeches_df['contents'].str.count(re_pat) / tf_denom
    idf = np.log(len(speeches_df) / speeches_df['contents'].str.contains(re_pat).sum())
    tfidf_dict[word] =  tf * idf
    
tfidf = pd.DataFrame(tfidf_dict)

keywords = tfidf.apply(five_largest, axis=1)
keywords_df = pd.concat([
    speeches_df['president'],
    speeches_df['date'],
    keywords
], axis=1)

In [None]:
tfidf

In [None]:
keywords_df

### Aside: What if we remove the $\log$ from $\text{idf}(t)$?

Let's try it and see what happens.

In [None]:
tfidf_nl_dict = {}
tf_denom = speeches_df['contents'].str.split().str.len()
for word in unique_words:
    re_pat = fr' {word} ' # Imperfect pattern for speed
    tf = speeches_df['contents'].str.count(re_pat) / tf_denom
    idf_nl = len(speeches_df) / speeches_df['contents'].str.contains(re_pat).sum()
    tfidf_nl_dict[word] =  tf * idf_nl
    
tfidf_nl = pd.DataFrame(tfidf_nl_dict)

keywords_nl = tfidf_nl.apply(five_largest, axis=1)
keywords_nl_df = pd.concat([
    speeches_df['president'],
    speeches_df['date'],
    keywords_nl
], axis=1)

In [None]:
tfidf_nl

In [None]:
keywords_nl_df

### The role of $\log$ in $\text{idf}(t)$

$$
\begin{align*}\text{tfidf}(t, d) &= \text{tf}(t, d) \cdot \text{idf}(t) \\\ &= \frac{\text{number of occurrences of $t$ in $d$}}{\text{total number of words in $d$}} \cdot \log \left(\frac{\text{total number of documents}}{\text{number of documents in which $t$ appears}} \right) \end{align*} $$

- Remember, for any positive input $x$, $\log(x)$ is (much) smaller than $x$.
- In $\text{idf}(t)$, the $\log$ "dampens" the impact of the ratio $\frac{\text{# documents}}{\text{# documents with $t$}}$.

- If a word is very common, the ratio will be close to 1. The log of the ratio will be close to 0.

In [None]:
(1000 / 999)

In [None]:
np.log(1000 / 999)

- If a word is very rare, the ratio will be very large. However, for instance, a word being seen in **2 out of 50** documents is not very different than being seen in **2 out of 500** documents (it is very rare in both cases), and so $\text{idf}(t)$ should be similar in both cases.

In [None]:
(50 / 2)

In [None]:
(500 / 2)

In [None]:
np.log(50 / 2)

In [None]:
np.log(500 / 2)

## Features

<center><img src='imgs/DSLC.png' width=50%></center>

### Reflection

So far this quarter, we've learned how to:

- Extract information from tabular data using `pandas` and regular expressions.
- Clean data so that it best represents a data generating process.
    - Missingness analyses and imputation.
- Collect data from the internet through scraping and APIs, and parse it using BeautifulSoup.
- Perform exploratory data analysis through aggregation, visualization, and the computation of summary statistics like TF-IDF.
- Infer about the relationships between samples and populations through hypothesis and permutation testing.

- **We haven't** learned how to make predictions.

### Features

* A **feature** is a measurable property or characteristic of a phenomenon being observed.
    * Other words for "feature" include "(explanatory) variable" and "attribute".
* In DataFrames, features typically correspond to **columns**, while rows typically correspond to different individuals.
* There are two types of features:
    * Features that come as part of a dataset, e.g. weight and height.
    * Features that we **create**, e.g. $\text{BMI} = \frac{\text{weight (kg)}}{\text{[height (m)]}^2}$.

**Note:** TF-IDF is a **feature** we've created that summarizes documents!

### Example: San Diego employee salaries

What features are present in `salaries`? What features can we create?

In [None]:
salaries = pd.read_csv('https://transcal.s3.amazonaws.com/public/export/san-diego-2020.csv')
util.anonymize_names(salaries)

In [None]:
salaries.head()

- Employee salaries.
    - This feature came with the dataset.
- Employee salaries, standardized by job status.
    - We'd need to compute this feature, using information that is already in `salaries`.
- Employee genders.
    - We'd need to merge `salaries` with another data source, like the SSA baby names dataset, to create this feature.
    - How accurate would the resulting feature be?
- Job "category".
    - We could compute this using TF-IDF (which would allow us to find the most important word in each job title).

### What makes a good feature?

- A good feature should be...

    - Faithful to the data generating process.
    - Strongly associated to the phenomenon of interest.
    - Easily used in standard modeling techniques (e.g. quantitative and scaled).
    
- Often times, the columns in a dataset aren't good features on their own. In such cases, we may need to "engineer" features that are useful.
    - Useful for what?

## Example: Predicting child heights 📏

### Galton's heights dataset

- When studying missingness, we worked with a dataset containing the heights of children and their parents.
- The dataset was collected by Francis Galton, the founder of eugenics. 
- He was interested in **predicting a child's height**, given various attributes (father's height, mother's height, child gender, etc.).

In [None]:
galton = pd.read_csv('data/galton.csv')
galton.head()

### Exploratory data analysis

The following **scatter matrix** contains a scatter plot of all pairs of quantitative attributes, and a histogram for each quantitative attribute on its own.

In [None]:
pd.plotting.scatter_matrix(galton, figsize=(12, 8));

Is a linear model suitable for prediction? If so, on which attributes?

### Attempt #1: Predict child's height using father's height

We will assume that the relationship between father's heights and child's heights is linear. That is,

$$\text{predicted child's height} = w_0^* + w_1^* \cdot \text{father's height}$$

where $w_0^*$ and $w_1^*$ are carefully chosen **parameters**.

`seaborn`'s `lmplot` function can automatically plot the "line of best fit" on a scatter plot.

In [None]:
sns.lmplot(data=galton, x='father', y='childHeight');

### Recap: Simple linear regression

For any father's height $x_i$, their predicted child's height is given by

$$H(x_i) = w_0 + w_1x_i$$

- **Question:** How do we determine which intercept, $w_0$, and slope, $w_1$, to use?

- **One answer:** Pick the $w_0$ and $w_1$ that minimize **mean squared error**. If $x_i$ and $y_i$ correspond to the $i$th father's height and child's height, respectively, then:

$$\begin{align*}\text{MSE} &= \frac{1}{n} \sum_{i = 1}^n \big( y_i - H(x_i) \big)^2
\\ &= \frac{1}{n} \sum_{i = 1}^n \big( y_i - w_0 - w_1x_i \big)^2\end{align*}$$

- In DSC 40A, you found the formulas for the best intercept, $w_0^*$, and the best slope, $w_1^*$, through calculus. 
    - The resulting line, $H(x_i) = w_0^* + w_1^* x_i$, is called the **line of best fit**, or the **regression line**.

- Specifically, if $r$ is the correlation coefficient, $\sigma_x$ and $\sigma_y$ are the standard deviations of $x$ and $y$, and $\bar{x}$ and $\bar{y}$ are the means of $x$ and $y$, then:

$$w_1^* = r \cdot \frac{\sigma_y}{\sigma_x}$$

$$w_0^* = \bar{y} - w_1^* \bar{x}$$

- **Key idea: The lower the MSE is, the "better" the model fits the _training_ data**.

### Finding the regression line programatically

There are several packages that can perform linear regression; `scipy.stats` is one of them.

In [None]:
from scipy.stats import linregress

In [None]:
lm = linregress(x=galton['father'], y=galton['childHeight'])
lm

The `lm` object has several attributes, most notably, `slope` and `intercept.`

In [None]:
lm.intercept

In [None]:
lm.slope

In [None]:
def pred_child(father):
    return lm.intercept + lm.slope * father

`pred_child` words on scalar values:

In [None]:
pred_child(60)

But it also works on arrays/Series:

In [None]:
galton

In [None]:
pred_child(galton['father'])

Recall, a lower MSE means a better fit on the training data. Let's compute the MSE of this simple linear model; it will be useful later.

In [None]:
def mse(actual, pred):
    return np.mean((actual - pred) ** 2)

In [None]:
mse(galton['childHeight'], pred_child(galton['father']))

### Aside: MSE vs. RMSE

An issue with mean squared error is that its units are the **square** of the units of the $y$-values.

$$\text{MSE} = \frac{1}{n} \sum_{i = 1}^n \big( y_i - H(x_i) \big)^2$$

For instance, the number below is 11.892 "inches squared".

In [None]:
mse(galton['childHeight'], pred_child(galton['father']))

To correct the units of mean squared error, we can take the square root. The result, **root mean squared error (RMSE)** is also a measure of how well a model fits training data.

$$\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i = 1}^n \big( y_i - H(x_i) \big)^2}$$

**Important:** The line that minimizes MSE is the same line that minimizes RMSE and SSE (sum of squared errors).

In [None]:
def rmse(actual, pred):
    return np.sqrt(np.mean((actual - pred) ** 2))

Let's create a dictionary to keep track of the RMSEs of the various models we create.

In [None]:
rmse_dict = {}
rmse_dict['father only'] = rmse(galton['childHeight'], pred_child(galton['father']))
rmse_dict

### Visualizing our single-feature predictions

- How well does our linear model capture the underlying relationship between the heights of fathers and their children?
- What improvements can we make to our linear model?

In [None]:
sns.scatterplot(data=galton, x='father', y='childHeight', label='actual child heights')
sns.scatterplot(x=galton['father'], 
                y=pred_child(galton['father']), 
                label='predicted child heights'
);

### Attempt #2: Predict child's height using father's and mother's heights

* What if the father is very tall and the mother is very short?
* Adding mother's height as a **feature** should help our predictions.
* When performing linear regression with two features, the result is a **plane of best fit**.

$$\text{predicted child's height} = w_0^* + w_1^* \cdot \text{father's height} + w_2^* \cdot \text{mother's height}$$

### Multiple regression in `sklearn`

We'll cover `sklearn` in more detail in the coming lectures.

In [None]:
from sklearn.linear_model import LinearRegression

A typical pattern in `sklearn` is instantiate, fit, and predict.

In [None]:
lr = LinearRegression()
lr.fit(X=galton[['father', 'mother']], y=galton['childHeight'])

After calling `fit` on `lr`, we can access the intercept and coefficients of the plane of best fit (i.e. these are $w_0^*$, $w_1^*$, and $w_2^*$).

In [None]:
lr.intercept_, lr.coef_

However, we don't actually need to access these directly. Fit `LinearRegression` objects have the `predict` method, which we can use directly:

In [None]:
predictions = lr.predict(galton[['father', 'mother']])
predictions[:5]

How well does this model perform?

In [None]:
rmse_dict['father and mother'] = rmse(galton['childHeight'], predictions)
rmse_dict

It seems like this two-feature model has a lower RMSE than the original single-feature model (which we'd expect), but it's only slightly lower.

### Visualizing our two-feature predictions

Here, we must draw a 3D scatter plot and plane, with one axis for father's height, one axis for mother's height, and one axis for child's height. The code below does this.

In [None]:
XX, YY = np.mgrid[60:80:2, 55:75:2]
Z = lr.intercept_ + lr.coef_[0] * XX + lr.coef_[1] * YY
plane = go.Surface(x=XX, y=YY, z=Z, colorscale='Oranges')

fig = go.Figure(data=[plane])
fig.add_trace(go.Scatter3d(x=galton['father'], 
                           y=galton['mother'], 
                           z=galton['childHeight'], mode='markers', marker = {'color': '#656DF1'}))

fig.update_layout(scene = dict(
    xaxis_title = 'father',
    yaxis_title = 'mother',
    zaxis_title = 'child'),
    width=1000, height=800)

If we want to visualize in 2D, we must pick a single feature to display on the $x$-axis.

In [None]:
sns.scatterplot(data=galton, x='father', y='childHeight', label='actual child heights')
sns.scatterplot(x=galton['father'], 
                y=predictions, 
                label='predicted child heights using father and mother'
);

In [None]:
sns.scatterplot(data=galton, x='mother', y='childHeight', label='actual child heights')
sns.scatterplot(x=galton['mother'], 
                y=predictions, 
                label='predicted child heights using father and mother'
);

### Attempt #3: Adding gender as a feature

- In Attempt #2, the predicted height of a child depended only on their father's height and mother's height.
- However, we'd expect children of different genders to be of different heights, even for a fixed set of parent's heights.
    - For instance, sisters are usually shorter than brothers.
- Is this theory substantiated by the data?
    - To check, we can start by plotting separate regression lines for each gender.

In [None]:
sns.lmplot(data=galton, x='father', y='childHeight', hue='gender', 
           palette={'male': 'purple', 'female': 'green'});

Observation: It appears that the two lines have similar slopes, but different intercepts.

### Attempt #3: Adding gender as a feature

There's an issue: gender is a categorical feature, but in order to use it as a feature in a regression model, it must be quantitative.

In [None]:
galton.head()

**Solution:** Create a column named `'gender=female'`, that is
- 1 when `'gender'` is `'female'`, and
- 0 otherwise.

In [None]:
galton['gender=female'] = (galton['gender'] == 'female').astype(int)
galton.head()

Now, we can use `'gender=female'` as a feature, just as we used `'father'` and `'mother'` as features.

$$\text{predicted child's height} \\ = w_0^* + w_1^* \cdot \text{father's height} + w_2^* \cdot \text{mother's height} + w_3^* \cdot \text{gender=female}$$

In [None]:
lr_three_features = LinearRegression()
lr_three_features.fit(galton[['father', 'mother', 'gender=female']], galton['childHeight'])

In [None]:
predictions_three_features = lr_three_features.predict(galton[['father', 'mother', 'gender=female']])

In [None]:
rmse_dict['father, mother, and gender'] = rmse(galton['childHeight'], predictions_three_features)
rmse_dict

The RMSE of our new three feature model is significantly lower than the RMSEs of the earlier models. This indicates that `'gender=female'` is very useful in predicting child's heights.

### Visualizing our three-feature predictions

To visualize our data and linear model, we'd need 4 dimensions:
- One for father's height.
- One for mother's height.
- One for `'gender=female'`.
- One for child's height.

Humans can't visualize in 4D, but there may be a solution.

In [None]:
lr_three_features.intercept_, lr_three_features.coef_

Above, we are given the values of $w_0^*$, $w_1^*$, $w_2^*$, and $w_3^*$. This means our linear model is of the form:

$$\text{predicted child's height} \\ = 21.736 + 0.393 \cdot \text{father's height} + 0.318 \cdot \text{mother's height} - 5.215 \cdot \text{gender=female}$$

But remember, `'gender=female'` is either 1 or 0. Let's look at those two cases separately.

- **For female children:**

$$\text{predicted child's height} = 16.521 + 0.393 \cdot \text{father's height} + 0.318 \cdot \text{mother's height}$$

- **For male children:**

$$\text{predicted child's height} = 21.736 + 0.393 \cdot \text{father's height} + 0.318 \cdot \text{mother's height}$$

- These are really two **parallel planes** in 3D, with different $z$-intercepts!

In [None]:
XX, YY = np.mgrid[60:80:2, 55:75:2]
Z_female = (lr_three_features.intercept_ + lr_three_features.coef_[2]) + lr_three_features.coef_[0] * XX + lr_three_features.coef_[1] * YY
Z_male = lr_three_features.intercept_ + lr_three_features.coef_[0] * XX + lr_three_features.coef_[1] * YY

plane_female = go.Surface(x=XX, y=YY, z=Z_female, colorscale ='Greens')
plane_male = go.Surface(x=XX, y=YY, z=Z_male, colorscale='Purples')

fig = go.Figure(data=[plane_female, plane_male])

galton_female = galton[galton['gender'] == 'female']
galton_male = galton[galton['gender'] == 'male']

fig.add_trace(go.Scatter3d(x=galton_female['father'], 
                           y=galton_female['mother'], 
                           z=galton_female['childHeight'], mode='markers', marker = {'color': 'green'}))

fig.add_trace(go.Scatter3d(x=galton_male['father'], 
                           y=galton_male['mother'], 
                           z=galton_male['childHeight'], mode='markers', marker = {'color': 'purple'}))

fig.update_layout(scene = dict(
    xaxis_title = 'father',
    yaxis_title = 'mother',
    zaxis_title = 'child'),
    width=1000, height=800,
    showlegend=False,
    title="Predicted child's heights given parents' heights and gender (purple=male, green=female)")

If we want to visualize in 2D, we must pick a single feature to display on the $x$-axis.

In [None]:
sns.scatterplot(data=galton, x='father', y='childHeight', label='actual child heights')
sns.scatterplot(x=galton['father'], 
                y=predictions_three_features, 
                label='predicted child heights using father, mother, and gender'
);

## Summary, next time

### Summary

- The $\log$ is necessary in computing $\text{idf}(t)$; without it, the inverse document frequency is overemphasized in TF-IDF and the resulting scores are not as meaningful.
- A feature is a measurable property or characteristic of a phenomenon being observed.

### Next time: feature engineering

- Next class, we will learn more about **feature engineering**.
- When we created the `'gender=female'` column in `galton`, we **engineered** a feature that we thought would be useful for our model.
- More generally, **feature engineering** is the act of finding transformations that transform data into effective **quantitative variables**.

- **Question:** How do we decide what features to create?