In [1]:
import re
import pandas as pd
import numpy as np

import util

# Lecture 18 – Text as Data

## DSC 80, Spring 2022

### Announcements

- Lab 6 is due **today at 11:59PM**.
    - **You don't have to do Question 3** ([even though it might work again](https://campuswire.com/c/G325FA25B/feed/1061)).
- Project 3 is due on **Thursday, May 12th at 11:59PM**.
- Look at the [Grade Report](https://www.gradescope.com/courses/379137/assignments/2051129/) on Gradescope, which summarizes your grades on all assessments so far.
    - Project 2 and Lab 5 grades have also been released.

### Agenda

- Example: Log parsing with regular expressions.
- Quantifying text data.
- Bag of words.

Remember to refer to [dsc80.com/resources/#regular-expressions](https://dsc80.com/resources/#regular-expressions).

## Example: Log parsing

Recall the **log string** from a few lectures ago.

In [2]:
s = '''132.249.20.188 - - [05/May/2022:14:26:15 -0800] "GET /my/home/ HTTP/1.1" 200 2585'''

Let's use our new regex syntax (including capturing groups) to extract the day, month, year, and time from the log string `s`.

In [None]:
exp = '\[(.+)\/(.+)\/(.+):(.+):(.+):(.+) .+\]'
re.findall(exp, s)

While above regex works, it is not very **specific**. It _works_ on incorrectly formatted log strings.

In [None]:
other_s = '[adr/jduy/wffsdffs:r4s4:4wsgdfd:asdf 7]'
re.findall(exp, other_s)

### The more specific, the better!
* Be as specific in your pattern matching as possible – you don't want to match and extract strings that don't fit the pattern you care about.
    - `.*` matches every possible string, but we don't use it very often.
    
* A better date extraction regex:
```
\[(\d{2})\/([A-Z]{1}[a-z]{2})\/(\d{4}):(\d{2}):(\d{2}):(\d{2}) -\d{4}\]
```

    * `\d{2}` matches any 2-digit number.
    * `[A-Z]{1}` matches any single occurrence of any uppercase letter.
    * `[a-z]{2}` matches any 2 consecutive occurrences of lowercase letters.
    * Remember, special characters (`[`, `]`, `/`) need to be escaped with `\`.

In [None]:
s

In [None]:
new_exp = '\[(\d{2})\/([A-Z]{1}[a-z]{2})\/(\d{4}):(\d{2}):(\d{2}):(\d{2}) -\d{4}\]'
re.findall(new_exp, s)

A benefit of `new_exp` over `exp` is that it doesn't capture anything when the string doesn't follow the format we specified.

In [None]:
other_s

In [None]:
re.findall(new_exp, other_s)

### Another character class

The `\b` character class refers to "word boundaries". It matches anything that separates letters/digits/underscores.

In [None]:
re.findall('\\b\w+\\b', 'hello, my name is billy')

In [None]:
re.findall('\\b\w+\\b', 'hello-my-name-is-bil_ly!!')

Remember, the `\w` character class refers to letters, digits, and underscores, i.e. "word" characters.

**Question:** What's with the `\\`?

### Aside: "raw" strings

- Regular expressions use `\` to escape special characters and to denote character classes (like `\w` and `\b`).
- Python uses `\` to demarcate special strings as well.

In [None]:
print('ho\ney')

Sometimes, the regex meaning and Python meaning of a special string clash.

In [None]:
'hi\billy'

In [None]:
print('hi\billy') # \b means "backspace" in Python

In [None]:
print('hi\\billy')

To prevent Python from interpreting `\b`, `\n`, etc. as its own special strings, and to keep them in their "raw" form, use raw strings.

To create a raw string, add the character `r` right before the quotes.

In [None]:
r'hi\billy'

In [None]:
print(r'hi\billy')

Raw strings can help us avoid misinterpretations like the one below.

In [None]:
re.findall('\b\w+\b', 'hello, my name is billy')

In [None]:
re.findall(r'\b\w+\b', 'hello, my name is billy')

If you don't want to use a raw string, you'd instead have to escape the `\b` with another `\`, as we did on the previous slide:

In [None]:
re.findall('\\b\w+\\b', 'hello, my name is billy')

## Reflection

### Limitations of regexes

Writing a regular expression is like writing a program.
* You need to know the syntax well.
* They can be easier to write than to read.
* They can be difficult to debug.

Regular expressions are terrible at certain types of problems. Examples:
* Anything involving counting (same number of instances of a and b).
* Anything involving complex structure (palindromes).
* Parsing highly complex text structure ([HTML](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags), for instance).

Below is a regular expression that validates email addresses in Perl. See [this article](http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html) for more details.



<center><img src="imgs/image_8.png" width=700></center>

StackOverflow crashed due to regex! See [this article](https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016) for the details.

<center><img src='imgs/so_regex.png' width=60%></center>

### Advice

- You don't need to force yourself to "memorize" regex syntax – refer to the resources in the [Agenda](#Agenda) section of the lecture and on the [Resources](https://dsc80.com/resources#regular-expressions) tab of the course website.
- Also refer to the three tables of syntax in the lecture:
    - [Regex building blocks](#Regex-building-blocks-🧱).
    - [More regex syntax](#More-regex-syntax).
    - [Even more regex syntax](#Even-more-regex-syntax).
- **Note:** You don't always have to use regular expressions! If Python/`pandas` string methods work for your task, you can still use those.
- **Play [Regex Golf](https://alf.nu/RegexGolf?world=regex&level=r00) to practice!** 🏌️



## Quantifying text data

### Quantifying text data

- How do we **quantify** the similarity of two text documents?
- How do we use a text document as input in a regression or classification model?
- **How do we turn a text document into a vector of numbers?**
    - From 40A: A **design matrix** consists of one row per "data point", and one column per "feature".


### Example: San Diego employee salaries

Recall, in Lectures 1 and 2, we worked with a dataset of San Diego city employee salaries in 2020.

In [None]:
# 2021 data is now actually available, but we will use 2020 data as we did earlier in the quarter
salaries = pd.read_csv('https://transcal.s3.amazonaws.com/public/export/san-diego-2020.csv')
util.anonymize_names(salaries)

In [None]:
salaries.head()

We asked the question, "Does gender influence pay?"

A followup question to that was "Do men and women make similar salaries amongst those with similar jobs?" – but what makes two jobs similar?

### Exploring job titles

In [None]:
jobtitles = salaries['Job Title']
jobtitles.head()

How many job titles are there in the dataset? How many **unique** job titles are there?

In [None]:
jobtitles.shape[0], jobtitles.nunique()

What are the most common job titles?

In [None]:
jobtitles.value_counts().iloc[:100]

In [None]:
jobtitles.value_counts().iloc[:25].sort_values().plot(kind='barh', figsize=(8, 6));

### Messiness of job titles

- Are there multiple representations of the same job title (e.g. `'Assistant Fire Chief'` vs. `'Asst. Fire Chief'`?
- Are there multiple representations of the same word that is used in multiple job titles (e.g. `'Civil Eng.'` vs `'Mechanical engineer'`)?

Run the cell below repeatedly to get a feel for the "messiness" of job titles in their current state.

In [None]:
jobtitles.sample(10)

### Canonicalizing job titles

Let's try to **canonicalize** job titles. To do this, we'll look at:

- Punctuation.
- "Glue" words.
- Abbreviations.

### Punctuation

Are there job titles with unnecessary punctuation that we can remove? 

- To find out, we can write a regular expression that looks for characters other than letters, numbers, and spaces.

- We can use regular expressions with the `.str` methods we learned earlier in the quarter just by using `regex=True`.

In [None]:
jobtitles.str.contains(r'[^A-Za-z0-9 ]', regex=True).sum()

In [None]:
jobtitles[jobtitles.str.contains(r'[^A-Za-z0-9 ]', regex=True)].head()

It seems like we should replace these pieces of punctuation with a single space.

### "Glue" words

Are there job titles with "glue" words in the middle, such as Assistant <u>to the</u> Chief?

To figure out if any titles contain the word `'to'`, we **can't** just do the following, because it will evaluate to `True` for job titles that have `'to'` anywhere in them, even if not as a standalone word.

In [None]:
# Why are we converting to lowercase?
jobtitles.str.lower().str.contains('to').sum()

In [None]:
jobtitles[jobtitles.str.lower().str.contains('to')]

Instead, we need to look for `'to'` separated by word boundaries.

In [None]:
jobtitles.str.lower().str.contains(r'\bto\b', regex=True).sum()

In [None]:
jobtitles[jobtitles.str.lower().str.contains(r'\bto\b', regex=True)]

We can look for other filler words too, like `'the'` and `'for'`.

In [None]:
jobtitles[jobtitles.str.lower().str.contains(r'\bthe\b', regex=True)]

In [None]:
jobtitles[jobtitles.str.lower().str.contains(r'\bfor\b', regex=True)]

We should probably remove these "glue" words.

### Fixing punctuation and removing "glue" words

To canonicalize job titles, we'll start by:
- converting to lowercase,
- removing each occurrence of `'to'`, `'the'`, and `'for'`,
- replacing each non-letter/digit/space character with a space, and
- replacing each sequence of multiple spaces with a single space.

In [None]:
jobtitles = (
    jobtitles
    .str.lower()
    .str.replace(r'\bto|\bthe|\bfor', '', regex=True)
    .str.replace('[^A-Za-z0-9 ]', ' ', regex=True)
    .str.replace(' +', ' ', regex=True)               # ' +' matches 1 or more occurrences of a space
    .str.strip()                                      # Removes leading/trailing spaces if present
)

In [None]:
jobtitles.sample(10)

### Abbreviations 

Which job titles are inconsistently described? Let's look at three categories – librarians, engineers, and directors.

In [None]:
jobtitles[jobtitles.str.contains('libr')].value_counts()

In [None]:
jobtitles[jobtitles.str.contains('eng')].value_counts()

In [None]:
jobtitles[jobtitles.str.contains('dir')].value_counts()

### The limits of canonicalization

- Our current approach requires a lot of manual labor.
    - There may be more abbreviations in use amongst job titles (like `'asst'` for `'assistant'`), but how do we find them?
- Remember, our goal is to quantify how similar two job titles are.
- **Idea:** Two job titles are similar if they contain similar words (regardless of order).

## Bag of words 👜

### A counts matrix

Let's create a "counts" matrix, such that:
- there is 1 row per job title,
- there is 1 column per **unique** word that is used in job titles, and
- the value in row `title` and column `word` is the number of occurrences of `word` in `title`.

Such a matrix might look like:

| | senior | lecturer | teaching | professor | assistant | associate |
| --- | --- | --- | --- | --- | --- | --- |
| **senior lecturer** | 1 | 1 | 0 | 0 | 0 | 0 |
| **assistant teaching professor** | 0 | 0 | 1 | 1 | 1 | 0 | 
| **associate professor** | 0 | 0 | 0 | 1 | 0 | 1 |
| **senior assistant to the assistant professor** | 1 | 0 | 0 | 1 | 2 | 0 |

### Creating a counts matrix

First, we need to determine all words that are used across all job titles.

In [None]:
jobtitles.str.split()

In [None]:
all_words = jobtitles.str.split().sum()
all_words[:10]

Next, we need to find a list of all **unique** words used in titles. (We can do this with `np.unique`, but `value_counts` shows us the distribution, which is interesting.)

In [None]:
unique_words = pd.Series(all_words).value_counts()
unique_words.head(10)

In [None]:
len(unique_words)

For each of the 435 unique words that are used in job titles, we can count the number of occurrences of the word in each job title.
- `'assistant fire chief'` contains the word `'assistant'` once, the word `'fire'` once, and the word `'chief'` once.
- `'assistant managers assistant'` contains the word `'assistant'` twice and the word `'managers'` once.

In [None]:
# Created using a dictionary to avoid a "DataFrame is highly fragmented" warning.
counts_dict = {}
for word in unique_words.index:
    re_pat = fr'\b{word}\b'
    counts_dict[word] = jobtitles.str.count(re_pat).astype(int).tolist()
    
counts_df = pd.DataFrame(counts_dict)

In [None]:
counts_df.head()

`counts_df` has one row for all 12605 job titles (employees), and one column for each unique word that is used in a job title.

In [None]:
counts_df.shape

To put into context what the numbers in `counts_df` mean, we can show the actual job title for each row.

In [None]:
counts_df = pd.concat([jobtitles.to_frame(), counts_df], axis=1).set_index('Job Title')
counts_df.head()

The first row tells us that the first job title contains `'police'` once and `'officer'` once. The fifth row tells us that the fifth job title contains `'fire'` once.

### Interpreting the counts matrix

In [None]:
counts_df.head()

The Series below describes the 20 most common words used in job titles, along with the number of times they appeared in all job titles (including repeats). We will call these words "top 20".

In [None]:
counts_df.iloc[:, :20].sum()

The Series below describes the **number of top 20 words** used in each job title.

In [None]:
counts_df.iloc[:, :20].sum(axis=1)

### Question: What job titles are most similar to `'asst fire chief'`?

- Remember, our idea was to treat two job titles as similar if they contain similar words (regardless of order).
- Now that we have `counts_df`, we have a (row) vector for each job title.
- **How do we measure how similar two vectors are?**

To start, let's compare `'asst fire chief'` to `'fire battalion chief'`.

In [None]:
afc = counts_df.loc['asst fire chief'].iloc[0]
afc

In [None]:
fbc = counts_df.loc['fire battalion chief'].iloc[0]
fbc

We can stack these two vectors horizontally.

In [None]:
pair_counts = (
    pd.concat([afc, fbc], axis=1)
    .sort_values(by=['asst fire chief', 'fire battalion chief'], ascending=False)
    .head(10)
    .T
)

pair_counts

One way to measure how similar the above two vectors are is through their **dot product**.

In [None]:
np.sum(pair_counts.iloc[0] * pair_counts.iloc[1])

Here, since both vectors consist only of 1s and 0s, the dot product is equal to the **number of shared words** between the two job titles.

### Aside: dot product

- Recall, if $\vec{a} = \begin{bmatrix} a_1 & a_2 & ... & a_n \end{bmatrix}^T$ and $\vec{b} = \begin{bmatrix} b_1 & b_2 & ... & b_n \end{bmatrix}^T$ are two vectors, then their **dot product** $\vec{a} \cdot \vec{b}$ is defined as:

$$\vec{a} \cdot \vec{b} = a_1b_1 + a_2b_2 + ... + a_nb_n$$

- The dot product also has a **geometric** interpretation. If $|\vec{a}|$ and $|\vec{b}|$ are the $L_2$ norms (lengths) of $\vec{a}$ and $\vec{b}$, and $\theta$ is the angle between $\vec{a}$ and $\vec{b}$, then:

$$\vec{a} \cdot \vec{b} = |\vec{a}| |\vec{b}| \cos \theta$$

- $\cos \theta$ is equal to its maximum value (1) when $\theta = 0$, i.e. when $\vec{a}$ and $\vec{b}$ point in the same direction. 

- 🚨 **Key idea: The more similar two vectors are, the larger their dot product is!**

### Computing similarities

To find the job title that is most similar to `'asst fire chief'`, we can compute the dot product of the `'asst fire chief'` word vector with all other titles' word vectors, and find the title with the highest dot product.

In [None]:
counts_df.head()

In [None]:
afc

To do so, we can apply `np.dot` to each row that doesn't correspond to `'asst fire chief'`.

In [None]:
dots = (
    counts_df[counts_df.index != 'asst fire chief']
    .apply(lambda s: np.dot(s, afc), axis=1)
    .sort_values(ascending=False)
)

dots

The unique job titles that are **most similar** to `'asst fire chief'` are given below.

In [None]:
np.unique(dots.index[dots == dots.max()])

Note that they all share two words in common with `'asst fire chief'`.

**Note:** To truly use the dot product as a measure of similarity, we should **normalize** by the lengths of the word vectors. More on this soon.

### Bag of words

- The **bag of words** model represents texts (e.g. job titles, sentences, documents) as **vectors of word counts**.
    - The "counts" matrices we have worked with so far were created using the bag of words model.
    - The bag of words model defines a **vector space** in $\mathbb{R}^{\text{number of unique words}}$.
- It is called "bag of words" because it doesn't consider **order**.

<center><img src='imgs/bag-of-words.jpeg' width=45%></center>

<center><a href="https://42f6861cgkip12ijm63i3orf-wpengine.netdna-ssl.com/wp-content/uploads/2020/12/2020-07-bagofwords.jpg">(source)</a></center>

### Cosine similarity and bag of words

To measure the similarity between two word vectors, we compute their dot product, also known as their **cosine similarity**.

$$\cos \theta = \frac{\vec{a} \cdot \vec{b}}{|\vec{a}| | \vec{b}|}$$

If $\cos \theta$ is large, the two word vectors are similar. **It is important to normalize by the lengths of the vectors**, otherwise texts with more words will have artificially high similarities with other texts.

**Note:** Sometimes, you will see the **cosine distance** being used. It is the complement of cosine similarity:
  
  $$\text{dist}(\vec{a}, \vec{b}) = 1 - \cos \theta$$
  
If $\text{dist}(\vec{a}, \vec{b})$ is small, the two word vectors are similar.

### A recipe for computing similarities

Given a set of texts, to find the **most similar** text to one text $T$ in particular:
- Use the bag of words model to create a counts matrix. Specifically:
    - Create an index out of **all** distinct words used across all texts.
    - Create a single vector for each text by counting the number of occurrences of each distinct word.
- Compute the cosine similarity between text $T$ and all other texts.
- The other text with the greatest cosine similarity is the most similar, under the bag of words model.

### Example: Global warming 🌎

Consider the following **sentences**.

In [None]:
sentences = pd.Series([
    'I really want global peace',
    'I must love global warming',
    'We must solve climate change'
])

sentences

Let's represent each sentence using the bag of words model.

In [None]:
unique_words = pd.Series(sentences.str.split().sum()).value_counts()
unique_words

In [None]:
counts_dict = {}
for word in unique_words.index:
    re_pat = fr'\b{word}\b'
    counts_dict[word] = sentences.str.count(re_pat).astype(int).tolist()
    
counts_df = pd.DataFrame(counts_dict).set_index(sentences)

In [None]:
counts_df

Let's now find the cosine similarity between each sentence.

In [None]:
# There is an easier way of doing this in sklearn, as we will see soon
def sim_pair(s1, s2):
    return np.dot(s1, s2) / (np.linalg.norm(s1) * np.linalg.norm(s2))

In [None]:
sim_pair(counts_df.iloc[0], counts_df.iloc[1])

In [None]:
sim_pair(counts_df.iloc[0], counts_df.iloc[2])

In [None]:
sim_pair(counts_df.iloc[1], counts_df.iloc[2])

**Issue:** Bag of words only encodes the **words** that each sentence uses, not their **meanings**.
- Sentence 0 and sentence 2 have similar meanings, but have no shared words.
- Sentence 0 and sentence 1 have very different meanings, but a relatively high cosine similarity.

### Pitfalls of the bag of words model

Remember, the key assumption underlying the bag of words model is that **two texts are similar if they share many words in common**.

- The bag of words model doesn't consider **order**.
    - The job titles `'asst fire chief'` and `'chief fire asst'` are treated as the same.
- The bag of words model treats all words as being equally important.
    - `'asst'` and `'fire'` have the same importance, even though `'fire'` is probably more important in describing someone's job title.
- The bag of words model doesn't consider the **meaning** of words.
    - `'I love data science'` and `'I hate data science'` share 75% of their words, but have very different meanings.

## Summary, next time

### Summary

- `pandas` `.str` methods can use regular expressions; just set `regex=True`.
- Canonicalization can be difficult in practice when working with large datasets.
- The bag of words model allows us to turn texts into numerical vectors of word counts.
    - It treats two texts as similar if they share many words in common.
    - It doesn't consider the order, importance, or meaning of words.
- **Next time:** An improvement to bag of words.