# Exercises

Use your new text-wrangling skills to tackle these tasks:

1. The style guide of a publishing company you're writing for wants you to avoid commas before suffixes in names. But there are several names like `Alvarez, Jr.` & `Williams, Sr.` in your database. Which functions can you use to remove the comma? Would a regular expression function help? How would you capture just the suffixes to place them into a separate column?
2. Using any one of the presidents' speeches, count the number of unique words that are five characters or more. (Hint: You can use `regexp_split_to_table()` in a subquery to create a table of words to count.) Bonus: Remove commas & periods at the end of each word.
3. Rewrite the below query using the `ts_rank_cd()` function instead of `ts_rank()`. According to PostgreSQL documentation, `ts_rank_cd()` computes cover density, which takes into account how close the lexeme search terms are to each other. Does using the `ts_rank_cd()` function significantly change the results?

```
SELECT president,
       speech_date,
       ts_rank(search_speech_text, to_tsquery('english',
           'war & security & threat & enemy')) AS score
FROM president_speeches
WHERE search_speech_text @@ to_tsquery('english',
          'war & security & threat & enemy')
ORDER BY score DESC
LIMIT 5;
```

---

# 1. 

We could use the `replace()` function. Based on the [PostgreSQL documentation](https://www.postgresql.org/docs/current/functions-string.html#FUNCTIONS-STRING-FORMAT), `replace(string, x, y)` replaces all occurences of substring *x* to substring *y* in *string*. So for example, to remove the commas in the aforementioned names, we could do something like:

```
SELECT replace('Alvarez, Jr.', ',', '');

SELECT replace('Williams, Sr.', ',', '');
```

For regular expression function(s), we could use `regexp_replace()`, which, based on the documentation, works similarly to the `replace()` function.

```
SELECT regexp_replace('Alvarez, Jr.', ',', '');

SELECT regexp_replace('Williams, Sr.', ',', '');
```

To capture the suffixes in a new column, we could use `regexp_match()`.

```
CREATE TABLE names_suffix (name text);

INSERT INTO names_suffix
VALUES ('Alvarez, Jr.'),
	   ('Williams, Sr.');

SELECT name,
	   (regexp_match(name, ',\s(\w+.)'))[1] AS suffix
FROM names_suffix;
```

We know that the suffix always comes after a comma & space, hence the `,\s`. Enclosed within parentheses is what we're truly after, the suffix. It is any number of characters followed by a period, so we get `(\w+.)`. We then set the result of this `regexp_match()` function as a new column with the `AS` keyword.

<img src = "Exercise Images/Name Suffixes.png" width = "600" style = "margin:auto"/>

# 2.

```
SELECT regexp_replace(words, '[$,|$.]', '') AS words,
	   count(*)
FROM (
	SELECT regexp_split_to_table(speech_text, 
		   	   '\s+') AS words
	FROM president_speeches
	WHERE president = 'Joseph R. Biden'
)
GROUP BY regexp_replace(words, '[$,|$.]', '')
HAVING char_length(regexp_replace(words, 
		   '[$,|$.]', '')) > 5
ORDER BY count(*) DESC;
```

This is pretty messy & hard to interpret, so let's try to use CTEs instead. We'll clean up our speech text before performing our groupings & counting.

```
WITH biden_speech (words)
AS (
    SELECT regexp_replace(words, '[$,|$.]', '') AS words
    FROM (
        SELECT regexp_split_to_table(speech_text, 
    		   	   '\s+') AS words
    	FROM president_speeches
    	WHERE president = 'Joseph R. Biden'
    )
)
SELECT words,
	   count(*)
FROM biden_speech
GROUP BY words
HAVING char_length(words) > 5
ORDER BY count(*) DESC;
```

<img src = "Exercise Images/Biden Speech Word Count.png" width = "600" style = "margin:auto"/>

# 3. 

Here is our query using `ts_rank()`:

```
SELECT president,
       speech_date,
       ts_rank(search_speech_text, to_tsquery('english',
           'war & security & threat & enemy')) AS score
FROM president_speeches
WHERE search_speech_text @@ to_tsquery('english',
          'war & security & threat & enemy')
ORDER BY score DESC
LIMIT 5;
```

This is what it does:

<img src = "Scoring Relevance with ts_rank().png" width = "600" style = "margin:auto"/>

Now let's replace `ts_rank()` with `ts_rank_cd()`:

```
SELECT president,
       speech_date,
       ts_rank_cd(search_speech_text, to_tsquery(
           'english', 'war & security & threat & enemy'))
           AS score
FROM president_speeches
WHERE search_speech_text @@ to_tsquery('english',
          'war & security & threat & enemy')
ORDER BY score DESC
LIMIT 5;
```

<img src = "Exercise Images/Scoring Relevance with ts_rank_cd().png" width = "600" style = "margin:auto"/>

Using `ts_rank_cd()` instead of `ts_rank()` changes the scores for the top five speeches. George W. Bush's speeches shoot up to the top five when using `ts_rank_cd()`, which makes sense because of the events after September 11, 2001. With `ts_rank()`, although George W. Bush's speeches do see the top 5 placement, William J. Clinton's speech takes the top spot.

Let's take a look at his [speech](https://clintonwhitehouse3.archives.gov/WH/SOU97/).

```
SELECT speech_text
FROM president_speeches
WHERE president = 'William J. Clinton' 
	AND speech_date = '1997-02-04';
```

Based on the contents of his speech, Clinton talks about the end of the USA-USSR Cold War, the desire for America to stay at the forefront of international influence, agreements with foreign countries to reduce the amount of nuclear weapons, war on gangs, wars around the world, weapon modernisation, & what seems like a desire to mediate or temper the new incoming Cold War. Fascinating.