<title>SQL for AI Projects</title>

# Introduction

**Natural Language Processing Challenge**

In this Jupyter notebook - we'll quickly setup the DuckDB database, get you familiar with this Google Colab setup and then we'll dive into the NLP challenge exercises for the SQL for AI Projects course!

##s Challenge Exercises

1. Clean webpage text data
2. Investigate customer review text
3. Implement A/B test framework

## Database Setup

First things first, let's load up our Python libraries and setup access to our database.

Don't worry if you're not familiar with Python - we'll just need to run the very first cell to initialize our SQL instance and there will be clear instructions whenever there is some non-SQL components.

## Getting Started

To execute each cell in this notebook - you can click on the play button on the left of each cell or you could simply hit the `Run all` button on the top of the notebook just below the menu toolbar.

This cell below will help us download and connect to a DuckDB database object within this notebook's temporary environment.

There will also be a few outputs in the same cell including the following:

* An interactive entity relationship diagram for our database is also as an output from the following cell. This will help us visualize all of the database tables and their relevant primary and foreign keys.

In [None]:
# Initial setup steps
# ====================

# These pip install commands are required for Google Colab notebook environment
!pip install --upgrade --quiet duckdb==1.3.1
!pip install --quiet duckdb-engine==0.17.0
!pip install --quiet jupysql==0.11.1

# Also need to setup Git LFS for large file dowloads
# This helps us to download large files stored on GitHub
!apt-get install git-lfs -y
!git lfs install

# Clone GitHub repo into a "data" folder
!git clone https://github.com/LinkedInLearning/real-world-data-and-AI-challenges-with-SQL-3813163.git data

# Need to change directory into "data" to run download database object
%cd data
!git lfs pull

# Then we need to change directory back up so all our paths are correct!
%cd ..

# Time to import all our Python packages
import duckdb
import textwrap
import pandas as pd
from pathlib import Path
from IPython.display import HTML, display

# Load the jupysql extension to enable us to run SQL code in code cells
%load_ext sql

# We can now set some basic Pandas settings for rendering SQL outputs
%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

# This is a convenience function to print long strings into multiple lines
# You'll see this in action later on in our tutorial!
def wrap_print(text):
    print(textwrap.fill(text, width=80))

# This is some boilerplate code to help us format printed output with wrapping
HTML("""
<style>
.output pre {
    white-space: pre-wrap;
    word-break: break-word;
}
</style>
""")

# Connecting to DuckDB
# ====================

# Setup the SQL connection
connection = duckdb.connect("data/data.db")
%sql connection

# Run a few test queries using both connections
tables = connection.execute("SHOW TABLES").fetchall()
table_names = [table[0] for table in tables]

preview_counts_list = []
for table_name in table_names:
    try:
        preview_counts_list.append(
            connection.execute(f"""
                SELECT '{table_name}' AS table_name,
                    COUNT(*) AS record_count
                FROM {table_name}""").fetchdf()
        )
    except Exception as e:
        print(f"❌ Could not preview table {table_name}: {e}")
        

print("✅ Database is now ready!")

print("\n📋 Show count of rows from each table in the database:")

# Combine all dataframes in preview_df_list
preview_counts_df = pd.concat(preview_counts_list, ignore_index=True)

display(preview_counts_df)

display(HTML('''
<iframe width="100%" height="600" src='https://dbdiagram.io/e/685279b3f039ec6d36c0c7e9/68527d19f039ec6d36c1813e'> </iframe>
'''
))

In [None]:
# This is commented out - but it can be used to run this notebook locally!
# In fact - I used this while developing this notebook :)

# import duckdb
# import textwrap
# import pandas as pd
# from pathlib import Path
# from IPython.display import HTML, display
# 
# connection = duckdb.connect("data.db")
# %sql connection

# How to Run SQL Queries

Let's quickly see how we can run SQL code in our Jupyter notebook.

In our Colab environment we can run single or multi-line queries. We can also easily save the output of SQL queries as a local Pandas DataFrame object and even run subsequent SQL queries which can interact with these same DataFrame objects.

## Single Line SQL Query

We can use our notebook magic `%sql` at the start of a notebook cell to run a single line of SQL to query our database.

Let's take a look at the first 5 rows from the `locations` table:

In [None]:
%sql SELECT * FROM locations LIMIT 5;

## Multi-Line SQL Query

We can also run multi-line SQL queries by using a different notebook magic `%%sql` where we now have 2 percentage signs.

We'll apply a filter on our `location` dataset and return 2 columns.

In [None]:
%%sql
SELECT
  location_name,
  description
FROM locations
WHERE location_id = 1;

## Saving SQL Outputs

By using the `<<` operator, we can assign the result of a SQL query (returned as a Pandas DataFrame) to a named Python variable in the notebook’s scope.

### Single Line Assignment

We can specify the name of the output variable directly after the `%sql` or `%%sql` magic command.

In [None]:
%sql single_magic_df << SELECT * FROM locations LIMIT 5;

We can now reference the Python variable directly as a Pandas DataFrame

In [None]:
# Python notebook scope
single_magic_df

We can also use this same variable as a table reference within a DuckDB `SELECT` statement.

In [None]:
%sql SELECT * FROM single_magic_df;

### Multi-line Assignment

This assignment using `<<` also works with the `%%sql` (multi-line) magic command.

In [None]:
%%sql multi_magic_df <<
SELECT
  location_name,
  description
FROM locations
WHERE location_id = 1;

In [None]:
# display the dataframe
multi_magic_df

When referencing the Python variable within DuckDB, we can also use it inside a multi-line SQL query using the `%%sql` magic command.

In [None]:
%%sql
SELECT *
FROM multi_magic_df;

# 1. Clean Text Data


In this exercise #1 - we’ll clean and prepare the `html_data` column from the locations table so it’s ready for NLP.

Here is an overview of what we will cover in this tutorial:

* Deep dive into using `REGEXP_REPLACE`
* Remove HTML tags
* Clean up newline, whitespace and `&` characters
* Apply advanced find-and-replace using `REGEXP_REPLACE`
* Maintain original document structure

## 1.1 Inspect Raw Data

### 1.1.2 Viewing Raw HTML

Let’s start by looking at a single row — specifically for Yosemite National Park — to see what kind of cleaning is needed.

We’ll use the `.loc` method in Pandas to inspect the raw HTML. In this case, our expression below is how we would implement - “Get the value from the first row of the DataFrame, specifically from the html_data column.”

```python
yosemite_html_example_df.loc[0, "html_data"]
```

In [None]:
%sql yosemite_html_example_df << SELECT html_data FROM locations WHERE location_id = 1;

# We'll save this variable for use in a later cell
yosemite_raw_html_string = yosemite_html_example_df.loc[0, "html_data"]

print(yosemite_raw_html_string)

### 1.1.2 Inspect Rendered Data

As we can see - there is a lot of cleaning that needs to be done with this!

Let's take a look at how we can print out our HTML and see how it would render on an actual webpage.

In [None]:
display(HTML(yosemite_raw_html_string))

## 1.2 Removing HTML Tags

After inspecting the HTML code and our rendered data above - a simple solution comes to mind - potentially we can remove all the tags to only keep us the main text contents that we would see when we visit the actual web page generated by the HTML code.

We can employ some regular expressions - also known as **regex** - and use the `REGEXP_REPLACE` SQL function to help us get this done.

### 1.2.1 Introduction to `REGEXP_REPLACE`

We'll get very familiar with the `REGEXP_REPLACE` function as we'll be using it throughout this tutorial.

An example query we will use below is as follows to extract the text data for Yosemite - we'll also store this as a Python variable `yosemite_removed_tags_df` so we can access it again later:

```sql
SELECT
  REGEXP_REPLACE(html_data, '<[^>]+>', '', 'g') AS text_data
FROM locations
WHERE location_id = 1;
```

Below is a simple breadown of the query components is included here:

| Function         | Purpose                                                |
| ---------------- | ------------------------------------------------------ |
| `REGEXP_REPLACE` | Performs regex-based text replacement                  |
| `html_data     ` | The column of string data that we want to adjust       |
| `'<[^>]+>'`      | Matches any HTML tag like `<p>`, `<h2>`, `<ul>` etc    |
| `[ ... ]`        | Defines a character class to match with inside the [ ] |
| `''`             | Replaces matched text with nothing (i.e. deletes them) |
| `'g'`            | Global flag — apply to all matches, not just the first |

In [None]:
%%sql yosemite_removed_tags_df <<
SELECT
  REGEXP_REPLACE(html_data, '<[^>]+>', '', 'g') AS text_data
FROM locations
WHERE location_id = 1;

In [None]:
print(yosemite_removed_tags_df)

## 1.3 Removing Newline Characters

It looks like there's a few newline `\n` characters now appear in our transformed `html_data` string.

We can deploy our `REGEXP_REPLACE` function again to make this work to trim our text outputs and remove those newlines from our already transformed `text_data` column.

This time - notice how I'm using the `yosemite_removed_tags_df` as the target for my `SELECT` statement in my SQL query below.

In [None]:
%%sql yosemite_removed_tags_trimmed_df <<
SELECT
    REGEXP_REPLACE(text_data, '[\n]', '', 'g') AS text_data
FROM yosemite_removed_tags_df;

In [None]:
yosemite_removed_tags_trimmed_df

### 1.3.1 Pretty Printing Long Strings

We can't really see the entire string when we just display it like we have above - so I've implemented a neat printing function which we can use to see our string in a slightly nicer format.

The only catch is that we'll need to use our `.loc` notation to extract the text data from our Pandas DataFrame Python variable!

In [None]:
wrap_print(yosemite_removed_tags_trimmed_df.loc[0, "text_data"])

### 1.3.2 Fixing Our Mistakes

Wait a minute...it looks like we have a few more issues!

Our `REGEXP_REPLACE` might have removed our additional newline characters but now it looks like we've squished a few of our words together in the raw text data.

We can remove these by using our trusty `REGEXP_REPLACE` again - but this time we replace the empty string character with a single whitespace.

Let's apply our changes on the same `yosemite_removed_tags_df` Pandas DataFrame we used for our previous SQL query - but we will assign our output to a new variable called `yosemite_removed_tags_and_newlines_df`

**Note** - yes, I know the long variable names seem like a pain...but we have a popular saying "code is usually read many more times than it's written" so you can think of this as the equivalent of "a stitch in time, saves nine" sort of thing!

In [None]:
%%sql yosemite_removed_tags_and_newlines_df <<
SELECT
    # This time swap out the '' character for ' '
    REGEXP_REPLACE(text_data, '[\n]', ' ', 'g') AS text_data
FROM yosemite_removed_tags_df;

In [None]:
wrap_print(yosemite_removed_tags_and_newlines_df.loc[0, "text_data"])

## 1.4 Further Text Cleaning

Now we've got more cleaning to do - maybe we'll want to get rid of those pesky little `&amp;` characters and replace them with a single `&` character.

We've also got a few too many whitespace characters here in our text.

Let's apply our changes one at a time - but at some point we will need to think about how we can combine all of these changes in one go from our source `locations` table instead of applying these transformations one at a time.

Let's perform the following transformations:
1. Replace `&amp;` with `&`
2. Replace one or more whitespace character with a single whitespace

### 1.4.1 `REGEXP_REPLACE` With Special Characters

Sometimes when using `REGEXP_REPLACE` we need to be careful with special characters when we are searching for a specific pattern. Try playing around with the `'&amp;'` below and you'll begin to see what a I mean! If you want to use it with the character class definition square brackets - we'll need to use the backslash character `\` to escape important characters.

Note that these days - it's easy enough to ask an AI to assist with your regular expressions - but back in the old day's we needed to always look these up in Google or use specific Regex checking tools like ["I Hate Regex"](https://ihateregex.io/)

In [None]:
%%sql yosemite_removed_tags_newlines_ampersand_df <<
SELECT
    REGEXP_REPLACE(text_data, '&amp;', '&', 'g') AS text_data
FROM yosemite_removed_tags_and_newlines_df;

In [None]:
wrap_print(yosemite_removed_tags_newlines_ampersand_df.loc[0, "text_data"])

### 1.4.2 Regular Expression Character Class

We can also use our character classes `[ ]+` to let our `REGEXP_REPLACE` to find and replace 1 or more whitespace characters in a row with a single whitespace.

In [None]:
%%sql yosemite_removed_tags_newlines_ampersand_spaces_df <<
SELECT
    # We can use [ ]+ to represent 1 or more spaces
    REGEXP_REPLACE(text_data, '[ ]+', ' ', 'g') AS text_data
FROM yosemite_removed_tags_newlines_ampersand_df;

In [None]:
wrap_print(yosemite_removed_tags_newlines_ampersand_spaces_df.loc[0, "text_data"])

## 1.5 Combining Transformations

So let's say we want to apply all of our changes that we've identified so far in one-shot from the raw `locations` table within our database.

We have the following transformations to apply:

1. Remove HTML tags
2. Replace multiple newline characters with a single space
3. Replace funny ampersand `&amp;` characters

### 1.5.1 Nested `REGEXP_REPLACE`

We can again complete this task using our trust `REGEXP_REPLACE` function!

The only catch here is that we'll need to use our function in a "nested" form to apply these changes one after another - and we'll need to think about the order of how we apply our changes.

One approach I use to better understand the "nesting" behaviour is to always work from **inside-out** - the SQL engine will start from the most nested transformation first before applying the outer nested function.

Let's give this a shot below and store our results into a variable called `yosemite_one_shot_df`

In [None]:
%%sql yosemite_one_shot_df <<
SELECT
  REGEXP_REPLACE(
    REGEXP_REPLACE(
        # 1. Most inner function for tag cleanup
        REGEXP_REPLACE(html_data, '<[^>]+>', '', 'g'),
        # 2. Now we can clean up newline and all whitespace in one-shot
        # We also remove any \r returns and \t tab characters
        '[\n\r\t\ ]+', ' ', 'g'
    ),
    # 3. Now we can apply our & update
    '&amp;', '&', 'g'
  )
   AS text_data
FROM locations
WHERE location_id = 1;

In [None]:
wrap_print(yosemite_one_shot_df.loc[0, "text_data"])

## 1.6 Testing Another Example

We've been performing all our transformations so far on the Yosemite location data - let's take a look at another specific example to challenge our SQL skills and clean our data further!

`location_id = 46` contains the LACMA Los Angeles County Museum of Art details.

This will be a good example for us to implement even further data cleansing steps.

Let's firstly print out our example record to see what we're playing with!

In [None]:
%sql museum_html_example_df << SELECT html_data FROM locations WHERE location_id = 46;

# We'll save this variable for use in a later cell
museum_raw_html_string = museum_html_example_df.loc[0, "html_data"]

print(museum_raw_html_string)

In [None]:
print(museum_raw_html_string)

In [None]:
display(HTML(museum_raw_html_string))

### 1.6.1 Removing Brackets

One of the first things I've noticed here is that we'll likely end up with the round brackets or paranethesis around the `(Los Angeles County Museum of Art)`

We can aim to try and remove these brackets and also apply the same exact transformations we've seen with our Yosemite example.

Let's try this first to see if we need to apply further transforms.

We can use our trusty `REGEXP_REPLACE` to remove both the left and right paranthesis characters - however we'll need to be careful with how we apply the backslash `\` to escape these special regular expression chraracters in our function call.

In [None]:
%%sql museum_transformed_df <<
SELECT
  REGEXP_REPLACE(
    REGEXP_REPLACE(
        # 1. Most inner function for tag cleanup
        REGEXP_REPLACE(html_data, '<[^>]+>', '', 'g'),
        # 2. Now we can clean up newline and all whitespace in one-shot
        # Here we can also add in our \( and \) escaped paranetheses characters
        '[\n\r\t\(\) ]+', ' ', 'g'
    ),
    # 3. Now we can apply our & update
    '&amp;', '&', 'g'
  )
   AS text_data
FROM locations
WHERE location_id = 46;

In [None]:
wrap_print(museum_transformed_df.loc[0, "text_data"])

### 1.6.1 Removing Specific Tags

This is ALMOST there - but we have one more complication that we'll need to fix up!

The very first line seems to have a repeat in the location name - we can see `LACMA Los Angeles County Museum of Art LACMA Los Angeles County Museum of Art` in our first line of the previous output.

We'll need to inspect our raw HTML to see where this comes from - if we inspect the first few rows from our raw HTML that we had previously - we can see both a `title` and level 1 heading `h1` tag that repeats the location name.

```html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>LACMA (Los Angeles County Museum of Art)</title>
</head>
<body>
<h1>LACMA (Los Angeles County Museum of Art)</h1>
...
```

### 1.6.2 Common Ordering Mistakes

We should be right to also use our `REGEXP_REPLACE` in a nested fashion to remove this additional `title` tag to remove the duplicate in our transformed text output - however we'll need to be careful with the order of which we apply the `REGEXP_REPLACE` transformations.

If we were to apply this title tag removal **after** we've already removed **all** of our tags - then nothing would happen!

Let's demonstrate this in action before we see how we should fix it - imagine we wanted to apply `<title>` tag removal after all of our original transformations using this regular expression: `(?si)<title>.*?</title>`

The breakdown of what's happening in this regular expression is below:

* (?si) – two inline flags
    + s: single‑line mode, so the dot . matches everything, including newlines
    + i: case‑insensitive, so it will match <title>, <Title>, <TITLE>, etc. 
* <title> – literally matches the opening tag that we're after
* `.*?` – a non‑greedy match of any characters from just after <title> to the earliest possible </title>
* </title> – literally matches the closing `</title> tag


In [None]:
%%sql museum_transformed_removed_title_df <<
SELECT
  REGEXP_REPLACE(
    REGEXP_REPLACE(
      REGEXP_REPLACE(
        # 1. Most inner function for tag cleanup
        REGEXP_REPLACE(html_data, '<[^>]+>', '', 'g'),
        # 2. Now we can clean up newline and all whitespace in one-shot
        # Here we can also add in our \( and \) escaped paranetheses characters
        '[\n\r\t\(\) ]+', ' ', 'g'
      ),
      # 3. Now we can apply our & update
      '&amp;', '&', 'g'
    ),
    # 4. Let's say we put in our <title> removal here...
    '(?si)<title>.*?</title>', '', 'g'
  ) AS text_data
FROM locations
WHERE location_id = 46;

In [None]:
wrap_print(museum_transformed_removed_title_df.loc[0, "text_data"])

### 1.6.2 Fixing Up The Order

We can still see the repetition right at the beginning of our text!

This is because when we apply the `REGEXP_REPLACE` to remove our HTML tags - we inadvertantly also remove the `<title>` tags we are looking to replace so our follow-up `REGEXP_REPLACE` call doesn't see the data.

Let's try this again - but we'll adjust the order of our replacements a little and store our outputs in `museum_transformed_removed_title_adjusted_order_df` (I know...the names are getting a bit long right?!)


In [None]:
%%sql museum_transformed_removed_title_adjusted_order_df <<
SELECT
  REGEXP_REPLACE(
    REGEXP_REPLACE(
      REGEXP_REPLACE(
        # 1. Let's say we put in our <title> removal here this time!
        REGEXP_REPLACE(html_data, '(?si)<title>.*?</title>', '', 'g'),
        # 2. Now we can remove all of our tags
        '<[^>]+>', '', 'g'
      ),
      # 3. Now we can clean up newline and all whitespace in one-shot
      # Here we can also add in our \( and \) escaped paranetheses characters
      '[\n\r\t\(\) ]+', ' ', 'g'
    ),
    # 4. Finally we can remove the ampersand and we're done!
    '&amp;', '&', 'g'
  ) AS text_data
FROM locations
WHERE location_id = 46;

In [None]:
wrap_print(museum_transformed_removed_title_adjusted_order_df.loc[0, "text_data"])

## 1.7 Retaining Document Structure

Excellent - we've managed to remove the repetition at the beginning of our museum example!

But there's more we can do!

For traditional NLP - this is probably good enough, we have extracted the raw text and cleaned up most of our tags, additional spaces and fixed the ampersand web-escaped characters.

However - for modern LLMs we can take it further and attempt to retain as much of our original document structure as possible. We can do this by further manipulating our raw text into discrete sections.

For this exercise - we will need to really inspect our raw HTML to see how we might apply a good generalized rule and apply it across all our documents.

### 1.7.1 Identifying What to Retain

If we dive into our raw HTML - we'll be able to see how our level 2 headings might be useful to structure our cleaned text output.

These `<h2> ... </h2>` tags can be used with our `REGEXP_REPLACE` to help retain the structure of the document.

```html
<h2>Summary</h2>
<p>The largest art museum in the western United States, with a collection of nearly 150,000 works spanning the history of art from ancient times to the present.</p>
</section>
<section id="best-time">
<h2>Best Time to Visit</h2>
<p>Spring and fall usually offer mild weather and smaller crowds. Always check local conditions, as climate can vary by elevation.</p>
</section>
```

### 1.7.2 Advanced Find and Replace

For our exercise - let's surround whatever contents are inside the H2 heading with a single pipe character. 

It's a good idea to surround the level 2 contents with a pipe characters `|` before and after the heading text.

We can accomplish this find and replace task using the same `REGEXP_REPLACE` function but this time with a slightly different variation using variable substitution!

We can use the regular expression: `'<h2>|</h2>'` to replace any occurences of `<h2>` or `</h2>` within the `REGEXP_REPLACE` command.

```sql
REGEXP_REPLACE(html_data, '(?si)<h2>(.*?)</h2>', '| \1 |', 'g')
```

Here is a simple breakdown of this regular expression function:


| Component         | Purpose                                                      |
| ----------------- | ------------------------------------------------------------ |
| `(?si)`           | Matches across multiple lines and is case-insensitive        |
| `<h2>`            | Matches the literal opening tag `<h2>`                       |
| `(.*?)`           | Capturing group that contains any character between the tags |
| `</h2>`           | Matches the literal closing of the tag `</h2>`               |
| `'| \1 |'`        | A backreference to the first (and only) capturing group      |
| `'g'`             |  Global flag — apply to all matches, not just the first      |

In [None]:
%%sql museum_further_transformed_df <<
SELECT
  REGEXP_REPLACE(
    REGEXP_REPLACE(
      REGEXP_REPLACE(
        # 1. remove title tag contents
        REGEXP_REPLACE(html_data, '(?si)<title>.*?</title>', '', 'g'),
        # 2. surround level 2 contents with | characters
        '(?si)<h2>(.*?)</h2>', '| \1 |', 'g'
      ),
      # 3. clean up - remove all other tags
      '<[^>]+>', '', 'g'
    ),
    # 4. further clean up of whitespace and newlines
    '[\n\r\t\(\) ]+', ' ', 'g'
  ) AS text_data
FROM locations
WHERE location_id = 46;

In [None]:
wrap_print(museum_further_transformed_df.loc[0, "text_data"])

### 1.7.3 Remove Arbitrary Text

This is very close to perfect - but I've noticed one more step we can take to further clean up our output!

If we look at the end of our text data output - we can see that our ending of the `text_data` field ends with the following:

```text
| Useful Links | View on Google Maps Wikipedia Article
```

It seems that our hyperlink information is removed due to our previous regular expression removing all of the HTML tags. 

We can apply another `REGEXP_REPLACE` at the end of our series of transformations to remove everything from `Useful Links` to the end of the string.

I've also noticed an additional single whitespace character at the beginning of our text-string - so let's also apply a simple `TRIM` function to strip out all leading and trailing whitespace characters also.

This will be our final transformation - so let's store our outputs as the variable `museum_final_transformed_df`

In [None]:
%%sql museum_final_transformed_df <<
SELECT
  TRIM(
    REGEXP_REPLACE(
      REGEXP_REPLACE(
        REGEXP_REPLACE(
          REGEXP_REPLACE(
            # 1. remove title tag contents
            REGEXP_REPLACE(html_data, '(?si)<title>.*?</title>', '', 'g'),
            # 2. surround level 2 contents with | characters
            '(?si)<h2>(.*?)</h2>', '| \1 |', 'g'
          ),
          # 3. clean up - remove all other tags
          '<[^>]+>', '', 'g'
        ),
        # 4. further clean up of whitespace and newlines
        '[\n\r\t\(\) ]+', ' ', 'g'
      ),
      # 5. remove the final useful links / wiki missing links
      # We'll need to escape the pipe character as it's special!
      # The $ denotes the end of the line so we remove everything from | Useful...
      '\| Useful Links.*$', '', 'g'
    )
  ) AS text_data
FROM locations
WHERE location_id = 46;

In [None]:
wrap_print(museum_final_transformed_df.loc[0, "text_data"])

## 1.8 Apply Transformations to Entire Dataset

This is perfect! Let's now remove our `WHERE` filter and apply this to our entire dataset and we're done for exercise 1!

We can store our outputs as `locations_transformed_df` and we can quickly check that our transformations look alright for the previous records we were checking `location_id = 1` and `location_id = 46` for our Yosemite and Museum examples.

In [None]:
%%sql locations_transformed_df <<
SELECT
  # we can keep all of our existing locations data here in our final dataset
  locations.*,
  TRIM(
    REGEXP_REPLACE(
      REGEXP_REPLACE(
        REGEXP_REPLACE(
          REGEXP_REPLACE(
            # 1. remove title tag contents
            REGEXP_REPLACE(html_data, '(?si)<title>.*?</title>', '', 'g'),
            # 2. surround level 2 contents with | characters
            '(?si)<h2>(.*?)</h2>', '| \1 |', 'g'
          ),
          # 3. clean up - remove all other tags
          '<[^>]+>', '', 'g'
        ),
        # 4. further clean up of whitespace and newlines
        '[\n\r\t\(\) ]+', ' ', 'g'
      ),
      # 5. remove the final useful links / wiki missing links
      # We'll need to escape the pipe character as it's special!
      # The $ denotes the end of the line so we remove everything from | Useful...
      '\| Useful Links.*$', '', 'g'
    )
  ) AS text_data
FROM locations;

In [None]:
# Let's check out our work!
locations_transformed_df.head()

In [None]:
# Check Yosemite which is our first row
# Pandas DataFrames are 0-indexed so the first record is the 0th row
wrap_print(locations_transformed_df.loc[0, "text_data"])

In [None]:
# Check Museum example which is our 46th row - index should be 45
wrap_print(locations_transformed_df.loc[45, "text_data"])