# SQL for AI Projects

## Introduction

**Natural Language Processing Challenge**

In this Jupyter notebook - we'll quickly setup the DuckDB database, get you familiar with this Google Colab setup and then we'll dive into the NLP challenge exercises for the SQL for AI Projects course!

### Challenge Exercises

1. Clean webpage text data
2. Investigate customer review text
3. Implement A/B test framework

### Database Setup

First things first, let's load up our Python libraries and setup access to our database.

Don't worry if you're not familiar with Python - we'll just need to run the very first cell to initialize our SQL instance and there will be clear instructions whenever there is some non-SQL components.


### Getting Started

To execute each cell in this notebook - you can click on the play button on the left of each cell or you could simply hit the `Run all` button on the top of the notebook just below the menu toolbar.

This cell below will help us download and connect to a DuckDB database object within this notebook's temporary environment.

There will also be a few outputs in the same cell including the following:

* An interactive entity relationship diagram for our database is also as an output from the following cell. This will help us visualize all of the database tables and their relevant primary and foreign keys.

In [None]:
# Initial setup steps
# ====================

# These pip install commands are required for Google Colab notebook environment
!pip install --upgrade --quiet duckdb==1.3.1
!pip install --quiet duckdb-engine==0.17.0
!pip install --quiet jupysql==0.11.1

# Also need to setup Git LFS for large file dowloads
# This helps us to download large files stored on GitHub
!apt-get install git-lfs -y
!git lfs install

# Clone GitHub repo into a "data" folder
!git clone https://github.com/LinkedInLearning/real-world-data-and-AI-challenges-with-SQL-3813163.git data

# Need to change directory into "data" to run download database object
%cd data
!git lfs pull

# Then we need to change directory back up so all our paths are correct!
%cd ..

# Time to import all our Python packages
import duckdb
import textwrap
import pandas as pd
from IPython.display import HTML, display

# Load the jupysql extension to enable us to run SQL code in code cells
%load_ext sql

# We can now set some basic Pandas settings for rendering SQL outputs
%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

# This is a convenience function to print long strings into multiple lines
# You'll see this in action later on in our tutorial!
def wrap_print(text):
    print(textwrap.fill(text, width=80))

# This is some boilerplate code to help us format printed output with wrapping
HTML("""
<style>
.output pre {
    white-space: pre-wrap;
    word-break: break-word;
}
</style>
""")

# Connecting to DuckDB
# ====================

# Setup the SQL connection
connection = duckdb.connect("data/data.db")
%sql connection

# Run a few test queries using both connections
tables = connection.execute("SHOW TABLES").fetchall()
table_names = [table[0] for table in tables]

preview_counts_list = []
for table_name in table_names:
    try:
        preview_counts_list.append(
            connection.execute(f"""
                SELECT '{table_name}' AS table_name,
                    COUNT(*) AS record_count
                FROM {table_name}""").fetchdf()
        )
    except Exception as e:
        print(f"❌ Could not preview table {table_name}: {e}")
        

print("✅ Database is now ready!")

print("\n📋 Show count of rows from each table in the database:")

# Combine all dataframes in preview_df_list
preview_counts_df = pd.concat(preview_counts_list, ignore_index=True)

display(preview_counts_df)

display(HTML('''
<iframe width="100%" height="600" src='https://dbdiagram.io/e/685279b3f039ec6d36c0c7e9/68527d19f039ec6d36c1813e'> </iframe>
'''
))

# How to Run SQL Queries

Let's quickly see how we can run SQL code in our Jupyter notebook.

In our Colab environment we can run single or multi-line queries. We can also easily save the output of SQL queries as a local Pandas DataFrame object and even run subsequent SQL queries which can interact with these same DataFrame objects.

## Single Line SQL Query

We can use our notebook magic `%sql` at the start of a notebook cell to run a single line of SQL to query our database.

Let's take a look at the first 5 rows from the `locations` table:

In [None]:
%sql SELECT * FROM locations LIMIT 5;

## Multi-Line SQL Query

We can also run multi-line SQL queries by using a different notebook magic `%%sql` where we now have 2 percentage signs.

We'll apply a filter on our `location` dataset and return 2 columns.

In [None]:
%%sql
SELECT
  location_name,
  description
FROM locations
WHERE location_id = 1;

## Saving SQL Outputs

By using the `<<` operator, we can assign the result of a SQL query (returned as a Pandas DataFrame) to a named Python variable in the notebook’s scope.

### Single Line Assignment

We can specify the name of the output variable directly after the `%sql` or `%%sql` magic command.

In [None]:
%sql single_magic_df << SELECT * FROM locations LIMIT 5;

We can now reference the Python variable directly as a Pandas DataFrame

In [None]:
# Python notebook scope
single_magic_df

We can also use this same variable as a table reference within a DuckDB `SELECT` statement.

In [None]:
%sql SELECT * FROM single_magic_df;

### Multi-line Assignment

This assignment using `<<` also works with the `%%sql` (multi-line) magic command.

In [None]:
%%sql multi_magic_df <<
SELECT
  location_name,
  description
FROM locations
WHERE location_id = 1;

In [None]:
# display the dataframe
multi_magic_df

When referencing the Python variable within DuckDB, we can also use it inside a multi-line SQL query using the `%%sql` magic command.

In [None]:
%%sql
SELECT *
FROM multi_magic_df;

# 1. Clean Text Data

In this first exercise - we’ll clean and prepare the `html_data` column from the locations table so it’s ready for NLP.

Here is an overview of what we will cover!

* Deep dive into using `REGEXP_REPLACE`
* Remove HTML tags
* Clean up newline, whitespace and `&` characters
* Apply advanced find-and-replace using `REGEXP_REPLACE`
* Maintain original document structure

## 1.1 Inspect Raw Data

### 1.1.1 Viewing Raw HTML Data

We'll apply a filter on our locations table to view `location_id = 46` which contains the LACMA Los Angeles County Museum of Art location details.

In [None]:
%sql museum_html_example_df << SELECT html_data FROM locations WHERE location_id = 46;

We’ll need to use the `.loc` method in Pandas to inspect the raw HTML. In this case, our expression below is how we would implement - “Get the value from the first row of the DataFrame, specifically from the html_data column.”

```python
museum_html_example_df.loc[0, "html_data"]
```

In [None]:
# We'll also save this variable for use in a later cell
museum_raw_html_string = museum_html_example_df.loc[0, "html_data"]

print(museum_raw_html_string)

### 1.1.2 Inspect Rendered Data

As we can see - there is a lot of cleaning that needs to be done with this!

Let's take a look at how we can print out our HTML and see how it would render on an actual webpage.

In [None]:
display(HTML(museum_raw_html_string))

## 1.2 Data Transformations

After inspecting our example HTML code and our rendered data above - we can come up with a list of requirements for our data transformations.

The order of these transformations is important as we won't be able to apply find-and-replace on our HTML tags if we were to remove the first!

1. Remove duplicated title data
2. Apply transformations to maintain document structure
3. Remove HTML tags
4. Clean up whitespace and newline characters
5. Fix up web ampersand `&` issues
6. Remove redundant information at the end of text

We'll get very familiar with the `REGEXP_REPLACE` function as we'll be using it throughout this tutorial to accomplish these text data transformations.

You can think of it as a flexible find-and-replace tool that we can use in SQL to identify key patterns in our text data which we want to remove or replace with a certain set of characters.

The `REGEXP_REPLACE` function in DuckDB has 4 positional parameters:

1. **source_string** – The text input that we want to search through  
2. **pattern** – The regular expression pattern used to find matches  
3. **replacement** – The string that will replace any text matching the pattern  
4. **occurrence (optional)** – Which match to replace (by default, it replaces **all** matches)

Note for our SQL implementation we use `'g'` for the occurrence parameter to explicitly ask our SQL engine to replace all occurrences as it's not always common knowledge the "all" matches are replaced as default or the behaviour might differ from one SQL dialect to another!

Let's first focus on the regular expression patterns we will use for our data transformation steps.

We will be using 6 regular expressions for each step in our text data transformations.

Let's step through these one at a time before implementing the complete SQL query.

With the exception of step 2 and 5 - all of our matches will be replaced with either a blank character to remove the match or a single whitespace.

| Step | Regex Pattern                 | Purpose                                           |
|------|-------------------------------|---------------------------------------------------|
| 1    | `(?si)<title>.*?</title>`     | Remove `<title>` contents                         |
| 2    | `(?si)<h2>(.*?)</h2>`         | Wrap `<h2>` contents with `|`                     |
| 3    | `<[^>]+>`                     | Remove all remaining HTML tags                    |
| 4    | `[\n\r\t\(\) ]+`              | Normalize whitespace and remove `()`, tabs, etc.  |
| 5    | `&amp;`                       | Replace HTML-encoded `&`                          |
| 6    | `\| Useful Links.*$`          | Remove footer-like links from the end             |

Let's breakdown each Regex Pattern with a detailed summary below.

### 1.2.1 Remove \<title\> Tags

We have the following regex for this step: `(?si)<title>.*?</title>`

Once we match these rules - we will replace the match with an empty string for removal.

| Component             | Description                                               |
|-----------------------|-----------------------------------------------------------|
| `(?si)`               | Enables case-insensitive (`i`) and dot-all (`s`) modes    |
| `<title>`             | Matches the literal opening `<title>` tag                 |
| `.*?`                 | Lazily matches any characters (including newlines)        |
| `</title>`            | Matches the closing `</title>` tag                        |


### 1.2.2 Wrap `<h2>` contents with `|`

We have the following regex for this step: `(?si)<h2>(.*?)</h2>`

The difference with this transformation step is that we have a "capture group" in brackets that we re-use as `\1` in our second positional parameter for the `REGEXP_REPLACE` function.

This allows us to retain the structure of our original HTML document by surrounding the h2 heading tags with pipe characters so our downstream NLP algorithms can capture this context as it parses our text inputs.

| Component             | Description                                               |
|-----------------------|-----------------------------------------------------------|
| `(?si)`               | Case-insensitive, dot-all mode                            |
| `<h2>`                | Matches the opening `<h2>` tag                            |
| `(.*?)`               | Lazily captures content inside the tag                    |
| `</h2>`               | Matches the closing `</h2>` tag                           |
| Replacement: `\| \1 \|` | Replaces match with pipe-wrapped captured content         |

### 1.2.3 Remove Remaining HTML Tags

After we've transformed our title and h2 HTML tags - we can then remove the rest of them from our text inputs using the regex `<[^>]+>` and replacing this with an empty string `''`

Note that the square brackets `[ ... ]` denotes a "character class" where the hat `^>` inside the square brackets mean "any character except >" and the following `+` means one or more occurences of this character class. 

| Component     | Description                                                      |
|----------------|-----------------------------------------------------------------|
| `<`           | Matches the opening angle bracket of a tag                       |
| `[^>]+`       | Matches one or more characters that are not `>`                  |
| `>`           | Matches the closing angle bracket of a tag                       |

### 1.2.4 Normalize Whitespace and Extra Characters

Usually after removing the HTML tags might lead to excessive amounts of newlines or other unwanted characters in a text string. We will replace these occurences with a single whitespace to normalize our text outputs.

Here we use the same character class `[ ... ]` but this time we have a range of other characters. Note that the open and close brackets need to be escaped with a backslash character so the Regex engine doesn't get confused with the "capture class" that we've used previously!

| Component      | Description                                                       |
|----------------|-------------------------------------------------------------------|
| `[\n\r\t\(\) ]`| Character class: matches newlines, carriage returns, tabs, `()`, and spaces |
| `+`            | Matches one or more of the above characters                       |


### 1.2.5 Replace HTML-Encoded Ampersands

This is a straightforward find-and-replace with the HTML encoded ampersand `&amp;` with a simple `&` character.

| Component | Description                       |
|-----------|-----------------------------------|
| `&amp;`   | Matches the literal string `&amp;` |

### 1.2.6 Remove Trailing Sections

Finally - we can remove the redundant `Useful Links` section that has the hyperlink information removed by our HTML tag step. The `.*` usage is quite common so you will likely see it in other SQL code where `REGEXP_REPLACE` transformations occur!

| Component          | Description                                                 |
|--------------------|-------------------------------------------------------------|
| `\|`               | Escaped pipe character (`|`), matched literally             |
| ` Useful Links`    | Fixed string match                                          |
| `.*`               | Matches any characters after “Useful Links”                 |
| `$`                | Anchors match to the **end of the line**                    |


## 1.2.7 SQL Implementation

Now let's perform all of these transformations using a series of "nested" `REGEXP_REPLACE` functions - and we'll add one more `TRIM` call at the beginning of the nested stack of functions to remove any leading or trailing whitespace characters.

At first - this query will look quite long and complex - but it's easy to understand if we were to read the transformation logic from "inside-out" where we begin with the inner-most nested `REGEXP_REPLACE` function call and work sequentially outwards.

Also note how each part of the query is indented one level with each function call - this helps us read the code a little bit easier and I've found that it helps me debug any issues as I develop the code from scratch this way!

We can store our outputs as `locations_transformed_df` and we can quickly check that our transformations look alright for the previous records we were checking `location_id = 46` for our Museum example data.

In [None]:
%%sql locations_transformed_df <<
# ------------------------------------------------------
# 1. Extract and clean text from raw HTML in locations
# ------------------------------------------------------
SELECT
  # Retain all original metadata columns from the locations table
  locations.*,

  # Apply layered text cleaning using nested REGEXP_REPLACE
  TRIM(
    REGEXP_REPLACE(
      REGEXP_REPLACE(
        REGEXP_REPLACE(
          REGEXP_REPLACE(
            REGEXP_REPLACE(
              # Step 1: Remove <title> tags and their content
              REGEXP_REPLACE(html_data, '(?si)<title>.*?</title>', '', 'g'),

              # Step 2: Preserve <h2> tag content and wrap with pipes
              '(?si)<h2>(.*?)</h2>', '| \1 |', 'g'
            ),

            # Step 3: Strip out all remaining HTML tags
            '<[^>]+>', '', 'g'
          ),

          # Step 4: Normalize whitespace and remove special characters
          '[\n\r\t\(\) ]+', ' ', 'g'
        ),

        # Step 5: Replace HTML entities (e.g., &amp;) with literal symbols
        '&amp;', '&', 'g'
      ),

      # Step 6: Remove trailing sections like "| Useful Links ..." if present
      '\| Useful Links.*$', '', 'g'
    )
  ) AS text_data

FROM locations;


In [None]:
# Let's check out our work!
locations_transformed_df.head()

In [None]:
%sql museum_transformed_df << SELECT text_data FROM locations_transformed_df WHERE location_id = 46;

# We'll use a helper function "wrap_print" that I've implemented at the setup part of our tutorial
wrap_print(museum_transformed_df.loc[0, "text_data"])

# 2. Customer Reviews

In this part of our tutorial we'll implement some basic NLP techniques using only SQL on our customer reviews dataset.

The following NLP techniques will be implemented:

* Remove stopwords
* Term Frequency
* Document Frequency
* TF-IDF

We'll use our TF-IDF outputs to perform some simple search queries and to return most frequent terms in 5 star and 1 star reviews.

## 2.1 Inspect Reviews Data

Let's first start with by inspecting our `reviews` table to see what data we have available to us.

In [None]:
%sql SELECT * FROM reviews LIMIT 5;

## 2.2 Data Transformations

To prepare our reviews data for further NLP tasks and machine learning - we will implement the following transformations:

1. Stopword removal
2. Tokenize of uni-gram terms
3. Term frequency
4. Document frequency
5. TF-IDF by combining term and document frequency

We'll also apply a filter for just a single tour product `Coastal & Canyon Explorer` so we can easier see the differences in 1 and 5 star reviews within this sub-corpus or sub-collection of review documents.

We can do this using SQL to mimic what might occur in a production database environment - however please note that most of the time we will use Python packages for their simplicity after extracting the required data from the database!


### 2.2.1 Stopword Removal

Normally in a Python based workflow - stop-word removal is accomplished using standard NLP library functions, however we don't have this luxury in most SQL databases including `DuckDB`.

Luckily we can use our trusty `REGEXP_REPLACE` function to easily remove this huge list of English stop-words from our text data - these stop-words below are from the standard Python NLP library called `nltk`:

```text
a             about         above         after         again         ain  
all           am            an            and           any           are  
aren't        as            at            be            because       been  
before        being         below         between       both          but  
by            can           couldn        couldn't      d             did  
didn't        didn          do            does          doesn't       doesn  
doing         don't         don           down          during        each  
few           for           from          further       had           hadn't  
hadn          has           hasn't        hasn          have          haven't  
haven         having        he            her           here          hers  
herself       him           himself       his           how           i  
if            in            into          is            isn't         isn  
it            it's          its           itself        let's         ll  
m             ma            me            mightn        mightn't      more  
most          mustn         mustn't       my            myself        needn  
needn't       no            nor           not           now           o  
of            off           on            once          only          or  
other         our           ours          ourselves     out           over  
own           re            s             same          shan't        shan  
she           she's         should        should've     shouldn't     shouldn  
so            some          such          t             than          that  
that'll       the           their         theirs        them          themselves  
then          there         these         they          this          those  
through       to            too           under         until         up  
very          was           wasn't        wasn          we            were  
weren't       weren         what          when          where         which  
while         who           whom          why           will          with  
won't         won           wouldn        wouldn't      y             you  
you'd         you'll        you're        you've        your          yours  
yourself      yourselves
```

### 2.2.2 Tokenization

We'll take a challenge to use only SQL to implement a basic form of text tokenization.

The steps we'll use are as follows:

1. Use `REGEXP_SPLIT_TO_ARRAY` to convert our text string into an array of "terms"
2. Cross join onto an array from 1 to the `ARRAY_LENGTH` of our document to track "position"
3. Slice our original document array of terms using our "position"

Using this approach - we can also extend our analysis to generate bi-gram terms as a further challenge!

### 2.2.3 Frequency Metrics

Using our now tokenized data - we can use a simple `GROUP BY` clause on our `review_id` column with a `COUNT` function to generate the **term frequency** counts within each review. We can think of term frequency as "how many times did this word appear in this review?"

For document frequency - we'll instead perform a `GROUP BY` on our terms and perform a `COUNT DISTINCT` on the `review_id` column. This helps us answer the question "how many unique reviews included this specific word?"

Putting these two metrics together, we can generate one of the most common NLP metrics called `TF-IDF` or term frequency inverse document frequency.

This helps us generate a metric for each term within each review so we can evaluate the following:

1. How common is this term in the current review?
2. How rare is this term across all reviews?

We need to apply a natural log transformation to our value to scale our final output to account for large document counts.

We'll attempt to implement TF-IDF as it appears in the Python `scikit-learn` library which includes a few smoothing components.


### 2.2.4 Mathematical Notation

**Term Frequency (TF)**

The term frequency of term *t* in document *d*:

$$
TF(t, d) = f_{t,d}
$$

Where:
- **f<sub>t,d</sub>**: the number of times term *t* appears in document *d*


**Inverse Document Frequency (IDF)**

The inverse document frequency of term *t* across the corpus:

$$
IDF(t) = \log\left( \frac{1 + N}{1 + df(t)} \right) + 1
$$

Where:
- **N**: Total number of documents
- **df(t)**: Number of documents that contain term *t*
- Smoothing is applied by adding 1 in a few places to avoid division by zero and ensure numerical stability

**TF-IDF Score**

Combining both:

$$
TFIDF(t, d) = TF(t, d) \times IDF(t)
$$

This score increases when a term is frequent in a specific document but rare across the entire collection.

### 2.2.5 SQL Implementation

Now let's take a look at how we can implement this end-to-end using SQL for all our transformations.

We'll apply the following steps in the script below:

1. Join `reviews` to `products` and apply filter to keep `Coastal & Canyon Explorer` tour product reviews
2. Apply standard NLP cleaning steps on our `review_text` column in the following order:
  1. Remove stop-words
  2. Collapse multiple spaces into a single space
  3. Remove punctuation using character class
  4. Apply lowercase transformation
  5. Trim leading and trailing whitespace
3. Implement uni-gram terms tokenization
4. Calculate term-frequency and document-frequency
5. Combine both calculations to get TF-IDF

In [None]:
%%sql reviews_unigram_tfidf_df <<
# ------------------------------------------------------
# 1. Join reviews and products, apply NLP preprocessing
# ------------------------------------------------------
WITH cte_reviews AS (
  SELECT
    reviews.review_id,
    products.product_name,
    reviews.sentiment,
    reviews.star_rating,

    # Apply regex-based text normalization:
    # Step 1: Remove stop-words
    # Step 2: Collapse extra spaces
    # Step 3: Remove punctuation
    # Step 4: Lowercase text
    # Step 5: Trim whitespace
    TRIM(
      LOWER(
        REGEXP_REPLACE(
          REGEXP_REPLACE(
            REGEXP_REPLACE(
              reviews.review_text,
              '\b(i|me|my|myself|...|wouldn''t)\b', '', 'gi'
            ),
            '[\s]+', ' ', 'g'
          ),
          '[\(\)''&,.:;\\—!]', '', 'g'
        )
      )
    ) AS transformed_review_text

  FROM reviews
  INNER JOIN products ON reviews.product_id = products.product_id
  WHERE products.product_name = 'Coastal & Canyon Explorer'
),

# ------------------------------------------------------
# 2. Tokenize cleaned text into unigram terms
# ------------------------------------------------------
cte_arrays AS (
  SELECT
    review_id,
    REGEXP_SPLIT_TO_ARRAY(transformed_review_text, '\s+') AS term_array
  FROM cte_reviews
),

cte_range AS (
  SELECT
    review_id,
    term_array,
    i.unnest AS position
  FROM cte_arrays
  CROSS JOIN UNNEST(RANGE(1, ARRAY_LENGTH(term_array) + 1)) AS i
),

cte_tokenized AS (
  SELECT
    review_id,
    term_array[position] AS term,
    position
  FROM cte_range
),

# ------------------------------------------------------
# 3. Calculate term frequency (TF) and document frequency (DF)
# ------------------------------------------------------
cte_term_frequency AS (
  SELECT
    review_id,
    term,
    COUNT(*) AS term_frequency
  FROM cte_tokenized
  GROUP BY review_id, term
),

cte_document_frequency AS (
  SELECT
    term,
    COUNT(DISTINCT review_id) AS document_frequency
  FROM cte_tokenized
  GROUP BY term
),

cte_total_document_count AS (
  SELECT
    COUNT(DISTINCT review_id) AS document_count
  FROM cte_tokenized
),

# ------------------------------------------------------
# 4. Combine stats and compute TF-IDF score
# ------------------------------------------------------
cte_combined AS (
  SELECT
    tf.review_id,
    tf.term,
    COALESCE(tf.term_frequency, 0) AS term_frequency,
    COALESCE(df.document_frequency, 0) AS document_frequency,
    docs.document_count
  FROM cte_term_frequency AS tf
  LEFT JOIN cte_document_frequency AS df ON tf.term = df.term
  CROSS JOIN cte_total_document_count AS docs
)

# ------------------------------------------------------
# 5. Final output: TF-IDF scores with review metadata
# ------------------------------------------------------
SELECT
  reviews.review_id,
  reviews.product_name,
  reviews.sentiment,
  reviews.star_rating,
  combined.term,
  combined.term_frequency,
  combined.document_frequency,

  # TF-IDF formula: tf * log-scaled inverse document frequency
  combined.term_frequency * (
    LN((1 + combined.document_count) / (1 + combined.document_frequency)) + 1
  ) AS tfidf

FROM cte_combined AS combined
LEFT JOIN cte_reviews AS reviews ON combined.review_id = reviews.review_id
ORDER BY reviews.review_id, tfidf DESC;


In [None]:
pd.set_option('display.max_rows', None)

# Let's just take a look at our very first review_id we were inspecting before
%sql SELECT * FROM reviews_unigram_tfidf_df WHERE review_id = '000319b1';

## 2.3 TF-IDF Questions

Let's apply some aggregations to our TF-IDF metrics for uni-gram terms and ask a few simple questions of our Coastal & Canyon Explorer reviews data.

1. Which are the highest TF-IDF terms for all 1 star reviews?
2. Which are the highest TF-IDF terms for all 5 star reviews?

### 2.3.1 1 Star Reviews

> Which are the highest TF-IDF terms for all 1 star reviews?

In [None]:
%%sql
SELECT
  term,
  AVG(tfidf) AS avg_tfidf
FROM reviews_unigram_tfidf_df
WHERE star_rating = 1
GROUP BY term
ORDER BY avg_tfidf DESC
LIMIT 25;

### 2.3.2 5 Star Reviews

> Which are the highest TF-IDF terms for all 5 star reviews?

In [None]:
%%sql
SELECT
  term,
  AVG(tfidf) AS avg_tfidf
FROM reviews_unigram_tfidf_df
WHERE star_rating = 5
GROUP BY term
ORDER BY avg_tfidf DESC
LIMIT 25;

## 2.4 Comparing Bigram Outputs

We can alter our query just slightly to generate the same TF-IDF analysis but this time using bi-gram terms instead of uni-gram terms to provide more context as we analyze the NLP outputs.

### 2.4.1 SQL Implementation

In [None]:
%%sql reviews_bigram_tfidf_df <<
# ------------------------------------------------------
# 1. Join reviews and products, apply NLP text cleaning
# ------------------------------------------------------
WITH cte_reviews AS (
  SELECT
    reviews.review_id,
    products.product_name,
    reviews.sentiment,
    reviews.star_rating,

    # Apply stop-word removal, punctuation cleanup, lowercasing, whitespace trim
    TRIM(
      LOWER(
        REGEXP_REPLACE(
          REGEXP_REPLACE(
            REGEXP_REPLACE(
              reviews.review_text,
              '\b(i|me|my|myself|we|our|...|wouldn''t)\b', '', 'gi'       # Step 1: Remove stop words
            ),
            '[\s]+', ' ', 'g'                                             # Step 2: Collapse spaces
          ),
          '[\(\)''&,.:;\\—!]', '', 'g'                                    # Step 3: Remove punctuation
        )
      )
    ) AS transformed_review_text
  FROM reviews
  INNER JOIN products ON reviews.product_id = products.product_id
  WHERE products.product_name = 'Coastal & Canyon Explorer'
),

# ------------------------------------------------------
# 2. Tokenize into arrays for bi-gram generation
# ------------------------------------------------------
cte_arrays AS (
  SELECT
    review_id,
    REGEXP_SPLIT_TO_ARRAY(transformed_review_text, '\s+') AS term_array
  FROM cte_reviews
),

cte_range AS (
  SELECT
    review_id,
    term_array,
    i.unnest AS position
  FROM cte_arrays
  CROSS JOIN UNNEST(RANGE(1, ARRAY_LENGTH(term_array) + 1)) AS i         # Map positions 1..N
),

# ------------------------------------------------------
# 3. Extract bi-gram terms using LAG
# ------------------------------------------------------
cte_tokenized_base AS (
  SELECT
    review_id,
    term_array[position] AS unigram_term,
    position
  FROM cte_range
),

cte_tokenized AS (
  SELECT
    review_id,
    LAG(unigram_term) OVER (PARTITION BY review_id ORDER BY position) || ' ' || unigram_term AS term
  FROM cte_tokenized_base
  QUALIFY term IS NOT NULL
),

# ------------------------------------------------------
# 4. Compute term frequency (TF) and document frequency (DF)
# ------------------------------------------------------
cte_term_frequency AS (
  SELECT
    review_id,
    term,
    COUNT(*) AS term_frequency
  FROM cte_tokenized
  GROUP BY review_id, term
),

cte_document_frequency AS (
  SELECT
    term,
    COUNT(DISTINCT review_id) AS document_frequency
  FROM cte_tokenized
  GROUP BY term
),

cte_total_document_count AS (
  SELECT
    COUNT(DISTINCT review_id) AS document_count
  FROM cte_tokenized
),

# ------------------------------------------------------
# 5. Combine stats and compute TF-IDF score
# ------------------------------------------------------
cte_combined AS (
  SELECT
    tf.review_id,
    tf.term,
    COALESCE(tf.term_frequency, 0) AS term_frequency,
    COALESCE(df.document_frequency, 0) AS document_frequency,
    docs.document_count
  FROM cte_term_frequency AS tf
  LEFT JOIN cte_document_frequency AS df ON tf.term = df.term
  CROSS JOIN cte_total_document_count AS docs
)

# ------------------------------------------------------
# 6. Final output with review metadata and sorted TF-IDF
# ------------------------------------------------------
SELECT
  reviews.review_id,
  reviews.product_name,
  reviews.sentiment,
  reviews.star_rating,
  combined.term,
  combined.term_frequency,
  combined.document_frequency,

  # TF-IDF scoring formula
  combined.term_frequency * (
    LN((1 + combined.document_count) / (1 + combined.document_frequency)) + 1
  ) AS tfidf

FROM cte_combined AS combined
LEFT JOIN cte_reviews AS reviews ON combined.review_id = reviews.review_id
ORDER BY reviews.review_id, tfidf DESC;


In [None]:
# Take a look at the first review ID to see the difference
%sql SELECT * FROM reviews_bigram_tfidf_df WHERE review_id = '000319b1';

### 2.4.1 1 Star Reviews

> Which are the highest TF-IDF bi-gram terms for all 1 star reviews?

In [None]:
%%sql
SELECT
  term,
  AVG(tfidf) AS avg_tfidf
FROM reviews_bigram_tfidf_df
WHERE star_rating = 1
GROUP BY term
ORDER BY avg_tfidf DESC
LIMIT 25;

### 2.4.2 5 Star Reviews

> Which are the highest TF-IDF bi-gram terms for all 5 star reviews?

In [None]:
%%sql
SELECT
  term,
  AVG(tfidf) AS avg_tfidf
FROM reviews_bigram_tfidf_df
WHERE star_rating = 5
GROUP BY term
ORDER BY avg_tfidf DESC
LIMIT 25;

In [None]:
# Reset Pandas option to only show top 10 rows
pd.set_option('display.max_rows', 10)

# 3. Implement A/B Framework

The final part of our NLP challenge is to design a measurement framework to quantify the uplift of an AI experiment.

In our Explore California business example - an NLP search experiment was conducted in the first 3 months of 2026. 50% of the website visitors are exposed to an AI powered "NLP search" tool while the remaining 50% see the original experience.

Let's explore the SQL implementation to compare the two groups of website visitors and apply some statistical testing to validate whether the new NLP search lead to a significant uplift in product sales for the experiment period.

To help us with our table joins - we can inspect the entity relationship diagram below.

<iframe width="100%" height="600" src='https://dbdiagram.io/e/685279b3f039ec6d36c0c7e9/68527d19f039ec6d36c1813e'> </iframe>

## 3.1 Inspect Raw Data

We can begin by taking a look at the `sales`, `visits` and `features` tables - these will be key to our analysis!

For our `features` - we'll apply a filter for "Search" for our 

In [None]:
%%sql
SELECT * FROM sales LIMIT 5;

In [None]:
%%sql
SELECT * FROM visits LIMIT 5;

And finally - we've got a `features` table which identifies the `visit_id` values where an AI powered feature was shown.

For our first NLP search feature - we can apply a filter on this table for "Search"

In [None]:
%%sql
SELECT * FROM features
WHERE feature = 'Search'
LIMIT 5;

## 3.2 Experimental Analysis Base

Let's have a go at combining all of these tables to generate a single table where we can apply all of our analysis.

The required columns that we'll need for this table are below:

* visit_timestamp
* visit_id
* user_id
* feature_flag (if it exists)
* sale_flag (if it exists)
* sale_amount (use the `products` table to find the USD price)

Let's also filter this table to only include the first 3 months of 2026 - this is when our theoretical AI experiment is running.

We will store out outputs as `experiment_analysis_df`

In [None]:
%%sql experiment_analysis_df <<
# ------------------------------------------------------
# 1. Join visits with feature flags, sales, and product data
# ------------------------------------------------------
SELECT
  visits.visit_timestamp,
  visits.visit_id,
  visits.user_id,

  # Flag whether the feature was active for this visit
  CASE 
    WHEN features.feature IS NOT NULL THEN 1 
    ELSE 0 
  END AS feature_flag,

  # Flag whether a sale occurred during this visit
  CASE 
    WHEN sales.sale_id IS NOT NULL THEN 1 
    ELSE 0 
  END AS sale_flag,

  # Capture sale amount; default to 0 if no product linked
  COALESCE(products.price_usd, 0) AS sale_amount

FROM visits

# Join feature exposure data (optional per visit)
LEFT JOIN features 
  ON visits.visit_id = features.visit_id

# Join sales data (optional per visit)
LEFT JOIN sales 
  ON visits.visit_id = sales.visit_id

# Join product price info (optional if sale exists)
LEFT JOIN products 
  ON sales.product_id = products.product_id

# ------------------------------------------------------
# 2. Filter to experiment window: Q1 2026
# ------------------------------------------------------
WHERE visits.visit_timestamp BETWEEN DATE '2026-01-01' AND DATE '2026-03-31';


In [None]:
experiment_analysis_df.head()

## 3.3 Split Group Comparison

We can use our table `experiment_analysis_df` to answer a few relevant questions by splitting up our visits based on the `feature_flag`:

1. How many sales are there and what is the total sale amount?
2. What is the total number of visits for the 3 month period?
3. What percentage of visits lead to a sale?

In [None]:
%%sql
# ------------------------------------------------------
# 1. Aggregate conversion and revenue metrics by feature group
# ------------------------------------------------------
SELECT
  feature_flag,  # 0 = control group, 1 = treatment group

  # Count of unique visits that resulted in a sale
  COUNT(DISTINCT CASE WHEN sale_flag = 1 THEN visit_id ELSE NULL END) AS sales_count,

  # Total dollar value of all sales
  SUM(sale_amount) AS sales_amount,

  # Total number of visits in each group
  COUNT(DISTINCT visit_id) AS visit_count,

  # Conversion rate = sales / total visits
  sales_count / visit_count AS conversion_rate

FROM experiment_analysis_df

# ------------------------------------------------------
# 2. Group by control vs treatment and sort by flag
# ------------------------------------------------------
GROUP BY feature_flag
ORDER BY feature_flag;


## 3.4 Hypothesis Testing

In A/B testing scenarios like this, we often want to know whether one group (e.g., a new feature group or "target") performs better than another group (e.g., the "control").

Since we are specifically interested in whether the **target group performs better**, we use a **one-tailed test** — focusing only on **positive uplift**. A one-tailed test checks whether the target group’s conversion rate is significantly **higher** than the control group’s, not just different.

This is different from a **two-tailed test**, which checks for **any difference** — either higher or lower — without considering direction. One-tailed tests are more powerful when directionality is relevant, but should only be used when you're confident about the expected direction.

We set our **alpha level** at `0.05`, which means we accept up to a 5% chance of a false positive — rejecting the null hypothesis when there is no true difference.

### 3.4.1 Z-Score Formula

The **z-score** is a statistical measure that tells us how unusual or extreme the observed difference in conversion rates is, assuming the null hypothesis (no difference) is true.

In our case, we want a high z-score to **disprove the null hypothesis** that the target group performs the same as the control group.

The z-score for the difference in proportions is calculated as:

$$
z = \frac{p_1 - p_2}{\sqrt{p(1 - p) \left( \frac{1}{n_1} + \frac{1}{n_2} \right)}}
$$

Where:

- $p_1 = \dfrac{x_1}{n_1}$ — Target group conversion rate  
- $p_2 = \dfrac{x_2}{n_2}$ — Control group conversion rate
- $p = \dfrac{x_1 + x_2}{n_1 + n_2}$ — Overall conversion rate
- $x_1$, $x_2$ — number of sales in each target and control group  
- $n_1$, $n_2$ — number of visits/observations in each target and control group

A z-score above **1.645** indicates statistical significance at the 95% confidence level for a **one-tailed** test.


### 3.4.2  Standard Error of Target Conversion Rate

To measure the uncertainty of the **target group and control group conversion rates**, we use the standard error of a single proportion:

$$
SE_{p_1} = \sqrt{ \frac{p_1(1 - p_1)}{n_1} }
$$

This allows us to construct a confidence interval around the conversion rates independently.

$$
CI_{p_1} = p_1 \pm Z \cdot SE_{p_1}
$$

For example, using \( Z = 1.96 \) gives a 95% confidence interval.

### 3.4.3 Standard Error of Uplift (Difference in Proportions)

To calculate a confidence interval for the **uplift** — the difference in conversion rates between the target and control groups — we use the following:

$$
SE_{\text{uplift}} = \sqrt{ \frac{p_1(1 - p_1)}{n_1} + \frac{p_2(1 - p_2)}{n_2} }
$$

Then the confidence interval for the uplift becomes:

$$
CI_{\text{uplift}} = (p_1 - p_2) \pm 1.96 \cdot SE_{\text{uplift}}
$$

This gives us a plausible range for the **true uplift**, helping us assess not just whether the result is statistically significant, but also how **meaningful** it might be.

By combining z-score testing and confidence intervals, we get a more complete picture of both **statistical significance** and **practical impact**.


## 3.5 SQL Implementation

### 📈 What This SQL Code Does

This SQL workflow evaluates the impact of a feature rollout using an A/B test framework — comparing a **target group** (with the feature enabled) against a **control group** (feature off).

---

#### 🧮 Step-by-Step Breakdown

1. **Aggregate Metrics for Control Group**  
   Calculates the number of visits, conversions (sales), total revenue, and overall conversion rate for users **without** the feature (`feature_flag = 0`).

2. **Aggregate Metrics for Target Group**  
   Does the same calculations for users **with** the feature (`feature_flag = 1`).

3. **Combine Groups**  
   Joins control and target group results into a single row for side-by-side comparison.

4. **Calculate Uplift & Confidence Intervals**  
   - Computes the **absolute uplift** in conversion rate between the two groups  
   - Calculates 95% confidence intervals for both conversion rates  
   - Computes the **standard error** of the uplift to prepare for statistical testing

5. **Run Significance Test & Estimate Business Impact**  
   - Calculates a **z-score** to test statistical significance  
   - Flags the result as “Significant” or “Not Significant” using a one-tailed 95% test  
   - Estimates the number of **incremental sales**  
   - Projects **baseline revenue** (what the target group would have earned without the uplift)  
   - Calculates **incremental revenue** driven by the feature

6. **Return Final Results**  
   Outputs all the key metrics: uplift, significance, confidence intervals, and business impact — giving a complete picture of how effective the feature was.

---

This analysis helps us make **data-driven decisions** about whether a new product feature led to meaningful improvement in user behavior and business outcomes.


In [None]:
%%sql experiment_results_df <<
# ------------------------------------------------------
# 1. Aggregate control group metrics
# ------------------------------------------------------
WITH cte_control AS (
  SELECT
    COUNT(DISTINCT visit_id) AS control_visit_count,
    COUNT(DISTINCT CASE WHEN sale_flag = 1 THEN visit_id ELSE NULL END) AS control_sales_count,
    SUM(sale_amount) AS control_sales_amount,
    control_sales_count / control_visit_count AS control_conversion_rate
  FROM experiment_analysis_df
  WHERE feature_flag = 0
),

# ------------------------------------------------------
# 2. Aggregate target group metrics
# ------------------------------------------------------
cte_target AS (
  SELECT
    COUNT(DISTINCT visit_id) AS target_visit_count,
    COUNT(DISTINCT CASE WHEN sale_flag = 1 THEN visit_id ELSE NULL END) AS target_sales_count,
    SUM(sale_amount) AS target_sales_amount,
    target_sales_count / target_visit_count AS target_conversion_rate
  FROM experiment_analysis_df
  WHERE feature_flag = 1
),

# ------------------------------------------------------
# 3. Combine control and target groups into a single row
# ------------------------------------------------------
cte_combined AS (
  SELECT
    control.*,
    target.*
  FROM cte_target AS target
  CROSS JOIN cte_control AS control
),

# ------------------------------------------------------
# 4. Calculate uplift and confidence intervals
# ------------------------------------------------------
cte_stats AS (
  SELECT
    *,
    
    # Absolute uplift in conversion rate
    target_conversion_rate - control_conversion_rate AS absolute_uplift,

    # 95% Confidence Interval for target group conversion rate
    target_conversion_rate - 1.96 * SQRT((target_conversion_rate * (1 - target_conversion_rate)) / target_visit_count) AS target_ci_lower,
    target_conversion_rate + 1.96 * SQRT((target_conversion_rate * (1 - target_conversion_rate)) / target_visit_count) AS target_ci_upper,

    # 95% Confidence Interval for control group conversion rate
    control_conversion_rate - 1.96 * SQRT((control_conversion_rate * (1 - control_conversion_rate)) / control_visit_count) AS control_ci_lower,
    control_conversion_rate + 1.96 * SQRT((control_conversion_rate * (1 - control_conversion_rate)) / control_visit_count) AS control_ci_upper,

    # Standard error for difference in conversion rates
    SQRT(
      (target_conversion_rate * (1 - target_conversion_rate)) / target_visit_count +
      (control_conversion_rate * (1 - control_conversion_rate)) / control_visit_count
    ) AS uplift_se
  FROM cte_combined
),

# ------------------------------------------------------
# 5. Calculate z-score, test result, and revenue impact
# ------------------------------------------------------
cte_zscore AS (
  SELECT
    *,
    
    # Z-score for difference in proportions
    absolute_uplift / uplift_se AS z_score,

    # Test result: one-tailed 95% significance test
    CASE 
      WHEN absolute_uplift / uplift_se >= 1.645 THEN 'Significant'
      ELSE 'Not Significant'
    END AS test_result,

    # 95% Confidence Interval for absolute uplift
    absolute_uplift - 1.96 * uplift_se AS uplift_ci_lower,
    absolute_uplift + 1.96 * uplift_se AS uplift_ci_upper,

    # Estimated incremental conversions
    target_visit_count * absolute_uplift AS incremental_sales_count,

    # Expected sales amount if no uplift had occurred (baseline projection)
    control_conversion_rate * target_visit_count * 
      (control_sales_amount * 1.0 / NULLIF(control_sales_count, 0)) AS expected_sales_amount_without_uplift,

    # Difference in observed sales revenue between target and control groups
    target_sales_amount - control_sales_amount AS incremental_sales_amount
  FROM cte_stats
)

# ------------------------------------------------------
# 6. Final output: summarized experiment evaluation
# ------------------------------------------------------
SELECT

  # Statistical Test Results
  z_score,
  test_result,

  # Uplift and Impact Metrics
  absolute_uplift,
  uplift_ci_lower,
  uplift_ci_upper,
  incremental_sales_count,
  incremental_sales_amount,

  # Target Group Metrics
  target_visit_count,
  target_sales_count,
  target_conversion_rate,
  target_ci_lower,
  target_ci_upper,

  # Control Group Metrics
  control_visit_count,
  control_sales_count,
  control_conversion_rate,
  control_ci_lower,
  control_ci_upper

FROM cte_zscore;


In [None]:
experiment_results_df

## 3.6 Experimentation Insights

Here is an example report we can generate using our calculated metrics from our A/B test framework.

---

### 📊 Experiment Results Summary

Our A/B test aimed to evaluate whether the new NLP search feature (enabled in the **target** group) led to improved conversion and revenue performance compared to the control group.

#### ✅ Statistical Significance

- **Z-score**: `13.63`  
- **Result**: **Significant** at the 95% confidence level (one-tailed test)

This indicates **strong evidence** that the target group outperformed the control group in conversion rate.

---

#### 🎯 Conversion Performance

| Metric                     | Control Group  | Target Group     |
|----------------------------|----------------|------------------|
| Number of Visits           | 24,336         | 24,355           |
| Number of Conversions      | 1,700          | 2,549            |
| Conversion Rate            | 6.99%          | 10.47%           |
| 95% CI (Conversion Rate)   | [6.67%, 7.31%] | [10.08%, 10.85%] |

- **Absolute uplift in conversion rate**: **+3.48%**  
- **95% Confidence Interval for uplift**: [2.98%, 3.98%]

---

#### 💰 Revenue Impact

- **Estimated incremental conversions**: `~848` additional sales  
- **Incremental sales amount**: **$2,166,053**

This represents the **additional revenue** generated by the target group over what we would have expected had they performed like the control group — while accounting for variation in sale amounts per product.

---

#### 📌 Conclusion

The experiment showed a **statistically significant uplift** in both conversion rate and revenue. Based on the observed metrics, enabling the new NLP search feature is likely to drive meaningful business impact through increased sales and revenue.