# Authors:

- Luca Erbì
- Gabriele Lorenzo


# Lab 5: Text to SQL

In this lab, we'll apply what we've already learned in the previous labs to build a sqlite database using Ministral 8b. This lab is less guided than the previous ones and you'll need to refer to what you've done previously to complete each part. Moreover, this lab is more focused on prompt engineering and you have to find the best prompt and prompt strategy (system prompt? temperature value? dialog prompt style?).

For this lab, we need to use sqlite3 to execute the generated queries.
Check the doc: https://docs.python.org/3/library/sqlite3.html

<font color='red'>BE CAREFUL: you need to generate sql queries then automaticly exectute them with sqlite3 connector. DO NOT generate python code. DO NOT copy paste genereted query to the connector.</font>

<font color='green'>TIPS: sqlite3 create a file containing your db. Delete it if you need to reset the db.</font>

Lab overview:

0. Modules installation and model loading.
1. Create tables using llm.
2. Populate tables using llm.
3. Explore our tables using llm.
4. More than one table with llm.

IMPORTANT:

- You must work in pairs. You must submit **ONLY ONE NOTEBOOK** for each pair.
- Do not share your work with other pairs.
- You should not use Copilot, ChatGPT or similar tools. At the very least, remove the prompt ...


## 0. Setup


In [1]:
# !pip install -U transformers datasets bitsandbytes accelerate

In [13]:
from transformers import (
    BitsAndBytesConfig,
    AutoTokenizer,
    AutoModelForCausalLM,
    GenerationConfig,
)

from tqdm import tqdm
from tqdm.notebook import tqdm as tqdm_notebook
import pandas as pd
import sqlite3
import torch
import re

In [3]:
# Put your hugging face token here: https://huggingface.co/docs/hub/en/security-tokens
# You need to fill the access form with your huggingface account on this link: https://huggingface.co/mistralai/Ministral-8B-Instruct-2410
hf_token = ""
llm_name = "mistralai/Ministral-8B-Instruct-2410"

# We want to use 4bit quantization to save memory
quantization_config = BitsAndBytesConfig(load_in_8bit=False, load_in_4bit=True)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(llm_name, padding_side="left", token=hf_token)
# Prevent some transformers specific issues.
tokenizer.use_default_system_prompt = False
tokenizer.pad_token_id = tokenizer.eos_token_id

# Load LLM.
llm = AutoModelForCausalLM.from_pretrained(
    llm_name,
    quantization_config=quantization_config,
    device_map={"": 0},  # load all the model layers on GPU 0
    torch_dtype=torch.bfloat16,  # float precision
    token=hf_token,
)
# Set LLM on eval mode.
llm.eval()

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.


model.safetensors.index.json:   0%|          | 0.00/26.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.07G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(131072, 4096)
    (layers): ModuleList(
      (0-35): 36 x MistralDecoderLayer(
        (self_attn): MistralAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=12288, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=12288, bias=False)
          (down_proj): Linear4bit(in_features=12288, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): MistralRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): Mis

## 1. Create tables using llms

You need to generate and execute SQL queries to create 3 tables:

- "characters": Id (primary key), Name (str), Age (int), Profession (int).
- "characters20": same than characters.
- "skills": Id (primary key), Name (str), Profession (str).

For example, by running this code `cursor.execute("""PRAGMA table_info(characters);""").fetchall()`.

You should have this results:

```
`[(0, 'id', 'INTEGER', 0, None, 1),
 (1, 'name', 'TEXT', 1, None, 0),
 (2, 'age', 'INTEGER', 1, None, 0),
 (3, 'profession', 'TEXT', 1, None, 0)]
```

<font color='red'>BE CAREFUL: sqlite3 doesn't have the same possibility than SQL. You may need to specify it.</font>


In [21]:
generation_config = GenerationConfig(
    max_new_tokens=512,
    do_sample=False,
    # temperature=.7,
    # top_p=.8,
    # top_k=20,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.pad_token_id,
)

In [22]:
characters = """characters(Id (primary key autoincrement), Name (str not null), Age (int not null), Profession (str not null))"""
characters20 = """characters20(Id (primary key autoincrement), Name (str not null), Age (int not null), Profession (str not null))"""
skills = """skills(Id (primary key autoincrement), Name (str not null), Profession (str not null))"""

template_create = """
This is the table:
{table}
This is the text:
Write the query to create the table in sqlite3.
```
"""

template_refine = """
This is the table:
{table}
This is the text:
Write the query to create the table in sqlite3.
This is the reference:
```sql
{reference}
```
The reference SQL may be correct or incorrect.
If the reference SQL is correct and written in sqlite3 format, just say 'It is correct. '.
If the reference SQL is incorrect, modify the reference SQL and output the correct SQLite.
"""

tables = [characters, characters20, skills]
# create sqlite cursor
conn = sqlite3.connect("lab5.db")
cursor = conn.cursor()

In [38]:
def generate_table(table):
    prompt_text = template_create.format(table=table)
    input_ids = tokenizer.encode(prompt_text, return_tensors="pt").to("cuda")

    generation_output = llm.generate(
        input_ids=input_ids,
        generation_config=generation_config,
    )
    generation_text = tokenizer.decode(generation_output[0][len(input_ids[0]) :])

    matches = re.findall(
        r"CREATE TABLE .*?\)", generation_text, re.DOTALL | re.IGNORECASE
    )
    if len(matches) == 0:
        print(generation_text)
        print("No matches found")
        return
    generate_table_cmd = matches[0]

    # # last match
    # prompt_text = template_refine.format(table=table, reference=matches[0])
    # input_ids = tokenizer.encode(prompt_text, return_tensors="pt").to("cuda")

    # generation_output = llm.generate(
    #     input_ids=input_ids,
    #     generation_config=generation_config,
    # )
    # generation_text = tokenizer.decode(generation_output[0])
    # print(len(generation_text) - len(prompt_text))
    # print("-" * 80)

    # # extract from the CREATE TABLE to ;
    # matches = re.findall(
    #     r"CREATE TABLE .*?\)", generation_text, re.DOTALL | re.IGNORECASE
    # )
    # print(matches[-1])

    # # add try and catch
    try:
        cursor.execute(generate_table_cmd)
        conn.commit()
    except Exception as e:
        print(e)


# clear all tables
cursor.execute("DROP TABLE IF EXISTS characters")
cursor.execute("DROP TABLE IF EXISTS characters20")
cursor.execute("DROP TABLE IF EXISTS skills")
conn.commit()

for table in tqdm_notebook(tables):
    generate_table(table)

  0%|          | 0/3 [00:00<?, ?it/s]

In [39]:
print(cursor.execute(f"PRAGMA table_info(characters);").fetchall())
print(cursor.execute(f"PRAGMA table_info(characters20);").fetchall())
print(cursor.execute(f"PRAGMA table_info(skills);").fetchall())

[(0, 'Id', 'INTEGER', 0, None, 1), (1, 'Name', 'TEXT', 1, None, 0), (2, 'Age', 'INTEGER', 1, None, 0), (3, 'Profession', 'TEXT', 1, None, 0)]
[(0, 'Id', 'INTEGER', 0, None, 1), (1, 'Name', 'TEXT', 1, None, 0), (2, 'Age', 'INTEGER', 1, None, 0), (3, 'Profession', 'TEXT', 1, None, 0)]
[(0, 'Id', 'INTEGER', 0, None, 1), (1, 'Name', 'TEXT', 1, None, 0), (2, 'Profession', 'TEXT', 1, None, 0)]


## 2. Populate tables using llm

You need to generate and execute SQL queries to fill in “characters” and “characters20” :

- For both, the age must be constrained between 18 and 50 (we'll assess whether the constraint is met later).
- For “characters”, generate 10 rows using the prompt. Apply the prompt 10 times (you should end up with 100 lines).
- For “characters20”, generate 20 rows using the prompt. Apply the prompt 5 times (you should also get 100 lines at the end).

For example, executing this code `cursor.execute("SELECT * FROM characters")`.

You should get this result (with 100 rows and perhaps different values ...) :

```
[(1, 'Alice', 25, 'Artist'),
 (2, 'Bob', 35, 'Engineer'),
  ...
 (99, 'Ian', 32, 'Architect'),
 (100, 'Jane', 18, 'Dancer')]
```

<font color='red'> BE CAREFUL: If your generation configuration doesn't include sampling, you'll always have the same rows.</font>

<font color='green'> BONUS: In section 3, we'll compare the number of duplicated rows between the two methods. Do you have a better strategy for minimizing the number of duplicated rows? Give it a try! (create another table for this purpose) </font>


In [65]:
generation_config = GenerationConfig(
    max_new_tokens=512,
    do_sample=True,
    temperature=0,
    # top_p=.8,
    # top_k=20,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.pad_token_id,
)

In [79]:
template_insert = """
You have this SQLite table:
{table}

Write the SQLite3 query to insert {n_lines} populated lines of different values (not ID) into the table.
The age should be between 18 and 50.

Only use sqlite3 syntax, you are not allowed to use any other language syntax.
Just write the query and NOTHING else. Do not execute it.

This is an example of insertion query in sqlite3:
```sql
INSERT INTO example_table (column1, column2) VALUES ('value1', 'value2');
```
"""

template_insert_refine = """
You have this SQLite table:
{table}

You have this SQLite code to insert {n_lines} lines into the table:
```sql
{reference}
```

The SQLite code may be correct or incorrect.
If the SQLite code is correct and written in sqlite3 format, just say 'IT IS CORRECT' and say again the correct SQLite code.
If the SQLite code is incorrect, modify the SQLite code and write the SQLite3 query to insert {n_lines} populated lines of different values (not ID) into the table. Just write the query, do not execute it.
"""


def apply_insert_prompt(table, n_iter, n_lines):
    for _ in tqdm_notebook(range(n_iter)):
        prompt_text = template_insert.format(table=table, n_lines=n_lines)
        input_ids = tokenizer.encode(prompt_text, return_tensors="pt").to("cuda")

        generation_output = llm.generate(
            input_ids=input_ids,
            generation_config=generation_config,
        )
        generation_text = tokenizer.decode(generation_output[0][len(input_ids[0]) :])
        print("FIRST GENERATION")
        print(generation_text)
        print("-" * 80)

        # matches = re.findall(
        #     r"```sql .*?```", generation_text, re.DOTALL | re.IGNORECASE
        # )
        matches = re.findall(
            r"INSERT INTO.*?;", generation_text, re.DOTALL | re.IGNORECASE
        )
        if len(matches) == 0:
            raise Exception("No matches found")
        print("FIRST MATCH")
        print(matches)
        print("-" * 80)

        for match in matches:
            try:
                cursor.execute(match)
                conn.commit()
            except Exception as e:
                print(e)

        # prompt_text = template_insert_refine.format(
        #     table=table, reference=matches[0], n_lines=n_lines
        # )
        # input_ids = tokenizer.encode(prompt_text, return_tensors="pt").to("cuda")

        # generation_output = llm.generate(
        #     input_ids=input_ids,
        #     generation_config=generation_config,
        # )
        # generation_text = tokenizer.decode(generation_output[0][len(input_ids[0]) :])
        # print("SECOND GENERATION")
        # print(generation_text)
        # print("-" * 80)

        # matches = re.findall(
        #     r"INSERT INTO.*?;", generation_text, re.DOTALL | re.IGNORECASE
        # )
        # print("SECOND MATCH")
        # print(matches)
        # print("-" * 80)

        # if len(matches) == 0:
        #     raise Exception("No matches found")

        # for match in matches:
        #     try:
        #         cursor.execute(match)
        #         conn.commit()
        #     except Exception as e:
        #         print(e)


cursor.execute("DELETE FROM characters")
apply_insert_prompt(characters, n_iter=10, n_lines=10)

  0%|          | 0/10 [00:00<?, ?it/s]

FIRST GENERATION
```sql
INSERT INTO characters (Name, Age, Profession) VALUES
('Alice', 25, 'Engineer'),
('Bob', 30, 'Doctor'),
('Charlie', 35, 'Teacher'),
('David', 40, 'Artist'),
('Eve', 45, 'Writer'),
('Frank', 50, 'Musician'),
('Grace', 18, 'Chef'),
('Helen', 20, 'Nurse'),
('Ivy', 22, 'Designer'),
('Jack', 28, 'Actor');
```</s>
--------------------------------------------------------------------------------
FIRST MATCH
["INSERT INTO characters (Name, Age, Profession) VALUES\n('Alice', 25, 'Engineer'),\n('Bob', 30, 'Doctor'),\n('Charlie', 35, 'Teacher'),\n('David', 40, 'Artist'),\n('Eve', 45, 'Writer'),\n('Frank', 50, 'Musician'),\n('Grace', 18, 'Chef'),\n('Helen', 20, 'Nurse'),\n('Ivy', 22, 'Designer'),\n('Jack', 28, 'Actor');"]
--------------------------------------------------------------------------------
FIRST GENERATION
```sql
INSERT INTO characters (Name, Age, Profession) VALUES ('Alice', 25, 'Engineer'), ('Bob', 30, 'Doctor'), ('Charlie', 35, 'Artist'), ('David', 40, 'Teache

In [81]:
cursor.execute("DELETE FROM characters20")
apply_insert_prompt(characters20, n_iter=5, n_lines=20)

  0%|          | 0/5 [00:00<?, ?it/s]

FIRST GENERATION
```sql
INSERT INTO characters20 (Name, Age, Profession) VALUES ('John Doe', 25, 'Engineer'), ('Jane Doe', 30, 'Doctor'), ('Alice Smith', 45, 'Teacher'), ('Bob Johnson', 20, 'Artist'), ('Charlie Brown', 35, 'Writer'), ('David Wilson', 40, 'Musician'), ('Eve Davis', 50, 'Nurse'), ('Frank Miller', 25, 'Chef'), ('Grace Johnson', 35, 'Artist'), ('Helen Davis', 40, 'Teacher'), ('Ivy Smith', 50, 'Engineer'), ('Jack Brown', 30, 'Writer'), ('Karen Wilson', 45, 'Musician'), ('Linda Davis', 20, 'Nurse'), ('Mason Miller', 35, 'Chef'), ('Nancy Johnson', 40, 'Teacher'), ('Oscar Brown', 50, 'Artist'), ('Peggy Davis', 30, 'Writer'), ('Quincy Smith', 45, 'Musician'), ('Rachel Johnson', 20, 'Nurse'), ('Samuel Miller', 35, 'Chef'), ('Tina Davis', 40, 'Teacher'), ('Ursula Brown', 50, 'Artist');
```</s>
--------------------------------------------------------------------------------
FIRST MATCH
["INSERT INTO characters20 (Name, Age, Profession) VALUES ('John Doe', 25, 'Engineer'), ('Jane D

In [82]:
print(
    f"Number of rows in characters: {cursor.execute('SELECT COUNT(*) FROM characters;').fetchone()[0]}"
)
print(
    f"Number of rows in characters20: {cursor.execute('SELECT COUNT(*) FROM characters20;').fetchone()[0]}"
)

Number of rows in characters: 101
Number of rows in characters20: 104


## 3. Explore our tables using llm.

First, you need to generate and execute SQL queries that indicate the number of duplicate rows (without ids) in each character table. To make things easier, we only ask for the number of each duplicated rows.

Here is an examples of expected results:

```
[(2,), (7,), (5,), (2,), (2,), (3,), (2,), (2,), (2,), (2,), (2,), (2,), (2,)]
```

<font color='green'> BONUS: Generate a query that returns the total count of duplicated rows. You may need to do this in several steps.</font>

Secondly, you need to generate and execute SQL queries that remove duplicate rows. To make things easier, it's not necessary to keep original duplicated lines. For example, if you have a list like this : [a, b, a, c]. We ask you to remove all the a: [b, c].

<font color='green'> BONUS: Generate a query that delete duplicated but keep the original row. [a, b, a, c] -> [a, b, c] </font>

Finaly, you need to generate and execute SQL queries that check if the age constraint is respected.

<font color='red'> BE CAREFUL: Do each step for every characters tables you have.</font>


In [93]:
template_query = """
You have this SQLite table:
{table}

Write the SQLite3 query to do satisfy this request: {action}

The query should be written in sqlite3 syntax, you are not allowed to use any other language syntax.
"""

actions = [
    "indicate the number of duplicate rows (without ids)",
    "indicate the total count of duplicate rows (without ids)",
]


def apply_query_prompt(t_name, table, action):
    prompt_text = template_query.format(table=table, action=action)
    input_ids = tokenizer.encode(prompt_text, return_tensors="pt").to("cuda")

    generation_output = llm.generate(
        input_ids=input_ids,
        generation_config=generation_config,
    )
    generation_text = tokenizer.decode(generation_output[0][len(input_ids[0]) :])
    print("GENERATION")
    print(generation_text)
    print("-" * 80)

    matches = re.search(r"```sql(.*?)```", generation_text, re.DOTALL)
    if matches is None:
        raise Exception("No matches found")
    match = matches.group(0).split("\n")[1:-1]
    match = "\n".join(match)
    print("MATCHES")
    print(match)
    print("-" * 80)

    try:
        print(f"Table: {t_name}, Action: {action}\n{cursor.execute(match).fetchall()}")
    except Exception as e:
        print(e)


for action in actions:
    for t_name, table in {
        "characters": characters,
        "characters20": characters20,
    }.items():
        apply_query_prompt(t_name, table, action)

GENERATION
```sql
SELECT COUNT(*) FROM characters GROUP BY Name, Age, Profession
```

This query will count the number of duplicate rows based on the combination of Name, Age, and Profession columns. The `GROUP BY` clause groups the rows by these columns, and the `COUNT(*)` function counts the number of rows in each group. If a group has more than one row, it means there are duplicates.</s>
--------------------------------------------------------------------------------
MATCHES
SELECT COUNT(*) FROM characters GROUP BY Name, Age, Profession
--------------------------------------------------------------------------------
Table: characters, Action: indicate the number of duplicate rows (without ids)
[(1,), (2,), (4,), (1,), (1,), (3,), (3,), (1,), (1,), (1,), (2,), (1,), (3,), (1,), (1,), (1,), (1,), (3,), (3,), (1,), (1,), (1,), (1,), (5,), (1,), (1,), (1,), (3,), (3,), (1,), (1,), (1,), (1,), (2,), (1,), (1,), (1,), (1,), (1,), (1,), (1,), (1,), (1,), (1,), (1,), (1,), (1,), (1,), (1,),

## 4. More than one table with llm.

First, choose your best characters table (with the largest number of rows).

Second, generate and execute an SQL query that returns the set of unique professions in the table.

Third, generate and execute an SQL query that populates the skill tables from this set of unique professions.

Fourth, generate and execute an SQL query that verifies that the professions in the skill table exist in your characters table.

Finally, generate and execute an SQL query that returns the name of the skills associated with a character name (by profession).
