# Genetic Mutations
In this example we will experiment with a dataset that contains genetic mutations. We will include this dataset as an external dataset to SwellDB. We'll then create a new table that contains the mutations and their associated diseases. We will also include a link to a study that describes the mutation and its effects.

In [1]:
import os
import logging

import pandas as pd
import pyarrow as pa
import datafusion

# Data Exploration - Pandas
First, let's take a look at the dataset.

In [2]:
# Load as Pandas Dataframe
df = pd.read_csv("../tests/test_files/mutations.csv")

In [3]:
df

Unnamed: 0,sample_id,mutation
0,S001,BRCA1 c.68_69delAG
1,S002,CFTR p.F508del
2,S003,HTT CAG expansion
3,S004,LDLR c.1060+1G>A
4,S005,TP53 p.R175H
5,S006,APOE ε4/ε4
6,S007,MYH7 p.R453C
7,S008,EGFR p.L858R
8,S009,KRAS p.G12D
9,S010,TTR V30M


# Analytical Queries - DataFusion
Let's load the dataset into DataFusion and run some queries.

In [4]:
sc = datafusion.SessionContext()

In [5]:
sc.register_csv("mutation", "../tests/test_files/mutations.csv")

In [6]:
sc.sql(
    """
    SELECT *
    FROM mutation
    """
)

sample_id,mutation
S001,BRCA1 c.68_69delAG
S002,CFTR p.F508del
S003,HTT CAG expansion
S004,LDLR c.1060+1G>A
S005,TP53 p.R175H
S006,APOE ε4/ε4
S007,MYH7 p.R453C
S008,EGFR p.L858R
S009,KRAS p.G12D
S010,TTR V30M


## Querying beyond the available data
Let's assume that we would like to run the following query on top of the mutations table.

```sql
SELECT sample_id, mutation, associated_disease, study_link
FROM mutation
```

We can see that the two last columns — `associated_disease` and `study_link` — are not present in the dataset. This is where SwellDB comes into play. We can use SwellDB to generate these columns using the LLM and the search engine.

# SwellDB

In [7]:
from swelldb import SwellDB, OpenAILLM
from swelldb.swelldb import Mode

# Include some operators
from swelldb.table_plan.table.physical.dataset_table import DatasetTable
from swelldb.table_plan.table.physical.search_engine_table import SearchEngineTable

In [8]:
swelldb: SwellDB = SwellDB(
    llm=OpenAILLM(api_key=os.environ["OPENAI_API_KEY"]), 
    serper_api_key=os.environ["SERPER_API_KEY"])

In [11]:
tbl = (
    swelldb.table_builder()
    .set_table_name("mutations")
    .set_content("A table that contains genetic mutations")
    .set_schema("sample_id int, mutation str, associated_disease str, study_link str")
    .set_base_columns(["mutation"])
    # Add external data sources
    .add_csv_file("mutations", "../tests/test_files/mutations.csv")
    .set_table_gen_mode(Mode.OPERATORS)
    .set_operators([DatasetTable, SearchEngineTable])
    .set_chunk_size(20)
).build()

In [12]:
tbl.explain()

SearchEngineTable[schema=['study_link', 'associated_disease', 'mutation']
--DatasetTable[schema=['mutation', 'sample_id']"]


In [13]:
table = tbl.materialize()

In [14]:
sc.register_dataset("mutation_swell", pa.dataset.dataset(table))

In [17]:
sc.sql(""" 
SELECT sample_id, mutation, study_link, associated_disease
FROM mutation_swell
""")

sample_id,mutation,study_link,associated_disease
S001,BRCA1 c.68_69delAG,https://www.ncbi.nlm.nih.  https://www.ncbi.nlm.nih.gov/clinvar/variation/17662/  ...,Hereditary breast and ova  Hereditary breast and ovarian cancer  ...
S002,CFTR p.F508del,https://www.ncbi.nlm.nih.  https://www.ncbi.nlm.nih.gov/clinvar/RCV000007523/  ...,Cystic fibrosis
S003,HTT CAG expansion,https://pmc.ncbi.nlm.nih.  https://pmc.ncbi.nlm.nih.gov/articles/PMC2668007/  ...,Huntington's disease
S004,LDLR c.1060+1G>A,https://www.ncbi.nlm.nih.  https://www.ncbi.nlm.nih.gov/clinvar/RCV000238168/  ...,Familial hypercholesterol  Familial hypercholesterolemia  ...
S005,TP53 p.R175H,https://www.ncbi.nlm.nih.  https://www.ncbi.nlm.nih.gov/clinvar/variation/VCV000012374  ...,Various cancers
S006,APOE ε4/ε4,https://www.nih.gov/news-  https://www.nih.gov/news-events/nih-research-matters/study-defines-major-genetic-form-alzheimer-s-disease  ...,Alzheimer's disease
S007,MYH7 p.R453C,https://www.ncbi.nlm.nih.  https://www.ncbi.nlm.nih.gov/clinvar/RCV000230258/  ...,Hypertrophic cardiomyopat  Hypertrophic cardiomyopathy  ...
S008,EGFR p.L858R,https://pmc.ncbi.nlm.nih.  https://pmc.ncbi.nlm.nih.gov/articles/PMC11632430/  ...,Lung cancer
S009,KRAS p.G12D,https://pmc.ncbi.nlm.nih.  https://pmc.ncbi.nlm.nih.gov/articles/PMC9562007/  ...,Pancreatic cancer
S010,TTR V30M,https://arci.org/about-am  https://arci.org/about-amyloidosis/hereditary-attr-amyloidosis/  ...,Hereditary ATTR amyloidos  Hereditary ATTR amyloidosis  ...
