# Quickstart: Building a Semantic Layer with Intugle

This notebook provides a quick introduction to this project. You'll learn how to use its key features to automatically build a semantic layer over your data.

**What is a Semantic Layer?**

A semantic layer is a business-friendly representation of your data. It hides the complexity of the underlying data sources and provides a unified view of your data using familiar business terms. This makes it easier for both business users and data teams to understand and query the data, accelerating data-driven insights.

**Who is this for?**

This tool is designed for both **data teams** and **business teams**. 

* **Data teams** can use it to automate data profiling, schema discovery, and documentation, significantly accelerating their workflow.
* **Business teams** can use it to gain a better understanding of their data and to perform self-service analytics without needing to write complex SQL queries.

**In this notebook, you will learn how to:**

1. **Profile your data:** Analyze your data sources to understand their structure, data types, and other characteristics.
2. **Business Glossary Generation:** Generate a business glossary for each column, with support for industry-specific domains.
2. **Automatically predict links:** Use a Large Language Model (LLM) to automatically discover relationships (foreign keys) between tables.
3. **Generate a semantic layer:** Create YAML files file that defines your semantic layer.
4. **Generate SQL queries:** Use the semantic layer to generate SQL queries and retrieve data.

## 1. LLM Configuration

Before running the project, you need to configure a Large Language Model (LLM). This is used for tasks like generating business glossaries and predicting links between tables.

You can configure the LLM by setting the following environment variables:

* `LLM_PROVIDER`: The LLM provider and model to use (e.g., `openai:gpt-3.5-turbo`). The format follows langchain's format for initializing chat models. Checkout how to specify your model [here](https://python.langchain.com/docs/integrations/chat/) 
* `OPENAI_API_KEY`: Your API key for the LLM provider.

Here's an example of how to set these variables in your environment:

```bash
export LLM_PROVIDER="openai:gpt-3.5-turbo"
export OPENAI_API_KEY="your-openai-api-key"
```

Alternatively, you can set them in the notebook like this:

In [None]:
import os

os.environ["LLM_PROVIDER"] = "openai:gpt-3.5-turbo"
os.environ["OPENAI_API_KEY"] = "your-openai-api-key"  # Replace with your actual key


> Currently the langchain packages for OpenAI, Anthropic and Gemini is installed by default. For additional models, make sure you have the integration packages installed. E.g. you should have langchain-deepseek installed to use a DeepSeek model. You can get these packages here: [LangChain Chat Models](https://python.langchain.com/docs/integrations/chat/)

## 2. Data Profiling and Glossary Generation

The first step in building a semantic layer is to profile your data. This involves analyzing your data sources to understand their structure, data types, and other characteristics. This tool provides a pipeline for this purpose. It can also generate business glossary for your data.


In [None]:
from intugle import DataSet

In [None]:
allergies_data = {
    "path": "https://raw.githubusercontent.com/Intugle/data-tools/refs/heads/main/sample_data/healthcare/allergies.csv",
    "type": "csv"
}
# Create a DataSet object and run the profiling pipeline
dataset_allergies = DataSet(allergies_data, "allergies")
dataset_allergies.run(domain="Healthcare")

# Or you can run each step manually:
# dataset_allergies.profile().identify_datatypes().identify_keys().generate_glossary(domain="Healthcare")

The `run()` method performs a series of analysis steps, including:

* **Profiling:** Calculates statistics for each column, such as distinct count, uniqueness, and completeness.
* **Datatype Identification:** Identifies the data type of each column (e.g., integer, string, datetime).
* **Key Identification:** Identifies potential primary keys.
* **Glossary Generation:** Generates a business glossary for each column using an LLM.

> The `domain` parameter helps the LLM generate a more contextual business glossary. It specifies the industry domain that the dataset belongs to (e.g., "Healthcare", "Finance", "E-commerce").


In [None]:
dataset_allergies.result_to_pandas()[["business_glossary"]]

In [None]:
dataset_allergies.to_df()

In [None]:
from intugle.core.settings import settings

settings.PROJECT_BASE

## 3. Automated Link Prediction

Now that we've profiled our data, let's discover the relationships between tables. This tool uses an LLM to predict links (foreign keys) between tables.

First, we'll load a few more tables from the sample dataset.

In [None]:
table_names = ["patients", "claims", "careplans", "claims_transactions", "medications"]
datasets = [dataset_allergies]


def generate_config(table_name: str) -> str:
    """Append the base URL to the table name."""
    return {
        "path": f"https://raw.githubusercontent.com/Intugle/data-tools/refs/heads/main/sample_data/healthcare/{table_name}.csv",
        "type": "csv"
    }


for table_name in table_names:
    config = generate_config(table_name)
    dataset = DataSet({**config}, table_name)
    datasets.append(dataset)

Now, let's run the link prediction pipeline.

In [None]:
from intugle import LinkPredictor

# Initialize the predictor
predictor = LinkPredictor(datasets)

# Run the prediction
results = predictor.predict(save=True)
results.links

The `results` object contains the predicted links between the tables. You can also visualize the relationships as a graph.


In [None]:
results.show_graph()


## 4. The Semantic Layer

The profiling and link prediction results are used to generate YAML files which are saved automatically. These files defines the semantic layer, including the models (tables) and their relationships. 

By default, these files are saved in the current working directory. You can configure this path by setting the `PROJECT_BASE` environment variable. They can also be saved manually as shown below:

In [None]:
for ds in datasets:
    ds.save_yaml()

results.save_yaml("relationships.yml")

Now, we can load the YAML files and create a manifest.

## 5. SQL Generation

Once you have a semantic layer, you can use the `SqlGenerator` to generate SQL queries. This allows you to query the data using business-friendly terms, without having to write complex SQL.

Let's create an ETL model to define the query we want to generate.

In [None]:

etl = {
    "name": "patient_names",
    "fields": [
        {"id": "patients.first", "name": "first_name"},
        {"id": "patients.last", "name": "last_name"},
        {"id": "allergies.start", "name": "start_date"},
    ],
    "filter": {
        "selections": [{"id": "claims.departmentid", "values": ["3", "20"]}],
    },
}

Now, let's use the `SqlGenerator` to generate the SQL query.

In [None]:
from intugle.sql_generator import SqlGenerator


In [None]:

# Create a SqlGenerator
sql_generator = SqlGenerator()

# Generate the query
test_etl = sql_generator.generate_product(etl)

# Print the query
test_etl.data

In [None]:
test_etl.to_df()

In [None]:
sql_generator.plot_sources_graph()

## 6. MCP Server: Interacting with Your Semantic Layer

Now that you have a semantic layer, you can serve it as a MCP server to interact with it using natural language. The MCP server exposes your semantic layer as a set of tools that can be used by any MCP client.

### Starting the MCP Server

To start the MCP server, run the following command in your terminal:

```bash
intugle-mcp
```

This will start a server on `localhost:8000`.

### Connecting to the MCP Server

Once the server is running, you can connect to it from any MCP client. The endpoint for the MCP server is:

`http://localhost:8000/semantic_layer/mcp`

You can use a variety of MCP clients to connect to the server, such as Claude Desktop, Gemini CLI etc.

### Use Cases

Once connected, you can interact with your semantic layer using natural language. Here are some exciting applications:

*   **Generate SQL Queries:** Ask questions in natural language and have the MCP server generate the corresponding SQL query.
*   **Data Discovery:** Ask questions about the tables and columns in your semantic layer to better understand your data.

> To execute the generated SQL queries, you can also connect your database instance as a MCP tool to your client.

## Conclusion

You've learned how to:

* Configure your LLM provider.
* Profile your data to understand its characteristics.
* Use an LLM to automatically predict links between tables.
* Generate a semantic layer.
* Use the semantic layer to generate SQL queries.
* Interact with your semantic layer using the MCP server.

This is just a starting point. This project has many other features to explore. We encourage you to try it with your own data and see how it can help you build a powerful semantic layer.
