This notebook is an accompaniment to the blog post at https://www.ockam.io/blog/secure-llm-connections

# Setup Prerequisites

The cells in this section below will configure the jupyter-ai plugin to allow the `%ai` and `%%ai` magics to work within cells, and for those commands to default to using the Azure OpenAI model and deployment we've setup. It will then download the healthcare data we need and make it available locally to simluate our "on-prem" environment.

In [56]:
import os
%load_ext jupyter_ai_magics
%config AiMagics.default_language_model = "azure-chat-openai:%s" % os.environ.get('AZURE_OPENAI_DEPLOYMENT_NAME')

The jupyter_ai_magics extension is already loaded. To reload it, use:
  %reload_ext jupyter_ai_magics


In [2]:
! curl -L -o ~/notebooks/heart-disease.zip\
  https://www.kaggle.com/api/v1/datasets/download/oktayrdeki/heart-disease && unzip ~/notebooks/heart-disease.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  567k  100  567k    0     0   254k      0  0:00:02  0:00:02 --:--:--  700k
Archive:  /home/jovyan/notebooks/heart-disease.zip
  inflating: heart_disease.csv       


# Load data

We'll now load the local health data into a dataframe and show a sample so you can see the format.

In [32]:
import pandas as pd
df = pd.read_csv('~/notebooks/heart_disease.csv')
df.head(5)

Unnamed: 0,Age,Gender,Blood Pressure,Cholesterol Level,Exercise Habits,Smoking,Family Heart Disease,Diabetes,BMI,High Blood Pressure,...,High LDL Cholesterol,Alcohol Consumption,Stress Level,Sleep Hours,Sugar Consumption,Triglyceride Level,Fasting Blood Sugar,CRP Level,Homocysteine Level,Heart Disease Status
0,56.0,Male,153.0,155.0,High,Yes,Yes,No,24.991591,Yes,...,No,High,Medium,7.633228,Medium,342.0,,12.969246,12.38725,No
1,69.0,Female,146.0,286.0,High,No,Yes,Yes,25.221799,No,...,No,Medium,High,8.744034,Medium,133.0,157.0,9.355389,19.298875,No
2,46.0,Male,126.0,216.0,Low,No,No,No,29.855447,No,...,Yes,Low,Low,4.44044,Low,393.0,92.0,12.709873,11.230926,No
3,32.0,Female,122.0,293.0,High,Yes,Yes,No,24.130477,Yes,...,Yes,Low,High,5.249405,High,293.0,94.0,12.509046,5.961958,No
4,60.0,Male,166.0,242.0,Low,Yes,Yes,Yes,20.486289,Yes,...,No,Low,High,7.030971,High,263.0,154.0,10.381259,8.153887,No


Now that the data has been loaded into a dataframe, we're going to let jupyter interpolate that into our prompt to the AI service:

In [4]:
%%ai 
The following is a dataset that represents health information for a group of people, refer to it for future prompts:
{df}

```markdown
## Health Information Dataset

This dataset contains health information on a group of people, focusing on various health metrics and lifestyle factors. Below is a summary of the dataset's attributes:

### Attributes

- **Age**: Age of the individual.
- **Gender**: Gender of the individual (Male/Female).
- **Blood Pressure**: Blood pressure measurement.
- **Cholesterol Level**: Cholesterol level measurement.
- **Exercise Habits**: Level of physical activity (High/Medium/Low).
- **Smoking**: Smoking status (Yes/No).
- **Family Heart Disease**: Family history of heart disease (Yes/No).
- **Diabetes**: Diabetes status (Yes/No).
- **BMI**: Body Mass Index.
- **High Blood Pressure**: Whether the individual has high blood pressure (Yes/No).
- **High LDL Cholesterol**: Whether the individual has high LDL cholesterol (Yes/No).
- **Alcohol Consumption**: Level of alcohol consumption (Low/Medium/High).
- **Stress Level**: Stress level (Low/Medium/High).
- **Sleep Hours**: Number of hours slept.
- **Sugar Consumption**: Level of sugar consumption (Low/Medium/High).
- **Triglyceride Level**: Triglyceride level measurement.
- **Fasting Blood Sugar**: Fasting blood sugar level.
- **CRP Level**: C-reactive protein level.
- **Homocysteine Level**: Homocysteine level measurement.
- **Heart Disease Status**: Current heart disease status (Yes/No).

### Sample Data (First 5 Rows)

| Age  | Gender | Blood Pressure | Cholesterol Level | Exercise Habits | Smoking | Family Heart Disease | Diabetes | BMI        | High Blood Pressure | High LDL Cholesterol | Alcohol Consumption | Stress Level | Sleep Hours | Sugar Consumption | Triglyceride Level | Fasting Blood Sugar | CRP Level | Homocysteine Level | Heart Disease Status |
|------|--------|----------------|-------------------|-----------------|--------|---------------------|----------|------------|---------------------|---------------------|---------------------|--------------|-------------|-------------------|---------------------|---------------------|-----------|---------------------|----------------------|
| 56.0 | Male   | 153.0          | 155.0             | High            | Yes    | Yes                 | No       | 24.99      | Yes                 | No                  | High                | Medium       | 7.63        | Medium            | 342.0               | NaN                 | 12.97     | 12.39               | No                   |
| 69.0 | Female | 146.0          | 286.0             | High            | No     | Yes                 | Yes      | 25.22      | No                  | No                  | Medium             | High         | 8.74        | Medium            | 133.0               | 157.0               | 9.36      | 19.30               | No                   |
| 46.0 | Male   | 126.0          | 216.0             | Low             | No     | No                  | No       | 29.86      | No                  | Yes                 | Low                 | Low          | 4.44        | Low               | 393.0               | 92.0                | 12.71     | 11.23               | No                   |
| 32.0 | Female | 122.0          | 293.0             | High            | Yes    | Yes                 | No       | 24.13      | Yes                 | Yes                 | Low                 | High         | 5.25        | High              | 293.0               | 94.0                | 12.51     | 5.96                | No                   |
| 60.0 | Male   | 166.0          | 242.0             | Low             | Yes    | Yes                 | Yes      | 20.49      | Yes                 | No                  | High                | High         | 7.03        | High              | 263.0               | 154.0               | 10.38     | 8.15                | No                   |

### Dataset Size
- **Total Rows**: 10,000
- **Total Columns**: 21
```

# Exploration

In [5]:
%%ai -f html
Explain this dataset

In [6]:
%%ai -f html
What are some interesting observations from the data?

In [7]:
%%ai -f html
What trends, observations, or outliers exist in this data?

# How does it work?

This jupyter notebook as been configured to send prompt requests to the Azure OpenAI API that is available at URL set in the `AZURE_OPENAI_ENDPOINT` environment variable. Execute the cell below to see the value of that environment variable:

In [1]:
import os
print(os.environ["AZURE_OPENAI_ENDPOINT"])

https://az-openai.17e4f5cd-3df0-4f28-842c-090e641d201b.ockam.network:443


This address may look like a fully qualified domain name to a remote service, but it actually resolves back to `127.0.0.1` (i.e., `localhost`). We simply use this full domain name to be able to generate a unique TLS certificate for your TCP inlet to serve.

To prove that's the case we can use `dig` to return the DNS information for that hostname. The results will have a section that looks like this:

```
;; ANSWER SECTION:
<something>.ockam.network. 377 IN A 127.0.0.1
```

Note the `127.0.0.1`, which means all requests to this domain will resolve to the local machine. To verify that's the case you can execute the cell below and observe the result yourself:

In [3]:
!dig {os.environ["AZURE_OPENAI_ENDPOINT"].replace("https://", "").replace(":443","")}


; <<>> DiG 9.18.30-0ubuntu0.24.04.2-Ubuntu <<>> az-openai.17e4f5cd-3df0-4f28-842c-090e641d201b.ockam.network
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 42865
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: fe69d0eac298927c (echoed)
;; QUESTION SECTION:
;az-openai.17e4f5cd-3df0-4f28-842c-090e641d201b.ockam.network. IN A

;; ANSWER SECTION:
az-openai.17e4f5cd-3df0-4f28-842c-090e641d201b.ockam.network. 377 IN A 127.0.0.1

;; Query time: 3 msec
;; SERVER: 127.0.0.11#53(127.0.0.11) (UDP)
;; WHEN: Tue Feb 04 12:12:33 UTC 2025
;; MSG SIZE  rcvd: 117

