# Llama3.1 Dataset Maker for Nvidia NIM

This notebook is intended to run in a Llama3.1 NIM environment.
<br>
To set it up, please watch my video tutorial on the topic.

## Setup NGC

In [None]:
%%bash

wget https://raw.githubusercontent.com/brevdev/notebooks/main/assets/setup-ngc.sh -O setup-ngc
chmod +x setup-ngc
./setup-ngc

## Setup Llama3.1 NIM

In [None]:
%%bash

export NGC_API_KEY= # paste your NGC API key here

# Log in to NGC
echo "${NGC_API_KEY}" | docker login nvcr.io -u '$oauthtoken' --password-stdin

# Set up NIM cache directory
mkdir -p $HOME/.nim-cache

docker run -d --rm --name="llama" \
    --network=container:verb-workspace \
    --runtime=nvidia \
    --gpus all \
    --shm-size=16GB \
    -e NGC_API_KEY \
    -v $HOME/.nim-cache:/home/user/.nim-cache \
    -v /home/ubuntu/workspace:/workspace \
    -w /workspace \
    nvcr.io/nim/meta/llama-3.1-8b-instruct:1.1.0

# Check if NIM is up
echo "Checking if NIM is up..."
while true; do
    if curl -s http://localhost:8000 > /dev/null; then
        echo "NIM has been started successfully!"
        break
    else
        echo "NIM is not up yet. Checking again in 10 seconds..."
        sleep 10
    fi
done

## Verify that Llama is Available
Specify the client URL to check if Llama3.1 is listening on port 8000

In [None]:
!curl localhost:8000/v1/health/ready

## Create Dataset

In the following cell, you will see how to chain prompt output to generate datasets automatically.

In [None]:
from openai import OpenAI
import pandas as pd

# create empty dataframe 
data = pd.DataFrame(columns=["country", "capital", "food"])

# specify model location
client = OpenAI(
  base_url = "http://localhost:8000/v1",
  api_key = "not_used"
)

def ask_question(user_input):
    
    # specify model settings
    chat_response = client.chat.completions.create(
    model="meta/llama-3.1-8b-instruct",
    messages=[{"role":"user","content": user_input}],
    temperature=0.5,
    top_p=1,
    max_tokens=1024,
    # return output as a single unit of text
    stream=False
    )

    return chat_response.choices[0].message.content

# fetch names of all world countries
all_countries = ask_question("""
names of all countries separated by commas in an alphabetical order.
names only, with no other output
""")
all_countries = all_countries.split(", ")

# iterate over all country names
for i, country in enumerate(all_countries):
    # fetch attributes for each country
    capital = ask_question("what is the capital city of " + country + ". just the name")
    food = ask_question("what is the national food of " + country + ". just the name")
    # store country and attributes in the pre-defined dataframe
    data.loc[i] = [country, capital, food]

# save CSV file in the current directory
data.to_csv("data.csv", header=None)

## Workflow: All Countries Output

The next cell was used to customize the all_countries output from the cell above. You don't have to run it as it is already included in the dataset maker code.

In [None]:
all_countries = ask_question("""
names of all countries separated by commas in an alphabetical order.
names only, with no other output
""")
all_countries = all_countries.split(", ")
print(all_countries)