# Prototyping
## Scope
- Sort out completions using aisuite - `done!`
- Create basic data structure that is RAG ready - faiss - `todo`
- Prompt engineering (and tests) - `todo`
- Mapping with Folium - `todo`
- Tying together with streamlit - `todo`

# Completions - `aisuite`
todo:
- pivot to .env for secret management? - just awkward as thats the present venv name - https://github.com/andrewyng/aisuite/blob/main/.env.sample
- explore alternate anthropic models - `model = 'anthropic:claude-3-5-sonnet-v2@20241022'` 
- see how the resulting issue shapes up - https://github.com/andrewyng/aisuite/issues/155

In [None]:
import aisuite as ai, toml, os
secrets = toml.load('../secrets.toml')
API_KEY = secrets.get('ANTHROPIC_SECRET')
os.environ['ANTHROPIC_API_KEY'] = API_KEY


client = ai.Client()
model = 'anthropic:claude-3-5-haiku@20241022' 
messages = [
    {"role": "system", "content": "Respond in Pirate English."},
    {"role": "user", "content": "Tell me a joke."},
]

response = client.chat.completions.create(
    model = model,
    messages = messages,
    temperature=0.75
)

print(response.choices[0].message.content)

TypeError: Client.__init__() got an unexpected keyword argument 'proxies'

In [21]:
import anthropic, sys
print(anthropic.__version__,
      #ai.__version__,
      sys.version)
# aisuite==0.1.6 via requirements.txt

0.30.1 3.13.0 (tags/v3.13.0:60403a5, Oct  7 2024, 09:38:07) [MSC v.1941 64 bit (AMD64)]


In [None]:
# upgrading to anthropic 0.40.0 fixed this issue.
import aisuite as ai, toml, os
secrets = toml.load('../secrets.toml')
API_KEY = secrets.get('ANTHROPIC_SECRET')
os.environ['ANTHROPIC_API_KEY'] = API_KEY


client = ai.Client()
model = 'anthropic:claude-3-5-haiku-20241022' 
messages = [
    {"role": "system", "content": "Respond in Pirate English."},
    {"role": "user", "content": "Tell me a joke."},
]

response = client.chat.completions.create(
    model = model,
    messages = messages,
    temperature=0.75
)

print(response.choices[0].message.content)

Arrr, here be a jest fer ye, matey!

Why be a pirate's favorite letter? 'R', of course! *hearty pirate laugh*

Yarrr har har! *slaps knee and takes a swig from rum bottle*


## Completions - `anthropic` directly
No longer necessary though it did help troubleshoot issues with aisuite


In [2]:
# let's pivot - to the anthropic API directly for now, and we can fix this down the track
# https://pypi.org/project/anthropic/
from anthropic import Anthropic
import toml
secrets = toml.load('../secrets.toml')
API_KEY = secrets.get('ANTHROPIC_SECRET')

client = Anthropic(
    api_key=API_KEY
)

message = client.messages.create(
    max_tokens=1024,
    system="Respond in Pirate English.",
    messages=[
        {"role": "user", "content": "Tell me a joke."},
    ],
    model="claude-3-5-haiku-20241022",
)
print(message.content)

[TextBlock(text="Arrr, here be a jest fer ye, me hearty!\n\nWhy'd the pirate make a terrible teacher? 'Cause he kept usin' his ARRRRRbitrary punishments! *hearty laugh*\n\n*slaps knee and takes a swig from a rum bottle*\n\nYarrr! That be a knee-slappin' chuckle fer ye! *winks*", type='text')]


# Explore RAG and datastructures
1. We need to firstly transform the data into embeddings or vectors against which we search. The vectors need to use the same embedding model as what we'll use at runtime. We can use anthropic, and if we do we'll need to eventually explore batching to reduce costs. OR we can introduce another library and a bunch of new models (e.g. ), which may need GPU's etc. For now, probably simplest to go with anthropic and a small subset of the data (e.g. 100 programs near my location). 
2. We then need to implement search on whatever text the user provides, to provide a number of results. I like the idea of the `faiss` library here, and maybe this also can be implemented in numpy (trying to reduce the number of dependencies if I can, and realistically the data isn't big enough to have to worry about a heavy duty library just yet)
3. Finally, we need to add the search results as context, along with the initial user prompt, so that the response provides valid outputs. For this, I suspect some prompt engineering is required.

In [None]:
# let's get some test data near Wollongong
import pandas as pd, numpy as np
df = pd.read_pickle('../data/transformed_charities.pkl')
user_lat = -34.425072
user_lon = 150.893143

cols_of_interest = [
    'abn',
    'charity name',
    'how purposes were pursued',
    'total full time equivalent staff',
    'staff - volunteers',
    'Program name',
    'Classification',
    'Charity weblink',
    'location_number',
    'operating_location',
    'latitude',
    'longitude',
    'distance' # what we're about to compute
]

# compute distance - this could be to slow for a user session
def haversine_distance_on_df(row,user_lat,user_lon):
    R = 6371  # Earth's radius in kilometers
        
    # Convert to radians
    user_lat_rad = np.radians(user_lat)
    user_lon_rad = np.radians(user_lon)
    charity_lats_rad = np.radians(row['latitude'])
    charity_lons_rad = np.radians(row['longitude'])
    
    # Haversine formula
    dlat = charity_lats_rad - user_lat_rad
    dlon = charity_lons_rad - user_lon_rad
    a = np.sin(dlat/2)**2 + np.cos(user_lat_rad) * np.cos(charity_lats_rad) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a))
    return R * c

# Apply the function to create a new distance column
df['distance'] = df.apply(lambda row: haversine_distance_on_df(row, user_lat, user_lon), axis=1)

# filter data to within 10km
filtered_data = df.loc[df.distance <=10.0,cols_of_interest]
display(filtered_data.head(),filtered_data.shape)

Unnamed: 0,abn,charity name,how purposes were pursued,total full time equivalent staff,staff - volunteers,Program name,Classification,Charity weblink,location_number,operating_location,latitude,longitude,distance
292,11930852906,Kind Hearts Illawarra,"We continued to run outreach programme, even a...",0.0,13,Outreach in the Park,Soup kitchens,www.kindheartsillawarra.com.au,1,"MacCabe Park, Wollongong NSW, Australia",-34.427625,150.894013,0.294943
293,11930852906,Kind Hearts Illawarra,"We continued to run outreach programme, even a...",0.0,13,Produce Table,Food aid,www.kindheartsillawarra.com.au,1,"MacCabe Park, Wollongong NSW, Australia",-34.427625,150.894013,0.294943
309,11981168448,CORRIMAL RSL SUB-BRANCH LIMITED,Provide support to veterans and their families...,0.0,152,ANZAC Day Dawn Commemorative service,Unknown or not classified,https://www.rslnsw.org.au/,1,"Corrimal NSW, Australia",-34.373193,150.896911,5.779034
310,11981168448,CORRIMAL RSL SUB-BRANCH LIMITED,Provide support to veterans and their families...,0.0,152,Remembrance Day Commemorative Service,Unknown or not classified,http://www.rslnsw.org.au/,1,"Corrimal NSW, Australia",-34.373193,150.896911,5.779034
311,11981168448,CORRIMAL RSL SUB-BRANCH LIMITED,Provide support to veterans and their families...,0.0,152,RSL NSW s Charitable Purpose,Welfare,https://www.rsldefencecare.org.au,1,"Corrimal NSW, Australia",-34.366667,150.891667,6.495752


(175, 13)

In [16]:
# create unique identifier - that is also informative to the model
filtered_data['id'] = filtered_data['charity name'].astype(str)+' | '+filtered_data['Program name']+' | '+filtered_data['operating_location'].astype(str)
assert filtered_data['id'].nunique()/filtered_data['id'].count() == 1.0

In [21]:
# todo - consider adding more metatadata
def prepare_text_for_embedding(row):
    """Prepare text by combining identifier and content"""
    return f"ID:{row['id']} - {row['how purposes were pursued']}"

filtered_data['text_to_embed'] = filtered_data.apply(prepare_text_for_embedding, axis=1)

In [22]:
from anthropic import Anthropic
import toml

# Load API key from environment
secrets = toml.load('../secrets.toml')
API_KEY = secrets.get('ANTHROPIC_SECRET')
client = Anthropic(api_key=API_KEY)

def get_embedding(text):
    """Get embedding from Anthropic API"""
    try:
        response = client.beta.embeddings.create(
            model="claude-3-haiku-20241022",
            input=text
        )
        return response.embedding  # Returns the embedding vector
    except Exception as e:
        print(f"Error getting embedding: {e}")
        return None

# Create embeddings for our filtered dataset

result = filtered_data['text_to_embed'].head(1).apply(get_embedding)
result



Error getting embedding: 'Beta' object has no attribute 'embeddings'


292    None
Name: text_to_embed, dtype: object

- This was a bum steer by Claude - Anthropic do not provide embedding models just yet! https://docs.anthropic.com/en/docs/build-with-claude/embeddings
- We could look at their suggested provider...
- We could also just implement a traditional NLP method of TF-IDF, which saves money and can be executed by a streamlit server (no GPU needed in runtime) - https://stackoverflow.com/questions/8897593/how-to-compute-the-similarity-between-two-text-documents

In [None]:
# todo - Save the embeddings and metadata
filtered_data.to_pickle('../data/wollongong_charities_with_embeddings.pkl')

print(f"Generated embeddings for {len(filtered_data)} records")
print(f"Sample embedding vector length: {len(filtered_data['embedding'].iloc[0])}")

# Prompt engineeering and tests - `pytest` or `projit`
Tests could include:
- valid json
- json contains keys of interest

# Mapping - `folium`


# Tying together - `streamlit`
- Mobile experience

In [None]:
# getting user location
# https://github.com/aghasemi/streamlit_js_eval?tab=readme-ov-file
# though note you'll need an error route - i.e. agent asks for locatioon if it fails