# Prototyping
## Scope
- Sort out completions using aisuite - `done!`
- Create basic data structure that is RAG ready - faiss - `todo`
- Prompt engineering (and tests) - `todo`
- Mapping with Folium - `todo`
- Tying together with streamlit - `todo`

# Completions - `aisuite`
todo:
- pivot to .env for secret management? - just awkward as thats the present venv name - https://github.com/andrewyng/aisuite/blob/main/.env.sample
- explore alternate anthropic models - `model = 'anthropic:claude-3-5-sonnet-v2@20241022'` 
- see how the resulting issue shapes up - https://github.com/andrewyng/aisuite/issues/155

In [None]:
import aisuite as ai, toml, os
secrets = toml.load('../secrets.toml')
API_KEY = secrets.get('ANTHROPIC_SECRET')
os.environ['ANTHROPIC_API_KEY'] = API_KEY


client = ai.Client()
model = 'anthropic:claude-3-5-haiku@20241022' 
messages = [
    {"role": "system", "content": "Respond in Pirate English."},
    {"role": "user", "content": "Tell me a joke."},
]

response = client.chat.completions.create(
    model = model,
    messages = messages,
    temperature=0.75
)

print(response.choices[0].message.content)

TypeError: Client.__init__() got an unexpected keyword argument 'proxies'

In [21]:
import anthropic, sys
print(anthropic.__version__,
      #ai.__version__,
      sys.version)
# aisuite==0.1.6 via requirements.txt

0.30.1 3.13.0 (tags/v3.13.0:60403a5, Oct  7 2024, 09:38:07) [MSC v.1941 64 bit (AMD64)]


In [None]:
import aisuite as ai, toml, os
secrets = toml.load('../secrets.toml')
API_KEY = secrets.get('ANTHROPIC_SECRET')
os.environ['ANTHROPIC_API_KEY'] = API_KEY


client = ai.Client()
model = 'anthropic:claude-3-5-haiku-20241022' 
messages = [
    {"role": "system", "content": "Respond in Pirate English."},
    {"role": "user", "content": "Tell me a joke."},
]

response = client.chat.completions.create(
    model = model,
    messages = messages,
    temperature=0.75
)

print(response.choices[0].message.content)

Arrr, here be a jest fer ye, matey!

Why be a pirate's favorite letter? 'R', of course! *hearty pirate laugh*

Yarrr har har! *slaps knee and takes a swig from rum bottle*


## Completions - `anthropic` directly
No longer necessary though it did help troubleshoot issues with aisuite


In [2]:
# let's pivot - to the anthropic API directly for now, and we can fix this down the track
# https://pypi.org/project/anthropic/
from anthropic import Anthropic
import toml
secrets = toml.load('../secrets.toml')
API_KEY = secrets.get('ANTHROPIC_SECRET')

client = Anthropic(
    api_key=API_KEY
)

message = client.messages.create(
    max_tokens=1024,
    system="Respond in Pirate English.",
    messages=[
        {"role": "user", "content": "Tell me a joke."},
    ],
    model="claude-3-5-haiku-20241022",
)
print(message.content)

[TextBlock(text="Arrr, here be a jest fer ye, me hearty!\n\nWhy'd the pirate make a terrible teacher? 'Cause he kept usin' his ARRRRRbitrary punishments! *hearty laugh*\n\n*slaps knee and takes a swig from a rum bottle*\n\nYarrr! That be a knee-slappin' chuckle fer ye! *winks*", type='text')]


# Explore RAG and datastructures
1. We need to firstly transform the data into embeddings or vectors against which we search. The vectors need to use the same embedding model as what we'll use at runtime. We can use anthropic, and if we do we'll need to eventually explore batching to reduce costs. OR we can introduce another library and a bunch of new models (e.g. ), which may need GPU's etc. For now, probably simplest to go with anthropic and a small subset of the data (e.g. 100 programs near my location). 
2. We then need to implement search on whatever text the user provides, to provide a number of results. I like the idea of the `faiss` library here, and maybe this also can be implemented in numpy (trying to reduce the number of dependencies if I can, and realistically the data isn't big enough to have to worry about a heavy duty library just yet)
3. Finally, we need to add the search results as context, along with the initial user prompt, so that the response provides valid outputs. For this, I suspect some prompt engineering is required.

In [70]:
# how to get user lat lon?
import requests
secrets = toml.load('../secrets.toml')
LOCATION_KEY = secrets.get('OPENWEATHER_SECRET')
location = input('Provide a location please: City and State (noting that charity information is limited to Australia)')
try:
    res = requests.get(f"http://api.openweathermap.org/geo/1.0/direct?q={location+', Australia'}&limit=1&appid={LOCATION_KEY}")
    assert res.status_code == 200
except AssertionError as error:
    print("API request failed - status code {res.status_code}")


location_result = res.json()
user_lat = location_result[0]['lat']
user_lon = location_result[0]['lon']
print(f"location: {location}\nlat: {user_lat}\nlon: {user_lon}")

location: Port Kembla, NSW
lat: -34.4703383
lon: 150.8953674


In [None]:
# let's get some test data near Wollongong
import pandas as pd, numpy as np
df = pd.read_pickle('../data/transformed_charities.pkl')
user_lat = -34.425072
user_lon = 150.893143

cols_of_interest = [
    'abn',
    'charity name',
    'how purposes were pursued',
    'total full time equivalent staff',
    'staff - volunteers',
    'Program name',
    'Classification',
    'Charity weblink',
    'location_number',
    'operating_location',
    'latitude',
    'longitude',
    'distance' # what we're about to compute
]

# compute distance - this could be to slow for a user session
def haversine_distance_on_df(row,user_lat,user_lon):
    R = 6371  # Earth's radius in kilometers
        
    # Convert to radians
    user_lat_rad = np.radians(user_lat)
    user_lon_rad = np.radians(user_lon)
    charity_lats_rad = np.radians(row['latitude'])
    charity_lons_rad = np.radians(row['longitude'])
    
    # Haversine formula
    dlat = charity_lats_rad - user_lat_rad
    dlon = charity_lons_rad - user_lon_rad
    a = np.sin(dlat/2)**2 + np.cos(user_lat_rad) * np.cos(charity_lats_rad) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a))
    return R * c

# Apply the function to create a new distance column
df['distance'] = df.apply(lambda row: haversine_distance_on_df(row, user_lat, user_lon), axis=1)

# filter data to within 10km
filtered_data = df.loc[df.distance <=10.0,cols_of_interest]
display(filtered_data.head(),filtered_data.shape)

Unnamed: 0,abn,charity name,how purposes were pursued,total full time equivalent staff,staff - volunteers,Program name,Classification,Charity weblink,location_number,operating_location,latitude,longitude,distance
292,11930852906,Kind Hearts Illawarra,"We continued to run outreach programme, even a...",0.0,13,Outreach in the Park,Soup kitchens,www.kindheartsillawarra.com.au,1,"MacCabe Park, Wollongong NSW, Australia",-34.427625,150.894013,0.294943
293,11930852906,Kind Hearts Illawarra,"We continued to run outreach programme, even a...",0.0,13,Produce Table,Food aid,www.kindheartsillawarra.com.au,1,"MacCabe Park, Wollongong NSW, Australia",-34.427625,150.894013,0.294943
309,11981168448,CORRIMAL RSL SUB-BRANCH LIMITED,Provide support to veterans and their families...,0.0,152,ANZAC Day Dawn Commemorative service,Unknown or not classified,https://www.rslnsw.org.au/,1,"Corrimal NSW, Australia",-34.373193,150.896911,5.779034
310,11981168448,CORRIMAL RSL SUB-BRANCH LIMITED,Provide support to veterans and their families...,0.0,152,Remembrance Day Commemorative Service,Unknown or not classified,http://www.rslnsw.org.au/,1,"Corrimal NSW, Australia",-34.373193,150.896911,5.779034
311,11981168448,CORRIMAL RSL SUB-BRANCH LIMITED,Provide support to veterans and their families...,0.0,152,RSL NSW s Charitable Purpose,Welfare,https://www.rsldefencecare.org.au,1,"Corrimal NSW, Australia",-34.366667,150.891667,6.495752


(175, 13)

In [16]:
# create unique identifier - that is also informative to the model
filtered_data['id'] = filtered_data['charity name'].astype(str)+' | '+filtered_data['Program name']+' | '+filtered_data['operating_location'].astype(str)
assert filtered_data['id'].nunique()/filtered_data['id'].count() == 1.0

In [21]:
# todo - consider adding more metatadata
def prepare_text_for_embedding(row):
    """Prepare text by combining identifier and content"""
    return f"ID:{row['id']} - {row['how purposes were pursued']}"

filtered_data['text_to_embed'] = filtered_data.apply(prepare_text_for_embedding, axis=1)

In [22]:
from anthropic import Anthropic
import toml

# Load API key from environment
secrets = toml.load('../secrets.toml')
API_KEY = secrets.get('ANTHROPIC_SECRET')
client = Anthropic(api_key=API_KEY)

def get_embedding(text):
    """Get embedding from Anthropic API"""
    try:
        response = client.beta.embeddings.create(
            model="claude-3-haiku-20241022",
            input=text
        )
        return response.embedding  # Returns the embedding vector
    except Exception as e:
        print(f"Error getting embedding: {e}")
        return None

# Create embeddings for our filtered dataset

result = filtered_data['text_to_embed'].head(1).apply(get_embedding)
result



Error getting embedding: 'Beta' object has no attribute 'embeddings'


292    None
Name: text_to_embed, dtype: object

- This was a bum steer by Claude - Anthropic do not provide embedding models just yet! https://docs.anthropic.com/en/docs/build-with-claude/embeddings
- We could look at their suggested provider...
- We could also just implement a traditional NLP method of TF-IDF, which saves money and can be executed by a streamlit server (no GPU needed in runtime) - https://stackoverflow.com/questions/8897593/how-to-compute-the-similarity-between-two-text-documents


### pivot to TF-IDF (or traditional NLP approaches) 

In [56]:
import pandas as pd
import numpy as np
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
import pickle
import re
import nltk
from typing import List, Tuple, Optional, Dict, Any
import logging

class TextPreprocessor:
    def __init__(self, debug: bool = False):
        self.debug = debug
        self._setup_logging()
        self._download_nltk_data()
        self.lemmatizer = WordNetLemmatizer()
        self.stop_words = set(stopwords.words('english'))
        
    def _setup_logging(self):
        self.logger = logging.getLogger(__name__)
        if self.debug:
            self.logger.setLevel(logging.DEBUG)
        else:
            self.logger.setLevel(logging.WARNING)
            
    def _download_nltk_data(self):
        """Download required NLTK data if not present"""
        for package in ['punkt', 'stopwords', 'wordnet', 'omw-1.4']:
            try:
                nltk.data.find(f'tokenizers/{package}')
            except LookupError:
                nltk.download(package, quiet=True)
                
    def preprocess_text(self, text: str) -> str:
        """Preprocess text with optional debugging"""
        text = str(text).lower()
        text = re.sub(r'[^a-zA-Z\s]', ' ', text)
        
        try:
            tokens = word_tokenize(text)
        except LookupError:
            tokens = text.split()
            
        tokens = [self.lemmatizer.lemmatize(token) for token in tokens 
                 if token not in self.stop_words and len(token) > 2]
        
        processed = ' '.join(tokens)
        if self.debug:
            self.logger.debug(f"Preprocessed text: {processed[:100]}...")
        
        return processed

class DocumentSimilarity:
    def __init__(self, max_features: int = 5000, min_df: int = 1, 
                 max_df: float = 1.0, debug: bool = False):
        """
        Initialize document similarity processor
        
        Args:
            max_features: Maximum number of features for TF-IDF
            min_df: Minimum document frequency
            max_df: Maximum document frequency
            debug: Enable debug logging
        """
        self.debug = debug
        self._setup_logging()
        
        self.preprocessor = TextPreprocessor(debug=debug)
        self.vectorizer = TfidfVectorizer(
            max_features=max_features,
            min_df=min_df,
            max_df=max_df,
            strip_accents='unicode',
            analyzer='word',
            token_pattern=r'\b\w+\b',
            ngram_range=(1, 1)
        )
        
    def _setup_logging(self):
        self.logger = logging.getLogger(__name__)
        if self.debug:
            self.logger.setLevel(logging.DEBUG)
        else:
            self.logger.setLevel(logging.WARNING)
            
    def fit_transform_documents(self, df: pd.DataFrame, 
                              text_column: str) -> pd.DataFrame:
        """Process documents and create TF-IDF matrix"""
        self.logger.debug("Processing documents...")
        
        # Preprocess texts
        processed_texts = df[text_column].apply(self.preprocessor.preprocess_text)
        if self.debug:
            self.logger.debug("\nSample processed texts:")
            for text in processed_texts.head(3):
                self.logger.debug(f"{text[:100]}...")
        
        # Fit and transform
        tfidf_matrix = self.vectorizer.fit_transform(processed_texts)
        feature_names = self.vectorizer.get_feature_names_out()
        
        if self.debug:
            self.logger.debug(f"\nVocabulary size: {len(feature_names)}")
            self.logger.debug(f"Sample terms: {feature_names[:10]}")
        
        # Create DataFrame with TF-IDF features
        tfidf_df = pd.DataFrame(
            tfidf_matrix.toarray(),
            index=df.index,
            columns=[f'tfidf_{feat}' for feat in feature_names]
        )
        
        return pd.concat([df, tfidf_df], axis=1)
    
    def transform_query(self, query: str) -> np.ndarray:
        """Transform query text to TF-IDF vector"""
        processed_query = self.preprocessor.preprocess_text(query)
        if self.debug:
            self.logger.debug(f"\nProcessed query: '{processed_query}'")
        
        query_vector = self.vectorizer.transform([processed_query])
        query_array = query_vector.toarray()[0]
        
        if self.debug:
            self._log_query_stats(query_array)
            
        return query_array
    
    def _log_query_stats(self, query_array: np.ndarray):
        """Log query vector statistics if debug is enabled"""
        vocabulary = self.vectorizer.get_feature_names_out()
        non_zero_indices = np.nonzero(query_array)[0]
        
        self.logger.debug(f"\nQuery stats:")
        self.logger.debug(f"Vector norm: {np.linalg.norm(query_array):.6f}")
        self.logger.debug(f"Non-zero terms: {len(non_zero_indices)}")
        
        if len(non_zero_indices) > 0:
            self.logger.debug("Matching terms:")
            for idx in non_zero_indices:
                self.logger.debug(f"  {vocabulary[idx]}: {query_array[idx]:.6f}")
    
    def find_similar_documents(self, query_vector: np.ndarray, 
                             document_vectors: np.ndarray,
                             df_indices: List[Any],
                             top_k: int = 5) -> Tuple[List[Any], List[float]]:
        """
        Find most similar documents using cosine similarity
        
        Returns:
            Tuple of (document indices, similarity scores)
            Empty lists if no similar documents found
        """
        if len(df_indices) != document_vectors.shape[0]:
            raise ValueError("Number of indices must match number of document vectors")
            
        # Compute similarities
        query_norm = np.linalg.norm(query_vector)
        doc_norms = np.linalg.norm(document_vectors, axis=1)
        
        # Handle zero vectors
        if query_norm == 0:
            self.logger.warning("Query vector is zero - no matches possible")
            return [], []
            
        # Compute cosine similarities efficiently
        similarities = np.zeros(len(document_vectors))
        non_zero_docs = doc_norms > 0
        
        if not any(non_zero_docs):
            self.logger.warning("No valid document vectors found")
            return [], []
            
        # Vectorized similarity computation
        similarities[non_zero_docs] = (
            document_vectors[non_zero_docs] @ query_vector / 
            (doc_norms[non_zero_docs] * query_norm)
        )
        
        # Get top k results
        valid_similarities = similarities[~np.isnan(similarities)]
        if len(valid_similarities) == 0:
            return [], []
            
        top_k = min(top_k, len(valid_similarities))
        top_indices = np.argsort(similarities)[-top_k:][::-1]
        
        return [df_indices[i] for i in top_indices], similarities[top_indices].tolist()
    
    def save_model(self, filepath: str):
        """Save model state"""
        model_data = {
            'vectorizer': self.vectorizer,
            'preprocessor': self.preprocessor
        }
        with open(filepath, 'wb') as f:
            pickle.dump(model_data, f)
    
    @staticmethod
    def load_model(filepath: str, debug: bool = False) -> 'DocumentSimilarity':
        """Load saved model"""
        with open(filepath, 'rb') as f:
            model_data = pickle.load(f)
        
        processor = DocumentSimilarity(
            max_features=model_data['vectorizer'].max_features,
            debug=debug
        )
        processor.vectorizer = model_data['vectorizer']
        processor.preprocessor = model_data['preprocessor']
        return processor

# Example usage:
if __name__ == "__main__":
    # Create processor with debugging disabled
    processor = DocumentSimilarity(max_features=1000, debug=False)
    
    # Example data
    df = pd.DataFrame({
        'text_to_embed': [
            "This is a sample document about machine learning",
            "Another document discussing data science",
            "A third document about artificial intelligence"
        ]
    })
    
    # Process documents
    processed_df = processor.fit_transform_documents(df, 'text_to_embed')
    
    # Example query
    query = "machine learning and data science"
    query_vector = processor.transform_query(query)
    
    # Get document vectors
    tfidf_columns = [col for col in processed_df.columns if col.startswith('tfidf_')]
    document_vectors = processed_df[tfidf_columns].values
    
    # Find similar documents
    similar_indices, similarities = processor.find_similar_documents(
        query_vector, 
        document_vectors,
        df_indices=processed_df.index.tolist()
    )
    
    # Print results
    if similar_indices:
        for idx, sim in zip(similar_indices, similarities):
            print(f"Document: {df.loc[idx, 'text_to_embed']}")
            print(f"Similarity: {sim:.4f}\n")

Document: This is a sample document about machine learning
Similarity: 0.5465

Document: Another document discussing data science
Similarity: 0.4795

Document: A third document about artificial intelligence
Similarity: 0.0000



#### Applying to our data

In [32]:
filtered_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 175 entries, 292 to 204769
Data columns (total 16 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   abn                               175 non-null    int64  
 1   charity name                      175 non-null    object 
 2   how purposes were pursued         172 non-null    object 
 3   total full time equivalent staff  175 non-null    float64
 4   staff - volunteers                175 non-null    int64  
 5   Program name                      175 non-null    object 
 6   Classification                    175 non-null    object 
 7   Charity weblink                   175 non-null    object 
 8   location_number                   175 non-null    int64  
 9   operating_location                175 non-null    object 
 10  latitude                          175 non-null    float64
 11  longitude                         175 non-null    float64
 12  distance

In [57]:
# Initialize with debugging off
processor = DocumentSimilarity(
    max_features=1000,
    min_df=2,
    max_df=0.95,
    debug=True  # Set to True if you need debugging output
)

# Process documents
processed_df = processor.fit_transform_documents(filtered_data, 'text_to_embed')

# Transform query
query = "A non-religious or sectarian soup kitchen for homeless or disadvantaged people"
query_vector = processor.transform_query(query)

# Get document vectors
tfidf_columns = [col for col in processed_df.columns if col.startswith('tfidf_')]
document_vectors = processed_df[tfidf_columns].values

# Find similar documents
similar_indices, similarities = processor.find_similar_documents(
    query_vector, 
    document_vectors,
    df_indices=processed_df.index.tolist()
)

# Print results
if similar_indices:
    for idx, similarity in zip(similar_indices, similarities):
        print(f"Document {filtered_data.loc[idx, 'id']}")
        print(f"Text: {filtered_data.loc[idx, 'text_to_embed'][:200]}...")
        print(f"Similarity: {similarity:.4f}\n")
else:
    print("No similar documents found.")

Document Southern Youth & Family Services Limited | Family Counselling Project | Wollongong NSW, Australia
Text: ID:Southern Youth & Family Services Limited | Family Counselling Project | Wollongong NSW, Australia - We provide support and assistance to children, young people, adults and families who are disadvan...
Similarity: 0.2839

Document Southern Youth & Family Services Limited | Cringila Community Development; Playgroup; Southern Suburbs Preschool | Cringila NSW, Australia
Text: ID:Southern Youth & Family Services Limited | Cringila Community Development; Playgroup; Southern Suburbs Preschool | Cringila NSW, Australia - We provide support and assistance to children, young peo...
Similarity: 0.2732

Document Southern Youth & Family Services Limited | Reconnect Program and Family Mental Health Support Service (CAFS) | Wollongong NSW, Australia
Text: ID:Southern Youth & Family Services Limited | Reconnect Program and Family Mental Health Support Service (CAFS) | Wollongong NSW, Aus

In [58]:
similar_indices

[56904, 22575, 56913, 26164, 16352]

In [59]:
similarities

[0.28385910307126794,
 0.2731521715021081,
 0.27081219353533453,
 0.20546684110403923,
 0.19069065469080873]

In [65]:
# this code was modularised for the streamlit app and can be accessed as follows;
import sys
from pathlib import Path

# Get the absolute path to the project root (parent of notebooks directory)
project_root = Path().absolute().parent

# Add project root to Python path so src can be found
if str(project_root) not in sys.path:
    sys.path.append(str(project_root))


# Import our modules
from src.similarity_manager import DocumentSimilarityManager
from src.data_processing import load_charity_data, prepare_text_field

# Initialize the similarity manager
manager = DocumentSimilarityManager(
    model_dir="../data/models",
    processed_dir="../data/processed",
    debug=False
)

# Load existing model or train new one
if not manager.load_model():
    print("Training new model...")
    df = load_charity_data()
    manager.train_model(df, text_column='text_to_embed')
    print("Model trained!")


# Hit the model with a query
query = "A soup kitchen for homeless people"
top_k = 10
similar_indices, similarities = manager.find_similar(query, top_k=top_k)

if not similar_indices:
    print("No matches found")

df = load_charity_data()
for idx, sim in zip(similar_indices, similarities):
    print(f"\nCharity: {df.loc[idx, 'charity name']}")
    print(f"Similarity: {sim:.3f}")
    print(f"Program: {df.loc[idx, 'Program name']}")
    print(f"Location: {df.loc[idx, 'operating_location']}")
    print("-" * 80)

  charities_df = pd.read_csv(data_dir / "datadotgov_ais22.csv")



Charity: St Thomas' Anglican Church Port Macquarie
Similarity: 0.668
Program: St Thomas Soup kitchen
Location: St Thomas' Anglican Church, 50 Hay Street, Port Macquarie NSW, Australia
--------------------------------------------------------------------------------

Charity: Street Mission Incorporated
Similarity: 0.648
Program: Soup Kitchen
Location: Dee Why NSW, Australia
--------------------------------------------------------------------------------

Charity: Street Mission Incorporated
Similarity: 0.648
Program: Soup Kitchen
Location: Balgowlah NSW, Australia
--------------------------------------------------------------------------------

Charity: Light House Community Services Inc
Similarity: 0.588
Program: Light House Soup Kitchen
Location: Gold Coast QLD, Australia
--------------------------------------------------------------------------------

Charity: Southcare Community Care Inc
Similarity: 0.568
Program: Soup Kitchen
Location: Frankston North VIC, Australia
--------------

In [None]:
df = load_charity_data()
for idx, sim in zip(similar_indices, similarities):
    print(f"\nCharity: {df.loc[idx, 'charity name']}")
    print(f"Similarity: {sim:.3f}")
    print(f"Program: {df.loc[idx, 'Program name']}")
    print(f"Location: {df.loc[idx, 'operating_location']}")
    print("-" * 80)

In [66]:
similar_indices

[14629, 48179, 13848, 4111, 17739, 7600, 22487, 8502, 83903, 15241]

# Prompt engineeering and tests - `pytest` or `projit`
Tests could include:
- valid json
- json contains keys of interest

## Setup

In [5]:
# 1. user provides location preferences
# debug - this is too slow; maybe we need to filter on state first? maybe we need to store the processed data as well?
import requests, toml, pandas as pd, numpy as np
import sys
from pathlib import Path
project_root = Path().absolute().parent
if str(project_root) not in sys.path:
    sys.path.append(str(project_root))
from src.similarity_manager import DocumentSimilarityManager
from src.data_processing import load_charity_data, prepare_text_field
secrets = toml.load('../secrets.toml')
LOCATION_KEY = secrets.get('OPENWEATHER_SECRET')
df = load_charity_data()
def haversine_distance_on_df(row,user_lat,user_lon):
    R = 6371  # Earth's radius in kilometers
        
    # Convert to radians
    user_lat_rad = np.radians(user_lat)
    user_lon_rad = np.radians(user_lon)
    charity_lats_rad = np.radians(row['latitude'])
    charity_lons_rad = np.radians(row['longitude'])
    
    # Haversine formula
    dlat = charity_lats_rad - user_lat_rad
    dlon = charity_lons_rad - user_lon_rad
    a = np.sin(dlat/2)**2 + np.cos(user_lat_rad) * np.cos(charity_lats_rad) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a))
    return R * c

# get user location
location = input('Provide a location please: City and State (noting that charity information is limited to Australia)')
distance = float(input('Provide a desired travel distance in km'))

try:
    res = requests.get(f"http://api.openweathermap.org/geo/1.0/direct?q={location+', Australia'}&limit=1&appid={LOCATION_KEY}")
    assert res.status_code == 200
except AssertionError as error:
    print("API request failed - status code {res.status_code}")

location_result = res.json()
user_lat = location_result[0]['lat']
user_lon = location_result[0]['lon']
print(f"location: {location}\nlat: {user_lat}\nlon: {user_lon}") # debug

# Apply the function to create a new distance column
df['distance'] = df.apply(lambda row: haversine_distance_on_df(row, user_lat, user_lon), axis=1)

# show filtered results based on location and desired distance
filtered_df = df.loc[df.distance <=distance].copy()
num_charities = filtered_df.shape[0]

# give the user some feedback
print(f"For the location: {location}\nand distance: {distance}\nthere are {num_charities} charity programs near you")


  charities_df = pd.read_csv(data_dir / "datadotgov_ais22.csv")


location: Port Kembla, NSW
lat: -34.4703383
lon: 150.8953674
For the location: Port Kembla, NSW
and distance: 15.0
there are 305 charity programs near you


In [73]:
num_charities = filtered_df.shape[0]

In [6]:
# 2. user provides interests for similarity scoring

# Initialize the similarity manager
manager = DocumentSimilarityManager(
    model_dir="../data/models",
    processed_dir="../data/processed",
    debug=False
)

# Load existing model or train new one
if not manager.load_model():
    print("Training new model...")
    df = load_charity_data()
    manager.train_model(df, text_column='text_to_embed')
    print("Model trained!")


# Hit the model with a query
query = input("What volunteering opportunity are you interested in?") # "A soup kitchen for homeless people"
top_k = min(100,num_charities)

location_filtered_indices = set(filtered_df.index)

# Use new filtered search method
similar_indices, similarities = manager.find_similar_filtered(
    query=query,
    valid_indices=location_filtered_indices,
    top_k=min(100, num_charities)
)

if not similar_indices:
    print("No matches found")

for idx, sim in zip(similar_indices, similarities):
    print(f"\nCharity: {df.loc[idx, 'charity name']}")
    print(f"Similarity: {sim:.3f}")
    print(f"Program: {df.loc[idx, 'Program name']}")
    print(f"Location: {df.loc[idx, 'operating_location']}")
    print("-" * 80)




Charity: Southern Youth & Family Services Limited
Similarity: 0.198
Program: Family Counselling Project
Location: Wollongong NSW, Australia
--------------------------------------------------------------------------------

Charity: Southern Youth & Family Services Limited
Similarity: 0.198
Program: Family Counselling Project
Location: Shellharbour NSW, Australia
--------------------------------------------------------------------------------

Charity: Southern Youth & Family Services Limited
Similarity: 0.197
Program: Barnardos Parenting;  Indigenous Network Program
Location: Shellharbour NSW, Australia
--------------------------------------------------------------------------------

Charity: Southern Youth & Family Services Limited
Similarity: 0.197
Program: Barnardos Parenting;  Indigenous Network Program
Location: Illawarra, New South Wales, Australia
--------------------------------------------------------------------------------

Charity: Southern Youth & Family Services Limited
S

## Prompt Design

In [18]:
# INITIAL ATTEMPT
system_prompt = """You are a charity volunteering recommendation assistant for Australian nonprofits. Analyze charity options based on user interests and location.

Your responses must be valid JSON with exactly two keys:
- charity_ids: list of recommended charity IDs (maximum 10)
- message: explanation of recommendations, match quality, and suggested next steps

Example valid response:
{
    "charity_ids": [26164, 14931],
    "message": "I found 2 programs matching your interests within your area. The Wollongong Homeless Hub provides drop-in support and food relief, while Need a Feed Australia offers food distribution services."
}

If no suitable matches are found, return:
{
    "charity_ids": [],
    "message": "I couldn't find exact matches in your area"
}"""

In [24]:
# Attempt 2
system_prompt = """You are a charity volunteering recommendation assistant for Australian nonprofits. Your role is to thoughtfully match volunteers with charities based on their interests and location.

For each recommendation, consider:
1. How well the charity's activities match the user's interests (indicated by similarity score where 1.0 is perfect)
2. Geographic accessibility from the user's location (indicated by distance in km)

Your response must be valid JSON with these two keys:
- charity_ids: list of recommended charity IDs (maximum 10)
- message: A natural, conversational explanation that includes:
    - Why each recommended charity is a good match 
    - What specific activities/roles the user could do there
    - How far away each charity is located
    - Brief next steps for getting involved

Example responses:

Perfect match found:
{
    "charity_ids": [26164, 14931],
    "message": "I found two excellent matches for you! The Wollongong Homeless Hub (2km away, 0.92 similarity) aligns perfectly with your interest in helping the homeless through their food distribution program. They need volunteers for their twice-weekly breakfast service. Just 3km further, Need a Feed Australia (0.87 similarity) also runs food relief programs and needs drivers for their delivery service. Both locations show high activity levels in your area."
}

Good matches with location trade-off:
{
    "charity_ids": [15678, 19234],
    "message": "While the closest match is 15km away, I found two great options: Youth Connect (0.95 similarity) needs mentors for their after-school program, matching your interest in youth education. The Smith Family (0.89 similarity) is slightly closer at 12km and runs similar youth support programs. The map below shows these and other nearby charities you might consider."
}

No ideal matches:
{
    "charity_ids": [],
    "message": "I couldn't find exact matches for your interests in your immediate area. The map below shows all nearby charities - you might want to explore these local options and see if any align with your goals, even if they're not perfect matches."
}"""

In [33]:
# ATTEMPT 3
system_prompt = """You are a charity volunteering recommendation assistant for Australian nonprofits. Your role is to thoughtfully match volunteers with charities based on their interests and location.

For each recommendation, consider:
1. How well the charity's activities match the user's interests (indicated by similarity score where 1.0 is perfect)
2. Geographic accessibility from the user's location (indicated by distance in km)

Your response must be a SINGLE valid JSON object with NO special characters or formatting beyond simple spaces and escaped newlines (\\n). Include these two keys:
- charity_ids: list of recommended charity IDs (maximum 10)
- message: single string with newlines represented as \\n

Example of valid response format:
{
    "charity_ids": [26164, 14931],
    "message": "I found two excellent matches for you!\\n\\nThe Wollongong Homeless Hub (2km away, 0.92 similarity) aligns perfectly with your interest in helping the homeless.\\n\\nNeed a Feed Australia (5km away, 0.87 similarity) also runs food relief programs and needs delivery drivers."
}

If no matches found:
{
    "charity_ids": [],
    "message": "I couldn't find exact matches in your area. The map below shows nearby charities you might consider."
}"""

In [34]:
def prepare_charity_context(similar_indices, similarities, df, max_chars=100000):
    """
    Prepare charity context for LLM prompt while respecting context length limits.
    Returns formatted context string and list of included charity IDs.
    """
    context_parts = []
    included_ids = []
    current_length = 0
    
    # Template for each charity entry - estimate ~200-300 chars per entry
    for idx, sim in zip(similar_indices, similarities):
        charity = df.loc[idx]
        
        entry = (
            f"ID: {idx}\n"
            f"Name: {charity['charity name'][:100]}\n"  # Truncate very long names
            f"Program: {charity['Program name'][:100]}\n"
            f"Location: {charity['operating_location'][:150]}\n"
            f"Distance: {charity['distance']}\n"
            f"Description: {charity['how purposes were pursued'][:2000]}\n" # primary context feature
            f"Similarity: {sim:.3f}\n" # cosine similartiy
            f"Website: {charity.get('Charity weblink', 'Not available')}\n"
            "---\n"
        )
        
        # Check if adding this entry would exceed limit
        if current_length + len(entry) > max_chars:
            break
            
        context_parts.append(entry)
        included_ids.append(idx)
        current_length += len(entry)
    
    context = (
        
        f"There are {len(included_ids)} relevant charities that may match this interest.\n"
        f"Similarity scores range from {similarities[0]:.3f} to {similarities[-1]:.3f}.\n"
        "Available charity details:\n\n"
        + "\n".join(context_parts)
    )
    
    return context, included_ids

# Use the function
context_text, valid_ids = prepare_charity_context(similar_indices, similarities, filtered_df)

# Prepare the user prompt
user_prompt = f"""
The user provided the following location: {location}.\n
The user provided the following query: {query}.\n
Use this context to recommend up to 10 charities: {context_text}
Valid id's are as follows: {valid_ids}
"""

# Now you can use this user prompt with Claude
print(f"Prompt length: {len(user_prompt)} characters")

Prompt length: 79329 characters


In [35]:
import aisuite as ai,os, json
secrets = toml.load('../secrets.toml')
API_KEY = secrets.get('ANTHROPIC_SECRET')
os.environ['ANTHROPIC_API_KEY'] = API_KEY


client = ai.Client()
model = 'anthropic:claude-3-5-haiku-20241022' 
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt},
]

response = client.chat.completions.create(
    model = model,
    messages = messages,
    temperature=0.75
)

print(response.choices[0].message.content)
try:
    response_json = json.loads(response.choices[0].message.content)
except JSONDecodeError as e:
    print("invalid JSON object: {e}")

{
    "charity_ids": [26164, 14931, 12013, 56904, 14932],
    "message": "I found several excellent soup kitchen and homeless support options near Port Kembla!\n\nThe Wollongong Homeless Hub (4.7km away, 0.94 similarity) provides drop-in support and emergency accommodation.\n\nNeed a Feed Australia (4.7km away, 0.66 similarity) offers mobile coffee van services and community dinners for vulnerable people.\n\nPort Kembla Baptist Church (1.9km away, 0.20 similarity) runs a community welfare program with a secondhand clothes shop and meal support for disadvantaged individuals.\n\nSouthern Youth & Family Services (4.7km away, 0.198 similarity) supports families and individuals experiencing homelessness.\n\nTheir mobile coffee van (Need a Feed, 4.7km away) also provides fresh meals and take-away food to those in need."
}


In [36]:
response_json['charity_ids']

[26164, 14931, 12013, 56904, 14932]

In [37]:
filtered_df.loc[response_json['charity_ids']]

Unnamed: 0,abn,charity name,registration status,charity website,charity size,basic religious charity,ais due date,date ais received,financial report date received,conducted activities,...,Operating online,Operating overseas,overseas countries,Charity weblink,location_number,operating_location,latitude,longitude,id,distance
26164,80074835053,Wollongong Homeless Hub and Housing Services,Registered,www.whhhs.org.au,Large,n,31/01/2023,31/01/2023,31/01/2023,y,...,N,N,,www.wefh.org.au,1,"Wollongong NSW, Australia",-34.427812,150.893061,Wollongong Homeless Hub and Housing Services |...,4.733426
14931,50410129649,Need a Feed Australia INC,Registered,https://www.needafeed.org/,Small,n,31/01/2023,09/01/2023,09/01/2023,y,...,N,N,,www.needafeed.org,1,"Illawarra, New South Wales, Australia",-34.523225,150.843491,Need a Feed Australia INC | Food parcel distri...,7.561987
12013,42776141776,Port Kembla Baptist Church,Registered,www.pkbc.org.au,Small,y,31/01/2023,25/02/2023,,y,...,N,N,,www.pkbc.org.au,1,"81 Illawarra Street, Port Kembla NSW, Australia",-34.487219,150.89833,Port Kembla Baptist Church | Youth Support Gro...,1.896571
56904,70244601731,Southern Youth & Family Services Limited,Registered,http://www.syfs.org.au,Large,n,31/01/2023,06/12/2022,06/12/2022,y,...,N,N,,http://syfs.org.au,2,"Wollongong NSW, Australia",-34.427812,150.893061,Southern Youth & Family Services Limited | Fam...,4.733426
14932,50410129649,Need a Feed Australia INC,Registered,https://www.needafeed.org/,Small,n,31/01/2023,09/01/2023,09/01/2023,y,...,N,N,,www.needafeed.org,1,"Wollongong NSW, Australia",-34.427812,150.893061,Need a Feed Australia INC | Mobile Coffee van ...,4.733426


In [38]:
# let's try a slightly different query
query = "I want to give blood at the nearest donor centre, like Red Cross Life Blood"

similar_indices, similarities = manager.find_similar_filtered(
    query=query,
    valid_indices=location_filtered_indices,
    top_k=min(100, num_charities)
)

# Use the function
context_text, valid_ids = prepare_charity_context(similar_indices, similarities, filtered_df)

# Prepare the user prompt
user_prompt = f"""
The user provided the following location: {location}.\n
The user provided the following query: {query}.\n
Use this context to recommend up to 10 charities: {context_text}
Valid id's are as follows: {valid_ids}
"""

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt},
]

response = client.chat.completions.create(
    model = model,
    messages = messages,
    temperature=0.75
)

print(response.choices[0].message.content)
try:
    response_json = json.loads(response.choices[0].message.content)
except JSONDecodeError as e:
    print("invalid JSON object: {e}")

{
    "charity_ids": [],
    "message": "I apologize, but I could not find a Red Cross Life Blood donation centre in the provided list of charities. For blood donation, I recommend directly contacting Red Cross Life Blood at 13 14 95 or visiting their website www.lifeblood.com.au to find the nearest donor centre to Port Kembla.\n\nThe closest donation centre is likely to be in Wollongong, which is approximately 5-7 km from Port Kembla."
}


In [39]:
# and again
query = "It's two days before Christmas. I am flexible with charities to volunteer at. Ideally it is aligned to improving the mental health of people in this difficult time - i.e. suicide watch, community building programs, etc"

similar_indices, similarities = manager.find_similar_filtered(
    query=query,
    valid_indices=location_filtered_indices,
    top_k=min(100, num_charities)
)

# Use the function
context_text, valid_ids = prepare_charity_context(similar_indices, similarities, filtered_df)

# Prepare the user prompt
user_prompt = f"""
The user provided the following location: {location}.\n
The user provided the following query: {query}.\n
Use this context to recommend up to 10 charities: {context_text}
Valid id's are as follows: {valid_ids}
"""

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt},
]

response = client.chat.completions.create(
    model = model,
    messages = messages,
    temperature=0.75
)

print(response.choices[0].message.content)
try:
    response_json = json.loads(response.choices[0].message.content)
except JSONDecodeError as e:
    print("invalid JSON object: {e}")

{
    "charity_ids": [86478, 182865, 30602, 163101, 24403, 56913, 34094],
    "message": "Given the proximity to Christmas and your focus on mental health support, I've found some excellent volunteer opportunities:\n\nStride Mental Health Limited (5km away, 0.90 similarity) offers a Safe Haven program focused on mental health recovery and support.\n\nWaves of Wellness Foundation (5km away, 0.90 similarity) provides innovative surf therapy programs to help people manage mental health and build community connections.\n\nRichmondPRA (5km away, 0.85 similarity) offers comprehensive mental health services including psychosocial support.\n\nQuest for Life Foundation (4.7km away, 0.80 similarity) provides workshops on resilience and mental wellbeing during challenging times.\n\nCommunity Restorative Centre (5km away, 0.75 similarity) offers support programs for vulnerable populations, which can be especially important during the holiday season.\n\nSouthern Youth & Family Services (4.7km away,

In [40]:
# Waves of Wellness sounds good
filtered_df.loc[response_json['charity_ids']]

Unnamed: 0,abn,charity name,registration status,charity website,charity size,basic religious charity,ais due date,date ais received,financial report date received,conducted activities,...,Operating online,Operating overseas,overseas countries,Charity weblink,location_number,operating_location,latitude,longitude,id,distance
86478,58000020146,Stride Mental Health Limited,Registered,http://stride.com.au,Large,n,31/01/2023,30/01/2023,30/01/2023,y,...,N,N,,https://stride.com.au/strides-safe-haven-and-s...,3,"Wollongong NSW, Australia",-34.424834,150.893113,Stride Mental Health Limited | Safe Haven | Wo...,5.064112
182865,40614442018,Waves of Wellness Foundation Ltd,Registered,www.foundationwow.org,Medium,n,31/01/2023,28/01/2023,28/01/2023,y,...,N,N,,https://www.foundationwow.org/surf-therapy,6,"Wollongong NSW, Australia",-34.424834,150.893113,Waves of Wellness Foundation Ltd | Surf Experi...,5.064112
30602,91111111267,RichmondPRA_ACNC Group,Voluntarily Revoked No Longer Operating,https://www.flourishaustralia.org.au,Large,n,31/01/2023,31/01/2023,31/01/2023,y,...,N,N,,https://www.health.gov.au/initiatives-and-prog...,1,"Wollongong NSW, Australia",-34.424834,150.893113,RichmondPRA_ACNC Group | Commonwealth Psychoso...,5.064112
163101,79003747153,Quest For Life Foundation,Registered,http://www.questforlife.org.au,Large,n,31/01/2023,06/12/2022,06/12/2022,y,...,Y,N,,https://www.questforlife.com.au/1-day-programs,5,"Wollongong NSW, Australia",-34.427812,150.893061,Quest For Life Foundation | Living Mindfully W...,4.733426
24403,75411263189,Community Restorative Centre Limited,Registered,www.crcnsw.org.au,Large,n,31/01/2023,30/01/2023,30/01/2023,y,...,N,N,,https://www.crcnsw.org.au/services/housing-sup...,1,"Wollongong NSW, Australia",-34.424834,150.893113,Community Restorative Centre Limited | Transit...,5.064112
56913,70244601731,Southern Youth & Family Services Limited,Registered,http://www.syfs.org.au,Large,n,31/01/2023,06/12/2022,06/12/2022,y,...,N,N,,http://syfs.org.au,2,"Wollongong NSW, Australia",-34.427812,150.893061,Southern Youth & Family Services Limited | Rec...,4.733426
34094,99169872244,PTSD Australia New Zealand Limited,Registered,http://www.fearless.org.au,Small,n,31/01/2023,06/02/2023,06/02/2023,y,...,Y,N,,www.fearless.org.au,1,"Illawarra, New South Wales, Australia",-34.523225,150.843491,PTSD Australia New Zealand Limited | National ...,7.561987


## Evaluation
To-do

In [23]:
# indexes look correct
filtered_df.loc[[26164, 14931, 14932]]

Unnamed: 0,abn,charity name,registration status,charity website,charity size,basic religious charity,ais due date,date ais received,financial report date received,conducted activities,...,Operating online,Operating overseas,overseas countries,Charity weblink,location_number,operating_location,latitude,longitude,id,distance
26164,80074835053,Wollongong Homeless Hub and Housing Services,Registered,www.whhhs.org.au,Large,n,31/01/2023,31/01/2023,31/01/2023,y,...,N,N,,www.wefh.org.au,1,"Wollongong NSW, Australia",-34.427812,150.893061,Wollongong Homeless Hub and Housing Services |...,4.733426
14931,50410129649,Need a Feed Australia INC,Registered,https://www.needafeed.org/,Small,n,31/01/2023,09/01/2023,09/01/2023,y,...,N,N,,www.needafeed.org,1,"Illawarra, New South Wales, Australia",-34.523225,150.843491,Need a Feed Australia INC | Food parcel distri...,7.561987
14932,50410129649,Need a Feed Australia INC,Registered,https://www.needafeed.org/,Small,n,31/01/2023,09/01/2023,09/01/2023,y,...,N,N,,www.needafeed.org,1,"Wollongong NSW, Australia",-34.427812,150.893061,Need a Feed Australia INC | Mobile Coffee van ...,4.733426


In [None]:
# WIP for later
def validate_charity_response(response, df):
    try:
        # Parse JSON
        data = json.loads(response)
        
        # Structure tests
        assert isinstance(data, dict), "Response must be a dictionary"
        assert set(data.keys()) == {"charity_ids", "message"}, "Must have exactly charity_ids and message keys"
        assert isinstance(data["charity_ids"], list), "charity_ids must be a list"
        assert isinstance(data["message"], str), "message must be a string"
        
        # Content tests
        assert len(data["charity_ids"]) <= 10, "Maximum 10 recommendations"
        assert all(isinstance(id, int) for id in data["charity_ids"]), "All IDs must be integers"
        assert all(id in df.index for id in data["charity_ids"]), "All IDs must exist in database"
        
        # Distance test (if location provided)
        if "distance" in df.columns:
            distances = df.loc[data["charity_ids"], "distance"]
            assert all(distances <= 10.0), "All charities must be within 10km"
            
        return True, "Validation passed"
    except Exception as e:
        return False, str(e)

In [None]:
# choose a charity

# Mapping - `folium`


# Tying together - `streamlit`
- Mobile experience

In [None]:
# getting user location
# https://github.com/aghasemi/streamlit_js_eval?tab=readme-ov-file
# though note you'll need an error route - i.e. agent asks for locatioon if it fails