# Research question 1

- Embed each user story in user story dataset(persist in a datastore to avoid recomputation to avoid cost for commercial llm embedders)
- Cluster to get optimal k

__Questions I am wrestling with__
- How to get optimal cluster size k for each `user stories dataset`.
- How to embed the user stories with several embedding models.

__Methodology__
1. Dataset to be used is `data/g12-camperplus.txt`(__55 user stories__).
2. Preprocess by doing tokenization, lemmatization and removing stop words
3. Embed using any LLM embedding store in a vector database
4. Use K-means algorithm(by varying cluster between 2 and square root of n)
5. Evaluate using SC an CH index

#### Check number of user stories

In [1]:
SUPABASE_PROJECT_NAME = "g12-camperplus"

In [2]:
FILE_PATH = f"data/{SUPABASE_PROJECT_NAME}.txt"

In [3]:
number_of_user_stories = 0
with open(FILE_PATH) as file:
    for line in file:
        number_of_user_stories += 1

print(f"There are {number_of_user_stories} stories in a {FILE_PATH}")

There are 55 stories in a data/g12-camperplus.txt


#### Install packages needed for embedding user stories

In [4]:
#!pip install ollama langchain langchain_community

In [5]:
# !ollama pull llama3
# !ollama pull phi3
# !ollama pull mistral

Llama3, Phi-3 and Mistral take 20.25s, 9s, 12s for embedding a single query respectively. This means the higher the size of the model, the slower the embedding process

__Install packages for performing data preprocessing and storing embeddings__|

In [6]:
#!pip install nltk supabase

In [7]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import supabase 

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\abdul\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\abdul\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\abdul\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [8]:
from langchain_community.embeddings import OllamaEmbeddings

# Load stop words and lemmatizer
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

# Load LLM models
llama_model = OllamaEmbeddings(model="llama3")
mistral_model = OllamaEmbeddings(model="mistral")
phi3_model = OllamaEmbeddings(model="phi3")

In [9]:
# Load user story data
# Connect to supabase vector store
# Proprocess each user story, embed and store in the database

### Load user story data

In [10]:
with open(FILE_PATH, 'r') as file:
    user_stories = file.readlines()

### Connect to supabase

In [11]:
#!pip install python-dotenv

In [12]:
from dotenv import dotenv_values
# Load environment variables from .env file
env_variables = dotenv_values(".env")

In [13]:
import os
from supabase import create_client, Client

url: str = env_variables.get("SUPABASE_URL")
key: str = env_variables.get("SUPABASE_KEY")
supabase_client: Client = create_client(url, key)

### Creating necessary functions for embedding user stories and storing them in supabase vector store

In [14]:
def preprocess_story(story):
    
    """
    Preprocesses a user story by tokenizing, lemmatizing, and removing stop words.
    """
    tokens = word_tokenize(story.lower().strip())
    filtered_tokens = [lemmatizer.lemmatize(token) for token in tokens if token not in stop_words]
    return " ".join(filtered_tokens)

def get_embeddings(story):
    """
    Generates embeddings for a user story using LLAMA, Mistral, and Phi-3 models.
    """
    preprocessed_story = preprocess_story(story)
    llama_embedding = llama_model.embed_query(preprocessed_story)
    mistral_embedding = mistral_model.embed_query(preprocessed_story)
    phi3_embedding = phi3_model.embed_query(preprocessed_story)
    return {
        "llama_embedding": llama_embedding,
        "mistral_embedding": mistral_embedding,
        "phi3_embedding": phi3_embedding
    }

def store_data(index, story, embeddings):
    """
    Stores the preprocessed user story and embeddings in a Supabase vector database.
    """
    inserted_data = {
        "user_story_id": f"US-{index}",
        "story": story,
        "llama_embedding": embeddings["llama_embedding"],
        "mistral_embedding": embeddings["mistral_embedding"],
        "phi3_embedding": embeddings["phi3_embedding"]
    }
    
    # Try and rename the name of table below to the user stories
    supabase_client.table(SUPABASE_PROJECT_NAME).insert(inserted_data).execute()

### Preprocess, embed and store data and respective embeddings in supabase

In [15]:
import time

start_time = time.time()
for index, story in enumerate(user_stories):
    preprocessed_story = preprocess_story(story)
    print(f"Preprocessed story: {preprocessed_story} {index}")
    embeddings = get_embeddings(preprocessed_story)
    print(f"Done performing embedding {index}")
    store_data(index, preprocessed_story, embeddings)
    print(f"Done with US-{index}")
end_time = time.time()
elapsed_time = end_time - start_time

print(f"Elapsed time: {elapsed_time} seconds")

Preprocessed story: ï » ¿as camp administrator want able add camper keep track individual camper 0
Done performing embedding 0
Done with US-0
Preprocessed story: camp administrator want able remove camper n't attend camp anymore keep record organized 1
Done performing embedding 1
Done with US-1
Preprocessed story: camp administrator want able keep camper record previous year amount work need lowered 2
Done performing embedding 2
Done with US-2
Preprocessed story: camp administrator want able upload consent form camper parent easily access form 3
Done performing embedding 3
Done with US-3
Preprocessed story: camp administrator want able keep track camper submitted form legal issue avoided 4
Done performing embedding 4
Done with US-4
Preprocessed story: camp administrator want able schedule activity camper camp worker easily keep track time 5
Done performing embedding 5
Done with US-5
Preprocessed story: camp administrator want able suspend camper behavioral problem 6
Done performing emb

__The embedding of the 55 user stories and storing in a database took about 39mins__