## Step 1: Libraries Installation
Below are brief explanations of the tools and libraries utilised within the implementation code:
* **datasets**: This library is part of the Hugging Face ecosystem. By installing 'datasets', we gain access to a number of pre-processed and ready-to-use datasets, which are essential for training and fine-tuning machine learning models or benchmarking their performance.

* **pandas**: A data science library that provides robust data structures and methods for data manipulation, processing and analysis.

* **openai**: This is the official Python client library for accessing OpenAI's suite of AI models and tools, including GPT and embedding models.italicised text

* **pymongo**: PyMongo is a Python toolkit for MongoDB. It enables interactions with a MongoDB database.

In [1]:
!pip install datasets pandas openai pymongo

Collecting datasets
  Downloading datasets-2.19.1-py3-none-any.whl.metadata (19 kB)
Collecting openai
  Downloading openai-1.30.3-py3-none-any.whl.metadata (21 kB)
Collecting pymongo
  Downloading pymongo-4.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB)
Collecting pyarrow>=12.0.0 (from datasets)
  Downloading pyarrow-16.1.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.0 kB)
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl.metadata (3.6 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting tqdm>=4.62.1 (from datasets)
  Downloading tqdm-4.66.4-py3-none-any.whl.metadata (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collect

## Step 2: Data Loading

Load the dataset titled ["brianarbuckle/cocktail_recipes"](https://huggingface.co/datasets/brianarbuckle/cocktail_recipes). This dataset is a collection of cocktails-related details that include attributes such as the title, ingredients, directions, misc, source and ner.

In [2]:
# 1. Load Dataset
from datasets import load_dataset
import pandas as pd

# https://huggingface.co/datasets/brianarbuckle/cocktail_recipes
dataset = load_dataset("brianarbuckle/cocktail_recipes")

# Convert the dataset to a pandas dataframe
dataset_df = pd.DataFrame(dataset['train'])

dataset_df.head(5)

  from .autonotebook import tqdm as notebook_tqdm
Downloading readme: 100%|██████████| 3.33k/3.33k [00:00<00:00, 18.2MB/s]
Downloading data: 100%|██████████| 96.9k/96.9k [00:00<00:00, 123kB/s]
Generating train split: 100%|██████████| 875/875 [00:00<00:00, 106476.04 examples/s]


Unnamed: 0,title,ingredients,directions,misc,source,ner
0,151 Swizzle,[1.5 oz. 151-Proof Demerara Rum [Lemon Hart or...,[],[],Beachbum Berry Remixed,"[pernod, rum]"
1,20th Century,"[The 21st Century, 2 oz. Siete Leguas Blanco T...","[shake on ice and strain into coupe , The Best...",[],Jim Meehan,"[cocchi americano, pernod, tequila]"
2,20th Century,"[1.5 oz. Plymouth Gin, 3\/4 oz. Mari Brizard W...",[shake on ice and strain],[],PDT,"[lillet, gin]"
3,Abbey Cocktail,[],"[Shake liquid ingredients with ice., Strain in...",[Suggested glassware is Cocktail Glass],The Ultimate Bar Book,[]
4,Absinthe Drip,[1 1/2 ounces Pernod (or other absinthe substi...,[Pour Pernod into a pousse-caf or sour glass....,[The Absinthe Drip was made famous by Toulouse...,The Ultimate Bar Book,"[pernod, absinthe]"


## Step 3: Data Cleaning and Preparation

The next step cleans the data and prepares it for the next stage, which creates a new embedding data point using the new OpenAI embedding model.

In [3]:
import numpy as np

def process_ingredients(ner):
    if isinstance(ner, list):
        if not ner:
            return np.nan
        else:
            return ", ".join(ner)
    else:
        return ner

dataset_df['ner'] = dataset_df['ner'].apply(process_ingredients)
dataset_df


Unnamed: 0,title,ingredients,directions,misc,source,ner
0,151 Swizzle,[1.5 oz. 151-Proof Demerara Rum [Lemon Hart or...,[],[],Beachbum Berry Remixed,"pernod, rum"
1,20th Century,"[The 21st Century, 2 oz. Siete Leguas Blanco T...","[shake on ice and strain into coupe , The Best...",[],Jim Meehan,"cocchi americano, pernod, tequila"
2,20th Century,"[1.5 oz. Plymouth Gin, 3\/4 oz. Mari Brizard W...",[shake on ice and strain],[],PDT,"lillet, gin"
3,Abbey Cocktail,[],"[Shake liquid ingredients with ice., Strain in...",[Suggested glassware is Cocktail Glass],The Ultimate Bar Book,
4,Absinthe Drip,[1 1/2 ounces Pernod (or other absinthe substi...,[Pour Pernod into a pousse-caf or sour glass....,[The Absinthe Drip was made famous by Toulouse...,The Ultimate Bar Book,"pernod, absinthe"
...,...,...,...,...,...,...
870,Yellow Bird,"[ A Caribbean favorite., 1 ounce dark rum, 1 o...","[Shake liquid ingredients with ice., Strain in...",[Suggested glassware is Cocktail Glass],The Ultimate Bar Book,"galliano, triple sec, rum, cointreau"
871,Yellow Fever,"[1 1/2 ounces vodka, 1/2 ounce Galliano, 1/2 o...","[Shake ingredients with ice., Strain into a ch...",[Suggested glassware is Cocktail Glass],The Ultimate Bar Book,"vodka, galliano"
872,Yellow Parrot Cocktail,"[3/4 ounce yellow Chartreuse, 3/4 ounce Pernod...","[Shake ingredients with ice., Strain into a ch...",[Suggested glassware is Cocktail Glass],The Ultimate Bar Book,"brandy, pernod, chartreuse"
873,[The Spirit of the] Algonquin,"[ 2oz. 90 Proof Rye, .75oz. Lemon Juice, .75oz...",[shake on ice and strain],[Suggested glassware is Cocktail Glass],PDT,


In [4]:
print("Columns:", dataset_df.columns)
print("\nNumber of rows and columns:", dataset_df.shape)
print("\nBasic Statistics for numerical data:")
print(dataset_df.describe())
print("\nNumber of missing values in each column:")
print(dataset_df.isnull().sum())

Columns: Index(['title', 'ingredients', 'directions', 'misc', 'source', 'ner'], dtype='object')

Number of rows and columns: (875, 6)

Basic Statistics for numerical data:
             title                                        ingredients  \
count          875                                                875   
unique         646                                                858   
top     Ward Eight  [After Dinner Cocktail, 5 cl Cognac, 2 cl Crme...   
freq             9                                                  6   

       directions misc                 source  ner  
count         875  875                    875  753  
unique        456   63                     83  252  
top            []   []  The Ultimate Bar Book  gin  
freq           83  376                    438   81  

Number of missing values in each column:
title            0
ingredients      0
directions       0
misc             0
source           0
ner            122
dtype: int64


In [4]:
dataset_df = dataset_df.dropna(subset=['ner'])
dataset_df.rename(columns={'ner': 'base'}, inplace=True)
dataset_df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset_df.rename(columns={'ner': 'base'}, inplace=True)


Unnamed: 0,title,ingredients,directions,misc,source,base
0,151 Swizzle,[1.5 oz. 151-Proof Demerara Rum [Lemon Hart or...,[],[],Beachbum Berry Remixed,"pernod, rum"
1,20th Century,"[The 21st Century, 2 oz. Siete Leguas Blanco T...","[shake on ice and strain into coupe , The Best...",[],Jim Meehan,"cocchi americano, pernod, tequila"
2,20th Century,"[1.5 oz. Plymouth Gin, 3\/4 oz. Mari Brizard W...",[shake on ice and strain],[],PDT,"lillet, gin"
4,Absinthe Drip,[1 1/2 ounces Pernod (or other absinthe substi...,[Pour Pernod into a pousse-caf or sour glass....,[The Absinthe Drip was made famous by Toulouse...,The Ultimate Bar Book,"pernod, absinthe"
5,Acapulco,"[1 ounce gold tequila, 1 ounce gold rum, 2 oun...","[Shake ingredients with ice., Strain into an i...",[Suggested glassware is Highball Glass],The Ultimate Bar Book,"tequila, rum"
...,...,...,...,...,...,...
869,Yellow Bird,"[3 cl White Rum, 1.5 cl Galliano, 1.5 cl Tripl...","[Shake liquid ingredients with ice., Strain in...",[Suggested glassware is Cocktail Glass],IBA,"galliano, triple sec, rum"
870,Yellow Bird,"[ A Caribbean favorite., 1 ounce dark rum, 1 o...","[Shake liquid ingredients with ice., Strain in...",[Suggested glassware is Cocktail Glass],The Ultimate Bar Book,"galliano, triple sec, rum, cointreau"
871,Yellow Fever,"[1 1/2 ounces vodka, 1/2 ounce Galliano, 1/2 o...","[Shake ingredients with ice., Strain into a ch...",[Suggested glassware is Cocktail Glass],The Ultimate Bar Book,"vodka, galliano"
872,Yellow Parrot Cocktail,"[3/4 ounce yellow Chartreuse, 3/4 ounce Pernod...","[Shake ingredients with ice., Strain into a ch...",[Suggested glassware is Cocktail Glass],The Ultimate Bar Book,"brandy, pernod, chartreuse"


## Step 4: Create embeddings with OpenAI

In [11]:
import os
import openai
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Access the API key from the environment variable
openai.api_key = os.getenv("OPENAI_API_KEY")

EMBEDDING_MODEL = "text-embedding-3-small"

def get_embedding(text):
    """Generate an embedding for the given text using OpenAI's API."""
    # Check for valid input
    if not text or not isinstance(text, str):
        return None

    try:
        # Call OpenAI API to get the embedding
        embedding = openai.embeddings.create(input=text, model=EMBEDDING_MODEL).data[0].embedding
        return embedding
    except Exception as e:
        print(f"Error in get_embedding: {e}")
        return None

dataset_df["base_embedding_optimised"] = dataset_df['base'].apply(get_embedding)
dataset_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset_df["base_embedding_optimised"] = dataset_df['base'].apply(get_embedding)


Unnamed: 0,title,ingredients,directions,misc,source,base,base_embedding_optimised
0,151 Swizzle,[1.5 oz. 151-Proof Demerara Rum [Lemon Hart or...,[],[],Beachbum Berry Remixed,"pernod, rum","[-0.004870898090302944, 0.03630739822983742, 0..."
1,20th Century,"[The 21st Century, 2 oz. Siete Leguas Blanco T...","[shake on ice and strain into coupe , The Best...",[],Jim Meehan,"cocchi americano, pernod, tequila","[-0.012892143800854683, -0.006151298992335796,..."
2,20th Century,"[1.5 oz. Plymouth Gin, 3\/4 oz. Mari Brizard W...",[shake on ice and strain],[],PDT,"lillet, gin","[-0.041257601231336594, -0.008328181691467762,..."
4,Absinthe Drip,[1 1/2 ounces Pernod (or other absinthe substi...,[Pour Pernod into a pousse-caf or sour glass....,[The Absinthe Drip was made famous by Toulouse...,The Ultimate Bar Book,"pernod, absinthe","[-0.04415753856301308, 0.030096998438239098, 0..."
5,Acapulco,"[1 ounce gold tequila, 1 ounce gold rum, 2 oun...","[Shake ingredients with ice., Strain into an i...",[Suggested glassware is Highball Glass],The Ultimate Bar Book,"tequila, rum","[-0.015642663463950157, 0.00371152488514781, -..."


## Step 5: Vector Database Setup and Data Ingestion

In [12]:
import pymongo


def get_mongo_client(mongo_uri):
  """Establish connection to the MongoDB."""
  try:
    client = pymongo.MongoClient(mongo_uri)
    print("Connection to MongoDB successful")
    return client
  except pymongo.errors.ConnectionFailure as e:
    print(f"Connection failed: {e}")
    return None

mongo_uri = "mongodb+srv://lazzyCoder:dO1qORxPZ4YOeJkZ@cocktaildb.eq81tdu.mongodb.net/?retryWrites=true&w=majority&appName=cocktailDB"
if not mongo_uri:
  print("MONGO_URI not set in environment variables")

mongo_client = get_mongo_client(mongo_uri)

# Ingest data into MongoDB
db = mongo_client['cocktails']
collection = db['cocktailsDB']

Connection to MongoDB successful


In [13]:
# Delete any existing records in the collection
collection.delete_many({})

DeleteResult({'n': 753, 'electionId': ObjectId('7fffffff0000000000000187'), 'opTime': {'ts': Timestamp(1716778214, 755), 't': 391}, 'ok': 1.0, '$clusterTime': {'clusterTime': Timestamp(1716778214, 757), 'signature': {'hash': b'\xfa\xc3\x8f\x1a\xab\xbf\xbfW\x01ha\xaf\xce\x17\x14/\x0f.a\xd0', 'keyId': 7327356172825001986}}, 'operationTime': Timestamp(1716778214, 755)}, acknowledged=True)

In [14]:
documents = dataset_df.to_dict('records')
collection.insert_many(documents)

print("Data ingestion into MongoDB completed")

Data ingestion into MongoDB completed


## Step 6: Create a Vector Search Index

At this point make sure that your vector index is created via MongoDB Atlas.

## Step 7: Perform Vector Search on User Queries

In [15]:

def vector_search(user_query, collection):
    """
    Perform a vector search in the MongoDB collection based on the user query.

    Args:
    user_query (str): The user's query string.
    collection (MongoCollection): The MongoDB collection to search.

    Returns:
    list: A list of matching documents.
    """

    # Generate embedding for the user query
    query_embedding = get_embedding(user_query)

    if query_embedding is None:
        return "Invalid query or embedding generation failed."

    # Define the vector search pipeline
    pipeline = [
        {
            "$vectorSearch": {
                "index": "vector_index",
                "queryVector": query_embedding,
                "path": "base_embedding_optimised",
                "numCandidates": 150,  # Number of items matches to consider
                "limit": 5  # Return top 5 matches
            }
        },
        {
            "$project": {
                "_id": 0,  # Exclude the _id field
                "base": 1,  # Include the base field
                "title": 1,  # Include the title field
                "ingredients": 1, # Include the ingredients field
                "directions": 1,  # Include the directions field
                "score": {
                    "$meta": "vectorSearchScore"  # Include the search score
                }
            }
        }
    ]

    # Execute the search
    results = collection.aggregate(pipeline)
    return list(results)

## Step 8: Handling User Query and Result

In [16]:
def handle_user_query(query, collection):

  get_knowledge = vector_search(query, collection)

  search_result = ''
  for result in get_knowledge:
      search_result += f"Title: {result.get('title', 'N/A')}, Base Drink: {result.get('base', 'N/A')}\n"

  completion = openai.chat.completions.create(
      model="gpt-3.5-turbo",
      messages=[
          {"role": "system", "content": "You are a cocktail recommendation system."},
          {"role": "user", "content": "Answer this user query: " + query + " with the following context: " + search_result}
      ]
  )

  return (completion.choices[0].message.content), search_result

In [17]:
query = "What are the best cocktails I can make with vodka?"
response, source_information = handle_user_query(query, collection)

print(f"Response: {response}")
print(f"Source Information: \n{source_information}")


Response: Here are some cocktail recommendations with vodka as the base drink:

1. Watermelon Vodka Cocktail - Refreshing and fruity, perfect for summer evenings.
2. Vodka Gimlet - A classic cocktail with a simple yet delicious combination of vodka and lime juice.
3. Tokyo Mary - A twist on the traditional Bloody Mary, with vodka as the base and Asian-inspired flavors.
4. Russian Spring Punch - A bubbly and refreshing cocktail with vodka, elderflower liqueur, and fresh lemon juice.
5. Purple Passion - A colorful and sweet cocktail with vodka, cranberry juice, and a splash of soda water.

Enjoy experimenting with these delicious vodka-based cocktails!
Source Information: 
Title: Watermelon, Base Drink: vodka
Title: Vodka Gimlet, Base Drink: vodka
Title: Tokyo Mary, Base Drink: vodka
Title: Russian Spring Punch, Base Drink: vodka
Title: Purple Passion, Base Drink: vodka



## Frontend Part

In [18]:
!npm install localtunnel
!pip install streamlit

[K[?25hm#########[0m[100;90m⠂⠂⠂⠂⠂⠂⠂⠂⠂[0m) ⠧ idealTree: [32;40mtiming[0m [35midealTree[0m Completed in 153ms[0m[K
up to date, audited 23 packages in 945ms

3 packages are looking for funding
  run `npm fund` for details

2 [33m[1mmoderate[22m[39m severity vulnerabilities

To address all issues (including breaking changes), run:
  npm audit fix --force

Run `npm audit` for details.
Collecting streamlit
  Downloading streamlit-1.35.0-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting altair<6,>=4.0 (from streamlit)
  Downloading altair-5.3.0-py3-none-any.whl.metadata (9.2 kB)
Collecting blinker<2,>=1.0.0 (from streamlit)
  Downloading blinker-1.8.2-py3-none-any.whl.metadata (1.6 kB)
Collecting cachetools<6,>=4.0 (from streamlit)
  Downloading cachetools-5.3.3-py3-none-any.whl.metadata (5.3 kB)
Collecting click<9,>=7.0 (from streamlit)
  Downloading click-8.1.7-py3-none-any.whl.metadata (3.0 kB)
Collecting protobuf<5,>=3.20 (from streamlit)
  Downloading protobuf-4.25.3-cp37-

In [20]:
%%writefile app.py
import openai
import pymongo
import streamlit as st

openai.api_key = "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"
EMBEDDING_MODEL = "text-embedding-3-small"


def get_mongo_client(mongo_uri):
  """Establish connection to the MongoDB."""
  try:
    client = pymongo.MongoClient(mongo_uri)
    print("Connection to MongoDB successful")
    return client
  except pymongo.errors.ConnectionFailure as e:
    print(f"Connection failed: {e}")
    return None

mongo_uri = "mongodb+srv://lazzyCoder:dO1qORxPZ4YOeJkZ@cocktaildb.eq81tdu.mongodb.net/?retryWrites=true&w=majority&appName=cocktailDB"
if not mongo_uri:
  print("MONGO_URI not set in environment variables")

mongo_client = get_mongo_client(mongo_uri)

# Ingest data into MongoDB
db = mongo_client['cocktails']
collection = db['cocktailsDB']

# Delete any existing records in the collection
def get_embedding(text):
    """Generate an embedding for the given text using OpenAI's API."""

    # Check for valid input
    if not text or not isinstance(text, str):
        return None

    try:
        # Call OpenAI API to get the embedding
        embedding = openai.embeddings.create(input=text, model=EMBEDDING_MODEL).data[0].embedding
        return embedding
    except Exception as e:
        print(f"Error in get_embedding: {e}")
        return None

def vector_search(user_query, collection):
    """
    Perform a vector search in the MongoDB collection based on the user query.

    Args:
    user_query (str): The user's query string.
    collection (MongoCollection): The MongoDB collection to search.

    Returns:
    list: A list of matching documents.
    """

    # Generate embedding for the user query
    query_embedding = get_embedding(user_query)

    if query_embedding is None:
        return "Invalid query or embedding generation failed."

    # Define the vector search pipeline
    pipeline = [
        {
            "$vectorSearch": {
                "index": "vector_index",
                "queryVector": query_embedding,
                "path": "base_embedding_optimised",
                "numCandidates": 150,  # Number of candidate matches to consider
                "limit": 5  # Return top 5 matches
            }
        },
        {
            "$project": {
                "_id": 0,  # Exclude the _id field
                "base": 1,  # Include the base field
                "title": 1,  # Include the title field
                "ingredients": 1, # Include the ingredients field
                "directions": 1,  # Include the directions field
                "score": {
                    "$meta": "vectorSearchScore"  # Include the search score
                }
            }
        }
    ]

    # Execute the search
    results = collection.aggregate(pipeline)
    return list(results)

def handle_user_query(query, collection):

  get_knowledge = vector_search(query, collection)

  search_result = ''
  for result in get_knowledge:
      search_result += f"Title: {result.get('title', 'N/A')}, Base Drink: {result.get('base', 'N/A')}, \n Ingredients: {result.get('ingredients', 'N/A')}, \n Directions: {result.get('directions', 'N/A')}\n"

  completion = openai.chat.completions.create(
      model="gpt-3.5-turbo",
      messages=[
          {"role": "system", "content": "You are a cocktail recommendation system."},
          {"role": "user", "content": "Answer this user query: " + query + " with the following context: " + search_result}
      ]
  )

  return (completion.choices[0].message.content), search_result

# 6. Conduct query with retrival of sources

st.set_page_config(page_title="Cocktail Recommendation")
st.header("Ask For Cocktail Recommendation")
query = st.text_input("Query")
if query != "":
  response, source_information = handle_user_query(query, collection)
  print(source_information)
  print(response)
  st.write(f"Response: {response}\n\n")






Overwriting app.py
