## Step 1: Libraries Installation
Below are brief explanations of the tools and libraries utilised within the implementation code:
* **datasets**: This library is part of the Hugging Face ecosystem. By installing 'datasets', we gain access to a number of pre-processed and ready-to-use datasets, which are essential for training and fine-tuning machine learning models or benchmarking their performance.

* **pandas**: A data science library that provides robust data structures and methods for data manipulation, processing and analysis.

* **openai**: This is the official Python client library for accessing OpenAI's suite of AI models and tools, including GPT and embedding models.italicised text

* **pymongo**: PyMongo is a Python toolkit for MongoDB. It enables interactions with a MongoDB database.

In [1]:
!pip install datasets pandas openai pymongo

Collecting datasets
  Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
Collecting openai
  Downloading openai-1.30.3-py3-none-any.whl (320 kB)
Collecting pymongo
  Downloading pymongo-4.7.2-cp39-cp39-win_amd64.whl (484 kB)
Collecting pyarrow-hotfix
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting pyarrow>=12.0.0
  Downloading pyarrow-16.1.0-cp39-cp39-win_amd64.whl (25.9 MB)
Collecting huggingface-hub>=0.21.2
  Downloading huggingface_hub-0.23.1-py3-none-any.whl (401 kB)
Collecting xxhash
  Downloading xxhash-3.4.1-cp39-cp39-win_amd64.whl (29 kB)
Collecting fsspec[http]<=2024.3.1,>=2023.1.0
  Downloading fsspec-2024.3.1-py3-none-any.whl (171 kB)
Collecting multiprocess
  Downloading multiprocess-0.70.16-py39-none-any.whl (133 kB)
Collecting dill<0.3.9,>=0.3.0
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
Collecting httpx<1,>=0.23.0
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
Collecting distro<2,>=1.7.0
  Downloading distro-1.9.0-py3-none-any.whl (2

## Step 2: Data Loading

Load the dataset titled ["brianarbuckle/cocktail_recipes"](https://huggingface.co/datasets/brianarbuckle/cocktail_recipes). This dataset is a collection of cocktails-related details that include attributes such as the title, ingredients, directions, misc, source and ner.

In [2]:
# 1. Load Dataset
from datasets import load_dataset
import pandas as pd

# https://huggingface.co/datasets/brianarbuckle/cocktail_recipes
dataset = load_dataset("brianarbuckle/cocktail_recipes")

# Convert the dataset to a pandas dataframe
dataset_df = pd.DataFrame(dataset['train'])

dataset_df.head(5)

Downloading readme:   0%|          | 0.00/3.33k [00:00<?, ?B/s]



Downloading data:   0%|          | 0.00/96.9k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/875 [00:00<?, ? examples/s]

Unnamed: 0,title,ingredients,directions,misc,source,ner
0,151 Swizzle,[1.5 oz. 151-Proof Demerara Rum [Lemon Hart or...,[],[],Beachbum Berry Remixed,"[pernod, rum]"
1,20th Century,"[The 21st Century, 2 oz. Siete Leguas Blanco T...","[shake on ice and strain into coupe , The Best...",[],Jim Meehan,"[cocchi americano, pernod, tequila]"
2,20th Century,"[1.5 oz. Plymouth Gin, 3\/4 oz. Mari Brizard W...",[shake on ice and strain],[],PDT,"[lillet, gin]"
3,Abbey Cocktail,[],"[Shake liquid ingredients with ice., Strain in...",[Suggested glassware is Cocktail Glass],The Ultimate Bar Book,[]
4,Absinthe Drip,[1 1/2 ounces Pernod (or other absinthe substi...,[Pour Pernod into a pousse-caf or sour glass....,[The Absinthe Drip was made famous by Toulouse...,The Ultimate Bar Book,"[pernod, absinthe]"


## Step 3: Data Cleaning and Preparation

The next step cleans the data and prepares it for the next stage, which creates a new embedding data point using the new OpenAI embedding model.

In [3]:
import numpy as np

def process_ingredients(ner):
    if isinstance(ner, list):
        if not ner:
            return np.nan
        else:
            return ", ".join(ner)
    else:
        return ner

dataset_df['ner'] = dataset_df['ner'].apply(process_ingredients)
dataset_df


Unnamed: 0,title,ingredients,directions,misc,source,ner
0,151 Swizzle,[1.5 oz. 151-Proof Demerara Rum [Lemon Hart or...,[],[],Beachbum Berry Remixed,"pernod, rum"
1,20th Century,"[The 21st Century, 2 oz. Siete Leguas Blanco T...","[shake on ice and strain into coupe , The Best...",[],Jim Meehan,"cocchi americano, pernod, tequila"
2,20th Century,"[1.5 oz. Plymouth Gin, 3\/4 oz. Mari Brizard W...",[shake on ice and strain],[],PDT,"lillet, gin"
3,Abbey Cocktail,[],"[Shake liquid ingredients with ice., Strain in...",[Suggested glassware is Cocktail Glass],The Ultimate Bar Book,
4,Absinthe Drip,[1 1/2 ounces Pernod (or other absinthe substi...,[Pour Pernod into a pousse-caf or sour glass....,[The Absinthe Drip was made famous by Toulouse...,The Ultimate Bar Book,"pernod, absinthe"
...,...,...,...,...,...,...
870,Yellow Bird,"[ A Caribbean favorite., 1 ounce dark rum, 1 o...","[Shake liquid ingredients with ice., Strain in...",[Suggested glassware is Cocktail Glass],The Ultimate Bar Book,"galliano, triple sec, rum, cointreau"
871,Yellow Fever,"[1 1/2 ounces vodka, 1/2 ounce Galliano, 1/2 o...","[Shake ingredients with ice., Strain into a ch...",[Suggested glassware is Cocktail Glass],The Ultimate Bar Book,"vodka, galliano"
872,Yellow Parrot Cocktail,"[3/4 ounce yellow Chartreuse, 3/4 ounce Pernod...","[Shake ingredients with ice., Strain into a ch...",[Suggested glassware is Cocktail Glass],The Ultimate Bar Book,"brandy, pernod, chartreuse"
873,[The Spirit of the] Algonquin,"[ 2oz. 90 Proof Rye, .75oz. Lemon Juice, .75oz...",[shake on ice and strain],[Suggested glassware is Cocktail Glass],PDT,


In [4]:
print("Columns:", dataset_df.columns)
print("\nNumber of rows and columns:", dataset_df.shape)
print("\nBasic Statistics for numerical data:")
print(dataset_df.describe())
print("\nNumber of missing values in each column:")
print(dataset_df.isnull().sum())

Columns: Index(['title', 'ingredients', 'directions', 'misc', 'source', 'ner'], dtype='object')

Number of rows and columns: (875, 6)

Basic Statistics for numerical data:
             title                                        ingredients  \
count          875                                                875   
unique         646                                                858   
top     Ward Eight  [After Dinner Cocktail, 5 cl Cognac, 2 cl Crme...   
freq             9                                                  6   

       directions misc                 source  ner  
count         875  875                    875  753  
unique        456   63                     83  252  
top            []   []  The Ultimate Bar Book  gin  
freq           83  376                    438   81  

Number of missing values in each column:
title            0
ingredients      0
directions       0
misc             0
source           0
ner            122
dtype: int64


In [5]:
dataset_df = dataset_df.dropna(subset=['ner'])
dataset_df.rename(columns={'ner': 'base'}, inplace=True)
dataset_df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset_df.rename(columns={'ner': 'base'}, inplace=True)


Unnamed: 0,title,ingredients,directions,misc,source,base
0,151 Swizzle,[1.5 oz. 151-Proof Demerara Rum [Lemon Hart or...,[],[],Beachbum Berry Remixed,"pernod, rum"
1,20th Century,"[The 21st Century, 2 oz. Siete Leguas Blanco T...","[shake on ice and strain into coupe , The Best...",[],Jim Meehan,"cocchi americano, pernod, tequila"
2,20th Century,"[1.5 oz. Plymouth Gin, 3\/4 oz. Mari Brizard W...",[shake on ice and strain],[],PDT,"lillet, gin"
4,Absinthe Drip,[1 1/2 ounces Pernod (or other absinthe substi...,[Pour Pernod into a pousse-caf or sour glass....,[The Absinthe Drip was made famous by Toulouse...,The Ultimate Bar Book,"pernod, absinthe"
5,Acapulco,"[1 ounce gold tequila, 1 ounce gold rum, 2 oun...","[Shake ingredients with ice., Strain into an i...",[Suggested glassware is Highball Glass],The Ultimate Bar Book,"tequila, rum"
...,...,...,...,...,...,...
869,Yellow Bird,"[3 cl White Rum, 1.5 cl Galliano, 1.5 cl Tripl...","[Shake liquid ingredients with ice., Strain in...",[Suggested glassware is Cocktail Glass],IBA,"galliano, triple sec, rum"
870,Yellow Bird,"[ A Caribbean favorite., 1 ounce dark rum, 1 o...","[Shake liquid ingredients with ice., Strain in...",[Suggested glassware is Cocktail Glass],The Ultimate Bar Book,"galliano, triple sec, rum, cointreau"
871,Yellow Fever,"[1 1/2 ounces vodka, 1/2 ounce Galliano, 1/2 o...","[Shake ingredients with ice., Strain into a ch...",[Suggested glassware is Cocktail Glass],The Ultimate Bar Book,"vodka, galliano"
872,Yellow Parrot Cocktail,"[3/4 ounce yellow Chartreuse, 3/4 ounce Pernod...","[Shake ingredients with ice., Strain into a ch...",[Suggested glassware is Cocktail Glass],The Ultimate Bar Book,"brandy, pernod, chartreuse"


## Step 4: Create embeddings with OpenAI

In [7]:
import openai
# from google.colab import userdata

openai.api_key = "sk-proj-ALAn4ctxMZFJF7zZJTIAT3BlbkFJkgqju8mRpDQAqtfHFoR6"

EMBEDDING_MODEL = "text-embedding-3-small"

def get_embedding(text):
    """Generate an embedding for the given text using OpenAI's API."""

    # Check for valid input
    if not text or not isinstance(text, str):
        return None

    try:
        # Call OpenAI API to get the embedding
        embedding = openai.embeddings.create(input=text, model=EMBEDDING_MODEL).data[0].embedding
        return embedding
    except Exception as e:
        print(f"Error in get_embedding: {e}")
        return None

dataset_df["base_embedding_optimised"] = dataset_df['base'].apply(get_embedding)
dataset_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset_df["base_embedding_optimised"] = dataset_df['base'].apply(get_embedding)


Unnamed: 0,title,ingredients,directions,misc,source,base,base_embedding_optimised
0,151 Swizzle,[1.5 oz. 151-Proof Demerara Rum [Lemon Hart or...,[],[],Beachbum Berry Remixed,"pernod, rum","[-0.00484482292085886, 0.036357153207063675, 0..."
1,20th Century,"[The 21st Century, 2 oz. Siete Leguas Blanco T...","[shake on ice and strain into coupe , The Best...",[],Jim Meehan,"cocchi americano, pernod, tequila","[-0.012937657535076141, -0.006157346069812775,..."
2,20th Century,"[1.5 oz. Plymouth Gin, 3\/4 oz. Mari Brizard W...",[shake on ice and strain],[],PDT,"lillet, gin","[-0.041257601231336594, -0.008328181691467762,..."
4,Absinthe Drip,[1 1/2 ounces Pernod (or other absinthe substi...,[Pour Pernod into a pousse-caf or sour glass....,[The Absinthe Drip was made famous by Toulouse...,The Ultimate Bar Book,"pernod, absinthe","[-0.04415753856301308, 0.030096998438239098, 0..."
5,Acapulco,"[1 ounce gold tequila, 1 ounce gold rum, 2 oun...","[Shake ingredients with ice., Strain into an i...",[Suggested glassware is Highball Glass],The Ultimate Bar Book,"tequila, rum","[-0.016873909160494804, 0.004132091999053955, ..."


## Step 5: Vector Database Setup and Data Ingestion

In [8]:
import pymongo


def get_mongo_client(mongo_uri):
  """Establish connection to the MongoDB."""
  try:
    client = pymongo.MongoClient(mongo_uri)
    print("Connection to MongoDB successful")
    return client
  except pymongo.errors.ConnectionFailure as e:
    print(f"Connection failed: {e}")
    return None

mongo_uri = "mongodb+srv://lazzyCoder:dO1qORxPZ4YOeJkZ@cocktaildb.eq81tdu.mongodb.net/?retryWrites=true&w=majority&appName=cocktailDB"
if not mongo_uri:
  print("MONGO_URI not set in environment variables")

mongo_client = get_mongo_client(mongo_uri)

# Ingest data into MongoDB
db = mongo_client['cocktails']
collection = db['cocktailsDB']

Connection to MongoDB successful


In [9]:
# Delete any existing records in the collection
collection.delete_many({})

DeleteResult({'n': 753, 'electionId': ObjectId('7fffffff0000000000000187'), 'opTime': {'ts': Timestamp(1716703643, 768), 't': 391}, 'ok': 1.0, '$clusterTime': {'clusterTime': Timestamp(1716703643, 772), 'signature': {'hash': b'\x82{\xc1\x0b_\xe0\xfb*\xfb\n\xad\x90nB\x06\xb1\xbc+\xfd\xdf', 'keyId': 7327356172825001986}}, 'operationTime': Timestamp(1716703643, 768)}, acknowledged=True)

In [10]:
documents = dataset_df.to_dict('records')
collection.insert_many(documents)

print("Data ingestion into MongoDB completed")

Data ingestion into MongoDB completed


## Step 6: Create a Vector Search Index

At this point make sure that your vector index is created via MongoDB Atlas.

## Step 7: Perform Vector Search on User Queries

In [11]:

def vector_search(user_query, collection):
    """
    Perform a vector search in the MongoDB collection based on the user query.

    Args:
    user_query (str): The user's query string.
    collection (MongoCollection): The MongoDB collection to search.

    Returns:
    list: A list of matching documents.
    """

    # Generate embedding for the user query
    query_embedding = get_embedding(user_query)

    if query_embedding is None:
        return "Invalid query or embedding generation failed."

    # Define the vector search pipeline
    pipeline = [
        {
            "$vectorSearch": {
                "index": "vector_index",
                "queryVector": query_embedding,
                "path": "base_embedding_optimised",
                "numCandidates": 150,  # Number of items matches to consider
                "limit": 5  # Return top 5 matches
            }
        },
        {
            "$project": {
                "_id": 0,  # Exclude the _id field
                "base": 1,  # Include the base field
                "title": 1,  # Include the title field
                "ingredients": 1, # Include the ingredients field
                "directions": 1,  # Include the directions field
                "score": {
                    "$meta": "vectorSearchScore"  # Include the search score
                }
            }
        }
    ]

    # Execute the search
    results = collection.aggregate(pipeline)
    return list(results)

## Step 8: Handling User Query and Result

In [12]:
def handle_user_query(query, collection):

  get_knowledge = vector_search(query, collection)

  search_result = ''
  for result in get_knowledge:
      search_result += f"Title: {result.get('title', 'N/A')}, Base Drink: {result.get('base', 'N/A')}\n"

  completion = openai.chat.completions.create(
      model="gpt-3.5-turbo",
      messages=[
          {"role": "system", "content": "You are a cocktail recommendation system."},
          {"role": "user", "content": "Answer this user query: " + query + " with the following context: " + search_result}
      ]
  )

  return (completion.choices[0].message.content), search_result

In [13]:
query = "What are the best cocktails I can make with vodka?"
response, source_information = handle_user_query(query, collection)

print(f"Response: {response}")
print(f"Source Information: \n{source_information}")


Response: Here are some cocktail recommendations you can make with vodka:

1. **Watermelon** - This refreshing cocktail combines vodka with fresh watermelon juice for a fruity and summery drink.

2. **Walnut Martini** - A sophisticated cocktail that combines vodka with walnut liqueur for a nutty and rich flavor profile.

3. **Vodka Gimlet** - A classic cocktail made with vodka, lime juice, and simple syrup for a tart and refreshing drink.

4. **Russian Bear** - A strong cocktail that combines vodka with coffee liqueur and cream for a rich and indulgent drink.

5. **Purple Passion** - A vibrant cocktail that combines vodka with grape juice and a splash of soda for a fruity and colorful drink.
Source Information: 
Title: Watermelon, Base Drink: vodka
Title: Walnut Martini, Base Drink: vodka
Title: Vodka Gimlet, Base Drink: vodka
Title: Russian Bear, Base Drink: vodka
Title: Purple Passion, Base Drink: vodka



## Frontend Part

In [14]:
!npm install localtunnel
!pip install streamlit


added 22 packages in 1s

3 packages are looking for funding
  run `npm fund` for details
Collecting streamlit
  Downloading streamlit-1.35.0-py2.py3-none-any.whl (8.6 MB)
Collecting blinker<2,>=1.0.0
  Downloading blinker-1.8.2-py3-none-any.whl (9.5 kB)
Collecting tenacity<9,>=8.1.0
  Downloading tenacity-8.3.0-py3-none-any.whl (25 kB)
Collecting gitpython!=3.1.19,<4,>=3.0.7
  Downloading GitPython-3.1.43-py3-none-any.whl (207 kB)
Collecting altair<6,>=4.0
  Downloading altair-5.3.0-py3-none-any.whl (857 kB)
Collecting pydeck<1,>=0.8.0b4
  Downloading pydeck-0.9.1-py2.py3-none-any.whl (6.9 MB)
Collecting gitdb<5,>=4.0.1
  Downloading gitdb-4.0.11-py3-none-any.whl (62 kB)
Collecting smmap<6,>=3.0.1
  Downloading smmap-5.0.1-py3-none-any.whl (24 kB)
Installing collected packages: smmap, gitdb, tenacity, pydeck, gitpython, blinker, altair, streamlit
  Attempting uninstall: tenacity
    Found existing installation: tenacity 8.0.1
    Uninstalling tenacity-8.0.1:
      Successfully uninsta

In [17]:
%%writefile app.py
import openai
import pymongo
import streamlit as st
from streamlit_lottie import st_lottie
import requests

# Set up OpenAI API key
openai.api_key = "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"
EMBEDDING_MODEL = "text-embedding-3-small"

# Function to establish connection to MongoDB
def get_mongo_client(mongo_uri):
    try:
        client = pymongo.MongoClient(mongo_uri)
        print("Connection to MongoDB successful")
        return client
    except pymongo.errors.ConnectionFailure as e:
        print(f"Connection failed: {e}")
        return None

# MongoDB connection URI
mongo_uri = "mongodb+srv://lazzyCoder:dO1qORxPZ4YOeJkZ@cocktaildb.eq81tdu.mongodb.net/?retryWrites=true&w=majority&appName=cocktailDB"
if not mongo_uri:
    print("MONGO_URI not set in environment variables")

mongo_client = get_mongo_client(mongo_uri)

# Ingest data into MongoDB
db = mongo_client['cocktails']
collection = db['cocktailsDB']

# Function to generate an embedding for the given text
def get_embedding(text):
    if not text or not isinstance(text, str):
        return None

    try:
        embedding = openai.embeddings.create(input=text, model=EMBEDDING_MODEL).data[0].embedding
        return embedding
    except Exception as e:
        print(f"Error in get_embedding: {e}")
        return None

# Function to perform a vector search in the MongoDB collection based on the user query
def vector_search(user_query, collection):
    query_embedding = get_embedding(user_query)

    if query_embedding is None:
        return "Invalid query or embedding generation failed."

    pipeline = [
        {
            "$vectorSearch": {
                "index": "vector_index",
                "queryVector": query_embedding,
                "path": "base_embedding_optimised",
                "numCandidates": 150,
                "limit": 5
            }
        },
        {
            "$project": {
                "_id": 0,
                "base": 1,
                "title": 1,
                "ingredients": 1,
                "directions": 1,
                "score": {
                    "$meta": "vectorSearchScore"
                }
            }
        }
    ]

    results = collection.aggregate(pipeline)
    return list(results)

# Function to handle user query and return the response
def handle_user_query(query, collection):
    get_knowledge = vector_search(query, collection)

    search_result = ''
    for result in get_knowledge:
        search_result += f"Title: {result.get('title', 'N/A')}, Base Drink: {result.get('base', 'N/A')}, \n Ingredients: {result.get('ingredients', 'N/A')}, \n Directions: {result.get('directions', 'N/A')}\n"

    completion = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a cocktail recommendation system."},
            {"role": "user", "content": "Answer this user query: " + query + " with the following context: " + search_result}
        ]
    )

    return (completion.choices[0].message.content), search_result

# Function to load Lottie animations
def load_lottie_url(url):
    r = requests.get(url)
    if r.status_code != 200:
        return None
    return r.json()

# Load Lottie animation
lottie_cocktail = load_lottie_url("https://assets4.lottiefiles.com/private_files/lf30_oGlWy5.json")

# Streamlit app setup
st.set_page_config(page_title="Cocktail Recommendation", page_icon="🍸", layout="centered")
st.title("🍹 Cocktail Recommendation System")
st_lottie(lottie_cocktail, height=300)

st.write("Welcome to the Cocktail Recommendation System! Ask for cocktail recipes and recommendations based on your preferences.")

query = st.text_input("Enter your cocktail query here:")

if query:
    response, source_information = handle_user_query(query, collection)
    
    st.subheader("Search Results:")
    st.write(source_information)
    
    st.subheader("Recommendation:")
    st.write(response)
else:
    st.write("Please enter a query to get cocktail recommendations.")

# Additional features can be added here as needed.


Overwriting app.py
