<a href="https://colab.research.google.com/github/Aditya100300/LLMs_from_scratch/blob/main/Chapter_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# With this operation, a comprehensive language model is downloaded from the 'spacy' library. There are three versions to it:
# - Large
# - Medium
# - Small
# In this case, we are using the smaller model, which is around 30 MB.

!python -m spacy download en_core_web_sm
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding
!pip install openai==0.27.7

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m22.5 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Collecting openai==0.27.7
  Downloading openai-0.27.7-py3-none-any.whl.metadata (13 kB)
Downloading openai-0.27.7-py3-none-any.whl (71 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.0/72.0 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: openai
  Attempting uninstall: openai
   

### Explanation:
- **spaCy download**: We install the small English model via spaCy’s CLI command.
- **locale fix**: A typical fix to ensure UTF-8 is recognized properly in certain environments.
- **pip install openai**: The version pinned to 0.27.7 is used in later ChatCompletion calls.


In [None]:
# drive.mount() loads the contents from your Google Drive.

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install gradio

Collecting gradio
  Downloading gradio-5.23.1-py3-none-any.whl.metadata (16 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.12-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.5.0-py3-none-any.whl.metadata (3.0 kB)
Collecting gradio-client==1.8.0 (from gradio)
  Downloading gradio_client-1.8.0-py3-none-any.whl.metadata (7.1 kB)
Collecting groovy~=0.1 (from gradio)
  Downloading groovy-0.1.2-py3-none-any.whl.metadata (6.1 kB)
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.18 (from gradio)
  Downloading python_multipart-0.0.20-py3-none-any.whl.metadata (1.8 kB)
Collecting ruff>=0.9.3 (from gradio)
  Downloading ruff-0.11.2-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (25 kB)
Collecting safehttpx<0.2.0,>=0.1.6 

In [None]:
!pip install -U sentence-transformers rank_bm25

Collecting sentence-transformers
  Downloading sentence_transformers-4.0.1-py3-none-any.whl.metadata (13 kB)
Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->senten

### Explanation:
- **spacy**: The library for advanced NLP. We have `en_core_web_sm` installed.
- **drive.mount()**: Access your Google Drive to store or load data.
- **pip install gradio**: For building a simple interactive UI.
- **pip install -U sentence-transformers**: Ensure we have the latest `sentence-transformers` library.


In [None]:
import json
import pandas as pd
import time
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from collections import Counter
from heapq import nlargest
import nltk
import numpy as np
from tqdm import tqdm
from sentence_transformers import SentenceTransformer, util
# import tiktoken
from openai.embeddings_utils import get_embedding, cosine_similarity

### Explanation:
- **pandas**: For DataFrame manipulations.
- **tqdm**: For progress bars when encoding text.
- **SentenceTransformer**: Our embedding model of choice.
- **openai**: For possible GPT usage or other completions (used in a function below).
- **gradio**: For building a web UI.


Read the data for Paris Hotels

In [None]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/hamzafarooq/maven-mlsystem-design-cohort-1/main/data/paris_02_11_23.csv')

In [None]:
df.head()

Unnamed: 0,review_id,date,review_rating,title,text,votes,url,language,platform,author_id,author_name,author_username,name,id,description,rating,rating_count,features
0,864290614,2022-10-12,1,A large impersonal place with an on time check...,"If you are looking for a huge, grand hotel exp...",1,/ShowUserReviews-g187147-d207742-r864290614-In...,en,MOBILE,E488EBBA1F82F16BF878FE274C735941,Anna J,AnnaJ250,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature..."
1,864049819,2022-10-10,4,Good hotel with rude waiter,We went to this hotel just this month\nWe have...,1,/ShowUserReviews-g187147-d207742-r864049819-In...,en,MOBILE,4A830AD8B128F60AC02E83D6B6A530F7,QATAR2007,QATAR2007,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature..."
2,863952022,2022-10-10,5,Fantastic,"Absolutely top-notch. Room, service, bed, pill...",0,/ShowUserReviews-g187147-d207742-r863952022-In...,en,OTHER,AA2958C3E083861E81EEC085671BAA5B,aji1376,aji1376,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature..."
3,863793066,2022-10-09,4,"Amidst the chaos of Fashion week, their servic...",We stayed during the Paris Fashion Week Chaos....,0,/ShowUserReviews-g187147-d207742-r863793066-In...,en,MOBILE,DE4AB96DA3E104846D6D6423C2DAA4C8,jelinc2016,jelinc2016,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature..."
4,863631994,2022-10-08,2,Not worth the effort or money,This hotel is not worth the effort or the pric...,0,/ShowUserReviews-g187147-d207742-r863631994-In...,en,MOBILE,DE02D713F209AEC684DC6108509E6912,VikaasK,VikaasK,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature..."


In [None]:
df.shape

(28694, 18)

In [None]:
# Count the number of times a specific name appears in the 'name' column, and relay that number for each different name, in order.
df.name.value_counts()

Unnamed: 0_level_0,count
name,Unnamed: 1_level_1
"Hotel d'Angleterre, Saint Germain des Pres",960
InterContinental Paris - Le Grand,800
Hotel Marceau Champs Elysees,800
Hotel Saint-Marc,800
Hotel Dauphine Saint Germain,800
...,...
Hotel B55,94
Les Jardins du Faubourg,78
Ibis Styles Paris Meteor Avenue de la Porte d'Italie,62
Hotel Maxim Folies,54


### Explanation:
- **pd.read_csv**: Reads a CSV from a GitHub raw URL.
- **df.head()**: Display first 5 rows for a quick preview.
- **df.shape**: Check how many rows × columns are in the dataset.
- **df.name.value_counts()**: See how many times each distinct hotel name appears.


In [None]:
# Drop the duplicated values from each column, i.e. drop a row if it contains a duplicated value.

df=df.drop_duplicates()

### Explanation:
- Some data might appear multiple times. We use `drop_duplicates()` to ensure each row is unique.


In [None]:
df.shape

(11990, 18)

Create a new folder in Google Drive - called "Semantic_Search"

In [None]:
df.name.value_counts()

Unnamed: 0_level_0,count
name,Unnamed: 1_level_1
InterContinental Paris - Le Grand,80
Renaissance Paris Vendome Hotel,80
Renaissance Paris Nobel Tour Eiffel Hotel,80
Hotel des Saints-Peres - Esprit de France,80
Hotel Plaza Etoile,80
...,...
Seven Hotel Paris,32
Ibis Styles Paris Meteor Avenue de la Porte d'Italie,31
Hotel Du Sentier,28
Hotel Maxim Folies,27


In [None]:
# Make a folder in your drive folder called "Semantic_Search".
!mkdir /content/drive/MyDrive/Semantic_Search

In [None]:
# Save the dataframe to the folder you just creagted.
df.to_csv('/content/drive/MyDrive/Semantic_Search/paris_02_11_23.csv',index=False)

In [None]:
df.shape

(11990, 18)

In [None]:
df.name.value_counts()

Unnamed: 0_level_0,count
name,Unnamed: 1_level_1
InterContinental Paris - Le Grand,80
Renaissance Paris Vendome Hotel,80
Renaissance Paris Nobel Tour Eiffel Hotel,80
Hotel des Saints-Peres - Esprit de France,80
Hotel Plaza Etoile,80
...,...
Seven Hotel Paris,32
Ibis Styles Paris Meteor Avenue de la Porte d'Italie,31
Hotel Du Sentier,28
Hotel Maxim Folies,27


In [None]:
df.head()

Unnamed: 0,review_id,date,review_rating,title,text,votes,url,language,platform,author_id,author_name,author_username,name,id,description,rating,rating_count,features
0,864290614,2022-10-12,1,A large impersonal place with an on time check...,"If you are looking for a huge, grand hotel exp...",1,/ShowUserReviews-g187147-d207742-r864290614-In...,en,MOBILE,E488EBBA1F82F16BF878FE274C735941,Anna J,AnnaJ250,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature..."
1,864049819,2022-10-10,4,Good hotel with rude waiter,We went to this hotel just this month\nWe have...,1,/ShowUserReviews-g187147-d207742-r864049819-In...,en,MOBILE,4A830AD8B128F60AC02E83D6B6A530F7,QATAR2007,QATAR2007,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature..."
2,863952022,2022-10-10,5,Fantastic,"Absolutely top-notch. Room, service, bed, pill...",0,/ShowUserReviews-g187147-d207742-r863952022-In...,en,OTHER,AA2958C3E083861E81EEC085671BAA5B,aji1376,aji1376,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature..."
3,863793066,2022-10-09,4,"Amidst the chaos of Fashion week, their servic...",We stayed during the Paris Fashion Week Chaos....,0,/ShowUserReviews-g187147-d207742-r863793066-In...,en,MOBILE,DE4AB96DA3E104846D6D6423C2DAA4C8,jelinc2016,jelinc2016,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature..."
4,863631994,2022-10-08,2,Not worth the effort or money,This hotel is not worth the effort or the pric...,0,/ShowUserReviews-g187147-d207742-r863631994-In...,en,MOBILE,DE02D713F209AEC684DC6108509E6912,VikaasK,VikaasK,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature..."


In [None]:
# Create a column named 'combined', which containes the titles of the different lodges, with the descriptions associated to it.
df["combined"] = (
    "title: " + df.title.str.strip()+"; Content: " + df.text.str.strip()
    # +"; desc: "+ df.text.str.strip()
)

In [None]:
df.head()

Unnamed: 0,review_id,date,review_rating,title,text,votes,url,language,platform,author_id,author_name,author_username,name,id,description,rating,rating_count,features,combined
0,864290614,2022-10-12,1,A large impersonal place with an on time check...,"If you are looking for a huge, grand hotel exp...",1,/ShowUserReviews-g187147-d207742-r864290614-In...,en,MOBILE,E488EBBA1F82F16BF878FE274C735941,Anna J,AnnaJ250,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title: A large impersonal place with an on tim...
1,864049819,2022-10-10,4,Good hotel with rude waiter,We went to this hotel just this month\nWe have...,1,/ShowUserReviews-g187147-d207742-r864049819-In...,en,MOBILE,4A830AD8B128F60AC02E83D6B6A530F7,QATAR2007,QATAR2007,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title: Good hotel with rude waiter; Content: W...
2,863952022,2022-10-10,5,Fantastic,"Absolutely top-notch. Room, service, bed, pill...",0,/ShowUserReviews-g187147-d207742-r863952022-In...,en,OTHER,AA2958C3E083861E81EEC085671BAA5B,aji1376,aji1376,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title: Fantastic; Content: Absolutely top-notc...
3,863793066,2022-10-09,4,"Amidst the chaos of Fashion week, their servic...",We stayed during the Paris Fashion Week Chaos....,0,/ShowUserReviews-g187147-d207742-r863793066-In...,en,MOBILE,DE4AB96DA3E104846D6D6423C2DAA4C8,jelinc2016,jelinc2016,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...","title: Amidst the chaos of Fashion week, their..."
4,863631994,2022-10-08,2,Not worth the effort or money,This hotel is not worth the effort or the pric...,0,/ShowUserReviews-g187147-d207742-r863631994-In...,en,MOBILE,DE02D713F209AEC684DC6108509E6912,VikaasK,VikaasK,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title: Not worth the effort or money; Content:...


In [None]:
import re

df_combined = df.copy()

df_combined['combined'] = df_combined['combined'].apply(lambda x: re.sub('[^a-zA-z0-9\s]','',str(x)))

# Translate all the "combined" column to lower case.
def lower_case(input_str):
    input_str = input_str.lower()
    return input_str

df_combined['combined']= df_combined['combined'].apply(lambda x: lower_case(x))


### Explanation:
- **combined**: merges the user’s “title” with “text” for a single field to embed.
- We remove special punctuation, ensuring text is plain alphanumeric + spaces, then convert to lowercase.


In [None]:
import json
from sentence_transformers import SentenceTransformer, CrossEncoder, util
import gzip
import os
import torch

embedder = SentenceTransformer('all-mpnet-base-v2')

# Use the GPU if available
if not torch.cuda.is_available():
    print("Warning: No GPU found. Please add GPU to your notebook")
else:
  print("GPU Found!")
  embedder =  embedder.to('cuda')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

GPU Found!


### Explanation:
- **all-mpnet-base-v2**: A strong general-purpose sentence embedding model.
- We detect GPU with `torch.cuda.is_available()`. If yes, move the model to the GPU for faster encoding.


In [None]:
df_combined.head()

Unnamed: 0,review_id,date,review_rating,title,text,votes,url,language,platform,author_id,author_name,author_username,name,id,description,rating,rating_count,features,combined
0,864290614,2022-10-12,1,A large impersonal place with an on time check...,"If you are looking for a huge, grand hotel exp...",1,/ShowUserReviews-g187147-d207742-r864290614-In...,en,MOBILE,E488EBBA1F82F16BF878FE274C735941,Anna J,AnnaJ250,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title a large impersonal place with an on time...
1,864049819,2022-10-10,4,Good hotel with rude waiter,We went to this hotel just this month\nWe have...,1,/ShowUserReviews-g187147-d207742-r864049819-In...,en,MOBILE,4A830AD8B128F60AC02E83D6B6A530F7,QATAR2007,QATAR2007,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title good hotel with rude waiter content we w...
2,863952022,2022-10-10,5,Fantastic,"Absolutely top-notch. Room, service, bed, pill...",0,/ShowUserReviews-g187147-d207742-r863952022-In...,en,OTHER,AA2958C3E083861E81EEC085671BAA5B,aji1376,aji1376,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title fantastic content absolutely topnotch ro...
3,863793066,2022-10-09,4,"Amidst the chaos of Fashion week, their servic...",We stayed during the Paris Fashion Week Chaos....,0,/ShowUserReviews-g187147-d207742-r863793066-In...,en,MOBILE,DE4AB96DA3E104846D6D6423C2DAA4C8,jelinc2016,jelinc2016,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title amidst the chaos of fashion week their s...
4,863631994,2022-10-08,2,Not worth the effort or money,This hotel is not worth the effort or the pric...,0,/ShowUserReviews-g187147-d207742-r863631994-In...,en,MOBILE,DE02D713F209AEC684DC6108509E6912,VikaasK,VikaasK,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title not worth the effort or money content th...


In [None]:
# Take a sample of the first 10 rows.
sample = df_combined[:10]

In [None]:
sample

Unnamed: 0,review_id,date,review_rating,title,text,votes,url,language,platform,author_id,author_name,author_username,name,id,description,rating,rating_count,features,combined
0,864290614,2022-10-12,1,A large impersonal place with an on time check...,"If you are looking for a huge, grand hotel exp...",1,/ShowUserReviews-g187147-d207742-r864290614-In...,en,MOBILE,E488EBBA1F82F16BF878FE274C735941,Anna J,AnnaJ250,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title a large impersonal place with an on time...
1,864049819,2022-10-10,4,Good hotel with rude waiter,We went to this hotel just this month\nWe have...,1,/ShowUserReviews-g187147-d207742-r864049819-In...,en,MOBILE,4A830AD8B128F60AC02E83D6B6A530F7,QATAR2007,QATAR2007,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title good hotel with rude waiter content we w...
2,863952022,2022-10-10,5,Fantastic,"Absolutely top-notch. Room, service, bed, pill...",0,/ShowUserReviews-g187147-d207742-r863952022-In...,en,OTHER,AA2958C3E083861E81EEC085671BAA5B,aji1376,aji1376,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title fantastic content absolutely topnotch ro...
3,863793066,2022-10-09,4,"Amidst the chaos of Fashion week, their servic...",We stayed during the Paris Fashion Week Chaos....,0,/ShowUserReviews-g187147-d207742-r863793066-In...,en,MOBILE,DE4AB96DA3E104846D6D6423C2DAA4C8,jelinc2016,jelinc2016,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title amidst the chaos of fashion week their s...
4,863631994,2022-10-08,2,Not worth the effort or money,This hotel is not worth the effort or the pric...,0,/ShowUserReviews-g187147-d207742-r863631994-In...,en,MOBILE,DE02D713F209AEC684DC6108509E6912,VikaasK,VikaasK,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title not worth the effort or money content th...
5,862073785,2022-09-26,4,Not quite up to Intercontinental standards,We had a one night stay prior to taking the Eu...,0,/ShowUserReviews-g187147-d207742-r862073785-In...,en,MOBILE,BA9F96090F35C7A562E550825BAB4B32,badgerken2019,badgerken2019,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title not quite up to intercontinental standar...
6,861021031,2022-09-20,5,Luxury in the heart of the city,We were very impressed with the intercontinent...,0,/ShowUserReviews-g187147-d207742-r861021031-In...,en,OTHER,CB65D8DF04EEB78702BB03F39BAC6717,BadBenito,BadBenito,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title luxury in the heart of the city content ...
7,860421664,2022-09-16,5,Beautiful,Visited for my birthday and everything was jus...,0,/ShowUserReviews-g187147-d207742-r860421664-In...,en,OTHER,9D8328D86D62AFA560367B6F81565010,Shivers2612,Shivers2612,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title beautiful content visited for my birthda...
8,856124006,2022-08-24,5,Great old fashioned quality service,All I can say is we found good old fashioned f...,1,/ShowUserReviews-g187147-d207742-r856124006-In...,en,OTHER,82C9562F2314BE52F0CD417EC6034CE2,Nouf M,noufm716,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title great old fashioned quality service cont...
9,856016947,2022-08-23,5,Best location and platform to visit Paris,Our family of 4 plus a friend enjoyed the grea...,0,/ShowUserReviews-g187147-d207742-r856016947-In...,en,MOBILE,266247D26801AA5B479974849CAA8210,JanKritz,JanKritz,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title best location and platform to visit pari...


In [None]:
# Switch to CPU
embedder =  embedder.to('cpu')

startTime = time.time()

# Create a column named 'embedding', where the 'combined' column is turned to embeddings by the model.
sample["embedding"] = sample.combined.apply(lambda x: embedder.encode(x))

executionTime = (time.time() - startTime)
print('Execution time in seconds: ' + str(executionTime))


Execution time in seconds: 6.661051034927368


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sample["embedding"] = sample.combined.apply(lambda x: embedder.encode(x))


### Explanation:
- We store each row’s embedding in the new column `'embedding'`.
- This step can take a while if you have many rows. We measure how many seconds it took.


In [None]:
# Switch once more to GPU.

embedder =  embedder.to('cuda')
startTime = time.time()

sample["embedding"] = sample.combined.apply(lambda x: embedder.encode(x))

executionTime = (time.time() - startTime)
print('Execution time in seconds: ' + str(executionTime))

# Notice the difference in time to do this operation between GPU and CPU.


Execution time in seconds: 0.9214999675750732


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sample["embedding"] = sample.combined.apply(lambda x: embedder.encode(x))


In [None]:
sample

Unnamed: 0,review_id,date,review_rating,title,text,votes,url,language,platform,author_id,author_name,author_username,name,id,description,rating,rating_count,features,combined,embedding
0,864290614,2022-10-12,1,A large impersonal place with an on time check...,"If you are looking for a huge, grand hotel exp...",1,/ShowUserReviews-g187147-d207742-r864290614-In...,en,MOBILE,E488EBBA1F82F16BF878FE274C735941,Anna J,AnnaJ250,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title a large impersonal place with an on time...,"[0.043491535, -0.01282531, 0.0029111477, 0.070..."
1,864049819,2022-10-10,4,Good hotel with rude waiter,We went to this hotel just this month\nWe have...,1,/ShowUserReviews-g187147-d207742-r864049819-In...,en,MOBILE,4A830AD8B128F60AC02E83D6B6A530F7,QATAR2007,QATAR2007,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title good hotel with rude waiter content we w...,"[0.05525053, 0.019289287, -0.0056739757, 0.057..."
2,863952022,2022-10-10,5,Fantastic,"Absolutely top-notch. Room, service, bed, pill...",0,/ShowUserReviews-g187147-d207742-r863952022-In...,en,OTHER,AA2958C3E083861E81EEC085671BAA5B,aji1376,aji1376,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title fantastic content absolutely topnotch ro...,"[-0.014940896, 0.022968661, -0.0036183984, 0.0..."
3,863793066,2022-10-09,4,"Amidst the chaos of Fashion week, their servic...",We stayed during the Paris Fashion Week Chaos....,0,/ShowUserReviews-g187147-d207742-r863793066-In...,en,MOBILE,DE4AB96DA3E104846D6D6423C2DAA4C8,jelinc2016,jelinc2016,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title amidst the chaos of fashion week their s...,"[-0.0036499605, 0.04220707, -0.006234068, 0.06..."
4,863631994,2022-10-08,2,Not worth the effort or money,This hotel is not worth the effort or the pric...,0,/ShowUserReviews-g187147-d207742-r863631994-In...,en,MOBILE,DE02D713F209AEC684DC6108509E6912,VikaasK,VikaasK,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title not worth the effort or money content th...,"[0.0013607198, -0.0038351824, 0.0018077933, 0...."
5,862073785,2022-09-26,4,Not quite up to Intercontinental standards,We had a one night stay prior to taking the Eu...,0,/ShowUserReviews-g187147-d207742-r862073785-In...,en,MOBILE,BA9F96090F35C7A562E550825BAB4B32,badgerken2019,badgerken2019,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title not quite up to intercontinental standar...,"[0.041726097, 0.033815235, 0.0017297651, 0.065..."
6,861021031,2022-09-20,5,Luxury in the heart of the city,We were very impressed with the intercontinent...,0,/ShowUserReviews-g187147-d207742-r861021031-In...,en,OTHER,CB65D8DF04EEB78702BB03F39BAC6717,BadBenito,BadBenito,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title luxury in the heart of the city content ...,"[-0.0469322, -0.017643088, 0.024403226, 0.0763..."
7,860421664,2022-09-16,5,Beautiful,Visited for my birthday and everything was jus...,0,/ShowUserReviews-g187147-d207742-r860421664-In...,en,OTHER,9D8328D86D62AFA560367B6F81565010,Shivers2612,Shivers2612,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title beautiful content visited for my birthda...,"[0.007762895, 0.033725765, 0.006036935, 0.0588..."
8,856124006,2022-08-24,5,Great old fashioned quality service,All I can say is we found good old fashioned f...,1,/ShowUserReviews-g187147-d207742-r856124006-In...,en,OTHER,82C9562F2314BE52F0CD417EC6034CE2,Nouf M,noufm716,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title great old fashioned quality service cont...,"[0.026673278, -0.00047402183, 0.030833986, 0.0..."
9,856016947,2022-08-23,5,Best location and platform to visit Paris,Our family of 4 plus a friend enjoyed the grea...,0,/ShowUserReviews-g187147-d207742-r856016947-In...,en,MOBILE,266247D26801AA5B479974849CAA8210,JanKritz,JanKritz,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title best location and platform to visit pari...,"[-0.043222655, -0.009725487, 0.0045521534, 0.0..."


In [None]:
# from google.colab import drive
# drive.mount('/gdrive')

In [None]:
# Transform your dataframe to a pickle file, which is a byte stream file used to save a dataframe's state across sections.
sample.to_pickle('/content/drive/MyDrive/Semantic_Search/df.pkl')    #to save the dataframe, df to 123.pkl

In [None]:
# Load the pickle file.
df_with_embedding = pd.read_pickle('/content/drive/MyDrive/Semantic_Search/df.pkl') #to load 123.pkl back to the dataframe df

### Explanation:
- **to_pickle**: Saves a Python object (including embedded vectors) as a binary file for quick re-load.
- Next time, you can skip the embedding step by reading from that pickle.


In [None]:
df_with_embedding.head()

Unnamed: 0,review_id,date,review_rating,title,text,votes,url,language,platform,author_id,author_name,author_username,name,id,description,rating,rating_count,features,combined,embedding
0,864290614,2022-10-12,1,A large impersonal place with an on time check...,"If you are looking for a huge, grand hotel exp...",1,/ShowUserReviews-g187147-d207742-r864290614-In...,en,MOBILE,E488EBBA1F82F16BF878FE274C735941,Anna J,AnnaJ250,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title a large impersonal place with an on time...,"[0.043491535, -0.01282531, 0.0029111477, 0.070..."
1,864049819,2022-10-10,4,Good hotel with rude waiter,We went to this hotel just this month\nWe have...,1,/ShowUserReviews-g187147-d207742-r864049819-In...,en,MOBILE,4A830AD8B128F60AC02E83D6B6A530F7,QATAR2007,QATAR2007,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title good hotel with rude waiter content we w...,"[0.05525053, 0.019289287, -0.0056739757, 0.057..."
2,863952022,2022-10-10,5,Fantastic,"Absolutely top-notch. Room, service, bed, pill...",0,/ShowUserReviews-g187147-d207742-r863952022-In...,en,OTHER,AA2958C3E083861E81EEC085671BAA5B,aji1376,aji1376,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title fantastic content absolutely topnotch ro...,"[-0.014940896, 0.022968661, -0.0036183984, 0.0..."
3,863793066,2022-10-09,4,"Amidst the chaos of Fashion week, their servic...",We stayed during the Paris Fashion Week Chaos....,0,/ShowUserReviews-g187147-d207742-r863793066-In...,en,MOBILE,DE4AB96DA3E104846D6D6423C2DAA4C8,jelinc2016,jelinc2016,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title amidst the chaos of fashion week their s...,"[-0.0036499605, 0.04220707, -0.006234068, 0.06..."
4,863631994,2022-10-08,2,Not worth the effort or money,This hotel is not worth the effort or the pric...,0,/ShowUserReviews-g187147-d207742-r863631994-In...,en,MOBILE,DE02D713F209AEC684DC6108509E6912,VikaasK,VikaasK,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title not worth the effort or money content th...,"[0.0013607198, -0.0038351824, 0.0018077933, 0...."


In [None]:
query = 'Not worth the effort or money + This hotel is not worth the effort or the price'

# Embed the previous query.
query_embedding = embedder.encode(query,show_progress_bar=True)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
# First element of the 'combined' column.
df_with_embedding.combined[0]

'title a large impersonal place with an on time check in problem and other issues content if you are looking for a huge grand hotel experience this place may be for you but i found it to be impersonal and the staff lacking warmth and sometimes manners they seem to avoid engagement with guests whenever possible add that to the fact that each morning and evening there are 2 or 3 bus loads of tour groups gathering in the lobby and right outside the hotel when i encountered this on my first evening and needed a taxi i asked the doorman whether he could get me a taxi or whether i should order an uber he simply nodded and walked away so i ordered an uber\n\nthe checkin counter must be understaffed as i had to wait in a long line for about 15 minutes only to be told that my room was not ready official checkin is 2pm but my room was not ready until 5pm i should have taken seriously the many previous tripadvisor reviewers who had the same experience they gave me a drink voucher in the bar as co

In [None]:
# Create a list of embeddings, from the contents of the "combined" column.
corpus_embeddings = embedder.encode(df_with_embedding.combined,show_progress_bar=True)

# Create a column called 'similarity', displaying the cosine similarity between your embedded query and each embedded content from the 'corpus_embeddings' variable.
df_with_embedding['similarity']=cosine_similarity(corpus_embeddings, query_embedding)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
df_with_embedding

Unnamed: 0,review_id,date,review_rating,title,text,votes,url,language,platform,author_id,...,author_username,name,id,description,rating,rating_count,features,combined,embedding,similarity
0,864290614,2022-10-12,1,A large impersonal place with an on time check...,"If you are looking for a huge, grand hotel exp...",1,/ShowUserReviews-g187147-d207742-r864290614-In...,en,MOBILE,E488EBBA1F82F16BF878FE274C735941,...,AnnaJ250,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title a large impersonal place with an on time...,"[0.043491535, -0.01282531, 0.0029111477, 0.070...",0.180471
1,864049819,2022-10-10,4,Good hotel with rude waiter,We went to this hotel just this month\nWe have...,1,/ShowUserReviews-g187147-d207742-r864049819-In...,en,MOBILE,4A830AD8B128F60AC02E83D6B6A530F7,...,QATAR2007,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title good hotel with rude waiter content we w...,"[0.05525053, 0.019289287, -0.0056739757, 0.057...",0.144596
2,863952022,2022-10-10,5,Fantastic,"Absolutely top-notch. Room, service, bed, pill...",0,/ShowUserReviews-g187147-d207742-r863952022-In...,en,OTHER,AA2958C3E083861E81EEC085671BAA5B,...,aji1376,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title fantastic content absolutely topnotch ro...,"[-0.014940896, 0.022968661, -0.0036183984, 0.0...",0.154253
3,863793066,2022-10-09,4,"Amidst the chaos of Fashion week, their servic...",We stayed during the Paris Fashion Week Chaos....,0,/ShowUserReviews-g187147-d207742-r863793066-In...,en,MOBILE,DE4AB96DA3E104846D6D6423C2DAA4C8,...,jelinc2016,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title amidst the chaos of fashion week their s...,"[-0.0036499605, 0.04220707, -0.006234068, 0.06...",0.151105
4,863631994,2022-10-08,2,Not worth the effort or money,This hotel is not worth the effort or the pric...,0,/ShowUserReviews-g187147-d207742-r863631994-In...,en,MOBILE,DE02D713F209AEC684DC6108509E6912,...,VikaasK,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title not worth the effort or money content th...,"[0.0013607198, -0.0038351824, 0.0018077933, 0....",0.255166
5,862073785,2022-09-26,4,Not quite up to Intercontinental standards,We had a one night stay prior to taking the Eu...,0,/ShowUserReviews-g187147-d207742-r862073785-In...,en,MOBILE,BA9F96090F35C7A562E550825BAB4B32,...,badgerken2019,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title not quite up to intercontinental standar...,"[0.041726097, 0.033815235, 0.0017297651, 0.065...",0.151502
6,861021031,2022-09-20,5,Luxury in the heart of the city,We were very impressed with the intercontinent...,0,/ShowUserReviews-g187147-d207742-r861021031-In...,en,OTHER,CB65D8DF04EEB78702BB03F39BAC6717,...,BadBenito,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title luxury in the heart of the city content ...,"[-0.0469322, -0.017643088, 0.024403226, 0.0763...",0.10518
7,860421664,2022-09-16,5,Beautiful,Visited for my birthday and everything was jus...,0,/ShowUserReviews-g187147-d207742-r860421664-In...,en,OTHER,9D8328D86D62AFA560367B6F81565010,...,Shivers2612,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title beautiful content visited for my birthda...,"[0.007762895, 0.033725765, 0.006036935, 0.0588...",0.146341
8,856124006,2022-08-24,5,Great old fashioned quality service,All I can say is we found good old fashioned f...,1,/ShowUserReviews-g187147-d207742-r856124006-In...,en,OTHER,82C9562F2314BE52F0CD417EC6034CE2,...,noufm716,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title great old fashioned quality service cont...,"[0.026673278, -0.00047402183, 0.030833986, 0.0...",0.136536
9,856016947,2022-08-23,5,Best location and platform to visit Paris,Our family of 4 plus a friend enjoyed the grea...,0,/ShowUserReviews-g187147-d207742-r856016947-In...,en,MOBILE,266247D26801AA5B479974849CAA8210,...,JanKritz,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title best location and platform to visit pari...,"[-0.043222655, -0.009725487, 0.0045521534, 0.0...",0.135314


In [None]:
# cosine_similarity(corpus_embeddings, query_embedding)

#Entire Data

In [None]:
startTime = time.time()

# Create a column named 'embedding', where the 'combined' column is turned to embeddings by the model.
df_combined["embedding"] = df_combined.combined.apply(lambda x: embedder.encode(x))

executionTime = (time.time() - startTime)
print('Execution time in seconds: ' + str(executionTime))


Execution time in seconds: 197.5309317111969


In [None]:
# Turn the dataframe you have just created to pickle file, for later use.
df_combined.to_pickle('/content/drive/MyDrive/Semantic_Search/entire_data.pkl')

##Embedding upload - new starting point


In [None]:
!pip install openai==0.27.7
!pip install gradio
!pip install -U sentence-transformers



In [None]:
# drive.mount() loads the contents from your Google Drive.

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import json
import pandas as pd
import time
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from collections import Counter
from heapq import nlargest
import nltk
import numpy as np
from tqdm import tqdm
from sentence_transformers import SentenceTransformer, util
# import tiktoken
from openai.embeddings_utils import get_embedding, cosine_similarity

In [None]:
import json
from sentence_transformers import SentenceTransformer, CrossEncoder, util
import gzip
import os
import torch

embedder = SentenceTransformer('all-mpnet-base-v2')

# Use the GPU if available
if not torch.cuda.is_available():
    print("Warning: No GPU found. Please add GPU to your notebook")
else:
  print("GPU Found!")
  embedder =  embedder.to('cuda')

GPU Found!


In [None]:
# Read the pickle file you entered earlier.
import pandas as pd
df = pd.read_pickle('/content/drive/MyDrive/Semantic_Search/entire_data.pkl') #to load 123.pkl back to the dataframe df

In [None]:
df.shape

(11990, 20)

In [None]:
df.head()

Unnamed: 0,review_id,date,review_rating,title,text,votes,url,language,platform,author_id,author_name,author_username,name,id,description,rating,rating_count,features,combined,embedding
0,864290614,2022-10-12,1,A large impersonal place with an on time check...,"If you are looking for a huge, grand hotel exp...",1,/ShowUserReviews-g187147-d207742-r864290614-In...,en,MOBILE,E488EBBA1F82F16BF878FE274C735941,Anna J,AnnaJ250,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title a large impersonal place with an on time...,"[0.043491535, -0.01282531, 0.0029111477, 0.070..."
1,864049819,2022-10-10,4,Good hotel with rude waiter,We went to this hotel just this month\nWe have...,1,/ShowUserReviews-g187147-d207742-r864049819-In...,en,MOBILE,4A830AD8B128F60AC02E83D6B6A530F7,QATAR2007,QATAR2007,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title good hotel with rude waiter content we w...,"[0.05525053, 0.019289287, -0.0056739757, 0.057..."
2,863952022,2022-10-10,5,Fantastic,"Absolutely top-notch. Room, service, bed, pill...",0,/ShowUserReviews-g187147-d207742-r863952022-In...,en,OTHER,AA2958C3E083861E81EEC085671BAA5B,aji1376,aji1376,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title fantastic content absolutely topnotch ro...,"[-0.014940896, 0.022968661, -0.0036183984, 0.0..."
3,863793066,2022-10-09,4,"Amidst the chaos of Fashion week, their servic...",We stayed during the Paris Fashion Week Chaos....,0,/ShowUserReviews-g187147-d207742-r863793066-In...,en,MOBILE,DE4AB96DA3E104846D6D6423C2DAA4C8,jelinc2016,jelinc2016,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title amidst the chaos of fashion week their s...,"[-0.0036499605, 0.04220707, -0.006234068, 0.06..."
4,863631994,2022-10-08,2,Not worth the effort or money,This hotel is not worth the effort or the pric...,0,/ShowUserReviews-g187147-d207742-r863631994-In...,en,MOBILE,DE02D713F209AEC684DC6108509E6912,VikaasK,VikaasK,InterContinental Paris - Le Grand,207742,"The InterContinental Paris Le Grand, opened du...",4.5,3517.0,"['roomFeatures_air conditioning', 'roomFeature...",title not worth the effort or money content th...,"[0.0013607198, -0.0038351824, 0.0018077933, 0...."


In [None]:
# search through the reviews for a specific product
def search_reviews(df, query, n=5, pprint=True):

    # Embed your search query.
    query_embedding = embedder.encode(query,show_progress_bar=True)

    # As before, create a 'similarity' column, which shows the cosine similarity between the your query and the embedded combined contents.
    # REMEMBER!! YOU ARE USING THE SAME MODEL TO EMBED BOTH THE COMBINED CONTENTS AND YOUR QUERY.
    df["similarity"] = df.embedding.apply(lambda x: cosine_similarity(x, query_embedding.reshape(768,-1))) #similarity against each doc

    # Now, sort the values bty similarity, and choose the most similar doc.
    results = (
        df.sort_values("similarity", ascending=False) # re-rank
        .head(n))

    return results

### Explanation:
- **query_embedding**: The new text is embedded with the same model.
- **cosine_similarity**: We compare row embeddings with query embedding, store in `'similarity'`.
- Then **sort** by `'similarity'` descending to find the best matches.


In [None]:
query = 'hotel close to Louvre and great food nearby but not too expensive'

In [None]:
results = search_reviews(df,query,15)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
results

Unnamed: 0,review_id,date,review_rating,title,text,votes,url,language,platform,author_id,...,author_username,name,id,description,rating,rating_count,features,combined,embedding,similarity
10033,858737654,2022-09-06,4,Great hotel close to the Louvre,This was a great location with easy access to ...,1,/ShowUserReviews-g187147-d228694-r858737654-Ho...,en,MOBILE,107486993911F044C7A86C88BD8CBE6C,...,kahnfeldt,Hotel Malte - Astotel,228694,Located in the 2nd district next to the Stock ...,5.0,2285.0,"['roomFeatures_allergy-free room', 'roomFeatur...",title great hotel close to the louvre content ...,"[0.00507981, 0.006933634, 0.0052260924, 0.0466...",[0.8135017]
267,751494275,2020-03-20,5,Superb location.Lovely hotel. Delicious buffet.,We stayed in this lovely hotel for three night...,1,/ShowUserReviews-g187147-d228728-r751494275-Re...,en,OTHER,94055358518C9E758FA3817BAA2F65D9,...,MEG0963,Renaissance Paris Vendome Hotel,228728,Bask in the lavish lifestyle of our boutique h...,4.5,1618.0,"['roomFeatures_bathrobes', 'roomFeatures_air c...",title superb locationlovely hotel delicious bu...,"[-0.0066311704, 0.027375644, 0.010441846, 0.07...",[0.7794165]
21822,859898297,2022-09-13,5,Really nice hotel close to the Louvre,My wife and I came here to celebrate our 40th ...,0,/ShowUserReviews-g187147-d617625-r859898297-Gr...,en,MOBILE,C34ABC5BCEC1820E642381861D65E5E0,...,GraemeBuck,Grand Hotel du Palais Royal,617625,,,,,title really nice hotel close to the louvre co...,"[0.030774625, 0.029099965, -0.011844366, 0.052...",[0.778882]
4933,815954462,2021-10-25,5,Great location for romantic trip,"Friendly staff. Great location, relaxed atmosp...",0,/ShowUserReviews-g187147-d228737-r815954462-Ho...,en,OTHER,CD7F732341E1CB5D4414C9597B80B9C7,...,jdN6106ZS,Hôtel Trianon Rive Gauche,228737,,,,,title great location for romantic trip content...,"[-0.009349008, 0.06855301, 0.014361295, 0.0507...",[0.7786452]
21898,839455495,2022-05-22,5,Wonderful boutique hotel close to the Louvre,A very nice hotel in a quiet plaza two blocks ...,0,/ShowUserReviews-g187147-d617625-r839455495-Gr...,en,OTHER,F68553597AC1CF5735AA52480B978199,...,johnhR7616AX,Grand Hotel du Palais Royal,617625,,,,,title wonderful boutique hotel close to the lo...,"[0.03218718, -0.018514507, 0.009698471, 0.0526...",[0.7774591]
18243,804234515,2021-08-16,5,Perfect location and really nice hotel,We just came back from a 4 day trip in Paris w...,2,/ShowUserReviews-g187147-d207663-r804234515-Ho...,en,OTHER,4110D3D69BD285509C0C1AA985550AA7,...,Y2171PSsophiet,Hotel Cayre,207663,,,,,title perfect location and really nice hotel c...,"[-0.011272881, -0.015685814, 0.007126427, 0.07...",[0.77657044]
28275,801368976,2021-08-02,5,"Best location, elegant, clean and great rooms.",We spent 5 nights at the lovely Hotel Da Vinci...,1,/ShowUserReviews-g187147-d6675948-r801368976-H...,en,MOBILE,1FEA86D03026BB83FEB43E29EA2B2D0D,...,marydresser,Hotel Da Vinci,6675948,,,,,title best location elegant clean and great ro...,"[-0.04500474, 0.030609895, 0.0011573076, 0.069...",[0.77456754]
21920,834074624,2022-04-10,5,Great Boutique style hotel,Great location close to the Louvre. Lots of s...,0,/ShowUserReviews-g187147-d617625-r834074624-Gr...,en,OTHER,1A544DD87197FCAC75FDA3D6E1AF6E07,...,dtt808,Grand Hotel du Palais Royal,617625,,,,,title great boutique style hotel content great...,"[0.0061422065, 0.01872041, -0.0009768117, 0.04...",[0.77316636]
6866,806790088,2021-08-28,5,Nice small hotel at an excellent location,"Clean, with a nice design and an excellent loc...",0,/ShowUserReviews-g187147-d278169-r806790088-Ho...,en,OTHER,3C6E94BBBCCA399CFE39851FEF397129,...,do8yb,Hôtel Eugène en Ville,278169,,,,,title nice small hotel at an excellent locatio...,"[-0.013086394, -0.02277145, 0.021511065, 0.072...",[0.7728499]
207,834926510,2022-04-17,5,Best Paris Location,You can’t beat the location of this hotel! Rig...,0,/ShowUserReviews-g187147-d228728-r834926510-Re...,en,MOBILE,58EB5EA5E1A2FCD598F7F7F9D988306A,...,Camper46227911888,Renaissance Paris Vendome Hotel,228728,Bask in the lavish lifestyle of our boutique h...,4.5,1618.0,"['roomFeatures_bathrobes', 'roomFeatures_air c...",title best paris location content you cant bea...,"[-0.0070767864, -0.023154905, 0.011793152, 0.0...",[0.7724514]


### Explanation:
- We pick a sample query about being near the Louvre and not expensive.
- We fetch the top 15 results. For each row, we display the hotel name, similarity, and some snippet of text for context.


Take all the reviews which are closest to the query, and groupby the hotel name

In [None]:
def search(query):
  # Define a number of results to return, in this case, return only the first 15 results ranked by similarity.
  n = 15

  # Embed the query.
  query_embedding = embedder.encode(query)

  # Generate the similarity column, based on your query.
  df["similarity"] = df.embedding.apply(lambda x: cosine_similarity(x, query_embedding.reshape(768,-1)))

  # Calculate the top 'n' most similar results by similarity.
  results = (
      df.sort_values("similarity", ascending=False)
      .head(n))

  resultlist = []

  # Display them in a very concise and ordered manner.
  hlist = []
  for r in results.index:
      if results.name[r] not in hlist:
          smalldf = results.loc[results.name == results.name[r]]
          if smalldf.shape[1] > 3:
            smalldf = smalldf[:3]

          resultlist.append(
          {
            "name":results.name[r],
            "score": smalldf.similarity[r][0],
            "rating": smalldf.rating.max(),
            "relevant_reviews": [ smalldf.text[s] for s in smalldf.index]
          })
          hlist.append(results.name[r])
  return resultlist




### Explanation:
- We embed the query, compute `'similarity'`, sort, and then group by the **first mention** of each unique hotel in the top results.  
- Return a list of dictionaries with `'hotel_name'`, `'score'`, `'rating'`, `'some_reviews'`.


In [None]:
search('hotel close to Louvre and great food nearby but not too expensive')

[{'name': 'Hotel Malte - Astotel',
  'score': np.float32(0.8135017),
  'rating': 5.0,
  'relevant_reviews': ['This was a great location with easy access to several Metro stations, attractions, and restaurants.  It was great having snacks and drinks available all afternoon, especially since traveling with kids.  Good breakfast, and plenty of space for the 4 of us in the duplex room.  Having a second bathroom was very nice as well.']},
 {'name': 'Renaissance Paris Vendome Hotel',
  'score': np.float32(0.7794165),
  'rating': 4.5,
  'relevant_reviews': ['We stayed in this lovely hotel for three nights.Our room was on the second floor with the courtyard view,so it was very quiet at night.The room wasn\'t very big but  absolutely  lovely and cozy with comfortable beds and all modern  conveniences. Most of all we were impressed by breakfast.It was really GREAT ! A great variety of cheese,bacon,ham,sausages,fruit,vegetables,dairy,pastry,fresh juice ,omelet(specially cooked for you with any fi

### Explanation:
- We pick a sample query about being near the Louvre and not expensive.
- We fetch the top 15 results. For each row, we display the hotel name, similarity, and some snippet of text for context.


In [77]:
from google.colab import userdata
import openai

openai.api_key = "Paste the key here"


In [None]:
def generate_answer(query):
    prompt = f"""
    Based on the following query from a user, please generate a detailed answer based on the context
    focusing on which is the top three hotel based on the query. You should respond as you are a travel agent and are conversing with the
    user in a nice cordial way. Always address the user as Travis. Make sure you infrom the user why this is a good answer. The answer should be a paragph
    remove the special characters and (/n ) , make the output clean and concise.
    Answer only as a poet.


    ###########
    query:
    "{query}"

    ########

    context:"
    "{search(query)}"
    #####

    Return in Markdown format with each hotel highlighted.
    """

    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ]
    response = openai.ChatCompletion.create(
        model="gpt-4o-mini",
        max_tokens=1500,
        n=1,
        stop=None,
        temperature=0.2, #higher temperature means more creative or more hallucination
        messages = messages


    )

    # Extract the generated response from the API response
    generated_text = response.choices[0].message['content'].strip()

    return generated_text

# # Example usage
# query = "What are the best amenities offered by Hotel XYZ?"
# response = generate_hotel_response(query)
# print(response)

    #return response.choices[0].message.content.strip()

In [None]:
txt=generate_answer('close to Louvre and great food nearby but not too expensive')

RateLimitError: You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.

### Explanation:
- **search_hotels(query)**: We gather the top relevant hotels as context.  
- **prompt**: We instruct GPT to produce a short poem about the top 3 recommended hotels.  
- **model="gpt-3.5-turbo"**: Adjust if you have GPT-4 access.  
- We limit tokens and set `temperature=0.7` for a bit of creativity.  
- The function returns a string.


In [None]:
# prompt

In [None]:
import markdown
from IPython.display import display, HTML

def render_markdown(md_text):
    # Convert Markdown to HTML
    html = markdown.markdown(md_text)
    # Display the HTML
    display(HTML(html))

In [None]:
render_markdown(txt)

NameError: name 'txt' is not defined

### Explanation:
- We pass a new query to `generate_answer`, which does semantic search + calls GPT to produce a short poem.  
- Finally, print it out.


#Building the API

In [None]:

import gradio as gr

# def search(query):
#   n = 15
#   query_embedding = embedder.encode(query)
#   df["similarity"] = df.embedding.apply(lambda x: cosine_similarity(x, query_embedding.reshape(768,-1)))

#   results = (
#       df.sort_values("similarity", ascending=False)
#       .head(n))

#   resultlist = []

#   hlist = []
#   for r in results.index:
#       if results.name[r] not in hlist:
#           smalldf = results.loc[results.name == results.name[r]]
#           smallarr = smalldf.similarity[r].max()
#           sm =smalldf.rating[r].mean()

#           if smalldf.shape[1] > 3:
#             smalldf = smalldf[:3]

#           resultlist.append(
#           {
#             "name":results.name[r],
#             "description":results.description[r],
#             "relevance score": smallarr.tolist(),
#             "rating": sm.tolist(),
#             "relevant_reviews": [ smalldf.text[s] for s in smalldf.index]
#           })
#           hlist.append(results.name[r])
#   return resultlist

def greet(query):
    bm25 = generate_answer(query)
    return bm25

# Use the gradio library to display a user interface for your user to interact with.
demo = gr.Interface(fn=greet, inputs="text", outputs="text")

# Launch the user demo - This can be done directly in your colab notebook. On your local notebook, you can also give a personalized localhost:port address.
demo.launch(share=True,debug=False)
# COPY THE URL THAT APPEARS BELOW

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://7af54e74dc59a07b16.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




### Explanation:
- **greet(query)**: The function that calls our `generate_answer` logic.  
- We define a Gradio `Interface` with a single text input and a text output.  
- **launch(share=True)**: Makes the app accessible via a public link.


In [None]:
!pip install --upgrade gradio



In [None]:
# Import the Client class from the gradio_client module.
from gradio_client import Client

# Create an instance of the Client class. The URL provided should point to a live Gradio app.

# PASTE THE URL FROM ABOVE HERE
client = Client("https://7142ca2aabef402984.gradio.live/")

# Use the 'predict' method of the Client instance to send a request to the Gradio app.
# The string "Hotel near the Eiffel Tower!" is passed to the 'query' textbox component of the app.
# 'api_name' specifies the endpoint ('/predict') that the Gradio interface exposes for processing this input.
result = client.predict(
				"Hotel near the Eiffel Tower!",	# str in 'query' Textbox component
				api_name="/predict"
)

# Print the result returned from the Gradio app. This output depends on how the Gradio app processes the input.
print(result)



Loaded as API: https://7142ca2aabef402984.gradio.live/ ✔


ValueError: Could not fetch config for https://7142ca2aabef402984.gradio.live/

### Explanation:
- This uses `gradio_client` to programmatically call the live Gradio app.  
- **client.predict(...)**: We pass the user query to the app’s “predict” endpoint.


In [None]:
result

# END OF NOTEBOOK

**Summary**:

1. We installed spaCy and `openai`, set up local or GPU environment.  
2. Loaded a CSV of Paris hotel reviews, created a new “combined” text field, and embedded it with `SentenceTransformer(all-mpnet-base-v2)`.  
3. Demonstrated searching with `cosine_similarity`.  
4. Showed how to integrate with an LLM for summarizing or poetically describing results.  
5. Provided a Gradio interface for end-user queries, and an optional client test.  

This completes the demonstration of end-to-end usage: data ingestion, transformation, embedding, searching, summarizing, and UI exposure. Enjoy!  
