## Semantic Embedding and Search on Hospitality KPI Data

This notebook demonstrates how to transform structured hospitality KPI data into natural language sentences and generate vector embeddings using the `sentence-transformers` library. These embeddings enable semantic similarity search, allowing us to query the dataset using human-like questions such as:

> "Which dates had low occupancy but high ADR?"  
> "Find months with performance similar to March 2023"  
> "Show periods with peak RevPAR during shoulder season"

I have created mock data for three years, containing daily values for:
- Occupancy
- Average Daily Rate (ADR)
- Revenue per Available Room (RevPAR)
- Demand and Supply

The core goals of this notebook are:
1. Convert KPI rows into descriptive sentences
2. Generate semantic embeddings using a pre-trained transformer model
3. Perform similarity search based on user queries

This project forms the foundation for a hospitality-focused vector search system, useful in AI-driven reporting, forecasting, or benchmarking applications.


In [22]:
# STEP 1: Create the .env file with API key 
openai_key = "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"
with open(".env", "w") as f:
    f.write(f"OPENAI_API_KEY={openai_key}")
print(".env file created.")

.env file created.


In [24]:
# STEP 2: Install Required Libraries 
!pip install openai python-dotenv pandas scikit-learn




In [26]:
# STEP 3: Import Libraries and Load API Key
import os
import pandas as pd
import openai
from dotenv import load_dotenv
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Load the API key from .env
load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")


In [28]:
df=pd.read_csv("C:/Users/knowl/OneDrive/Desktop/kpi_data.csv")

In [30]:
df.head()

Unnamed: 0,date,occupancy,adr,revpar,demand,supply
0,2023-06-01,0.61,218.48,133.27,189,559
1,2023-06-02,0.92,193.08,177.63,319,599
2,2023-06-03,0.8,168.92,135.14,291,558
3,2023-06-04,0.73,141.0,102.93,162,525
4,2023-06-05,0.49,229.12,112.27,172,479


In [32]:
df.isnull().sum()

date         0
occupancy    0
adr          0
revpar       0
demand       0
supply       0
dtype: int64

In [34]:
# Convert KPI Rows to Descriptive Sentences
df['text_description'] = df.apply(
    lambda row: f"On {row['date']}, the occupancy was {row['occupancy']*100:.0f}%, ADR was ${row['adr']:.2f}, "
                f"RevPAR was ${row['revpar']:.2f}, with {row['demand']} rooms sold out of {row['supply']}.",
    axis=1
)
df[['date', 'text_description']].head()


Unnamed: 0,date,text_description
0,2023-06-01,"On 2023-06-01, the occupancy was 61%, ADR was ..."
1,2023-06-02,"On 2023-06-02, the occupancy was 92%, ADR was ..."
2,2023-06-03,"On 2023-06-03, the occupancy was 80%, ADR was ..."
3,2023-06-04,"On 2023-06-04, the occupancy was 73%, ADR was ..."
4,2023-06-05,"On 2023-06-05, the occupancy was 49%, ADR was ..."


In [20]:
!pip install --upgrade openai




In [36]:
import openai
print(openai.__version__)


1.86.0


In [38]:
from openai import OpenAI

# Initialize OpenAI client
client = OpenAI()

# Define embedding function
def get_openai_embeddings(texts, model="text-embedding-ada-002"):
    response = client.embeddings.create(input=texts, model=model)
    return [e.embedding for e in response.data]

# Generate embeddings from KPI descriptions
texts = df['text_description'].tolist()
embeddings = get_openai_embeddings(texts)


###  Semantic Search Function

In this step, we define a function called `semantic_search()` that enables natural language querying of our hospitality KPI data. The function works by:

1. Converting the user’s query into an embedding using OpenAI's `text-embedding-ada-002` model.
2. Calculating **cosine similarity** between the query embedding and each of the previously generated KPI embeddings.
3. Sorting the results based on similarity scores.
4. Returning the top-k most semantically relevant KPI entries.

This allows us to search the KPI dataset using plain English questions like:
- "Find dates with low occupancy and high ADR"
- "Which days had strong RevPAR performance?"
- "Show similar periods to March 2023"

This marks the core functionality of our semantic vector search system.


In [43]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def semantic_search(query, top_k=5):
    # Get the vector for the user's query
    query_embedding = get_openai_embeddings([query])[0]

    # Compute similarity scores
    similarity_scores = cosine_similarity([query_embedding], embeddings)[0]

    # Get top matching indices
    top_indices = np.argsort(similarity_scores)[::-1][:top_k]

    # Return top results with scores
    results = df.iloc[top_indices][['date', 'text_description']]
    scores = similarity_scores[top_indices]
    return results, scores


## Running a test query

In this step, I will run a sample semantic query to retrieve KPI records that match a given intent written in plain English.

For the test query:
> "Find dates with low occupancy and high ADR"

The system:
1. Converts the query into an embedding using OpenAI’s embedding model.
2. Calculates cosine similarity between the query embedding and each row’s embedding.
3. Sorts the results based on similarity score.
4. Displays the top 5 most relevant KPI rows based on semantic meaning — not exact keyword matching.

This allows business analysts or hotel managers to ask natural language questions and receive insightful performance records in return.


In [46]:
query = "Find dates with low occupancy and high ADR"
results, scores = semantic_search(query)

# Display results
for i, (text, score) in enumerate(zip(results['text_description'], scores)):
    print(f"{i+1}. ({score:.2f}) {text}")


1. (0.84) On 2024-01-13, the occupancy was 94%, ADR was $87.98, RevPAR was $82.70, with 299 rooms sold out of 558.
2. (0.84) On 2025-01-29, the occupancy was 68%, ADR was $85.21, RevPAR was $57.94, with 302 rooms sold out of 517.
3. (0.84) On 2023-08-14, the occupancy was 79%, ADR was $205.20, RevPAR was $162.11, with 216 rooms sold out of 535.
4. (0.84) On 2024-06-11, the occupancy was 86%, ADR was $208.90, RevPAR was $179.65, with 229 rooms sold out of 596.
5. (0.84) On 2024-05-30, the occupancy was 92%, ADR was $110.01, RevPAR was $101.21, with 191 rooms sold out of 586.


###  Query Example: High RevPAR

This example demonstrates how to retrieve dates with high revenue per available room (RevPAR). The model uses semantic understanding to match the intent of the query:

> "Show dates with high RevPAR"

The top-ranked results reflect periods of strong hotel performance based on revenue optimization metrics, providing valuable insights for revenue managers and strategists.


In [50]:
query = "Show dates with high RevPAR"
results, scores = semantic_search(query)

# Display results
for i, (text, score) in enumerate(zip(results['text_description'], scores)):
    print(f"{i+1}. ({score:.2f}) {text}")


1. (0.84) On 2024-05-17, the occupancy was 87%, ADR was $150.99, RevPAR was $131.36, with 489 rooms sold out of 465.
2. (0.84) On 2024-02-13, the occupancy was 78%, ADR was $121.89, RevPAR was $95.07, with 154 rooms sold out of 500.
3. (0.84) On 2024-03-01, the occupancy was 68%, ADR was $191.99, RevPAR was $130.55, with 459 rooms sold out of 515.
4. (0.84) On 2025-04-14, the occupancy was 77%, ADR was $170.30, RevPAR was $131.13, with 169 rooms sold out of 491.
5. (0.84) On 2024-05-23, the occupancy was 72%, ADR was $248.50, RevPAR was $178.92, with 178 rooms sold out of 513.


### Conclusion

In this project, I transformed structured hospitality KPI data into descriptive text and generated semantic embeddings using OpenAI's `text-embedding-ada-002` model. By applying cosine similarity between vectors, I built a semantic search system that allows users to query hotel performance data using natural language.

With this system, I can answer questions like *"Which dates had low occupancy but high ADR?"* or *"Show dates with high RevPAR"* — retrieving results based on semantic meaning rather than simple keyword matches.

This project demonstrates my ability to apply modern NLP techniques and vector-based search to real-world business data.
