### An Agentic AI Pipeline for Real Estate Recommendations: Driven by Large Language Models, Meta-Heuristic Optimization, Vector Search, and Explainability

This project implements an **agentic AI-driven pipeline** for real estate recommendations. It uses Ames Housing Dataset and leverages Large Language Models (LLMs), vector-based similarity search, and meta-heuristic optimization techniques. The system intelligently processes user queries, matches them to relevant property descriptions, and provides concise, user-friendly recommendations with explanations.

---

### **What the System Does**
The system aims to assist users in finding their ideal properties by:
1. **Understanding User Queries**: Using an LLM, the system rewrites and refines user queries into a structured format optimized for property matching.
2. **Matching Properties**: Searches a vector database of property descriptions, finding the most relevant matches based on semantic similarity.
3. **Optimizing Search**: Applies Particle Swarm Optimization (PSO) to rewrite and enhance queries dynamically for improved relevance.
4. **Providing Explanations**: Offers clear explanations for why properties were selected, detailing matching features and similarity scores.
5. **Summarizing Results**: Presents the results in a concise summary for easy interpretation by the user.

---

### **How the System Works**
The system operates in a structured, multi-step pipeline, showcasing its **agentic capabilities** by combining multiple AI tools into an interactive workflow:

1. **LLM-Driven Query Rewriting**:
   - The system uses a language model (e.g., `google/flan-t5-large`) to transform ambiguous or informal user inputs into structured, actionable queries tailored to property matching.
   - Example: A query like *"I want a house with a big backyard and a nice kitchen"* is rewritten as *"Looking for properties with a large backyard and a high-quality kitchen."*

2. **Embedding and Vector Search**:
   - Property descriptions are converted into high-dimensional embeddings using the `sentence-transformers/all-mpnet-base-v2` model and stored in a vector database.
   - The system performs similarity searches in this database to retrieve the most relevant properties.

3. **Optimization with PSO**:
   - Particle Swarm Optimization dynamically refines the query, adjusting feature weights (e.g., kitchen quality, living area) to improve the relevance of the results.
   - This adaptive optimization ensures that user preferences are captured effectively.

4. **Similarity Search with Explanations**:
   - Properties are ranked based on their similarity to the optimized query.
   - Explanations highlight the key matching features (e.g., neighborhood, square footage, kitchen quality) and provide similarity scores for transparency.

5. **Result Summarization**:
   - The system generates a concise summary of the top property matches, making it easy for users to understand and compare their options.

---

### **Agentic Aspects**
This project is highly **agentic** due to its ability to:
1. **Dynamic Query Understanding**: The system doesn’t rely on static inputs but actively transforms and optimizes user queries for better results.
2. **Modular Tool Integration**:
   - Combines tools like LLMs for rewriting, PSO for optimization, and vector databases for search.
   - Tools are treated as modular components that can adapt to various inputs and contexts.
3. **Interactive Decision-Making**: The pipeline dynamically adapts the query and search process based on user inputs and optimization results.
4. **Transparent Explanations**: Provides clear reasoning for its recommendations, mimicking human-like decision-making and enhancing user trust.
---

Let's start the code!

- Install the required packages for optimization, vector storage, and text generation.

In [1]:
!pip install pyswarm langchain_community chromadb langgraph transformers -U

Collecting pyswarm
  Downloading pyswarm-0.6.tar.gz (4.3 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting langchain_community
  Downloading langchain_community-0.3.13-py3-none-any.whl.metadata (2.9 kB)
Collecting chromadb
  Downloading chromadb-0.6.0-py3-none-any.whl.metadata (6.8 kB)
Collecting langgraph
  Downloading langgraph-0.2.60-py3-none-any.whl.metadata (15 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting httpx-sse<0.5.0,>=0.4.0 (from langchain_community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting langchain<0.4.0,>=0.3.13 (from langchain_community)
  Downloading langchain-0.3.13-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core<0.4.0,>=0.3.27 (from langchain_community)
  Downloading langchain_core-0.3.28-py3-none-any.whl.metadata (6.3 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain_community)
  Down

- Import the libraries for data manipulation, embedding creation, vector storage, and optimization.

In [2]:
import os
import pandas as pd
import numpy as np
from transformers import pipeline
from langchain.schema import Document
from pyswarm import pso
from langchain.vectorstores import Chroma
from langchain.prompts import PromptTemplate
import torch
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.llms import HuggingFacePipeline
from langchain.chains import LLMChain, SequentialChain
from langchain.agents import Tool

##Step 1: Data Preparation
- Load the Ames real estate dataset properties and limits it to the first 100 rows for faster processing.

- A full description of the Ames housing dataset can be seen [here](https://www.kaggle.com/datasets/shashanknecrothapa/ames-housing-dataset).

In [3]:
df_path = "https://raw.githubusercontent.com/MPAghababa/llms/main/real_estate/ames_real_estate.csv"
df = pd.read_csv(df_path)
df = df[:100]
df.head(3)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500


##Step 2: Data-to-Text Generation
- Initialize a text-to-text generation pipeline using the google/flan-t5-large model.

- Generate descriptions for each property using the language model. This step could be computationally expensive.

In [4]:
LLM_MODEL = "google/flan-t5-large"
generator = pipeline('text2text-generation', model=LLM_MODEL, tokenizer=LLM_MODEL)

description_prompt_template = """
You are given the following property details:
- Neighborhood: {Neighborhood}
- House Style: {HouseStyle}
- Overall Quality (1-10): {OverallQual}
- Year Built: {YearBuilt}
- Above Ground Living Area (sq ft): {GrLivArea}
- Number of Bedrooms Above Ground: {BedroomAbvGr}
- Kitchen Quality rating: {KitchenQual}
- Lot Frontage (feet): {LotFrontage}
- Total Rooms Above Ground: {TotRmsAbvGrd}
- Number of Fireplaces: {Fireplaces}
- Pool Quality: {PoolQC}
- Garage Type: {GarageType}
- Exterior Condition: {ExterCond}
"""

def generate_description(row):
    input_text = description_prompt_template.format(
        Neighborhood=row["Neighborhood"],
        HouseStyle=row["HouseStyle"],
        OverallQual=str(row["OverallQual"]),
        YearBuilt=str(row["YearBuilt"]),
        GrLivArea=str(row["GrLivArea"]),
        BedroomAbvGr=str(row["BedroomAbvGr"]),
        KitchenQual=row["KitchenQual"],
        LotFrontage=str(row.get("LotFrontage", "N/A")),
        TotRmsAbvGrd=str(row.get("TotRmsAbvGrd", "N/A")),
        Fireplaces=str(row.get("Fireplaces", "N/A")),
        PoolQC=row.get("PoolQC", "N/A"),
        GarageType=row.get("GarageType", "N/A"),
        ExterCond=row.get("ExterCond", "N/A")
    )
    description = generator(input_text, max_length=150, num_return_sequences=1)
    return description[0]['generated_text'].strip()

df["Description"] = df.apply(generate_description, axis=1)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Device set to use cuda:0
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


##Step 3: Embedding and Vector Storage

- Initialize an embedding model to transform text into a vector space.

- Creates a list of Document objects for storing property descriptions with metadata.

- Stores the embedded descriptions in a vector database for efficient similarity searches.



In [5]:
embedding_model_name = "sentence-transformers/all-mpnet-base-v2"
embedding_model = HuggingFaceEmbeddings(model_name=embedding_model_name)

embedding = HuggingFaceEmbeddings(
    model_name=embedding_model_name,
    model_kwargs={"device": "cuda" if torch.cuda.is_available() else "cpu"}
    )

docs = []
for _, row in df.iterrows():
    description = (
        f"Property in {row['Neighborhood']} built in {row['YearBuilt']}, "
        f"featuring {row['GrLivArea']} sq ft of living space, "
        f"{row['BedroomAbvGr']} bedrooms, and a kitchen rated {row['KitchenQual']}."
    )
    docs.append(
        Document(
            page_content=description,
            metadata=row.to_dict()
        )
    )

persist_directory = "./chromadb"
vectordb = Chroma.from_texts(
    texts=[doc.page_content for doc in docs],
    metadatas=[doc.metadata for doc in docs],
    embedding=embedding,
    persist_directory=persist_directory
    )

vectordb.persist()

  embedding_model = HuggingFaceEmbeddings(model_name=embedding_model_name)


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

  vectordb.persist()


##Step 4: PSO for Query Rewriting

- Define a function to adjust a query based on feature weights.

- Use Particle Swarm Optimization (PSO) to rewrite queries for better search results.

In [6]:
feature_map = {
    "YearBuilt": "built in the year",
    "GrLivArea": "with approximately square feet",
    "KitchenQual": "with a high-quality kitchen",
    "LotFrontage": "with a frontage of approximately",
    "TotRmsAbvGrd": "with a total of",
    "Fireplaces": "featuring fireplaces",
    "PoolQC": "with a pool of quality",
    "GarageType": "with an attached garage",
    "ExterCond": "in excellent exterior condition",
}

def optimize_query(query, features_weights, feature_map, weight_threshold=0.1):
    rewritten_query = query
    for i, (feature, description) in enumerate(feature_map.items()):
        if features_weights[i] > weight_threshold:
            rewritten_query += f", {description}: {features_weights[i]:.1f}"
    return rewritten_query

def pso_query_rewriting(query, vectorstore):
    def objective_function(weights):
        optimized_query = optimize_query(query, weights, feature_map)
        results_with_scores = vectorstore.similarity_search_with_score(optimized_query, k=3)
        similarity_scores = [score for _, score in results_with_scores]
        return -np.mean(similarity_scores)

    lb = [10] * len(feature_map)
    ub = [20] * len(feature_map)
    best_weights, _ = pso(objective_function, lb, ub, swarmsize=10, maxiter=10)
    return optimize_query(query, best_weights, feature_map)

##Step 5: Similarity Search with Explanation

- Perform a similarity search in the vector database and generate explanations for the matches.

In [7]:
def find_similar_properties_with_explanation(query: str, k: int = 3):

    results = vectordb.similarity_search_with_score(query, k=k)

    explanations = []
    for i, (result, score) in enumerate(results):
        explanation = (
            f"Property {i + 1}:\n"
            f"Description: {result.page_content}\n"
            f"Similarity Score: {score:.2f}\n"
            )

        explanations.append(explanation)

    return explanations

##Step 6: Pipeline Integration

- Combine query processing, optimization, and similarity search into a single pipeline.

- Execute the full pipeline and return a summary of the search results.

- summarize_results: condenses the results of the property search into a user-friendly text summary. While not core to the agent's decision-making, it plays a vital role in creating a clear response.

- llm_pipeline: Initializes an LLM for text-to-text generation. This is used to improve or rewrite the user's query into a form that is more suitable for the downstream tasks.

- prompt_template: Defines a structured template for the LLM to rewrite user queries. This ensures that the rewritten query aligns with the system's capabilities.

- llm_chain: Combines the prompt and LLM into a chain and processes the user's query through a formalized rewriting step.

- find_similar_properties: uses the vectordb vector store to find properties most similar to the query. It also eliminates duplicates and explains why each match is relevant.

- find_similar_tool and summarize_tool: These tools package functionality into reusable modules, allowing the agent to access and combine capabilities interactively.

In [8]:
def summarize_results(results: list):

    summary = "Top Property Matches:\n"
    for i, result in enumerate(results):
        summary += f"- Property {i + 1}: {result}\n"
    return summary

llm_pipeline = pipeline("text2text-generation", model=LLM_MODEL, tokenizer=LLM_MODEL)
llm = HuggingFacePipeline(pipeline=llm_pipeline)

prompt_template = PromptTemplate(
    template=(
        "You are an AI assistant specializing in real estate recommendations. "
        "Take the user's query and rewrite it clearly for property matching. "
        "Example: Given 'large backyard and swimming pool', rewrite as 'Looking for properties with a large backyard and a swimming pool'. "
        "Now, rewrite the query: {query}"
    ),
    input_variables=["query"]
    )

llm_chain = LLMChain(llm=llm, prompt=prompt_template, output_key="processed_query")

query_chain = SequentialChain(
    chains=[llm_chain],
    input_variables=["query"],
    output_variables=["processed_query"]
    )

def find_similar_properties(query: str, k: int = 3):

    results_with_scores = vectordb.similarity_search_with_score(query, k=k * 2)

    seen_ids = set()
    unique_results = []
    for result, score in results_with_scores:
        if result.metadata["Id"] not in seen_ids:
            seen_ids.add(result.metadata["Id"])
            unique_results.append((result, score))
        if len(unique_results) == k:
            break

    explanations = []
    for i, (result, score) in enumerate(unique_results):
        explanation = (
            f"Property {i + 1}:\n"
            f"Description: {result.page_content}\n"
            f"Matching Features: "
            f"Neighborhood: {result.metadata['Neighborhood']}, "
            f"Kitchen Quality: {result.metadata['KitchenQual']}, "
            f"Living Area: {result.metadata['GrLivArea']} sq ft.\n"
            f"Similarity Score: {score:.2f}\n")

        explanations.append(explanation)

    return explanations


find_similar_tool = Tool(
    name="Find Similar Properties",
    func=find_similar_properties_with_explanation,
    description="Find properties that match a given query and provide explanations."
    )

summarize_tool = Tool(
    name="Summarize Results",
    func=summarize_results,
    description="Summarize the top property matches into a concise overview."
    )


def full_pipeline(query):
    # Step 1: Process the query through the LLM
    processed_query = query_chain.run({"query": query})
    print("Processed Query from LLM:", processed_query)

    # Step 2: Optimize the query using PSO
    optimized_query = pso_query_rewriting(processed_query.strip(), vectordb)
    print("Optimized Query with PSO:", optimized_query)

    # Step 3: Find similar properties
    similar_properties = find_similar_properties_with_explanation(optimized_query.strip())

    # Step 4: Summarize the results
    summary = summarize_results(similar_properties)

    return summary


Device set to use cuda:0
  llm = HuggingFacePipeline(pipeline=llm_pipeline)
  llm_chain = LLMChain(llm=llm, prompt=prompt_template, output_key="processed_query")


In [9]:
test_query_1 = "Looking for a property in a great neighborhood with a high-quality kitchen"
response = full_pipeline(test_query_1)
print("Final Results:\n", response)

  processed_query = query_chain.run({"query": query})


Processed Query from LLM: Looking for a property in a great neighborhood with a high-quality kitchen
Stopping search: maximum iterations reached --> 10
Optimized Query with PSO: Looking for a property in a great neighborhood with a high-quality kitchen, built in the year: 19.8, with approximately square feet: 10.4, with a high-quality kitchen: 20.0, with a frontage of approximately: 10.9, with a total of: 20.0, featuring fireplaces: 18.5, with a pool of quality: 16.2, with an attached garage: 20.0, in excellent exterior condition: 20.0
Final Results:
 Top Property Matches:
- Property 1: Property 1:
Description: Property in NAmes built in 1960, featuring 1253 sq ft of living space, 2 bedrooms, and a kitchen rated TA.
Similarity Score: 0.56

- Property 2: Property 2:
Description: Property in NAmes built in 1959, featuring 1225 sq ft of living space, 3 bedrooms, and a kitchen rated TA.
Similarity Score: 0.56

- Property 3: Property 3:
Description: Property in NAmes built in 1959, featurin

In [10]:
test_query_2 = "Give me some properties with at least two bedrooms"
response = full_pipeline(test_query_2)
print("Final Results:\n", response)

Processed Query from LLM: Looking for properties with at least two bedrooms.
Stopping search: maximum iterations reached --> 10
Optimized Query with PSO: Looking for properties with at least two bedrooms., built in the year: 20.0, with approximately square feet: 10.9, with a high-quality kitchen: 10.0, with a frontage of approximately: 10.7, with a total of: 10.9, featuring fireplaces: 17.6, with a pool of quality: 12.2, with an attached garage: 18.3, in excellent exterior condition: 13.4
Final Results:
 Top Property Matches:
- Property 1: Property 1:
Description: Property in ClearCr built in 1953, featuring 2287 sq ft of living space, 3 bedrooms, and a kitchen rated TA.
Similarity Score: 0.69

- Property 2: Property 2:
Description: Property in NAmes built in 1960, featuring 1253 sq ft of living space, 2 bedrooms, and a kitchen rated TA.
Similarity Score: 0.69

- Property 3: Property 3:
Description: Property in NAmes built in 1958, featuring 1339 sq ft of living space, 3 bedrooms, and 

##Conclusions and Remarks

### **Conclusion**
This project exemplifies an **agentic AI system** by integrating multiple advanced technologies into a cohesive, goal-oriented workflow. It not only processes queries intelligently but also adapts dynamically to deliver optimized, transparent, and user-centric property recommendations. This approach is a significant step toward building robust, explainable, and interactive AI agents for real-world applications.

### **Notes and Remarks**

Here are some thoughtful notes and remarks you can include at the end of your project to provide additional context, future directions, and reflections:

---

### **Remarks**

1. **Explainability as a Key Feature**:
   - The inclusion of explanations for property recommendations ensures transparency and builds user trust.
   - Future iterations could enhance this further by visualizing property features or incorporating user feedback into the explanation generation process.

2. **Scalability Considerations**:
   - While the system processes 100 properties efficiently, scaling to larger datasets may require optimizations in embedding generation, vector search, and query rewriting.
   - Leveraging distributed systems or cloud-based vector databases (e.g., Pinecone, Weaviate) could enhance scalability.

3. **Handling Missing or Incomplete Data**:
   - The current implementation relies on the assumption that most property data fields are available.
   - Future enhancements could include data imputation techniques or flexible prompts that adapt to missing fields to maintain robust output.

4. **User Interaction and Feedback**:
   - Adding a user feedback loop could refine recommendations over time, enabling the system to learn from preferences and improve personalization.

5. **Generalizability Across Domains**:
   - While this project focuses on real estate, the agentic AI architecture can be adapted to other recommendation domains (e.g., job matching, travel planning) with minor modifications to the input data and prompts.

6. **Integration with External APIs**:
   - The system could be expanded to integrate live real estate APIs (e.g., Zillow, Realtor.com) to fetch real-time property data and enrich the recommendation process.

7. **Ethical Considerations**:
   - Ensuring unbiased recommendations is critical, especially when relying on data that may inadvertently reflect historical biases.
   - Implementing mechanisms to audit and validate fairness in recommendations would strengthen the system’s ethical foundation.


8. **Potential for Conversational Interfaces**:
   - Extending the system to include a conversational interface (e.g., chatbot) could make the interaction more natural and user-friendly.
   - This would further enhance the agentic nature of the system, making it feel more like a virtual real estate agent.

9. **Performance Metrics**:
    - Future development could include metrics to evaluate system performance, such as precision and recall for recommendations or user satisfaction scores based on feedback.

---



## Let’s connect and let me know if you have any comments.

https://www.linkedin.com/in/mpaghababa/