# Synthetic Data Generator

This notebook generates synthetic real estate listings for properties in Milan, Italy using a Large Language Model (LLM). 
The workflow consists of three main steps:
1. Loading structured property data (area, location, amenities, etc.)
2. Generating natural language descriptions using LLM
3. Computing embeddings and storing them in a vector database

The generated listings will be used in the `HomeMatch.ipynb` notebook to create a recommendation system
that matches buyer preferences with available properties using semantic search.

In [None]:
# Import required libraries and set up environment
import os, sys
import pandas as pd 

# Get project root directory from git
source = os.popen('git rev-parse --show-toplevel').read().strip('\n')
sys.path.insert(0, source)

# Configure OpenAI API credentials
# Note: API key is intentionally left empty and should be set securely
os.environ["OPENAI_API_KEY"] = ""
os.environ["OPENAI_API_BASE"] = "https://openai.vocareum.com/v1"

# Import custom modules for data handling and generation
from src.Configurator import Configurator 
from src.SyntheticDataGenerator import SyntheticDataGenerator

%load_ext autoreload
%autoreload 2

In [None]:
# Initialize configuration from YAML file
# This loads parameters for data processing, model settings, and file paths
config = Configurator(
    source = source, 
    yaml_file_path = f"{source}/src/config.yaml"
)

In [None]:
# Initialize SyntheticDataGenerator
# n_listings is the number of listings to generate
sdg = SyntheticDataGenerator(
    config = config, 
    n_listings = 100, 
    verbose = 1
)

# Load house data
Import the structured dataset containing Milan property information

In [None]:
# Load the house dataset into a pandas DataFrame
# This contains features like area, zone, rooms, and distances to POIs
sdg.load_house_data()

# Generate listings
Create natural language descriptions for properties using LLM

In [None]:
# Generate natural language descriptions for properties
# This uses the OpenAI API to create human-like listing texts
sdg.generate_listings()

# Save listings
Store both the raw listings and their vector embeddings

## Save listings' text
Store the raw text descriptions for future reference

In [None]:
# Save the generated listings to disk
sdg.save_listings()

## Create embeddings and save them in ChromaDB
Convert listings to vector embeddings for semantic search

In [None]:
# Generate embeddings for each listing and store them in ChromaDB
# These vectors will be used for semantic similarity search in HomeMatch.ipynb
sdg.save_embeddings_chromadb()