<a href="https://colab.research.google.com/github/MaxRetapolis/school_board_library/blob/main/Document_ELT_1_Folder_Watcher.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Okay, let's dive into the first step: the **Directory Watcher** for Windows.

**How the Directory Watcher Can Work on Windows**

Windows offers a few mechanisms for monitoring file system changes, allowing us to implement a robust directory watcher. Here's a breakdown of the best options:

**1. `ReadDirectoryChangesW` API**

   *   **Mechanism:** This is the most powerful and efficient way to monitor directory changes on Windows. It's a native Win32 API function that allows you to asynchronously receive notifications when changes occur within a directory (and optionally its subdirectories).
   *   **Pros:**
        *   **Event-Driven:** It operates on an event-driven model. Your watcher doesn't need to constantly poll the directory, making it very resource-efficient.
        *   **Detailed Information:** The API provides detailed information about the changes, including the type of change (file added, deleted, modified, renamed), the filename, and the timestamps.
        *   **Handles Overlapped I/O:** This enables highly efficient, asynchronous operations, minimizing blocking and improving performance.
   *   **Cons:**
        *   **Complexity:** It's a relatively low-level API, requiring you to manage buffers, handle overlapped structures, and deal with Win32 data types.
        *   **Requires Native Code (or Wrapper):** You'll either need to write C/C++ code or use a Python library that wraps this API (more on this below).

**2. `FindFirstChangeNotification` API (Simpler, but Less Powerful)**

   *   **Mechanism:** This is another Win32 API function that's easier to use than `ReadDirectoryChangesW`. It creates a handle that becomes signaled when a specified type of change occurs in a directory or its subtree.
   *   **Pros:**
        *   **Simpler to Use:** Compared to `ReadDirectoryChangesW`, it has a simpler interface.
   *   **Cons:**
        *   **Less Information:** You only get notified *that* a change happened, not *what* changed. You'll need to scan the directory to figure out the details.
        *   **Can Miss Events:** In certain scenarios (especially with rapid changes), it might miss some events.
        *   **Still Requires Polling (Sort Of):** Although it's event-based in the sense of being signaled, you still need to use a loop and `WaitForSingleObject` or `WaitForMultipleObjects` to wait for the signal and then resubmit a request to watch for further changes.

**3. Python Libraries (Wrappers around `ReadDirectoryChangesW` or `FindFirstChangeNotification`)**

   *   **Mechanism:** These libraries provide a more Pythonic interface to the underlying Win32 API functions, simplifying the development process.
   *   **Pros:**
        *   **Easier Development:** They handle the low-level complexities of the Win32 API, letting you focus on the logic.
   *   **Cons:**
        *   **Dependency:** You'll need to install an additional library.
        *   **Abstraction Overhead:** There might be a slight performance overhead compared to using the raw API directly, but usually negligible.
   *   **Popular Options:**
        *   **`watchdog`:** A very popular and well-maintained library that provides a cross-platform way to monitor file system events (it uses `ReadDirectoryChangesW` on Windows).
        *   **`pywin32`:**  This library provides access to many Win32 APIs, including `ReadDirectoryChangesW` and `FindFirstChangeNotification`, through a Python interface.

**4. Polling (Not Recommended for Real-Time Monitoring)**

   *   **Mechanism:** This involves repeatedly scanning the directory at regular intervals (e.g., every few seconds) and comparing the current state with the previous state to detect changes.
   *   **Pros:**
        *   **Simple to Implement:** It's relatively easy to write a basic polling loop using Python's `os.listdir()` or `os.scandir()`.
   *   **Cons:**
        *   **Inefficient:** It consumes resources even when no changes occur.
        *   **Latency:** There's a delay between when a change happens and when your watcher detects it, proportional to the polling interval.
        *   **Potential to Miss Events:** If changes happen rapidly between polling intervals, they might be missed.

**Recommendation: `watchdog` (or `ReadDirectoryChangesW` if you need ultimate control)**

For most cases, the **`watchdog`** library is the best choice. It offers a good balance of ease of use, efficiency, and reliability. It's actively maintained and handles the intricacies of `ReadDirectoryChangesW` under the hood.

If you absolutely need the maximum possible performance and are comfortable dealing with low-level APIs, you could consider using `ReadDirectoryChangesW` directly via `pywin32`, but this adds complexity.

**Does it need to run on a schedule?**

No, it **should not** run on a schedule in the traditional sense (like a cron job). It should be a **long-running process** that utilizes event-driven mechanisms (like those provided by `ReadDirectoryChangesW` or `watchdog`) to react to changes in real time.

**Does it need to keep track of existing files?**

*   **Initially, Yes:** When the Directory Watcher starts, it should ideally perform an initial scan of the directory to establish a baseline of existing files. This is to avoid treating all existing files as "new" when the watcher starts monitoring. This initial scan is a one time action per start of the watcher.
*   **After Startup, No (Mostly):** Once the initial scan is complete, the event-driven mechanisms will notify the watcher of new files, modified files, and deleted files. The watcher itself doesn't need to actively maintain a list of files *in most cases*. It relies on the events it receives.
*   **Edge Cases (Where Tracking Might Be Helpful):**
    *   **Renames:** If you need to accurately track file renames, you might need to do some extra work, as the events you receive might be a "delete" followed by a "create." You could potentially use file modification times or other heuristics to infer renames. Some libraries like `watchdog` provide a higher level event that detects renames.
    *   **Missed Events (Rare):** In extremely rare cases of missed events, a periodic background check (not a full scan) might be a safety net, but this shouldn't be the primary mechanism.

**How will it treat differently new files from changed existing ones?**

The events generated by `ReadDirectoryChangesW` (and consequently by `watchdog`) distinguish between different types of changes:

*   **`FILE_ACTION_ADDED`:**  Indicates a new file was created in the directory.
*   **`FILE_ACTION_MODIFIED`:** Indicates an existing file's content was modified.
*   **`FILE_ACTION_RENAMED_OLD_NAME` and `FILE_ACTION_RENAMED_NEW_NAME`**: Used to identify file renames.
*   `FILE_ACTION_REMOVED`: Indicates a file was deleted from the directory.

Your Directory Watcher's event handler will receive these events and can then take different actions:

*   **New Files (`FILE_ACTION_ADDED`):**
    1.  Create a new `Document` object.
    2.  Populate initial metadata.
    3.  Pass the `Document` to the `ExtractorFactory`.

*   **Changed Files (`FILE_ACTION_MODIFIED`):**
    1.  You might need to check if you already have a `Document` object representing this file (this could involve a lookup mechanism, perhaps using the file path as a key).
    2.  If you have an existing `Document`, update its content and relevant metadata. This means you need to re-extract the content and update the metadata using `Extractor`. This also means the `Extractor` needs to support updating an existing `Document`.
    3.  If you don't have a corresponding `Document` (which should be rare if you're tracking properly), treat it as a new file.

**Example using `watchdog` (Illustrative)**

In [None]:
import time
import logging
from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler

# ... (Your other classes: Document, ExtractorFactory, etc.)

class Watcher(FileSystemEventHandler):
    def __init__(self, directory_to_watch, extractor_factory, metadata_manager, storage_manager):
        self.directory_to_watch = directory_to_watch
        self.extractor_factory = extractor_factory
        self.metadata_manager = metadata_manager
        self.storage_manager = storage_manager
        self.existing_files = set()  # Keep track of files seen during initial scan

        # Initial scan to get the baseline
        self.initial_scan()

    def initial_scan(self):
        logging.info("Performing initial scan of directory...")
        for filename in os.listdir(self.directory_to_watch):
            filepath = os.path.join(self.directory_to_watch, filename)
            if os.path.isfile(filepath):
                self.existing_files.add(filepath)
        logging.info(f"Initial scan complete. Found {len(self.existing_files)} existing files.")

    def on_created(self, event):
        if not event.is_directory and event.src_path not in self.existing_files:
            logging.info(f"New file detected: {event.src_path}")
            self.process_file(event.src_path)
            self.existing_files.add(event.src_path)  # Add to existing files after processing

    def on_modified(self, event):
        if not event.is_directory and event.src_path in self.existing_files:
            logging.info(f"Modified file detected: {event.src_path}")
            # In a real system, you'd likely need a mechanism to look up
            # the existing Document object based on the filepath.
            # For simplicity, we'll just reprocess it as if it were new.
            self.process_file(event.src_path)

    def on_deleted(self, event):
        if not event.is_directory:
            logging.info(f"Deleted file detected: {event.src_path}")
            if event.src_path in self.existing_files:
                self.existing_files.remove(event.src_path)


    def process_file(self, filepath):
        try:
            # 1. Create a Document object
            doc = Document(filepath)

            # 2. Get the appropriate Extractor
            extractor = self.extractor_factory.get_extractor(doc)

            # 3. Extract content and metadata
            extractor.extract_content(doc)
            extractor.extract_metadata(doc)

            # 4. Handle metadata
            self.metadata_manager.store_metadata(doc)

            # 5. Store the raw file and extracted content
            self.storage_manager.store_raw_document(doc)
            if doc.content:  # Assuming you have a content attribute in your Document
                self.storage_manager.store_extracted_content(doc)

        except Exception as e:
            logging.error(f"Error processing {filepath}: {e}")

# ...

if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO,
                        format='%(asctime)s - %(message)s',
                        datefmt='%Y-%m-%d %H:%M:%S')
    path_to_watch = "./inbound_directory"  # Replace with your directory
    # ... (Initialize your ExtractorFactory, MetadataManager, StorageManager)

    event_handler = Watcher(path_to_watch, extractor_factory, metadata_manager, storage_manager)
    observer = Observer()
    observer.schedule(event_handler, path_to_watch, recursive=False) # recursive=False - only current directory
    observer.start()
    try:
        while True:
            time.sleep(5)
    except:
        observer.stop()
    observer.join()

**In summary, the Directory Watcher on Windows should:**

1.  **Use `watchdog` (preferred) or `ReadDirectoryChangesW` for efficient, event-driven monitoring.**
2.  **Be a long-running process, not a scheduled task.**
3.  **Perform an initial scan on startup to establish a baseline but then rely on events.**
4.  **Clearly distinguish between new and modified files using the event types provided by the underlying mechanism.**
5.  **Handle events appropriately to create or update `Document` objects and trigger the rest of the ELT pipeline.**