# Assignment: Connect Pub/Sub to Invoke Document Processing Pipeline


## Objective
This assignment focuses on building an event-driven document processing pipeline using Google Cloud Pub/Sub. You will learn to publish messages to a Pub/Sub topic and trigger a Cloud Function (or a similar serverless compute service) that simulates a document processing task. This pattern is essential for scalable and decoupled architectures in real-world data pipelines.

## Part 1: Google Cloud Project Setup (20 Marks)

1.  **GCP Project Setup:**
    * Ensure you have an active Google Cloud Platform (GCP) project. If not, create a new one.
    * Make sure billing is enabled for your project.
    * Install and configure the `gcloud CLI` on your local machine if you haven't already.
    * Show the `gcloud auth login` and `gcloud config set project [YOUR_PROJECT_ID]` commands with successful output.

2.  **Enable APIs:**
    * Enable the following APIs in your GCP project:
        * Cloud Pub/Sub API
        * Cloud Functions API
        * Cloud Build API (for deploying Cloud Functions)
        * Cloud Logging API
    * Provide `gcloud services enable` commands for each required API.

In [None]:
# Your GCP CLI commands for project setup and API enablement here.
# Provide screenshots or text output confirming successful API enablement.

## Part 2: Pub/Sub Topic and Publisher (30 Marks)

1.  **Create a Pub/Sub Topic:**
    * Create a new Pub/Sub topic. Name it meaningfully (e.g., `document-processing-requests`).
    * Use the `gcloud pubsub topics create` command.
    * Provide the command and confirm its successful creation.

2.  **Develop a Python Publisher Script:**
    * Write a Python script (`publisher.py`) that:
        * Imports the `google.cloud.pubsub_v1` library.
        * Initializes a Pub/Sub Publisher client.
        * Publishes at least **3 distinct messages** to your created topic.
        * Each message should simulate a document processing request. For example, it could be a JSON string containing a `document_id` and a `file_path` (e.g., `gs://my-bucket/documents/doc1.pdf`).
        * Ensure to encode the message data to bytes.
        * Include error handling for publishing messages.
    * Provide the full Python code for `publisher.py`.

3.  **Run the Publisher and Verify (via Console):**
    * Run your `publisher.py` script locally.
    * Go to the Pub/Sub topic in the GCP Console.
    * Check the "Messages" tab (or create a temporary subscription to pull messages) to verify that your messages were published successfully.
    * Provide a screenshot of the Pub/Sub console showing published messages or pull results.

In [None]:
# Your gcloud command for creating the Pub/Sub topic.
# Full Python code for `publisher.py`.
# Screenshot of GCP Console verifying published messages.

## Part 3: Cloud Function for Document Processing (40 Marks)

1.  **Create Cloud Function Code:**
    * Create a new directory (e.g., `document_processor_function`).
    * Inside this directory, create two files:
        * `main.py`: Contains your Cloud Function logic.
        * `requirements.txt`: Lists any Python dependencies for your function (e.g., `functions-framework`).
    * In `main.py`, write a Python function that:
        * Is triggered by a Pub/Sub message.
        * Takes `event` and `context` arguments (standard for Pub/Sub triggered Cloud Functions).
        * Decodes the incoming Pub/Sub message data.
        * Parses the `document_id` and `file_path` from the message.
        * Simulates a document processing task (e.g., print a message like "Processing document [ID] from [Path]", simulate a delay with `time.sleep()`, or perform a simple string manipulation).
        * Logs the received data and the simulated processing steps to standard output (which Cloud Logging will capture).
    * Provide the full Python code for `main.py` and `requirements.txt`.

2.  **Deploy the Cloud Function:**
    * Deploy your Cloud Function using the `gcloud functions deploy` command.
    * Configure it to be triggered by your Pub/Sub topic.
    * Specify a runtime (e.g., `python39` or `python310`).
    * Allow unauthenticated invocations (if setting up for testing, though Pub/Sub triggers are usually authenticated).
    * Show the `gcloud functions deploy` command and its successful output.

3.  **Test the Pipeline:**
    * Re-run your `publisher.py` script to send new messages to the Pub/Sub topic.
    * Go to the Cloud Functions console, navigate to your deployed function.
    * Check the "Logs" tab of your Cloud Function to verify that it was triggered and processed the messages successfully.
    * Provide a screenshot of the Cloud Function logs showing the processing of your messages.

4.  **Error Handling (Bonus - 5 Marks):**
    * Modify your Cloud Function to handle potential errors in message parsing or processing (e.g., `try-except` blocks).
    * Demonstrate triggering an error (e.g., by sending a malformed message) and show how it appears in the Cloud Function logs.

In [None]:
# Full Python code for Cloud Function `main.py` and `requirements.txt`.
# gcloud command for deploying the Cloud Function.
# Screenshot of Cloud Function logs showing successful processing (and error handling if bonus applied).

## Part 4: Reflection and Clean-up (10 Marks)

1.  **Benefits of Event-Driven Architecture:**
    * Discuss the advantages of using Pub/Sub for invoking a document processing pipeline (e.g., decoupling, scalability, reliability, asynchronous processing).
    * How does this pattern help in building robust data pipelines?

2.  **Challenges Faced:**
    * Describe any challenges you encountered during this assignment (e.g., IAM permissions, message encoding/decoding, debugging Cloud Functions) and how you resolved them.

3.  **Clean Up Resources:**
    * After completing the assignment, delete the Pub/Sub topic and the Cloud Function to avoid incurring unnecessary costs.
    * Provide the `gcloud pubsub topics delete` and `gcloud functions delete` commands.

## Submission Guidelines

* Submit this Jupyter Notebook (.ipynb file) with all cells executed and outputs visible.
* Include all necessary Python files (`publisher.py`, `document_processor_function/main.py`, `document_processor_function/requirements.txt`).
* Provide `requirements.txt` files for both your local environment and the Cloud Function.
* Ensure your code is well-commented and easy to understand.
* All `gcloud` commands and screenshots should be clearly presented as requested.
* Make sure your pipeline is functional and demonstrable.