## HTML gen prototyping
Notebook for prototyping html generation

In [13]:
import sys
sys.path.append("../") # go to parent dir
from llm.llmwrapper import LLM
from utils.utils import find_text_in_between_tags
from loguru import logger

In [28]:
llm = LLM(provider="gemini", model="gemini-2.5-flash")

INFO:root:LLM initialized
INFO:root:Provider: gemini
INFO:root:Model: gemini-2.5-flash


In [21]:
query = "Make a presentation about building a system which takes an internal company database of contract documents, and based on a user query, helps find the most relevant contract document"

In [22]:
pptx_plan = \
"""
<Slide 1 START>

**Title:** Revolutionizing Contract Management: A Cutting-Edge Search Solution
**Subtitle:** Unlocking the Power of Your Contract Data
**Your Name/Title:** [Your Name], Tech Consultant
**Company Logo/Date:** [Your Company Logo], [Date]

<Slide 1 END>

<Slide 2 START>

**Title:** Agenda

* The Challenge: Current State of Contract Management
* Our Solution: Intelligent Contract Search
* Technical Deep Dive
* Implementation and Timeline
* Human Resources and Expertise
* Cost and ROI
* Security and Compliance
* Future Enhancements and Scalability
* Q&A and Next Steps

<Slide 2 END>

<Slide 3 START>

**Title:** The Current State of Contract Management: Inefficiency and Risk

**Content:**

* **Manual Search:**  Employees spend countless hours manually searching through shared drives and databases for specific contracts or clauses.
* **Keyword Limitations:**  Traditional keyword search often fails to retrieve relevant documents due to variations in terminology and the complexity of legal language.
* **Version Control Issues:** Difficulty tracking different versions of contracts, leading to potential confusion and legal risks.
* **Missed Opportunities:**  Valuable insights buried within contract data remain untapped, hindering strategic decision-making.

**Visual:** A simple bar chart showing the average time spent searching for contracts per employee per week.  The chart should have two bars: "Current Manual Search" (showing a high value, e.g., 5 hours) and "Projected with Intelligent Search" (showing a significantly lower value, e.g., 30 minutes).

<Slide 3 END>

<Slide 4 START>

**Title:** The Cost of Inefficiency

**Content:**

* **Lost Productivity:** Quantify the cost of lost productivity due to inefficient contract management.  Example: "5 hours/week per employee * 100 employees * $50/hour = $25,000 lost per week."
* **Delayed Deal Closures:**  Explain how slow contract retrieval can delay deal closures and impact revenue.
* **Compliance Risks:**  Highlight the potential risks of non-compliance due to difficulty finding relevant clauses.
* **Competitive Disadvantage:**  Explain how inefficient contract management can put the company at a competitive disadvantage.

<Slide 4 END>

<Slide 5 START>

**Title:** Introducing Intelligent Contract Search

**Content:**

* **Overview:**  Introduce the proposed solution: a cutting-edge contract search system powered by AI.
* **Semantic Search:** Explain how semantic search understands the meaning and context of queries, going beyond keyword matching.
* **LLM-Powered Relevance:**  Highlight the use of LLMs to re-rank search results and ensure the most relevant contracts are presented to the user.

<Slide 5 END>

<Slide 6 START>

**Title:** How It Works: A Simplified View

**Visual:** A diagram illustrating the system's workflow:

1. **User Query:** A text box labeled "User Query"
2. **Embedding Generation:** An arrow pointing to a box labeled "Embedding Generation (Sentence Transformers)"
3. **Vector Database:** An arrow pointing to a cylinder labeled "Vector Database (Pinecone)"
4. **Similarity Search:** An arrow pointing to a box labeled "Similarity Search"
5. **LLM Re-ranking:** An arrow pointing to a box labeled "LLM Re-ranking"
6. **Results:** An arrow pointing to a box labeled "Relevant Contracts"

<Slide 6 END>

<Slide 7 START>

**Title:** Key Benefits of Intelligent Contract Search

**Content:**

* **Enhanced Accuracy:** Find the right contract every time with semantic search.
* **Time Savings:** Drastically reduce time spent searching for contracts.
* **Improved Risk Management:** Easily identify critical clauses and potential risks.
* **Scalability:** Handles growing contract volumes efficiently.
* **Competitive Advantage:** Gain valuable insights from contract data.

<Slide 7 END>

<Slide 8 START>

**Title:** System Architecture

**Visual:** A more detailed architecture diagram:

1. **User Interface:** A box labeled "User Interface"
2. **API Gateway:** A box labeled "API Gateway"
3. **Search Service:** A box labeled "Search Service (Lambda)"
4. **Embedding Service:** A box labeled "Embedding Service (SageMaker)"
5. **Vector Database:** A cylinder labeled "Vector Database (Pinecone)"
6. **LLM Service:** A box labeled "LLM Service (Bedrock)"
7. **Document Store:** A cylinder labeled "Document Store (S3)"

Arrows should connect the components in a logical flow, starting from the User Interface and ending at the Document Store.

<Slide 8 END>

<Slide 9 START>

**Title:** Technology Stack

**Content:**

* **Document Preprocessing:** AWS Textract (OCR), Lambda (cleaning and normalization)
* **Embedding Generation:** Sentence Transformers (all-mpnet-base-v2) hosted on AWS SageMaker
* **Vector Database:** Pinecone
* **LLM Re-ranking:** Amazon Bedrock
* **Search Indexing (Hybrid Approach):**  Elasticsearch (keyword search) combined with Pinecone (vector search)
* **Cloud Platform:** AWS (S3, Lambda, API Gateway, etc.)

**Justification:**  Briefly explain the rationale behind choosing these technologies, emphasizing performance, scalability, and cost-effectiveness.

<Slide 9 END>

<Slide 10 START>

**Title:** Implementation Timeline

**Visual:** A Gantt chart illustrating the project timeline, divided into phases:

* **Phase 1: Data Preprocessing and Embedding Generation (2 months)**
* **Phase 2: Vector Database Setup and Indexing (1 month)**
* **Phase 3: Search Service Development and Integration (2 months)**
* **Phase 4: User Interface Development and Testing (1 month)**
* **Phase 5: Deployment and Training (1 month)**

Each phase should have a start and end date, and the dependencies between phases should be clearly visualized.

<Slide 10 END>

<Slide 11 START>

**Title:** Implementation Approach

* **Phased Rollout:**  Explain the phased approach to implementation, starting with a pilot group and gradually expanding to the entire organization.
* **Iterative Development:** Emphasize the iterative nature of the development process, allowing for flexibility and incorporating user feedback.

<Slide 11 END>


<Slide 12 START>

**Title:** Human Resources and Expertise

**Content:**

* **Team:** Data Scientists (2), Software Engineers (3), Cloud Architect (1), Project Manager (1)
* **Existing Expertise:**  Mention any relevant in-house expertise that can be leveraged.  Example: "Our existing cloud infrastructure team has extensive experience with AWS."
* **Training and Support:**  "We will provide comprehensive training to your team on using the new system."

<Slide 12 END>

<Slide 13 START>

**Title:** Cost Breakdown

**Visual:** A pie chart showing the cost breakdown:

* **Software Licenses (e.g., Pinecone):** 20%
* **Cloud Infrastructure (AWS):** 30%
* **Development Costs:** 40%
* **Ongoing Maintenance:** 10%

Provide specific cost estimates for each category.

<Slide 13 END>

<Slide 14 START>

**Title:** Return on Investment (ROI)

**Content:**

* **Time Savings:**  Quantify the expected time savings and translate it into monetary value. Example: "Reducing search time by 4 hours per week per employee can save $20,000 per week."
* **Improved Efficiency:**  Explain how faster contract retrieval can improve overall efficiency and productivity.
* **Reduced Risk:**  Quantify the potential cost savings from reduced compliance risks.
* **Projected ROI:**  "Based on our estimates, we project a 200% ROI within the first year."

<Slide 14 END>

<Slide 15 START>

**Title:** Security and Compliance

**Content:**

* **Data Encryption:**  "All contract data will be encrypted both in transit and at rest."
* **Access Controls:**  "Strict access controls will be implemented to ensure only authorized personnel can access sensitive contract data."
* **Compliance:**  "The system will be compliant with all relevant regulations, including GDPR and CCPA."

<Slide 15 END>

<Slide 16 START>

**Title:** Future Enhancements and Scalability

**Content:**

* **Future Enhancements:**  "Potential future enhancements include integration with other systems, such as CRM and contract lifecycle management software."
* **Scalability:**  "The system is designed to scale seamlessly to handle future growth in contract volume and user demand."

<Slide 16 END>

<Slide 17 START>

**Title:** Q&A and Next Steps

**Content:**

* **Q&A:** Open the floor for questions.
* **Next Steps:**  "We propose a follow-up meeting to discuss the proposal in more detail and answer any remaining questions."

<Slide 17 END>
"""

### Step 1: Extracting slide content

In [23]:
def extract_flags(text: str, startflag : str = '<', endflag : str = '>' ) -> list:
    flags = []
    start = 0
    while True:
        start = text.find(startflag, start)
        if start == -1:
            break
        end = text.find(endflag, start)
        if end == -1:
            break

        flags.append(text[start:end+1])
        start = end + 1
    
    if len(flags) % 2 != 0:
        raise Exception(f"Uneven number of separation flags. Start and End flags come in pairs")

    return flags

In [24]:
flags = extract_flags(pptx_plan)
slide_content = {}
slidenum = 1
for i in range(0,len(flags),2):
    slide_content[f"slide_{slidenum}"] = find_text_in_between_tags(pptx_plan, flags[i], flags[i + 1])
    slidenum += 1

In [25]:
slide_content

{'slide_1': '**Title:** Revolutionizing Contract Management: A Cutting-Edge Search Solution\n**Subtitle:** Unlocking the Power of Your Contract Data\n**Your Name/Title:** [Your Name], Tech Consultant\n**Company Logo/Date:** [Your Company Logo], [Date]',
 'slide_2': '**Title:** Agenda\n\n* The Challenge: Current State of Contract Management\n* Our Solution: Intelligent Contract Search\n* Technical Deep Dive\n* Implementation and Timeline\n* Human Resources and Expertise\n* Cost and ROI\n* Security and Compliance\n* Future Enhancements and Scalability\n* Q&A and Next Steps',
 'slide_3': '**Title:** The Current State of Contract Management: Inefficiency and Risk\n\n**Content:**\n\n* **Manual Search:**  Employees spend countless hours manually searching through shared drives and databases for specific contracts or clauses.\n* **Keyword Limitations:**  Traditional keyword search often fails to retrieve relevant documents due to variations in terminology and the complexity of legal language.

### Generate title slide

In [35]:
title_slide_prompt = \
f"""
You are a tech consultant, and you have been given the following request:

"{query}"

You are trying to create a set of slides for a proposal.
The first slide you want to create is the title slide.
Generate the title slide in HTML.

Take into consideration the following points:
- Choose a style that is both visually appealing and functional; befitting of a proposal from a top-tier tech consulting company.
- What colour and design would be appropriate, especially for the background?
- What font type should you use?
- What should the size of the page be, to accurately reflect powerpoint slides?
- What size should the font be, so that it fits on the slide?

This slide will become a template master slide which will define the style of the following slides, so design this slide with great care.
Do not output any other text other than the html itself.
If your slides are visually appealing but also functional, you will be rewarded with a bonus.

The information that should be included on this slide is as follows:
{slide_content['slide_1']}
"""
print(title_slide_prompt)


You are a tech consultant, and you have been given the following request:

"Make a presentation about building a system which takes an internal company database of contract documents, and based on a user query, helps find the most relevant contract document"

You are trying to create a set of slides for a proposal.
The first slide you want to create is the title slide.
Generate the title slide in HTML.

Take into consideration the following points:
- Choose a style that is both visually appealing and functional; befitting of a proposal from a top-tier tech consulting company.
- What colour and design would be appropriate, especially for the background?
- What font type should you use?
- What should the size of the page be, to accurately reflect powerpoint slides?
- What size should the font be, so that it fits on the slide?

This slide will become a template master slide which will define the style of the following slides, so design this slide with great care.
Do not output any other 

In [36]:
title_slide_html_response = llm.call(query=title_slide_prompt)['text']
print(title_slide_html_response)

INFO:google_genai.models:AFC is enabled with max remote calls: 10.
INFO:httpx:HTTP Request: POST https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent "HTTP/1.1 200 OK"
INFO:google_genai.models:AFC remote call 1 is done.


```html
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Revolutionizing Contract Management - Title Slide</title>
    <!-- Google Fonts - Inter for a modern, professional look -->
    <link href="https://fonts.googleapis.com/css2?family=Inter:wght@300;400;600;700&display=swap" rel="stylesheet">
    <style>
        /* CSS Variables for easy customization and template master definition */
        :root {
            --primary-bg-start: #0A192F; /* Deep Navy Blue - Professional and authoritative */
            --primary-bg-end: #1A2A40;   /* Slightly lighter dark blue for subtle gradient */
            --text-color-light: #E0E6ED; /* Off-white/light gray for high contrast on dark background */
            --text-color-accent: #66B2FF; /* A professional light blue for accents and highlights */
            --font-family-primary: 'Inter', sans-serif; /* Modern, clean sans-serif font */
  

In [37]:
slide_review_prompt = \
"""
You are a senior front-end software engineer reviewing a junior engineer's work.
He has written some HTML which is supposed to show one slide of a powerpoint.

You have been provided with the HTML code and also a rendering of the code as an image.

Please check that:
- Components are correctly aligned within the page
- Text does not exceed the height and width of the page
- Text that is wrapped by a component does not escape the boundaries of the component

If there are any changes that need to be made (e.g. reduction in font size etc.), only output the improved HTML code.
If the code meets all of the criteria, output only the word "OK" without the quotation marks.

The HTML code is provided below:
{code}
"""

In [None]:
def reviewer(review_prompt, html_txt, html_image):
    