#### 1. Setup: Imports and API Configuration

In [1]:
import os
import io
import re
import json
import time
from collections import defaultdict
from dotenv import load_dotenv
from PIL import Image # For image handling with OCR

# --- Import Google Generative AI ---
import google.generativeai as genai
from google.generativeai.types import HarmCategory, HarmBlockThreshold
from google.api_core import exceptions as google_exceptions

# --- Import PDF to Image Library ---
from pdf2image import convert_from_path
from pdf2image.exceptions import PDFPageCountError, PDFSyntaxError

# --- Configure Logging ---
import logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# --- Load API Key ---
load_dotenv()
google_api_key = os.getenv("GOOGLE_API_KEY")
genai_configured = False

if not google_api_key:
    logging.warning("Google API key not found in .env file. Gemini features will fail.")
else:
    try:
        genai.configure(api_key=google_api_key)
        genai_configured = True
        logging.info("Google Generative AI client configured successfully.")
    except Exception as e:
        logging.error(f"Failed to configure Google Generative AI: {e}")

2025-03-30 18:07:19,596 - INFO - Google Generative AI client configured successfully.


#### 2. Configuration

In [3]:
# --- LLM Models ---
LLM_VISION_MODEL = "gemini-1.5-pro-latest"
LLM_TEXT_MODEL = "gemini-1.5-flash-latest"

In [4]:
# --- Analysis Configuration ---
TARGET_SECTIONS = [
    "Problem",
    "Solution",
    "Market Size",
    "Business Model",
    "Financial Projections", 
    "Team"
]

SECTION_WEIGHTS = {
    "Problem": 20,
    "Solution": 20,
    "Market Size": 13,
    "Business Model": 13,
    "Financial Projections": 14,
    "Team": 20,
}
logging.info(f"Analysis focused on {len(TARGET_SECTIONS)} sections with weights summing to {sum(SECTION_WEIGHTS.values())}.")

2025-03-30 18:07:20,909 - INFO - Analysis focused on 6 sections with weights summing to 100.


In [5]:
SCORING_CRITERIA = {
    "Problem": """
    Evaluate the 'Problem' section based on:
    1. Clarity (0-25): Is the problem statement clear, concise, and easy to understand?
    2. Magnitude & Significance (0-35): Is the problem significant? Is it painful? Does it affect a large enough audience or have substantial impact? Is this backed by credible data/evidence?
    3. Urgency (0-20): Is there a compelling reason why this problem needs to be solved *now*? Market timing?
    4. Target Audience Definition (0-20): Is the specific audience facing this problem clearly identified and profiled?
    Score (0-100): Provide a single integer score.
    Justification: Briefly explain the score (2-3 sentences), mentioning key strengths/weaknesses based on the criteria.
    """,
    "Solution": """
    Evaluate the 'Solution' section based on:
    1. Clarity & Conciseness (0-25): Is the solution clearly explained and easy to grasp?
    2. Problem Fit (0-35): Does the solution directly and effectively address the identified problem?
    3. Value Proposition (0-25): Is the unique value for the customer clearly articulated? What makes it compelling and differentiated?
    4. Feasibility & Scalability (0-15): Does the solution seem technically feasible? Does it have potential to scale? (Acknowledge early stage limitations).
    Score (0-100): Provide a single integer score.
    Justification: Briefly explain the score (2-3 sentences).
    """,
     "Market Size": """
    Evaluate the 'Market Size' section based on:
    1. Market Definition (0-30): Is the target market clearly defined using standard terms (TAM, SAM, SOM)? Is the definition logical?
    2. Data & Sources (0-30): Is the market size supported by credible, recent data and sources? Are sources cited?
    3. Realism & Focus (0-30): Is the estimation, especially for SOM (Serviceable Obtainable Market), realistic and justifiable for the specific target segment the startup can reach initially?
    4. Growth Potential (0-10): Is there an indication of market growth trends?
    Score (0-100): Provide a single integer score.
    Justification: Briefly explain the score (2-3 sentences).
    """,
    "Business Model": """
    Evaluate the 'Business Model' section based on:
    1. Clarity of Revenue Streams (0-30): Is the primary way the company makes money clearly explained (e.g., subscription, freemium, transaction fees, licensing)? Are revenue streams distinct?
    2. Pricing Strategy (0-25): Is the pricing logical, justified, and aligned with the value proposition?
    3. Profitability Path (0-30): Does the model demonstrate a clear potential path to profitability? Are key unit economics (e.g., COGS, Gross Margin, implied LTV/CAC) considered or plausible?
    4. Scalability (0-15): Can the business model support significant growth?
    Score (0-100): Provide a single integer score.
    Justification: Briefly explain the score (2-3 sentences).
    """,
    "Financial Projections": """
    Evaluate the 'Financial Projections' section based on:
    1. Clarity & Key Metrics (0-25): Are the projections presented clearly? Does it include key financial metrics (revenue, key costs, burn rate, funding needs)?
    2. Assumptions & Realism (0-35): Are the underlying assumptions stated or clearly implied? Are they realistic and grounded in the business model, market size, and GTM strategy? Avoid overly optimistic hockey sticks without justification.
    3. Time Horizon & Detail (0-20): Does it cover a reasonable time period (e.g., 3-5 years)? Is the level of detail appropriate for the stage?
    4. Link to Funding Ask (0-20): Are the projections linked to the amount of funding being requested ('Use of Funds') and the milestones it enables?
    Score (0-100): Provide a single integer score.
    Justification: Briefly explain the score (2-3 sentences).
    """,
    "Team": """
    Evaluate the 'Team' section based on:
    1. Founder/Core Team Relevance & Experience (0-40): Do the key team members have highly relevant experience, skills, and domain expertise for this specific venture? Is there founder-market fit?
    2. Completeness & Roles (0-25): Are key roles covered for the current stage? Are critical gaps acknowledged (implicitly or explicitly)?
    3. Execution Ability & Passion (Deduced) (0-20): Does the presentation convey the team's ability to execute and their passion/commitment? (Partially subjective).
    4. Advisors/Board (if applicable) (0-15): Are advisors relevant and do they add significant strategic value? Is their involvement clear?
    Score (0-100): Provide a single integer score.
    Justification: Briefly explain the score (2-3 sentences).
    """,
}

In [6]:
# --- API Call Configuration ---
LLM_TEMPERATURE = 0.3
LLM_MAX_TOKENS_OCR = 4096
LLM_MAX_TOKENS_SECTION_ID = 500
LLM_MAX_TOKENS_SCORING = 350
LLM_MAX_TOKENS_FEEDBACK = 1000 
LLM_REQUEST_TIMEOUT = 180
LLM_RETRY_ATTEMPTS = 3
LLM_RETRY_DELAY = 5

In [7]:
# --- Rate Limit Handling Configuration ---
INTER_PAGE_DELAY = 10 
RATE_LIMIT_BACKOFF_MULTIPLIER = 15

In [8]:
# --- Gemini Safety Settings ---
SAFETY_SETTINGS_TEXT = { HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE, HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE, HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE, HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE, }
SAFETY_SETTINGS_VISION = { HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_ONLY_HIGH, HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_ONLY_HIGH, HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_ONLY_HIGH, HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_ONLY_HIGH, }

#### 3. Core Functions

##### 3.1: Gemini API Call for Vision (OCR)

In [9]:
def call_gemini_vision_ocr(image_bytes, mime_type="image/jpeg"):
    if not genai_configured: return None
    try:
        model = genai.GenerativeModel(LLM_VISION_MODEL)
        generation_config = genai.GenerationConfig(temperature=0.1, max_output_tokens=LLM_MAX_TOKENS_OCR)
        prompt = "Extract all text visible in this image. Provide only the text content, maintaining layout if possible (e.g., using line breaks)."
        payload = [prompt, {"mime_type": mime_type, "data": image_bytes}]
        for attempt in range(LLM_RETRY_ATTEMPTS):
            try:
                logging.debug(f"Calling Gemini Vision (Attempt {attempt + 1}/{LLM_RETRY_ATTEMPTS})")
                response = model.generate_content(contents=payload, generation_config=generation_config, safety_settings=SAFETY_SETTINGS_VISION, request_options={'timeout': LLM_REQUEST_TIMEOUT})
                if not response.parts:
                    block_reason = response.prompt_feedback.block_reason if response.prompt_feedback else "Unknown"
                    logging.warning(f"Gemini Vision response blocked/empty. Reason: {block_reason}")
                    return None
                return response.text.strip()
            except google_exceptions.GoogleAPIError as e:
                 error_str = str(e).lower(); is_rate_limit_error = "quota" in error_str or "rate limit" in error_str or "429" in error_str; suggested_delay_parsed = None
                 try:
                     if hasattr(e, 'metadata'):
                          for item in e.metadata:
                               if isinstance(item, tuple) and 'retry-delay' in item[0]:
                                    delay_match = re.search(r'seconds:\s*(\d+)', str(item[1]));
                                    if delay_match: suggested_delay_parsed = int(delay_match.group(1))
                 except Exception: pass
                 if is_rate_limit_error:
                      base_wait = RATE_LIMIT_BACKOFF_MULTIPLIER * (attempt + 1); wait_time = max(suggested_delay_parsed + 2 if suggested_delay_parsed else 0, base_wait)
                      logging.warning(f"Gemini API rate limit/quota error (Attempt {attempt + 1}/{LLM_RETRY_ATTEMPTS}). Waiting {wait_time:.2f}s...")
                 else: wait_time = LLM_RETRY_DELAY * (2 ** attempt); logging.warning(f"Gemini API error (Attempt {attempt + 1}/{LLM_RETRY_ATTEMPTS}): {e}. Waiting {wait_time:.2f}s...")
                 if attempt == LLM_RETRY_ATTEMPTS - 1: logging.error(f"Gemini Vision call failed after retries (final error: {e})."); return None
                 time.sleep(wait_time)
            except (google_exceptions.InvalidArgument, ValueError) as e: logging.error(f"Non-retryable Gemini error (InvalidArgument/ValueError): {e}"); return None
            except Exception as e:
                logging.error(f"Unexpected error during Gemini Vision call (Attempt {attempt + 1}): {e}")
                if attempt < LLM_RETRY_ATTEMPTS - 1: wait_time = LLM_RETRY_DELAY * (2 ** attempt); logging.warning(f"Retrying unexpected error in {wait_time:.2f}s..."); time.sleep(wait_time)
                else: return None
        return None
    except Exception as model_init_error: logging.error(f"Failed to initialize Gemini model '{LLM_VISION_MODEL}': {model_init_error}"); return None


##### 3.2 Text Extraction

In [10]:
def extract_text_with_ocr_and_gemini(pdf_path):
    pages_text = []; logging.info(f"Starting PDF processing with Gemini OCR for: {pdf_path}")
    if not genai_configured: return None
    try:
        logging.info("Converting PDF to images..."); start_conv = time.time(); poppler_path_to_use = None
        images = convert_from_path(pdf_path, dpi=200, poppler_path=poppler_path_to_use)
        logging.info(f"Converted {len(images)} pages in {time.time() - start_conv:.2f}s.")
        for i, image in enumerate(images):
            page_num = i + 1; logging.info(f"Processing page {page_num}/{len(images)} with Gemini Vision OCR..."); start_ocr = time.time()
            img_byte_arr = io.BytesIO(); image.save(img_byte_arr, format='JPEG', quality=90)
            extracted_text = call_gemini_vision_ocr(img_byte_arr.getvalue(), "image/jpeg")
            pages_text.append(extracted_text if extracted_text else "")
            if not extracted_text: logging.warning(f"Failed/empty OCR for page {page_num}.")
            logging.info(f"Page {page_num} OCR took {time.time() - start_ocr:.2f}s.")
            if page_num < len(images): logging.info(f"Waiting {INTER_PAGE_DELAY}s before next page..."); time.sleep(INTER_PAGE_DELAY)
        logging.info("Completed Gemini OCR processing."); return pages_text
    except ImportError: logging.error("pdf2image or Pillow not installed."); return None
    except FileNotFoundError: logging.error(f"PDF file not found: {pdf_path}"); return None
    except (PDFPageCountError, PDFSyntaxError) as pdf_err: logging.error(f"PDF conversion error: {pdf_err}. Ensure Poppler is installed/accessible."); return None
    except Exception as e: logging.error(f"Unexpected error during PDF/OCR processing: {e}"); return None


##### 3.3 Gemini API Call for Text Analysis

In [11]:
def call_gemini_text(prompt, max_output_tokens=1000, temperature=LLM_TEMPERATURE):
    if not genai_configured: return None
    try:
        model = genai.GenerativeModel(LLM_TEXT_MODEL)
        generation_config = genai.GenerationConfig(temperature=temperature, max_output_tokens=max_output_tokens)
        for attempt in range(LLM_RETRY_ATTEMPTS):
            try:
                logging.debug(f"Calling Gemini Text (Attempt {attempt + 1}/{LLM_RETRY_ATTEMPTS})")
                response = model.generate_content(contents=prompt, generation_config=generation_config, safety_settings=SAFETY_SETTINGS_TEXT, request_options={'timeout': LLM_REQUEST_TIMEOUT})
                if not response.parts:
                     block_reason = response.prompt_feedback.block_reason if response.prompt_feedback else "Unknown"; logging.warning(f"Gemini Text response blocked/empty. Reason: {block_reason}"); return None
                return response.text.strip()
            except google_exceptions.GoogleAPIError as e:
                 wait_time = LLM_RETRY_DELAY * (2 ** attempt); logging.warning(f"Gemini Text API error (Attempt {attempt + 1}/{LLM_RETRY_ATTEMPTS}): {e}. Waiting {wait_time:.2f}s...")
                 if attempt == LLM_RETRY_ATTEMPTS - 1: logging.error(f"Gemini Text call failed after retries (final error: {e})."); return None
                 time.sleep(wait_time)
            except (google_exceptions.InvalidArgument, ValueError) as e: logging.error(f"Non-retryable Gemini Text error (InvalidArgument/ValueError): {e}"); return None
            except Exception as e:
                logging.error(f"Unexpected error during Gemini Text call (Attempt {attempt + 1}): {e}")
                if attempt < LLM_RETRY_ATTEMPTS - 1: wait_time = LLM_RETRY_DELAY * (2 ** attempt); logging.warning(f"Retrying unexpected error in {wait_time:.2f}s..."); time.sleep(wait_time)
                else: return None
        return None
    except Exception as model_init_error: logging.error(f"Failed to initialize Gemini model '{LLM_TEXT_MODEL}': {model_init_error}"); return None

##### 3.4 Text Preprocessing

In [12]:
def preprocess_text(text):
    if not text: return ""
    text = text.lower(); text = re.sub(r'\s+', ' ', text).strip(); return text


##### 3.5 Section Identification

In [13]:
def identify_sections_llm(pages_text):
    """Uses Gemini Text LLM to identify page numbers ONLY for TARGET_SECTIONS.""" 
    logging.info(f"Starting section identification focused on: {', '.join(TARGET_SECTIONS)}") 
    if not pages_text or not any(pages_text): logging.warning("No text provided for section identification."); return {}

    formatted_pages = []
    for i, text in enumerate(pages_text):
        max_len_per_page = 400
        processed_text_for_prompt = text
        if len(processed_text_for_prompt) > max_len_per_page: processed_text_for_prompt = processed_text_for_prompt[:max_len_per_page//2] + "..." + processed_text_for_prompt[-max_len_per_page//2:]
        formatted_pages.append(f"--- Page {i + 1} ---\n{processed_text_for_prompt}\n")
    full_text_for_prompt = "\n".join(formatted_pages)

    prompt = f"""
    **Your Role:** Expert pitch deck analyst. Map content to specific key sections.
    **Instructions:** Analyze the pitch deck text. Identify page number(s) for ONLY these key sections: {', '.join(TARGET_SECTIONS)}. Output ONLY a valid JSON mapping exact section names (from the list) to lists of page numbers (e.g., [1, 2]). Omit sections not found or not in the target list. No explanations or markdown.
    **Example Output:** {{"Problem": [2, 3], "Solution": [4], "Team": [10]}}
    ---
    **Pitch Deck Text:**
    {full_text_for_prompt}
    ---
    **JSON Output:**
    """
    response = call_gemini_text(prompt, max_output_tokens=LLM_MAX_TOKENS_SECTION_ID)
    if not response: logging.error("Section ID LLM call failed."); return {}

    try: 
        json_match = re.search(r'```json\s*(\{.*?\})\s*```', response, re.DOTALL | re.IGNORECASE)
        if not json_match: json_match = re.search(r'\{.*\}', response, re.DOTALL)
        if not json_match: raise json.JSONDecodeError("No JSON object found", response, 0)
        json_string = json_match.group(1) if len(json_match.groups()) == 1 else json_match.group(0)
        identified_sections_pages = json.loads(json_string)
        logging.info(f"LLM raw identification: {identified_sections_pages}")

        
        validated_sections = {}
        for section, pages in identified_sections_pages.items():
            if section in TARGET_SECTIONS and isinstance(pages, list):
                valid_pages = [p - 1 for p in pages if isinstance(p, int) and 0 < p <= len(pages_text)]
                if valid_pages: validated_sections[section] = sorted(list(set(valid_pages)))
            else:
                logging.warning(f"LLM identified section '{section}' which is not in TARGET_SECTIONS or has invalid format. Skipping.")
        logging.info(f"Validated section page indices (0-based): {validated_sections}")
        return validated_sections
    except json.JSONDecodeError as e: logging.error(f"Failed to parse JSON for section ID: {e}. Response: {response}"); return {}
    except Exception as e: logging.error(f"Error processing section ID response: {e}"); return {}

##### 3.6 Aggregate Section Content

In [14]:
def aggregate_section_content(pages_text, section_pages_map):
    section_content = {};
    if not section_pages_map: return {}
    logging.info("Aggregating text content for identified sections...")
    for section, page_indices in section_pages_map.items():
        content = "".join(pages_text[idx] + "\n\n" for idx in page_indices if 0 <= idx < len(pages_text))
        section_content[section] = preprocess_text(content) # Preprocess before scoring
    logging.info(f"Aggregated content for sections: {list(section_content.keys())}")
    return section_content

##### 3.7 Score Sections 

In [15]:
def score_sections_llm(section_content):
    logging.info("Starting section scoring...")
    section_scores = {};
    if not section_content: return {}
    scoring_instructions = """**Your Role:** Expert pitch deck analyst. Evaluate section text based *only* on given criteria. **Instructions:** Provide numerical score (0-100) and brief justification (2-3 sentences). Format output ONLY as valid JSON: {"score": integer, "justification": string}. No explanations/markdown."""
    for section, text in section_content.items():
        if section in SCORING_CRITERIA: # Only score sections with defined criteria
            logging.info(f"Scoring section: {section}")
            if not text or len(text.strip()) < 20:
                logging.warning(f"Skipping scoring for '{section}' (insufficient content)."); section_scores[section] = {"score": 0, "justification": "Content missing or too brief."}; continue
            prompt = f"""{scoring_instructions}\n---\n**Section:** {section}\n**Criteria:**\n{SCORING_CRITERIA[section]}\n---\n**Text:**\n{text[:4000]}\n---\n**JSON Output:**"""
            response = call_gemini_text(prompt, max_output_tokens=LLM_MAX_TOKENS_SCORING)
            if response:
                 try:
                    json_match = re.search(r'```json\s*(\{.*?\})\s*```', response, re.DOTALL | re.IGNORECASE);
                    if not json_match: json_match = re.search(r'\{.*\}', response, re.DOTALL)
                    if not json_match: raise json.JSONDecodeError("No JSON object found", response, 0)
                    json_string = json_match.group(1) if len(json_match.groups()) == 1 else json_match.group(0)
                    score_data = json.loads(json_string)
                    if isinstance(score_data.get('score'), int) and 0 <= score_data['score'] <= 100 and isinstance(score_data.get('justification'), str):
                        section_scores[section] = score_data; logging.info(f"Scored '{section}': {score_data['score']}/100")
                    else: logging.warning(f"LLM scoring response invalid format/values for '{section}'. Parsed: {score_data}. Resp: {response}"); section_scores[section] = {"score": 0, "justification": "Failed parsing score (invalid format/values)."}
                 except json.JSONDecodeError as e: logging.warning(f"Failed JSON decode for '{section}' scoring: {e}. Resp: {response}"); section_scores[section] = {"score": 0, "justification": "Failed parsing score (JSON decode error)."}
                 except Exception as e: logging.error(f"Unexpected error processing score for '{section}': {e}"); section_scores[section] = {"score": 0, "justification": f"Internal error scoring."}
            else: logging.error(f"LLM call failed for scoring '{section}'."); section_scores[section] = {"score": 0, "justification": "LLM call failed during scoring."}
    logging.info("Finished section scoring.")
    return section_scores

##### 3.8 Calculate Overall Score

In [16]:
def calculate_overall_score(section_scores, weights):
    total_score_points = 0; total_weight = 0;
    if not section_scores or not weights: return 0
    logging.info("Calculating overall weighted score based on target sections...")
    for section, score_data in section_scores.items():
        if section in weights: # Only consider sections we defined weights for
             weight = weights[section]
             total_weight += weight # Increment total weight considered
             if isinstance(score_data.get('score'), int) and score_data['score'] >= 0: # Include 0 scores in normalization
                  score = score_data['score']
                  total_score_points += score * weight # Weighted score contribution
             else:
                  logging.warning(f"Section '{section}' has weight but invalid score ({score_data.get('score')}). Excluding score contribution.")

    if total_weight > 0:
        normalized_score = (total_score_points / total_weight) if total_weight > 0 else 0
        logging.info(f"Raw weighted score sum: {total_score_points:.0f}, Total weight considered: {total_weight}")
        logging.info(f"Normalized overall score: {normalized_score:.0f}")
        return round(normalized_score)
    else:
        logging.warning("No target sections with defined weights were found or scored. Overall score is 0.")
        return 0

##### 3.9 Generate Qualitative Feedback

In [17]:
def generate_feedback_llm(section_scores):
    """Uses Gemini Text LLM to generate feedback focused on TARGET_SECTIONS."""
    logging.info("Generating qualitative feedback focused on target sections...")
    if not section_scores: return {"strengths": [], "weaknesses": []}

    analysis_summary = ""
    valid_scores_found = False
    target_sections_found_count = 0
    for section in TARGET_SECTIONS: 
        if section in section_scores:
            target_sections_found_count += 1
            data = section_scores[section]
            score = data.get('score')
            justification = data.get('justification', 'N/A')
            if isinstance(score, int):
                 analysis_summary += f"Section: {section}\nScore: {score}/100\nJustification: {justification}\n---\n"
                 if score > 0 or "missing" not in justification.lower(): valid_scores_found = True
        else:
             analysis_summary += f"Section: {section}\nScore: Not Found/Identified\n---\n"


    if target_sections_found_count == 0:
         logging.warning("None of the target sections were found in the analysis.")
         return {"strengths": ["Analysis failed: None of the core sections (Problem, Solution, Market, etc.) were identified."], "weaknesses": []}
    if not valid_scores_found:
         logging.warning("No scorable target sections found to generate meaningful feedback.")
         return {"strengths": ["Analysis incomplete: Core sections found but could not be scored."], "weaknesses": []}

    prompt = f"""
    **Your Role:** Constructive startup pitch analyst. Synthesize analysis of core sections into actionable feedback.
    **Analysis Summary (Focusing on Problem, Solution, Market, Business Model, Financials, Team):**
    ---
    {analysis_summary}
    ---
    **Instructions:**
    1. Identify 2-3 key STRENGTHS based *primarily* on the analysis of the core sections listed above.
    2. Identify 2-3 critical WEAKNESSES based *primarily* on the analysis of these core sections.
    3. For each weakness, provide a specific, actionable suggestion for improvement.
    4. Format output ONLY as a valid JSON object: {{"strengths": [list of strings], "weaknesses": [list of strings]}}. No explanations or markdown.
    **Example Output:** {{"strengths": ["Strong problem validation (Problem: 90).", "Clear value prop (Solution: 80)."], "weaknesses": ["Market Size lacks sources (Market: 40). Suggestion: Cite specific market reports.", "Financial assumptions unclear (Financials: 55). Suggestion: Detail key drivers."]}}
    ---
    **JSON Output:**
    """
    response = call_gemini_text(prompt, max_output_tokens=LLM_MAX_TOKENS_FEEDBACK)
    default_error = {"strengths": ["Feedback generation failed."], "weaknesses": []}
    if not response: logging.error("Feedback LLM call failed."); return default_error

    try: 
        json_match = re.search(r'```json\s*(\{.*?\})\s*```', response, re.DOTALL | re.IGNORECASE)
        if not json_match: json_match = re.search(r'\{.*\}', response, re.DOTALL)
        if not json_match: raise json.JSONDecodeError("No JSON object found", response, 0)
        json_string = json_match.group(1) if len(json_match.groups()) == 1 else json_match.group(0)
        feedback_data = json.loads(json_string)
        if isinstance(feedback_data.get('strengths'), list) and isinstance(feedback_data.get('weaknesses'), list):
            logging.info("Successfully generated qualitative feedback.")
            return feedback_data
        else: logging.error(f"Feedback LLM response invalid format. Parsed: {feedback_data}. Resp: {response}"); return {"strengths": ["Failed parsing feedback (invalid format)."],"weaknesses": []}
    except json.JSONDecodeError as e: logging.error(f"Failed JSON decode for feedback: {e}. Resp: {response}"); return {"strengths": ["Failed parsing feedback (JSON decode error)."],"weaknesses": []}
    except Exception as e: logging.error(f"Unexpected error processing feedback response: {e}"); return default_error


#### 4. Main Workflow Function

In [18]:
def analyze_pitch_deck(pdf_path):
    """Orchestrates the focused pitch deck analysis process."""
    results = {
        'overall_score': 0, 'section_scores': {}, 'feedback': {'strengths': [], 'weaknesses': []},
        'raw_text_pages': [], 'section_mapping': {}, 'error': None
    }
    start_time_analysis = time.time()
    logging.info(f"--- Starting Focused Pitch Deck Analysis (6 Sections) for: {pdf_path} ---")

    # Text Extraction
    logging.info("Step 1: Extracting Text using Gemini OCR...")
    raw_pages_text = extract_text_with_ocr_and_gemini(pdf_path)
    if raw_pages_text is None: results['error'] = "Critical error during PDF/OCR."; logging.error(results['error']); return results
    if not any(page_text.strip() for page_text in raw_pages_text): results['error'] = "Failed to extract any text content."; logging.error(results['error']); return results
    successful_extractions = sum(1 for t in raw_pages_text if t.strip()); logging.info(f"Extracted text from {successful_extractions}/{len(raw_pages_text)} pages via OCR.")
    results['raw_text_pages'] = raw_pages_text

    # Section Identification 
    logging.info("Step 2: Identifying Target Sections...")
    section_pages_map = identify_sections_llm(raw_pages_text) # Function now uses TARGET_SECTIONS in prompt
    if not section_pages_map: logging.warning("Could not identify any of the target sections. Analysis will be limited."); # Continue? Or stop? Let's continue but score/feedback will be minimal.
    results['section_mapping'] = section_pages_map

    # Aggregate & Preprocess Section Content
    logging.info("Step 3: Aggregating and Preprocessing Section Content...")
    section_content = aggregate_section_content(raw_pages_text, section_pages_map)
    if not section_content and section_pages_map: logging.warning("Section mapping found, but failed to aggregate content.")

    # Scoring Sections 
    logging.info("Step 4: Scoring Sections...")
    section_scores = score_sections_llm(section_content)
    results['section_scores'] = section_scores
    if not section_scores: logging.warning("Section scoring produced no results.")

    # Calculate Overall Score 
    logging.info("Step 5: Calculating Overall Score...")
    overall_score = calculate_overall_score(section_scores, SECTION_WEIGHTS)
    results['overall_score'] = overall_score

    # Generate Qualitative Feedback 
    logging.info("Step 6: Generating Feedback...")
    feedback = generate_feedback_llm(section_scores) # Function now focuses prompt
    results['feedback'] = feedback

    total_time = time.time() - start_time_analysis
    logging.info(f"--- Focused Analysis Completed in {total_time:.2f} seconds ---")
    return results

#### 5. Execution

In [21]:
if __name__ == "__main__":
    pdf_file_path = '../data/Uber-Pitch-Deck.pdf' # Example

    print("="*60); print(" AI Pitch Deck Analyzer (Focused on 6 Key Sections)"); print("="*60)
    print(f"Target Sections: {', '.join(TARGET_SECTIONS)}")
    print(f"Processing file: {pdf_file_path}")
    print(f"Text Model: {LLM_TEXT_MODEL}, Vision Model: {LLM_VISION_MODEL}")
    print(f"Inter-Page OCR Delay: {INTER_PAGE_DELAY}s")
    print("-" * 60)

    if not os.path.exists(pdf_file_path): print(f"\nERROR: PDF file not found: '{pdf_file_path}'.")
    elif not genai_configured: print("\nERROR: Google Generative AI client not configured.")
    else:
        analysis_results = analyze_pitch_deck(pdf_file_path)

        # --- Display Results ---
        print("\n" + "="*30 + " ANALYSIS RESULTS " + "="*30)

        if analysis_results.get('error'): print(f"\nAnalysis Failed: {analysis_results['error']}\n")
        else:
            
            print(f"\nOverall Pitch Score (Based on Weighted Sections): {analysis_results['overall_score']}/100\n")

            print("-" * 40); print("Section Scores & Justifications:"); print("-" * 40)
            if analysis_results['section_scores']:
                 
                 target_sections_found_in_scores = False
                 for section in TARGET_SECTIONS:
                     if section in analysis_results['section_scores']:
                         target_sections_found_in_scores = True
                         data = analysis_results['section_scores'][section]
                         score = data.get('score', 'N/A')
                         justification = data.get('justification', 'N/A')
                         weight_info = f"(Weight: {SECTION_WEIGHTS.get(section, 0)})" # Get weight safely
                         print(f"  - {section} {weight_info}: {score}/100")
                         print(f"    Justification: {justification}")
                     else:
                         print(f"  - {section} (Weight: {SECTION_WEIGHTS.get(section, 0)}): Not Found/Scored")
                 if not target_sections_found_in_scores:
                      print("  None of the target sections were identified or scored.")
            else: print("  No sections were scored.")
            print("\n")

            print("-" * 40); print("Feedback (Focused on Core Sections):"); print("-" * 40)
            feedback = analysis_results.get('feedback', {}); strengths = feedback.get('strengths'); weaknesses = feedback.get('weaknesses')
            print("Strengths:")
            if strengths: [print(f"  + {s}") for s in strengths]
            else: print("  No specific strengths identified.")
            print("\nWeaknesses & Suggestions:")
            if weaknesses: [print(f"  - {w}") for w in weaknesses]
            else: print("  No specific weaknesses identified.")
            print("\n" + "="*78 + "\n")

    print("Analysis complete.")

2025-03-30 18:07:53,209 - INFO - --- Starting Focused Pitch Deck Analysis (6 Sections) for: ../data/Uber-Pitch-Deck.pdf ---
2025-03-30 18:07:53,210 - INFO - Step 1: Extracting Text using Gemini OCR...
2025-03-30 18:07:53,210 - INFO - Starting PDF processing with Gemini OCR for: ../data/Uber-Pitch-Deck.pdf
2025-03-30 18:07:53,212 - INFO - Converting PDF to images...


 AI Pitch Deck Analyzer (Focused on 6 Key Sections)
Target Sections: Problem, Solution, Market Size, Business Model, Financial Projections, Team
Processing file: ../data/Uber-Pitch-Deck.pdf
Text Model: gemini-1.5-flash-latest, Vision Model: gemini-1.5-pro-latest
Inter-Page OCR Delay: 10s
------------------------------------------------------------


2025-03-30 18:08:02,321 - INFO - Converted 24 pages in 9.11s.
2025-03-30 18:08:02,323 - INFO - Processing page 1/24 with Gemini Vision OCR...
2025-03-30 18:08:05,442 - INFO - Page 1 OCR took 3.12s.
2025-03-30 18:08:05,442 - INFO - Waiting 10s before next page...
2025-03-30 18:08:15,456 - INFO - Processing page 2/24 with Gemini Vision OCR...
2025-03-30 18:08:19,326 - INFO - Page 2 OCR took 3.87s.
2025-03-30 18:08:19,326 - INFO - Waiting 10s before next page...
2025-03-30 18:08:29,340 - INFO - Processing page 3/24 with Gemini Vision OCR...
2025-03-30 18:08:33,289 - INFO - Page 3 OCR took 3.95s.
2025-03-30 18:08:33,289 - INFO - Waiting 10s before next page...
2025-03-30 18:08:43,290 - INFO - Processing page 4/24 with Gemini Vision OCR...
2025-03-30 18:08:47,368 - INFO - Page 4 OCR took 4.08s.
2025-03-30 18:08:47,370 - INFO - Waiting 10s before next page...
2025-03-30 18:08:57,371 - INFO - Processing page 5/24 with Gemini Vision OCR...
2025-03-30 18:09:17,206 - INFO - Page 5 OCR took 19.83



Overall Pitch Score (Based on Weighted Sections): 53/100

----------------------------------------
Section Scores & Justifications:
----------------------------------------
  - Problem (Weight: 20): 70/100
    Justification: The problem of inefficient taxi technology in 2008 is clearly stated and its magnitude is supported by data points like fuel consumption and medallion costs.  However, the urgency aspect is less emphasized, and the target audience (passengers and drivers) could be more precisely defined.
  - Solution (Weight: 20): 70/100
    Justification: The solution is clearly explained, directly addresses the problem of inefficient taxi services, and offers a compelling value proposition of convenience and luxury.  However, feasibility and scalability aspects lack detail, and the text is somewhat disorganized.
  - Market Size (Weight: 13): 45/100
    Justification: The text partially defines the market (TAM is mentioned), but lacks clear definitions of SAM and SOM.  Data is p