In [17]:
# Cell 1: Installing Libraries
print("Installing necessary libraries for logic testing and Streamlit...")

# Installing libraries needed for processing and potentially Streamlit later
# Using -q to make the output less noisy
!pip install streamlit pandas openai "transformers[torch]" torch pdfplumber python-dotenv nest_asyncio evaluate accelerate sentencepiece -q

# Installing specific httpx version compatible with openai v1.10.0
print("\\nForce installing compatible httpx version (0.27.2)...")
# Uninstalling any existing version first to avoid conflicts
!pip uninstall -y httpx
# Installing the required version
!pip install httpx==0.27.2 --quiet

print("\\n--- Installation Complete ---")
print(" Libraries installed.")
print(" IMPORTANT: Please restart the runtime now before running the next cell!")
print("   Go to 'Runtime' -> 'Restart Runtime' in the menu above.")

Installing necessary libraries for logic testing and Streamlit...
\nForce installing compatible httpx version (0.27.2)...
Found existing installation: httpx 0.27.2
Uninstalling httpx-0.27.2:
  Successfully uninstalled httpx-0.27.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-genai 1.10.0 requires httpx<1.0.0,>=0.28.1, but you have httpx 0.27.2 which is incompatible.[0m[31m
[0m\n--- Installation Complete ---
 Libraries installed.
 IMPORTANT: Please restart the runtime now before running the next cell!
   Go to 'Runtime' -> 'Restart Runtime' in the menu above.


This first step is like getting all the tools and ingredients ready before cooking. We need to install several Python packages (libraries) that provide specific functionalities like reading PDFs (pdfplumber), handling AI models (transformers, torch), interacting with OpenAI (openai), and running asynchronous code smoothly in Colab (nest_asyncio). We also install streamlit itself, even though we won't run the web app here, just to ensure all dependencies are present. We also install a specific older version of httpx because the openai library we use needs it.

In [1]:
# Cell 2: Mount Drive, Load API Key Directly, and Init OpenAI Client

# Import necessary libraries
from google.colab import drive
import os
# from dotenv import load_dotenv # No longer needed for direct reading
import nest_asyncio
import asyncio
import traceback
from openai import AsyncOpenAI

# --- Apply nest_asyncio patch ---
try:
    nest_asyncio.apply()
    print("Applied nest_asyncio patch.")
except RuntimeError:
    print("nest_asyncio patch might already be applied or is not needed.")
except Exception as e:
    print(f"Could not apply nest_asyncio: {e}")

# --- Mount Google Drive ---
try:
    drive.mount('/content/drive')
    print("----- Google Drive mounted successfully.")
except Exception as e:
    print(f"----- Failed to mount Google Drive: {e}")
    raise

# --- Load OpenAI API Key DIRECTLY from openai_key.txt ---
# ** Define the CORRECT path to your key file **
api_key_file_path = "/content/drive/MyDrive/RadioBrief/openai_key.txt"

openai_api_key = None # Initialize variable
print(f"\nAttempting to load API key directly from: {api_key_file_path}")
if os.path.exists(api_key_file_path):
    try:
        # Open the file and read the key (assuming it's the only content)
        with open(api_key_file_path, "r") as f:
            openai_api_key = f.read().strip() # Read the whole file and remove whitespace

        if openai_api_key:
            print(f"----- OpenAI API key loaded successfully from {os.path.basename(api_key_file_path)}.")
        else:
            print(f"----- Found file {os.path.basename(api_key_file_path)}, but it appears empty.")
            openai_api_key = None # Ensure it's None if empty
    except Exception as e:
         print(f"----- Error reading API key file {os.path.basename(api_key_file_path)}: {e}")
         openai_api_key = None
else:
     print(f"----- Error: API key file not found at the specified path: {api_key_file_path}")
     print("   Please ensure the file exists in your Drive and the path is correct.")

# --- Initialize OpenAI Client ---
async_client = None # Initialize variable
if openai_api_key:
    print("\nInitializing OpenAI Async Client...")
    try:
        # Create the client instance using the key we loaded
        async_client = AsyncOpenAI(api_key=openai_api_key)
        print("----- OpenAI Async Client Initialized successfully.")
    except Exception as e:
        print(f"----- Failed to initialize OpenAI client: {e}")
        print(traceback.format_exc())
else:
    print("\n----- Skipping OpenAI client initialization because the API key could not be loaded.")

print("\n--- Setup Complete for Cell 2 ---")

Applied nest_asyncio patch.
Mounted at /content/drive
✅ Google Drive mounted successfully.

Attempting to load API key directly from: /content/drive/MyDrive/RadioBrief/openai_key.txt
✅ OpenAI API key loaded successfully from openai_key.txt.

Initializing OpenAI Async Client...
✅ OpenAI Async Client Initialized successfully.

--- Setup Complete for Cell 2 ---


Setup Environment (Drive, Key, Async)

Explanation: Now that the tools are installed (and the runtime restarted), we need to set up our workspace. This cell connects your Colab notebook to your Google Drive so we can access files (like your API key and saved model). It also loads your secret OpenAI API key from a .env file (which you should have in your Google Drive) so we don't put the key directly in the code. Finally, it prepares the environment to handle the special type of code (asynchronous) needed for OpenAI calls.

##Changes Made:

Removed the import load_dotenv.
Changed env_path to api_key_file_path and set it to your correct file: /content/drive/MyDrive/RadioBrief/openai_key.txt.
Replaced the load_dotenv(...) and os.getenv(...) logic with a simple with open(...) as f: f.read().strip() to read the key directly from the specified text file.

In [2]:
# Cell 3: Load Fine-Tuned Classification Model

# Import necessary libraries from transformers
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
import torch
import os
import traceback

# --- Configuration ---
# Define the path where your fine-tuned model is saved in Google Drive
# ** IMPORTANT: Double-check this path is correct! **
fine_tuned_model_path = "/content/drive/MyDrive/RadioBrief/finetune-results/final_model"

# Define the expected labels and mappings from your fine-tuning notebook
# (These MUST match the labels the model was trained on - MasakhaNEWS 'fra')
id2label_map = {0: 'business', 1: 'entertainment', 2: 'health', 3: 'politics', 4: 'religion', 5: 'sports', 6: 'technology'}
label2id_map = {v: k for k, v in id2label_map.items()}
num_model_labels = len(id2label_map)

# --- Load Model ---
classifier_finetuned = None # Initialize variable
print(f"Attempting to load fine-tuned classification model from: {fine_tuned_model_path}")

# Check if the specified directory exists
if not os.path.isdir(fine_tuned_model_path):
     print(f"----- Error: Fine-tuned model directory not found at: {fine_tuned_model_path}")
     print("   Please check the path. Did you run the fine-tuning notebook and save the model?")
else:
    try:
        # Determine if GPU is available, otherwise use CPU
        device_id = 0 if torch.cuda.is_available() else -1
        device_name = 'GPU 0' if device_id == 0 else 'CPU'
        print(f"Attempting to load model onto device: {device_name}")

        # Load the model configuration and weights
        model = AutoModelForSequenceClassification.from_pretrained(
            fine_tuned_model_path,
            num_labels=num_model_labels,
            id2label=id2label_map,
            label2id=label2id_map
        )
        # Load the tokenizer associated with the model
        tokenizer = AutoTokenizer.from_pretrained(fine_tuned_model_path)

        # Create a text-classification pipeline using the loaded model and tokenizer
        classifier_finetuned = pipeline(
            "text-classification",
            model=model,
            tokenizer=tokenizer,
            device=device_id # Tell the pipeline to use GPU if available
        )
        print(f"----- Fine-tuned classifier pipeline loaded successfully on {device_name}.")

    except Exception as e:
        # Catch any errors during loading
        print(f"----- Failed to load fine-tuned classification model: {e}")
        print(traceback.format_exc()) # Show detailed error

# Final check
if not classifier_finetuned:
    print("\n----- Classifier pipeline could not be loaded. Classification steps later might fail or be skipped.")

print("\n--- Setup Complete for Cell 3 ---")

Attempting to load fine-tuned classification model from: /content/drive/MyDrive/RadioBrief/finetune-results/final_model
Attempting to load model onto device: CPU


Device set to use cpu


✅ Fine-tuned classifier pipeline loaded successfully on CPU.

--- Setup Complete for Cell 3 ---


Load Fine-Tuned Model

Explanation: This cell loads the classification model you trained in the FineTune_Topic_Classifier (2).ipynb notebook. It finds the saved model files in your Google Drive and prepares a pipeline object (like a ready-to-use tool) that we can give text to and get a topic prediction back. It uses the specific labels the model was trained on (business, politics, etc.).

In [8]:
# Cell 4: Define Helper Functions and Variables

# Import necessary libraries (some might be redundant but safe)
import re
import pdfplumber
import logging
import traceback
import datetime
import asyncio
from openai import AsyncOpenAI # For type hint
from transformers import pipeline # For type hint
import os # Needed for basename in extract_text_from_pdf

print("Defining helper functions and variables...")

# --- Configuration Variables (from Perfect_results_Rafael_WithReport.ipynb) ---
# === ACTION REQUIRED: Paste your 'political_keywords' list below ===
# (Copied from Cell 6 of Perfect_results_Rafael_WithReport.ipynb)
political_keywords = [
    # --- General Politics & Concepts ---
    "Politique", "Géopolitique", "Économie politique", "Relations internationales",
    "Souveraineté", "Frontières", "Sécurité", "Défense", "Libéralisme",
    "Conflit", "Guerre", "Crise", "Instabilité", "Tensions géopolitiques", "Cessez-le-feu", "Trêve",
    "Élections", "Scrutin", "Parité",
    "Diplomatie", "Négociations", "Sommet", "Traité", "Accord",
    "Sanctions", "Embargo", "Coercition économique",
    "Aide humanitaire", "Crise humanitaire", "Droits de l'homme",
    "Terrorisme", "Extrémisme", "Ingérence étrangère",
    # --- French Politics & Foreign Policy ---
    "Assemblée Nationale", "Sénat", "Matignon", "Bercy", "Élysée",
    "Quai d'Orsay", "Ministère des Affaires étrangères", "Ministre des Affaires étrangères", "France",
    # --- International Institutions & Law ---
    "ONU", "Nations Unies", "Conseil de sécurité", "Résolution", "Casques bleus",
    "OTAN", "NATO",
    "Union Européenne", "UE", "Bruxelles", "Commission européenne", "Parlement européen",
    "G7", "G20", "OMC", "FMI", "Banque mondiale",
    "Union Africaine", "UA",
    "Cour pénale internationale", "CPI", "Cour internationale de Justice", "CIJ",
    # --- Regions & Countries ---
    "Moyen-Orient", "Proche-Orient",
    "Gaza", "Palestinien", "Palestine", "Cisjordanie", "Jérusalem", "Autorité palestinienne",
    "Israël", "Israélien", "Tel Aviv",
    "Liban", "Libanais", "Beyrouth",
    "Syrie", "Syrien", "Damas",
    "Jordanie", "Jordanien", "Amman",
    "Égypte", "Égyptien", "Le Caire",
    "Irak", "Irakien", "Bagdad",
    "Iran", "Iranien", "Téhéran",
    "Arabie Saoudite", "Saoudien", "Riyad",
    "Yémen", "Yéménite", "Sanaa",
    "Qatar", "Qatari", "Doha",
    "Émirats arabes unis", "EAU", "Émirati", "Abou Dhabi", "Dubaï",
    "Turquie", "Turc", "Ankara",
    "Afrique du Nord", "Maghreb",
    "Algérie", "Algérien", "Alger",
    "Tunisie", "Tunisien", "Tunis",
    "Maroc", "Marocain", "Rabat",
    "Libye", "Libyen", "Tripoli",
    "États-Unis", "USA", "Américain", "Washington", "Maison Blanche", "Pentagone", "Département d'État",
    "Chine", "Chinois", "Pékin", "Taïwan",
    "Russie", "Russe", "Moscou", "Kremlin",
    "Ukraine", "Ukrainien", "Kiev",
    # --- Specific Topics ---
    "Migration", "Migrants", "Réfugiés", "Asile", "Immigration", "Immigrés", "Flux migratoires", "Frontière",
    "Guerre commerciale", "Commerce international", "Droit de douane", "Accords commerciaux", "Protectionnisme", "Multilateralisme",
    "Nucléaire", "Prolifération", "AIEA",
    "Énergie", "Pétrole", "Gaz", "OPEP",
    # --- Groups & Specific Entities ---
    "Hamas", "Jihad islamique",
    "Hezbollah",
    "Houthis", "Ansar Allah",
    "Talibans", "Afghanistan",
    "État islamique", "Daech", "EI", "ISIS",
    "Al-Qaïda",
    # --- Political Ideologies ---
    "Extrême droite", "Populisme",
    "Autoritarisme", "Souverainisme", "Nationalisme", "Islamisme"
]
# === END PASTE keywords list ===
print(f"Loaded {len(political_keywords)} political keywords.")

# === ACTION REQUIRED: Paste your 'min_keyword_hits' assignment below ===
# (Copied from Cell 12 of Perfect_results_Rafael_WithReport.ipynb)
min_keyword_hits = 3
# === END PASTE min_keyword_hits ===
print(f"Minimum keyword hits set to: {min_keyword_hits}")


# --- FUNCTION DEFINITIONS ---

# 1. PDF Extraction Function
# === ACTION REQUIRED: Paste your 'extract_text_from_pdf' function code below ===
# (Copied from Cell 10 of Perfect_results_Rafael_WithReport.ipynb)
# (Make sure it takes pdf_path as input for Colab testing)
def extract_text_from_pdf(pdf_path):
    """Extracts all the text from a PDF file using pdfplumber."""
    full_text = ""
    try:
        with pdfplumber.open(pdf_path) as pdf:
            for page in pdf.pages:
                text = page.extract_text()
                if text:
                    full_text += text + "\\n"
    except FileNotFoundError:
        logging.error(f"PDF file not found: {pdf_path}")
        print(f"------ Error: PDF file not found at {pdf_path}")
        return None
    except pdfplumber.PDFSyntaxError:
        logging.error(f"PDF syntax error in file: {pdf_path}")
        print(f"------ Error: PDF syntax error in file: {pdf_path}")
        return None
    except Exception as e:
        logging.error(f"Error extracting text from PDF: {e}\\n{traceback.format_exc()}")
        print(f"------ Error extracting text from PDF: {e}")
        return None
    print(f"--- Extracted text from PDF: {os.path.basename(pdf_path)}")
    return full_text
# === END PASTE extract_text_from_pdf ===


# 2. Article Splitting Function (Split by 'ftp\\n' pattern)
def smart_split_articles(full_text):
    """Splits text into potential articles based on the 'ftp\\n' separator
       and applies basic cleaning and filtering."""
    if not full_text: return []
    print("--- Running smart_split_articles (Splitting by 'ftp\\n') ---")

    # Basic cleaning (remove form feed, maybe consolidate multiple spaces later if needed)
    cleaned_text = full_text.replace('\\x0c', '\\n')

    # Split the text wherever 'ftp\n' occurs
    # This pattern likely separates major sections or pages
    split_pattern = r'ftp\\n'
    possible_articles = re.split(split_pattern, cleaned_text, flags=re.IGNORECASE) # Use re.split
    print(f"Number of chunks after splitting by '{split_pattern}': {len(possible_articles)}")

    # Filter out empty strings and apply length/content checks
    final_articles = []
    min_article_length = 150 # Minimum characters
    print(f"Filtering chunks shorter than {min_article_length} characters...")
    for i, article_chunk in enumerate(possible_articles):
        if article_chunk:
            # Further clean each chunk: remove leading/trailing whitespace
            # and potentially remove lines that look like page headers/footers if needed
            trimmed_chunk = article_chunk.strip()

            # Remove potential leftover page/section headers at the start of a chunk
            # Example: remove lines like 'VI VOTRE FAIT DU JOUR' or 'POLITIQUE 11' if they appear alone at the start
            lines = trimmed_chunk.split('\\n')
            if lines and re.match(r'^[A-Z\s]+\s*\d*$', lines[0].strip()): # Check if first line looks like a header
                 trimmed_chunk = '\\n'.join(lines[1:]).strip()

            # Check length and if it contains at least some letters
            if len(trimmed_chunk) > min_article_length and any(c.isalpha() for c in trimmed_chunk):
                 final_articles.append(trimmed_chunk)
            # else:
                 # print(f"  Chunk {i} filtered out (length {len(trimmed_chunk)} <= {min_article_length} or no letters/only header).")


    print(f"  (Splitting resulted in {len(final_articles)} potential articles after filtering)")
    print("--- Finished smart_split_articles ---")
    return final_articles

# 3. Summarization Function (Improved with Input Truncation)
async def async_summarize_article(
    aclient: AsyncOpenAI,
    text: str,
    max_lines: int = 4,
    style: str = "journalistique radio",
    tone: str = "neutre et informatif",
    focus: str = "faits politiques",
    # Add a safety limit slightly below the typical model max context
    # (e.g., gpt-3.5-turbo often has 16k token limit, roughly 3-4 chars/token)
    max_input_chars: int = 12000 # You can adjust this limit if needed
) -> str:
    """Asynchronously summarizes an article using the provided AsyncOpenAI client,
       truncating input text if it's too long."""

    # Basic check for valid input text
    if not text or not isinstance(text, str) or len(text.strip()) < 50:
        logging.warning(f"Skipping summary for short/invalid text: {text[:50]}...")
        return "Résumé non disponible (Texte d'entrée invalide)."

    # --- ADDED: Input Truncation ---
    text_to_summarize = text # Start with original text
    if len(text) > max_input_chars:
        logging.warning(f"Input text length ({len(text)}) exceeds limit ({max_input_chars}). Truncating.")
        print(f"⚠️ Input text too long ({len(text)} chars), truncating to {max_input_chars} chars for summary.")
        text_to_summarize = text[:max_input_chars] # Use only the first part
    # --- End Added ---

    # Use the potentially truncated text in the prompt
    prompt = f"""
Résume cet article en {max_lines} lignes maximum, dans un style {style},
en mettant l'accent sur les {focus}. Utilise un ton {tone}.

Texte de l'article :
{text_to_summarize}
"""
    print(f"--> Preparing async summary request for article snippet: {text_to_summarize[:50]}...")
    if not aclient: return "Résumé non disponible (Erreur Client OpenAI)."
    try:
        response = await aclient.chat.completions.create(
            model="gpt-3.5-turbo", # Ensure this model supports the context length
            messages=[{"role": "user", "content": prompt}],
            temperature=0.3,
            max_tokens=200 # Limit the output summary length too
        )
        summary = response.choices[0].message.content.strip()
        print(f"--> Received async summary for article snippet: {text_to_summarize[:50]}...")
        # Add note if text was truncated
        if len(text) > max_input_chars:
             summary += "\\n*(Note: Résumé basé sur le début de l'article car le texte original était trop long.)*"
        return summary
    except Exception as e:
        # Log the specific error, including context length errors
        logging.error(f"Error summarizing article asynchronously: {e}\\nText Snippet: {text_to_summarize[:200]}")
        # Check if it's a context length error specifically
        # The error object structure might vary, check common attributes
        error_code = getattr(e, 'code', None) or getattr(getattr(e, 'body', {}), 'code', None)
        if error_code == 'context_length_exceeded':
            print(f"----- Error during async summary: Input text still too long for the model even after truncation attempt.")
            return "Résumé non disponible (Erreur: Texte trop long)."
        else:
            print(f"----- Error during async summary request for article snippet: {text_to_summarize[:50]}... Error: {e}")
            return "Résumé non disponible (Erreur API)."

# 4. Translation Function
# === ACTION REQUIRED: Paste your 'async_translate_to_arabic' function code below ===
# (Copied from Cell 8 of Perfect_results_Rafael_WithReport.ipynb)
async def async_translate_to_arabic(
    aclient: AsyncOpenAI,
    text: str,
    style: str = "journalistique",
    formality: str = "formel",
    target_audience: str = "public général",
    context: str = "actualités"
) -> str:
    """Asynchronously translates text into Modern Standard Arabic."""
    prompt = f"""
Traduis ce texte en arabe standard moderne (MSA),
avec une précision élevée et un style {style},
adapté à un public {target_audience} dans un contexte de {context}.
Utilise un registre {formality}.
Conserve la structure et le sens du texte original.

Texte à traduire :
{text}
""" # Ensure no characters after the closing triple quotes
    print(f"--> Preparing async translation request for text snippet: {text[:50]}...")
    if not aclient: return "Traduction non disponible (Erreur Client OpenAI)." # Added client check
    # Handling the case where the input text might be the error message from summarization
    if text == "Résumé non disponible." or "Erreur API" in text or "Résumé Non Généré/Vide" in text:
         print(f"--> Skipping translation because input text indicates summary error: '{text[:50]}...'")
         return "Traduction non disponible car le résumé initial n'était pas disponible ou contenait une erreur."
    try:
        response = await aclient.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.3,
        )
        translation = response.choices[0].message.content.strip()
        print(f"--> Received async translation for text snippet: {text[:50]}...")
        return translation
    except Exception as e:
        logging.error(f"Error translating text asynchronously: {e}\\n{traceback.format_exc()}\\nText: {text[:200]}")
        print(f"------ Error during async translation request for text snippet: {text[:50]}... Error: {e}")
        return "Traduction non disponible (Erreur API)."
# === END PASTE async_translate_to_arabic ===


# 5. Classification Function (using the fine-tuned model loaded in Cell 3)
# === ACTION REQUIRED: Paste the 'classify_topic_finetuned' function definition below ===
# (Copied from the previous response's app.py code)
def classify_topic_finetuned(pipeline_obj, text_to_classify):
    """Classifies text using the fine-tuned model pipeline."""
    # Check if the pipeline object (loaded in Cell 3) exists
    if not pipeline_obj:
        print("----- Fine-tuned classifier not loaded/available. Skipping classification.")
        # Return a dictionary indicating the classifier is missing
        return {"label": "Classifier Unavailable", "score": 0.0}

    # Check if the text input is valid (not empty, is a string, has some length)
    if not text_to_classify or not isinstance(text_to_classify, str) or len(text_to_classify.strip()) < 20:
        print(f"(!) Skipping fine-tuned classification for invalid/short input text: '{str(text_to_classify)[:50]}...'")
        # Return a dictionary indicating bad input
        return {"label": "Input Too Short/Invalid", "score": 0.0}

    try:
        # Print a message showing the start of classification for this text
        print(f"  (Classifying with fine-tuned model: {text_to_classify[:50]}...)")

        # Run the pipeline! Give it the text (limit length for stability)
        # The pipeline handles tokenization and prediction internally
        result = pipeline_obj(text_to_classify[:512]) # Apply length limit (e.g., 512 tokens)

        # The pipeline returns a list (usually with one item for single input)
        # Each item is a dictionary like [{'label': 'politics', 'score': 0.9...}]
        if result and isinstance(result, list):
            top_prediction = result[0] # Get the first (and likely only) prediction dictionary
            label = top_prediction.get('label', 'Error') # Get the predicted label name
            score = top_prediction.get('score', 0.0)   # Get the confidence score
            print(f"  (Fine-tuned classification: Label='{label}', Score={score:.4f})")
            # Return the result in a consistent dictionary format
            return {"label": label, "score": score}
        else:
             # Handle unexpected output format from the pipeline
             print(f"----- Unexpected result format from fine-tuned pipeline: {result}")
             return {"label": "Classification Error (Format)", "score": 0.0}

    except Exception as e:
        # Handle any error during the classification process
        print(f"----- Error during fine-tuned classification: {e}")
        print(traceback.format_exc()) # Print detailed error stack
        # Return a dictionary indicating a runtime error
        return {"label": "Classification Error (Runtime)", "score": 0.0}
# === END PASTE classify_topic_finetuned ===


print("\n --- Helper functions and variables defined.")

Defining helper functions and variables...
Loaded 182 political keywords.
Minimum keyword hits set to: 3

 --- Helper functions and variables defined.


Define Helper Functions & Variables

Explanation: This is a crucial cell where you bring together all the individual processing steps. You need to copy the exact Python code for your helper functions from the Perfect_results_Rafael_WithReport.ipynb notebook. This includes extracting text from PDFs, splitting text into articles, summarizing articles (using OpenAI), translating summaries (using OpenAI), and the new function we define here to use your fine-tuned model for classification. We also copy the political_keywords and min_keyword_hits variables.

In [13]:
# Cell 5: Test the Combined Logic

# Import necessary libraries (some might be redundant)
import asyncio
import os
import re
import traceback

# --- Define Input Source ---
# Change this to "pdf" to test PDF processing
input_source = "pdf"  # Options: "paste" or "pdf"

# --- Define Test Input ---
# Option 1: Pasted Text (if input_source = "paste")
test_pasted_text = """
Philippe-Attal, la bataille qui vient dans le camp présidentiel  Tout en se ménageant, les deux ex-premiers ministres tiennent à se différencier, sans primaire, pour que l’un d’eux s’impose comme candidat unique du « bloc central ».   10 min • Louis Hausalter Loris Boichot Tristan Quinault-Maupoil É douard Philippe et Gabriel Attal se sont donné rendez-vous chez un spécialiste des rouleaux de printemps et des raviolis de crevettes. Ce 19 février, les deux anciens premiers ministres se retrouvent autour de la table de Lily Wang, un chic restaurant asiatique du 7e arrondissement de Paris. Le patron du parti présidentiel, Renaissance, invite son homologue d’Horizons à son rassemblement militant du 6 avril, prévu à Saint-Denis (Seine-Saint-Denis). Là où il prévoit d’assumer un « premier pas » vers l’élection présidentielle de 2027, à son tour. Son interlocuteur, pour sa part déjà officiellement candidat, s’est lancé dans une série de meetings régionaux.  Les temps sont à la préparation de l’après-Macron, sous la surveillance de figures du « bloc central » méfiantes devant ce duel d’ambitieux anciens premiers ministres. « Avant de se demander qui porte un projet, il faudrait qu’il y ait un projet, pointe la ministre de l’Éducation nationale, Élisabeth Borne, face à ce qu’elle appelle la « bande des garçons ». Il faudra envoyer des messages de ce qu’on veut faire, et non pas juste dire : “Je suis là.” » C’est maintenant en dehors du pouvoir exécutif, tournés vers l’avenir, que s’organisent l’ex-Républicain (LR) et l’ancien socialiste, tous deux devenus personnalités politiques préférées des Français à leur sortie du gouvernement. Pour la présidentielle, « ce sont les deux plus attendus, les plus crédibles dans le bloc central », remarque un cadre de Renaissance. Et d’ajouter, beau joueur : « Quoi qu’il arrive, le futur du pays se fera avec Édouard Philippe et Gabriel Attal. »  Entre le maire du Havre (Seine-Maritime), 54 ans, et son rival, de dix-huit ans son cadet, il y aurait de la cordialité, « sans tensions », selon l’entourage du premier, ni « guerre larvée » ou « animosité », selon les proches du second. Tous deux membres du même gouvernement de 2018 à 2020, ils partagent des orientations semblables - proeuropéennes, libérales en économie et fermes en matière régalienne. Mais les divergences ne manquent pas. La polémique récente sur le port du voile lors des compétitions sportives en a apporté l’illustration. Le président d’Horizons a exprimé ses réticences devant la proposition de loi de LR visant à l’interdire ; Gabriel Attal en a profité pour s’y engouffrer et marquer sa différence. Une manière de rassurer sa droite, alors qu’il cherche à apparaître comme le garant du « dépassement » des clivages gauche-droite, d’une forme de « en même temps », davantage remis en cause par Édouard Philippe, autoproclamé « homme de droite ». À l’égard de François Bayrou, le trentenaire a fait le choix de la bienveillance, alors qu’Édouard Philippe répète qu’aucune réforme majeure ne pourra voir le jour d’ici la prochaine présidentielle.
""" # Add more text if needed

# Option 2: PDF Path (if input_source = "pdf")
# ** IMPORTANT: Make sure this PDF exists on your Google Drive! **
test_pdf_path = "/content/drive/MyDrive/RadioBrief/Le Parisien du Mercredi 09 Avril 2025.pdf"

# --- Main Test Logic (Async Function) ---
async def run_test():
    print("\\n--- Starting Logic Test ---")
    full_text = None
    source_name = ""
    processing_error = False

    # 1. Get Input Text
    print(f"Input source selected: {input_source}")
    if input_source == "pdf":
        if os.path.exists(test_pdf_path):
            print(f"Extracting text from PDF: {test_pdf_path}")
            full_text = extract_text_from_pdf(test_pdf_path) # Use function from Cell 4
            source_name = os.path.basename(test_pdf_path)
            if full_text is None:
                print("----- PDF extraction failed.")
                processing_error = True
        else:
            print(f"----- Test PDF not found at: {test_pdf_path}")
            processing_error = True
    elif input_source == "paste":
        if test_pasted_text and test_pasted_text.strip():
            full_text = test_pasted_text
            source_name = "Pasted Text"
            print("Using pasted text.")
        else:
            print("----- Test pasted text is empty.")
            processing_error = True
    else:
        print("----- Invalid input_source selected.")
        processing_error = True

    # Stop if input failed
    if processing_error or not full_text:
        print("--- Test Aborted due to input error ---")
        return

    # 2. Split Articles
    print(f"\\nProcessing content from: {source_name}")
    # Ensure smart_split_articles is defined (should be in Cell 4)
    if 'smart_split_articles' not in globals():
        print("----- Error: smart_split_articles function not defined.")
        return
    articles = smart_split_articles(full_text)

    if not articles:
        print("----- Could not split the text into articles using the defined rules.")
        processing_error = True
    else:
        print(f"Found {len(articles)} potential articles. Identifying the first political one...")

        first_political_article = None
        first_political_topic = "Non Politique / Non Trouvé"
        predicted_topic_label = "N/A"
        predicted_topic_score = 0.0

        # Compile keyword regex (ensure 're' and 'political_keywords' are available)
        if 're' not in globals() or 'political_keywords' not in globals():
             print("----- Error: 're' module or 'political_keywords' not defined.")
             return
        keyword_pattern = r'(?i)\\b(?:' + '|'.join(re.escape(kw) for kw in political_keywords) + r')\\b'
        keyword_regex = re.compile(keyword_pattern)

        # Define which MasakhaNEWS labels count as "political" for filtering
        # ** Adjust this list based on your needs and the model's labels **
        relevant_masakha_labels = ['politics']
        # relevant_masakha_labels = ['politics', 'business']

        print(f"Considering these fine-tuned labels as 'political': {relevant_masakha_labels}")

        # --- 3. Find and Process First Political Article (IMPROVED LOGIC) ---
        first_political_article_found = False
        # Compile keyword regex once
        if 're' not in globals() or 'political_keywords' not in globals() or 'min_keyword_hits' not in globals():
             print("----- Error: 're' module, 'political_keywords', or 'min_keyword_hits' not defined."); return
        keyword_pattern = r'(?i)\\b(?:' + '|'.join(re.escape(kw) for kw in political_keywords) + r')\\b'
        keyword_regex = re.compile(keyword_pattern)
        # relevant_masakha_labels = ['politics', 'business']
        relevant_masakha_labels = ['politics'] # Labels considered "political"
        print(f"Considering these fine-tuned labels as 'political': {relevant_masakha_labels}")

        # Loop through the *filtered* articles from smart_split_articles
        for i, article_text in enumerate(articles):
            # Note: article_text here is already stripped and length-filtered by smart_split_articles
            print(f"\n--- Checking Article Chunk {i+1} ---")
            print(f"Snippet: {article_text[:100]}...") # Show snippet of current article

            is_political = False
            assigned_topic_source = "N/A"
            predicted_topic_label = "N/A"
            predicted_topic_score = 0.0

            # --- Classification ---
            if 'classifier_finetuned' in globals() and classifier_finetuned and 'classify_topic_finetuned' in globals():
                classification_result = classify_topic_finetuned(classifier_finetuned, article_text) # Classify THIS article
                predicted_topic_label = classification_result.get('label', 'Error')
                predicted_topic_score = classification_result.get('score', 0.0)
                if predicted_topic_label in relevant_masakha_labels:
                    is_political = True
                    assigned_topic_source = f"Classifier ({predicted_topic_label})"
                    print(f"  -> Identified as '{predicted_topic_label}' by classifier (Score: {predicted_topic_score:.2f}).")
            else:
                predicted_topic_label = "Classifier Unavailable"
                print("  -> Skipping classification (model/function missing). Checking keywords...")

            # --- Keyword Check ---
            keyword_matches = keyword_regex.findall(article_text)
            number_of_hits = len(keyword_matches)
            if number_of_hits >= min_keyword_hits:
                print(f"  -> Met keyword threshold ({number_of_hits} hits).")
                if not is_political: # Mark as political if not already done by classifier
                    is_political = True
                    assigned_topic_source = f"Keywords ({number_of_hits} hits)"

            # --- Process if Political ---
            if is_political:
                print(f"✅ Political article found (Index {i+1}) identified via {assigned_topic_source}.")
                print("\\n--- Processing This Article ---")

                # Check OpenAI client
                if 'async_client' not in globals() or not async_client:
                     print("----- OpenAI client not initialized, cannot summarize/translate.")
                     break # Stop processing if client fails

                # Check functions exist
                if 'async_summarize_article' not in globals() or 'async_translate_to_arabic' not in globals():
                     print("----- Error: Summarization or translation function not defined.")
                     break

                try:
                    print("Running Summary and Translation...")
                    # Pass THIS specific article_text to the functions
                    summary_task = asyncio.create_task(async_summarize_article(async_client, article_text))
                    summary_result = await summary_task

                    translation_result = "Translation skipped (summary error)."
                    if summary_result and \
                       "Erreur API" not in summary_result and \
                       "non disponible" not in summary_result and \
                       "Texte d'entrée invalide" not in summary_result and \
                       "Texte trop long" not in summary_result:
                        translation_task = asyncio.create_task(async_translate_to_arabic(async_client, summary_result))
                        translation_result = await translation_task
                    else:
                        translation_result = f"Traduction non effectuée ({summary_result})."

                    # --- Display Results for THIS article ---
                    print("\\n--- FINAL TEST RESULTS (for first political article found) ---")
                    print(f"----- Detected Topic (Fine-Tuned Model): {predicted_topic_label} (Score: {predicted_topic_score:.2f})") # Use the label found for this article
                    print(f"\\n----- Summary (French):\\n{summary_result}")
                    print(f"\\n----- Translation (Arabic):\\n{translation_result}")
                    print(f"\\n----- Original Snippet (first 500 chars):\\n{article_text[:500]}...")

                    first_political_article_found = True # Mark that we found and processed one
                    break # IMPORTANT: Stop after processing the first political article

                except Exception as e:
                    print(f"----- An error occurred during API processing for article {i+1}: {e}")
                    print(traceback.format_exc())
                    # Optionally decide if you want to break or continue to next article on API error
                    break # Stop for now if API fails

        # --- After the loop ---
        if not first_political_article_found and not processing_error:
            print("\\n----- No political articles meeting the criteria were found in the input after checking all chunks.")

    print("\\n--- Logic Test Finished ---")

# --- Run the Test ---
# Using asyncio.run() to execute the async test function
# Check necessary components exist before running
if __name__ == '__main__' and \
   'async_client' in globals() and async_client and \
   'classifier_finetuned' in globals(): # Check classifier too, even if it might be None
     try:
         # nest_asyncio allows running within an existing loop (like Colab's)
         print("\nAttempting to run the async test logic...")
         asyncio.run(run_test())
         print("Async test logic execution complete.")
     except Exception as e:
         print(f"----- An error occurred trying to run the async test: {e}")
         print(traceback.format_exc())
else:
    print("\\nSkipping test run because OpenAI client ('async_client') or Classifier ('classifier_finetuned') is not available or not loaded.")



Attempting to run the async test logic...
\n--- Starting Logic Test ---
Input source selected: pdf
Extracting text from PDF: /content/drive/MyDrive/RadioBrief/Le Parisien du Mercredi 09 Avril 2025.pdf
--- Extracted text from PDF: Le Parisien du Mercredi 09 Avril 2025.pdf
\nProcessing content from: Le Parisien du Mercredi 09 Avril 2025.pdf
--- Running smart_split_articles (Splitting by 'ftp\n') ---
Number of chunks after splitting by 'ftp\\n': 45
Filtering chunks shorter than 150 characters...
  (Splitting resulted in 44 potential articles after filtering)
--- Finished smart_split_articles ---
Found 44 potential articles. Identifying the first political one...
Considering these fine-tuned labels as 'political': ['politics']
Considering these fine-tuned labels as 'political': ['politics']

--- Checking Article Chunk 1 ---
Snippet: 75
Un incendie Gare aux fermetures cet été
Centre de tri à Paris RER C
sans danger pour la santé ? a...
  (Classifying with fine-tuned model: 75
Un incendie G

Test Section

Explanation: This final cell puts everything together. It simulates the core logic: getting text (either from a test PDF path on your Drive or from pasted text you define here), splitting it, finding the first political article (using the fine-tuned classifier OR keywords), and then summarizing/translating that article. The results are printed at the end.


Report: Streamlit_App_Logic_Test NotebookThis report details the origins of the code used in the Streamlit_App_Logic_Test (1).ipynb notebook and analyzes the output from its Cell 5 execution using PDF input.

Part 1: Code Component OriginsThe Streamlit_App_Logic_Test (1).ipynb notebook was constructed by combining and adapting components from two primary source notebooks:

Perfect_results_Rafael_WithReport.ipynb (Main Processing Pipeline):Setup: Core library imports (pandas, pdfplumber, re, asyncio, openai, datetime, logging, traceback, os).Variables (Cell 4):political_keywords: Directly copied from Cell 6.min_keyword_hits: Directly copied from Cell 12.Functions (Cell 4):extract_text_from_pdf: Directly copied from Cell 10 (using the version taking a file path).smart_split_articles: Directly copied from Cell 11 (version without secondary ALL CAPS split), with debug print statements added for testing. (Note: This is the function currently causing issues).async_summarize_article: Based on Cell 7, but modified in Cell 4 of the test notebook to include input text truncation (max_input_chars check) to resolve the context_length_exceeded error encountered during testing.async_translate_to_arabic: Directly copied from Cell 8, with minor improvements to error message handling.Core Logic Structure (Cell 5): The overall flow (get text -> split -> loop/identify -> process -> display) is adapted from the logic in Cells 12 and 14, but modified to run as a single test sequence and incorporate the fine-tuned classifier.

FineTune_Topic_Classifier (2).ipynb (Fine-Tuning Notebook):Model Loading (Cell 3):The logic to load the saved fine-tuned model (/content/drive/MyDrive/RadioBrief/finetune-results/final_model) using AutoModelForSequenceClassification.from_pretrained and AutoTokenizer.from_pretrained is taken directly from the inference/loading examples in the fine-tuning notebook (e.g., Cell 13).The id2label_map, label2id_map, and num_model_labels variables are copied from the data preparation stage (Cell 4) of the fine-tuning notebook to ensure compatibility with the loaded model.The creation of the pipeline("text-classification", ...) object (classifier_finetuned) uses the loaded model and tokenizer.Classification Function (Cell 4):The classify_topic_finetuned function was newly written for the test notebook (Cell 4). It is designed specifically to use the classifier_finetuned pipeline loaded in Cell 3. It replaces the original classify_topic function from Perfect_results_Rafael_WithReport.ipynb (Cell 9), which used the different zero-shot model.

Other Components:Library Installation (Cell 1): Includes all libraries needed by both source notebooks, plus streamlit and nest_asyncio.Drive Mount & API Key Loading (Cell 2): Standard Colab practices, adapted to read the key directly from openai_key.txt per user instruction.Debug Prints: Added throughout smart_split_articles and the main test logic in Cell 5 to aid troubleshooting.

Conclusion on Code Origins: The test notebook accurately reflects the intended combination: core processing from the main notebook, classification using the fine-tuned model artifact, and necessary adaptations (like summary truncation) identified during testing.

Part 2: Analysis of Cell 5 Output (PDF Input)The execution of Cell 5 with the PDF input yielded the following key results and indicated one primary remaining issue:

Successful Steps:The notebook ran without Python errors.PDF text extraction was successful.The fine-tuned classification model (classifier_finetuned) loaded and made a prediction ('business') on the input it received.The input truncation logic added to async_summarize_article worked correctly, preventing the previous context_length_exceeded error from OpenAI by shortening the very long input text.Summarization and Translation API calls completed successfully on the truncated text.

Core Problem: Article Segmentation FailureThe debug output from the smart_split_articles function clearly shows:Number of chunks after re.split: 1
(Splitting attempt resulted in 1 potential articles after filtering)
This confirms that the function, using the regular expression based on section keywords (FRANCE, INTERNATIONAL, etc.), failed to split the text extracted from this specific PDF. It treated the entire document (332,797 characters) as a single article chunk.

Consequences:Because the text wasn't split, subsequent steps operated on the entire document instead of individual articles.The classification ('business') was performed on this large, mixed chunk, making the result likely inaccurate for any specific article within it.The summarization, while successful after truncation, only summarized the beginning of the document, not a targeted political article.The final displayed results (Topic: business, Summary: about PSG) are therefore misleading and do not reflect the successful processing of distinct political articles as intended.

Diagnosis: The failure lies specifically within the smart_split_articles function (defined in Cell 4). The regular expression used (split_pattern_flexible) is not matching the section keywords as they appear in the text extracted by pdfplumber in this execution context. This could be due to subtle differences in formatting (spaces, newlines like \n vs \r\n, hidden characters) compared to previous successful runs or the original notebook environment.

Required Action: The immediate next step is to debug and refine the split_pattern_flexible regular expression within the smart_split_articles function in Cell 4. This requires examining the repr(cleaned_text[:1000]) debug output from Cell 5 to see the exact formatting of the extracted text and adjusting the regex pattern to correctly identify the section keywords as they appear in that specific text. Once the splitting function reliably produces multiple article chunks from the PDF, the rest of the pipeline should yield meaningful results.


Summary of Key Modifications for PDF Processing LogicTo achieve the final successful result where the PDF was correctly processed in the Streamlit_App_Logic_Test.ipynb notebook, several crucial modifications were made iteratively:

Corrected API Key Loading: The initial method using .env failed. The code was modified to read the OpenAI API key directly from the specified openai_key.txt file, ensuring the OpenAI client could be initialized.

Implemented Input Truncation for Summarization: The initial runs failed with an OpenAI context_length_exceeded error because the entire (unsplit) PDF text was sent for summarization.Fix: The async_summarize_article function was modified to include a character limit (max_input_chars). If the input text exceeded this limit, it was truncated before being sent to the API. This successfully resolved the API error, although it initially resulted in summaries of the document's beginning rather than a specific article.

Revised Article Splitting Strategy: The core problem was that neither the original keyword-based regex nor splitting by double newlines (\n\n) effectively segmented the text extracted from the test PDF.
Analysis: Examination of the fully extracted text revealed a consistent ftp\n pattern, likely indicating page or major section breaks.Fix: The smart_split_articles function was entirely rewritten to use re.split(r'ftp\\n', ...) . This method successfully split the document into numerous (45) distinct chunks.

Refined Article Selection Loop: After successful splitting, the logic still processed the incorrect (large, initial) chunk.Fix: The for loop in the main test logic (Cell 5) was adjusted to correctly iterate through the list of chunks produced by the new smart_split_articles function (potentially skipping the first chunk if deemed necessary) and, crucially, to pass the text of the specific chunk identified as political to the summarization and translation functions.

Outcome: These combined modifications, particularly the change in splitting strategy (ftp\n) and ensuring the correct article chunk text was processed, led to the final successful output where the PDF was correctly segmented, a relevant article chunk was identified and processed without context length errors, yielding a meaningful summary and translation.