<a href="https://colab.research.google.com/github/Billy67200/Advanced-Programming/blob/main/Copie_de_01_LLM_use.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Welcome to the Text Information Extraction Workshop! 🎉

Hey there, future AI wranglers! In this workshop, we're going to dive into the exciting world of extracting structured information from text using the power of OpenAI's API. Think of it as teaching a robot to read news articles and pull out all the juicy details for us. 🤖📰

We'll cover everything from setting up your API connection to designing schemas that tell the AI *exactly* what kind of information we're after. Get ready to turn messy text into clean, organized data!

## Setup: Let's Get Ready to Rumble 🛠️

First things first, we need to equip ourselves with the right tools. This involves installing the necessary Python library and importing modules. Think of it like gathering your spell components before casting a magical data extraction spell. ✨

In [None]:
!pip install openai -q

Now, let's import the ingredients for our AI recipe:

In [None]:
# Import required libraries
from openai import OpenAI
from google.colab import userdata # for use in Google Colab
import json # to handle JSON data
from pydantic import BaseModel, Field # to define data schemas
from typing import List, Optional, Dict # for type hints

import textwrap # to make text output pretty

# for use outside colab - if you're running this locally
# import os
# from dotenv import load_dotenv
# load_dotenv() # Load environment variables from .env file

**What's happening here?**

*   `openai`: This is the star of the show! The official OpenAI Python library that lets us talk to the models.
*   `google.colab.userdata` (or `os`, `dotenv`):  We need to keep our API keys secret! `userdata` is for Google Colab secrets, while `os` and `dotenv` are for local development using environment variables (safer than hardcoding!).
*   `json`:  AI models love to speak JSON. We'll use this to handle structured data in and out.
*   `pydantic`:  This is our schema superhero! Pydantic helps us define exactly what kind of data we expect from the AI, ensuring it's in the format we want. Think of it as a blueprint for data.
*   `typing`:  Just good coding practice to make our code readable and less error-prone.
*   `textwrap`:  Because nobody likes walls of text. This will wrap our output nicely for easy reading.

## Connecting to the AI Brain: OpenAI Client Setup 🧠🔌

Time to plug into the AI brain! We'll set up the OpenAI client using your API key.  We're using `userdata.get('TOGETHER_API_KEY')` (or `os.getenv('TOGETHER_API_KEY')`) to securely grab your API key. **Never hardcode your API keys!** Treat them like your Netflix password, but for powerful AI. 🤫

In [None]:
# Setup OpenAI client with custom API key and base URL
TOGETHER_API_KEY = userdata.get('TOGETHER_API_KEY') # for Google Colab user secrets
# TOGETHER_API_KEY = os.getenv('TOGETHER_API_KEY') # for local environment variables

Now, let's create the actual client that will do the talking:

In [None]:
# Create client
client = OpenAI(
    base_url="https://api.together.xyz/v1", # Using TogetherAI's API - you can change this
    api_key=TOGETHER_API_KEY
)

**Explanation:**

*   We're creating an `OpenAI` client instance. This is our portal to the language models.
*   `base_url`:  We're using TogetherAI's API endpoint here. You might use OpenAI's directly or another provider.
*   `api_key`:  This authenticates us to use the API.  Keep it secret, keep it safe!

## Summarizing Text: Baby Steps with AI 👶📝

Let's start with something simple: summarizing text. We'll ask the AI to condense a French news article into a single, punchy English sentence.  Think of this as warming up before the main workout.

Here's our French news article (`text_1`) discussing political reactions to the Ukraine war:

In [None]:
text_1 = """
Vous pouvez partager un article en cliquant sur les icônes de partage en haut à droite de celui-ci.
La reproduction totale ou partielle d’un article, sans l’autorisation écrite et préalable du Monde, est strictement interdite.
Pour plus d’informations, consultez nos conditions générales de vente.
Pour toute demande d’autorisation, contactez syndication@lemonde.fr.
En tant qu’abonné, vous pouvez offrir jusqu’à cinq articles par mois à l’un de vos proches grâce à la fonctionnalité « Offrir un article ».

https://www.lemonde.fr/international/live/2025/03/03/en-direct-guerre-en-ukraine-pour-donald-trump-les-etats-unis-ont-des-problemes-plus-urgents-que-de-s-inquieter-de-poutine_6572748_3210.html

L’altercation entre Volodymyr Zelensky et Donald Trump a été délibérément provoquée par les Etats-Unis, selon Friedrich Merz

Lors d’une conférence de presse, lundi, à Hambourg, Friedrich Merz, le candidat de l’alliance CDU/CSU à la chancellerie, a déclaré, après des consultations avec les instances dirigeantes de la CDU à Berlin, qu’il avait regardé la scène de l’altercation entre Volodymyr Zelensky et Donald Trump. « A mon avis, il ne s’agit pas d’une réaction spontanée aux interventions de Zelensky, mais manifestement d’une escalade délibérément provoquée lors de cette rencontre dans le bureau Ovale. »

« Il y a une certaine continuité dans ce que nous voyons actuellement de Washington dans la série d’événements des dernières semaines et des derniers mois, y compris la présence de la délégation américaine à Munich à la conférence sur la sécurité », a-t-il poursuivi. « Je plaide pour que nous nous préparions au fait que nous devrons faire beaucoup, beaucoup plus pour notre propre sécurité dans les années et les décennies à venir », a ajouté le futur chancelier.

Néanmoins, il souhaite que « tout soit mis en œuvre afin de maintenir les Américains en Europe », dans un contexte de spéculations selon lesquelles Trump pourrait retirer une partie des troupes américaines d’Allemagne. Le futur chancelier a précisé qu’il n’avait pas l’intention de se rendre aux Etats-Unis pour l’instant et qu’il ne le ferait qu’après une éventuelle élection en tant que chancelier par le Bundestag.

Par ailleurs, il a défendu le chancelier Olaf Scholz (SPD) contre les critiques concernant son rôle lors du sommet des dirigeants occidentaux à Londres. « Il n’est pas surprenant que l’Allemagne ne soit pas pleinement perçue et prise au sérieux sur la scène internationale en ce moment, a-t-il déclaré. Tout autre chancelier dans sa situation – ayant perdu sa majorité parlementaire et étant en transition vers un nouveau gouvernement – connaîtrait la même difficulté. »

Il a souligné que lui et Olaf Scholz s’efforcent d’« introduire la position allemande dans les négociations internationales et européennes en étroite coordination ». Toutefois, il estime qu’il « serait souhaitable que l’Allemagne participe bientôt à ces discussions avec un chef de gouvernement élu et disposant d’une majorité au Bundestag ».

3/3 2025
"""

Let's call the language model to summarize this French text. We're using `meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo` – a powerful model from Meta.

In [None]:
# Call the LLM to summarize the text
chat_completion = client.chat.completions.create(
    #model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo", # You can try a larger model for potentially better results
    model="meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo", # A good balance of speed and performance

    messages=[
        {
            "role": "system", # System message sets the AI's persona
            "content": "You are a helpful assistant.", # Simple instruction for general helpfulness
        },
        {
            "role": "user", # User message is our actual request
            "content": "Summarize the following French text in one sentence in English: " + text_1 , # Our summarization task
        },
    ],
)

output = chat_completion.choices[0].message.content # Extract the AI's response

And now, let's make that summary readable:

In [None]:
print(textwrap.fill(output, width=80)) # Wrap the output for better readability

Here is a summary of the French text in one sentence in English:  Friedrich
Merz, the German CDU/CSU candidate for chancellor, has stated that the
altercation between Volodymyr Zelensky and Donald Trump was deliberately
provoked by the United States, and that Germany should prepare for increased
security measures in the face of a potentially more isolated America.


**What did we just do?**

*   We used `client.chat.completions.create` to send a request to the language model.
*   `model`: We specified which language model to use.
*   `messages`:  This is the conversation history. We have a `system` message to set the AI's role and a `user` message with our summarization request.
*   `chat_completion.choices[0].message.content`:  This digs into the API response to get the actual text summary from the AI.
*   `textwrap.fill`:  Makes the summary look nice and not like a long, unbroken line.

## Creating Structured Data: Meet Pydantic Schemas 🏗️

Summarization is cool, but what if we want *structured* information?  This is where Pydantic and schemas come to the rescue!  Think of a schema as a mold. We define the shape we want our data to have, and Pydantic helps us ensure the AI output fits that mold perfectly.

Let's start with a simple example: a `User` object.

In [None]:
# Define the schema for the User object using Pydantic.
class User(BaseModel):
    name: str = Field(description="user name") # Field with description for clarity
    address: str = Field(description="address") # Another field with description

**Pydantic Power Explained:**

*   `BaseModel`:  The foundation of our schema.  It tells Pydantic we're defining a data structure.
*   `name: str = Field(...)`:  We're defining a field called `name` that *must* be a string (`str`).  `Field(...)` lets us add extra info, like `description`.
*   `description`:  Super helpful for the AI!  It tells the model what each field is supposed to represent.

Now, let's ask the AI to create a `User` object in JSON format, following our schema:

In [None]:
# Call the LLM to create a User object in JSON format
chat_completion = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo",
    response_format={"type": "json_object", "schema": User.model_json_schema()}, # Important! Tell API we want JSON output matching the schema
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant that answers in JSON.", # System message - AI answers in JSON
        },
        {
            "role": "user",
            "content": "Create a user named Alice, who lives in 42, Wonderland Avenue. Output in JSON.", # User request
        },
    ],
)

created_user = json.loads(chat_completion.choices[0].message.content) # Parse JSON response
print(json.dumps(created_user, indent=2)) # Print nicely formatted JSON

{
  "name": "Alice",
  "address": "42 Wonderland Avenue"
}


**Key Improvements:**

*   `response_format={"type": "json_object", "schema": User.model_json_schema()}`:  This is the magic! We're telling the OpenAI API: "Hey, expect a JSON object back, and here's the schema it should follow (`User.model_json_schema()`)". Pydantic automatically generates the JSON schema from our `User` class.
*   `json.loads(...)`: We parse the JSON string response from the AI into a Python dictionary.
*   `json.dumps(..., indent=2)`:  We print the JSON in a nicely formatted way, making it easy to read.

## Extracting Article Details: Generalized Schema for News 📰🔍

Now for the main event: extracting information from news articles! We'll use a more complex, generalized schema called `ExtractScheme` designed to capture key details from various news articles.  This schema is our blueprint for extracting the *essence* of a news story.

In [None]:
class ExtractScheme(BaseModel):
    title: str = Field(description="Title of the news article")
    publication_date: str = Field(description="Date when the article was published. If not explicitly mentioned, infer from article content if possible.")
    main_event: str = Field(description="Primary event or topic discussed in the article")
    event_summary: str = Field(description="A brief summary of the event or article's main points")
    entities_involved: List[str] = Field(description="Organizations, countries, or key entities involved in the event")
    key_people: List[str] = Field(description="Key people or figures mentioned in relation to the event")
    relevant_locations: Optional[List[str]] = Field(description="Locations that are central to the event, if any")
    key_developments: Optional[List[str]] = Field(description="Key developments or actions that have occurred or are expected")
    potential_impact: Optional[List[str]] = Field(description="Potential impacts or consequences of the event")
    keywords: List[str] = Field(description="Key terms or phrases that are central to the article")

**`ExtractScheme` Breakdown:**

*   We've defined fields to capture the title, publication date, main event, summary, entities, people, locations, developments, impact, and keywords.  Basically, everything we want to know about a news article in a structured way.
*   `List[str]`:  For fields like `entities_involved` and `key_people`, we expect a *list* of strings, as there can be multiple entities and people.
*   `Optional[List[str]]`: `relevant_locations`, `key_developments`, and `potential_impact` are optional. Not all articles will have explicit locations or details for these.

Let's use our French text (`text_1`) again and see how well the AI can extract information using this schema:

In [None]:
# Call the LLM to extract information from text_1 using ExtractScheme
chat_completion = client.chat.completions.create(
    #model="meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo",
    model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",#modele plus puissant, car sinon il sort pas l'output en anglais mais en francais car trop con
    response_format={"type": "json_object", "schema": ExtractScheme.model_json_schema()}, # Enforce JSON output with ExtractScheme schema
    messages=[
        {
            "role": "system",
            "content": "You are an AI model tasked with extracting structured information from a news article. Follow the schema provided below to extract the relevant details. You do not invent information that is not in the provided text. Output in English JSON format.", # Detailed system instruction
        },
        {
            "role": "user",
            "content": "Extract article information from the following French text and output in English JSON format: " + text_1, # User request with text_1
        },
    ],
)

Let's see the extracted output:

In [None]:
extracted_output = json.loads(chat_completion.choices[0].message.content) # Parse JSON
print(json.dumps(extracted_output, ensure_ascii=False, indent=2)) # Print nicely, handle non-ASCII characters

{
  "title": "Escalation between Volodymyr Zelensky and Donald Trump deliberately provoked by the US, says Friedrich Merz",
  "publication_date": "2025-03-03",
  "main_event": "Friedrich Merz comments on the altercation between Volodymyr Zelensky and Donald Trump",
  "event_summary": "Friedrich Merz stated that the altercation between Volodymyr Zelensky and Donald Trump was deliberately provoked by the US, according to his opinion after watching the scene and consulting with CDU instances in Berlin.",
  "entities_involved": [
    "Friedrich Merz",
    "Volodymyr Zelensky",
    "Donald Trump",
    "US",
    "Germany",
    "CDU",
    "CSU"
  ],
  "key_people": [
    "Friedrich Merz",
    "Volodymyr Zelensky",
    "Donald Trump",
    "Olaf Scholz"
  ],
  "relevant_locations": [
    "US",
    "Germany",
    "Berlin",
    "Hamburg",
    "Munich",
    "Washington",
    "London",
    "Europe"
  ],
  "key_developments": [
    "Friedrich Merz believes the altercation was a deliberate escalation

on voit qu'il y a la même structure qu'on lui a envoyé dans extractscheme

**Inspecting the Schema:**

You can check out the generated JSON schema to see exactly what we're sending to the API:

In [None]:
ExtractScheme.model_json_schema() # Show the JSON schema

{'properties': {'title': {'description': 'Title of the news article',
   'title': 'Title',
   'type': 'string'},
  'publication_date': {'description': 'Date when the article was published. If not explicitly mentioned, infer from article content if possible.',
   'title': 'Publication Date',
   'type': 'string'},
  'main_event': {'description': 'Primary event or topic discussed in the article',
   'title': 'Main Event',
   'type': 'string'},
  'event_summary': {'description': "A brief summary of the event or article's main points",
   'title': 'Event Summary',
   'type': 'string'},
  'entities_involved': {'description': 'Organizations, countries, or key entities involved in the event',
   'items': {'type': 'string'},
   'title': 'Entities Involved',
   'type': 'array'},
  'key_people': {'description': 'Key people or figures mentioned in relation to the event',
   'items': {'type': 'string'},
   'title': 'Key People',
   'type': 'array'},
  'relevant_locations': {'anyOf': [{'items': {'ty

Or as a string:

In [None]:
json_schema = str(ExtractScheme.model_json_schema()) # Get schema as string

**Schema in Prompt (Alternative Method):**

Instead of `response_format`, you can also directly include the JSON schema in the user prompt.  This can be useful for debugging or more direct control.  Let's try it:

In [None]:
# Call the LLM to extract information from text_1 using schema string in prompt
chat_completion = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo",
    #response_format={"type": "json_object", "schema": CaseDetails.model_json_schema()}, # alternative method - we are not using CaseDetails here, it's a typo in the original notebook
    messages=[
        {
            "role": "system",
            "content": "You are an AI model tasked with extracting structured information from a news article. Follow the schema provided below to extract the relevant details. You do not invent information that is not in the provided text. You output JSON only in English. Nothing else.", # Strict system message
        },
        {
            "role": "user",
            "content": "Extract article information from the following French text and output in English JSON format: " + text_1 + " Use following JSON schema:" + json_schema, # User prompt with text and schema string
        },
    ],
)

And print the output:

In [None]:
extracted_output = json.loads(chat_completion.choices[0].message.content)
print(json.dumps(extracted_output, ensure_ascii=False, indent=2))

{
  "title": "Friedrich Merz claims US deliberately provoked altercation between Volodymyr Zelensky and Donald Trump",
  "publication_date": "3/3 2025",
  "main_event": "Altercation between Volodymyr Zelensky and Donald Trump",
  "event_summary": "Friedrich Merz, the candidate for the German chancellery, claims that the US deliberately provoked the altercation between Volodymyr Zelensky and Donald Trump. He also expressed concerns about the US's priorities and the potential impact on European security.",
  "entities_involved": [
    "United States",
    "Germany",
    "Ukraine",
    "Donald Trump",
    "Volodymyr Zelensky",
    "Friedrich Merz"
  ],
  "key_people": [
    "Friedrich Merz",
    "Donald Trump",
    "Volodymyr Zelensky",
    "Olaf Scholz"
  ],
  "relevant_locations": [
    "Hamburg",
    "Berlin",
    "Munich",
    "Londres",
    "Allemagne",
    "Etats-Unis"
  ],
  "key_developments": [
    "Altercation between Volodymyr Zelensky and Donald Trump",
    "US's priorities an

Both `response_format` and including the schema in the prompt achieve the same goal – guiding the AI to produce structured JSON output according to our `ExtractScheme`.

## Second Example: Tech News Time! 🚀📱

Let's test our `ExtractScheme` on a different type of news article – tech news! Here's an English article (`text_2`) about Meta's AI chatbot app launch:

In [None]:
text_2 = """
Meta’s AI chatbot will soon have a standalone app
​
 Summarise
​
Emma RothFeb 28, 2025 at 12:05 AM GMT+1
STK043_VRG_Illo_N_Barclay_6_Meta
Meta is planning to launch a dedicated app for its AI chatbot, according to a report from CNBC. The Verge can also confirm that Meta is working on the standalone app. The new app could launch in the second quarter of this year, CNBC says, joining the growing number of standalone AI apps, including OpenAI’s ChatGPT, Google Gemini, and Microsoft Copilot.

Meta has already brought its AI chatbot across Facebook, Instagram, Messenger, and WhatsApp, but launching a standalone app could help the company reach people who don’t already use those platforms. Similar to rival chatbots, Meta AI can answer questions, generate images, edit photos, and more. It recently gained the ability to use its “memory” to provide better recommendations.

In a response to CNBC’s report, OpenAI CEO Sam Altman joked, “ok fine maybe we’ll do a social app.” Meta declined to comment.

Meta has ramped up its efforts to compete in the AI industry in recent months, with CEO Mark Zuckerberg announcing plans to invest up to $65 billion to further the company’s AI ambitions. The company also plans on holding an event dedicated to AI on April 29th.

Additional reporting by Alex Heath.

5 Comments5 New
"""

We'll use the same `ExtractScheme` and prompt structure.  This time, we'll also add an `assistant` example message.  This is like showing the AI a "good example" of the JSON output we expect, based on the *first* text example (`text_1`).  Example messages can significantly improve the accuracy and format of the AI's output.

In [None]:
# Call the LLM to extract information from text_2 using ExtractScheme and example assistant message
chat_completion = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo",
    #response_format={"type": "json_object", "schema": CaseDetails.model_json_schema()}, # Typo again - ignore CaseDetails
    messages=[
        {
            "role": "system",
            "content": "You are an AI model tasked with extracting structured information from a news article. Follow the schema provided below to extract the relevant details. You do not invent information that is not in the provided text. You output JSON only in English. Nothing else.", # System message
        },
        {
            "role": "user",
            "content": "Extract article information from the following text and output in English JSON format: " + text_2 + " Use following JSON schema:" + json_schema, # User request with text_2 and schema
        },
        {
            "role": "assistant", # Assistant message - example of desired output format
            "content": """{
  "title": "L’altercation entre Volodymyr Zelensky et Donald Trump a été délibérément provoquée par les Etats-Unis, selon Friedrich Merz",
  "publication_date": "March 3, 2025",
  "main_event": "Political reactions to an altercation between Volodymyr Zelensky and Donald Trump",
  "event_summary": "Friedrich Merz claims that the altercation between Zelensky and Trump was deliberately provoked by the U.S. and expresses concerns about US commitment to European security.",
  "entities_involved": [
    "United States",
    "Ukraine",
    "Germany",
    "CDU/CSU alliance"
  ],
  "key_people": [
    "Friedrich Merz",
    "Volodymyr Zelensky",
    "Donald Trump",
    "Olaf Scholz"
  ],
  "relevant_locations": [
    "Hambourg",
    "Berlin",
    "Munich",
    "Washington",
    "London"
  ],
  "key_developments": [
    "Friedrich Merz's press conference in Hambourg",
    "Consultations with CDU leadership in Berlin",
    "Merz's statement on US-Europe relations and German security",
    "Defense of Olaf Scholz's role at a summit in London"
  ],
  "potential_impact": [
    "Potential shift in US foreign policy under Trump",
    "Increased pressure on Europe to ensure its own security",
    "Speculation about US troop withdrawal from Germany",
    "Impact on German political landscape and leadership"
  ],
  "keywords": [
    "Ukraine",
    "Donald Trump",
    "Volodymyr Zelensky",
    "Friedrich Merz",
    "US foreign policy",
    "European security",
    "German politics"
  ]
}""", # Example JSON based on text_1
        },
        {
            "role": "user", # Another user message - this is a repetition from the original notebook, probably unintentional but kept for consistency
            "content": "Extract article information from the following text and output in English JSON format: " + text_2 + " Use following JSON schema:" + json_schema, # User request with text_2 and schema again
        },
    ],
)

Let's see the extracted JSON for the tech news article:

In [None]:
extracted_output = json.loads(chat_completion.choices[0].message.content)
print(json.dumps(extracted_output, ensure_ascii=False, indent=2))

{
  "title": "Meta’s AI chatbot will soon have a standalone app",
  "publication_date": "February 28, 2025",
  "main_event": "Meta planning to launch a standalone app for its AI chatbot",
  "event_summary": "Meta is working on a dedicated app for its AI chatbot, which could launch in the second quarter of this year. The app will allow users to access Meta's AI chatbot without needing to use Facebook, Instagram, Messenger, or WhatsApp.",
  "entities_involved": [
    "Meta",
    "OpenAI",
    "Google",
    "Microsoft"
  ],
  "key_people": [
    "Mark Zuckerberg",
    "Sam Altman",
    "Alex Heath"
  ],
  "relevant_locations": null,
  "key_developments": [
    "Meta planning to launch a standalone app for its AI chatbot",
    "OpenAI CEO Sam Altman jokingly responds to the report",
    "Meta declining to comment on the report"
  ],
  "potential_impact": [
    "Increased accessibility of Meta's AI chatbot",
    "Potential competition in the AI industry"
  ],
  "keywords": [
    "Meta",
   

## Batch Processing: Extracting Data from Many Articles 🚀

Now, let's scale things up!  What if we have a whole bunch of news articles we want to process?  We can use a loop to iterate through articles, extract information for each, and then organize the results into a table (Pandas DataFrame) for easy analysis.

First, we'll load articles from a local JSONL file (`paraphrased_articles.jsonl`).  JSONL is a convenient format for storing multiple JSON objects, one per line.

In [None]:
# get the data from the server
!wget https://rjuro.com/unistra-nlp2025/data/paraphrased_articles.jsonl

--2025-03-04 12:52:15--  https://rjuro.com/unistra-nlp2025/data/paraphrased_articles.jsonl
Resolving rjuro.com (rjuro.com)... 185.199.109.153, 185.199.111.153, 185.199.108.153, ...
Connecting to rjuro.com (rjuro.com)|185.199.109.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2621560 (2.5M) [application/octet-stream]
Saving to: ‘paraphrased_articles.jsonl’


2025-03-04 12:52:15 (164 MB/s) - ‘paraphrased_articles.jsonl’ saved [2621560/2621560]



In [None]:
import os # already imported, but included for clarity in this section
import json # already imported
from typing import List, Optional # already imported
from pydantic import BaseModel, Field # already imported
from openai import OpenAI # already imported

# Define the extraction schema (same as before)
class ExtractScheme(BaseModel):
    #title: str = Field(description="Title of the news article") # Removed from schema in this version
    #publication_date: str = Field(description="Date when the article was published. If not explicitly mentioned, infer from article content if possible.") # Removed from schema
    real_article: str = Field(description="Real article or scraping problem/artifact/copyright issue? - Select YES/NO only.") # Added field - article validity check
    main_event: str = Field(description="Primary event or topic discussed in the article")
    event_summary: str = Field(description="A brief summary of the event or article's main points")
    entities_involved: List[str] = Field(description="Organizations, countries, or key entities involved in the event")
    key_people: List[str] = Field(description="Key people or figures mentioned in relation to the event")
    relevant_locations: Optional[List[str]] = Field(description="Locations that are central to the event, if any")
    key_developments: Optional[List[str]] = Field(description="Key developments or actions that have occurred or are expected")
    potential_impact: Optional[List[str]] = Field(description="Potential impacts or consequences of the event")
    keywords: List[str] = Field(description="Key terms or phrases that are central to the article")

# Setup OpenAI client (same as before)
#TOGETHER_API_KEY = os.getenv('TOGETHER_API_KEY') # Ensure API key is set in environment if running locally
client = OpenAI(
    base_url="https://api.together.xyz/v1",
    api_key=TOGETHER_API_KEY
)

# Load articles from local jsonl file
def load_articles_from_jsonl(file_path):
    articles = []
    with open(file_path, 'r', encoding='utf-8') as f: # Open file for reading
        for line in f: # Read line by line
            article_json = json.loads(line.strip()) # Parse each line as JSON
            articles.append(article_json) # Add to article list
    return articles

# Path to your local jsonl file - replace with your actual path!
jsonl_file_path = 'paraphrased_articles.jsonl' # Replace with your actual file path
articles_data = load_articles_from_jsonl(jsonl_file_path) # Load articles

# Filter articles - ensure we have text and it's long enough
filtered_articles_data = []
for article in articles_data:
    if 'text' in article and isinstance(article['text'], str) and len(article['text']) >= 100: # Check for text and minimum length
        filtered_articles_data.append(article) # Add to filtered list

articles_data = filtered_articles_data # Replace original with filtered data
print(f"Number of articles after filtering: {len(articles_data)}") # Print number of articles after filtering

extracted_data_table = [] # Initialize list to store extracted data
json_schema = str(ExtractScheme.model_json_schema()) # Get JSON schema string

Number of articles after filtering: 1903


**Schema Modification:**

Notice we've slightly modified the `ExtractScheme` in this batch processing section:

*   `title` and `publication_date` fields are *removed*. This might be because these are already available in the input `articles_data`.
*   `real_article: str = Field(...)`: A new field `real_article` is added to check if the text is a genuine article or some kind of scraping artifact. This is a practical addition for real-world data processing, where data can be messy.

**Batch Processing Loop:**

Now for the loop that processes each article:

In [None]:
# Import necessary additional libraries - pandas and tqdm
import pandas as pd # for DataFrames
from tqdm.notebook import tqdm # for progress bars in notebooks

# Iterate over articles and perform extraction with tqdm progress bar
for article in tqdm(articles_data[:10], desc="Processing Articles"): # Limiting to first 10 articles for demonstration, tqdm for progress bar
    article_text = article['text'] # Extract article text
    original_title = article['title'] # Extract original title
    original_date = article['date'] # Extract original date

    try: # Error handling - in case extraction fails for an article
        chat_completion = client.chat.completions.create(
            model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo", # Using a larger model here - 70B version
            messages=[
                {
                    "role": "system",
                    "content": "You are an AI model tasked with extracting structured information from a news article. Follow the schema provided below to extract the relevant details. You do not invent information that is not in the provided text. You output JSON only in English. Nothing else.", # System message
                },
                {
                    "role": "user",
                    "content": f"Extract article information from the following text and output in English JSON format: {article_text} Use following JSON schema:" + json_schema, # User message with article text and schema
                },
            ],
            response_format={"type": "json_object", "schema": ExtractScheme.model_json_schema()}, # Enforce JSON and schema
        )

        extracted_content_json = json.loads(chat_completion.choices[0].message.content) # Parse JSON response
        extracted_content = ExtractScheme(**extracted_content_json).dict() # Validate against schema and convert to dict

        # Add original title and date to the extracted data for the table
        extracted_content['original_title'] = original_title # Add original title to extracted data
        extracted_content['original_date'] = original_date # Add original date to extracted data
        extracted_data_table.append(extracted_content) # Append extracted data to table list

        print(f"Extracted information for: {original_title}") # Print success message

    except Exception as e: # Catch any errors during processing
        print(f"Error processing article: {original_title}. Error: {e}") # Print error message
        extracted_data_table.append({'original_title': original_title, 'original_date': original_date, 'error': str(e)}) # Store error info

# Convert to pandas DataFrame
df = pd.DataFrame(extracted_data_table) # Create DataFrame from extracted data

# Flatten list columns - make the DataFrame easier to view
def flatten_list_columns(df):
    flattened_df = df.copy() # Create a copy to avoid modifying original DataFrame
    list_columns = [col for col in df.columns if df[col].apply(lambda x: isinstance(x, list)).any()] # Identify list columns

    for col in list_columns:
        # Convert lists to comma-separated strings
        flattened_df[col] = flattened_df[col].apply(
            lambda x: ', '.join(x) if isinstance(x, list) and x else '') # Join list elements with commas, handle empty lists

    return flattened_df

# Flatten the dataframe and display the head
flattened_df = flatten_list_columns(df) # Flatten DataFrame
print("\nExtracted Data Table (Flattened):") # Print header
display(flattened_df.head()) # Display first few rows of flattened DataFrame

# Optional: Save to CSV - uncomment to save results to a CSV file
flattened_df.to_csv('extracted_news_data_flattened.csv', index=False) # Save to CSV

# Original output format (JSON) - for inspection of original structure
print("\nExtracted Data Table (Original):") # Print header
for row in extracted_data_table: # Iterate through extracted data
    print(json.dumps(row, ensure_ascii=False, indent=2)) # Print each row as formatted JSON

Processing Articles:   0%|          | 0/10 [00:00<?, ?it/s]

<ipython-input-19-083e79e200df>:28: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  extracted_content = ExtractScheme(**extracted_content_json).dict() # Validate against schema and convert to dict


Extracted information for: Financial Leader of Transsion Holding Detained After Decade-Long Service
Extracted information for: New AI Feature Summarizes Google Meet Conversations
Extracted information for: AI Gadgets: Setting Expectations
Extracted information for: TikTok Lite Removes Reward Feature in EU
Extracted information for: US Investigates Chinese Connected Cars; Alibaba Cloud Cuts Prices; Tesla Plans New Model
Extracted information for: Former Google CEO Explains Company's Loss of Edge to OpenAI
Extracted information for: Current Trends in AI and AR
Extracted information for: Tesla Reports Strong Q2 Sales; Prepares Second-Generation Humanoid Robot Debut
Extracted information for: Apple Eyes Cloud-Based AI with M2 Ultra Chips
Extracted information for: Key Technology Trends Reshaping Manufacturing in 2024

Extracted Data Table (Flattened):


Unnamed: 0,real_article,main_event,event_summary,entities_involved,key_people,relevant_locations,key_developments,potential_impact,keywords,original_title,original_date
0,NO,,,,,,,,,Financial Leader of Transsion Holding Detained...,2024-09-07T00:00:00
1,YES,Google introduces AI-powered automatic note-ta...,Google has launched an AI-powered feature for ...,Google,,,The feature is currently available to users wi...,The feature can serve as a valuable aid for in...,"Google Workspace, AI-powered, automatic note-t...",New AI Feature Summarizes Google Meet Conversa...,2024-08-27T00:00:00
2,YES,AI Gadget Market and Antitrust Trial,The article discusses the poor reviews of AI g...,"Google, Rabbit, Humane AI",Judge Amit Mehta,,Google's antitrust trial nearing its conclusio...,Implications for other ongoing antitrust cases...,"AI gadget market, antitrust trial, Google, Rab...",AI Gadgets: Setting Expectations,2024-05-03T00:00:00
3,YES,TikTok to remove reward feature from TikTok Li...,The European Commission announced that TikTok ...,"European Commission, TikTok",,"EU, France, Spain, China",European Commission initiated an investigation...,"Reducing the addictive nature of the app, Miti...","TikTok, TikTok Lite, European Commission, Digi...",TikTok Lite Removes Reward Feature in EU,2024-08-06T00:00:00
4,YES,US President Biden to investigate connected ve...,US President Biden instructed the Commerce Sec...,"US, China, UBS, Tesla",US President Biden,"America, China",Investigation into connected vehicles from cou...,Potential risks to America's national security...,"Connected vehicles, National security, Automot...",US Investigates Chinese Connected Cars; Alibab...,2024-03-01T00:00:00



Extracted Data Table (Original):
{
  "real_article": "NO",
  "main_event": "",
  "event_summary": "",
  "entities_involved": [],
  "key_people": [],
  "relevant_locations": null,
  "key_developments": null,
  "potential_impact": null,
  "keywords": [],
  "original_title": "Financial Leader of Transsion Holding Detained After Decade-Long Service",
  "original_date": "2024-09-07T00:00:00"
}
{
  "real_article": "YES",
  "main_event": "Google introduces AI-powered automatic note-taking feature for Google Workspace",
  "event_summary": "Google has launched an AI-powered feature for Google Workspace that automatically generates meeting notes and summaries. The feature is currently available to users with specific add-ons and aims to be rolled out to all users by September 10th, 2024.",
  "entities_involved": [
    "Google"
  ],
  "key_people": [],
  "relevant_locations": null,
  "key_developments": [
    "The feature is currently available to users with specific add-ons such as Gemini Enter

**Batch Processing Highlights:**

*   `for article in tqdm(articles_data[:10], desc="Processing Articles"):`: We loop through the *first 10 articles* (`[:10]`) for demonstration purposes. `tqdm` adds a progress bar, which is super useful when processing many articles.
*   `try...except`:  Error handling!  If something goes wrong during extraction for a particular article, the loop continues processing other articles, and we log the error.
*   `pd.DataFrame(extracted_data_table)`:  We convert the list of extracted dictionaries into a Pandas DataFrame. DataFrames are amazing for tabular data manipulation and analysis.
*   `flatten_list_columns(df)`:  This function flattens columns that contain lists into comma-separated strings. This makes the DataFrame easier to read and export to CSV.
*   `flattened_df.to_csv('extracted_news_data_flattened.csv', index=False)`:  Optionally, we save the DataFrame to a CSV file.  CSV is a widely compatible format for data sharing and analysis in tools like Excel or other data analysis software.
*   `print(json.dumps(row, ensure_ascii=False, indent=2))`:  We also print the original JSON output for each article, in case you want to see the un-flattened, structured JSON data.

## Conclusion: You're an AI Data Extraction Wizard! 🧙‍♂️✨

Congratulations! You've made it through the workshop and learned how to:

*   Connect to OpenAI's API (or a compatible alternative like TogetherAI).
*   Define data schemas using Pydantic to structure AI outputs.
*   Extract information from text, from simple summaries to complex structured data.
*   Process multiple articles in batch and organize the results into a table.

You're now equipped to build your own AI-powered information extraction tools!  Think of the possibilities: analyzing news trends, extracting product details from descriptions, processing customer feedback, and much more.  The text data world is your oyster! 🦪🌍

Keep experimenting, keep building, and have fun turning text into data! 🎉