<a href="https://colab.research.google.com/github/MohammedNasserAhmed/AINARABIC/blob/main/Extract_Arabic_Data_Fom_Firecrawl_Markdown_Using_Groq.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Arabic Information Extraction Using Groq API** ⚓

This script extracts specific fields from Arabic documents provided in JSON format using the Groq API. The script loads the data, sends it to the Groq model, and processes the response to extract relevant information.

### **Prerequisites** ⛳

Make sure you have the following installed:

- Required libraries: `groq`, `json`

### **Installation** ⏬

Install the required libraries:


In [1]:
!pip install groq

Collecting groq
  Downloading groq-0.10.0-py3-none-any.whl.metadata (13 kB)
Collecting httpx<1,>=0.23.0 (from groq)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting httpcore==1.* (from httpx<1,>=0.23.0->groq)
  Downloading httpcore-1.0.5-py3-none-any.whl.metadata (20 kB)
Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->groq)
  Downloading h11-0.14.0-py3-none-any.whl.metadata (8.2 kB)
Downloading groq-0.10.0-py3-none-any.whl (106 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m106.3/106.3 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpx-0.27.2-py3-none-any.whl (76 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpcore-1.0.5-py3-none-any.whl (77 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading h11-0.14.0-py3-none-any.whl (58 kB


### **Usage** ☕

#### *1. Load the Markdown Data*

The script begins by loading data from a `markdown.json` file:


In [6]:
import json

with open('markdown.json', 'r', encoding='utf-8') as f:
    markdown= json.load(f)

#### *2. Initialize the Groq Client*

Replace `"API_KEY"` with your actual Groq API key:*italicized text*

In [3]:
from groq import Groq
import os

groq_api_key = os.getenv("GROQ_API_KEY") or ""
client = Groq(
    api_key="your_groq_api",  # Replace 'API_KEY' with your actual Groq API key
)

#### *3. Define the Fields to Extract*

Specify the fields you want to extract from the document:

In [12]:
fields_to_extract = ["article title","author", "date", "site name", "summary"]

#### *4. Create the Completion Request*

Send a request to the Groq model to extract the information:

In [13]:
response = client.chat.completions.create(
    model="llama3-8b-8192",
    messages=[
        {
            "role": "system",
            "content": "You are an expert legal analyst specializing in extracting key Arabic information from documents. Your task is to accurately parse and output this information in a well-structured JSON format"
        },
        {
            "role": "user",
            "content": f"Please extract the following details from the provided document:\n\nDocument Content:\n\n{markdown}\n\nRequired Information:\n\n{fields_to_extract}"
        }
    ],
    temperature=0,
    max_tokens=512,
    top_p=1,
    stream=False,
    stop=None,
    response_format={"type": "json_object"}
)


#### *5. Print the Response*

Convert the JSON string response into a Python dictionary and print the extracted information:



In [14]:
json_string = response.choices[0].message.content

# Convert JSON string to a Python dictionary
data_extracted = json.loads(json_string)

# Print the extracted information
for key, value in data_extracted.items():
    print(f"{key}: {value}")

article_title: الذكاء الاصطناعي والتكنولوجيا الرقمية المستقبلية
author: د. خالد وليد محمود
date: 9/3/2024
site_name: الجزيرة نت
summary: الذكاء الاصطناعي الشغل الشاغل لغالبية حكومات الدول المتقدّمة؛ لإدراكها أن العالم يقف عند فجر حقبة جديدة، ستغيّر حياة البشرية والطريقة التي تعيش وتعمل بها في عدد كبير من المجالات والقطاعات المختلفة


### ***Notes*** ♒

- Ensure your `markdown.json` file is properly formatted, saved from the previous notebook.
- This script is designed to extract information from Arabic documents, but it can be adapted for other languages by modifying the prompts.