<a href="https://colab.research.google.com/github/MohammedNasserAhmed/AINARABIC/blob/main/Arabic_Web_Scraping_with_Anthropic_and_Firecrawl_APIs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# **Arabic-Web Scraping with Anthropic and Firecrawl APIs 🔃**

This notebook demonstrates how to scrape and process data from a specified webpage using the Anthropic and Firecrawl APIs. The example URL used is an Al Jazeera blog post.

## *Prerequisites*

Ensure you have the following installed:

- [Anthropic API](https://www.anthropic.com) (API key required)
- [Firecrawl API](https://www.firecrawl.com) (API key required)
- Required Python libraries: `requests`, `BeautifulSoup`, `dotenv`, `json`, `anthropic`, `firecrawl`, `textwrap`, `IPython`

You can install the necessary libraries using:

In [None]:
!pip install firecrawl anthropic beautifulsoup4 python-dotenv

*........ restart session .......*

In [None]:
import os
import time
from firecrawl import FirecrawlApp
import json
import anthropic
from bs4 import BeautifulSoup
from datetime import datetime
from typing import Optional, List
import re
import textwrap
from IPython.display import Markdown
import sys
sys.stdout.encoding = 'utf-8'
from dotenv import load_dotenv
load_dotenv()

## **Methods Pool**

In [None]:
def extract_main_content_from_html(html: str) -> str:
        soup = BeautifulSoup(html, 'html.parser')

        # Remove script and style elements
        for script in soup(["script", "style"]):
            script.decompose()

        # Find the main content container
        main_content = soup.select_one('main#main-content-area')

        if main_content:
            # Remove unwanted elements (adjust as needed)
            for unwanted in main_content.select('.article-info-block,.disclaimer-text, .article-author, .article-dates'):
                unwanted.decompose()

            # Extract text from all elements that might contain content
            content = []
            for element in main_content.descendants:
                if element.name == 'p':
                    content.append(element.get_text().strip())

            # Join all content and clean up
            full_content = '\n'.join(content)
            full_content = re.sub(r'\s+', ' ', full_content)  # Replace multiple spaces with single space
            full_content = re.sub(r'\n+', '\n', full_content)  # Replace multiple newlines with single newline

            return full_content.strip()

        return ""

def extract_author_from_html(html: str) -> Optional[str]:
        soup = BeautifulSoup(html, 'html.parser')
        author_element = soup.select_one('.article-author__name .author-link')
        if author_element:
            return author_element.text.strip()
        return None

def extract_date_from_html(html: str) -> Optional[datetime.date]:
    soup = BeautifulSoup(html, 'html.parser')
    date_element = soup.select_one('.article-dates__published')
    if date_element:
        date_str = date_element.text.strip()
        try:
            # Assuming the date format is always DD/MM/YYYY
            return datetime.strptime(date_str, '%d/%m/%Y').date()
        except ValueError:
            print(f"Error parsing date: {date_str}")
            return None
    return None


def save(data):
    with open('output.md', 'w', encoding='utf-8') as file:
        file.write(data)

    # Save markdown as .json file
    with open('output.json', 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=4)

def format_data_dict_to_markdown(data_dict):
  markdown_text = ""
  for key, value in data_dict.items():
    if value:
      markdown_text += f"**{key}:** {value}\n"
  return markdown_text

def to_markdown(text):
  text = text.replace('•', '  *')
  return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))

def extract_author(markdown: str):
    # Pattern to match Arabic names, potentially preceded by titles like "د."
    author_pattern = r'\*\s*(د\.\s*)?[\u0600-\u06FF\s]+\n'
    author_match = re.search(author_pattern, markdown)
    if author_match:
        author = author_match.group().strip('* \n')
        return author.strip()
    return None

def extract_date(text):
    # This is a simple date extraction. You might need to adjust it based on the actual date format in your content.
    date_match = re.search(r'\d{4}-\d{2}-\d{2}', text)
    if date_match:
        return datetime.strptime(date_match.group(), '%Y-%m-%d').date()
    return None


### *Key Functions*

- **`extract_main_content_from_html(html)`**: Extracts and cleans the main content from the HTML of the webpage.
- **`extract_date_from_html(html)`**: Extracts the publication date of the article.
- **`extract_author_from_html(html)`**: Extracts the author’s name from the article.
- **`format_data_dict_to_markdown(data_dict)`**: Formats the extracted data into markdown.
- **`save(data)`**: Saves the formatted data into a file.


## *Setup*

1. **API Keys**: Obtain your API keys from [Anthropic](https://www.anthropic.com) and [Firecrawl](https://www.firecrawl.com).

2. **Environment Variables**: Create a `.env` file in your project directory and add your API keys:

In [None]:
anthropic_api_key = os.getenv("ANTHROPIC_API_KEY") or ""
firecrawl_api_key = os.getenv("FIRECRAWL_API_KEY") or ""
#OR#
#os.environ["ANTHROPIC_API_KEY"] = '<anthropic_api_key>'
os.environ["FIRECRAWL_API_KEY"] = 'fc-291e4604a2bb4f4a84e533e6e9e03a70#'

3. **Code Overview**: The code loads the API keys, initializes the clients, and scrapes the content from the specified URL. The data extracted includes the title, description, main content, author, and other metadata, which is then formatted and saved.

## Example Usage

To scrape data from the provided Al Jazeera blog post:

In [None]:
url=u"https://www.aljazeera.net/opinions/2024/3/9/%D8%A7%D9%84%D8%B0%D9%83%D8%A7%D8%A1-%D8%A7%D9%84%D8%A7%D8%B5%D8%B7%D9%86%D8%A7%D8%B9%D9%8A-%D8%A3%D8%A8%D8%B9%D8%AF-%D9%85%D9%86-%D9%83%D9%88%D9%86%D9%87-%D9%85%D8%AC%D8%B1%D8%AF"

▶ execute the following:

In [8]:
client=anthropic.Client(api_key=anthropic_api_key)
app = FirecrawlApp(api_key=firecrawl_api_key)
scrape_result = app.scrape_url(url, params={'formats': ['markdown', 'html']})

In [None]:
if scrape_result:
    print('Collecting data from crawl results:\n')
    html = scrape_result.get('html')
    markdown = scrape_result.get('markdown')
    metadata = scrape_result.get('metadata')
    srcurl = metadata.get('ogUrl')
    title = metadata.get('ogTitle')
    description = metadata.get('ogDescription')
    site=metadata.get('ogSiteName')
    clean_content = extract_main_content_from_html(html)
    date = extract_date_from_html(html)
    author = extract_author_from_html(html)
    data_dict = {
            'SourceURL': srcurl,
            'Title': title,
            'Description': description,
            'Date': str(date),
            'Author': author,
            'Site': site,
            'MainContent': clean_content
        }
    data = format_data_dict_to_markdown(data_dict)
    save(data)
    print(f'Source URL: {data_dict["SourceURL"]}')
    print(f'Title: {data_dict["Title"]}')
    print(f'Description: {data_dict["Description"]}')
    print(f'Date: {data_dict["Date"]}')
    print(f'Author: {data_dict["Author"]}')
    print(f'Site: {data_dict["Site"]}')
    print('Main Content: \n')
to_markdown(clean_content)

In [None]:
to_markdown(scrape_result.get('markdown'))

## *Additional Notes*

- **Error Handling**: Ensure proper error handling for network requests and data extraction.
- **Ethical Considerations**: Always respect the terms of service of the websites you scrape and seek permission where necessary.
