<a href="https://colab.research.google.com/github/Tamsir123/API-RESTfull-Spring-Java/blob/main/Analyse_concurrentielle_Gemini.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analyse concurrentielle automatisée avec Gemini

Ce notebook montre comment scraper des sites concurrents, calculer le coût d'analyse, et extraire les informations clés (tarifs, offres) grâce à un agent IA Gemini.

**Pipeline :**
1. Scraping éthique de sites concurrents (BeautifulSoup, Firecrawl, Jina AI)
2. Calcul du coût d'analyse
3. Extraction des informations stratégiques avec Gemini
4. Visualisation des résultats

## 1. Définir les sites concurrents à analyser

In [38]:
competitor_sites = [
    {"name": "Articulate 360 by Adobe", "url": "https://www.articulate.com/360/pricing/freelancers"},
    {"name": "7taps", "url": "https://www.7taps.com/pricing"},
    {"name": "Mindsmith AI", "url": "https://www.mindsmith.ai/pricing"},
    {"name": "Cards-microlearning", "url": "https://www.cards-microlearning.com/en/tarifs"},
]

## 2. Scraping des sites concurrents
Nous utilisons plusieurs méthodes pour obtenir le contenu des pages :
- BeautifulSoup (scraping classique)
- Firecrawl (API Mendable)
- Jina AI (API Reader)

In [39]:
# Installer les dépendances nécessaires
!pip install requests beautifulsoup4 firecrawl-py tqdm prettytable --quiet
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm
from prettytable import PrettyTable, ALL
import firecrawl
import getpass

In [40]:
def beautiful_soup_scrape_url(url: str):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    return str(soup)


In [41]:
def scrape_jina_ai(url: str) -> str:
    response = requests.get("https://r.jina.ai/" + url)
    return response.text

In [42]:
import firecrawl
from google.colab import userdata

FIRECRAWL_API_KEY = getpass.getpass("Mendable API Key")

def scrape_firecrawl(url: str):
    app = firecrawl.FirecrawlApp(api_key=FIRECRAWL_API_KEY)
    scraped_data = app.scrape_url(url)["markdown"]
    return scraped_data

Mendable API Key··········


## 3. Calcul du coût d'analyse (tokens)
On estime le coût d'analyse pour chaque contenu récupéré.

In [43]:
!pip install tiktoken --quiet
import tiktoken
def count_tokens(input_string: str) -> int:
    tokenizer = tiktoken.get_encoding("cl100k_base")
    tokens = tokenizer.encode(input_string)
    return len(tokens)
def calculate_cost(input_string: str, cost_per_million_tokens: float = 5) -> float:
    num_tokens = count_tokens(input_string)
    total_cost = (num_tokens / 1_000_000) * cost_per_million_tokens
    return total_cost

## 4. Scraper tous les sites et afficher un tableau comparatif

In [44]:
list_of_scraper_functions = [
    {"name": "Beautiful Soup", "function": beautiful_soup_scrape_url},
    # {"name": "Firecrawl", "function": scrape_firecrawl},
    {"name": "Jina AI", "function": scrape_jina_ai}
]

def view_scraped_content(scrape_url_functions, sites_list, characters_to_display=500, table_max_width=50):
    content_table_headers = ["Site Name"] + [f'{func["name"]} content' for func in scrape_url_functions]
    cost_table_headers = ["Site Name"] + [f'{func["name"]} cost' for func in scrape_url_functions]
    content_table = PrettyTable()
    content_table.field_names = content_table_headers
    cost_table = PrettyTable()
    cost_table.field_names = cost_table_headers
    scraped_data = []
    for site in sites_list:
        content_row = [site['name']]
        cost_row = [site['name']]
        site_data = {"provider": site['name'], "sites": []}
        for scrape_function in scrape_url_functions:
            function_name = scrape_function['name']
            try:
                content = scrape_function['function'](site['url'])
                content_snippet = content[:characters_to_display]
                content_row.append(content_snippet)
                cost = calculate_cost(content)
                cost_row.append(f'${cost:.6f}')
                site_data["sites"].append({"name": function_name, "content": content})
            except Exception as e:
                error_message = f'Error: {str(e)}'
                content_row.append(error_message)
                cost_row.append('Error')
                site_data["sites"].append({"name": function_name, "content": error_message})
        content_table.add_row(content_row)
        cost_table.add_row(cost_row)
        scraped_data.append(site_data)
    content_table.max_width = table_max_width
    content_table.hrules = ALL
    cost_table.max_width = table_max_width
    cost_table.hrules = ALL
    print("Content Table:")
    print(content_table)
    print("Cost Table:")
    print(cost_table)
    return scraped_data

all_content = view_scraped_content(list_of_scraper_functions, competitor_sites, 700, 20)

Content Table:
+----------------------+------------------------+----------------------+
|      Site Name       | Beautiful Soup content |   Jina AI content    |
+----------------------+------------------------+----------------------+
|  Articulate 360 by   |         <html>         |  Title: Pricing for  |
|        Adobe         | <head><title>403 Forbi |   Articulate 360 |   |
|                      |  dden</title></head>   |      Articulate      |
|                      |         <body>         |                      |
|                      | <center><h1>403 Forbid | URL Source: https:// |
|                      |   den</h1></center>    | www.articulate.com/3 |
|                      | <hr/><center>nginx</ce | 60/pricing/freelance |
|                      |         nter>          |          rs          |
|                      |        </body>         |                      |
|                      |        </html>         |  Markdown Content:   |
|                      |            

## 5. Extraction des informations clés avec Gemini
Nous allons utiliser Gemini pour extraire les informations stratégiques (ex: tarifs) à partir du contenu scrapé.

In [45]:
# Utilisation de l'agent Gemini officiel (Google Generative AI)
!pip install google-generativeai --quiet
import google.generativeai as genai
import getpass
GEMINI_API_KEY = getpass.getpass('Entrez votre clé API Gemini : ')
genai.configure(api_key=GEMINI_API_KEY)

# Créer un agent Gemini
model = genai.GenerativeModel('gemini-2.0-flash')  # ou 'gemini-1.5-pro' selon votre accès

def extract_with_gemini_agent(user_input: str):
    prompt = (
        "Analyse le contenu suivant et extrait les informations de tarification sous forme de JSON : "
        "{cheapest: {name: str, price: float}, middle: {name: str, price: float}, most_expensive: {name: str, price: float}}.\n"
        "Contenu :\n" + user_input
    )
    response = model.generate_content(prompt)
    return response.text

Entrez votre clé API Gemini : ··········


## 6. Affichage des résultats extraits par Gemini

In [46]:
from typing import List, Dict
from prettytable import PrettyTable, ALL
from tqdm import tqdm

def display_gemini_extracted_content(results: List[Dict[str, any]], num_objects: int):
    table = PrettyTable()
    table.field_names = ["Site", "Provider Name", "Extracted Content"]

    # Ensure num_objects does not exceed the length of the results list
    num_objects = min(num_objects, len(results))

    # Process the specified number of items from the results list with a progress bar
    for result in tqdm(results[:num_objects], desc="Processing results"):
        provider_name = result["provider"]

        for site in result["sites"]:
            function_name = site["name"]
            content = site["content"]

            # Print the content being passed to the Gemini agent for inspection
            print(f"Content being passed to Gemini agent for {provider_name} ({function_name}):\n{content[:500]}...") # Print first 500 characters

            # Progress bar for each function
            for _ in tqdm(range(1), desc=f"Extracting content with {provider_name} for {function_name}"):
                  extracted_content = extract_with_gemini_agent(content)
                  table.add_row([provider_name, function_name, extracted_content])

    table.max_width = 50  # Set the maximum width for better display
    table.hrules = ALL

    print("Extracted Content Table:")
    print(table)

In [47]:
display_gemini_extracted_content(all_content, num_objects=9)

Processing results:   0%|          | 0/4 [00:00<?, ?it/s]

Content being passed to Gemini agent for Articulate 360 by Adobe (Beautiful Soup):
<html>
<head><title>403 Forbidden</title></head>
<body>
<center><h1>403 Forbidden</h1></center>
<hr/><center>nginx</center>
</body>
</html>
...



Extracting content with Articulate 360 by Adobe for Beautiful Soup:   0%|          | 0/1 [00:00<?, ?it/s][A
Extracting content with Articulate 360 by Adobe for Beautiful Soup: 100%|██████████| 1/1 [00:04<00:00,  4.62s/it]


Content being passed to Gemini agent for Articulate 360 by Adobe (Jina AI):
Title: Pricing for Articulate 360 | Articulate

URL Source: https://www.articulate.com/360/pricing/freelancers

Markdown Content:
Pricing for Articulate 360 | Articulate


[Skip to main content](https://www.articulate.com/360/pricing/freelancers#content)

[![Image 4: Articulate](https://www.articulate.com/wp-content/uploads/2023/06/articulate-logo.svg)](https://www.articulate.com/)

*   [Articulate 360 Platform](https://www.articulate.com/360/pricing/freelancers#)
    *   Our Pla...



Extracting content with Articulate 360 by Adobe for Jina AI:   0%|          | 0/1 [00:00<?, ?it/s][A
Extracting content with Articulate 360 by Adobe for Jina AI: 100%|██████████| 1/1 [00:02<00:00,  2.37s/it]
Processing results:  25%|██▌       | 1/4 [00:07<00:21,  7.01s/it]

Content being passed to Gemini agent for 7taps (Beautiful Soup):
<!DOCTYPE html>
<!-- Last Published: Mon Oct 20 2025 13:54:59 GMT+0000 (Coordinated Universal Time) --><html data-wf-domain="www.7taps.com" data-wf-page="6210eab977ab3b4dedf4d5f1" data-wf-site="6040dae83112172e5cc9b16f" lang="en-US"><head><meta charset="utf-8"/><title>See our pricing plans | 7taps microlearning app</title><meta content="See detailed pricing plans for 7taps Microlearning app here. Enterprise plan free trial available, big savings in yearly plans. Get started today!" name="descrip...



Extracting content with 7taps for Beautiful Soup:   0%|          | 0/1 [00:00<?, ?it/s][A
Extracting content with 7taps for Beautiful Soup: 100%|██████████| 1/1 [00:01<00:00,  1.69s/it]


Content being passed to Gemini agent for 7taps (Jina AI):
Title: See our pricing plans | 7taps microlearning app

URL Source: https://www.7taps.com/pricing

Published Time: Mon, 20 Oct 2025 14:04:59 GMT

Markdown Content:
7taps Free

Create high-impact training that delivers 84% better skill retention.

World's simplest course creator

info

AI microlearning designer

info

Instant content converter

info

Share courses with passwordless links

info

7taps Enterprise

Transform L&D into a strategic powerhouse driving measurable business results. Truste...



Extracting content with 7taps for Jina AI:   0%|          | 0/1 [00:00<?, ?it/s][A
Extracting content with 7taps for Jina AI: 100%|██████████| 1/1 [00:01<00:00,  1.91s/it]
Processing results:  50%|█████     | 2/4 [00:10<00:10,  5.01s/it]

Content being passed to Gemini agent for Mindsmith AI (Beautiful Soup):
<!DOCTYPE html>

<!-- ✨ Built with Framer • https://www.framer.com/ -->
<html lang="en">
<head>
<meta charset="utf-8"/>
<script>try{if(localStorage.get("__framer_force_showing_editorbar_since")){const n=document.createElement("link");n.rel = "modulepreload";n.href="https://framer.com/edit/init.mjs";document.head.appendChild(n)}}catch(e){}</script>
<!-- Start of headStart -->
<!-- End of headStart -->
<meta content="width=device-width" name="viewport"/>
<meta content="Framer 98b71dc" name="genera...



Extracting content with Mindsmith AI for Beautiful Soup:   0%|          | 0/1 [00:00<?, ?it/s][A
Extracting content with Mindsmith AI for Beautiful Soup: 100%|██████████| 1/1 [00:17<00:00, 17.39s/it]


Content being passed to Gemini agent for Mindsmith AI (Jina AI):
Title: Pricing

URL Source: https://www.mindsmith.ai/pricing

Published Time: Fri, 17 Oct 2025 02:02:06 GMT

Markdown Content:
Pricing


[Featured in Markus Bernhardt’s 2025 L&D Report (Case 2)](https://www.endeavorintel.com/endeavor-report?utm_source=mindsmith_site&utm_medium=referral&utm_campaign=endeavor_report)

[![Image 1](https://framerusercontent.com/images/NkEnv4jOiY2IelZ5PWfEKWF9mHI.svg)](https://www.mindsmith.ai/)

[Pricing](https://www.mindsmith.ai/pricing)

[About](ht...



Extracting content with Mindsmith AI for Jina AI:   0%|          | 0/1 [00:00<?, ?it/s][A
Extracting content with Mindsmith AI for Jina AI: 100%|██████████| 1/1 [00:01<00:00,  1.48s/it]
Processing results:  75%|███████▌  | 3/4 [00:29<00:11, 11.34s/it]

Content being passed to Gemini agent for Cards-microlearning (Beautiful Soup):
<!DOCTYPE html>
<!-- Last Published: Thu Oct 16 2025 07:31:00 GMT+0000 (Coordinated Universal Time) --><html data-wf-domain="www.cards-microlearning.com" data-wf-page="67766b5c3ee9c1f1faa2d8b7" data-wf-site="67766b5c3ee9c1f1faa2d715" lang="en-US"><head><meta charset="utf-8"/><title>Tarifs Cards micro-learning | IA intégrée + Moteur d'ancrage</title><link href="https://www.cards-microlearning.com/tarifs" hreflang="x-default" rel="alternate"/><link href="https://www.cards-microlearning.com/tarifs"...



Extracting content with Cards-microlearning for Beautiful Soup:   0%|          | 0/1 [00:00<?, ?it/s][A
Extracting content with Cards-microlearning for Beautiful Soup: 100%|██████████| 1/1 [01:43<00:00, 103.91s/it]


Content being passed to Gemini agent for Cards-microlearning (Jina AI):
Title: microlearning.com | 504: Gateway time-out

URL Source: https://www.cards-microlearning.com/en/tarifs


Markdown Content:
Gateway time-out Error code 504
-------------------------------

Visit [cloudflare.com](https://www.cloudflare.com/5xx-error-landing?utm_source=errorcode_504&utm_campaign=www.cards-microlearning.com) for more information.

2025-10-20 16:05:03 UTC

You

### Browser

Working

Ashburn

### [Cloudflare](https://www.clo...



Extracting content with Cards-microlearning for Jina AI:   0%|          | 0/1 [00:00<?, ?it/s][A
Extracting content with Cards-microlearning for Jina AI: 100%|██████████| 1/1 [00:01<00:00,  1.78s/it]
Processing results: 100%|██████████| 4/4 [02:15<00:00, 33.80s/it]

Extracted Content Table:
+-------------------------+----------------+----------------------------------------------------+
|           Site          | Provider Name  |                 Extracted Content                  |
+-------------------------+----------------+----------------------------------------------------+
| Articulate 360 by Adobe | Beautiful Soup |  Il n'y a aucune information de tarification dans  |
|                         |                |   le contenu fourni.  Le contenu HTML montre une   |
|                         |                |  erreur 403 Forbidden, indiquant que l'accès à la  |
|                         |                |  ressource est interdit.  Je ne peux pas extraire  |
|                         |                |              d'informations de prix.               |
|                         |                |                                                    |
|                         |                | Par conséquent, je ne peux pas créer un JSON ave




---
**Remarque :** Adaptez la fonction `extract_with_gemini` selon la documentation officielle de l'API Gemini.

Ce notebook est prêt à être utilisé pour une analyse concurrentielle automatisée avec Gemini.