# Website Report Generator with Gemini AI

An intelligent tool that automatically generates professional company reports by scraping websites and using AI to identify relevant content.

## What This Does

This notebook:
1. Scrapes website content using BeautifulSoup
2. Uses Gemini AI to intelligently select relevant pages
3. Aggregates content from multiple pages
4. Generates professional reports automatically
5. Supports streaming output for real-time display

**Model:** Gemini Flash (free tier compatible)


## Setup and Imports

Install required packages:
```bash
pip install python-dotenv requests beautifulsoup4 google-generativeai
```


In [1]:
# imports
# If these fail, please check you're running from an 'activated' environment

import os                # Environment variables access
import requests          # HTTP requests for web scraping
import json              # JSON parsing and formatting
from typing import List  # Type hints
from dotenv import load_dotenv  # Load API keys from .env file
from bs4 import BeautifulSoup   # HTML parsing and content extraction
from IPython.display import Markdown, display, update_display  # Notebook display utilities
import google.generativeai as genai  # Google Gemini AI

## Model Configuration

Using Gemini Flash (free tier compatible). The notebook will automatically fall back to alternative models if needed.


In [2]:
# Auto-fix: Use free tier compatible models
import os
MODEL = 'gemini-flash-latest'  # Free tier compatible model
print(f"Note: Using free tier model: {MODEL}")


Note: Using free tier model: gemini-flash-latest


## Website Scraper Class

The `Website` class scrapes a webpage and extracts:
- Page title
- Clean text content (removes scripts, styles, images)
- All links found on the page


In [3]:
# Initialize and constants

load_dotenv(override=True)
api_key = os.getenv('GEMINI_API_KEY')  
if api_key and len(api_key) > 10:
    print("API key looks good so far")
    genai.configure(api_key=api_key)
else:
    print("There might be a problem with your API key? Please check your GEMINI_API_KEY environment variable!")
    
# Try different model names
MODEL = 'gemini-2.0-flash'  
try:
    model = genai.GenerativeModel(MODEL)
    print(f"Using model: {MODEL}")
except Exception as e:
    print(f"Error with {MODEL}: {e}")
    # Try alternative model
    MODEL = 'gemini-2.0-flash'
    model = genai.GenerativeModel(MODEL)
    print(f"Using fallback model: {MODEL}")

API key looks good so far
Using model: gemini-2.0-flash


## Test the Scraper

Let's test the scraper on a sample website to see what links we can extract.


In [4]:
# A class to represent a Webpage


headers = {
 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}

class Website:
    """
    A utility class to represent a Website that we have scraped, now with links
    """

    def __init__(self, url):
        self.url = url
        response = requests.get(url, headers=headers)
        self.body = response.content
        soup = BeautifulSoup(self.body, 'html.parser')
        self.title = soup.title.string if soup.title else "No title found"
        if soup.body:
            for irrelevant in soup.body(["script", "style", "img", "input"]):
                irrelevant.decompose()
            self.text = soup.body.get_text(separator="\n", strip=True)
        else:
            self.text = ""
        links = [link.get('href') for link in soup.find_all('a')]
        self.links = [link for link in links if link]

    def get_contents(self):
        return f"Webpage Title:\n{self.title}\nWebpage Contents:\n{self.text}\n\n"

## Link Selection Prompt

This prompt tells Gemini how to identify which links are relevant for a company report (About pages, Careers, etc.).


In [5]:
atmos = Website("https://atmostech.ma")
atmos.links

['https://atmostech.ma/contact',
 'https://atmostech.ma/index.php?route=account/account',
 'https://atmostech.ma/register',
 'https://atmostech.ma/login',
 'https://atmostech.ma/index.php?route=account/wishlist',
 'https://atmostech.ma/cart',
 'https://atmostech.ma/index.php?route=checkout/checkout',
 'https://atmostech.ma/',
 'https://atmostech.ma/cartes-developpement',
 'https://atmostech.ma/cartes-developpement/arduino',
 'https://atmostech.ma/cartes-developpement/attiny',
 'https://atmostech.ma/cartes-developpement/esp',
 'https://atmostech.ma/cartes-developpement/fpga',
 'https://atmostech.ma/cartes-developpement/raspberry-maroc',
 'https://atmostech.ma/cartes-developpement/STM',
 'https://atmostech.ma/cartes-developpement',
 'https://atmostech.ma/kit-atmostech',
 'https://atmostech.ma/kit-atmostech/kit-capteur',
 'https://atmostech.ma/kit-atmostech/kit-developpement',
 'https://atmostech.ma/kit-atmostech/kit-raspberry',
 'https://atmostech.ma/kit-atmostech/kit-robot',
 'https://a

In [6]:
link_system_prompt = "You are provided with a list of links found on a webpage. \
You are able to decide which of the links would be most relevant to include in a report about the company, \
such as links to an About page, or a Company page, or Careers/Jobs pages.\n"
link_system_prompt += "You should respond in JSON as in this example:"
link_system_prompt += """
{
    "links": [
        {"type": "about page", "url": "https://full.url/goes/here/about"},
        {"type": "careers page": "url": "https://another.full.url/careers"}
    ]
}
"""

## Get Relevant Links Function

This function uses Gemini to intelligently select the most relevant links from a website's page list. It excludes legal pages (TOS, Privacy), user account pages, and focuses on company-relevant content.


In [7]:
print(link_system_prompt)

You are provided with a list of links found on a webpage. You are able to decide which of the links would be most relevant to include in a report about the company, such as links to an About page, or a Company page, or Careers/Jobs pages.
You should respond in JSON as in this example:
{
    "links": [
        {"type": "about page", "url": "https://full.url/goes/here/about"},
        {"type": "careers page": "url": "https://another.full.url/careers"}
    ]
}



## Collect All Page Details

This function scrapes the landing page and all relevant linked pages to gather comprehensive information about the company.


In [8]:
def get_links_user_prompt(website):
    user_prompt = f"Here is the list of links on the website of {website.url} - "
    user_prompt += "please decide which of these are relevant web links for a report about the company, respond with the full https URL in JSON format. \
Do not include Terms of Service, Privacy, email links.\n"
    user_prompt += "Links (some might be relative links):\n"
    user_prompt += "\n".join(website.links)
    return user_prompt

## Report Generation System

This is the main system prompt that guides Gemini in creating company reports. It can be modified to change the tone (professional, humorous, etc.).


In [9]:
print(get_links_user_prompt(atmos))

Here is the list of links on the website of https://atmostech.ma - please decide which of these are relevant web links for a report about the company, respond with the full https URL in JSON format. Do not include Terms of Service, Privacy, email links.
Links (some might be relative links):
https://atmostech.ma/contact
https://atmostech.ma/index.php?route=account/account
https://atmostech.ma/register
https://atmostech.ma/login
https://atmostech.ma/index.php?route=account/wishlist
https://atmostech.ma/cart
https://atmostech.ma/index.php?route=checkout/checkout
https://atmostech.ma/
https://atmostech.ma/cartes-developpement
https://atmostech.ma/cartes-developpement/arduino
https://atmostech.ma/cartes-developpement/attiny
https://atmostech.ma/cartes-developpement/esp
https://atmostech.ma/cartes-developpement/fpga
https://atmostech.ma/cartes-developpement/raspberry-maroc
https://atmostech.ma/cartes-developpement/STM
https://atmostech.ma/cartes-developpement
https://atmostech.ma/kit-atmoste

## Generate Company Report

This function creates a complete report by:
1. Scraping the website for relevant information
2. Feeding it to Gemini with instructions to create a report
3. Displaying the result as formatted markdown

Try it with your own company website!


In [10]:
def get_links(url):
    website = Website(url)
    
    # Combine system prompt and user prompt for Gemini
    full_prompt = link_system_prompt + "\n\n" + get_links_user_prompt(website)
    
    # Configure generation for JSON response
    try:
        # Try with response_mime_type for newer models
        generation_config = genai.types.GenerationConfig(
            temperature=0.1,
            response_mime_type="application/json"
        )
        response = model.generate_content(
            full_prompt,
            generation_config=generation_config
        )
    except:
        
        generation_config = genai.types.GenerationConfig(
            temperature=0.1
        )
        full_prompt_with_json = full_prompt + "\n\nRespond ONLY with valid JSON, no other text."
        response = model.generate_content(
            full_prompt_with_json,
            generation_config=generation_config
        )
    
    result = response.text
    return json.loads(result)

## Streamed Report Generation

This version streams the report output character by character for a more dynamic display experience.


In [11]:
get_links("https://www.atmostech.ma/")

{'links': [{'type': 'contact page', 'url': 'https://atmostech.ma/contact'},
  {'type': 'about us page',
   'url': 'https://atmostech.ma/%C3%A0-propos-de-nous'}]}

In [12]:
def get_all_details(url):
    result = "Landing page:\n"
    result += Website(url).get_contents()
    links = get_links(url)
    print("Found links:", links)
    for link in links["links"]:
        result += f"\n\n{link['type']}\n"
        result += Website(link["url"]).get_contents()
    return result

In [13]:
print(get_all_details("https://atmostech.ma/"))

Found links: {'links': [{'type': 'contact page', 'url': 'https://atmostech.ma/contact'}, {'type': 'homepage', 'url': 'https://atmostech.ma/'}, {'type': 'about page', 'url': 'https://atmostech.ma/%C3%A0-propos-de-nous'}]}
Landing page:
Webpage Title:
Arduino Raspberry-pi & Composants électroniques | Maroc | Atmostech
Webpage Contents:
+212674744471 / +212664166235
Compte
Inscription
Connexion
Liste de souhaits (0)
Panier
Commander
0 article(s) - 0,00DH
Votre panier est vide !
Catégories
Cartes de développement
Arduino
Attiny
ESP
FPGA
Raspberry
STM
Show All Cartes de développement
Kit Atmostech
Kit Capteur
Kit De développement
Kit Raspberry
Kit Robot
Show All Kit Atmostech
Modules , capteurs , Shields
CNC et Impression-3D
Composants / Circuits intégrés
Robotique , Outils
Matériel de prototypage
Matériel de soudage
Moteur , Drivers
Robotique
Show All Robotique , Outils
Connecteurs ,  Adaptateurs
OPto-Electronique
Afficheurs
LED
Show All OPto-Electronique
Outillage , Accessoires
Audio , So

In [14]:
system_prompt = "You are an assistant that analyzes the contents of several relevant pages from a company website \
and creates a short report about the company for prospective customers, investors and recruits. Respond in markdown.\
Include details of company culture, customers and careers/jobs if you have the information."

# Or uncomment the lines below for a more humorous report - this demonstrates how easy it is to incorporate 'tone':

# system_prompt = "You are an assistant that analyzes the contents of several relevant pages from a company website \
# and creates a short humorous, entertaining, jokey report about the company for prospective customers, investors and recruits. Respond in markdown.\
# Include details of company culture, customers and careers/jobs if you have the information."


In [15]:
def get_report_user_prompt(company_name, url):
    user_prompt = f"You are looking at a company called: {company_name}\n"
    user_prompt += f"Here are the contents of its landing page and other relevant pages; use this information to build a short report of the company in markdown.\n"
    user_prompt += get_all_details(url)
    user_prompt = user_prompt[:50_000] # Increased limit for Gemini (has larger context window)
    return user_prompt

In [16]:
get_report_user_prompt('atmostech', 'https://atmostech.ma/')

Found links: {'links': [{'type': 'contact page', 'url': 'https://atmostech.ma/contact'}, {'type': 'homepage', 'url': 'https://atmostech.ma/'}, {'type': 'about page', 'url': 'https://atmostech.ma/%C3%A0-propos-de-nous'}]}


'You are looking at a company called: atmostech\nHere are the contents of its landing page and other relevant pages; use this information to build a short report of the company in markdown.\nLanding page:\nWebpage Title:\nArduino Raspberry-pi & Composants électroniques | Maroc | Atmostech\nWebpage Contents:\n+212674744471 / +212664166235\nCompte\nInscription\nConnexion\nListe de souhaits (0)\nPanier\nCommander\n0 article(s) - 0,00DH\nVotre panier est vide !\nCatégories\nCartes de développement\nArduino\nAttiny\nESP\nFPGA\nRaspberry\nSTM\nShow All Cartes de développement\nKit Atmostech\nKit Capteur\nKit De développement\nKit Raspberry\nKit Robot\nShow All Kit Atmostech\nModules , capteurs , Shields\nCNC et Impression-3D\nComposants / Circuits intégrés\nRobotique , Outils\nMatériel de prototypage\nMatériel de soudage\nMoteur , Drivers\nRobotique\nShow All Robotique , Outils\nConnecteurs ,  Adaptateurs\nOPto-Electronique\nAfficheurs\nLED\nShow All OPto-Electronique\nOutillage , Accessoire

In [17]:
def create_report(company_name, url):
    # Combine system prompt and user prompt for Gemini
    full_prompt = system_prompt + "\n\n" + get_report_user_prompt(company_name, url)
    
    response = model.generate_content(full_prompt)
    result = response.text
    display(Markdown(result))

In [18]:
create_report('atmostech','https://atmostech.ma/')

Found links: {'links': [{'type': 'contact page', 'url': 'https://atmostech.ma/contact'}, {'type': 'homepage', 'url': 'https://atmostech.ma/'}, {'type': 'about page', 'url': 'https://atmostech.ma/%C3%A0-propos-de-nous'}]}


# Atmostech: Company Report

Atmostech is a Moroccan company founded in 2019 that specializes in the distribution, sale, installation, and creation of equipment for industry, education, and training. They also offer engineering services in industrial automation and industrial computer systems, industrial maintenance (electrical, mechanical, hydraulic, pneumatic, and electronic), and import didactic and industrial equipment. They are located in Casablanca.

**What Atmostech Offers:**

*   **Products:** A wide range of electronic components, development boards (Arduino, Raspberry Pi, etc.), modules, sensors, CNC and 3D printing supplies, robotics tools, prototyping materials, and more.
*   **Services:**
    *   Engineering in industrial automation and industrial computer systems.
    *   Industrial maintenance (electrical, mechanical, hydraulic, pneumatic, and electronic).
    *   Import of didactic and industrial equipment.

**Customers:**

Atmostech serves a diverse customer base, including:

*   Individuals (hobbyists, students)
*   Educational institutions
*   Industrial clients

**Company Values & Philosophy:**

Atmostech is committed to:

*   Delivering high-quality services.
*   Ensuring customer satisfaction.
*   Maintaining values of:
    *   Expertise and Quality
    *   Commitment
    *   Availability
    *   Responsibility
    *   Trust
    *   Proximity

**Contact Information:**

*   **Phone:** +212674744471 / +212664166235
*   **Address:** N° 5, Rue Hassan Elbasri -ex César Franck ,quartier: Belvédère - Casablanca
*   **Opening Hours:** 8:00 AM - 6:00 PM

**Note:** No information about company culture or job openings was found in the provided documents.


In [19]:
def stream_report(company_name, url):
    # Combine system prompt and user prompt for Gemini
    full_prompt = system_prompt + "\n\n" + get_report_user_prompt(company_name, url)
    
    # Generate streaming response
    response = model.generate_content(full_prompt, stream=True)
    
    full_response = ""
    display_handle = display(Markdown(""), display_id=True)
    
    for chunk in response:
        if chunk.text:
            full_response += chunk.text
            # Clean up markdown formatting
            cleaned_response = full_response.replace("```","").replace("markdown", "")
            update_display(Markdown(cleaned_response), display_id=display_handle.display_id)

In [20]:
stream_report('atmostech','https://atmostech.ma/')

Found links: {'links': [{'type': 'contact page', 'url': 'https://atmostech.ma/contact'}, {'type': 'homepage', 'url': 'https://atmostech.ma/'}, {'type': 'about page', 'url': 'https://atmostech.ma/%C3%A0-propos-de-nous'}, {'type': 'contact page', 'url': 'https://atmostech.ma/contact'}]}


# Atmostech: A Summary for Prospective Customers, Investors, and Recruits

Atmostech Engineering and Solutions, founded in 2019 and based in Casablanca, Morocco, is a dynamic company born from a passion for design and building test benches. They offer a range of products and services within the industrial automation, electronics, and robotics sectors.

## What Atmostech Offers:

*   **Products:** Specializing in the sale and distribution of electronic components, development boards (Arduino, Raspberry Pi, ESP), modules, sensors, CNC and 3D printing equipment, robotics tools, and prototyping materials. They also offer various kits, including Atmostech kits, sensor kits, and robot kits.
*   **Engineering Services:** Providing expertise in industrial automation, industrial IT systems, and general engineering work in areas like agriculture and apiculture.
*   **Maintenance Services:** Offering industrial maintenance services spanning electrical, mechanical, hydraulic, pneumatic, and electronic systems.

## Who are Atmostech's Customers?

Atmostech serves a diverse customer base, including:

*   Individuals and hobbyists interested in electronics and DIY projects.
*   Educational institutions looking for didactic and industrial equipment.
*   Industries requiring automation solutions, maintenance, and specialized equipment.

## Atmostech's Company Culture and Values:

Atmostech is committed to:

*   Delivering high-quality services with a focus on customer satisfaction.
*   Providing expertise and quality.
*   Maintaining engagement, availability, responsibility, trust, and proximity with clients.

## Contact Information

*   **Address:** N° 5, Rue Hassan Elbasri -ex César Franck ,quartier: Belvédère - Casablanca
*   **Phone:** +212674744471 / +212664166235, +2120674744471
*   **Hours:** 8:00 - 18:00

**In Summary:**

Atmostech is a versatile company with a strong focus on quality and customer satisfaction. They serve as a supplier of electronic components and a provider of engineering and maintenance services.

