# BeautifulSoup - Scrape emails from URL

Largely inspired by https://github.com/jupyter-naas/awesome-notebooks/blob/master/BeautifulSoup/BeautifulSoup_Scrape_emails_from_URL.ipynb

**Description:** This notebook will show how to scrape emails stored in HTML webpage using BeautifulSoup.

<u>References:</u>
- [Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Regular Expression Documentation](https://docs.python.org/3/library/re.html)

## Import libraries

In [17]:
import re
from collections import deque
from urllib.parse import urlsplit

import requests
from bs4 import BeautifulSoup

### Setup Variables
- `url`: URL of the webpage to scrape
- `limit`: number of emails found to stop scraping

In [18]:
url = "https://www.brentozar.com/"
limit = 3

## Scrape emails from URL

We will use the `requests` library to get the HTML content of the webpage and the `BeautifulSoup` library to parse the HTML content. We will use a regular expression to extract the emails from the HTML content.

In [19]:
unscraped = deque([url])

scraped = set()

emails = set()

while len(unscraped):
    url = unscraped.popleft()
    scraped.add(url)

    parts = urlsplit(url)

    base_url = "{0.scheme}://{0.netloc}".format(parts)
    if '/' in parts.path:
        path = url[:url.rfind('/') + 1]
    else:
        path = url

    print("Crawling URL: %s" % url)
    try:
        response = requests.get(url)
    except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
        continue

    exclude = ["google.com", "gmail.com", "example.com"]
    # Get emails from URL
    new_emails = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.+[a-z]{1,3}", url)
    for email in new_emails:
        for e in exclude:
            if not email.endswith(e):
                emails.update([email])

    # Get emails from content
    new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.+[a-z]{1,3}", response.text, re.I))
    for email in new_emails:
        for e in exclude:
            if not email.endswith(e):
                emails.update([email])

    if len(emails) >= limit:
        break

    soup = BeautifulSoup(response.text, 'html.parser')
    for anchor in soup.find_all("a"):
        if "href" in anchor.attrs:
            link = anchor.attrs["href"]
        else:
            link = ''

        if link.startswith('/'):
            link = base_url + link

        elif not link.startswith('http'):
            link = path + link

        if not link.endswith(".gz"):
            if not link in unscraped and not link in scraped:
                unscraped.append(link)

print(emails)

Crawling URL: https://www.brentozar.com/
Crawling URL: https://www.brentozar.com/#
Crawling URL: https://www.brentozar.com/log-in/
Crawling URL: https://www.brentozar.com/contact/
Crawling URL: https://www.brentozar.com/cart/
Crawling URL: https://www.youtube.com/c/BrentOzarUnlimited
Crawling URL: https://www.linkedin.com/in/brentozar/
Crawling URL: https://www.facebook.com/brentozar
Crawling URL: https://www.twitch.tv/brentozar
Crawling URL: https://github.com/BrentOzarULTD
Crawling URL: https://www.brentozar.com/sql-critical-care/
Crawling URL: https://www.brentozar.com/sql/sql-server-performance-tuning/
Crawling URL: https://www.brentozar.com/remote-dba-services-for-microsoft-sql-server/
Crawling URL: https://www.brentozar.com/remote-dba-services-for-microsoft-sql-server/sql-server-upgrades-and-migrations/
Crawling URL: https://training.brentozar.com/p/the-consultant-toolkit
Crawling URL: https://www.brentozar.com/training/
Crawling URL: https://www.brentozar.com/training/my-videos/

## Display result

In [20]:
print(f"🚀 {len(emails)} founded on {url}")
print(emails)

🚀 3 founded on https://support.google.com/youtube/contact/de_cancellation?hl=fr
{'8149a85a83fa4ec69640c43ddd69017d@sentry.io', 'asxvmprobertest@gmail.com', 'Help@BrentOzar.com'}


 ### Exercice 1
 
Développer un script qui permet de faire d'extraire les données sur les dividendes de l'action TOTAL SE en utilisant leur site web https://totalenergies.com/fr/actionnaires/action-et-dividende/dividende pour les années 2020 à 2024.
Le script doit retourner un tableau de données avec les colonnes suivantes :
- Nom du coupon
- Montant, un `float` avec 2 décimales
- Date de détachement, un objet `datetime`, parsé avec le module distant `dateparser`
- Date de paiement, un objet `datetime`, parsé avec le module distant `dateparser`