This cell uses the !pip install command to install necessary Python libraries for the project. Libraries like beautifulsoup4, requests, nbdev, streamlit, streamlit_jupyter, pandas, and lxml are installed. These libraries are essential for web scraping (beautifulsoup4, requests, lxml), data manipulation (pandas), and creating an interactive web application (streamlit, streamlit_jupyter).


In [156]:
!pip install beautifulsoup4 requests nbdev streamlit streamlit_jupyter pandas lxml SQLAlchemy



The # |exporti comment indicates that this cell is part of a module being prepared for export, using nbdev.
This cell imports the BeautifulSoup class from bs4 for parsing HTML and XML documents, requests for making HTTP requests, pandas (aliased as pd) for data manipulation and analysis, and streamlit (aliased as st) for building web applications. This setup prepares the necessary tools for web scraping and data handling.

In [157]:

# |exporti
from bs4 import BeautifulSoup
import requests
import pandas as pd
import streamlit as st

pd.options.display.max_columns = 20
pd.options.display.max_rows = 20
pd.options.display.max_colwidth = 80


In [158]:
# Create the SQL connection to words_db as specified in your secrets file.
conn = st.connection('words_db', type='sql')

This cell sets up the base URL for the CNIL's AI glossary page. It then initializes an empty list named pages.
A for loop runs 13 times (indicating there are likely 13 pages to scrape), appending the full URL of each page to the pages list. This is the preparatory step for iterating over multiple pages of the glossary.

In [159]:
# |exporti
base_link = "https://www.cnil.fr/fr/intelligence-artificielle/glossaire-ia?page="
pages = []
for i in range(13):
    pages.append(base_link + str(i))


In [160]:
def pages_links():
    base_link = "https://www.cnil.fr/fr/intelligence-artificielle/glossaire-ia?page="
    pages = []
    for i in range(13):
        pages.append(base_link + str(i))

    return pages

This cell uses Python's map and chr functions along with range(97, 123) to create a list of lowercase English alphabet characters. The ASCII values from 97 to 122 correspond to the lowercase letters 'a' through 'z'. This list, named alphabet_list, is likely used to iterate through the alphabetically sorted entries in the CNIL's AI glossary.

In [161]:
# |exporti
alphabet_list = list(map(chr, range(97, 123)))

This cell initializes an empty dictionary named alphabet_obj.
It then iterates over each character in the previously created alphabet_list, creating a key for each letter in the dictionary with an empty list as its value. This structure (alphabet_obj) is likely designed to store the scraped data categorized by the initial letter of each entry in the glossary.

In [162]:
# |exporti
alphabet_obj = {}

for char in alphabet_list:
    alphabet_obj[char]= []

The cell defines a function word_format which takes an HTML element representing a word (or glossary entry) as its argument.
Within the function, it creates an empty dictionary named obj.
It extracts the title of the entry (the word itself) from an HTML element with class definition-liste-titre and assigns it to the key title in the obj dictionary.
The definition of the word is extracted from a div with class definition-liste-body, and it's assigned to the key definition in obj.
The entry's first letter is stored under the key entry, which is likely used for alphabet-based categorization.
Additionally, it extracts the hyperlink associated with the word (for more information) and assigns it to the key link in obj.
Finally, the function returns the obj dictionary containing the structured data of the word.


In [163]:
# |exporti
from sqlalchemy import text
def word_format(word):
    obj = {}
    title = word.find("h3", class_="definition-liste-titre").a.text
    obj["title"] = title
    definition = word.find("div", class_= "definition-liste-body").text.strip()
    obj["definition"] = definition
    obj["entry"] = title[0].lower()
    link = word.find("h3", class_="definition-liste-titre").a.get("href")
    obj["link"] = link
    with conn.session as s:
        s.execute(text('CREATE TABLE IF NOT EXISTS words (entry TEXT, name, TEXT, definition TEXT, url TEXT);'))
            s.execute(
                text('INSERT INTO words (entry, name, definition, url) VALUES (:entry, :name, :definition, :url);'),
                params=dict(entry=title[0].lower(), name=title, definition=definition, url=link)
            )
        s.commit()
    
    return obj


This cell iterates through each URL stored in the pages list created earlier.
For each page, it uses requests.get(page).text to fetch the HTML content and then parses it with BeautifulSoup using the lxml parser.
It finds all div elements with the class list-inner, which likely contain individual word entries.
Within a nested loop, each word entry is passed to the word_format function (defined in a previous cell) to structure the data.
The returned object (obj) is then added to the appropriate list in the alphabet_obj dictionary based on the first letter of the entry. If the key does not exist, it's created and initialized with an empty list before appending obj.
This process effectively categorizes and stores all scraped glossary entries by their starting letter.

In [164]:
# |exporti
@st.cache_data
def web_scraper(alphabet):
    alphabet_obj = alphabet.copy()
    for page in pages:
        html_text = requests.get(page).text
        soup = BeautifulSoup(html_text, 'lxml')
        words = soup.find_all("div",class_="list-inner")
        for word in words:
            obj = word_format(word)
            try:
                alphabet_obj[obj["entry"]].append(obj)
            except:
                alphabet_obj[obj["entry"]] = []
                alphabet_obj[obj["entry"]].append(obj)
            
    return alphabet_obj
    

2024-01-16 12:07:24.909 No runtime found, using MemoryCacheStorageManager


In [165]:
# |exporti
alphabet_obj = web_scraper(alphabet_obj)

This cell seems to be setting up the interface for a Streamlit application.
It imports StreamlitPatcher and tqdm from streamlit_jupyter. StreamlitPatcher is likely used to enable Streamlit functionality in a Jupyter environment.
The StreamlitPatcher().jupyter() call initializes Streamlit's integration with the Jupyter notebook.
It also imports pandas (aliased as pd), which is a crucial library for data manipulation and is possibly used later in the Streamlit app for handling the scraped data.


# Streamlit interface

from streamlit_jupyter import StreamlitPatcher, tqdm
StreamlitPatcher().jupyter()
import pandas as pd



This cell extracts the keys from the alphabet_obj dictionary and stores them in a variable named entries.
These keys, which are the letters of the alphabet, are likely used in the Streamlit interface to allow users to select glossary entries based on their starting letter.

In [166]:
# |exporti
entries = alphabet_obj.keys()


**Cell 10 - Streamlit Title and Markdown:**
This cell uses Streamlit functions to create the title and a markdown description for the Streamlit application.
st.title("Glossaire IA") sets the title of the web application.
st.markdown is used to provide a description of the application, mentioning that the data was extracted from the CNIL website. It includes a hyperlink to the CNIL AI glossary page.
A dropdown menu is created using st.selectbox, allowing users to select a letter from the entries (which are the keys of the alphabet_obj dictionary). This dropdown is used to filter terms in the glossary based on their starting letter.

In [167]:
# |exporti
st.title("GLossaire IA")
st.markdown("Les données ont été extraites du site de [CNIL](%s) à fin de créer une application qui présente les entrées de façon plus intuitive de pouvoir créer des questionnaires." % pages[0])

selected_letter = st.selectbox(
    'Par quelle lettre commence le terme que vous cherchez',
    entries)



**Cell 11 - Displaying Dataframe in Streamlit:**
This cell creates a Pandas DataFrame from the list of entries in alphabet_obj corresponding to the selected letter from the dropdown menu.
st.dataframe(df) is then used to display this DataFrame in the Streamlit application. It shows the glossary terms that start with the selected letter along with their definitions, entry letters, and hyperlinks.

In [168]:
# |exporti
df = pd.DataFrame(alphabet_obj[selected_letter])
st.dataframe(df)

DeltaGenerator()

Cell 12 - Creating a List of Titles for Entries:
This cell defines a function print_ that returns the title of a glossary entry.
It then creates a list named word_list by mapping the print_ function over the list of entries in alphabet_obj corresponding to the selected letter. This process extracts just the titles of the entries, likely for display or further processing in the Streamlit application.

In [169]:
# |exporti
def print_(e):
    return e["title"]
word_list= list(map(print_, alphabet_obj[selected_letter]))



This cell adds another dropdown menu to the Streamlit interface. This menu allows the user to select a specific word from the list of words (word_list) that start with the previously selected letter. If there are no words in the list, it displays a message indicating that there are no corresponding entries.

In [170]:
# |exporti
selected_entry = None
isLen =len(word_list) > 0
if isLen:
    selected_entry = st.selectbox(
        'Quel mot cherchez-vous?',
        word_list)
else:
    st.write("Il n'y a pas de mots correspondant à cette entrée.")

This cell defines a filtering function (filter_cb) to find the glossary entry that matches the user's selected word. It then displays the definition of the selected word in the Streamlit interface. Additionally, it provides a hyperlink to the CNIL page for more information about the selected term.



In [171]:
# |exporti
def filter_cb(x):
    if x["title"] == selected_entry:
        return True
    else: 
        return False
                
if isLen and selected_entry is not None:
    selected_word = list(filter(filter_cb, alphabet_obj[selected_letter]))
    st.markdown(selected_word[0]["definition"])
    link = "https://www.cnil.fr/"+ selected_word[0]["link"]
    st.markdown(f"[{selected_entry}](%s)" % link)

This cell appears to be exporting the Jupyter notebook as a Python module or script using nbdev. The nb_export function is used to convert the notebook into a Python file named "app.py" in the specified directory. The cell also includes code to list the directory contents and remove a file named 'example.py', which might be part of the project's housekeeping or setup process.

In [172]:
from nbdev.export import nb_export
import os
path = 'example.py'
isFile = os.path.isfile(path)
if isFile:
    os.remove(path)
nb_export("ai_vocabulary_web_scraping.ipynb", lib_path="./", name="app")