<a href="https://colab.research.google.com/github/MaelaGLG/Policy-In-Action-ARCEP/blob/main/Python%20Script/Collection_of_articles_scholarly.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Collection of articles - Scholarly
### Author : Maela Guillaume-Le Gall
### Date : 13/02/2025

The purpose of this code is to systematically extract academic articles on Environmental Impacts of AI in Europe. It uses the 'scholarly' package for systematic searches on google scholar.

In [1]:
# Installing scholarly package
pip install scholarly


Collecting scholarly
  Downloading scholarly-1.7.11-py3-none-any.whl.metadata (7.4 kB)
Collecting arrow (from scholarly)
  Downloading arrow-1.3.0-py3-none-any.whl.metadata (7.5 kB)
Collecting bibtexparser (from scholarly)
  Downloading bibtexparser-1.4.3.tar.gz (55 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.6/55.6 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting fake-useragent (from scholarly)
  Downloading fake_useragent-2.0.3-py3-none-any.whl.metadata (17 kB)
Collecting free-proxy (from scholarly)
  Downloading free_proxy-1.1.3.tar.gz (5.6 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting python-dotenv (from scholarly)
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Collecting selenium (from scholarly)
  Downloading selenium-4.28.1-py3-none-any.whl.metadata (7.1 kB)
Collecting sphinx-rtd-theme (from scholarly)
  Downloading sphinx_rtd_theme-3.0.2-py2.p

In [5]:
from scholarly import scholarly
import requests
from bs4 import BeautifulSoup

# Search query
query = "Environmental Impacts of AI, Europe"

# Search on Google Scholar
search_query = scholarly.search_pubs(query)

# Function to fetch full abstract from the article's URL (if available)
def get_full_abstract(url):
    try:
        # Send a GET request to the article URL
        page = requests.get(url)
        soup = BeautifulSoup(page.content, 'html.parser')

        # Try to find the abstract within the HTML (this can vary between websites)
        abstract_section = soup.find('div', {'class': 'abstract'})
        if abstract_section:
            return abstract_section.get_text(strip=True)
        else:
            return 'Full abstract not available'
    except Exception as e:
        return f"Error fetching abstract: {str(e)}"

# Retrieve the first 5 results
articles = []
for i in range(5):  # Adjust the number of results as needed
    article = next(search_query)
    articles.append(article)

# Display the results with full abstracts
for i, article in enumerate(articles, start=1):
    print(f"Article {i}:")
    print(f"Title: {article['bib']['title']}")
    print(f"Author(s): {article['bib']['author']}")
    print(f"Year: {article['bib']['pub_year']}")
    print(f"Citations: {article['num_citations']}")

    # Try to fetch the full abstract from the article's URL
    full_abstract = get_full_abstract(article['pub_url'])
    print(f"Full Abstract: {full_abstract}")

    print(f"URL: {article['pub_url']}")
    print("-" * 40)


Article 1:
Title: The environmental challenges of AI in EU law: lessons learned from the Artificial Intelligence Act (AIA) with its drawbacks
Author(s): ['U Pagallo', 'J Ciani Sciolla', 'M Durante']
Year: 2022
Citations: 42
Full Abstract: Full abstract not available
URL: https://www.emerald.com/insight/content/doi/10.1108/tg-07-2021-0121/full/html
----------------------------------------
Article 2:
Title: Societal and ethical impacts of artificial intelligence: Critical notes on European policy frameworks
Author(s): ['L Vesnic-Alujevic', 'S Nascimento', 'A Polvora']
Year: 2020
Citations: 175
Full Abstract: Full abstract not available
URL: https://www.sciencedirect.com/science/article/pii/S0308596120300537
----------------------------------------
Article 3:
Title: Digitalization and AI in European agriculture: a strategy for achieving climate and biodiversity targets?
Author(s): ['B Garske', 'A Bau', 'F Ekardt']
Year: 2021
Citations: 125
Full Abstract: Full abstract not available
URL: h

In [15]:
# Install necessary dependencies for running Selenium with Chromium in Colab
!apt-get update -qq
!apt-get install -y wget curl unzip
!apt-get install -y chromium-browser
!apt-get install -y chromium-chromedriver
!pip install selenium
!pip install chromedriver-autoinstaller

# Import necessary libraries
import chromedriver_autoinstaller
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import time

# Set up Chrome options for headless mode in Google Colab
chrome_options = Options()
chrome_options.add_argument('--headless')  # Run Chrome in headless mode
chrome_options.add_argument('--no-sandbox')  # Disable sandboxing (required in Colab)
chrome_options.add_argument('--disable-dev-shm-usage')  # Avoid shared memory errors
chrome_options.add_argument('--disable-gpu')  # Disable GPU usage (for headless mode)
chrome_options.add_argument('--remote-debugging-port=9222')  # Enable remote debugging
chrome_options.binary_location = '/usr/bin/chromium-browser'  # Path to the Chromium binary in Colab

# Automatically install and match the correct version of ChromeDriver
chromedriver_autoinstaller.install()

# Set up ChromeDriver with the specified options
driver = webdriver.Chrome(options=chrome_options)

# Function to fetch full abstract from the article's URL (if available)
def get_full_abstract(url):
    try:
        driver.get(url)  # Open the article URL
        time.sleep(2)  # Wait for the page to load

        # Extract abstract (adjust the CSS selector if necessary)
        abstract_element = driver.find_element(By.CSS_SELECTOR, 'div.abstract')  # CSS selector for abstract
        full_abstract = abstract_element.text if abstract_element else "Abstract not available"
        return full_abstract
    except Exception as e:
        return f"Error fetching abstract: {str(e)}"

# Search query on Google Scholar
from scholarly import scholarly
query = "Environmental Impacts of AI, Europe"
search_query = scholarly.search_pubs(query)

# Retrieve the first 5 results
articles = []
for i in range(5):  # Adjust the number of results as needed
    article = next(search_query)
    articles.append(article)

# Display the results with full abstracts
for i, article in enumerate(articles, start=1):
    print(f"Article {i}:")
    print(f"Title: {article['bib']['title']}")
    print(f"Author(s): {article['bib']['author']}")
    print(f"Year: {article['bib']['pub_year']}")
    print(f"Citations: {article['num_citations']}")

    # Try to fetch the full abstract from the article's URL
    full_abstract = get_full_abstract(article['pub_url'])
    print(f"Full Abstract: {full_abstract}")

    print(f"URL: {article['pub_url']}")
    print("-" * 40)

# Close the Selenium WebDriver after scraping
driver.quit()


W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
curl is already the newest version (7.81.0-1ubuntu1.20).
unzip is already the newest version (6.0-26ubuntu3.2).
wget is already the newest version (1.21.2-2ubuntu1.1).
0 upgraded, 0 newly installed, 0 to remove and 22 not upgraded.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
chromium-browser is already the newest version (1:85.0.4183.83-0ubuntu2.22.04.1).
0 upgraded, 0 newly installed, 0 to remove and 22 not upgraded.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
chromium-chromedriver is already the newest version (1:85.0.4183.83-0ubuntu2.22.04.1).
0 upgraded, 0 newly installed, 0 to remove and 22 not u

In [16]:
# Install necessary dependencies for running Selenium with Chromium in Colab
!apt-get update -qq
!apt-get install -y wget curl unzip
!apt-get install -y chromium-browser
!apt-get install -y chromium-chromedriver
!pip install selenium
!pip install chromedriver-autoinstaller

# Import necessary libraries
import chromedriver_autoinstaller
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import time

# Set up Chrome options for headless mode in Google Colab
chrome_options = Options()
chrome_options.add_argument('--headless')  # Run Chrome in headless mode
chrome_options.add_argument('--no-sandbox')  # Disable sandboxing (required in Colab)
chrome_options.add_argument('--disable-dev-shm-usage')  # Avoid shared memory errors
chrome_options.add_argument('--disable-gpu')  # Disable GPU usage (for headless mode)
chrome_options.add_argument('--remote-debugging-port=9222')  # Enable remote debugging
chrome_options.binary_location = '/usr/bin/chromium-browser'  # Path to the Chromium binary in Colab

# Automatically install and match the correct version of ChromeDriver
chromedriver_autoinstaller.install()

# Set up ChromeDriver with the specified options
driver = webdriver.Chrome(options=chrome_options)

# Function to fetch full abstract from the article's URL (if available)
def get_full_abstract(url):
    try:
        driver.get(url)  # Open the article URL
        time.sleep(3)  # Wait for the page to load

        # Try to find the abstract based on common patterns (search for paragraphs)
        paragraphs = driver.find_elements(By.TAG_NAME, 'p')  # Get all paragraphs

        abstract = ''
        # Check if any paragraph seems to contain the abstract (you may need to adjust this logic)
        for para in paragraphs:
            text = para.text.strip()
            if text:  # If there's some text in the paragraph
                if 'abstract' in text.lower():  # Look for the word 'abstract' in the text
                    abstract += text + '\n'

        if abstract:
            return abstract.strip()
        else:
            return "No abstract found"
    except Exception as e:
        return f"Error fetching abstract: {str(e)}"

# Search query on Google Scholar
from scholarly import scholarly
query = "Environmental Impacts of AI, Europe"
search_query = scholarly.search_pubs(query)

# Retrieve the first 5 results
articles = []
for i in range(5):  # Adjust the number of results as needed
    article = next(search_query)
    articles.append(article)

# Display the results with full abstracts
for i, article in enumerate(articles, start=1):
    print(f"Article {i}:")
    print(f"Title: {article['bib']['title']}")
    print(f"Author(s): {article['bib']['author']}")
    print(f"Year: {article['bib']['pub_year']}")
    print(f"Citations: {article['num_citations']}")

    # Try to fetch the full abstract from the article's URL
    full_abstract = get_full_abstract(article['pub_url'])
    print(f"Full Abstract: {full_abstract}")

    print(f"URL: {article['pub_url']}")
    print("-" * 40)

# Close the Selenium WebDriver after scraping
driver.quit()


W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
curl is already the newest version (7.81.0-1ubuntu1.20).
unzip is already the newest version (6.0-26ubuntu3.2).
wget is already the newest version (1.21.2-2ubuntu1.1).
0 upgraded, 0 newly installed, 0 to remove and 22 not upgraded.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
chromium-browser is already the newest version (1:85.0.4183.83-0ubuntu2.22.04.1).
0 upgraded, 0 newly installed, 0 to remove and 22 not upgraded.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
chromium-chromedriver is already the newest version (1:85.0.4183.83-0ubuntu2.22.04.1).
0 upgraded, 0 newly installed, 0 to remove and 22 not u

In [17]:


# Set up Chrome options for headless mode in Google Colab
chrome_options = Options()
chrome_options.add_argument('--headless')  # Run Chrome in headless mode
chrome_options.add_argument('--no-sandbox')  # Disable sandboxing (required in Colab)
chrome_options.add_argument('--disable-dev-shm-usage')  # Avoid shared memory errors
chrome_options.add_argument('--disable-gpu')  # Disable GPU usage (for headless mode)
chrome_options.add_argument('--remote-debugging-port=9222')  # Enable remote debugging
chrome_options.binary_location = '/usr/bin/chromium-browser'  # Path to the Chromium binary in Colab

# Automatically install and match the correct version of ChromeDriver
chromedriver_autoinstaller.install()

# Set up ChromeDriver with the specified options
driver = webdriver.Chrome(options=chrome_options)

# Function to fetch full abstract from the article's URL (if available)
def get_full_abstract(url):
    try:
        driver.get(url)  # Open the article URL
        time.sleep(3)  # Wait for the page to load

        # Try to find the abstract based on common patterns (search for paragraphs or relevant sections)
        abstract = ""

        # Look for specific abstract section
        try:
            # Try to find elements with the word 'abstract' or headings related to abstracts
            abstract_section = driver.find_element(By.XPATH, "//h2[contains(text(),'Abstract')]/following-sibling::p")
            abstract = abstract_section.text
        except:
            # If the abstract is not found in the standard place, search more broadly
            paragraphs = driver.find_elements(By.TAG_NAME, 'p')  # Get all paragraphs
            for para in paragraphs:
                text = para.text.strip()
                if text and len(text.split()) > 5:  # Check if the paragraph is long enough to be an abstract
                    abstract += text + '\n'

        # Return the found abstract or a message if not found
        return abstract if abstract else "No abstract found"

    except Exception as e:
        return f"Error fetching abstract: {str(e)}"

# Search query on Google Scholar
from scholarly import scholarly
query = "Environmental Impacts of AI, Europe"
search_query = scholarly.search_pubs(query)

# Retrieve the first 5 results
articles = []
for i in range(5):  # Adjust the number of results as needed
    article = next(search_query)
    articles.append(article)

# Display the results with full abstracts
for i, article in enumerate(articles, start=1):
    print(f"Article {i}:")
    print(f"Title: {article['bib']['title']}")
    print(f"Author(s): {article['bib']['author']}")
    print(f"Year: {article['bib']['pub_year']}")
    print(f"Citations: {article['num_citations']}")

    # Try to fetch the full abstract from the article's URL
    full_abstract = get_full_abstract(article['pub_url'])
    print(f"Full Abstract: {full_abstract}")

    print(f"URL: {article['pub_url']}")
    print("-" * 40)

# Close the Selenium WebDriver after scraping
driver.quit()


Article 1:
Title: The environmental challenges of AI in EU law: lessons learned from the Artificial Intelligence Act (AIA) with its drawbacks
Author(s): ['U Pagallo', 'J Ciani Sciolla', 'M Durante']
Year: 2022
Citations: 42
Full Abstract: We are using cookies to give you the best experience on our website, but you are free to manage these at any time. To continue with our standard settings click "Accept". To find out more and manage your cookies, click "Manage cookies".
Important note for authors: phishing scams.
Transforming Government: People, Process and Policy
Article publication date: 22 June 2022 Permissions
Issue publication date: 12 July 2022
The paper aims to examine the environmental challenges of artificial intelligence (AI) in EU law that regard both illicit uses of the technology, i.e. overuse or misuse of AI and its possible underuses. The aim of the paper is to show how such regulatory efforts of legislators should be understood as a critical component of the Green Deal 

In [18]:

# Set up Chrome options for headless mode in Google Colab
chrome_options = Options()
chrome_options.add_argument('--headless')  # Run Chrome in headless mode
chrome_options.add_argument('--no-sandbox')  # Disable sandboxing (required in Colab)
chrome_options.add_argument('--disable-dev-shm-usage')  # Avoid shared memory errors
chrome_options.add_argument('--disable-gpu')  # Disable GPU usage (for headless mode)
chrome_options.add_argument('--remote-debugging-port=9222')  # Enable remote debugging
chrome_options.binary_location = '/usr/bin/chromium-browser'  # Path to the Chromium binary in Colab

# Automatically install and match the correct version of ChromeDriver
chromedriver_autoinstaller.install()

# Set up ChromeDriver with the specified options
driver = webdriver.Chrome(options=chrome_options)

# Function to handle cookies and fetch full abstract from the article's URL (if available)
def get_full_abstract(url):
    try:
        driver.get(url)  # Open the article URL
        time.sleep(3)  # Wait for the page to load

        # Dismiss any cookie consent pop-ups (may vary by site)
        try:
            cookie_button = driver.find_element(By.XPATH, "//button[contains(text(),'Accept') or contains(text(),'Close')]")
            cookie_button.click()
            time.sleep(2)  # Allow time for the cookie banner to be dismissed
        except:
            pass  # If no cookie banner, continue

        # Try to find the abstract based on common patterns (search for paragraphs or relevant sections)
        abstract = ""

        # Look for specific abstract section
        try:
            # Try to find elements with the word 'abstract' or headings related to abstracts
            abstract_section = driver.find_element(By.XPATH, "//h2[contains(text(),'Abstract')]/following-sibling::p")
            abstract = abstract_section.text
        except:
            # If the abstract is not found in the standard place, search more broadly
            paragraphs = driver.find_elements(By.TAG_NAME, 'p')  # Get all paragraphs
            for para in paragraphs:
                text = para.text.strip()
                if text and len(text.split()) > 5:  # Check if the paragraph is long enough to be an abstract
                    abstract += text + '\n'

        # Return the found abstract or a message if not found
        return abstract if abstract else "No abstract found"

    except Exception as e:
        return f"Error fetching abstract: {str(e)}"

# Search query on Google Scholar
from scholarly import scholarly
query = "Environmental Impacts of AI, Europe"
search_query = scholarly.search_pubs(query)

# Retrieve the first 5 results
articles = []
for i in range(5):  # Adjust the number of results as needed
    article = next(search_query)
    articles.append(article)

# Display the results with full abstracts
for i, article in enumerate(articles, start=1):
    print(f"Article {i}:")
    print(f"Title: {article['bib']['title']}")
    print(f"Author(s): {article['bib']['author']}")
    print(f"Year: {article['bib']['pub_year']}")
    print(f"Citations: {article['num_citations']}")

    # Try to fetch the full abstract from the article's URL
    full_abstract = get_full_abstract(article['pub_url'])
    print(f"Full Abstract: {full_abstract}")

    print(f"URL: {article['pub_url']}")
    print("-" * 40)

# Close the Selenium WebDriver after scraping
driver.quit()


Article 1:
Title: The environmental challenges of AI in EU law: lessons learned from the Artificial Intelligence Act (AIA) with its drawbacks
Author(s): ['U Pagallo', 'J Ciani Sciolla', 'M Durante']
Year: 2022
Citations: 42
Full Abstract: Important note for authors: phishing scams.
Transforming Government: People, Process and Policy
Article publication date: 22 June 2022 Permissions
Issue publication date: 12 July 2022
The paper aims to examine the environmental challenges of artificial intelligence (AI) in EU law that regard both illicit uses of the technology, i.e. overuse or misuse of AI and its possible underuses. The aim of the paper is to show how such regulatory efforts of legislators should be understood as a critical component of the Green Deal of the EU institutions, that is, to save our planet from impoverishment, plunder and destruction.
To illustrate the different ways in which AI can represent a game-changer for our environmental challenges, attention is drawn to a multid

In [19]:

# Set up Chrome options for headless mode in Google Colab
chrome_options = Options()
chrome_options.add_argument('--headless')  # Run Chrome in headless mode
chrome_options.add_argument('--no-sandbox')  # Disable sandboxing (required in Colab)
chrome_options.add_argument('--disable-dev-shm-usage')  # Avoid shared memory errors
chrome_options.add_argument('--disable-gpu')  # Disable GPU usage (for headless mode)
chrome_options.add_argument('--remote-debugging-port=9222')  # Enable remote debugging
chrome_options.binary_location = '/usr/bin/chromium-browser'  # Path to the Chromium binary in Colab

# Automatically install and match the correct version of ChromeDriver
chromedriver_autoinstaller.install()

# Set up ChromeDriver with the specified options
driver = webdriver.Chrome(options=chrome_options)

# Function to handle cookies and fetch full abstract from the article's URL (if available)
def get_full_abstract(url):
    try:
        driver.get(url)  # Open the article URL
        time.sleep(3)  # Wait for the page to load

        # Dismiss any cookie consent pop-ups (may vary by site)
        try:
            cookie_button = driver.find_element(By.XPATH, "//button[contains(text(),'Accept') or contains(text(),'Close')]")
            cookie_button.click()
            time.sleep(2)  # Allow time for the cookie banner to be dismissed
        except:
            pass  # If no cookie banner, continue

        # Try to find the abstract section using a more specific XPath
        abstract = ""

        # Check for known HTML elements or headings like 'Abstract'
        try:
            # Look for abstract section after the "Abstract" header
            abstract_section = driver.find_element(By.XPATH, "//div[contains(@class,'abstract') or contains(@class,'section')]/p")
            abstract = abstract_section.text
        except:
            # If no abstract is found, look for other potential text blocks
            paragraphs = driver.find_elements(By.TAG_NAME, 'p')  # Get all paragraphs
            for para in paragraphs:
                text = para.text.strip()
                if text and len(text.split()) > 5:  # Check if the paragraph is long enough to be an abstract
                    abstract += text + '\n'

        # Return the found abstract or a message if not found
        return abstract if abstract else "No abstract found"

    except Exception as e:
        return f"Error fetching abstract: {str(e)}"

# Search query on Google Scholar
from scholarly import scholarly
query = "Environmental Impacts of AI, Europe"
search_query = scholarly.search_pubs(query)

# Retrieve the first 5 results
articles = []
for i in range(5):  # Adjust the number of results as needed
    article = next(search_query)
    articles.append(article)

# Display the results with full abstracts
for i, article in enumerate(articles, start=1):
    print(f"Article {i}:")
    print(f"Title: {article['bib']['title']}")
    print(f"Author(s): {article['bib']['author']}")
    print(f"Year: {article['bib']['pub_year']}")
    print(f"Citations: {article['num_citations']}")

    # Try to fetch the full abstract from the article's URL
    full_abstract = get_full_abstract(article['pub_url'])
    print(f"Full Abstract: {full_abstract}")

    print(f"URL: {article['pub_url']}")
    print("-" * 40)

# Close the Selenium WebDriver after scraping
driver.quit()


Article 1:
Title: The environmental challenges of AI in EU law: lessons learned from the Artificial Intelligence Act (AIA) with its drawbacks
Author(s): ['U Pagallo', 'J Ciani Sciolla', 'M Durante']
Year: 2022
Citations: 42
Full Abstract: Important note for authors: phishing scams.
Transforming Government: People, Process and Policy
Article publication date: 22 June 2022 Permissions
Issue publication date: 12 July 2022
The paper aims to examine the environmental challenges of artificial intelligence (AI) in EU law that regard both illicit uses of the technology, i.e. overuse or misuse of AI and its possible underuses. The aim of the paper is to show how such regulatory efforts of legislators should be understood as a critical component of the Green Deal of the EU institutions, that is, to save our planet from impoverishment, plunder and destruction.
To illustrate the different ways in which AI can represent a game-changer for our environmental challenges, attention is drawn to a multid