
# Collection of Articles using Google Scholar

### Author : Maela Guillaume-Le Gall
 
------------------------------------------------------------------------------------------------------------------------------

Ce code effectue une recherche automatisée sur Google Scholar en utilisant les paramètres définis dans la première section **"Configuration"**. L'algorithme calcule la pertinence des articles en identifiant les mots communs entre la requête et le titre de l'article. Les articles sont ensuite triés, et le score de pertinence est combiné avec l'année de publication et le nombre de citations pour déterminer l'ordre des résultats. Pour chaque article sélectionné, l'algorithme webscrape et nettoie le texte complet de l'abstract via Selenium. Finalement, **les résultats sont exportés sous forme de tableau dans un fichier Excel téléchargeable à la fin du document.**

Les paramètres de **Configuration** peuvent être facilement modifiés dans la première cellule de code :
- requête = **QUERY**
- nombre d'articles à récupérer = **NUM_FETCH** 
- année minimale = **MIN_YEAR** 
- nombre d'articles à exporter = **NUM_SELECT**. 

*Le nombre d'articles à récupérer doit toujours être supérieur au nombre d'articles à exporter, car certains articles peuvent être exclus s'ils sont antérieurs à la date minimale (MIN_YEAR) ou en raison de mécanismes anti-bot. Il est recommandé d'indiquer un NUM_FETCH supérieur d'au moins un tiers à NUM_SELECT.*


------------------------------------------------------------------------------------------------------------------------------

This code performs an automated search on Google Scholar using the parameters defined in the first **"Configuration"** section. The algorithm calculates the relevance of articles by identifying the common words between the query and the article title. The articles are then sorted, and the relevance score is combined with the publication year and the number of citations to determine the order of the results. For each selected article, the algorithm webscrapes and cleans the full abstract text using Selenium. Finally, **the results are exported as a table in a Excel file downloadable at the end of the document.**

The **Configuration** parameters can be easily modified in the first code cell:
- query = **QUERY**
- number of articles to fetch = **NUM_FETCH**
- minimum publication year = **MIN_YEAR**
- number of articles to export = **NUM_SELECT**

*The number of articles to fetch should always be greater than the number of articles to export, as some articles may be excluded if they are older than the minimum date (MIN_YEAR) or due to anti-bot mechanisms. It is recommended to set NUM_FETCH to at least one third more than NUM_SELECT.*





In [3]:
# ==============================================================================================================
# CONFIGURATION 
# ==============================================================================================================

QUERY = "Environmental Impacts Artificial Intelligence"
# Number of articles to fetch from Google Scholar before filtering
NUM_FETCH = 15
# Minimum publication year to consider an article valid 
MIN_YEAR = 2020
# Number of valid articles to select for export
NUM_SELECT = 10


# Warning : The number of articles to fetch should always be greater than the number of articles to export, 
#as some articles may be excluded if they are older than the minimum date (MIN_YEAR) or due to anti-bot mechanisms. 
#It is recommended to set NUM_FETCH to at least one third more than NUM_SELECT.


In [4]:
# ==============================================================================================================
# INSTALL PACKAGES AND IMPORT LIBRARIES
# ==============================================================================================================

# Install required packages if not already installed
import importlib.util
import sys
import subprocess

required_packages = ["pandas", "scholarly", "selenium", "chromedriver_autoinstaller"]
for package in required_packages:
    if importlib.util.find_spec(package) is None:
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])


# Import libraries
import re
import time
from urllib.parse import urlparse
import pandas as pd

from scholarly import scholarly
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


In [5]:
# ==============================================================================================================
# UTILITY FUNCTIONS
# ==============================================================================================================

def clean_abstract(text):
    """
    Cleans the abstract text by removing unwanted lines.
    - Removes the word "abstract" at the beginning.
    - If a line starts with "Highlights", skips subsequent lines until a line starting with "abstract" is encountered (which is kept).
    - Removes lines starting with "Keywords:", "Download", "Graphical abstract", "Fig.", "Table", "Cookies", etc.
    - Removes any line containing "Cite this article".
    - Stops processing if a line starts with "This is a preview".
    """
    # Remove "abstract" if it appears at the beginning
    lines = text.split('\n')
    cleaned_lines = []
    skip_until_abstract = False

    for line in lines:
        if skip_until_abstract:
            if re.match(r'^\s*abstract', line, re.IGNORECASE):
                skip_until_abstract = False
                cleaned_lines.append(line.strip())
            continue

        if re.match(r'^\s*Highlights', line, re.IGNORECASE):
            skip_until_abstract = True
            continue

        if re.match(r'^\s*(Keywords:|Download|Graphical abstract)', line, re.IGNORECASE):
            continue

        if re.match(r'^\s*(Fig\.|Table|Cookies|Cookie Settings|©|All content on this site)', line, re.IGNORECASE):
            continue
        
        if re.search(r'(^\s*Please note,)|(^Your institution has not purchased)|(^You are not authenticated)|Cite this article', line, re.IGNORECASE):
            break

        cleaned_lines.append(line.strip())

    return "\n".join(cleaned_lines).strip()


def get_full_abstract(url, driver):
    """
    Retrieves the full abstract using a dictionary of generic selectors for different websites.
    If no specific selector works, a generic method (iterating over paragraphs) is used.
    """
    try:
        driver.get(url)
        time.sleep(3)

        # Try to close the cookie banner if present
        try:
            cookie_button = WebDriverWait(driver, 5).until(
                EC.element_to_be_clickable((By.XPATH, 
                    "//button[contains(text(),'Accept') or contains(text(),'Close') or contains(text(),'I agree')]"))
            )
            cookie_button.click()
            time.sleep(2)
        except:
            pass

        domain = urlparse(url).netloc.lower()
        abstract_text = ""

        # Dictionary of generic selectors by domain
        selectors = {
            "mdpi.com": [
                ("css", "div#art-abstract"),
                ("css", "section.abstract"),
                ("css", "div.art-abstract")
            ],
            "sciencedirect.com": [
                ("css", "section#abstract"),
                ("css", "section#abstracts"),
                ("css", "div.Abstracts")
            ],
            "springer.com": [
                ("xpath", "//section[contains(@class,'Abstract')]"),
                ("xpath", "//div[contains(@class,'c-article-section__content')]")
            ],
            "arxiv.org": [
                ("css", "blockquote.abstract"),
                ("css", "blockquote.abstract.mathjax")
            ],
            "nature.com": [
                ("xpath", "//div[contains(@class,'Abstract')]"),
                ("xpath", "//section[contains(@class,'abstract')]")
            ]
        }

        found = False
        for key, sel_list in selectors.items():
            if key in domain:
                for sel_type, selector in sel_list:
                    try:
                        if sel_type == "css":
                            elem = WebDriverWait(driver, 5).until(
                                EC.visibility_of_element_located((By.CSS_SELECTOR, selector))
                            )
                        else:
                            elem = WebDriverWait(driver, 5).until(
                                EC.visibility_of_element_located((By.XPATH, selector))
                            )
                        abstract_text = elem.text
                        found = True
                        break
                    except Exception:
                        continue
                if found:
                    break

        if not abstract_text:
            paragraphs = driver.find_elements(By.TAG_NAME, 'p')
            for para in paragraphs:
                text_para = para.text.strip()
                if text_para and len(text_para.split()) > 5:
                    abstract_text += text_para + "\n"
            
            # >>> Remove duplicate lines <<<
            lines = abstract_text.splitlines()
            unique_lines = []
            for line in lines:
                line_stripped = line.strip()
                if line_stripped and line_stripped not in unique_lines:
                    unique_lines.append(line_stripped)
            abstract_text = "\n".join(unique_lines)
        
        # Final cleaning of the abstract text
        abstract_text = clean_abstract(abstract_text)
        
        if not abstract_text.strip():
            return "No abstract found"
        return abstract_text
        
    except Exception as e:
        return f"Error fetching abstract: {str(e)}"


def relevance_score(title, query):
    """
    Computes a relevance score by comparing the words in the title with those in the query.
    """
    title_words = re.findall(r'\w+', title.lower())
    query_words = re.findall(r'\w+', query.lower())
    return len(set(title_words).intersection(query_words))


In [8]:
# ==============================================================================================================
# SELECTION AND EXPORTATION OF THE TABLE
# ==============================================================================================================

if __name__ == "__main__":
    # Initialize the Selenium WebDriver (using default settings)
    driver = webdriver.Chrome()

    # Start the search on Google Scholar using the configured query
    search_query = scholarly.search_pubs(QUERY)
    articles = []
    for i in range(NUM_FETCH):
        try:
            article = next(search_query)
            articles.append(article)
        except StopIteration:
            break

    # Process each article: compute relevance score, convert publication year and citation count to integers
    for article in articles:
        title = article.get('bib', {}).get('title', '')
        article['relevance'] = relevance_score(title, QUERY)
        try:
            article['year'] = int(article.get('bib', {}).get('pub_year', 0))
        except:
            article['year'] = 0
        try:
            article['citations'] = int(article.get('num_citations', 0))
        except:
            article['citations'] = 0

    # Filter articles with a publication year >= MIN_YEAR and sort them by relevance, year, and citation count
    sorted_articles = sorted(
        [a for a in articles if a.get('year', 0) >= MIN_YEAR],
        key=lambda a: (a['relevance'], a['year'], a['citations']),
        reverse=True
    )

    data = []
    valid_count = 0

    # Loop through the sorted articles until NUM_SELECT valid articles are collected
    for article in sorted_articles:
        title = article['bib'].get('title', 'N/A')
        authors = article['bib'].get('author', 'N/A')
        if isinstance(authors, list):
            authors = ", ".join(authors)
        year = article.get('year', 'N/A')
        citations = article.get('citations', 'N/A')
        relevance = article.get('relevance', 'N/A')
        pub_url = article.get('pub_url', '')
        
        if pub_url:
            full_abstract = get_full_abstract(pub_url, driver)
        else:
            full_abstract = "No URL provided"
        
        # Remove the word "Abstract" at the beginning of the abstract if present
        full_abstract = re.sub(r'^\s*Abstract[:]*\s*', '', full_abstract, flags=re.IGNORECASE)
        
        # Skip the article if the full abstract is not available or contains a blocking message
        if full_abstract.strip() in [
            "No abstract found", 
            "Confirmez que vous êtes un humain en effectuant l’action ci-dessous."
        ]:
            continue
        
        data.append({
            "Article": f"Article {valid_count + 1}",
            "Title": title,
            "Author(s)": authors,
            "Year": year,
            "Citations": citations,
            "Relevance Score": relevance,
            "Full Abstract": full_abstract,
            "URL": pub_url
        })
        
        valid_count += 1
        if valid_count >= NUM_SELECT:
            break

    # Create a DataFrame from the collected data and export it to an Excel file
    df = pd.DataFrame(data, columns=["Title", "Author(s)", "Year", "Citations", "Relevance Score", "Full Abstract", "URL"])
    df.to_excel("Relevant Articles IA.xlsx", index=False)
    print("Here is the list of relevant articles on your subject, download it by clicking")

    # Generate a download link (for Jupyter Notebook)
    from IPython.display import FileLink
    display(FileLink("Relevant Articles IA.xlsx"))

    # Close the Selenium WebDriver after export
    driver.quit()


Here is the list of relevant articles on your subject, download it by clicking
