
### Scraping Soil Monitoring Data

This notebook is designed to scrape search engine results related to soil monitoring, reporting, and verification (MRV). 
It retrieves and processes relevant URLs and domains to analyze the most frequently mentioned sources.

#### Steps:
1. Install and import required libraries.
2. Define a web scraping function to collect search results.
3. Process and clean the extracted data.
4. Analyze the results (top domains, keyword presence, etc.).


In [None]:
# Install required libraries
!pip install selenium
!pip install tldextract

In [12]:
# Import necessary libraries
import pandas as pd
import random
from datetime import datetime, date
import time

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

import tldextract
from urllib.parse import urlparse

In [None]:
# write the name of the browser you are using. Choose 'chrome', 'firefox' or 'edge'
browser = 'chrome'

#### Function: `scraping_soil`

The `scraping_soil` function is designed to scrape Google search results for a given set of search terms.  
It collects and processes search results from multiple pages, extracting key information such as:  
- Titles  
- URLs  
- Search result text snippets  
- Page position  
- Type of content (SEO vs. SEA)  

**Usage:**  
Call the function with a list of search terms and the number of pages to scrape. The function returns a pandas DataFrame containing the extracted data.  

**Example:**  
```python
search_terms = ["soil monitoring", "carbon sequestration"]
df_results = scrapator(search_terms, 3)
print(df_results.head())

In [14]:
def scraping_soil(search_terms, page_number):

    """
    Scrapes Google SERP (Search Engine Results Pages) for given search terms.

    Parameters:
    search_terms (list): List of queries to search for.
    page_number (int): Number of pages to scrape per search term.

    Returns:
    pd.DataFrame: A DataFrame containing extracted URLs, titles, and metadata.
    """

    start_time = datetime.now()
    print(f'Program started at: {start_time.strftime("%Y-%m-%d %H:%M:%S")}\n')

    df = pd.DataFrame()
    if browser == 'chrome':
        options = webdriver.ChromeOptions()
        driver = webdriver.Chrome(options=options)
    elif browser == 'firefox':
        options = webdriver.FirefoxOptions()
        driver = webdriver.Firefox(options=options)
    elif browser == 'edge':
        options = webdriver.EdgeOptions()
        driver = webdriver.Edge(options=options)
    else:
        raise ValueError("Unsupported browser. Choose 'chrome', 'firefox', or 'edge'.")

    # options = webdriver.ChromeOptions()
    # driver = webdriver.Chrome(options = options)

    print('***** Scraping in progress *****')

    for term in search_terms:
        print(f'\n**** Searching for term: {term} ****\n')
        driver.get('https://google.com')

        # Accept cookies if prompted
        try:
            WebDriverWait(driver, 10).until(
                EC.element_to_be_clickable((By.CSS_SELECTOR, 'button#L2AGLb.tHlp8d'))
            ).click()
        except:
            pass

        time.sleep(round(random.uniform(1, 10), 1))

        # Locate and use search bar
        search_box = driver.find_element(By.NAME, 'q')
        search_box.clear()
        search_box.send_keys(term)
        time.sleep(1)
        search_box.send_keys(Keys.RETURN)

        time.sleep(round(random.uniform(1, 10), 1))

        all_data = []  # Store extracted data

        for page in range(1, page_number + 1):
            print(f'*** Scraping page {page} ***')

            # Wait for page load / solve captcha if needed
            time.sleep(100)

            articles_SEO = driver.find_elements(By.CLASS_NAME, 'MjjYud')
            articles_SEA = driver.find_elements(By.CLASS_NAME, 'uEierd')

            # Extract SEO articles
            for position, article in enumerate(articles_SEO, start=1):
                try:
                    title = article.find_element(By.TAG_NAME, 'h3').text
                    link = article.find_element(By.TAG_NAME, 'a').get_attribute('href')
                    content = article.text
                    has_image = 'Yes' if article.find_elements(By.CLASS_NAME, 'Z26q7c.UK95Uc.Sth6v') else 'No'
                    all_data.append([link, title, content, term, page, position, 'SEO', has_image])
                except:
                    continue  # Skip if element not found

            # Extract SEA (paid ads) articles
            for position, ad in enumerate(articles_SEA, start=1):
                try:
                    title = ad.find_element(By.CLASS_NAME, 'CCgQ5.vCa9Yd.QfkTvb.N8QANc.MUxGbd.v0nnCb').text
                    link = ad.find_element(By.TAG_NAME, 'a').get_attribute('href')
                    content = ad.text
                    has_image = 'Yes' if ad.find_elements(By.CLASS_NAME, 'g-img.ZGomKf') else 'No'
                    all_data.append([link, title, content, term, page, position, 'SEA', has_image])
                except:
                    continue  # Skip if element not found

            # Navigate to the next page
            try:
                next_button = driver.find_element(By.XPATH, "//*[contains(text(),'Next')]") # Adapt the word to the language
                driver.execute_script("arguments[0].scrollIntoView();", next_button)
                next_button.click()
                time.sleep(round(random.uniform(1, 10), 1))
            except:
                print("No more pages available.")
                break

        # Convert to DataFrame
        df_term = pd.DataFrame(all_data, columns = ['url', 'title', 'content', 'search_term', 'page', 'position', 'type', 'image'])
        df_term.dropna(subset=['title'], inplace = True)
        df_term['company'] = df_term['url'].apply(lambda x: tldextract.extract(x).domain)
        df_term['domain'] = df_term['url'].apply(lambda x: urlparse(x).netloc)

        df = pd.concat([df, df_term], ignore_index = True)

    df['Date'] = date.today().strftime('%Y-%m-%d')

    driver.quit()
    end_time = datetime.now()
    print(f'Program completed at: {end_time.strftime("%Y-%m-%d %H:%M:%S")}')
    print(f'Duration: {end_time - start_time}')

    return df

In [15]:
# List of search queries related to soil monitoring and verification
queries = [
    'Monitor* AND Report* AND Verif* AND MRV AND soil*',
    #soil monitoring reporting verification'
    ]

### Manual Intervention During Scraping  

This script uses Selenium to scrape Google search results, which involves opening a web browser.  
During execution, manual intervention may be required in the following cases:  

- **Google CAPTCHA:** If Google detects unusual activity, it may prompt a CAPTCHA challenge that needs to be solved manually.  
- **Consent pop-ups:** The script attempts to close the Google cookie consent pop-up, but in some cases, manual confirmation may be needed.  
- **Page navigation issues:** If the script gets stuck on a page, refreshing or clicking manually might be necessary.  

⚠ **Recommendation:** Run the script in a visible browser session and monitor its progress to handle any manual actions when required.  


In [16]:
# Run scraping
df = scraping_soil(
    search_terms = queries,
    page_number = 3 # please select the number of pages you wish to scrape (20 pages)
    )


Program started at: 2025-05-13 16:09:51

***** Scraping in progress *****

**** Searching for term: Monitor* AND Report* AND Verif* AND MRV AND soil* ****

*** Scraping page 1 ***
*** Scraping page 2 ***
*** Scraping page 3 ***
Program completed at: 2025-05-13 16:15:21
Duration: 0:05:29.594164


In [17]:
df['Country'] = 'Germany' # change according your country

In [18]:
# Remove empty rows from the DataFrame and reset index
df = df.loc[df['url'] != 'vide']
df = df.reset_index(drop = True)

In [19]:
df

Unnamed: 0,url,title,content,search_term,page,position,type,image,company,domain,Date,Country
0,https://cordis.europa.eu/programme/id/HORIZON_...,"Monitoring, reporting and verification of soil...","Monitoring, reporting and verification of soil...",Monitor* AND Report* AND Verif* AND MRV AND soil*,1,1,SEO,No,europa,cordis.europa.eu,2025-05-13,Germany
1,https://scholar.google.de/scholar?q=Monitor*+A...,Scholarly articles for Monitor* AND Report* AN...,Scholarly articles for Monitor* AND Report* AN...,Monitor* AND Report* AND Verif* AND MRV AND soil*,1,3,SEO,No,google,scholar.google.de,2025-05-13,Germany
2,https://unece.org/sustainable-energy/monitorin...,,"People also ask\nWhat is mrv monitoring, repor...",Monitor* AND Report* AND Verif* AND MRV AND soil*,1,4,SEO,No,unece,unece.org,2025-05-13,Germany
3,https://www.bsag.fi/en/projects/marvic/,"Monitoring, reporting and verifying carbon bal...","Monitoring, reporting and verifying carbon bal...",Monitor* AND Report* AND Verif* AND MRV AND soil*,1,5,SEO,No,bsag,www.bsag.fi,2025-05-13,Germany
4,https://unece.org/sustainable-energy/monitorin...,"Monitoring, Reporting, and Verification (MRV)","Monitoring, Reporting, and Verification (MRV)\...",Monitor* AND Report* AND Verif* AND MRV AND soil*,1,6,SEO,No,unece,unece.org,2025-05-13,Germany
5,https://www.isric.org/news/new-paper-towards-m...,"Towards a modular, multi-ecosystem MRV framewo...","Towards a modular, multi-ecosystem MRV framewo...",Monitor* AND Report* AND Verif* AND MRV AND soil*,1,7,SEO,No,isric,www.isric.org,2025-05-13,Germany
6,https://www.sciencedirect.com/science/article/...,Solutions and insights for agricultural monito...,Solutions and insights for agricultural monito...,Monitor* AND Report* AND Verif* AND MRV AND soil*,1,8,SEO,No,sciencedirect,www.sciencedirect.com,2025-05-13,Germany
7,https://www.tandfonline.com/doi/full/10.1080/1...,"Towards a modular, multi-ecosystem monitoring,...","Towards a modular, multi-ecosystem monitoring,...",Monitor* AND Report* AND Verif* AND MRV AND soil*,1,9,SEO,No,tandfonline,www.tandfonline.com,2025-05-13,Germany
8,https://esdac.jrc.ec.europa.eu/content/towards...,"Towards a modular, multi-ecosystem Monitoring,...","Towards a modular, multi-ecosystem Monitoring,...",Monitor* AND Report* AND Verif* AND MRV AND soil*,1,10,SEO,No,europa,esdac.jrc.ec.europa.eu,2025-05-13,Germany
9,https://www.causeartist.com/monitoring-reporti...,"Explained: Monitoring, Reporting, and Verifica...","Explained: Monitoring, Reporting, and Verifica...",Monitor* AND Report* AND Verif* AND MRV AND soil*,1,11,SEO,No,causeartist,www.causeartist.com,2025-05-13,Germany


In [20]:
# Display the column names of the DataFrame
df.columns

Index(['url', 'title', 'content', 'search_term', 'page', 'position', 'type',
       'image', 'company', 'domain', 'Date', 'Country'],
      dtype='object')

In [21]:
# Show the 20 most frequently appearing domains in the dataset
df['domain'].value_counts()[:20]

domain
unece.org                      2
www.isric.org                  2
www.sciencedirect.com          2
openknowledge.fao.org          2
cordis.europa.eu               1
www.deloitte.com               1
research.wur.nl                1
pureportal.ilvo.be             1
www.idhsustainabletrade.com    1
climatesciences.lbl.gov        1
490c.uni-hohenheim.de          1
ilvo.vlaanderen.be             1
pubmed.ncbi.nlm.nih.gov        1
mission-innovation.net         1
www.lse.ac.uk                  1
www.carbon-drawdown.de         1
irc-orcasa.eu                  1
4p1000.org                     1
scholar.google.de              1
www.icos-cp.eu                 1
Name: count, dtype: int64

#### Exporting Results

After scraping and processing the search results, the data is saved as a CSV file.  
The filename includes the current date (`YYYY-MM-DD`) to keep track of different scraping sessions.  
This allows easy comparison of results across multiple executions.  

In [22]:
# Export the results to a CSV file
# Please adapt the filename with the filepath of your choice.
df.to_csv(f'scraping_results_{date.today().strftime("%Y-%m-%d")}.csv', index = False)