In this solution, I utilized AutoScraper and Selenium to automate the process of web scraping for gathering information about plants. AutoScraper was used to simplify the extraction of structured data from static web pages, while Selenium handled dynamic content by automating browser interactions. This setup allowed me to efficiently collect a wide range of plant-related data, including care instructions, optimal growing conditions, and gardening tips from various websites. The scraped data was then processed for use in a project.

#Installation and Setup of AutoScraper

In [1]:
!pip install autoscraper

Collecting autoscraper
  Downloading autoscraper-1.1.14-py3-none-any.whl.metadata (5.3 kB)
Collecting bs4 (from autoscraper)
  Downloading bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Downloading autoscraper-1.1.14-py3-none-any.whl (10 kB)
Downloading bs4-0.0.2-py2.py3-none-any.whl (1.2 kB)
Installing collected packages: bs4, autoscraper
Successfully installed autoscraper-1.1.14 bs4-0.0.2


In [2]:
from autoscraper import AutoScraper

#Setup for Selenium WebDriver with Chromium

In [4]:
!pip install selenium



In [5]:
!apt-get update

0% [Working]            Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
0% [Connecting to archive.ubuntu.com] [Connecting to security.ubuntu.com (185.125.190.83)] [1 InRele0% [Connecting to archive.ubuntu.com] [Connecting to security.ubuntu.com (185.125.190.83)] [Connecte                                                                                                    Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Ign:4 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Get:5 https://r2u.stat.illinois.edu/ubuntu jammy Release [5,713 B]
Get:6 https://r2u.stat.illinois.edu/ubuntu jammy Release.gpg [793 B]
Hit:7 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:8 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:9 https://r2u.stat.illinois.edu/ubuntu jammy/main all Packages [8,346 kB]
Hit:10 https://ppa

In [6]:
!apt-get install -y chromium chromium-browser

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Package chromium is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source
However the following packages replace it:
  chromium-bsu

E: Package 'chromium' has no installation candidate


In [7]:
!apt install chromium-chromedriver

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  apparmor chromium-browser libfuse3-3 liblzo2-2 libudev1 snapd squashfs-tools systemd-hwe-hwdb
  udev
Suggested packages:
  apparmor-profiles-extra apparmor-utils fuse3 zenity | kdialog
The following NEW packages will be installed:
  apparmor chromium-browser chromium-chromedriver libfuse3-3 liblzo2-2 snapd squashfs-tools
  systemd-hwe-hwdb udev
The following packages will be upgraded:
  libudev1
1 upgraded, 9 newly installed, 0 to remove and 54 not upgraded.
Need to get 28.5 MB of archives.
After this operation, 118 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 apparmor amd64 3.0.4-2ubuntu2.4 [598 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/main amd64 liblzo2-2 amd64 2.10-2build3 [53.7 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/main amd64 squashfs-tools amd64 1:4.5-3

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By # Import the By class

#Function to Initialize Headless Chromium WebDriver

In [9]:
def web_driver():
    options = webdriver.ChromeOptions()
    options.add_argument("--verbose")
    options.add_argument('--no-sandbox')
    options.add_argument('--headless')
    options.add_argument('--disable-gpu')
    options.add_argument("--window-size=1920, 1200")
    options.add_argument('--disable-dev-shm-usage')
    driver = webdriver.Chrome(options=options)
    return driver

The function web_driver() initializes a Selenium WebDriver for Chromium in a "headless" mode, which allows the browser to run without a graphical user interface. This is particularly useful for automating web scraping tasks on servers or environments where displaying the browser window is unnecessary.

Here’s a breakdown of the options used:



*   --verbose: Enables detailed logging, useful for debugging.
*  --no-sandbox: Runs the browser in a non-sandboxed mode, often needed for some environments like cloud services.

*   --headless: Runs the browser without a UI, which makes the scraping process faster and more resource-efficient.

*   --disable-gpu: Disables GPU usage since it's not required for headless operation.

*   --window-size=1920, 1200: Sets the browser's window size, useful for ensuring content is properly rendered and visible in the headless mode.
*  --disable-dev-shm-usage: Prevents issues related to limited shared memory on some Linux environments.



In [10]:
driver = web_driver()

#plant_data(url):

**Purpose:**
The function extracts text content from the main part of the webpage and filters out any price-related information.

In [16]:
def plant_data(url):
  driver.get(url)#url will be automatically entered
  wanted_list= driver.find_elements(By.XPATH,'//main')
  wanted_list=[s.text for s in wanted_list]
  f_res=[]
  wanted_list=list(set(wanted_list))
  #print(wanted_list)
  for i in range(len(wanted_list)):
    res=wanted_list[i].split('\n')
    for j in range(len(res)):
      if res[j].find('$')==-1 and res[j].find('£')==-1:
        f_res.append(res[j])
  return f_res

#all_plants_urls(url, initial_url):

**Purpose:**
This function scrapes all links from a given webpage, then filters out specific URLs (such as header or footer links) using the AutoScraper tool. The final result is a list of relevant plant-related URLs.

In [15]:
def all_plants_urls(url,initial_url):
  import time
  scraper=AutoScraper()
  driver.get(url)
  time.sleep(5)
  all_links=driver.find_elements(By.XPATH,'//main//a')
  all_links=[link.get_attribute('href') for link in all_links if link!=None]
  scraper=AutoScraper()
  ref=[url,initial_url]#example from the footer and example from the header(or the navbar ) , Simply url and initial_url
  result=scraper.build(url=url,wanted_list=ref)
  list_of_links=list(set(all_links)-set(result))
  f_list_of_links=[]
  for i in range(len(list_of_links)):
    if list_of_links[i] is not None :#Double verification of not existing of None is obligatory
      f_list_of_links.append(list_of_links[i])
  return f_list_of_links

#Input Handling and Scraping Logic for Gardening Website

**Purpose:**
This part of the code takes user input to select between scraping data for a single plant or a collection of plants from a gardening website. It validates the URL structure, handles pagination for multi-page results, and gathers plant information accordingly.








In [None]:
url=input("Enter the url of the gardening website:  ")
while url.find('https://www.')==-1 and (url.find('plant') or url.find('garden'))==-1:
  url=input("Enter the url of the gardening website:  ")
if len(url.split('/'))>=3:
  initial_url='https://'+url.split('/')[2]
else:
  initial_url=url
specification_of_scraping=input("Single plant(1) or collection of plants(2):  ")
while specification_of_scraping!='1' and specification_of_scraping!='2':
  specification_of_scraping=input("Single plant(1) or collection of plants(2):")
if specification_of_scraping=='2':
  f_all_plants=[]
  all_plants=all_plants_urls(url,initial_url)#function to define that extracts all the urls of the plants existing inside the website
  print(all_plants)
  print(len(all_plants))
  for i in range(len(all_plants)):
    f_all_plants.append(all_plants[i])
  j=1
  while len(all_plants)>0:
    j=j+1
    if url.find('?')!=-1:
      new_url=url+'&page='+str(j)
    else:
      new_url=url+'?page='+str(j)
    all_plants=all_plants_urls(new_url,initial_url)
    for i in range(len(all_plants)):
      f_all_plants.append(all_plants[i])
  for j in range(len(f_all_plants)):
    data=plant_data(f_all_plants[j])
elif specification_of_scraping=='1':
  data=plant_data(url)#function to define that extracts the data of the plant url without cleaning it
  print(data)

This block of code handles user input for scraping a gardening website, verifying its structure, and then scraping either data for a single plant or a collection of plants. Here's a breakdown:

URL Input and Validation:

url = input("Enter the url of the gardening website: "): The user is prompted to input the URL of the gardening website.
The while loop ensures the entered URL starts with 'https://www.' and contains keywords like 'plant' or 'garden' (important for ensuring the user enters a relevant URL). If these conditions aren't met, the user is prompted to re-enter the URL.
Initial URL Setup:

If the entered URL has three or more segments (e.g., https://example.com/plant), it extracts the base URL (https://example.com). Otherwise, it treats the entered URL as the base URL (initial_url).
Scraping Mode Selection:

specification_of_scraping = input("Single plant(1) or collection of plants(2): "): The user is asked whether they want to scrape a single plant's data (1) or multiple plants (2).
The while loop ensures valid input is provided (1 or 2).
Collection of Plants (Option 2):

If the user selects option 2 (collection of plants):
all_plants = all_plants_urls(url, initial_url): Calls the all_plants_urls() function to retrieve all plant-related URLs from the provided gardening website.
f_all_plants.append(all_plants[i]): The extracted URLs are stored in a list (f_all_plants).
The code handles pagination by appending &page= or ?page= to the URL if there are additional pages, and repeats the process for scraping additional plant URLs.
Once all URLs are collected, it loops through them and calls the plant_data() function to scrape and process data for each plant.
Single Plant (Option 1):

If the user selects option 1 (single plant):
data = plant_data(url): Calls the plant_data() function to scrape data from the single plant URL.
Data Output:

After scraping, it prints the data collected from the URL(s), either for the single plant or the entire collection.



#User Input, URL Validation, and Plant Data Scraping with Pagination

This section manages user input, ensures valid gardening website URLs, and facilitates scraping data for either a single plant or multiple plants. It implements pagination to scrape plant data across multiple pages when needed, providing a flexible and robust way to gather data from gardening websites.

In [12]:
import time
scraper=AutoScraper()
driver.get("https://www.gardenersworld.com/search?q=plants&tab=plants")
time.sleep(5)
all_links=driver.find_elements(By.XPATH,'//main//a')
all_links=[link.get_attribute('href') for link in all_links if link!=None]
scraper=AutoScraper()
ref=["https://www.gardenersworld.com","https://www.gardenersworld.com/plants/"]#example from the footer and example from the header(or the navbar ) , Simply url and initial_url
result=scraper.build(url="https://www.gardenersworld.com/search?q=plants&tab=plants",wanted_list=ref)
list_of_links=list(set(all_links)-set(result))
f_list_of_links=[]
for i in range(len(list_of_links)):
  if list_of_links[i] is not None :#Double verification of not existing of None is obligatory
    f_list_of_links.append(list_of_links[i])
print(f_list_of_links)
print(len(f_list_of_links))

['https://www.gardenersworld.com/plants/actinidia-kolomikta/', 'https://www.gardenersworld.com/plants/gunnera-tinctoria/', 'https://www.gardenersworld.com/plants/cobaea-scandens/', 'https://www.gardenersworld.com/plants/impatiens-new-guinea-group/', 'https://www.gardenersworld.com/plants/callistemon-citrinus-splendens/', 'https://www.gardenersworld.com/plants/phyllostachys-nigra/', 'https://www.gardenersworld.com/plants/buddleja-globosa/', 'https://www.gardenersworld.com/plants/gaultheria-procumbens/', 'https://www.gardenersworld.com/plants/lysimachia-vulgaris/', 'https://www.gardenersworld.com/plants/monarda-didyma/', 'https://www.gardenersworld.com/plants/anemone-blanda/', 'https://www.gardenersworld.com/how-to/grow-plants/peperomia-argyreia/', 'https://www.gardenersworld.com/how-to/grow-plants/clusia-rosea/', 'https://www.gardenersworld.com/plants/mahonia-japonica/', 'https://www.gardenersworld.com/how-to/grow-plants/ulex-europaeus/', 'https://www.gardenersworld.com/how-to/grow-plan