# Web Scraping for Mauritanian License Plates on Facebook Marketplace

## Introduction

This project is part of a competition aimed at gathering a dataset of 1000 unique images of Mauritanian license plates. The competition is divided into two main parts: data collection and sophisticated analysis. The data collection phase emphasizes a hybrid approach, combining web scraping with practical photography.

In this guide, we will focus on the web scraping aspect, where we will automate the process of gathering URLs of vehicle listings from the Facebook Marketplace. Our goal is to collect images of vehicles with Mauritanian license plates.

## Steps Involved

### 1. Setup WebDriver
Set up the Selenium WebDriver with the appropriate options to disable notifications.

### 2. Log in to Facebook
Use Selenium to open the Facebook login page and input your credentials to log in.

### 3. Navigate to Facebook Marketplace Vehicles Section
Once logged in, navigate to the Facebook Marketplace vehicles section, sorted by newest listings first.

### 4. Handle Pop-ups
Handle any pop-ups that may appear.

### 5. Scrape URLs
Define a function to scrape URLs from the page and scroll through the listings to collect more URLs.

### 6. Print and Save URLs
Print the collected URLs and optionally save them to a file.


In [3]:
import os
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
from bs4 import BeautifulSoup as soup

# Set up the webdriver
options = webdriver.ChromeOptions()
options.add_argument('--disable-notifications')
driver = webdriver.Chrome(options=options)

# Define your login credentials
username = 'your_email'
password = 'your_facebook_code'

# Log in to Facebook
print("Opening Facebook login page...")
driver.get('https://www.facebook.com/')
email_element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "email")))
email_element.send_keys(username)
password_element = driver.find_element(By.ID, "pass")
password_element.send_keys(password)
password_element.send_keys(Keys.RETURN)

# Wait for login to complete
print("Waiting for login to complete...")
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//div[@aria-label='Facebook']")))
print("Login successful.")

# Navigate to the Facebook Marketplace vehicles section sorted by newest first
base_url = "https://www.facebook.com/marketplace/category/vehicles/?sortBy=creation_time_descend&exact=false"
print(f"Navigating to {base_url}...")
driver.get(base_url)

# Close pop-up if it appears
try:
    close_popup = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//div[@aria-label='Close']")))
    close_popup.click()
    print("Closed pop-up.")
except:
    print("No pop-up to close or failed to close pop-up.")

# Function to scrape URLs from the marketplace page
def scrape_urls_from_page(page_soup):
    urls = []
    for a in page_soup.find_all('a', href=True):
        href = a['href']
        if '/marketplace/item/' in href:
            full_url = "https://www.facebook.com" + href.split('?')[0]
            urls.append(full_url)
    return urls

# Scroll and collect URLs
listing_urls = []
scroll_count = 25  # Adjust this as needed
scroll_delay = 2

for i in range(scroll_count):
    print(f"Scroll iteration {i+1}/{scroll_count}")
    # Parse the HTML
    html = driver.page_source
    market_soup = soup(html, 'html.parser')

    # Scrape URLs from the current page
    urls = scrape_urls_from_page(market_soup)
    listing_urls.extend(urls)
    
    # Scroll down to load more results
    print("Scrolling down to load more results...")
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(scroll_delay)

# Print the list of URLs
print("Collected URLs:")
for url in listing_urls:
    print(url)

# End the automated browsing session
print("Quitting the browser...")
driver.quit()

Opening Facebook login page...
Waiting for login to complete...
Login successful.
Navigating to https://www.facebook.com/marketplace/category/vehicles/?sortBy=creation_time_descend&exact=false...
No pop-up to close or failed to close pop-up.
Scroll iteration 1/25
Scrolling down to load more results...
Scroll iteration 2/25
Scrolling down to load more results...
Scroll iteration 3/25
Scrolling down to load more results...
Scroll iteration 4/25
Scrolling down to load more results...
Scroll iteration 5/25
Scrolling down to load more results...
Scroll iteration 6/25
Scrolling down to load more results...
Scroll iteration 7/25
Scrolling down to load more results...
Scroll iteration 8/25
Scrolling down to load more results...
Scroll iteration 9/25
Scrolling down to load more results...
Scroll iteration 10/25
Scrolling down to load more results...
Scroll iteration 11/25
Scrolling down to load more results...
Scroll iteration 12/25
Scrolling down to load more results...
Scroll iteration 13/25


In [4]:
listing_count = len(listing_urls)
print("Number of elements in listing_urls:", listing_count)


Number of elements in listing_urls: 3600


In [10]:
import os
import requests
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup as soup

# Set up the webdriver options
options = webdriver.ChromeOptions()
options.add_argument('--disable-notifications')

# Define your login credentials
username = 'your_email'
password = 'your_facebook_code'
# Create the main directory if it doesn't exist
output_dir = "ws_data_fb"
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
    print(f"Created directory: {output_dir}")

# Function to log in to Facebook
def login_to_facebook(driver, username, password):
    print("Opening Facebook login page...")
    driver.get('https://www.facebook.com/')
    email_element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "email")))
    email_element.send_keys(username)
    password_element = driver.find_element(By.ID, "pass")
    password_element.send_keys(password)
    password_element.send_keys(Keys.RETURN)
    print("Waiting for login to complete...")
    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//div[@aria-label='Facebook']")))
    print("Login successful.")

# Function to download images from a listing URL and save them in a specific folder
def download_images_from_listing(driver, url, output_dir):
    print(f"Opening URL: {url}")
    driver.get(url)
    
    try:
        WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//img[@referrerpolicy='origin-when-cross-origin']")))
        html = driver.page_source
        listing_soup = soup(html, 'html.parser')
        images = listing_soup.find_all('img', {"referrerpolicy": "origin-when-cross-origin"})
        
        image_count = 1
        for img in images:
            image_url = img['src']
            image_name = f"web{str(image_count).zfill(3)}.jpg"
            image_path = os.path.join(output_dir, image_name)
            print(f"Downloading image: {image_url}")
            
            with requests.get(image_url, stream=True) as r:
                with open(image_path, 'wb') as f:
                    for chunk in r.iter_content(chunk_size=8192):
                        f.write(chunk)
            print(f"Saved image as: {image_name}")
            image_count += 1
    except Exception as e:
        print(f"Error occurred while downloading images: {e}")

# Initialize the WebDriver and log in to Facebook
driver = webdriver.Chrome(options=options)
login_to_facebook(driver, username, password)

  

# Download images from each URL into its respective folder inside ws_data_fb
start_index = 3594
end_index = 3600

for i, url in enumerate(listing_urls[start_index:end_index]):
    # Create a directory for the current URL inside the main output directory
    url_output_dir = os.path.join(output_dir, f"ws_data_fb_url_{start_index + i + 1}")
    if not os.path.exists(url_output_dir):
        os.makedirs(url_output_dir)
        print(f"Created directory: {url_output_dir}")
    
    # Download images for the current URL
    download_images_from_listing(driver, url, url_output_dir)

# End the automated browsing session
print("Quitting the browser...")
driver.quit()
print("Image download process completed.")


Opening Facebook login page...
Waiting for login to complete...
Login successful.
Created directory: ws_data_fb\ws_data_fb_url_3595
Opening URL: https://www.facebook.com/marketplace/item/1452315212060571/
Downloading image: https://scontent.fnkc1-1.fna.fbcdn.net/v/t45.5328-4/427345134_3827438840833059_3906291650081782094_n.jpg?_nc_cat=111&ccb=1-7&_nc_sid=247b10&_nc_ohc=kJS-N5t1zB8Q7kNvgFktigr&_nc_ht=scontent.fnkc1-1.fna&oh=00_AYBF4Q8zY6hMUKm6HrRJQRtoeHGb9gaJ4RbHWNqaGU8FUg&oe=666B4787
Saved image as: web001.jpg
Downloading image: https://scontent.fnkc1-1.fna.fbcdn.net/v/t45.5328-4/427345134_3827438840833059_3906291650081782094_n.jpg?_nc_cat=111&ccb=1-7&_nc_sid=247b10&_nc_ohc=kJS-N5t1zB8Q7kNvgFktigr&_nc_ht=scontent.fnkc1-1.fna&oh=00_AYBF4Q8zY6hMUKm6HrRJQRtoeHGb9gaJ4RbHWNqaGU8FUg&oe=666B4787
Saved image as: web002.jpg
Downloading image: https://scontent.fnkc1-1.fna.fbcdn.net/v/t45.5328-4/427345134_3827438840833059_3906291650081782094_n.jpg?_nc_cat=111&ccb=1-7&_nc_sid=247b10&_nc_ohc=kJS-N5

In [22]:
import os
import requests
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup as soup

# Set up the webdriver options
options = webdriver.ChromeOptions()
options.add_argument('--disable-notifications')

# Define your login credentials
username = 'your_email'
password = 'your_facebook_code'

# Create the main directory if it doesn't exist
output_dir = "ws_data_fb"
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
    print(f"Created directory: {output_dir}")

# Function to log in to Facebook
def login_to_facebook(driver, username, password):
    print("Opening Facebook login page...")
    driver.get('https://www.facebook.com/')
    email_element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "email")))
    email_element.send_keys(username)
    password_element = driver.find_element(By.ID, "pass")
    password_element.send_keys(password)
    password_element.send_keys(Keys.RETURN)
    print("Waiting for login to complete...")
    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//div[@aria-label='Facebook']")))
    print("Login successful.")

# Function to download images from a listing URL and save them in a specific folder
def download_images_from_listing(driver, url, output_dir):
    print(f"Opening URL: {url}")
    driver.get(url)
    
    try:
        WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//img[@referrerpolicy='origin-when-cross-origin']")))
        html = driver.page_source
        listing_soup = soup(html, 'html.parser')
        images = listing_soup.find_all('img', {"referrerpolicy": "origin-when-cross-origin"})
        
        image_count = 1
        for img in images:
            image_url = img['src']
            image_name = f"web{str(image_count).zfill(3)}.jpg"
            image_path = os.path.join(output_dir, image_name)
            print(f"Downloading image: {image_url}")
            
            with requests.get(image_url, stream=True) as r:
                with open(image_path, 'wb') as f:
                    for chunk in r.iter_content(chunk_size=8192):
                        f.write(chunk)
            print(f"Saved image as: {image_name}")
            image_count += 1
    except Exception as e:
        print(f"Error occurred while downloading images: {e}")

# Initialize the WebDriver and log in to Facebook
driver = webdriver.Chrome(options=options)
login_to_facebook(driver, username, password)

  

# Download images from each URL into its respective folder inside ws_data_fb
start_index = 500
end_index = 600
for i, url in enumerate(listing_urls[start_index:end_index]):
    # Create a directory for the current URL inside the main output directory
    url_output_dir = os.path.join(output_dir, f"ws_data_fb_url_{start_index + i + 1}")
    if not os.path.exists(url_output_dir):
        os.makedirs(url_output_dir)
        print(f"Created directory: {url_output_dir}")
    
    # Download images for the current URL
    download_images_from_listing(driver, url, url_output_dir)

# End the automated browsing session
print("Quitting the browser...")
driver.quit()
print("Image download process completed.")


Opening Facebook login page...
Waiting for login to complete...
Login successful.
Opening URL: https://www.facebook.com/marketplace/item/439682065479890/
Downloading image: https://scontent.fsvq4-1.fna.fbcdn.net/v/t45.5328-4/446005370_3067565393374456_9076624001993343733_n.jpg?stp=dst-jpg_p180x540&_nc_cat=110&ccb=1-7&_nc_sid=247b10&_nc_ohc=Dru0L1DKHNwQ7kNvgGmdHMQ&_nc_ht=scontent.fsvq4-1.fna&oh=00_AYDDE3RgxMmt_8y_BRwt3X4VfIzFE8uA6dTUIlHrEXd_fA&oe=6661EC42
Saved image as: web001.jpg
Downloading image: https://scontent.fsvq4-1.fna.fbcdn.net/v/t45.5328-4/446005370_3067565393374456_9076624001993343733_n.jpg?stp=dst-jpg_p180x540&_nc_cat=110&ccb=1-7&_nc_sid=247b10&_nc_ohc=Dru0L1DKHNwQ7kNvgGmdHMQ&_nc_ht=scontent.fsvq4-1.fna&oh=00_AYDDE3RgxMmt_8y_BRwt3X4VfIzFE8uA6dTUIlHrEXd_fA&oe=6661EC42
Saved image as: web002.jpg
Downloading image: https://scontent.fsvq4-1.fna.fbcdn.net/v/t45.5328-4/427357606_1701460740709098_5654227361108380680_n.jpg?stp=c43.0.260.260a_dst-jpg_p261x260&_nc_cat=105&ccb=1-7&_n

In [11]:
import os
import requests
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup as soup

# Set up the webdriver options
options = webdriver.ChromeOptions()
options.add_argument('--disable-notifications')

# Define your login credentials
username = 'your_email'
password = 'your_facebook_code'

# Create the main directory if it doesn't exist
output_dir = "ws_data_fb"
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
    print(f"Created directory: {output_dir}")

# Function to log in to Facebook
def login_to_facebook(driver, username, password):
    print("Opening Facebook login page...")
    driver.get('https://www.facebook.com/')
    email_element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "email")))
    email_element.send_keys(username)
    password_element = driver.find_element(By.ID, "pass")
    password_element.send_keys(password)
    password_element.send_keys(Keys.RETURN)
    print("Waiting for login to complete...")
    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//div[@aria-label='Facebook']")))
    print("Login successful.")

# Function to download images from a listing URL and save them in a specific folder
def download_images_from_listing(driver, url, output_dir):
    print(f"Opening URL: {url}")
    driver.get(url)
    
    try:
        WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//img[@referrerpolicy='origin-when-cross-origin']")))
        html = driver.page_source
        listing_soup = soup(html, 'html.parser')
        images = listing_soup.find_all('img', {"referrerpolicy": "origin-when-cross-origin"})
        
        image_count = 1
        for img in images:
            image_url = img['src']
            image_name = f"web{str(image_count).zfill(3)}.jpg"
            image_path = os.path.join(output_dir, image_name)
            print(f"Downloading image: {image_url}")
            
            with requests.get(image_url, stream=True) as r:
                with open(image_path, 'wb') as f:
                    for chunk in r.iter_content(chunk_size=8192):
                        f.write(chunk)
            print(f"Saved image as: {image_name}")
            image_count += 1
    except Exception as e:
        print(f"Error occurred while downloading images: {e}")

# Initialize the WebDriver and log in to Facebook
driver = webdriver.Chrome(options=options)
login_to_facebook(driver, username, password)

  

# Download images from each URL into its respective folder inside ws_data_fb
start_index = 300
end_index = 500
for i, url in enumerate(listing_urls[start_index:end_index]):
    # Create a directory for the current URL inside the main output directory
    url_output_dir = os.path.join(output_dir, f"ws_data_fb_url_{start_index + i + 1}")
    if not os.path.exists(url_output_dir):
        os.makedirs(url_output_dir)
        print(f"Created directory: {url_output_dir}")
    
    # Download images for the current URL
    download_images_from_listing(driver, url, url_output_dir)

# End the automated browsing session
print("Quitting the browser...")
driver.quit()
print("Image download process completed.")


Opening Facebook login page...
Waiting for login to complete...
Login successful.
Created directory: ws_data_fb\ws_data_fb_url_301
Opening URL: https://www.facebook.com/marketplace/item/385766604486668/
Downloading image: https://scontent.fnkc1-1.fna.fbcdn.net/v/t45.5328-4/438232257_984518779817119_7347557751758857392_n.jpg?stp=dst-jpg_p720x720&_nc_cat=104&ccb=1-7&_nc_sid=247b10&_nc_ohc=0I3Pr3xEpk4Q7kNvgF1M_us&_nc_ht=scontent.fnkc1-1.fna&oh=00_AYBeEQFG7RA8IyQOmo_st0oOJBFpHqwJEDqBLHz7-LrKbA&oe=666B493D
Saved image as: web001.jpg
Downloading image: https://scontent.fnkc1-1.fna.fbcdn.net/v/t45.5328-4/438232257_984518779817119_7347557751758857392_n.jpg?stp=dst-jpg_p720x720&_nc_cat=104&ccb=1-7&_nc_sid=247b10&_nc_ohc=0I3Pr3xEpk4Q7kNvgF1M_us&_nc_ht=scontent.fnkc1-1.fna&oh=00_AYBeEQFG7RA8IyQOmo_st0oOJBFpHqwJEDqBLHz7-LrKbA&oe=666B493D
Saved image as: web002.jpg
Downloading image: https://scontent.fnkc1-1.fna.fbcdn.net/v/t45.5328-4/438232257_984518779817119_7347557751758857392_n.jpg?stp=dst-jpg_