# Scraping and Saving eBay Products Data

## Project Overview:

This Jupyter Notebook illustrates a comprehensive web scraping project that is designed to extract specific data about sold graphics cards from eBay. The primary tools employed in this project include the BeautifulSoup library for web scraping and Selenium for handling dynamic content. The information extracted includes the product's title, its selling price, additional details, the product link, and the date it was sold.

This is the first out of four notebooks where we'll walk through the entire data pipeline process. Starting with data collection, we'll subsequently clean, organize, transform, analyze, and visualize the data.

This notebook specifically focuses on the scraping and data extraction phase of the project. A number of functions are defined and utilized for parsing the webpage HTML, extracting necessary details, and handling any inconsistencies in the data. The scraped data is then stored in two formats: as a CSV file and in an SQLite database, which provides flexibility for future data handling and analysis.

---

## Imports

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
from datetime import datetime
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import sqlite3

## Define the parse function

In [2]:
def parse(soup):
    products = []
    results = soup.find_all('li', {'class': 's-item s-item__pl-on-bottom'})

    for item in results:
        title = item.find('span', {'role': 'heading'})
        price = item.find('span', {'class': 's-item__price'})
        subtitle = item.find('div', {'class': 's-item__subtitle'})
        link = item.find('a', {'class': 's-item__link'})
        sold_date = item.find('span', {'class': 'POSITIVE'})

        product = {
            'title': title.text.strip() if title else 'N/A',
            'price': price.text.strip() if price else 'N/A',
            'info': subtitle.text.strip() if subtitle else 'N/A',
            'link': link['href'] if link else 'N/A',
            'sold_date': parse_date(sold_date.text.strip()) if sold_date else 'N/A',
        }

        products.append(product)

    return products

## Define the parse_date function

In [3]:
def parse_date(date_str):
    if date_str.startswith('Sold'):
        date_str = date_str[5:].strip()
        return datetime.strptime(date_str, '%b %d, %Y').strftime('%Y-%m-%d')
    else:
        return 'N/A'

##  Define the get_data function

In [4]:
def get_data(url, user_agent):
    options = Options()
    options.add_argument(f'user-agent={user_agent}')
    driver = webdriver.Chrome(options=options)
    driver.get(url)
    time.sleep(3)  # Add a delay to allow the page to load completely
    html = driver.page_source
    driver.quit()
    return BeautifulSoup(html, 'html.parser')

## Set URL and user agent

In [None]:
url = "https://www.ebay.com/sch/i.html?_fsrp=1&_from=R40&_nkw=graphics+card&_sacat=0&LH_Complete=1&LH_Sold=1&_fss=1&LH_SellerWithStore=1&LH_PrefLoc=1&_udlo=50&_udhi=3500&LH_ItemCondition=1000%7C1500%7C2000%7C2020%7C2500%7C3000&Brand=ASUS%7CGIGABYTE%7CEVGA%7CPowerColor%7CNVIDIA%7CPNY%7CSAPPHIRE%7CZOTAC%7CXFX%7CMSI&Chipset%2520Manufacturer=NVIDIA%7CAMD&Memory%2520Type=GDDR5%7CGDDR5X%7CGDDR6%7CGDDR6X&Chipset%252FGPU%2520Model=AMD%2520Radeon%2520R9%2520390%7CAMD%2520Radeon%2520RX%2520470%7CAMD%2520Radeon%2520RX%2520480%7CAMD%2520Radeon%2520RX%2520460%7CAMD%2520Radeon%2520R9%2520390X%7CAMD%2520Radeon%2520RX%25205500%2520XT%7CAMD%2520Radeon%2520RX%2520560%7CAMD%2520Radeon%2520RX%2520570%7CAMD%2520Radeon%2520RX%25205700%7CAMD%2520Radeon%2520RX%25205700%2520XT%7CAMD%2520Radeon%2520RX%2520580%7CAMD%2520Radeon%2520RX%25206800%7CAMD%2520Radeon%2520RX%25206800%2520XT%7CAMD%2520Radeon%2520RX%25206900%2520XT%7CNVIDIA%2520GeForce%2520GTX%25201080%7CNVIDIA%2520GeForce%2520GTX%25201080%2520Ti%7CNVIDIA%2520GeForce%2520GTX%25201660%7CNVIDIA%2520GeForce%2520GTX%2520970%7CNVIDIA%2520GeForce%2520GTX%2520980%7CNVIDIA%2520GeForce%2520GTX%2520980%2520Ti%7CNVIDIA%2520GeForce%2520GTX%2520TITAN%7CNVIDIA%2520GeForce%2520GTX%2520TITAN%2520X%7CNVIDIA%2520GeForce%2520GTX%2520TITAN%2520Xp%7CNVIDIA%2520GeForce%2520RTX%25202060%7CNVIDIA%2520GeForce%2520RTX%25202070%7CNVIDIA%2520GeForce%2520RTX%25202070%2520Founders%2520Edition%7CNVIDIA%2520GeForce%2520RTX%25202080%7CNVIDIA%2520GeForce%2520RTX%25202080%2520Founders%2520Edition%7CNVIDIA%2520GeForce%2520RTX%25202080%2520Ti%7CNVIDIA%2520GeForce%2520RTX%25202080%2520Ti%2520Founders%2520Edition%7CNVIDIA%2520GeForce%2520RTX%25203060%7CNVIDIA%2520GeForce%2520RTX%25203060%2520Ti%7CNVIDIA%2520GeForce%2520RTX%25203070%7CNVIDIA%2520GeForce%2520RTX%25203080%7CNVIDIA%2520Quadro%25204000%7CNVIDIA%2520GeForce%2520GTX%25201060%7CNVIDIA%2520GeForce%2520GTX%25201050%2520Ti%7CNVIDIA%2520GeForce%2520GTX%25201050%7CNVIDIA%2520GeForce%2520GTX%25201070%2520Ti%7CNVIDIA%2520GeForce%2520GTX%25201070%7CNVIDIA%2520GeForce%2520GT%25201030&_ipg=240&rt=nc&Memory%2520Size=11%2520GB%7C4%2520GB%7C8%2520GB%7C12%2520GB%7C6%2520GB%7C16%2520GB%7C2%2520GB%7C10%2520GB&_dcat=27386"
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.58"

## Set display options for pandas DataFrame

In [None]:
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)

## Set the maximum number of pages to scrape

In [None]:
max_pages = 20

## Create an empty list to store all the products

In [5]:
all_products = []

## Scrape data from each page

In [None]:
for page in range(1, max_pages+1):
    print(f"Scraping page {page}...")
    current_url = url + f"&_pgn={page}"
    soup = get_data(current_url, user_agent)
    products = parse(soup)
    all_products.extend(products)

## Create a DataFrame from the collected products

In [None]:
df = pd.DataFrame(all_products)

## Save DataFrame as CSV

In [None]:
filename = "ebay_products.csv"
df.to_csv(filename, index=False)
print(f"CSV file '{filename}' has been downloaded.")

## Save DataFrame as SQLite database

In [None]:
database_filename = "ebay_products.db"
conn = sqlite3.connect(database_filename)
cursor = conn.cursor()

# Create the table
cursor.execute('''
    CREATE TABLE IF NOT EXISTS products (
        id INTEGER PRIMARY KEY,
        title TEXT,
        price TEXT,
        info TEXT,
        link TEXT,
        sold_date TEXT
    )
''')

# Insert data into the table
for _, row in df.iterrows():
    cursor.execute('''
        INSERT INTO products (title, price, info, link, sold_date)
        VALUES (?, ?, ?, ?, ?)
    ''', (row['title'], row['price'], row['info'], row['link'], row['sold_date']))

# Commit the changes and close the connection
conn.commit()
conn.close()

print(f"SQLite database file '{database_filename}' has been created.")


---

# Conclusion:

This Jupyter Notebook illustrates the practical implementation of web scraping with BeautifulSoup and Selenium to extract data from eBay. The acquired data about sold graphics cards is meticulously parsed and stored in two versatile formats: a CSV file and an SQLite database. This process lays the foundation for robust data analysis and application development in subsequent stages of the project.