## **eBay Product Data Scraping:**



**Project Overview:**
This notebook demonstrates the process of scraping product information from eBay. The objective is to collect data about various items listed on eBay to analyze their prices, discounts, shipping costs, locations, and other relevant attributes

**Objectives**

1.   **Scrape Product Information:**
Product Name,
Price,
Discount,
Shipping Cost,
Location,
Additional Information (e.g., Toy Type)
2.   **Data Cleaning and Preparation:** Process the raw data to ensure consistency and prepare it for analysis.
3. **Data Analysis:** Create a DataFrame to organize and analyze the collected data.


In [83]:
# Import necessary libraries
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

In [84]:
# Lists to hold all scraped data
prices = []
names = []
discounts = []
shipping_costs = []
locations = []
toy_types = []

# Loop through pages
for i in range(1, 21):
    url = f"https://www.ebay.com/sch/i.html?_from=R40&_nkw=airplanes&_sacat=0&LH_TitleDesc=0&_sop=12&LH_PrefLoc=2&imm=1&_pgn={i}"
    response = requests.get(url).text
    html = bs(response, 'html.parser')

    # Extract product information
    for item in html.find_all('div', class_='s-item__info'):
        # Extract price
        price = item.find('span', {'class': 's-item__price'})
        if price:
            price_text = price.get_text(strip=True)
            if price_text.startswith('$') and price_text[1:].replace('.', '', 1).isdigit():
                prices.append(price_text)
            else:
                prices.append(None)

        # Extract name
        name = item.find('div', class_='s-item__title')
        if name:
            names.append(name.get_text(strip=True))
        else:
            names.append(None)

        # Extract discount
        discount = item.find('span', {'class': 'BOLD'})
        if discount and discount.get_text(strip=True).endswith('off'):
            disc_text = discount.get_text(strip=True).replace('Extra ', '')
            discounts.append(disc_text)
        else:
            discounts.append(None)

        # Extract shipping cost
        ship = item.find('span', {'class': 's-item__shipping s-item__logisticsCost'})
        if ship:
            ship_text = ship.get_text(strip=True)
            if 'Free shipping' in ship_text:
                shipping_costs.append('0')
            elif ship_text.startswith('$') or ship_text.startswith('+'):
                ship_text = ship_text.replace(' shipping', '').replace(' shipping estimate', '')
                shipping_costs.append(ship_text)
            else:
                shipping_costs.append(None)
        else:
            shipping_costs.append(None)

        # Extract location
        location = item.find('span', {'class': 's-item__location s-item__itemLocation'})
        if location:
            locations.append(location.get_text(strip=True).replace('From ', ''))
        else:
            locations.append(None)

        # Extract toy type
        toy_type = item.find('span', {'class': 'SECONDARY_INFO'})
        if toy_type:
            toy_types.append(toy_type.get_text(strip=True))
        else:
            toy_types.append(None)

# Ensure all lists are of the same length
max_length = max(len(prices), len(names), len(discounts), len(shipping_costs), len(locations), len(toy_types))

prices.extend([None] * (max_length - len(prices)))
names.extend([None] * (max_length - len(names)))
discounts.extend([None] * (max_length - len(discounts)))
shipping_costs.extend([None] * (max_length - len(shipping_costs)))
locations.extend([None] * (max_length - len(locations)))
toy_types.extend([None] * (max_length - len(toy_types)))


In [85]:
# Create DataFrame
data = {
    'Price': prices,
    'Name': names,
    'Discount': discounts,
    'Shipping Cost': shipping_costs,
    'Location': locations,
    'Toy Type': toy_types
}

ebay_df = pd.DataFrame(data)

# Display the DataFrame
ebay_df.head()

Unnamed: 0,Price,Name,Discount,Shipping Cost,Location,Toy Type
0,$20.00,Shop on eBay,,,,Brand New
1,$20.00,Shop on eBay,,,,Brand New
2,$199.99,E-flite UMX Me 262 EDF BNF Basic EFLU31050,,0,,Brand New
3,$77.12,Rc plane p40 new from EFLITE,,+$41.20 estimate,from United Kingdom,Pre-Owned
4,$129.99,E-flite RC Airplane UMX P-51 Voodoo BNF EFLU...,,0,,Brand New


## **Performing Necesssary EDA**

In [86]:
# Information of Data
ebay_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1529 entries, 0 to 1528
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Price          1441 non-null   object
 1   Name           1529 non-null   object
 2   Discount       214 non-null    object
 3   Shipping Cost  1473 non-null   object
 4   Location       387 non-null    object
 5   Toy Type       1497 non-null   object
dtypes: object(6)
memory usage: 71.8+ KB


In [87]:
# Checking Null Values
ebay_df.isnull().sum()

Price              88
Name                0
Discount         1315
Shipping Cost      56
Location         1142
Toy Type           32
dtype: int64

In [88]:
# Chceking Duplicates
ebay_df.duplicated().sum()

102

## **Cleaning The Data**

In [89]:
# Fill missing values or drop rows/columns with missing values
ebay_df['Price'].fillna('Unknown', inplace=True)
ebay_df['Discount'].fillna('No Discount', inplace=True)
ebay_df['Shipping Cost'].fillna('Unknown', inplace=True)
ebay_df['Location'].fillna('Unknown', inplace=True)
ebay_df['Toy Type'].fillna('Unknown', inplace=True)

# Drop duplicates
ebay_df.drop_duplicates(inplace=True)


In [90]:
import numpy as np

# Replace non-numeric values with NaN
ebay_df['Price'] = ebay_df['Price'].replace('Free shipping', '0')
ebay_df['Price'] = ebay_df['Price'].replace('[\$,]', '', regex=True)

# Handle any non-numeric entries by converting to NaN
ebay_df['Price'] = pd.to_numeric(ebay_df['Price'], errors='coerce')

# Similarly for Shipping Cost
ebay_df['Shipping Cost'] = ebay_df['Shipping Cost'].replace('Free shipping', '0')
ebay_df['Shipping Cost'] = ebay_df['Shipping Cost'].replace('[\$,]', '', regex=True)
ebay_df['Shipping Cost'] = pd.to_numeric(ebay_df['Shipping Cost'], errors='coerce')


In [91]:
# Fill NaN values with a specific value (e.g., 0) or drop them
ebay_df['Price'].fillna(0, inplace=True)
ebay_df['Shipping Cost'].fillna(0, inplace=True)


In [92]:
# Verify the data types
print(ebay_df.dtypes)

# Check for any remaining non-numeric values
print(ebay_df[['Price', 'Shipping Cost']].head())


Price            float64
Name              object
Discount          object
Shipping Cost    float64
Location          object
Toy Type          object
dtype: object
    Price  Shipping Cost
0   20.00            0.0
2  199.99            0.0
3   77.12            0.0
4  129.99            0.0
5   59.91            0.0


In [93]:
# Droping unnecessary row
ebay_df = ebay_df[ebay_df['Name'] != 'Shop on eBay']
cols = ['Name'] + [col for col in ebay_df.columns if col != 'Name']
ebay_df = ebay_df[cols]

# Resetting the Index
ebay_df.to_csv('reordered_ebay_data.csv', index=False)
ebay_df = ebay_df.reset_index(drop=True)
ebay_df.head()

Unnamed: 0,Name,Price,Discount,Shipping Cost,Location,Toy Type
0,E-flite UMX Me 262 EDF BNF Basic EFLU31050,199.99,No Discount,0.0,Unknown,Brand New
1,Rc plane p40 new from EFLITE,77.12,No Discount,0.0,from United Kingdom,Pre-Owned
2,E-flite RC Airplane UMX P-51 Voodoo BNF EFLU...,129.99,No Discount,0.0,Unknown,Brand New
3,4-CH Spitfire One Key Remote Control Airplane ...,59.91,63% off,0.0,Unknown,Pre-Owned
4,SIG SUPERCOAT FUEL PROOF HIGH GLOSS DOPE MODE...,29.99,No Discount,14.86,Unknown,New (Other)


## Data Analysis

In [94]:
# Importing necessary libraries for visualizations
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Used Plotly to create a series of visualizations to explore different aspects of the data

fig = px.histogram(ebay_df, x='Price', title='Distribution of Prices',
                   labels={'Price': 'Price ($)'}, nbins=30)
fig.update_layout(bargap=0.1)
fig.show()



In [95]:
# Shipping Cost Distributions
fig = px.box(ebay_df, y='Shipping Cost', title='Shipping Cost Distribution',
             labels={'Shipping Cost': 'Shipping Cost ($)'})
fig.update_layout(yaxis_title='Shipping Cost ($)')
fig.show()


In [96]:
# Average Shipping Cost By Locations
avg_shipping_by_location = ebay_df.groupby('Location')['Shipping Cost'].mean().reset_index()
fig = px.bar(avg_shipping_by_location, x='Location', y='Shipping Cost', title='Average Shipping Cost by Location',
             labels={'Shipping Cost': 'Average Shipping Cost ($)'})
fig.update_layout(xaxis_title='Location', yaxis_title='Average Shipping Cost ($)')
fig.show()


In [97]:
# Extract numeric part of discount and clean data
ebay_df['Discount'] = ebay_df['Discount'].str.extract('(\d+)').astype(float)
discount_counts = ebay_df['Discount'].value_counts().reset_index()
discount_counts.columns = ['Discount (%)', 'Count']

fig = px.bar(discount_counts, x='Discount (%)', y='Count', title='Frequency of Discounts',
             labels={'Discount (%)': 'Discount Percentage', 'Count': 'Number of Items'})
fig.update_layout(xaxis_title='Discount Percentage', yaxis_title='Number of Items')
fig.show()


In [98]:
# Plotting necessary Subplots

fig = make_subplots(rows=2, cols=2, subplot_titles=('Price Distribution', 'Shipping Cost Distribution',
                                                    'Average Shipping Cost by Location', 'Discount Frequencies'))

# Price Distribution
fig.add_trace(go.Histogram(x=ebay_df['Price'], nbinsx=30, name='Price'), row=1, col=1)

# Shipping Cost Distribution
fig.add_trace(go.Box(y=ebay_df['Shipping Cost'], name='Shipping Cost'), row=1, col=2)

# Average Shipping Cost by Location
fig.add_trace(go.Bar(x=avg_shipping_by_location['Location'], y=avg_shipping_by_location['Shipping Cost'],
                     name='Average Shipping Cost'), row=2, col=1)

# Discount Frequencies
fig.add_trace(go.Bar(x=discount_counts['Discount (%)'], y=discount_counts['Count'], name='Discount Frequency'),
              row=2, col=2)

fig.update_layout(title_text='eBay Data Analysis', showlegend=False)
fig.show()


### Conclusion

In this project, we extracted, cleaned, and visualized eBay product data to gain insights into pricing, shipping costs, and discounts. The visualizations provided a comprehensive overview of the data, helping us understand the distribution and relationships of various features