<a href="https://colab.research.google.com/github/BillySiaga/Project2025/blob/main/Python_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Python Project, extracting data, cleaning and analyzing product data
What You're Aiming For

To extract, clean, and analyze product data from an online retailer's platform to identify pricing trends, product availability, and promotional patterns across various categories.


Instructions

Steps :

Web Scraping:

Utilize Python libraries such as BeautifulSoup to scrape product information from an online website.
Collect data attributes including product names, categories, prices, availability status, and promotional details.
Data Cleaning:


Address missing or inconsistent data entries, such as absent prices or ambiguous product descriptions.
Standardize text fields to ensure uniformity in product names and categories.

Data Transformation:


Convert price data into numerical formats for analysis.
Categorize products into hierarchical groups (e.g., Electronics > Mobile Phones > Smartphones).

Data Analysis:


Conduct exploratory data analysis (EDA) to uncover insights:
Identify average pricing within each product category.
Detect seasonal or promotional pricing patterns.
Assess product availability trends over time.

Data Visualization:


Employ visualization library Plotly to create some charts.

## Step 1 Web Scraping
Utilize Python libraries such as BeautifulSoup to scrape product information from an online website.
Collect data attributes including product names, categories, prices, availability status, and promotional details.

In [None]:
#web scraping
# Utilize Python libraries such as BeautifulSoup to scrape product information from an online website.
# Collect data attributes including product names, categories, prices, availability status, and promotional details.
#

from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

In [6]:
# scraping ebay website
url = 'https://www.ebay.com/'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
page.status_code

200

In [26]:
print(soup.prettify()[:1000])

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en-us">
 <!--<![endif]-->
 <head>
  <title>
   All products | Books to Scrape - Sandbox
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="24th Jun 2016 09:29" name="created"/>
  <meta content="" name="description"/>
  <meta content="width=device-width" name="viewport"/>
  <meta content="NOARCHIVE,NOCACHE" name="robots"/>
  <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
  <!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
  <link href="static/oscar/favicon.ico" rel="shortcut icon"/>
  <link href="static/oscar/css/styles.css" rel="stylesheet" type="tex

In [29]:
url = "http://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

books = []

for item in soup.select("article.product_pod"):
    title = item.h3.a['title']
    price = item.select_one(".price_color").text
    availability = item.select_one(".availability").text.strip()

    books.append({
        "name": title,
        "price": price,
        "availability": availability,
        "category": "Books",
        "promotion": "No Promo"
    })

df = pd.DataFrame(books)
print(df.head(30))


                                                 name    price availability  \
0                                A Light in the Attic  Â£51.77     In stock   
1                                  Tipping the Velvet  Â£53.74     In stock   
2                                          Soumission  Â£50.10     In stock   
3                                       Sharp Objects  Â£47.82     In stock   
4               Sapiens: A Brief History of Humankind  Â£54.23     In stock   
5                                     The Requiem Red  Â£22.65     In stock   
6   The Dirty Little Secrets of Getting Your Dream...  Â£33.34     In stock   
7   The Coming Woman: A Novel Based on the Life of...  Â£17.93     In stock   
8   The Boys in the Boat: Nine Americans and Their...  Â£22.60     In stock   
9                                     The Black Maria  Â£52.15     In stock   
10     Starving Hearts (Triangular Trade Trilogy, #1)  Â£13.99     In stock   
11                              Shakespeare's Sonnet

## Data Cleaning
Address missing or inconsistent data entries, such as absent prices or ambiguous product descriptions.
Standardize text fields to ensure uniformity in product names and categories.


In [30]:
#checking missing values
df.isnull().sum()

Unnamed: 0,0
name,0
price,0
availability,0
category,0
promotion,0


In [33]:
# standardizing item fields to ensure consistency
df['name'] = df['name'].str.lower()
df['category'] = df['category'].str.lower()
df.columns.tolist()


['name', 'price', 'availability', 'category', 'promotion']

## Data Transformation

In [39]:
#Convert price data into numerical formats for analysis.

df['price'] = df['price'].astype(str)  # First convert to string type to handle potential non-string values
df['price'] = df['price'].str.replace('Â', '', regex=False) # Removing the problematic character 'Â', if present
df['price'] = df['price'].astype(float)
print(df['price'].describe())

count    20.000000
mean     38.048500
std      15.135231
min      13.990000
25%      22.637500
50%      41.380000
75%      51.865000
max      57.250000
Name: price, dtype: float64


In [40]:
#Categorize products into hierarchical groups
# creating a new column for hierachical categories
#
df['hierarchical_category'] = "Books > " + df['category']
print(df.head(30))


                                                 name  price availability  \
0                                a light in the attic  51.77     In stock   
1                                  tipping the velvet  53.74     In stock   
2                                          soumission  50.10     In stock   
3                                       sharp objects  47.82     In stock   
4               sapiens: a brief history of humankind  54.23     In stock   
5                                     the requiem red  22.65     In stock   
6   the dirty little secrets of getting your dream...  33.34     In stock   
7   the coming woman: a novel based on the life of...  17.93     In stock   
8   the boys in the boat: nine americans and their...  22.60     In stock   
9                                     the black maria  52.15     In stock   
10     starving hearts (triangular trade trilogy, #1)  13.99     In stock   
11                              shakespeare's sonnets  20.66     In stock   

In [42]:
df.columns.tolist()

['name',
 'price',
 'availability',
 'category',
 'promotion',
 'hierarchical_category']

## Data Analysis:
Conduct exploratory data analysis (EDA) to uncover insights

In [47]:
# data analysis
#Identify average pricing within each product category.
#
print(df.groupby('category')['price'].mean())

category
books    38.0485
Name: price, dtype: float64


## Data Visualization:

Employ visualization library Plotly to create some charts.

In [48]:
#data visualization with a visualziation library
#
import plotly.express as px

fig = px.bar(df, x='category', y='price', color='category', title='Average Price by Category')
fig.show()