**Instruction:**

Extract, clean, and analyze product data from an online retailer's platform to identify pricing trends, product availability, and promotional patterns across various categories.

In [1]:
# To silence warnings for a cleaner output

import warnings
warnings.filterwarnings("ignore")

In [2]:
# Importing the necessary libraries

import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import plotly.express as px
import plotly.graph_objects as go

In [3]:
# Web Scraping

url = "https://aziza.tn/fr/home"

def scrape_product_data(url):
    headers = {'User-Agent': 'Mozilla/5.0'}
    page = requests.get(url, headers=headers)

    if page.status_code != 200:
        print("Failed to retrieve the website")
        return []

    soup = BeautifulSoup(page.content, 'html.parser')
    data = []

    for product in soup.find_all('li', class_='product-item'):

        # Extract product name
        name_tag = product.find('a', class_='product-item-link')
        name = name_tag.text.strip() if name_tag else 'Unknown'    # Missing "name_tag" will be scraped as "Unknown"

        # Extract category
        category_tag = product.find('span', class_='brand')
        category = category_tag.text.strip() if category_tag else 'General'    # Missing "category_tag" will be scraped as "General"

        # Extract product price
        price_tag = product.find('span', class_='price')
        price = price_tag.text.strip() if price_tag else '0'    # Missing "price_tag" will be scraped as "0"

        # Extract availability status (assuming it's in the 'tocart' button)
        availability_tag = product.find('button', class_='tocart')
        if availability_tag:
            button_text = availability_tag.text.strip().lower()
            if 'ajouter' in button_text:
                availability = 'Available'
            else:
                availability = 'Unavailable'
        else:
            availability = 'Unknown'   # Missing "availability_tag" will be scraped as "Unknown"

        # Detect a promotion
        promo = 'Yes' if product.find('div', class_='super') else 'No'

        data.append({
            'Product_name': name,
            'Category': category,
            'Price': price,
            'Availability': availability,
            'Promotion': promo
        })

    return data

scrapped_data = scrape_product_data(url)
df = pd.DataFrame(scrapped_data)

**Some Data Cleaning Incorporated into the Web Scraping Script:**

- Missing "name_tag" (**Product_name**) is scraped as "Unknown".
- Missing "category_tag" (**Category**) is scraped as "General".
- Since **Price** is a numerical feature, missing "price_tag" is scraped as "0".
- For the **Availability** feature, products that can be added to the cart (**"Ajouter Au Panier"**), are scraped as "**Available**". If they cannot be added to cart, they are scraped as "**Unavailable**". Missing "availability_tag" is scraped as "**Unknown**"

In [4]:
df['Availability'].value_counts()

Availability
Unknown        171
Unavailable     96
Available       40
Name: count, dtype: int64

In [5]:
# Data  Overview

print(df.info(), '\n')

df.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307 entries, 0 to 306
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Product_name  307 non-null    object
 1   Category      307 non-null    object
 2   Price         307 non-null    object
 3   Availability  307 non-null    object
 4   Promotion     307 non-null    object
dtypes: object(5)
memory usage: 12.1+ KB
None 



Unnamed: 0,Product_name,Category,Price,Availability,Promotion
0,Unknown,General,0,Unknown,No
1,Unknown,General,0,Unknown,No
2,Unknown,General,0,Unknown,No
3,Unknown,General,0,Unknown,No
4,Passoire ronde en plastique,,1100,Unavailable,No
5,Couvercle alimentaire,,1200,Unavailable,No
6,Tasse imprimé,,1990,Unavailable,No
7,Pichet,,2000,Unavailable,No
8,Pelle et balayette,,2000,Unavailable,No
9,Balai en plastique,,2590,Unavailable,No


In [6]:
# Removing non-digit characters from the price column and converting it to a float
df['Price'] = df['Price'].str.replace('[^\\d]', '', regex=True)
df['Price'] = pd.to_numeric(df['Price'], errors='coerce').astype(float)

# Standardize text fields
df['Product_name'] = df['Product_name'].str.title().str.strip()
df['Category'] = df['Category'].str.title().str.strip()

**More Data Cleaning Actions Taken:**

- To convert the **Price** column from an object to a float, the non-digit characters (currency symbols, spaces, and commas) were removed first.
- **Product_name** and **Category** columns were standardized.

In [7]:
# Confirming Price column type conversion

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307 entries, 0 to 306
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Product_name  307 non-null    object 
 1   Category      307 non-null    object 
 2   Price         307 non-null    float64
 3   Availability  307 non-null    object 
 4   Promotion     307 non-null    object 
dtypes: float64(1), object(4)
memory usage: 12.1+ KB


In [8]:
# Saving the cleaned data

df.to_csv("cleaned_aziza_data.csv", index=False)

In [9]:
# Checking unique products to understand the data for hierarchical grouping

df['Product_name'].unique()

array(['Unknown', 'Passoire Ronde En Plastique', 'Couvercle Alimentaire',
       'Tasse Imprimé', 'Pichet', 'Pelle Et Balayette',
       'Balai En Plastique', 'Set Savon',
       'Tasse En Verre Coloré Avec Paille', 'Sous Plats En Bois',
       'Ciseaux Multi-Usages', 'Lot De 4 Boites De Conservation',
       'Plateau De Présentation', 'Lot De 3 Serviette Arc En Ciel',
       'Pichet En Verre', 'Lot De 5 Boites De Conservation',
       'Maida En Plastique Ronde Decorée',
       'Poubelle En Plastique Avec Couvercle', 'Corbeille',
       'Bocale En Verre Avec Couvercle', 'Mule Pour Femme',
       'Distributeur À Eau', 'Spatule À Poisson',
       'Plateau À Servir Rectangulaire', 'Pompe À Eau Rechargeable',
       'Hachoir Manuel En Plastique', 'Support Essuie Tout',
       'Distributeur À Eau Avec 3 Tasses', 'Bassine En Plastique',
       'Couteau À Désosser', 'Set 6 Pots À Épices Avec Cuillères',
       'Poussoir À Saucisse Manuel', 'Hachoir Électrique',
       'Lot De 4 Chips Paprika'

In [10]:
# Categorizing products into hierarchical groups

def categorize_hierarchy(name):
    name = name.lower()

    # Electronics
    if any(word in name for word in ['tv', 'casque', 'grille pain', 'blender', 'fer à repasser', 'lampe', 'gaufrier']):
        return 'Electronics > Appliances'
    elif any(word in name for word in ['voiture', 'poupée', 'jouet', 'dinosaure']):
        return 'Toys & Games > Vehicles or Dolls'

    # Food
    elif any(word in name for word in ['yaourt', 'fromage', 'cake', 'gaufrette', 'muffin', 'tarte', 'pain', 'crème', 'biscuit']):
        return 'Food > Dairy & Bakery'
    elif any(word in name for word in ['boisson', 'jus', 'eau', 'nectar', 'café']):
        return 'Food > Beverages'
    elif any(word in name for word in ['sauce', 'harissa', 'huile', 'thon', 'salami', 'jambon', 'riz']):
        return 'Food > Condiments & Staples'

    # Household Items
    elif any(word in name for word in ['liquide vaisselle', 'eau de javel', 'lessive', 'nettoyant']):
        return 'Household > Cleaning Supplies'
    elif any(word in name for word in ['verre', 'tasse', 'mug', 'assiette', 'boite', 'bouteille', 'pot', 'plateau']):
        return 'Household > Kitchenware'
    elif any(word in name for word in ['presse-agrumes', 'faitout', 'moule', 'sauteuse', 'plat']):
        return 'Household > Cookware'

    # Personal Care
    elif any(word in name for word in ['shampooing', 'déodorant', 'dentifrice', 'cotton-tige']):
        return 'Personal Care > Hygiene'

    # Textiles & Accessories
    elif any(word in name for word in ['chaussette', 'serviette']):
        return 'Textiles > Apparel & Linens'

    # Seasonal/Other
    elif 'encensoir' in name:
        return 'Seasonal > Religious Items'
    elif 'hlou' in name or 'gateau' in name:
        return 'Food > Sweets & Desserts'

    else:
        return 'General > Other'

In [11]:
# Applying hierarchical grouping to the DataFrame
df['Hierarchy'] = df['Product_name'].apply(categorize_hierarchy)

df.head(20)

Unnamed: 0,Product_name,Category,Price,Availability,Promotion,Hierarchy
0,Unknown,General,0.0,Unknown,No,General > Other
1,Unknown,General,0.0,Unknown,No,General > Other
2,Unknown,General,0.0,Unknown,No,General > Other
3,Unknown,General,0.0,Unknown,No,General > Other
4,Passoire Ronde En Plastique,,1100.0,Unavailable,No,General > Other
5,Couvercle Alimentaire,,1200.0,Unavailable,No,General > Other
6,Tasse Imprimé,,1990.0,Unavailable,No,Household > Kitchenware
7,Pichet,,2000.0,Unavailable,No,General > Other
8,Pelle Et Balayette,,2000.0,Unavailable,No,General > Other
9,Balai En Plastique,,2590.0,Unavailable,No,General > Other


In [12]:
# Question 1: What is the Price Variability per Product Category (Which categories show wide pricing ranges)?

fig = px.box(
    df[df['Price'] > 0],
    x='Hierarchy',
    y='Price',
    title='Price Distribution by Product Category',
    labels={'Price': 'Price (TND)', 'Hierarchy': 'Category'},
    points='all',  # To show all individual data points
)

# Updating layout for better readability
fig.update_layout(
    xaxis_tickangle=45,
    height=700,  # To increase height
    width=1100,  # To increase width
    margin=dict(l=60, r=40, t=60, b=200),  # To add bottom margin for long labels
)

fig.show()

**Insight**

- There is wide price variation in **Electronics > Appliances** (up to 100,000 TND!), **Textiles > Apparel & Linens**, and **Toys & Games > Vehicles or Dolls**.

- Most food-related categories and household products are clustered at the low-price end (less than 10,000 TND, many below 5,000).

- Outliers exist in almost every category, but particularly in **Electronics**, **Household > Cookware**, and **Toys & Games**.

Electronics have the highest price range.

In [13]:
# Question 2: What is the Average Pricing Within Each Product Category (Which product categories are the most/least expensive)?

fig = px.bar(
    df.groupby('Hierarchy', as_index=False)['Price'].mean().sort_values(by='Price', ascending=False),
    x='Hierarchy',
    y='Price',
    title='Average Price per Product Category',
    labels={'Price': 'Average Price (TND)', 'Hierarchy': 'Product Category'},
    color='Price',
    color_continuous_scale='Viridis'
)
fig.update_layout(xaxis_tickangle=45)
fig.show()


**Insight**

- **Electronics > Appliances** have the highest average price, while **Food > Diary & Bakery** have the lowest average price.

In [14]:
# Question 3: Which Categories Have the Highest Share of Products on Promotion?

promo_df = df.groupby(['Hierarchy', 'Promotion']).size().reset_index(name='Count')
fig = px.bar(
    promo_df,
    x='Hierarchy',
    y='Count',
    color='Promotion',
    barmode='group',
    title='Product Promotion Distribution by Category',
    labels={'Hierarchy': 'Product Category'}
)
fig.update_layout(xaxis_tickangle=45)
fig.show()

**Insight**

- Promotions are extremely rare across the board.

- Almost all categories have products not under promotion. Only a tiny handful of products are marked as promoted in **Food > Condiments & Staples** and **General > Other**

- Either promotions are infrequent on the site, or they're not consistently labeled in the source HTML.

In [15]:
# Question 4: What is the Availability Status Across All Categories (Are there availability gaps in certain categories)?

avail_df = df.groupby(['Hierarchy', 'Availability']).size().reset_index(name='Count')
fig = px.bar(
    avail_df,
    x='Hierarchy',
    y='Count',
    color='Availability',
    barmode='group',
    title='Product Availability by Category',
    labels={'Hierarchy': 'Product Category'}
)
fig.update_layout(xaxis_tickangle=45)
fig.show()

**Insight**

- Most products have an “**Unknown**” availability status, especially in **Food > Dairy & Bakery**, **General > Other**, and **Personal Care > Hygiene**

- High “**Unavailable**” counts in **Household > Cookware** (very prominent), **Toys & Games > Vehicles or Dolls**, and **Electronics > Appliances**

- “**Available**” items are relatively low overall, with a few noticeable in **Household > Cookware** and **Food > Beverages**.

Many items scraped lacked availability labels may be perhaps due to inconsistent HTML structure.

Categories like **Household > Cookware** seem popular or stocked regularly, as it shows both available and unavailable entries, hinting at turnover.