**Instruction:**

Extract, clean, and analyze product data from an online retailer's platform to identify pricing trends, product availability, and promotional patterns across various categories.

In [None]:
# To silence warnings for a cleaner output

import warnings
warnings.filterwarnings("ignore")

In [None]:
# Importing the necessary libraries

import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import plotly.express as px
import plotly.graph_objects as go

In [None]:
# Web Scraping

url = "https://aziza.tn/fr/home"

def scrape_product_data(url):
    headers = {'User-Agent': 'Mozilla/5.0'}
    page = requests.get(url, headers=headers)

    if page.status_code != 200:
        print("Failed to retrieve the website")
        return []

    soup = BeautifulSoup(page.content, 'html.parser')
    data = []

    for product in soup.find_all('li', class_='product-item'):

        # Extract product name
        name_tag = product.find('a', class_='product-item-link')
        name = name_tag.text.strip() if name_tag else 'Unknown'    # Missing "name_tag" will be scraped as "Unknown"

        # Extract category
        category_tag = product.find('span', class_='brand')
        category = category_tag.text.strip() if category_tag else 'General'    # Missing "category_tag" will be scraped as "General"

        # Extract product price
        price_tag = product.find('span', class_='price')
        price = price_tag.text.strip() if price_tag else '0'    # Missing "price_tag" will be scraped as "0"

        # Extract availability status (assuming it's in the 'tocart' button)
        availability_tag = product.find('button', class_='tocart')
        if availability_tag:
            button_text = availability_tag.text.strip().lower()
            if 'ajouter' in button_text:
                availability = 'Available'
            else:
                availability = 'Unavailable'
        else:
            availability = 'Unknown'   # Missing "availability_tag" will be scraped as "Unknown"

        # Detect a promotion
        promo = 'Yes' if product.find('div', class_='super') else 'No'

        data.append({
            'Product_name': name,
            'Category': category,
            'Price': price,
            'Availability': availability,
            'Promotion': promo
        })

    return data

scrapped_data = scrape_product_data(url)
df = pd.DataFrame(scrapped_data)

**Some Data Cleaning Incorporated into the Web Scraping Script:**

- Missing "name_tag" (**Product_name**) is scraped as "Unknown".
- Missing "category_tag" (**Category**) is scraped as "General".
- Since **Price** is a numerical feature, missing "price_tag" is scraped as "0".
- For the **Availability** feature, products that can be added to the cart (**"Ajouter Au Panier"**), are scraped as "**Available**". If they cannot be added to cart, they are scraped as "**Unavailable**". Missing "availability_tag" is scraped as "**Unknown**"

In [None]:
df['Availability'].value_counts()

Unnamed: 0_level_0,count
Availability,Unnamed: 1_level_1
Unknown,142
Unavailable,108
Available,39


In [None]:
# Data  Overview

print(df.info(), '\n')

df.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 289 entries, 0 to 288
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Product_name  289 non-null    object
 1   Category      289 non-null    object
 2   Price         289 non-null    object
 3   Availability  289 non-null    object
 4   Promotion     289 non-null    object
dtypes: object(5)
memory usage: 11.4+ KB
None 



Unnamed: 0,Product_name,Category,Price,Availability,Promotion
0,Unknown,General,0,Unknown,No
1,Unknown,General,0,Unknown,No
2,Unknown,General,0,Unknown,No
3,Unknown,General,0,Unknown,No
4,Lot 2 tomates double concentrés + harissa,PRODUITS DU CAP BON,7990,Unavailable,Yes
5,Téléphone portable,Sonoro,28980,Unavailable,Yes
6,Téléphone portable,Sonoro,28980,Unavailable,Yes
7,Lot de 2 distributeurs sauce,,1100,Unavailable,No
8,Boite Alaska,,1100,Unavailable,No
9,Gobelet à café en plastique avec couvercle,,2000,Unavailable,No


In [None]:
# Removing non-digit characters from the price column and converting it to a float
df['Price'] = df['Price'].str.replace('[^\\d]', '', regex=True)
df['Price'] = pd.to_numeric(df['Price'], errors='coerce').astype(float)

# Standardize text fields
df['Product_name'] = df['Product_name'].str.title().str.strip()
df['Category'] = df['Category'].str.title().str.strip()

**More Data Cleaning Actions Taken:**

- To convert the **Price** column from an object to a float, the non-digit characters (currency symbols, spaces, and commas) were removed first.
- **Product_name** and **Category** columns were standardized.

In [None]:
# Confirming Price column type conversion

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 289 entries, 0 to 288
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Product_name  289 non-null    object 
 1   Category      289 non-null    object 
 2   Price         289 non-null    float64
 3   Availability  289 non-null    object 
 4   Promotion     289 non-null    object 
dtypes: float64(1), object(4)
memory usage: 11.4+ KB


In [None]:
# Saving the cleaned data

df.to_csv("cleaned_aziza_data.csv", index=False)

In [None]:
# Checking unique products to understand the data for hierarchical grouping

df['Product_name'].unique()

array(['Unknown', 'Lot 2 Tomates Double Concentrés + Harissa',
       'Téléphone Portable', 'Lot De 2 Distributeurs Sauce',
       'Boite Alaska', 'Gobelet À Café En Plastique Avec Couvercle',
       'Panier 3 Compartiments', 'Bouchon Évier À Main',
       'Porte Savon Transparent', 'Verre À Thé', 'Porte Éponge',
       'Lot De 3 Boites De Conservation En Plastique',
       'Boite À Épice Avec Couvercle', 'Set 3 Pots',
       'Pot De Conservation Avec Couvercle Rouge',
       'Boite De Conservation Ronde', 'Seau À Pop-Corn',
       'Tasse En Verre Pour Enfant Avec Paille', 'Verre À Eau',
       'Bac À Glace', 'Boite En Verre', 'Boite De Conservation',
       'Bouteille En Verre', 'Ensemble 2 Boites De Conservation',
       'Set 2 Verres', 'Tasse En Verre Avec Paille', 'Mug En Céramique',
       'Ensemble Pique Nique Pour 6 Personnes', 'Mug Céramique',
       'Moule À Muffin', 'Rame Papier', 'Pot À Lait', 'Plat  Rond',
       'Moule À Gateau', 'Crepière', 'Tasse À Café',
       'Cake Bl

In [None]:
# Categorizing products into hierarchical groups

def categorize_hierarchy(name):
    name = name.lower()

    # Electronics
    if any(word in name for word in ['tv', 'casque', 'grille pain', 'blender', 'fer à repasser', 'lampe', 'gaufrier']):
        return 'Electronics > Appliances'
    elif any(word in name for word in ['voiture', 'poupée', 'jouet', 'dinosaure']):
        return 'Toys & Games > Vehicles or Dolls'

    # Food
    elif any(word in name for word in ['yaourt', 'fromage', 'cake', 'gaufrette', 'muffin', 'tarte', 'pain', 'crème', 'biscuit']):
        return 'Food > Dairy & Bakery'
    elif any(word in name for word in ['boisson', 'jus', 'eau', 'nectar', 'café']):
        return 'Food > Beverages'
    elif any(word in name for word in ['sauce', 'harissa', 'huile', 'thon', 'salami', 'jambon', 'riz']):
        return 'Food > Condiments & Staples'

    # Household Items
    elif any(word in name for word in ['liquide vaisselle', 'eau de javel', 'lessive', 'nettoyant']):
        return 'Household > Cleaning Supplies'
    elif any(word in name for word in ['verre', 'tasse', 'mug', 'assiette', 'boite', 'bouteille', 'pot', 'plateau']):
        return 'Household > Kitchenware'
    elif any(word in name for word in ['presse-agrumes', 'faitout', 'moule', 'sauteuse', 'plat']):
        return 'Household > Cookware'

    # Personal Care
    elif any(word in name for word in ['shampooing', 'déodorant', 'dentifrice', 'cotton-tige']):
        return 'Personal Care > Hygiene'

    # Textiles & Accessories
    elif any(word in name for word in ['chaussette', 'serviette']):
        return 'Textiles > Apparel & Linens'

    # Seasonal/Other
    elif 'encensoir' in name:
        return 'Seasonal > Religious Items'
    elif 'hlou' in name or 'gateau' in name:
        return 'Food > Sweets & Desserts'

    else:
        return 'General > Other'

In [None]:
# Applying hierarchical grouping to the DataFrame
df['Hierarchy'] = df['Product_name'].apply(categorize_hierarchy)

df.head(20)

Unnamed: 0,Product_name,Category,Price,Availability,Promotion,Hierarchy
0,Unknown,General,0.0,Unknown,No,General > Other
1,Unknown,General,0.0,Unknown,No,General > Other
2,Unknown,General,0.0,Unknown,No,General > Other
3,Unknown,General,0.0,Unknown,No,General > Other
4,Lot 2 Tomates Double Concentrés + Harissa,Produits Du Cap Bon,7990.0,Unavailable,Yes,Food > Condiments & Staples
5,Téléphone Portable,Sonoro,28980.0,Unavailable,Yes,General > Other
6,Téléphone Portable,Sonoro,28980.0,Unavailable,Yes,General > Other
7,Lot De 2 Distributeurs Sauce,,1100.0,Unavailable,No,Food > Condiments & Staples
8,Boite Alaska,,1100.0,Unavailable,No,Household > Kitchenware
9,Gobelet À Café En Plastique Avec Couvercle,,2000.0,Unavailable,No,Food > Beverages


In [None]:
# Question 1: What is the Price Variability per Product Category (Which categories show wide pricing ranges)?

fig = px.box(
    df[df['Price'] > 0],
    x='Hierarchy',
    y='Price',
    title='Price Distribution by Product Category',
    labels={'Price': 'Price (TND)', 'Hierarchy': 'Category'},
    points='all',  # To show all individual data points
)

# Updating layout for better readability
fig.update_layout(
    xaxis_tickangle=45,
    height=700,  # To increase height
    width=1100,  # To increase width
    margin=dict(l=60, r=40, t=60, b=200),  # To add bottom margin for long labels
)

fig.show()

**Insight**

- There is wide price variation in **Electronics > Appliances** (up to 100,000 TND!), **Textiles > Apparel & Linens**, and **Toys & Games > Vehicles or Dolls**.

- Most food-related categories and household products are clustered at the low-price end (less than 10,000 TND, many below 5,000).

- Outliers exist in almost every category, but particularly in **Electronics**, **Household > Cookware**, and **Toys & Games**.

Electronics have the highest price range.

In [None]:
# Question 2: What is the Average Pricing Within Each Product Category (Which product categories are the most/least expensive)?

fig = px.bar(
    df.groupby('Hierarchy', as_index=False)['Price'].mean().sort_values(by='Price', ascending=False),
    x='Hierarchy',
    y='Price',
    title='Average Price per Product Category',
    labels={'Price': 'Average Price (TND)', 'Hierarchy': 'Product Category'},
    color='Price',
    color_continuous_scale='Viridis'
)
fig.update_layout(xaxis_tickangle=45)
fig.show()


**Insight**

- **Electronics > Appliances** have the highest average price, while **Food > Diary & Bakery** have the lowest average price.

In [None]:
# Question 3: Which Categories Have the Highest Share of Products on Promotion?

promo_df = df.groupby(['Hierarchy', 'Promotion']).size().reset_index(name='Count')
fig = px.bar(
    promo_df,
    x='Hierarchy',
    y='Count',
    color='Promotion',
    barmode='group',
    title='Product Promotion Distribution by Category',
    labels={'Hierarchy': 'Product Category'}
)
fig.update_layout(xaxis_tickangle=45)
fig.show()

**Insight**

- Promotions are extremely rare across the board.

- Almost all categories have products not under promotion. Only a tiny handful of products are marked as promoted in **Food > Condiments & Staples** and **General > Other**

- Either promotions are infrequent on the site, or they're not consistently labeled in the source HTML.

In [None]:
# Question 4: What is the Availability Status Across All Categories (Are there availability gaps in certain categories)?

avail_df = df.groupby(['Hierarchy', 'Availability']).size().reset_index(name='Count')
fig = px.bar(
    avail_df,
    x='Hierarchy',
    y='Count',
    color='Availability',
    barmode='group',
    title='Product Availability by Category',
    labels={'Hierarchy': 'Product Category'}
)
fig.update_layout(xaxis_tickangle=45)
fig.show()

**Insight**

- Most products have an “**Unknown**” availability status, especially in **Food > Dairy & Bakery**, **General > Other**, and **Personal Care > Hygiene**

- High “**Unavailable**” counts in **Household > Cookware** (very prominent), **Toys & Games > Vehicles or Dolls**, and **Electronics > Appliances**

- “**Available**” items are relatively low overall, with a few noticeable in **Household > Cookware** and **Food > Beverages**.

Many items scraped lacked availability labels may be perhaps due to inconsistent HTML structure.

Categories like **Household > Cookware** seem popular or stocked regularly, as it shows both available and unavailable entries, hinting at turnover.