<a href="https://colab.research.google.com/github/Gaks978/DML-Checkpoint/blob/main/Python_project(Data_visualization_with_python).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

What You're Aiming For

To extract, clean, and analyze product data from an online retailer's platform to identify pricing trends, product availability, and promotional patterns across various categories.


Instructions

Steps :

Web Scraping:

Utilize Python libraries such as BeautifulSoup to scrape product information from an online website.
Collect data attributes including product names, categories, prices, availability status, and promotional details.
Data Cleaning:


Address missing or inconsistent data entries, such as absent prices or ambiguous product descriptions.
Standardize text fields to ensure uniformity in product names and categories.

Data Transformation:


Convert price data into numerical formats for analysis.
Categorize products into hierarchical groups (e.g., Electronics > Mobile Phones > Smartphones).

Data Analysis:


Conduct exploratory data analysis (EDA) to uncover insights:
Identify average pricing within each product category.
Detect seasonal or promotional pricing patterns.
Assess product availability trends over time.

Data Visualization:


Employ visualization library Plotly to create some charts.

# WEB SCRAPING

In [19]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Base URL of the Aziza online store
base_url = 'https://aziza.tn/fr/home'

# Headers to mimic a browser visit
headers = {'User-Agent': 'Mozilla/5.0'}

# Send a GET request to the website
response = requests.get(base_url, headers=headers)

# Parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')

# Initialize a list to store product data
products = []

# Find all product containers (update the selector based on actual HTML structure)
product_containers = soup.find_all('div', class_='product-container')

for container in product_containers:
    # Extract product name
    name_tag = container.find('a', class_='product-name')
    name = name_tag.text.strip() if name_tag else 'N/A'

    # Extract category (if available)
    category_tag = container.find('span', class_='category')
    category = category_tag.text.strip() if category_tag else 'N/A'

    # Extract price
    price_tag = container.find('span', class_='price')
    price = price_tag.text.strip() if price_tag else 'N/A'

    # Extract availability status
    availability_tag = container.find('span', class_='availability')
    availability = availability_tag.text.strip() if availability_tag else 'N/A'

    # Extract promotional details
    promo_tag = container.find('span', class_='promo')
    promotion = promo_tag.text.strip() if promo_tag else 'None'

    # Append the product data to the list
    products.append({
        'Product Name': name,
        'Category': category,
        'Price': price,
        'Availability': availability,
        'Promotion': promotion
    })

# Create a DataFrame from the product list
df = pd.DataFrame(products)


# DATA CLEANING

In [20]:
# Remove entries with missing prices
df = df[df['Price'] != 'N/A']

# Standardize text fields
df['Product Name'] = df['Product Name'].str.title().str.strip()
df['Category'] = df['Category'].str.title().str.strip()
df['Availability'] = df['Availability'].str.title().str.strip()
df['Promotion'] = df['Promotion'].str.title().str.strip()


KeyError: 'Price'

# DATA TRANSFORMATION

In [21]:
# Convert price to float (remove currency symbols and commas)
df['Price'] = df['Price'].replace('[^\d.]', '', regex=True).astype(float)

# Split category into hierarchical groups if applicable
# Assuming categories are in the format 'Main > Sub > Sub-Sub'
category_split = df['Category'].str.split('>', expand=True)
df['Main Category'] = category_split[0].str.strip()
df['Subcategory'] = category_split[1].str.strip() if category_split.shape[1] > 1 else 'Uncategorized'
df['Sub-Subcategory'] = category_split[2].str.strip() if category_split.shape[1] > 2 else 'Uncategorized'


KeyError: 'Price'

# DATA ANALYSIS

In [22]:
# Average pricing within each subcategory
avg_price = df.groupby('Subcategory')['Price'].mean().reset_index()

# Product availability distribution
availability_distribution = df['Availability'].value_counts(normalize=True) * 100

# Promotion counts
promotion_counts = df['Promotion'].value_counts()


KeyError: 'Subcategory'

# DATA VISUALIZATION

In [23]:
import plotly.express as px

# Bar chart: Average price by subcategory
fig1 = px.bar(avg_price, x='Subcategory', y='Price', title='Average Price by Subcategory')
fig1.show()

# Pie chart: Product availability distribution
fig2 = px.pie(names=availability_distribution.index, values=availability_distribution.values, title='Product Availability Distribution')
fig2.show()

# Bar chart: Promotion counts
fig3 = px.bar(x=promotion_counts.index, y=promotion_counts.values, title='Promotion Counts')
fig3.show()


NameError: name 'avg_price' is not defined