# Final Project - E-Commerce Analysis (Invoices Dataset)

**Dataset**: invoices.csv (10,002 transactions)  
**Objective**: Extract 5 actionable business indicators to optimize sales

**Team**:
- Chadi ALKERDI
- Bilal GOUGIS
- Mantra OUTTANDY
- Minji PARK
- Jeffrey SAINT ANDRE

## Configuration and Imports

In [1]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import dash
from dash import dcc, html

print('Libraries loaded')

Libraries loaded


## Loading data and preparing Functions

In [2]:
def load_data(path: str) -> pd.DataFrame:
    """
    Load and prepare the dataset.

    Args:
        path: Path to the CSV file.

    Returns:
        DataFrame with computed columns.
    """
    df = pd.read_csv(path)

    # Date conversion
    df['invoice_date'] = pd.to_datetime(df['invoice_date'], format='%d/%m/%Y', errors='coerce')

    # Revenue calculation
    df['total_revenue'] = df['qty'] * df['amount']

    return df


The function `load_data` loads the CSV format file and converts into DataFrame style doing 2 modifications in the same time.
First, we call the CSV file in pandas table format.
At the same time, we also change the datatype of column `invoice_date` into date, and create a column calculating `total_revenue` which is equal to quantitiy*amount.



## Data Exploration

Before building the indicators, we briefly explore the dataset to understand its structure (number of rows/columns, types, missing values, basic statistics).


In [3]:
df_explore = load_data('invoices.csv')

print("Shape:", df_explore.shape)
print("\nColumn data types:")
print(df_explore.dtypes)

print("\nMissing values per column:")
print(df_explore.isna().sum())

print("\nMain descriptive statistics:")
print(df_explore.describe())


Shape: (10000, 12)

Column data types:
first_name               object
last_name                object
email                    object
product_id                int64
qty                       int64
amount                  float64
invoice_date     datetime64[ns]
address                  object
city                     object
stock_code                int64
job                      object
total_revenue           float64
dtype: object

Missing values per column:
first_name       0
last_name        0
email            0
product_id       0
qty              0
amount           0
invoice_date     0
address          0
city             0
stock_code       0
job              0
total_revenue    0
dtype: int64

Main descriptive statistics:
         product_id           qty        amount                invoice_date  \
count  10000.000000  10000.000000  10000.000000                       10000   
mean     149.746700      5.005900     52.918236  1995-06-11 13:32:18.240000   
min      100.000000      1.

In [4]:
print("Number of unique values per column:")
print(df_explore.nunique())


Number of unique values per column:
first_name        9393
last_name         9435
email             9769
product_id         100
qty                  9
amount            6176
invoice_date      7774
address          10000
city              7773
stock_code        9996
job                639
total_revenue     8889
dtype: int64


In [5]:
email_counts = df_explore['email'].value_counts()

print("\nDistribution of the number of orders per email:")
print(email_counts.value_counts().sort_index())



Distribution of the number of orders per email:
count
1    9554
2     199
3      16
Name: count, dtype: int64


In [6]:
job_counts = df_explore['job'].value_counts()

print("\nStatistics on job frequency:")
print(job_counts.describe())



Statistics on job frequency:
count    639.000000
mean      15.649452
std        4.038505
min        4.000000
25%       13.000000
50%       16.000000
75%       18.000000
max       29.000000
Name: count, dtype: float64


In [7]:
print("\nDistribution of sales per product_id:")
print(df_explore['product_id'].value_counts().describe())

print("\nNumber of transactions per year:")
print(df_explore['invoice_date'].dt.year.value_counts().sort_index())


Distribution of sales per product_id:
count    100.000000
mean     100.000000
std       10.396192
min       74.000000
25%       94.000000
50%      100.000000
75%      107.000000
max      131.000000
Name: count, dtype: float64

Number of transactions per year:
invoice_date
1970    204
1971    190
1972    198
1973    198
1974    190
1975    216
1976    185
1977    191
1978    202
1979    202
1980    206
1981    197
1982    211
1983    199
1984    189
1985    182
1986    207
1987    196
1988    205
1989    224
1990    200
1991    196
1992    177
1993    232
1994    205
1995    177
1996    203
1997    187
1998    192
1999    185
2000    189
2001    198
2002    172
2003    187
2004    205
2005    204
2006    188
2007    170
2008    191
2009    194
2010    181
2011    188
2012    180
2013    189
2014    177
2015    193
2016    173
2017    170
2018    184
2019    162
2020    179
2021    169
2022     11
Name: count, dtype: int64


#### Nature of the dataset and limits of interpretation

The file `invoices.csv` is a synthetic dataset created to simulate the activity of an e-commerce website. By looking at the data, we can clearly see that it does not fully represent a real company's behavior.

First, there are no missing values in any column, which is very uncommon in real invoice databases. In addition, we have 9769 unique email values among 10000 lines, this means that almost every email address appears only once, meaning that most customers placed only one order, 9554 people ordered only once. This makes it impossible to properly study customer loyalty or long-term purchasing behavior.

We also observe that the distribution of professions is very homogeneous, maximum 29 people have same job, with each job appearing only a limited number of times and no clearly dominant category. Similarly, product sales are very evenly distributed: each `product_id` is sold a number of times close to the average which is 100, with no clear best-sellers or poorly performing products.

Finally, when analyzing the number of transactions per year, we do not observe any strong seasonality or significant peaks over time. Each year, the number of transactions is between 180-210. The activity remains fairly stable across nearly 50 years, which is not realistic for a real business.

For all these reasons, this dataset should mainly be considered as a pedagogical dataset designed to practice data analysis methods (such as BCG, ABC classification, segmentation, etc.), rather than as a reliable representation of a real company. The indicators computed in the rest of this notebook should therefore be interpreted as methodological examples, and not as direct operational recommendations.


In [8]:
# Histogram of transactions per year
df_explore_year = load_data('invoices.csv')
df_explore_year['year'] = df_explore_year['invoice_date'].dt.year

transactions_per_year = (
    df_explore_year
    .groupby('year')
    .size()
    .reset_index(name='nb_transactions')
)

fig_years = px.bar(
    transactions_per_year,
    x='year',
    y='nb_transactions',
    title="Number of transactions per year",
    labels={'year': 'Year', 'nb_transactions': "Number of transactions"}
)
fig_years.show()


**Observation**  
The histogram shows that the number of transactions per year is fairly homogeneous over almost 50 years, 
with only small variations.  
There is no clear seasonality or strong trend (no obvious Christmas peaks, sales periods, etc.), 
which confirms that the dataset is synthetic and does not reflect a realistic e-commerce history.

In [9]:
# Top 10 most frequent emails
freq_email = df_explore_year['email'].value_counts().reset_index()
freq_email.columns = ['email', 'nb_orders']

top10_email = freq_email.head(10)

fig_email = px.bar(
    top10_email,
    x='email',
    y='nb_orders',
    title="Top 10 emails by number of orders",
    labels={'email': 'Email', 'nb_orders': 'Number of orders'}
)
fig_email.update_xaxes(tickangle=45)
fig_email.show()

print("Distribution of orders per email:")
print(freq_email['nb_orders'].value_counts())


Distribution of orders per email:
nb_orders
1    9554
2     199
3      16
Name: count, dtype: int64


**Observation**  
The chart shows that even the 10 most frequent emails have a very small number of orders which is between 1 and 3. 
The distribution above confirms that **the vast majority of emails appear only once**.  

As a consequence, we cannot really study customer loyalty, repeat purchases, or customer cohorts.  
This reinforces the idea that `invoices.csv` is a pedagogical dataset rather than a real e-commerce history.


## INDICATOR 1: BCG Product Matrix  

**Description**  
This indicator positions each product along two dimensions:  Sales volume in `qty` column,  Total revenue `total_revenue`.  

The goal is to distinguish the role of products in the catalog:  
-  **Star**: high sales volume and high revenue,  
-  **Premium**: low volume but high revenue (expensive products),  
-  **Volume**: high volume but lower revenue,  
-  **Standard**: lower volume and lower revenue.

**Method**  
We aggregate sales by `product_id`, then apply thresholds on the top 20% of products in terms of volume and revenue.  
Combining these thresholds allows us to assign each product to one of the 4 BCG categories, which are then visualized in a scatter plot (revenue vs. volume).

**Interpretation on this dataset**  
On this synthetic dataset, the distribution is quite homogeneous.  
Standard products still represent a significant share of total revenue.  
The matrix is therefore mainly useful to visualize theoretical roles (Stars, Premium, Volume, Standard), but it does not justify “removing Standard products” as one might sometimes do in a strong Pareto context.

**Business action (in a real case)** 

This study could help prioritize investments (stock, marketing, promotions) on Stars and Premium. It could also help to optimize Volume products (logistics, margin). It may improve monitor Standard: some could be simplified or removed, but only if their real contribution to revenue is low (which is not the case here due to the synthetic nature of the dataset).


In [10]:
def analyze_bcg_matrix(df: pd.DataFrame, top_pct: float = 0.2) -> pd.DataFrame:
    """
    Build a BCG-style product matrix with a top X% threshold (Pareto-like).

    Args:
        df: Input invoices DataFrame.
        top_pct: Proportion of products considered as "top" for volume/revenue thresholds.

    Returns:
        DataFrame with BCG category and market share per product.
    """
    analysis = df.groupby('product_id').agg({
        'total_revenue': 'sum',
        'qty': 'sum',
        'email': 'nunique'
    }).rename(columns={'email': 'nb_customers'})

    # Top X% thresholds
    revenue_threshold = analysis['total_revenue'].quantile(1 - top_pct)
    volume_threshold = analysis['qty'].quantile(1 - top_pct)

    def classify(row):
        high_rev = row['total_revenue'] > revenue_threshold
        high_vol = row['qty'] > volume_threshold

        if high_rev and high_vol:
            return 'Star'
        elif high_rev:
            return 'Premium'
        elif high_vol:
            return 'Volume'
        else:
            return 'Standard'

    analysis['category'] = analysis.apply(classify, axis=1)
    analysis['market_share'] = (analysis['total_revenue'] / analysis['total_revenue'].sum()) * 100

    return analysis.reset_index()



The function `analyze_bcg_matrix` groups sales by product, calculates total revenue, total quantity sold and number of customers, then uses the top 20% thresholds for revenue and volume to classify each product into one of the four BCG categories: Star, Premium, Volume, or Standard. It also computes the market share of each product.

In [11]:
def viz_bcg(bcg: pd.DataFrame) -> go.Figure:
    """
    Scatter plot for the BCG product matrix.
    """
    fig = px.scatter(
        bcg,
        x='qty',
        y='total_revenue',
        color='category',
        size='nb_customers',
        hover_data=['product_id', 'market_share'],
        color_discrete_map={
            'Star':'#2ecc71',
            'Premium': '#3498db',
            'Volume': '#f39c12',
            'Standard': '#95a5a6'
        },
        title='Strategic Product Matrix (BCG)',
        labels={'qty': 'Sales volume (units)', 'total_revenue': 'Total revenue ($)'}
    )

    # Threshold lines (approximate top 20% on this aggregated table)
    volume_threshold = bcg['qty'].quantile(0.8)
    revenue_threshold = bcg['total_revenue'].quantile(0.8)

    fig.add_vline(
        x=volume_threshold,
        line_dash='dash',
        line_color='gray',
        annotation_text='Top 20% volume'
    )
    fig.add_hline(
        y=revenue_threshold,
        line_dash='dash',
        line_color='gray',
        annotation_text='Top 20% revenue'
    )

    fig.update_layout(template='plotly_white', height=500)

    return fig

The function `viz_bcg` takes the BCG analysis table and creates a scatter plot where each product is positioned according to its sales volume and total revenue. Colors represent the BCG category, point size represents the number of customers, and threshold lines highlight the top 20% in both dimensions.

## INDICATOR 2: ABC Analysis (Pareto)

**Description**  
This indicator ranks products according to their cumulative contribution to total revenue (ABC classification):

- A: up to 80% of cumulative revenue  
- B: from 80% to 95% of cumulative revenue  
- C: above 95% of cumulative revenue  

**Method**  
Products are sorted by decreasing total revenue, then we compute the percentage contribution of each product to total revenue, the cumulative contribution, and we assign each product to class A, B or C based on its cumulative percentage.

**What this dataset shows**  
In this synthetic dataset, most products fall into class A to reach 80% of total revenue.  
The Pareto effect is therefore weak: we do not have “20% of products = 80% of revenue”, but rather many products that each contribute a small part to the overall revenue.

**Business action**  
On this synthetic dataset, the indicator mainly highlights the homogeneous distribution of revenue across products.  
In a real e-commerce catalog, ABC analysis would be used to focus efforts (stock, marketing, logistics) on a small number of truly dominant class A products.


In [12]:
def analyze_abc(df: pd.DataFrame) -> pd.DataFrame:
    """
    ABC analysis by product based on total revenue.

    Args:
        df: Input invoices DataFrame.

    Returns:
        DataFrame with ABC class and rank per product.
    """
    products = (
        df.groupby('product_id')
        .agg({'total_revenue': 'sum'})
        .sort_values('total_revenue', ascending=False)
    )

    products['contrib_pct'] = (products['total_revenue'] / products['total_revenue'].sum()) * 100
    products['contrib_cumul'] = products['contrib_pct'].cumsum()

    def abc_class(cumulative_pct):
        if cumulative_pct <= 80:
            return 'A (80% revenue)'
        elif cumulative_pct <= 95:
            return 'B (15% revenue)'
        else:
            return 'C (5% revenue)'

    products['class'] = products['contrib_cumul'].apply(abc_class)

    products = products.reset_index()
    products['rank'] = range(1, len(products) + 1)

    return products


The fuction `analyze_abc` ranks products by total revenue, computes their individual and cumulative contribution to total sales, and assigns each product to class A, B or C based on the Pareto rule.

## INDICATOR 3: Geographic Performance  

**Description**  
This indicator analyzes performance by **city** using three dimensions:  
- **Total revenue** (`total_revenue`),  
- **Average basket value** per transaction,  
- **Number of unique customers** (`email`) in each city.

The goal is to identify strong cities (high revenue, many customers) and high-potential cities (fewer customers but high average basket).

**Method**  
We aggregate the data by `city` using a `groupby` to compute total revenue, average basket value, and number of unique customers.  

We then focus on the most contributive cities (for example, the top 15 in revenue) and display them in a bubble chart where X-axis mean average basket value, Y-axis for total revenue, bubble size represents number of customers, and lastly bubble color showing composite potential score.

**Interpretation on this dataset**  
In this synthetic dataset, there are many different cities with relatively similar volumes:  
no single city dominates total revenue, and most cities fall into a “middle” zone.  
The chart therefore mainly illustrates the geographical analysis method, rather than revealing a few clearly dominant markets, as one might expect from a real e-commerce history.

**Business action (in a real case)**  
It real case scenario, it could strengthen presence in cities with high revenue or high average basket, run targeted campaigns in “high-potential” cities (few customers but high basket value),  or adapt commercial priorities by zone (local promotions, specific offers, logistics).


In [13]:
def analyze_geography(df: pd.DataFrame, top_n: int = 15) -> pd.DataFrame:
    """
    Geographic performance by city (top N cities).

    Args:
        df: Input invoices DataFrame.
        top_n: Number of top cities to keep based on total revenue.

    Returns:
        DataFrame with geographic metrics per city.
    """
    geo = df.groupby('city').agg({
        'total_revenue': 'sum',
        'email': 'nunique',
        'product_id': 'count'
    }).rename(columns={
        'total_revenue': 'total_revenue_city',
        'email': 'nb_customers',
        'product_id': 'nb_transactions'
    })

    geo['avg_basket'] = geo['total_revenue_city'] / geo['nb_transactions']
    geo['revenue_per_customer'] = geo['total_revenue_city'] / geo['nb_customers']

    # Composite potential score (normalized)
    geo['score'] = (
        (geo['total_revenue_city'] / geo['total_revenue_city'].max()) * 0.4 +
        (geo['avg_basket'] / geo['avg_basket'].max()) * 0.3 +
        (geo['nb_customers'] / geo['nb_customers'].max()) * 0.3
    ) * 10

    return geo.nlargest(top_n, 'total_revenue_city').reset_index()

The function `analyze_geography` aggregates performance by city, computing total revenue, number of customers, average basket value and a composite performance score. It then keeps only the top cities based on total revenue

In [14]:
def viz_geo(geo: pd.DataFrame) -> go.Figure:
    """
    Geographic bubble chart for top cities.
    """
    fig = px.scatter(
        geo,
        x='avg_basket',
        y='total_revenue_city',
        size='nb_customers',
        color='score',
        hover_data=['city'],
        text='city',
        color_continuous_scale='Viridis',
        title='Geographic Performance (Top 15 Cities)',
        labels={
            'avg_basket': 'Average basket ($)',
            'total_revenue_city': 'Total revenue ($)',
            'score': 'Potential score'
        }
    )

    fig.update_traces(textposition='top center')
    fig.update_layout(template='plotly_white', height=500)

    return fig

The function `viz_geo` creates a geographic bubble chart where each city is positioned according to its average basket value and total revenue. The bubble size represents the number of customers and the color represents the overall potential score.

## INDICATOR 4: Customer Profiles by Profession  

**Description**  
This indicator studies the distribution of revenue and orders by **profession** (`job`).  
The idea is to identify **B2B customer profiles** that buy more frequently or with a higher basket value than others.

**Method**  
We apply a `groupby(job)` to compute, for each profession:  
- **Total revenue**,  
- **Number of orders**,  
- **Number of unique customers**,  
- and optionally an **average spend** per transaction or per customer.  

We then visualize the most contributive professions (by total revenue or volume) using a horizontal bar chart.

**Interpretation on this dataset**  
Here, the `job` variable is generated in a very uniform way (many professions, each with few occurrences).  
We do not see any “super-dominant” profession segment, unlike what we might expect in a real B2B database (e.g., a few sectors driving most of the revenue).  
This indicator mainly shows how we would structure a profile-based analysis, rather than identifying truly strategic segments in this synthetic dataset.

**Business action (in a real case)**  
In a real cas, target professions that generate the highest total revenue or the highest average basket, or it could adapt the offer (services, product ranges, contracts) to key sectors,  or may help to build marketing campaigns segmented by profession (e.g., technical jobs, healthcare, education, etc.).


In [15]:
def analyze_profiles(df: pd.DataFrame, top_n: int = 10) -> pd.DataFrame:
    """
    Top professions by total revenue.

    Args:
        df: Input invoices DataFrame.
        top_n: Number of top professions to keep based on total revenue.

    Returns:
        DataFrame with metrics per profession.
    """
    profiles = df.groupby('job').agg({
        'total_revenue': ['sum', 'mean'],
        'email': 'nunique'
    })

    profiles.columns = ['total_revenue', 'avg_spend', 'nb_customers']
    profiles = profiles.nlargest(top_n, 'total_revenue').reset_index()

    return profiles


The function `analyze_profiles` groups customers by profession, computes total revenue, average spend per transaction and number of customers, and keeps only the top professions based on total revenue.

In [16]:
def viz_profiles(profiles: pd.DataFrame) -> go.Figure:
    """
    Horizontal bar chart for professions.
    """
    fig = px.bar(
        profiles,
        y='job',
        x='total_revenue',
        orientation='h',
        color='avg_spend',
        color_continuous_scale='Blues',
        title='Top 10 Professions by Revenue',
        labels={
            'total_revenue': 'Total revenue ($)',
            'job': 'Profession',
            'avg_spend': 'Average spend ($)'
        },
        text='total_revenue'
    )

    fig.update_traces(texttemplate='$%{text:,.0f}', textposition='outside')
    fig.update_layout(template='plotly_white', height=500)

    return fig

The function `viz_profiles`creates a horizontal bar chart showing the top professions by total revenue. The color of each bar represents the average spending per transaction. 

In [17]:
def viz_abc(abc: pd.DataFrame) -> go.Figure:
    """
    Cumulative revenue plot for ABC analysis.
    """
    fig = go.Figure()

    fig.add_trace(go.Scatter(
        x=abc['rank'],
        y=abc['contrib_cumul'],
        mode='lines+markers',
        name='Cumulative revenue (%)'
    ))

    fig.add_hline(
        y=80,
        line_dash='dash',
        line_color='green',
        annotation_text='80% (end of class A)',
        annotation_position='top left'
    )
    fig.add_hline(
        y=95,
        line_dash='dash',
        line_color='orange',
        annotation_text='95% (end of class B)',
        annotation_position='bottom left'
    )

    fig.update_layout(
        title='ABC Analysis - Cumulative Revenue',
        xaxis_title='Products (sorted by decreasing revenue)',
        yaxis_title='Cumulative revenue (%)',
        yaxis=dict(range=[0, 105]),
        template='plotly_white',
        height=500
    )

    return fig

The function `viz_abc` plots the cumulative contribution of products to total revenue, sorted by decreasing revenue, and highlights the thresholds that define classes A and B in the ABC analysis.

## INDICATOR 5: Amount Distribution (Normalization)

**Description**  
This indicator analyzes the distribution of purchase amounts (`total_revenue`) to understand the structure of order baskets.  

**Method**  
We use a histogram, a boxplot and descriptive statistics (min, max, quartiles, mean, median).  
In addition to the analysis on raw amounts, a Min–Max normalization is applied to `total_revenue` to bring values into the [0, 1] range.  
This normalization follows the method used in the practical session and would make it possible, if needed, to compare this indicator to other variables on a common scale.

On this dataset, we observe that around 27% of transactions have an amount < 100 $, 49% of transactions are between 100 $ and 400 $, and 24% of transactions are > 400 $  

We can therefore define three basket segments:  
- **Small basket**: < 100 $  
- **Medium basket**: 100–400 $  
- **Large basket**: > 400 $  

**Business action**  
We may take business actions like target small baskets with offers to increase the amount (cross-sell, free shipping thresholds). We may consider medium baskets as the core target. Or offer specific benefits for large baskets (VIP program, personalized discounts).


In [18]:
def analyze_distribution(df: pd.DataFrame) -> dict:
    """
    Distribution statistics for order amounts.

    Args:
        df: Input invoices DataFrame.

    Returns:
        Dict with descriptive stats and normalized data.
    """
    stats = {
        'mean': df['total_revenue'].mean(),
        'median': df['total_revenue'].median(),
        'std': df['total_revenue'].std(),
        'min': df['total_revenue'].min(),
        'max': df['total_revenue'].max(),
        'q25': df['total_revenue'].quantile(0.25),
        'q75': df['total_revenue'].quantile(0.75)
    }

    # Min–Max normalization
    df_norm = df.copy()
    df_norm['revenue_norm'] = (
        (df['total_revenue'] - stats['min']) / (stats['max'] - stats['min'])
    )

    return {'stats': stats, 'data': df_norm}


The function `analyze_distribution` creates a copy and computes basic descriptive statistics for order amounts and applies a Min–Max normalization to scale the values between 0 and 1.

In [19]:
def viz_distribution(result: dict) -> go.Figure:
    """
    Histogram + boxplot for order amounts.
    """
    fig = make_subplots(
        rows=1, cols=2,
        subplot_titles=('Amount distribution', 'Box plot'),
        column_widths=[0.7, 0.3]
    )

    # Histogram
    fig.add_trace(
        go.Histogram(
            x=result['data']['total_revenue'],
            nbinsx=50,
            name='Frequency'
        ),
        row=1, col=1
    )

    # Box plot
    fig.add_trace(
        go.Box(
            y=result['data']['total_revenue'],
            name='Distribution'
        ),
        row=1, col=2
    )

    fig.update_xaxes(title_text='Amount ($)', row=1, col=1)
    fig.update_yaxes(title_text='Frequency', row=1, col=1)
    fig.update_yaxes(title_text='Amount ($)', row=1, col=2)

    fig.update_layout(
        title_text='Order amount distribution analysis',
        template='plotly_white',
        showlegend=False,
        height=500
    )
    
    return fig

The function `viz_distribution` visualizes the distribution of order amounts using a histogram and a box plot to show both the frequency and the main descriptive characteristics of the data.

## Dash Dashboard

In [20]:
def create_dashboard(bcg, abc, geo, profiles, distrib):
    """
    Dash dashboard with 5 indicators.
    """
    app = dash.Dash(__name__)

    # Reusable style for indicator boxes
    style_box = {
        'padding': '20px',
        'backgroundColor': 'white',
        'margin': '10px',
        'borderRadius': '10px',
        'boxShadow': '0 2px 4px rgba(0,0,0,0.1)'
    }

    app.layout = html.Div([
        # Header
        html.Div([
            html.H1(
                "E-Commerce Analysis Dashboard",
                style={'textAlign': 'center', 'color': '#2c3e50'}
            ),
            html.H3(
                "Dataset: invoices.csv (10,000 transactions)",
                style={'textAlign': 'center', 'color': '#7f8c8d'}
            ),
            html.P(
                "Team: [Your names]",
                style={'textAlign': 'center', 'color': '#95a5a6'}
            )
        ], style={'padding': '20px', 'backgroundColor': '#ecf0f1', 'marginBottom': '20px'}),

        # Indicator 1: BCG
        html.Div([
            html.H3("Indicator 1: BCG Product Matrix", style={'color': '#2980b9'}),
            html.P("Action: Focus on Stars/Premium (top 20%), monitor Standard products."),
            dcc.Graph(figure=viz_bcg(bcg))
        ], style=style_box),

        # Indicator 2: ABC
        html.Div([
            html.H3("Indicator 2: ABC Analysis (Pareto)", style={'color': '#27ae60'}),
            html.P("Action: Prioritize management of Class A products (up to 80% of revenue)."),
            dcc.Graph(figure=viz_abc(abc))
        ], style=style_box),

        # Indicator 3: Geography
        html.Div([
            html.H3("Indicator 3: Geographic Performance", style={'color': '#8e44ad'}),
            html.P("Action: Invest in cities with high potential scores."),
            dcc.Graph(figure=viz_geo(geo))
        ], style=style_box),

        # Indicator 4: Profiles
        html.Div([
            html.H3("Indicator 4: Top Professions by Revenue", style={'color': '#d35400'}),
            html.P("Action: B2B targeting by profession and corporate partnerships."),
            dcc.Graph(figure=viz_profiles(profiles))
        ], style=style_box),

        # Indicator 5: Distribution
        html.Div([
            html.H3("Indicator 5: Amount Distribution", style={'color': '#c0392b'}),
            html.P("Action: Segment customers by basket size (small / medium / large)."),
            dcc.Graph(figure=viz_distribution(distrib))
        ], style=style_box)
    ], style={'backgroundColor': '#f5f6fa', 'padding': '10px'})

    return app


This function creates and returns a fully structured Dash web application that displays five key e-commerce performance indicators using pre-computed data and dedicated visualization functions. It focuses on clear presentation and business interpretation by organizing each indicator into styled sections with titles, actionable insights, and interactive graphs, without performing any data processing itself.

## Complete Execution

In [21]:
# Loading
print("[1/3] Loading data...")
df = load_data('invoices.csv')
print(f"      ✓ {len(df)} rows loaded\n")

# Computing indicators
print("[2/3] Computing 5 indicators...")
bcg = analyze_bcg_matrix(df)
print("      ✓ Indicator 1: BCG Product Matrix")

abc = analyze_abc(df)
print("      ✓ Indicator 2: ABC Analysis")

# Quick check: ABC class distribution (optional)
abc['class'].value_counts(normalize=True).mul(100).round(1)

geo = analyze_geography(df)
print("      ✓ Indicator 3: Geographic Performance")

profiles = analyze_profiles(df)
print("      ✓ Indicator 4: Customer Profiles by Profession")

distrib = analyze_distribution(df)
print("      ✓ Indicator 5: Amount Distribution\n")

# Dashboard
print("[3/3] Creating dashboard...")
app = create_dashboard(bcg, abc, geo, profiles, distrib)
print("      ✓ Dashboard ready\n")

print("=" * 70)
print("PROJECT COMPLETED")
print("=" * 70)
print("To launch locally: app.run_server(debug=True, port=8051)")
print("Then open: http://127.0.0.1:8051/\n")


[1/3] Loading data...
      ✓ 10000 rows loaded

[2/3] Computing 5 indicators...
      ✓ Indicator 1: BCG Product Matrix
      ✓ Indicator 2: ABC Analysis
      ✓ Indicator 3: Geographic Performance
      ✓ Indicator 4: Customer Profiles by Profession
      ✓ Indicator 5: Amount Distribution

[3/3] Creating dashboard...
      ✓ Dashboard ready

PROJECT COMPLETED
To launch locally: app.run_server(debug=True, port=8051)
Then open: http://127.0.0.1:8051/



## Be careful 
We need to run the hole notebook in order to have the dash link to work !

In [None]:

if __name__ == "__main__":
    app.run(debug=True, port=8051, use_reloader=False)


## Indicator Summary

| # | Indicator | TP Used | Business Action |
|---|------------|------------|------------------|
| 1 | BCG Matrix | TP2 (groupby) | Identify product roles and prioritize Stars/Premium without removing Standard |
| 2 | ABC Pareto | TP2 (query) | Priority management for Class A |
| 3 | Geography | TP2 (merge) | Invest in high-potential cities |
| 4 | Profiles | TP2 (agg) | B2B targeting by profession |
| 5 | Distribution | TP6 (normalization) | Price segmentation |

**Dashboard**: TP5 (Dash + Plotly)

## Conclusion & Perspectives

This project applies several techniques covered in the practical sessions (groupby, aggregations, normalization, visualization, Dash) 
to build an e-commerce dashboard around five indicators:

1. **BCG Matrix**: product positioning based on volume and revenue.
2. **ABC Pareto**: analysis of cumulative product contribution to revenue.
3. **Geography**: performance by city (revenue, average basket, number of customers).
4. **Customer Profiles (profession)**: sales distribution by job.
5. **Amount Distribution**: basket structure and price segmentation.

L’exploration des données a montré que `invoices.csv` est un jeu de données **synthétique** 
(very homogeneous distribution, no seasonality, few repeat customers). 
The indicators should therefore be interpreted as **examples of analytical methodology**, 
rather than as operational recommendations for a real company.

In a real context, this work could be extended by adding margin and logistics cost data to reason in terms of profit, analyzing customer loyalty (RFM, churn) from real email histories, studying seasonality and marketing campaign effects and further enriching the dashboard (interactive filters, period comparisons, etc.).
