# Programming in Data Science - Final Project
## Invoices Dataset Analysis
**Team Members: Leo WINTER, Yoann SUBLET, Kellian VERVAELE KLEIN, Alvaro SERERO**

Dataset: Invoices (Kaggle)

Source: https://www.kaggle.com/datasets/cankatsrc/invoices/data

This dataset includes multiple fields such as customer details (first name, last name, email), transaction information (product ID, quantity, amount, invoice date), and additional attributes like address, city, and stock code.

### Import all needed libraries for the project:
- Pandas for data manipulation
- Plotly express for visualizations
- Dash for creating a visual and interactive dashboard interface

In [41]:
import pandas as pd
import numpy as np
import random
from datetime import datetime, timedelta
import plotly.express as px
import plotly.graph_objects as go
from dash import Dash, dcc, html,Input, Output
from dash import callback
from prophet import Prophet
from prophet.plot import plot_plotly
import logging

from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans

import warnings
warnings.filterwarnings('ignore')

## 1) Data collection and exploration

### Function to safely load CSV data from a file path

In [13]:
def load_data(file_path: str) -> pd.DataFrame:
    """
    Function to load CSV data from a given file path safely.

    Input: 
    ------
    file_path => String, path to the CSV file

    Output: 
    ------
    dataset => pd.DataFrame containing the loaded data
    """
    try:
        df = pd.read_csv(file_path)
        print("Data loaded successfully.")
        return df
    
    # in case an error is generated when we try to read the file
    except Exception as e:
        print(f"Error loading data: {e}")
        return pd.DataFrame()

### Function to process invalid first_name and last_name columns.
- In the initial dataset there are "last_name" and "first_name" columns but each one contains a combination of a first name and last name which does not make sense since a trasaction is only made by one individual person and columns should contain exactly what is described by their name.
- For example, the first line is structured as follows, which is a mistake and need to be corrected.

first_name | last_name  
Carmen Nixon | Todd Anderson

In [14]:
def name_treatment(dataset: pd.DataFrame, options: str = "first") -> pd.DataFrame:
    """
    Function to treat first_name and last_name columns in a dataset.

    Only use this function if both first_name and last_name columns have already
    the full name (first + last name)!!!
    This function does not fuse firts_name and last_name column in a column name

    
    Input:
    ---------
    - dataset => Pandas DataFrame, dataset must have first_name and last_name columns
    - options => String, options for treatment between "separate", "first" and "last"
        - "separate": create two new line for each name, 
        - "first" (default): keep only the first_name renamed as name, 
        - "last" : keep only the last_name renamed as name
    
    Output:
    ---------
    - dataset => Pandas DataFrame after treating first_name and last_name columns
    """
    # We first check that both columns 'first_name' and 'lst_name' are present in the dataset
    if "first_name" in dataset and "last_name" in dataset:
        if options == "separate":
            value = dataset.columns.difference(['first_name','last_name']).tolist()
            # Create a new dataset that duplicate every line in dataset, and create one line with
            # a column named 'name' with the value  of 'first_name' and the second line with
            # a column named 'name' with the value  of 'last_name'
            new_dataset = pd.melt(dataset, id_vars=value,              
                              value_vars=['first_name', 'last_name'],
                              value_name='name')
            
            # Delete variable column and put the newly created 'name' column as the first column of the dataset
            column_to_keep = [col for col in new_dataset.columns if col not in ["name", "variable"]]
            new_order = ["name"] + column_to_keep
            new_dataset = new_dataset[new_order]

        elif options == "first":
            # Delete the 'last_name' column
            new_dataset = dataset.drop(columns=['last_name'])
            # Rename 'first_name' column into 'name'
            new_dataset.rename(columns={'first_name': 'name'}, inplace=True)

        elif options == "last":
            # Delete the 'first_name' column
            new_dataset = dataset.drop(columns=['first_name'])
            # Rename 'last_name' column into 'name'
            new_dataset.rename(columns={'last_name': 'name'}, inplace=True)
        else:
            print(f"{options} is not a correct parameters of options, please write 'separate' or 'first' or 'last")
            return dataset
        return new_dataset
    else:
        return dataset

### Function to parse invoice dates:
- Convert "invoice_date" column to datetime for futural temporal manipulations.
- Extracts year, month, day, and day of week features.

In [15]:
def parse_dates(df: pd.DataFrame,date_column: str, date_format: str="%d/%m/%Y") -> pd.DataFrame:
    """
    Converts "invoice_date" column from string to datetime.
    Extracts year, month, day, and day of week features.

    Input:
    ------
    - df (DataFrame) - dataset with invoice_date column

    Output:
    ------
    - df (DataFrame) - dataset with parsed datetime features
    """
    # We first check that the name of the column in the variable <date_column> is in the dataset
    if date_column not in df.columns:
        print(f"Column '{date_column}' not found in DataFrame.")
        return df
    
    format_date= ['%Y-%m-%d','%d-%m-%Y','%Y/%m/%d','%d/%m/%Y']
    if date_format in format_date:
        df = df.copy()
        # Put the time column into a DatetimeFormat
        df[date_column] = pd.to_datetime(df[date_column], format=date_format, errors='coerce')
        # In case the datetime format is not '%d/%m/%Y', transform into it into '%d/%m/%Y' format
        if (date_format != "%d/%m/%Y"):
            df[date_column] = df[date_column].dt.strftime('%d/%m/%Y')
            df[date_column] = pd.to_datetime(df[date_column], format='%d/%m/%Y', errors='coerce')
        # Create int column with year, month, day, dayofweek from the real date in column date_column
        df['year'] = df[date_column].dt.year
        df['month'] = df[date_column].dt.month
        df['day'] = df[date_column].dt.day
        df['dayofweek'] = df[date_column].dt.dayofweek
    else:
        print("Wrong date format type")
    return df

### Covert all string columns of the dataset to strip whitespaces.

In [16]:
def convert_string_columns(df: pd.DataFrame) -> None:
    """
    String manipulation: Strip whitespace from object columns

    Input:
    -------
    - df => Pandas DataFrame to be processed
    Output:
    ------- 
    None (the function modifies the DataFrame in place)
    """
    # Put every 'object' column in string_cols variable
    string_cols = df.select_dtypes(include=['object']).columns
    # Strip whitespace in every column in string_col
    for col in string_cols:
        df[col] = df[col].str.strip()

### Function to preprocess the initial loaded dataset: combines all the previous functions and returns a clean dataset.

In [17]:
def preprocess_data(df: pd.DataFrame,date_column: str,date_format: str="%d/%m/%Y", name_options: str = "first") -> pd.DataFrame:
    """
    Function to preprocess the initial loaded dataset:
    - Strips whitespaces from strings using the convert_string_columns function. 
    - Converts "invoice_date" column to datetime for futural temporal manipulations.
    - Adds "revenue" column derived from "qty" and "amount" columns.
    - Create a column name using the name_treatment function correcting the first_name and last_name column

    Input:
    ---------
    - df => Pandas DataFrame to be preprocessed
    - date_column => Name of the column where the date is in
    - [Optional] name_options (String) => options for the name_treatment function. Possible choixe "none", "separate", "first" (default) and "last"

    Output:
    ---------
    - df => Preprocessed Pandas DataFrame
    """
    if 'qty' in df.columns and 'amount' in df.columns:
        # Create 'revenue' column as product of 'quantity' and 'amount'
        df['revenue'] = df['qty'] * df['amount']

    if name_options != "none":
        df = name_treatment(df,name_options)
    df = parse_dates(df,date_column,date_format)
    convert_string_columns(df)

    return df

### Function for data exploration: displaying basic information on our dataset.
We can see that there is no missing or NaN data since all columns have 10000 non-null rows.

In [18]:
def explore_data(df: pd.DataFrame) -> None:
    """
    Prints key exploratory information: 
    - dataset shape (rows, columns)
    - column data types
    - missing values per column
    - description of columns
    - correlation matrix between numerical columns

    Input:
    ---------
    - df => Pandas DataFrame to be explored

    Output:
    ---------
    None (prints information to console)
    """
    print("Shape (rows, columns):", df.shape)

    print("\nColumn dtypes:")
    print(df.dtypes)

    print("\nMissing values per column:")
    print(df.isna().sum())

    print("\nBasic description of numerical columns:")
    print(df.describe())

    # Correlation matrix for numeric variables
    if 'qty' in df.columns and 'amount' in df.columns and 'revenue' in df.columns:
        print("\nCorrelation matrix (numeric columns):")
        print(df[['qty', 'amount', 'revenue']].corr())

### Testing data collection, preprocessing and exploration on the Invoices dataset.

In [19]:
df = load_data('invoices.csv')
df

Error loading data: [Errno 2] No such file or directory: 'invoices.csv'


In [21]:
df = load_data('invoices.csv')
df = preprocess_data(df, date_column="invoice_date", name_options="first")
explore_data(df)
df

Data loaded successfully.
Shape (rows, columns): (10000, 15)

Column dtypes:
name                    object
email                   object
product_id               int64
qty                      int64
amount                 float64
invoice_date    datetime64[ns]
address                 object
city                    object
stock_code               int64
job                     object
revenue                float64
year                     int32
month                    int32
day                      int32
dayofweek                int32
dtype: object

Missing values per column:
name            0
email           0
product_id      0
qty             0
amount          0
invoice_date    0
address         0
city            0
stock_code      0
job             0
revenue         0
year            0
month           0
day             0
dayofweek       0
dtype: int64

Basic description of numerical columns:
         product_id           qty        amount                invoice_date  \
count  10000.

Unnamed: 0,name,email,product_id,qty,amount,invoice_date,address,city,stock_code,job,revenue,year,month,day,dayofweek
0,Carmen Nixon,marvinjackson@example.com,133,9,14.57,1982-09-10,283 Wendy Common,West Alexander,36239634,Logistics and distribution manager,131.13,1982,9,10,4
1,Mrs. Heather Miller,jeffrey84@example.net,155,5,65.48,2012-10-03,13567 Patricia Circles Apt. 751,Andreamouth,2820163,Osteopath,327.40,2012,10,3,2
2,Crystal May,ugoodman@example.com,151,9,24.66,1976-03-23,6389 Debbie Island Suite 470,Coxbury,27006726,Economist,221.94,1976,3,23,1
3,Bobby Weber,ssanchez@example.com,143,4,21.34,1986-08-17,6362 Ashley Plaza Apt. 994,Ninaland,83036521,Sports administrator,85.36,1986,8,17,6
4,Kristen Welch,cynthia66@example.net,168,2,83.90,1996-06-11,463 Steven Cliffs Suite 757,Isaiahview,80142652,Chief Marketing Officer,167.80,1996,6,11,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,Daniel Chapman,davidrice@example.com,133,1,39.82,2004-05-14,8220 Stewart Isle Apt. 382,New Willieberg,53513212,"Surveyor, insurance",39.82,2004,5,14,4
9996,Jonathan Cabrera,jennifer66@example.com,133,1,21.94,1994-10-07,48045 Harris Mountain Apt. 857,Jameston,61754737,Writer,21.94,1994,10,7,4
9997,David Thomas,moraleskimberly@example.org,173,2,62.05,2010-06-27,872 Tonya Drive,West Eric,17037907,"Surveyor, quantity",124.10,2010,6,27,6
9998,Rose Bond,vleon@example.net,146,5,11.35,2011-01-16,635 Saunders Creek Suite 967,Port Catherine,32764659,Architectural technologist,56.75,2011,1,16,6


## 2) Querying the dataset

### Indicator 1: Grouping query (top cities by total revenue)
- Revenue = Quantity * Amount. This is the total amount of a single transaction.
- Identifies the most profitable geographic locations by aggregating total revenue by city.
- Could potentially be used for business (targeted marketing, logistics, etc.).

In [22]:
def indicator_top_group(df: pd.DataFrame, n: int = 10) -> pd.DataFrame:
    """
    Compute top N group by total revenue.

    This function groups invoices by 'city', sums the 'revenue' values and returns the top n cities ordered by total revenue descending.
    
    Input:
    --------
    - df: invoices DataFrame with 'city' and 'revenue' columns
    - n: number of top cities to return
    
    Output:
    --------
    - DataFrame with columns ['city', 'total_revenue']
    """
    # Check if revenue column exists, if not, create it from qty and amount
    if 'revenue' not in df.columns:
        if 'qty' in df.columns and 'amount' in df.columns:
            df['revenue'] = df['qty'] * df['amount']
        else:
            raise ValueError("Cannot calculate revenue: missing 'revenue' or 'qty'/'amount' columns")
    
    # Group by city and sum the revenues
    revenue_by_city = df.groupby('city').agg({
        'revenue': 'sum',
        'product_id': 'count'
    }).reset_index()

    revenue_by_city.columns = ['city', 'total_revenue', 'transaction_count']

    # Sort by revenue descending
    revenue_by_city = revenue_by_city.sort_values('total_revenue', ascending=False)

    # Return the top n cities by total revenue
    return revenue_by_city.head(n)

In [23]:
indicator_top_group(df, n=10)

Unnamed: 0,city,total_revenue,transaction_count
2864,Lake James,4417.17,12
5395,Port Kimberly,3144.1,6
4741,North Michael,2902.27,8
6105,Smithmouth,2644.53,8
6351,South Jennifer,2570.33,9
5362,Port Joshua,2566.84,7
6334,South James,2504.3,10
2218,Jamesstad,2381.87,5
3019,Lake Michael,2317.09,9
6257,South David,2309.46,9


### Indicator 2: Data transformation (revenue normalization by city)
- Apply min‑max normalization or z‑score to city revenue to compare cities independently of absolute scale.
- Min-Max normalization using the formula:
    $$x_{\text{norm}} = \frac{x - \min(x)}{\max(x) - \min(x)}$$
    where x is the original revenue and min(x), max(x) are the minimum and maximum revenues across cities.

- Z-Score normalization using the formula:
    $$z = \frac{x - \mu}{\sigma}$$
    where $\mu$ is the mean of $x$ and $\sigma$ is the standard deviation of $x$.

In [24]:
def normalize_city_revenue(city_rev: pd.DataFrame, method: str = 'min-max') -> pd.DataFrame:
    """
    Normalize total_revenue column using specified method.
    
    Input: 
    -------
    - city_rev: DataFrame with 'city' and 'total_revenue'
    - method: normalization method, either 'min-max', 'z-score' or 'both' (default 'min-max')
    
    Output: 
    ------
    - same DataFrame with extra column 'revenue_norm'
    """
    city_rev = city_rev.copy()

    # Normalize the dataset with min-max normalization
    if method in ("minmax", "both"):
        min_val = city_rev["total_revenue"].min()
        max_val = city_rev["total_revenue"].max()
        city_rev["revenue_minmax"] = (
            (city_rev["total_revenue"] - min_val) / (max_val - min_val)
        )

    # Normalize the dataset with z-score normalization
    if method in ("zscore", "both"):
        mean_val = city_rev["total_revenue"].mean()
        std_val = city_rev["total_revenue"].std(ddof=0)
        if std_val != 0:
            city_rev["revenue_zscore"] = (
                (city_rev["total_revenue"] - mean_val) / std_val
            )
        else:
            city_rev["revenue_zscore"] = 0

    return city_rev

### Function to discretize city revenue, assigning to it revenue classes (low, medium, high). 

In [25]:
def discretize_city_revenue(city_rev: pd.DataFrame, q: int = 3) -> pd.DataFrame:
    """
    Discretize city revenue into q quantile-based categories.

    Input:
    --------
    - city_rev: DataFrame with 'total_revenue'
    - q: number of bins
    
    Output:
    -------
    - DataFrame with extra column 'revenue_segment'
    """
    city_rev['revenue_segment'] = pd.qcut(
        city_rev['total_revenue'],
        q=q,
        labels=[f"Segment_{i+1}" for i in range(q)]
    )
    return city_rev

### Indicator 2 (Version 2): Data Transformation - Customer Segmentation
This indicator applies MinMax Normalization to standardize features and uses K-Means clustering to segment customers.

This helps identify high-value customers for targeted marketing and retention strategies.

In [26]:
def customer_segmentation(df: pd.DataFrame) -> pd.DataFrame:
    """
    - Applies MinMax normalization and K-Means clustering for segmentation.
    - Segments customers into Low, Medium, and High value groups based on spending patterns.

    Input:
    --------
    - df: invoices DataFrame
    
    Output:
    --------
    - DataFrame with customer revenue segmentation and detailed value ranges
    """
    # Check if revenue column exists, if not, create it from qty and amount
    if 'revenue' not in df.columns:
        if 'qty' in df.columns and 'amount' in df.columns:
            df['revenue'] = df['qty'] * df['amount']
        else:
            raise ValueError("Cannot calculate revenue: missing 'revenue' or 'qty'/'amount' columns")

    # Aggregate customer-level metrics
    customer_profile = df.groupby(['name', 'email']).agg({
        'revenue': ['sum', 'mean', 'count'],
        'qty': 'sum'
    }).reset_index()

    customer_profile.columns = ['name', 'email',
                                 'total_spent', 'avg_transaction',
                                 'num_transactions', 'total_quantity']
    
    # Apply MinMax Normalization to features
    features = ['total_spent', 'avg_transaction', 'num_transactions', 'total_quantity']
    scaler = MinMaxScaler()
    customer_profile[[f"{f}_norm" for f in features]] = scaler.fit_transform(customer_profile[features])

    # Basic safety checks before clustering
    if customer_profile.shape[0] < 3:
        # Not enough samples for 3 clusters: we skip clustering and return profile
        customer_profile['segment'] = 0
        customer_profile['segment_label'] = 'Single/Small'
        return customer_profile

    # K-Means Clustering (3 clusters)
    kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
    customer_profile['segment'] = kmeans.fit_predict(customer_profile[['total_spent_norm', 'num_transactions_norm']])

    # Label segments based on spending levels
    segment_means = customer_profile.groupby('segment')['total_spent'].mean().sort_values()
    segment_mapping = {
        segment_means.index[0]: 'Low Value',
        segment_means.index[1]: 'Medium Value',
        segment_means.index[2]: 'High Value'
    }
    customer_profile['segment_label'] = customer_profile['segment'].map(segment_mapping)

    print("Segment Distribution:")
    print(customer_profile['segment_label'].value_counts())

    print("\nSegment Revenue Ranges:")
    segment_stats = customer_profile.groupby('segment_label')['total_spent'].agg([
        ('Count', 'count'),
        ('Min Revenue', 'min'),
        ('Max Revenue', 'max'),
        ('Mean Revenue', 'mean'),
        ('Median Revenue', 'median')
    ]).round(2)
    print(segment_stats)

    print("\nSegment Characteristics:")
    detailed_stats = customer_profile.groupby('segment_label').agg({
        'total_spent': ['min', 'max', 'mean'],
        'num_transactions': ['mean'],
        'total_quantity': ['mean']
    }).round(2)
    print(detailed_stats)

    # Add value range to each customer record
    segment_ranges = customer_profile.groupby('segment_label')['total_spent'].agg(['min', 'max'])
    customer_profile = customer_profile.merge(
        segment_ranges, 
        left_on='segment_label', 
        right_index=True, 
        suffixes=('', '_segment')
    )
    customer_profile.rename(columns={'min': 'segment_min', 'max': 'segment_max'}, inplace=True)

    # Create readable range label
    customer_profile['value_range'] = customer_profile.apply(
        lambda row: f"${row['segment_min']:.2f} - ${row['segment_max']:.2f}", 
        axis=1
    )

    return customer_profile

In [27]:
customer_segmentation(df)

Segment Distribution:
segment_label
Low Value       5305
Medium Value    3085
High Value      1610
Name: count, dtype: int64

Segment Revenue Ranges:
               Count  Min Revenue  Max Revenue  Mean Revenue  Median Revenue
segment_label                                                               
High Value      1610       491.50       898.92        640.53          623.20
Low Value       5305         5.07       224.82        107.20           98.44
Medium Value    3085       224.88       491.25        342.61          337.12

Segment Characteristics:
              total_spent                 num_transactions total_quantity
                      min     max    mean             mean           mean
segment_label                                                            
High Value         491.50  898.92  640.53              1.0           7.83
Low Value            5.07  224.82  107.20              1.0           3.65
Medium Value       224.88  491.25  342.61              1.0           

Unnamed: 0,name,email,total_spent,avg_transaction,num_transactions,total_quantity,total_spent_norm,avg_transaction_norm,num_transactions_norm,total_quantity_norm,segment,segment_label,segment_min,segment_max,value_range
0,Aaron Allen,ihogan@example.com,74.39,74.39,1,1,0.077552,0.077552,0.0,0.000,0,Low Value,5.07,224.82,$5.07 - $224.82
1,Aaron Alvarez,anthony45@example.net,202.26,202.26,1,3,0.220607,0.220607,0.0,0.250,0,Low Value,5.07,224.82,$5.07 - $224.82
2,Aaron Bennett,miranda15@example.net,98.16,98.16,1,3,0.104145,0.104145,0.0,0.250,0,Low Value,5.07,224.82,$5.07 - $224.82
3,Aaron Brown,sergiozamora@example.org,19.14,19.14,1,2,0.015741,0.015741,0.0,0.125,0,Low Value,5.07,224.82,$5.07 - $224.82
4,Aaron Burns,josephanderson@example.org,187.68,187.68,1,2,0.204296,0.204296,0.0,0.125,0,Low Value,5.07,224.82,$5.07 - $224.82
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,Zachary Thomas,annecunningham@example.net,379.08,379.08,1,4,0.418426,0.418426,0.0,0.375,2,Medium Value,224.88,491.25,$224.88 - $491.25
9996,Zachary Thomas,jessicahenderson@example.net,592.34,592.34,1,7,0.657012,0.657012,0.0,0.750,1,High Value,491.50,898.92,$491.50 - $898.92
9997,Zachary Weaver,catherine88@example.org,351.24,351.24,1,4,0.387280,0.387280,0.0,0.375,2,Medium Value,224.88,491.25,$224.88 - $491.25
9998,Zoe Klein,joseph00@example.org,123.62,123.62,1,2,0.132629,0.132629,0.0,0.125,0,Low Value,5.07,224.82,$5.07 - $224.82


### Indicator 3: Temporal Analysis - Revenue Forecasting
Function for temporal prediction of the revenue.

In [28]:
def temporal_prediction(df: pd.DataFrame, time: str="year",periods: int = 10):
    """
    Use Prophet model to make a temporal prediction of the revenue.

    Inputs:
    ---------
    - df (DataFrame): Input dataset
    - [Optionnal] time (str): options for the prediction between "year", "month" and "day".  
        - "year" (default): use the year column of the dataset to make the prediction.  
        - "month": use the month column of the dataset to make the prediction. 
        - "day": use the invoice_date column containing the full date to make the prediction
    - [Optional] periods (int): The period to calculate the future date. Default 10.

    Outputs: 
    --------
    - new_df (DataFrame) - Contains 'time', 'original_revenue' and 'predicted_revenue'.
    - model (Prophet) - Prophet model trained on the dataset and used for the prediction
    - prediction (DataFrame) - Future prediction made by the model
    """
    dataset = df.copy()
    if time == "year":
        # A mettre ailleur la transformation en datetime ?
        dataset['year'] = pd.to_datetime(dataset['year'], format='%Y')

        # Put the future date to year and group the revenue by year
        freq = 'YE'
        dataset = dataset.groupby('year')['revenue'].sum().reset_index()
        # Rename the column to the name that the Prophet model will need
        dataset.rename(columns={'year': 'ds','revenue': 'y'}, inplace=True)

    elif time == "month":
        # A mettre ailleur la transformation en datetime ?
        dataset['month'] = dataset['year'].astype(str) + '-' + dataset['month'].astype(str).str.zfill(2)
        dataset['month'] = pd.to_datetime(dataset['month'], format='%Y-%m')

        # Put the future date to month and group the revenue by month
        freq = 'ME'
        dataset = dataset.groupby('month')['revenue'].sum().reset_index()
        # Rename the column to the name that the Prophet model will need
        dataset.rename(columns={'month': 'ds','revenue': 'y'}, inplace=True)
    elif time == "day":

        # Put the future date to day and group the revenue by day
        freq = 'D'
        dataset = dataset.groupby('invoice_date')['revenue'].sum().reset_index()
        # Rename the column to the name that the Prophet model will need
        dataset.rename(columns={'invoice_date': 'ds','revenue': 'y'}, inplace=True)
    else:
        print("Erreur: L'option 'time' doit être 'year', 'month' ou 'day.")
        return df,None,None

    # Put cmdstanpy log ouput at ERROR to not have the output from Prophet model when it is used without trouble
    logging.getLogger('cmdstanpy').setLevel(logging.ERROR)

    # Create a Prophet model to make prediction
    model = Prophet()
    model.fit(dataset)

    # get the date to predict and then use the predict function on these date to obtain the prediction
    future_dates = model.make_future_dataframe(periods=periods, freq=freq)
    prediction = model.predict(future_dates)

    # Create a dataset with both original value and predicted value
    # Then rename the new dataset columns with original_revenue and predicted_revenue
    new_df = pd.merge(dataset[['ds', 'y']], prediction[['ds', 'yhat']], on='ds', how='outer')
    new_df.rename(columns={'y': 'original_revenue','yhat': 'predicted_revenue', 'ds': 'time'}, inplace=True)
    
    return new_df, model,prediction  

### Function to create a visualization based on a temporal prediction

In [29]:
def display_temporal_prediction(df: pd.DataFrame, model,prediction,options: str = "prophet"):
    """
    Create a visualization of the a dataset with temporal prediction 
    either with the dataset or with the prediction model.

    Inputs:
    ---------  
    - df (DataFrame): Input dataset.
    - model (Prophet): Prophet model trained on the dataset and used for the prediction.
    - prediction (DataFrame): Future prediction made by the model.
    - [Optional] options (str): options for the visualization between "ploty" and "prophet". 
        - "prophet": use prophet default plot function to plot the prediction.   
        - "ploty" (default): use ploty to plot the prediction.  

    Outputs: 
    --------
    - fig (Figure) - A figure containing the temporal visualization.
    """

    # Create a visualization using ploty express library
    if options=="ploty":
        if 'predicted_revenue' in df.columns and 'original_revenue' in df.columns:
            fig = px.area()
            fig.add_scatter(x=df.index, y=df["original_revenue"], mode='lines', line=dict(color='blue'), name="original")
            fig.add_scatter(x=df.index ,y=df["predicted_revenue"], mode='lines', line=dict(color='green'), name="prediction")
            fig.update_layout(title="Prediction", xaxis_title="Date", yaxis_title="Revenue")
        else:
            print("Error, the prediction was not found in the dataset")
            fig = None

    # use the plot_ploty from prohet to get a visualization
    elif options == "prophet":
        if model is not None:
            fig = plot_plotly(model, prediction)
        else:
            print("Error, the prediction model was not found")
            fig = None
    return fig

### Indicator 4: Spatial Analysis - Geographic Clustering
This indicator applies K-Means clustering to group cities into activity levels based on revenue patterns (revenue = qty × amount).

Function to analyze geographic distribution. 

In [30]:
def analyze_geographic_distribution(df: pd.DataFrame) -> pd.DataFrame:
    """
    Analyzes spatial distribution of transactions across cities.
    Calculates revenue metrics and transaction patterns by location.

    Input:
    ---------
    - df => Pandas DataFrame with city and revenue information

    Output:
    ---------
    - city_stats => DataFrame with city-level statistics
    """
    # Check if revenue column exists
    if 'revenue' not in df.columns:
        if 'qty' in df.columns and 'amount' in df.columns:
            df['revenue'] = df['qty'] * df['amount']
            print("Revenue column not found, creating it now...")
        else:
            raise ValueError("Cannot calculate revenue: missing 'revenue' or 'qty'/'amount' columns")

    city_stats = df.groupby('city').agg({
        'revenue': ['sum', 'mean', 'std'],
        'qty': 'sum',
        'product_id': 'count'
    }).reset_index()

    city_stats.columns = ['city', 'total_revenue', 'avg_revenue', 'std_revenue',
                          'total_quantity', 'transaction_count']

    city_stats['revenue_per_transaction'] = (
        city_stats['total_revenue'] / city_stats['transaction_count'])

    city_stats = city_stats.sort_values('total_revenue', ascending=False)

    return city_stats

Function for spatial clustering, grouping cities by revenue patterns using KMeans clustering.

In [31]:
def spatial_clustering(city_stats: pd.DataFrame, n_clusters: int = 4) -> pd.DataFrame:
    """
    Applies K-Means clustering to group cities by revenue patterns.
    Uses normalized features: total revenue, transaction count, and
    average order value. Identifies geographic market segments.

    Input:
    ---------
    - city_stats => DataFrame with city-level statistics
    - n_clusters => Integer, number of clusters (default: 4)

    Output:
    ---------
    - city_stats => DataFrame with cluster labels added
    """
    features = ['total_revenue', 'transaction_count', 'revenue_per_transaction']

    # Normalize features
    scaler = MinMaxScaler()
    city_stats_norm = city_stats.copy()
    city_stats_norm[features] = scaler.fit_transform(city_stats[features])

    # Apply K-Means clustering
    kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
    city_stats['cluster'] = kmeans.fit_predict(city_stats_norm[features])

    # Label clusters by activity level
    cluster_means = city_stats.groupby('cluster')['total_revenue'].mean().sort_values()
    cluster_labels = {
        cluster_means.index[0]: 'Low Activity',
        cluster_means.index[1]: 'Medium-Low Activity',
        cluster_means.index[2]: 'Medium-High Activity',
        cluster_means.index[3]: 'High Activity'
    }
    city_stats['cluster_label'] = city_stats['cluster'].map(cluster_labels)
    
    print("Cluster Distribution:")
    print(city_stats['cluster_label'].value_counts())
    print("\nCluster Characteristics:")
    print(city_stats.groupby('cluster_label')[['total_revenue', 'transaction_count']].mean())

    return city_stats

**Bonus**

In [59]:
def generate_fake_extra_dataset(n_rows: int = 500) -> pd.DataFrame:
    """
    We created this function to simulate a rich external dataset.
    It generates invoices for multiple space colonies with different spending habits
    to make the clustering and revenue analysis interesting.
    """
    np.random.seed(42)
    random.seed(42)

    start_date = datetime(2021, 1, 1)
    dates = [start_date + timedelta(days=np.random.randint(0, 730)) for _ in range(n_rows)]
    
    # Base profiles
    # Titan: Industrial/Rich
    # Mars: Upper Class
    # Europa: Middle Class
    # Moon: Low Cost
    base_locations = ['Titan Outpost', 'Mars Colony', 'Europa Station', 'Moon Base']
    weights = [0.1, 0.3, 0.2, 0.4] 
    
    chosen_bases = np.random.choice(base_locations, n_rows, p=weights)
    
    # CRITICAL FIX: We append a random number (1-100) to create ~100 distinct cities
    # instead of just 4. This aligns the transaction volume with Earth cities.
    cities = [f"{base} {np.random.randint(1, 101)}" for base in chosen_bases]
    
    # Revenue logic remains the same (distinct profiles)
    balances = []
    for base in chosen_bases: # Check base name, not full city name
        if base == 'Titan Outpost':
            balances.append(np.round(np.random.uniform(5000, 12000), 2)) # Huge revenue
        elif base == 'Mars Colony':
            balances.append(np.round(np.random.uniform(1000, 3000), 2))  # High revenue
        elif base == 'Moon Base':
            balances.append(np.round(np.random.uniform(20, 150), 2))     # Low revenue
        else:
            balances.append(np.round(np.random.uniform(300, 900), 2))    # Medium
            
    data = {
        'issuedDate': [d.strftime('%Y-%m-%d') for d in dates],
        'client': [f"Alien_{i}" for i in range(n_rows)],
        'service': np.random.choice(['Mining', 'Tech', 'Transport', 'Food'], n_rows),
        'balance': balances,
        'city': cities, # Using the diversified city names
        'id_invoice': range(1000, 1000 + n_rows),
        'total': np.zeros(n_rows),
        'discount': np.zeros(n_rows),
        'tax': np.zeros(n_rows),
        'invoiceStatus': 'Paid',
        'dueDate': [d.strftime('%Y-%m-%d') for d in dates]
    }
    
    df_fake = pd.DataFrame(data)
    print(f"Generated {n_rows} transactions across {len(set(cities))} distinct space locations.")
    return df_fake

In [60]:
def fusion_dataset(df: pd.DataFrame, df_extra: pd.DataFrame) -> pd.DataFrame:
    '''
    Merges the main dataset with an external dataset.

    CONTEXT:
    Since the original dataset ('invoices.csv') is synthetic (it's containing only fictional cities 
    like 'Ninaland' or 'Noahstad'), we just can't simply download realworld external data 
    (like GDP or weather) that would match these locations.

    Therefore, this function is designed to fuse the main data with a generated synthetic external dataset that mimics the structure of a real supplementary source 
    (based on the schema of: https://www.kaggle.com/datasets/ghassenkhaled/invoices-data).

    Input:
    -------
    - df: pd.DataFrame, the main synthetic dataset
    - df_extra: pd.DataFrame, the generated external dataset to be merged
    
    Output:
    -------
    - pd.DataFrame, the combined and cleaned dataset
    '''
    df_base = df.copy()         # To make sure to not modify the base dataset
    df_base['country'] = 'Earth' 
    
    # --- CORRECTION: I removed the date filter here ---
    # We want to keep ALL historical data from Earth, plus the new Space data
    # --------------------------------------------------

    # --- FIX FOR DASHBOARD COMPATIBILITY ---
    # We rename columns to match our main dataset schema
    df_extra = df_extra.rename(columns={
        'issuedDate': 'invoice_date',
        'service': 'job',
        'balance': 'revenue',
        'client': 'name',
        'id_invoice': 'product_id' # We map invoice ID to product_id so counts work
    })

    # We need to create these columns because the dashboard needs them
    # Note: 'city' is already in df_extra (Mars, Titan, etc.), no need to overwrite it
    df_extra['qty'] = 1              # Default quantity
    df_extra['email'] = df_extra['name'].str.replace(' ', '.').str.lower() + '@space.net'
    df_extra['amount'] = df_extra['revenue'] 
    
    # Ensure date columns are present in extra data
    df_extra['invoice_date'] = pd.to_datetime(df_extra['invoice_date'])
    df_extra['year'] = df_extra['invoice_date'].dt.year
    df_extra['month'] = df_extra['invoice_date'].dt.month
    df_extra['day'] = df_extra['invoice_date'].dt.day
    df_extra['dayofweek'] = df_extra['invoice_date'].dt.dayofweek
    # ---------------------------------------

    # Drop columns we don't need (as per your original logic)
    cols_to_drop = ['total', 'discount', 'tax', 'invoiceStatus', 'dueDate']
    df_extra = df_extra.drop(columns=[c for c in cols_to_drop if c in df_extra.columns])
    
    # We drop unecessary columns from base
    df_base = df_base.drop(columns=['address', 'stock_code']) 
    
    # Merge both datasets
    df_new = pd.concat([df_base, df_extra], join='outer', ignore_index=True)
    df_new = df_new.drop_duplicates()
    
    return df_new

## 3) Dash visualization

In [56]:
def create_indicator(df: pd.DataFrame):
    """
    Create visualization of some indicator on the dataset.

    Input:
    ---------
    - df => Pandas DataFrame containing our cleaned dataset

    Output:
    ---------
    - fig_top_cities_revenue => Visualization of the city who have the N(=10) best revenue
    - fig_customer_segmentation => Visualization of the segmentation of the customers according to 3 class
    - figure_pred_year => Visualization of the prediction of revenue according to yearly revenue
    - figure_pred_month => Visualization of the prediction of revenue according to monthly revenue
    - figure_pred_day => Visualization of the prediction of revenue according to daily revenue
    - fig_spatial_clustering => Visualization of the city cluster by revenue and activity

    """
                                        ### Indicator 1

    # Creating the indicator
    city_stats = indicator_top_group(df, n=10)

    # Creating the vizualization
    fig_top_cities_revenue = px.bar(
        city_stats,
        x='city',
        y='total_revenue',
        hover_data={'transaction_count': True, 'total_revenue': ':.2f'},
        title='Top Cities by Revenue and Transactions',
        labels={'city': 'City', 'total_revenue': 'Total Revenue ($)', 'transaction_count': 'Number of Transactions'},
        color='transaction_count',
        color_continuous_scale='Blues'
    )
    fig_top_cities_revenue.update_layout(
        xaxis_tickangle=-45,
        height=400,
        showlegend=False,
        hovermode='closest'
    )

                                        ### Indicator 2
    
    # Creating the indicator
    customer_segments = customer_segmentation(df)

    # Value ranges summary
    segment_summary = customer_segments.groupby('segment_label').agg({
        'total_spent': ['count', 'min', 'max', 'mean']
    }).round(2)
    segment_summary.columns = ['Count', 'Min', 'Max', 'Mean']
    segment_summary = segment_summary.reset_index()

     # Create custom labels with value ranges
    segment_summary['label_with_range'] = segment_summary.apply(
        lambda row: f"{row['segment_label']}<br>${row['Min']:.2f} - ${row['Max']:.2f}<br>Avg: ${row['Mean']:.2f}",
        axis=1
    )

    # Creating the vizualization
    fig_customer_segmentation = go.Figure(data=[go.Pie(
        labels=segment_summary['label_with_range'],
        values=segment_summary['Count'],
        hole=0.4,
        marker=dict(
            colors=['#e74c3c', '#f39c12', '#27ae60'],
            line=dict(color='white', width=2)
        ),
        textinfo='percent+value',
        textposition='outside',
        hovertemplate='<b>%{label}</b><br>Customers: %{value}<br>Percentage: %{percent}<extra></extra>'
    )])
    
    fig_customer_segmentation.update_layout(
        title={
            'text': 'Customer Segmentation by Value<br><sub>With Revenue Ranges</sub>',
            'x': 0.5,
            'xanchor': 'center'
        },
        height=400,
        showlegend=True,
        legend=dict(
            orientation="v",
            yanchor="middle",
            y=0.5,
            xanchor="left",
            x=1.05,
            font=dict(size=10)
        ),
        annotations=[dict(
            text='Customer<br>Segments',
            x=0.5, y=0.5,
            font_size=14,
            showarrow=False,
            font=dict(color='#2c3e50', weight='bold')
        )]
    )

                                        ### Indicator 3

    ## To change the data used to make the prediction change time_pred,
    ## Put the oldest date that you want to be taken in year.
    ## In case you change time_pred and use the create_dashboard function,
    ## change the year in year_to_index and the label in the Dropdown
    time_pred = [1900, 2020, 2019, 2017, 2015, 2010, 2000, 1990, 1980]

    ## To change how far the model should predict, change the prediction_lenght variable,
    ## It must be a positive int number
    prediction_lenght_year = 10
    prediction_lenght_month = 24
    prediction_lenght_day = 120

    figure_pred_year = []
    figure_pred_month = []
    figure_pred_day = []

    # Make the prediction using yearly, monthly and Daily data for every date in time_pred
    for i in time_pred:
        dataset = df.copy()
        dataset = dataset[dataset["year"]>i]

        data_y, model_y,predictions_y = temporal_prediction(
            dataset,time="year", periods=prediction_lenght_year
        )
        figure_pred_year.append(
            display_temporal_prediction(data_y,model_y,predictions_y)
        )
        data_m,model_m,predictions_m = temporal_prediction(
            dataset,time="month",periods=prediction_lenght_month
        )
        figure_pred_month.append(
            display_temporal_prediction(data_m,model_m,predictions_m)
        )
        data_m,model_m,predictions_m = temporal_prediction(
            dataset,time="day",periods=prediction_lenght_day
        )
        figure_pred_day.append(
            display_temporal_prediction(data_m,model_m,predictions_m)
        )

                                        ### Indicator 4

    # Creating the indicator
    city_clusters = spatial_clustering(analyze_geographic_distribution(df))
    
    # Creating the vizualization
    fig_spatial_clustering = px.scatter(
        city_clusters.head(3000),
        x='transaction_count',
        y='total_revenue',
        color='cluster_label',
        size='revenue_per_transaction',
        hover_data=['city'],
        labels={
            'transaction_count': 'Number of Transactions',
            'total_revenue': 'Total Revenue ($)',
            'cluster_label': 'Activity Level'
        },
        color_discrete_map={
            'Low Activity': 'red',
            'Medium-Low Activity': 'orange',
            'Medium-High Activity': 'blue',
            'High Activity': 'green'
        }
    )
    fig_spatial_clustering.update_layout(
        title=dict(
            text='City Clustering by Revenue and Activity',
            x=0.5,
            xanchor='center',
            font=dict(size=16)
        ),
        margin=dict(l=60, r=40, t=80, b=60),
        height=500
    )

    return (fig_top_cities_revenue, fig_customer_segmentation, figure_pred_year,
            figure_pred_month, figure_pred_day, fig_spatial_clustering)

In [61]:
def create_dashboard(df: pd.DataFrame, df_augmented: pd.DataFrame | None = None) -> Dash:
    """Create a Dash dashboard to visualize key business indicators from invoice data."""
    
    # Generate visualizations
    (fig_top_cities_revenue, fig_customer_segmentation, figure_pred_year,
     figure_pred_month, figure_pred_day, fig_spatial_clustering) = create_indicator(df)

    # Display static figures for the HTML report
    fig_top_cities_revenue.show()
    fig_customer_segmentation.show()
    fig_spatial_clustering.show()
    if figure_pred_year:
        figure_pred_year[0].show()

    # Fix Prophet chart layouts
    for fig_list in [figure_pred_year, figure_pred_month, figure_pred_day]:
        for fig in fig_list:
            fig.update_xaxes(title_text="Date")
            fig.update_yaxes(title_text="Revenue ($)")
            fig.update_layout(margin=dict(l=50, r=50, t=50, b=50))

    # Dropdown mapping
    year_to_index = {
        'YA': 0, 'MA': 0, 'DA': 0,
        'Y2020': 1, 'M2020': 1, 'D2020': 1,
        'Y2019': 2, 'M2019': 2, 'D2019': 2,
        'Y2017': 3, 'M2017': 3, 'D2017': 3,
        'Y2015': 4, 'M2015': 4, 'D2015': 4,
        'Y2010': 5, 'M2010': 5, 'D2010': 5,
        'Y2000': 6, 'M2000': 6, 'D2000': 6,
        'Y1990': 7, 'M1990': 7, 'D1990': 7,
        'Y1980': 8, 'M1980': 8, 'D1980': 8,
    }

    # Initialize app
    app = Dash(__name__, external_stylesheets=[
        'https://fonts.googleapis.com/css2?family=Inter:wght@400;600;700&display=swap'
    ])

    # Colors
    C = {
        'primary': '#2c3e50', 'accent': '#3498db', 'success': '#27ae60',
        'warning': '#f39c12', 'info': '#8e44ad', 'light': '#ecf0f1',
        'white': '#fff', 'text': '#2c3e50', 'text_light': '#7f8c8d',
        'border': '#bdc3c7', 'bg': '#fafafa'
    }

    # Card style
    card = {
        'backgroundColor': C['white'], 'borderRadius': '12px',
        'boxShadow': '0 4px 12px rgba(0,0,0,0.1)', 'border': f'1px solid {C["border"]}'
    }

    app.layout = html.Div([
        # Header
        html.Div([
            html.H1('Invoices Data Analysis Dashboard', 
                   style={'margin': '0 0 10px', 'color': C['primary'], 'fontSize': '2.5rem', 'fontWeight': '700'}),
            html.H3('Programming for Data Science - Final Project',
                   style={'margin': '0 0 15px', 'color': C['primary'], 'fontSize': '1.3rem', 'fontWeight': '500'}),
            html.Div([
                html.Span('Team: ', style={'fontWeight': '600'}),
                html.Span('Alvaro SERERO, Leo WINTER, Yoann SUBLET, Kellian VERVAELE KLEIN')
            ], style={'fontSize': '0.95rem', 'color': C['text_light'], 'marginBottom': '5px'}),
            html.Div([
                html.Span('Dataset: ', style={'fontWeight': '600'}),
                html.Span('Invoices (Kaggle) - 10,000 transactions')
            ], style={'fontSize': '0.95rem', 'color': C['text_light']}),
        ], style={'textAlign': 'center', 'padding': '40px 20px', 'backgroundColor': C['light'], 
                  'borderBottom': f'3px solid {C["accent"]}', 'marginBottom': '40px'}),

        html.Div([
            # Row 1: Indicators 1 & 2
            html.Div([
                # Indicator 1
                html.Div([
                    html.H2('Indicator 1: Top Cities by Revenue',
                           style={'color': C['accent'], 'fontSize': '1.5rem', 'fontWeight': '600', 
                                  'margin': '0 0 10px', 'padding': '20px 20px 0'}),
                    html.P('Grouping Query - Revenue = Quantity x Price',
                          style={'color': C['text_light'], 'fontSize': '0.9rem', 'margin': '0', 'padding': '0 20px 10px'}),
                    dcc.Graph(figure=fig_top_cities_revenue, style={'height': '400px'}, config={'displayModeBar': False})
                ], style={**card, 'overflow': 'hidden'}),

                # Indicator 2
                html.Div([
                    html.H2('Indicator 2: Customer Segmentation',
                           style={'color': C['success'], 'fontSize': '1.5rem', 'fontWeight': '600',
                                  'margin': '0 0 10px', 'padding': '20px 20px 0'}),
                    html.P('MinMax Normalization + K-Means (k=3) with value ranges',
                          style={'color': C['text_light'], 'fontSize': '0.9rem', 'margin': '0', 'padding': '0 20px 10px'}),
                    dcc.Graph(figure=fig_customer_segmentation, style={'height': '400px'}, config={'displayModeBar': False})
                ], style={**card, 'overflow': 'hidden'})
            ], style={'display': 'grid', 'gridTemplateColumns': 'repeat(auto-fit, minmax(500px, 1fr))', 
                     'gap': '30px', 'marginBottom': '30px'}),

            # Row 2: Indicator 3
            html.Div([
                html.H2('Indicator 3: Revenue Forecasting',
                       style={'color': C['info'], 'fontSize': '1.5rem', 'fontWeight': '600', 'marginBottom': '10px'}),
                html.P('Prophet Model - Multiple time scales',
                      style={'color': C['text_light'], 'fontSize': '0.9rem', 'marginBottom': '15px'}),
                html.Div([
                    html.Label('Select Time Period:', style={'fontWeight': '600', 'marginRight': '15px'}),
                    dcc.Dropdown(
                        id='dropdown',
                        options=[
                            {'label': 'All Years', 'value': 'YA'},
                            {'label': '2020', 'value': 'Y2020'}, {'label': '2019', 'value': 'Y2019'},
                            {'label': '2017', 'value': 'Y2017'}, {'label': '2015', 'value': 'Y2015'},
                            {'label': '2010', 'value': 'Y2010'}, {'label': '2000', 'value': 'Y2000'},
                            {'label': '1990', 'value': 'Y1990'}, {'label': '1980', 'value': 'Y1980'},
                            {'label': 'All Months', 'value': 'MA'},
                            {'label': '2020', 'value': 'M2020'}, {'label': '2019', 'value': 'M2019'},
                            {'label': '2017', 'value': 'M2017'}, {'label': '2015', 'value': 'M2015'},
                            {'label': '2010', 'value': 'M2010'}, {'label': '2000', 'value': 'M2000'},
                            {'label': '1990', 'value': 'M1990'}, {'label': '1980', 'value': 'M1980'},
                            {'label': 'All Days', 'value': 'DA'},
                            {'label': '2020', 'value': 'D2020'}, {'label': '2019', 'value': 'D2019'},
                            {'label': '2017', 'value': 'D2017'}, {'label': '2015', 'value': 'D2015'},
                            {'label': '2010', 'value': 'D2010'},
                        ],
                        value='YA', clearable=False, style={'width': '250px'}
                    )
                ], style={'display': 'flex', 'alignItems': 'center', 'marginBottom': '20px'}),
                dcc.Graph(id='graph', style={'height': '450px'})
            ], style={**card, 'padding': '20px', 'marginBottom': '30px'}),

            # Row 3: Indicator 4
            html.Div([
                html.H2('Indicator 4: Geographic Clustering',
                       style={'color': C['warning'], 'fontSize': '1.5rem', 'fontWeight': '600', 'marginBottom': '10px'}),
                html.P('K-Means Spatial Clustering - 4 activity levels',
                      style={'color': C['text_light'], 'fontSize': '0.9rem', 'marginBottom': '15px'}),
                dcc.Graph(figure=fig_spatial_clustering, style={'height': '500px'}, config={'displayModeBar': False})
            ], style={**card, 'padding': '20px 20px 10px', 'marginBottom': '30px'})

        ], style={'maxWidth': '1400px', 'margin': '0 auto', 'padding': '0 20px 40px'}),

        # Footer
        html.Div([
            html.P('© 2025 Data Science Team | Built with Python, Dash, Plotly & Prophet',
                  style={'textAlign': 'center', 'color': C['text_light'], 'fontSize': '0.85rem', 'margin': '0'})
        ], style={'padding': '20px', 'backgroundColor': C['light'], 'borderTop': f'1px solid {C["border"]}'})

    ], style={'fontFamily': "'Inter', sans-serif", 'backgroundColor': C['bg'], 'minHeight': '100vh', 'margin': '0'})

    @callback(
    Output('graph', 'figure'),
    Input('dropdown', 'value'))
    def update_temporal_graph(selected_value):
        index = year_to_index[selected_value]
        if selected_value.startswith("Y"):
            fig_dash = figure_pred_year[index]
        elif selected_value.startswith("M"):
            fig_dash = figure_pred_month[index]
        elif selected_value.startswith("D"):
            fig_dash = figure_pred_day[index]
        
        # Fix the chart layout
        fig_dash.update_xaxes(title_text="")
        fig_dash.update_layout(
            margin=dict(l=60, r=40, t=60, b=40),
            xaxis_title="",
            height=450
        )
        
        return fig_dash
    return app

In [62]:
def main():
    file_path = "invoices.csv"
    
    # Load and clean main data
    df = load_data(file_path)
    df = preprocess_data(df, date_column="invoice_date")
    
    # Attempt to merge with external data
    try:
        # Generate synthetic external data
        df_extra = generate_fake_extra_dataset(n_rows=500)
        
        if not df_extra.empty:
            df = fusion_dataset(df, df_extra)
            print("External dataset merged successfully.")
            print(f"Total rows: {df.shape[0]}")
            
    except Exception as e:
        print(f"Skipping external data merge: {e}")

    # Launch exploration and dashboard
    explore_data(df)
    app = create_dashboard(df) 
    app.run(debug=False)

if __name__ == "__main__":
    main()

Data loaded successfully.
Generated 500 transactions across 269 distinct space locations.
External dataset merged successfully.
Total rows: 10500
Shape (rows, columns): (10500, 14)

Column dtypes:
name                    object
email                   object
product_id               int64
qty                      int64
amount                 float64
invoice_date    datetime64[ns]
city                    object
job                     object
revenue                float64
year                     int32
month                    int32
day                      int32
dayofweek                int32
country                 object
dtype: object

Missing values per column:
name              0
email             0
product_id        0
qty               0
amount            0
invoice_date      0
city              0
job               0
revenue           0
year              0
month             0
day               0
dayofweek         0
country         500
dtype: int64

Basic description of numerical co