# Programming in Data Science - Final Project
## Invoices Dataset Analysis
**Team Members: Leo WINTER, Yoann SUBLET, Kellian VERVAELE KLEIN, Alvaro SERERO**

Dataset: Invoices (Kaggle)

Source: https://www.kaggle.com/datasets/cankatsrc/invoices/data

This dataset includes multiple fields such as customer details (first name, last name, email), transaction information (product ID, quantity, amount, invoice date), and additional attributes like address, city, and stock code.

### Import all needed libraries for the project:
- Pandas for data manipulation
- Plotly express for visualizations
- Dash for creating a visual and interactive dashboard interface

In [28]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from dash import Dash, dcc, html,Input, Output
from dash import callback
from prophet import Prophet
from prophet.plot import plot_plotly
import logging

from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans

import warnings
warnings.filterwarnings('ignore')

## 1) Data collection and exploration

### Function to safely load CSV data from a file path

In [29]:
def load_data(file_path: str) -> pd.DataFrame:
    """
    Function to load CSV data from a given file path safely.

    Input: 
    ------
    file_path => String, path to the CSV file

    Output: 
    ------
    dataset => pd.DataFrame containing the loaded data
    """
    try:
        df = pd.read_csv(file_path)
        print("Data loaded successfully.")
        return df
    
    # in case an error is generated when we try to read the file
    except Exception as e:
        print(f"Error loading data: {e}")
        return pd.DataFrame()

### Function to process invalid first_name and last_name columns.
- In the initial dataset there are "last_name" and "first_name" columns but each one contains a combination of a first name and last name which does not make sense since a trasaction is only made by one individual person and columns should contain exactly what is described by their name.
- For example, the first line is structured as follows, which is a mistake and need to be corrected.

first_name | last_name  
Carmen Nixon | Todd Anderson

In [30]:
def name_treatment(dataset: pd.DataFrame, options: str="separate") -> pd.DataFrame:
    """
    Function to treat first_name and last_name columns in a dataset.

    Only use this function if both first_name and last_name columns have already
    the full name (first + last name)!!!
    This function does not fuse firts_name and last_name column in a column name

    
    Input:
    ---------
    - dataset => Pandas DataFrame, dataset must have first_name and last_name columns
    - options => String, options for treatment between "separate", "first" and "last"
        - "separate" (default): create two new line for each name, 
        - "first": keep only the first_name renamed as name, 
        - "last" : keep only the last_name renamed as name
    
    Output:
    ---------
    - dataset => Pandas DataFrame after treating first_name and last_name columns
    """
    # We first check that both columns 'first_name' and 'lst_name' are present in the dataset
    if "first_name" in dataset and "last_name" in dataset:
        if options == "separate":
            value = dataset.columns.difference(['first_name','last_name']).tolist()
            # Create a new dataset that duplicate every line in dataset, and create one line with
            # a column named 'name' with the value  of 'first_name' and the second line with
            # a column named 'name' with the value  of 'last_name'
            new_dataset = pd.melt(dataset, id_vars=value,              
                              value_vars=['first_name', 'last_name'],
                              value_name='name')
            
            # Delete variable column and put the newly created 'name' column as the first column of the dataset
            column_to_keep = [col for col in new_dataset.columns if col not in ["name", "variable"]]
            new_order = ["name"] + column_to_keep
            new_dataset = new_dataset[new_order]

        elif options == "first":
            # Delete the 'last_name' column
            new_dataset = dataset.drop(columns=['last_name'])
            # Rename 'first_name' column into 'name'
            new_dataset.rename(columns={'first_name': 'name'}, inplace=True)

        elif options == "last":
            # Delete the 'first_name' column
            new_dataset = dataset.drop(columns=['first_name'])
            # Rename 'last_name' column into 'name'
            new_dataset.rename(columns={'last_name': 'name'}, inplace=True)
        else:
            print(f"{options} is not a correct parameters of options, please write 'separate' or 'first' or 'last")
            return dataset
        return new_dataset
    else:
        return dataset

### Function to parse invoice dates:
- Convert "invoice_date" column to datetime for futural temporal manipulations.
- Extracts year, month, day, and day of week features.

In [48]:
def parse_dates(df: pd.DataFrame,date_column: str, date_format: str="%d/%m/%Y") -> pd.DataFrame:
    """
    Converts "invoice_date" column from string to datetime.
    Extracts year, month, day, and day of week features.

    Input:
    ------
    - df (DataFrame) - dataset with invoice_date column

    Output:
    ------
    - df (DataFrame) - dataset with parsed datetime features
    """
    # We first check that the name of the column in the variable <date_column> is in the dataset
    if date_column not in df.columns:
        print(f"Column '{date_column}' not found in DataFrame.")
        return df
    
    format_date= ['%Y-%m-%d','%d-%m-%Y','%Y/%m/%d','%d/%m/%Y']
    if date_format in format_date:
        df = df.copy()
        # Put the time column into a DatetimeFormat
        df[date_column] = pd.to_datetime(df[date_column], format=date_format, errors='coerce')
        # In case the datetime format is not '%d/%m/%Y', transform into it into '%d/%m/%Y' format
        if (date_format != "%d/%m/%Y"):
            df[date_column] = df[date_column].dt.strftime('%d/%m/%Y')
            df[date_column] = pd.to_datetime(df[date_column], format='%d/%m/%Y', errors='coerce')
        # Create int column with year, month, day, dayofweek from the real date in column date_column
        df['year'] = df[date_column].dt.year
        df['month'] = df[date_column].dt.month
        df['day'] = df[date_column].dt.day
        df['dayofweek'] = df[date_column].dt.dayofweek
    else:
        print("Wrong date format type")
    return df

### Covert all string columns of the dataset to strip whitespaces.

In [32]:
def convert_string_columns(df: pd.DataFrame) -> None:
    """
    String manipulation: Strip whitespace from object columns

    Input:
    -------
    - df => Pandas DataFrame to be processed
    Output:
    ------- 
    None (the function modifies the DataFrame in place)
    """
    # Put every 'object' column in string_cols variable
    string_cols = df.select_dtypes(include=['object']).columns
    # Strip whitespace in every column in string_col
    for col in string_cols:
        df[col] = df[col].str.strip()

### Function to preprocess the initial loaded dataset: combines all the previous functions and returns a clean dataset.

In [33]:
def preprocess_data(df: pd.DataFrame,date_column: str,date_format: str="%d/%m/%Y", name_options: str ="none") -> pd.DataFrame:
    """
    Function to preprocess the initial loaded dataset:
    - Strips whitespaces from strings using the convert_string_columns function. 
    - Converts "invoice_date" column to datetime for futural temporal manipulations.
    - Adds "revenue" column derived from "qty" and "amount" columns.
    - Create a column name using the name_treatment function correcting the first_name and last_name column

    Input:
    ---------
    - df => Pandas DataFrame to be preprocessed
    - date_column => Name of the column where the date is in
    - [Optional] name_options (String) => options for the name_treatment function. Possible choixe "none", "separate", "first" and "last"

    Output:
    ---------
    - df => Preprocessed Pandas DataFrame
    """
    if 'qty' in df.columns and 'amount' in df.columns:
        # Create 'revenue' column as product of 'quantity' and 'amount'
        df['revenue'] = df['qty'] * df['amount']

    if name_options != "none":
        df = name_treatment(df,name_options)
    df = parse_dates(df,date_column,date_format)
    convert_string_columns(df)

    return df

### Function for data exploration: displaying basic information on our dataset.
We can see that there is no missing or NaN data since all columns have 10000 non-null rows.

In [34]:
def explore_data(df: pd.DataFrame) -> None:
    """
    Prints key exploratory information: 
    - dataset shape (rows, columns)
    - column data types
    - missing values per column
    - description of columns
    - correlation matrix between numerical columns

    Input:
    ---------
    - df => Pandas DataFrame to be explored

    Output:
    ---------
    None (prints information to console)
    """
    print("Shape (rows, columns):", df.shape)

    print("\nColumn dtypes:")
    print(df.dtypes)

    print("\nMissing values per column:")
    print(df.isna().sum())

    print("\nBasic description of numerical columns:")
    print(df.describe())

    # Correlation matrix for numeric variables
    if 'qty' in df.columns and 'amount' in df.columns and 'revenue' in df.columns:
        print("\nCorrelation matrix (numeric columns):")
        print(df[['qty', 'amount', 'revenue']].corr())

### Testing data collection, preprocessing and exploration on the Invoices dataset.

In [49]:
df = load_data('invoices.csv')
df = preprocess_data(df,date_column="invoice_date")
explore_data(df)

Data loaded successfully.
Shape (rows, columns): (10000, 16)

Column dtypes:
first_name              object
last_name               object
email                   object
product_id               int64
qty                      int64
amount                 float64
invoice_date    datetime64[ns]
address                 object
city                    object
stock_code               int64
job                     object
revenue                float64
year                     int32
month                    int32
day                      int32
dayofweek                int32
dtype: object

Missing values per column:
first_name      0
last_name       0
email           0
product_id      0
qty             0
amount          0
invoice_date    0
address         0
city            0
stock_code      0
job             0
revenue         0
year            0
month           0
day             0
dayofweek       0
dtype: int64

Basic description of numerical columns:
         product_id           qty        am

## 2) Querying the dataset

### Indicator 1: Grouping query (top cities by total revenue)
- Revenue = Quantity * Amount --> this is the total amount of a single transaction.
- Identifies the most profitable geographic locations by aggregating total revenue by city.
- Could potentially be used for business (targeted marketing, logistics, etc.).

In [36]:
def indicator_top_group(df: pd.DataFrame,revenue_column: str,groupBy_column: str, n: int = 10) -> pd.DataFrame:
    """
    Compute top N group by total revenue.

    This function groups rows by 'groupBy_column', sums the 'revenue' values and returns the top `n` 'groupBy_column' ordered by total revenue descending.
    
    Input:
    --------
    - df: invoices dataframe
    - revenue_column: name of the column in the dataset containing the revenue
    - groupBy_column : name of the column in the dataset containing the variable that you want to group together
    - n: number of 'groupBy_column'
    
    Output:
    --------
    - DataFrame with columns ['city', 'total_revenue']
    """
    # Ensure revenue column exists
    if revenue_column not in df.columns:
        if 'qty' not in df.columns and 'amount' not in df.columns:
            return df
        df = df.copy()
        df['revenue'] = df['qty'] * df['amount']
        revenue_column = 'revenue'
    
    # return a dataset with the sum of the revenue and the number of transaction grouped by 'revenue_column' 
    if groupBy_column in df.columns:
        city_rev = (
            df.groupby(groupBy_column)
            .agg(
                total_revenue=(revenue_column, 'sum'),
                transactions_count=(groupBy_column, 'count')
            )
            .reset_index()
            .sort_values('total_revenue', ascending=False)
            .head(n)
        )
        return city_rev
    return df

### Indicator 2: Data transformation (revenue normalization by city)
- Apply min‑max normalization or z‑score to city revenue to compare cities independently of absolute scale.
- Min-Max normalization using the formula:
    $$x_{\text{norm}} = \frac{x - \min(x)}{\max(x) - \min(x)}$$
    where x is the original revenue and min(x), max(x) are the minimum and maximum revenues across cities.

- Z-Score normalization using the formula:
    $$z = \frac{x - \mu}{\sigma}$$
    where $\mu$ is the mean of $x$ and $\sigma$ is the standard deviation of $x$.

In [37]:
def normalize_city_revenue(city_rev: pd.DataFrame, method: str = 'min-max') -> pd.DataFrame:
    """
    Normalize total_revenue column using specified method.
    
    Input: 
    -------
    - city_rev: DataFrame with 'city' and 'total_revenue'
    - method: normalization method, either 'min-max', 'z-score' or 'both' (default 'min-max')
    
    Output: 
    ------
    - same DataFrame with extra column 'revenue_norm'
    """
    city_rev = city_rev.copy()

    # Normalize the dataset with min-max normalization ( min_max = (x - min(x)) / (max(x)- min(x)))
    if method in ("minmax", "both"):
        min_val = city_rev["total_revenue"].min()
        max_val = city_rev["total_revenue"].max()
        city_rev["revenue_minmax"] = (
            (city_rev["total_revenue"] - min_val) / (max_val - min_val)
        )

# Normalize the dataset with z-score normalization ( z = (x - mean(x)) / std(x))
    if method in ("zscore", "both"):
        mean_val = city_rev["total_revenue"].mean()
        std_val = city_rev["total_revenue"].std(ddof=0)
        if std_val != 0:
            city_rev["revenue_zscore"] = (
                (city_rev["total_revenue"] - mean_val) / std_val
            )
        else:
            city_rev["revenue_zscore"] = 0

    return city_rev

### Function to discretize city revenue, assigning to it revenue classes (low, medium, high). 

In [38]:
def discretize_city_revenue(city_rev: pd.DataFrame, q: int = 3) -> pd.DataFrame:
    """
    Discretize city revenue into q quantile-based categories.

    Input:
    --------
    - city_rev: DataFrame with 'total_revenue'
    - q: number of bins
    
    Output:
    -------
    - DataFrame with extra column 'revenue_segment'
    """
    city_rev['revenue_segment'] = pd.qcut(
        city_rev['total_revenue'],
        q=q,
        labels=[f"Segment_{i+1}" for i in range(q)]
    )
    return city_rev

### Indicator 2 (Version 2): Data Transformation - Customer Segmentation
This indicator applies MinMax Normalization to standardize features, then uses K-Means clustering to segment customers.

This helps identify high-value customers for targeted marketing and retention strategies.

In [39]:
def customer_segmentation(df: pd.DataFrame) -> pd.DataFrame:
    """
    Applies MinMax normalization and K-Means clustering for segmentation.
    Segments customers into Low, Medium, and High value groups based on
    spending patterns and transaction frequency.

    Input:
    --------
    - df: invoices dataframe
    Output:
    --------
    - DataFrame with city revenue segmentation
    """
    # Ensure revenue column exists
    df = df.copy()
    if 'revenue' not in df.columns:
        df['revenue'] = df.get('qty', 0) * df.get('amount', 0)

    # Aggregate customer-level metrics
    customer_profile = df.groupby(['name', 'email']).agg({
        'revenue': ['sum', 'mean', 'count'],
        'qty': 'sum'
    }).reset_index()

    customer_profile.columns = ['name', 'email',
                                 'total_spent', 'avg_transaction',
                                 'num_transactions', 'total_quantity']
    
    # Apply MinMax Normalization to features
    features = ['total_spent', 'avg_transaction', 'num_transactions', 'total_quantity']
    scaler = MinMaxScaler()
    customer_profile[[f'{f}_norm' for f in features]] = scaler.fit_transform(customer_profile[features])

    # Basic safety checks before clustering
    if customer_profile.shape[0] < 3:
        # Not enough samples for 3 clusters: we skip clustering and return profile
        customer_profile['segment'] = 0
        customer_profile['segment_label'] = 'Single/Small'
        return customer_profile

    # K-Means Clustering (3 clusters)
    kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
    customer_profile['segment'] = kmeans.fit_predict(customer_profile[['total_spent_norm', 'num_transactions_norm']])

    # Label segments based on spending levels
    segment_means = customer_profile.groupby('segment')['total_spent'].mean().sort_values()
    segment_mapping = {
        segment_means.index[0]: 'Low Value',
        segment_means.index[1]: 'Medium Value',
        segment_means.index[2]: 'High Value'
    }
    customer_profile['segment_label'] = customer_profile['segment'].map(segment_mapping)

    print("Segment Distribution:")
    print(customer_profile['segment_label'].value_counts())
    print("\nSegment Characteristics:")
    print(customer_profile.groupby('segment_label')[['total_spent', 'num_transactions']].mean())

    return customer_profile

### Indicator 3: Temporal Analysis - Revenue Forecasting
Function for temporal prediction of the revenue.

In [40]:
def temporal_prediction(df: pd.DataFrame, time: str="year",periods: int = 10):
    """
    Use Prophet model to make a temporal prediction of the revenue.

    Inputs:
    ---------
    - df (DataFrame): Input dataset
    - [Optionnal] time (str): options for the prediction between "year", "month" and "day".  
        - "year" (default): use the year column of the dataset to make the prediction.  
        - "month": use the month column of the dataset to make the prediction. 
        - "day": use the invoice_date column containing the full date to make the prediction
    - [Optional] periods (int): The period to calculate the future date. Default 10.

    Outputs: 
    --------
    - new_df (DataFrame) - Contains 'time', 'original_revenue' and 'predicted_revenue'.
    - model (Prophet) - Prophet model trained on the dataset and used for the prediction
    - prediction (DataFrame) - Future prediction made by the model
    """
    dataset = df.copy()
    if time == "year":
        # A mettre ailleur la transformation en datetime ?
        dataset['year'] = pd.to_datetime(dataset['year'], format='%Y')

        # Put the future date to year and group the revenue by year
        freq = 'YE'
        dataset = dataset.groupby('year')['revenue'].sum().reset_index()
        # Rename the column to the name that the Prophet model will need
        dataset.rename(columns={'year': 'ds','revenue': 'y'}, inplace=True)

    elif time == "month":
        # A mettre ailleur la transformation en datetime ?
        dataset['month'] = dataset['year'].astype(str) + '-' + dataset['month'].astype(str).str.zfill(2)
        dataset['month'] = pd.to_datetime(dataset['month'], format='%Y-%m')

        # Put the future date to month and group the revenue by month
        freq = 'ME'
        dataset = dataset.groupby('month')['revenue'].sum().reset_index()
        # Rename the column to the name that the Prophet model will need
        dataset.rename(columns={'month': 'ds','revenue': 'y'}, inplace=True)
    elif time == "day":

        # Put the future date to day and group the revenue by day
        freq = 'D'
        dataset = dataset.groupby('invoice_date')['revenue'].sum().reset_index()
        # Rename the column to the name that the Prophet model will need
        dataset.rename(columns={'invoice_date': 'ds','revenue': 'y'}, inplace=True)
    else:
        print("Erreur: L'option 'time' doit être 'year', 'month' ou 'day.")
        return df,None,None

    # Put cmdstanpy log ouput at ERROR to not have the output from Prophet model when it is used without trouble
    logging.getLogger('cmdstanpy').setLevel(logging.ERROR)

    # Create a Prophet model to make prediction
    model = Prophet()
    model.fit(dataset)

    # get the date to predict and then use the predict function on these date to obtain the prediction
    future_dates = model.make_future_dataframe(periods=periods, freq=freq)
    prediction = model.predict(future_dates)

    # Create a dataset with both original value and predicted value
    # Then rename the new dataset columns with original_revenue and predicted_revenue
    new_df = pd.merge(dataset[['ds', 'y']], prediction[['ds', 'yhat']], on='ds', how='outer')
    new_df.rename(columns={'y': 'original_revenue','yhat': 'predicted_revenue', 'ds': 'time'}, inplace=True)
    
    return new_df, model,prediction

    

Function to create a visualization based on a temporal prediction

In [41]:
def display_temporal_prediction(df: pd.DataFrame, model,prediction,options: str = "prophet"):
    """
    Create a visualization of the a dataset with temporal prediction 
    either with the dataset or with the prediction model.

    Inputs:
    ---------  
    - df (DataFrame): Input dataset.
    - model (Prophet): Prophet model trained on the dataset and used for the prediction.
    - prediction (DataFrame): Future prediction made by the model.
    - [Optional] options (str): options for the visualization between "ploty" and "prophet". 
        - "prophet": use prophet default plot function to plot the prediction.   
        - "ploty" (default): use ploty to plot the prediction.  

    Outputs: 
    --------
    - fig (Figure) - A figure containing the temporal visualization.
    """

    # Create a visualization using ploty express library
    if options=="ploty":
        if 'predicted_revenue' in df.columns and 'original_revenue' in df.columns:
            fig = px.area()
            fig.add_scatter(x=df.index, y=df["original_revenue"], mode='lines', line=dict(color='blue'), name="original")
            fig.add_scatter(x=df.index ,y=df["predicted_revenue"], mode='lines', line=dict(color='green'), name="prediction")
            fig.update_layout(title="Prediction", xaxis_title="Date", yaxis_title="Revenue")
        else:
            print("Error, the prediction was not found in the dataset")
            fig = None

    # use the plot_ploty from prohet to get a visualization
    elif options == "prophet":
        if model is not None:
            fig = plot_plotly(model, prediction)
        else:
            print("Error, the prediction model was not found")
            fig = None
    return fig

### Indicator 4: Spatial Analysis - Geographic Clustering
This indicator applies K-Means clustering to group cities into activity levels based on revenue patterns (revenue = qty × amount).

Function to analyze geographic distribution. 

In [42]:
def analyze_geographic_distribution(df: pd.DataFrame) -> pd.DataFrame:
    """
    Analyzes spatial distribution of transactions across cities.
    Calculates revenue metrics and transaction patterns by location.

    Input:
    ---------
    - df => Pandas DataFrame with city and revenue information

    Output:
    ---------
    - city_stats => DataFrame with city-level statistics
    """
    # Check if revenue column exists
    if 'revenue' not in df.columns:
        if 'qty' in df.columns and 'amount' in df.columns:
            df['revenue'] = df['qty'] * df['amount']
            print("Revenue column not found, creating it now...")
        else:
            raise ValueError("Cannot calculate revenue: missing 'revenue' or 'qty'/'amount' columns")

    city_stats = df.groupby('city').agg({
        'revenue': ['sum', 'mean', 'std'],
        'qty': 'sum',
        'product_id': 'count'
    }).reset_index()

    city_stats.columns = ['city', 'total_revenue', 'avg_revenue', 'std_revenue',
                          'total_quantity', 'transaction_count']

    city_stats['revenue_per_transaction'] = (
        city_stats['total_revenue'] / city_stats['transaction_count'])

    city_stats = city_stats.sort_values('total_revenue', ascending=False)

    return city_stats

Function for spatial clustering, grouping cities by revenue patterns using KMeans clustering.

In [43]:
def spatial_clustering(city_stats: pd.DataFrame, n_clusters: int = 4) -> pd.DataFrame:
    """
    Applies K-Means clustering to group cities by revenue patterns.
    Uses normalized features: total revenue, transaction count, and
    average order value. Identifies geographic market segments.

    Input:
    ---------
    - city_stats => DataFrame with city-level statistics
    - n_clusters => Integer, number of clusters (default: 4)

    Output:
    ---------
    - city_stats => DataFrame with cluster labels added
    """
    features = ['total_revenue', 'transaction_count', 'revenue_per_transaction']

    # Normalize features
    scaler = MinMaxScaler()
    city_stats_norm = city_stats.copy()
    city_stats_norm[features] = scaler.fit_transform(city_stats[features])

    # Apply K-Means clustering
    kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
    city_stats['cluster'] = kmeans.fit_predict(city_stats_norm[features])

    # Label clusters by activity level
    cluster_means = city_stats.groupby('cluster')['total_revenue'].mean().sort_values()
    cluster_labels = {
        cluster_means.index[0]: 'Low Activity',
        cluster_means.index[1]: 'Medium-Low Activity',
        cluster_means.index[2]: 'Medium-High Activity',
        cluster_means.index[3]: 'High Activity'
    }
    city_stats['cluster_label'] = city_stats['cluster'].map(cluster_labels)
    
    print("Cluster Distribution:")
    print(city_stats['cluster_label'].value_counts())
    print("\nCluster Characteristics:")
    print(city_stats.groupby('cluster_label')[['total_revenue', 'transaction_count']].mean())

    return city_stats

In [44]:
def fusion_dataset(df: pd.DataFrame, df_extra: pd.DataFrame)->pd.DataFrame:
    '''
    Create a new dataset taking information from both dataset.
    This fonction was made for 2 dataset:
    - https://www.kaggle.com/datasets/cankatsrc/invoices
    - https://www.kaggle.com/datasets/ghassenkhaled/invoices-data

    Input:
    -------
    - df: pd.DataFrame, first dataset
    - df_extra: pd.DataFrame, second dataset

    Ouput:
    -------
    - pd.DataFrame, new dataset
    '''
    df_base = df.copy()         # To make sure to not modify the base dataset
    df_base['country'] = 'mars' # fictional country because randomly generated dataset
    df_base = df_base[df_base['year']>2021] # because extra dataset contains recent dataset

    df_extra = df_extra.drop(columns=['id_invoice','total','discount','tax','invoiceStatus','dueDate'])
    df_base = df_base.drop(columns=['address','amount','email','product_id','qty','stock_code'])
    df_extra.rename(columns={'issuedDate': 'invoice_date', 'service': 'job',
                             'balance': 'revenue', 'client': 'name'}, inplace=True)
    
    
    df_new = pd.concat([df_base, df_extra], ignore_index=True)
    df_new = df_new.drop_duplicates()
    return df_new

## 3) Dash visualization

In [45]:
def create_indicator(df: pd.DataFrame):
    """
    Create visualization of some indicator on the dataset.

    Input:
    ---------
    - df => Pandas DataFrame containing our cleaned dataset

    Output:
    ---------
    - fig_top_cities_revenue => Visualization of the city who have the N(=10) best revenue
    - fig_customer_segmentation => Visualization of the segmentation of the customers according to 3 class
    - figure_pred_year => Visualization of the prediction of revenue according to yearly revenue
    - figure_pred_month => Visualization of the prediction of revenue according to monthly revenue
    - figure_pred_day => Visualization of the prediction of revenue according to daily revenue
    - fig_spatial_clustering => Visualization of the city cluster by revenue and activity

    """
                                        ### Indicator 1

    # Creating the indicator
    city_stats = indicator_top_group(df,revenue_column="revenue",groupBy_column="city", n=10)

    # Creating the vizualization
    fig_top_cities_revenue = px.bar(
        city_stats,
        x='city',
        y='total_revenue',
        hover_data={'transactions_count': True, 'total_revenue': ':.2f'},
        title='Top Cities by Revenue and Transactions',
        labels={'total_revenue': 'Total Revenue ($)', 'city': 'City Name'},
        color='transactions_count',
        color_continuous_scale='Blues'
    )
    fig_top_cities_revenue.update_layout(
        xaxis_tickangle=-45,
        height=400
    )

                                        ### Indicator 2
    
    # Creating the indicator
    customer_segments = customer_segmentation(df)

    # Creating the vizualization
    fig_customer_segmentation = go.Figure([
        go.Pie(
            labels=customer_segments['segment_label'].value_counts().index,
            values=customer_segments['segment_label'].value_counts().values,
            hole=0.4,
            marker=dict(colors=['red', 'orange', 'green'])
        )
    ]).update_layout(
        title='Customer Distribution by Segment (Low, Medium or High Value)',
        height=400
    )

                                        ### Indicator 3

    ## To change the data used to make the prediction change time_pred,
    ## Put the oldest date that you want to be taken in year.
    ## In case you change time_pred and use the create_dashboard function,
    ## change the year in year_to_index and the label in the Dropdown
    time_pred = [1900,2020,2019,2017,2015,2010,2000,1990,1980]

    ## To change how far the model should predict, change the prediction_lenght variable,
    ## It must be a positive int number
    prediction_lenght_year = 10
    prediction_lenght_month = 24
    prediction_lenght_day = 120

    figure_pred_year = []
    figure_pred_month = []
    figure_pred_day = []

    # Make the prediction using yearly, monthly and Daily data for every date in time_pred
    for i in time_pred:
        dataset = df.copy()
        dataset = dataset[dataset["year"]>i]
        data_y, model_y,predictions_y = temporal_prediction(dataset,time="year", periods=prediction_lenght_year)
        figure_pred_year.append(display_temporal_prediction(data_y,model_y,predictions_y))
        data_m,model_m,predictions_m = temporal_prediction(dataset,time="month",periods=prediction_lenght_month)
        figure_pred_month.append(display_temporal_prediction(data_m,model_m,predictions_m))
        data_m,model_m,predictions_m = temporal_prediction(dataset,time="day",periods=prediction_lenght_day)
        figure_pred_day.append(display_temporal_prediction(data_m,model_m,predictions_m))

                                        ### Indicator 4

    # Creating the indicator
    city_clusters = spatial_clustering(analyze_geographic_distribution(df))
    
    # Creating the vizualization
    fig_spatial_clustering = px.scatter(
        city_clusters.head(3000),
        x='transaction_count',
        y='total_revenue',
        color='cluster_label',
        size='revenue_per_transaction',
        hover_data=['city'],
        title='City Clustering by Revenue and Activity',
        labels={
            'transaction_count': 'Number of Transactions',
            'total_revenue': 'Total Revenue ($)',
            'cluster_label': 'Activity Level'
        },
        color_discrete_map={
            'Low Activity': 'red',
            'Medium-Low Activity': 'orange',
            'Medium-High Activity': 'blue',
            'High Activity': 'green'
        }
    ).update_layout(height=400)

    return fig_top_cities_revenue,fig_customer_segmentation,figure_pred_year,figure_pred_month,figure_pred_day,fig_spatial_clustering

In [46]:
def create_dashboard(df: pd.DataFrame, df_augmented: pd.DataFrame | None = None) -> Dash:
    
    # Getting the visualization for base dataset
    fig_top_cities_revenue,fig_customer_segmentation,figure_pred_year,figure_pred_month,figure_pred_day,fig_spatial_clustering = create_indicator(df)

    # 
    if df_augmented is not None:
        x = "test"
    # Dictionnary to transform the markdown value in indicator 3 
    # to an index that can be used in update_temporal_graph function
    ## This variable work with the Dropdown options in the dasboard and the update_temporal_graph function 
    year_to_index = {
    "YA": 0, "MA": 0, "DA": 0,    # 1900 (All)
    "Y2020": 1, "M2020": 1, "D2020": 1,
    "Y2019": 2, "M2019": 2, "D2019": 2,
    "Y2017": 3, "M2017": 3, "D2017": 3,
    "Y2015": 4, "M2015": 4, "D2015": 4,
    "Y2010": 5, "M2010": 5, "D2010": 5,
    "Y2000": 6, "M2000": 6, "D2000": 6,
    "Y1990": 7, "M1990": 7, "D1990": 7,
    "Y1980": 8, "M1980": 8, "D1980": 8}

    # Initialize the Dash app
    app = Dash(__name__)

    app.layout = html.Div([
        # Header
        html.Div([
            html.H1('Invoices Data Analysis Dashboard', 
                    style={'textAlign': 'center', 'color': '#2c3e50', 'marginBottom': 10}),
            html.H3('Programming for Data Science - Final Project', 
                    style={'textAlign': 'center', 'color': '#34495e', 'marginBottom': 5}),
            html.P('Team Members: Alvaro SERERO, Leo WINTER, Yoann SUBLET, Kellian VERVAELE KLEIN',
                style={'textAlign': 'center', 'color': '#7f8c8d', 'fontSize': 14}),
            html.P('Dataset: Invoices (Kaggle) - 10,000 online store transactions',
                style={'textAlign': 'center', 'color': '#7f8c8d', 'fontSize': 14, 'marginBottom': 20}),
            html.Div([
                html.P('Project Objective: Extract and visualize four key business indicators from invoice data ' +
                    'to support data-driven decision-making in e-commerce operations.',
                    style={'textAlign': 'center', 'color': '#2c3e50', 'fontSize': 15, 
                            'maxWidth': '900px', 'margin': '0 auto', 'padding': '15px',
                            'backgroundColor': '#ecf0f1', 'borderRadius': '8px'})
            ])
        ], style={'backgroundColor': '#f8f9fa', 'padding': '20px', 'marginBottom': '30px',
                'borderRadius': '10px', 'boxShadow': '0 2px 4px rgba(0,0,0,0.1)'}),

        # Dashboard content
        html.Div([
            # Row 1: Indicators 1 and 2
            html.Div([
                # Indicator 1: Top Cities by Revenue and Transactions
                html.Div([
                    html.H3('Indicator 1: Top Cities by Total Revenue', 
                            style={'color': '#2980b9', 'marginBottom': 15}),
                    html.P('Grouping Query - Aggregates total revenue and transaction count per city',
                        style={'fontSize': 13, 'color': '#7f8c8d', 'marginBottom': 15}),
                    dcc.Graph(
                        figure=fig_top_cities_revenue
                    )
                ], style={'width': '48%', 'display': 'inline-block', 'verticalAlign': 'top',
                        'padding': '20px', 'backgroundColor': 'white', 'borderRadius': '8px',
                        'boxShadow': '0 2px 4px rgba(0,0,0,0.1)', 'marginRight': '2%'}),
                
                # Indicator 2: Customer Segmentation
                html.Div([
                    html.H3('Indicator 2: Customer Segmentation', 
                            style={'color': '#27ae60', 'marginBottom': 15}),
                    html.P('Data Transformation - MinMax Normalization + K-Means Clustering',
                        style={'fontSize': 13, 'color': '#7f8c8d', 'marginBottom': 15}),
                    dcc.Graph(
                        figure=fig_customer_segmentation
                    )
                ], style={'width': '48%', 'display': 'inline-block', 'verticalAlign': 'top',
                        'padding': '20px', 'backgroundColor': 'white', 'borderRadius': '8px',
                        'boxShadow': '0 2px 4px rgba(0,0,0,0.1)'})
            ], style={'marginBottom': '30px'}),  # End of Row 1

            # Row 2: Indicators 3 and 4
            html.Div([
                # Indicator 3: Revenue Prediction
                html.Div([
                    html.H3('Indicator 3: Prediction of future revenue', 
                        style={'color': '#2980b9', 'marginBottom': 15}),
                    html.P('Predict the future by taking information from past time',
                        style={'fontSize': 13, 'color': '#7f8c8d', 'marginBottom': 15}),
                    dcc.Graph(id='graph'),
                    dcc.Dropdown(options=[{"label": "All Year", "value": "YA"},{"label": "2020 Year", "value": "Y2020"},
                                        {"label": "2019 Year", "value": "Y2019"},{"label": "2017 Year", "value": "Y2017"},
                                        {"label": "2015 Year", "value": "Y2015"},{"label": "2010 Year", "value": "Y2010"},
                                        {"label": "2000 Year", "value": "Y2000"},{"label": "1990 Year", "value": "Y1990"},
                                        {"label": "1980 Year", "value": "Y1980"},
                                        {"label": "All Month", "value": "MA"},{"label": "2020 Month", "value": "M2020"},
                                        {"label": "2019 Month", "value": "M2019"},{"label": "2017 Month", "value": "M2017"},
                                        {"label": "2015 Month", "value": "M2015"},{"label": "2010 Month", "value": "M2010"},
                                        {"label": "2000 Month", "value": "M2000"},{"label": "1990 Month", "value": "M1990"},
                                        {"label": "1980 Month", "value": "M1980"},
                                        {"label": "All Day", "value": "DA"},{"label": "2020 Day", "value": "D2020"},
                                        {"label": "2019 Day", "value": "D2019"},{"label": "2017 Day", "value": "D2017"},
                                        {"label": "2015 Day", "value": "D2015"},{"label": "2010 Day", "value": "D2010"},
                                        {"label": "2000 Day", "value": "D2000"},{"label": "1990 Day", "value": "D1990"},
                                        {"label": "1980 Day", "value": "D1980"}],
                                            value="YA", id='dropdown')
                ]),

                # Indicator 4: Spatial Clustering
                html.Div([
                    html.H3('Indicator 4: Geographic Clustering (Spatial)', 
                            style={'color': '#e67e22', 'marginBottom': 15}),
                    html.P('Spatial Analysis - K-Means Clustering of cities by activity level',
                        style={'fontSize': 13, 'color': '#7f8c8d', 'marginBottom': 10}),
                    dcc.Graph(
                        figure=fig_spatial_clustering
                    )
                ], style={'width': '48%', 'display': 'inline-block', 'verticalAlign': 'top',
                        'padding': '20px', 'backgroundColor': 'white', 'borderRadius': '8px',
                        'boxShadow': '0 2px 4px rgba(0,0,0,0.1)'})
            ]),
        ], style={'padding': '20px', 'maxWidth': '1400px', 'margin': '0 auto'}),
    ])

    # callback to get the information from the indicator 3 dropdown
    @callback(
    Output('graph', 'figure'),
    Input('dropdown', 'value'))
    # Function that work with the callback by taking 
    # the dropdown value and returning the graph for indicator 3
    # input : the "value" variable of the dropdown id='dropdown'
    # output : the visualization containing revenue prediction
    def update_temporal_graph(selected_value):
        index = year_to_index[selected_value]
        if selected_value.startswith("Y"):
            fig_dash = figure_pred_year[index]
        elif selected_value.startswith("M"):
            fig_dash = figure_pred_month[index]
        elif selected_value.startswith("D"):
            fig_dash = figure_pred_day[index]
        return fig_dash
    return app

In [37]:
def main():
    file_path = "invoices.csv"
    # https://www.kaggle.com/datasets/ghassenkhaled/invoices-data
    file_path_extra = "invoices_extra.csv" 
    df = load_data(file_path)
    df_extra = load_data(file_path_extra)
    df = preprocess_data(df,date_column="invoice_date", name_options="separate")
    df_extra = preprocess_data(df_extra,date_column="issuedDate",date_format='%Y-%m-%d')

    df_augmented = fusion_dataset(df,df_extra)
    explore_data(df)

    app = create_dashboard(df,df_augmented)
    app.run()

if __name__ == "__main__":
    main()

Data loaded successfully.
Data loaded successfully.
Shape (rows, columns): (20000, 15)

Column dtypes:
name                    object
address                 object
amount                 float64
city                    object
email                   object
invoice_date    datetime64[ns]
job                     object
product_id               int64
qty                      int64
revenue                float64
stock_code               int64
year                     int32
month                    int32
day                      int32
dayofweek                int32
dtype: object

Missing values per column:
name            0
address         0
amount          0
city            0
email           0
invoice_date    0
job             0
product_id      0
qty             0
revenue         0
stock_code      0
year            0
month           0
day             0
dayofweek       0
dtype: int64

Basic description of numerical columns:
             amount                invoice_date    product_id     

13:41:27 - cmdstanpy - INFO - Chain [1] start processing
13:41:28 - cmdstanpy - INFO - Chain [1] done processing


Cluster Distribution:
cluster_label
Low Activity            3809
Medium-Low Activity     2377
Medium-High Activity    1182
High Activity            405
Name: count, dtype: int64

Cluster Characteristics:
                      total_revenue  transaction_count
cluster_label                                         
High Activity           2172.758765           7.664198
Low Activity             252.502331           2.268837
Medium-High Activity    1338.599915           2.159052
Medium-Low Activity      795.021868           2.398822


In [50]:
file_path_extra = "invoices_extra.csv" 
df_extra = load_data(file_path_extra)
df_extra = preprocess_data(df_extra,date_column="issuedDate",date_format='%Y-%m-%d')
file_path = "invoices.csv"
df = load_data(file_path)
df = preprocess_data(df,date_column="invoice_date", name_options="separate")

df_talta = fusion_dataset(df,df_extra)
df_talta.head()

Data loaded successfully.
Data loaded successfully.


Unnamed: 0,name,city,invoice_date,job,revenue,year,month,day,dayofweek,country
0,Phyllis Bailey,Kyleborough,2022-01-16,"Engineer, water",36.12,2022,1,16,6,mars
1,Roy Jackson,Danielmouth,2022-01-04,Cartographer,203.76,2022,1,4,1,mars
2,Christopher Davis,North Vanessa,2022-01-17,"Designer, jewellery",583.52,2022,1,17,0,mars
3,Mark Thomas,Jeremymouth,2022-01-05,Information systems manager,419.75,2022,1,5,2,mars
4,John Morrison,Larrychester,2022-01-13,"Engineer, mining",654.75,2022,1,13,3,mars


In [58]:

city_stats = indicator_top_group(df_talta,revenue_column="revenue",groupBy_column="city", n=20)

# Creating the vizualization
fig_top_cities_revenue = px.bar(
        city_stats,
        x='city',
        y='total_revenue',
        hover_data={'transactions_count': True, 'total_revenue': ':.2f'},
        title='Top country by Revenue and Transactions',
        labels={'total_revenue': 'Total Revenue ($)', 'city': 'City Name'},
        color='transactions_count',
        color_continuous_scale='Blues'
)
fig_top_cities_revenue.update_layout(
        xaxis_tickangle=-45,
        height=400
)


ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed