# Programming in Data Science - Final Project
## Invoices Dataset Analysis
**Team Members: Leo WINTER, Yoann SUBLET, Kellian VERVAELE KLEIN, Alvaro SERERO**

Dataset: Invoices (Kaggle)

Source: https://www.kaggle.com/datasets/cankatsrc/invoices/data

This dataset includes multiple fields such as customer details (first name, last name, email), transaction information (product ID, quantity, amount, invoice date), and additional attributes like address, city, and stock code.

### Import all needed libraries for the project:
- Pandas for data manipulation
- Plotly express for visualizations
- Dash for creating a visual and interactive dashboard interface

In [20]:
import pandas as pd
import plotly.express as px
from dash import Dash, dcc, html,Input, Output
from dash import callback
from prophet import Prophet
from prophet.plot import plot_plotly
import logging

from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans

import warnings
warnings.filterwarnings('ignore')

## 1) Data collection and exploration

### Function to safely load CSV data from a file path

In [3]:
def load_data(file_path: str) -> pd.DataFrame:
    """
    Function to load CSV data from a given file path safely.

    Input: 
    ------
    file_path => String, path to the CSV file

    Output: 
    ------
    dataset => pd.DataFrame containing the loaded data
    """
    try:
        df = pd.read_csv(file_path)
        print("Data loaded successfully.")
        return df
    except Exception as e:
        print(f"Error loading data: {e}")
        return pd.DataFrame()

### Function to process invalid first_name and last_name columns.
- In the initial dataset there are "last_name" and "first_name" columns but each one contains a combination of a first name and last name which does not make sense since a trasaction is only made by one individual person and columns should contain exactly what is described by their name.
- For example, the first line is structured as follows, which is a mistake and need to be corrected.

first_name | last_name  
Carmen Nixon | Todd Anderson

In [4]:
def name_treatment(dataset: pd.DataFrame, options: str="separate") -> pd.DataFrame:
    """
    Function to treat first_name and last_name columns in a dataset.
    
    Input:
    ---------
    - dataset => Pandas DataFrame, dataset must have first_name and last_name columns
    - options => String, options for treatment between "separate", "first" and "last"
        - "separate" (default): create two new line for each name, 
        - "first": keep only the first_name renamed as name, 
        - "last" : keep only the last_name renamed as name
    
    Output:
    ---------
    - dataset => Pandas DataFrame after treating first_name and last_name columns
    """

    if "first_name" in dataset and "last_name" in dataset:
        if options == "separate":
            value = dataset.columns.difference(['first_name','last_name']).tolist()
            new_dataset = pd.melt(dataset, id_vars=value,              
                              value_vars=['first_name', 'last_name'],
                              value_name='name')
            
            autres_colonnes = [col for col in new_dataset.columns if col not in ["name", "variable"]]
            nouvel_ordre = ["name"] + autres_colonnes
            new_dataset = new_dataset[nouvel_ordre]

        elif options == "first":
            new_dataset = dataset.drop(columns=['last_name'])
            new_dataset.rename(columns={'first_name': 'name'}, inplace=True)

        elif options == "last":
            new_dataset = dataset.drop(columns=['first_name'])
            new_dataset.rename(columns={'last_name': 'name'}, inplace=True)
        else:
            print(f"{options} is not a correct parameters of options, please write 'separate' or 'first' or 'last")
            return dataset
        return new_dataset
    else:
        return dataset

### Function to parse invoice dates:
- Convert "invoice_date" column to datetime for futural temporal manipulations.
- Extracts year, month, day, and day of week features.

In [5]:
def parse_dates(df: pd.DataFrame) -> pd.DataFrame:
    """
    Converts "invoice_date" column from string to datetime.
    Extracts year, month, day, and day of week features.

    Input:
    ------
    - df (DataFrame) - dataset with invoice_date column

    Output:
    ------
    - df (DataFrame) - dataset with parsed datetime features
    """
    if 'invoice_date' not in df.columns:
        print("Column 'invoice_date' not found in DataFrame.")
        return df
    
    df = df.copy()
    df['invoice_date'] = pd.to_datetime(df['invoice_date'], format='%d/%m/%Y', errors='coerce')
    df['year'] = df['invoice_date'].dt.year
    df['month'] = df['invoice_date'].dt.month
    df['day'] = df['invoice_date'].dt.day
    df['dayofweek'] = df['invoice_date'].dt.dayofweek
    print("Dates parsed and temporal features extracted")
    return df

### Covert all string columns of the dataset to strip whitespaces.

In [6]:
def convert_string_columns(df: pd.DataFrame) -> None:
    """
    String manipulation: Strip whitespace from object columns

    Input:
    -------
    - df => Pandas DataFrame to be processed
    Output:
    ------- 
    None (the function modifies the DataFrame in place)
    """
    string_cols = df.select_dtypes(include=['object']).columns
    for col in string_cols:
        df[col] = df[col].str.strip()

### Function to preprocess the initial loaded dataset: combines all the previous functions and returns a clean dataset.

In [7]:
def preprocess_data(df: pd.DataFrame, name_options: str ="separate") -> pd.DataFrame:
    """
    Function to preprocess the initial loaded dataset:
    - Strips whitespaces from strings using the convert_string_columns function. 
    - Converts "invoice_date" column to datetime for futural temporal manipulations.
    - Adds "revenue" column derived from "qty" and "amount" columns.
    - Create a column name using the name_treatment function correcting the first_name and last_name column

    Input:
    ---------
    - df => Pandas DataFrame to be preprocessed
    - [Optionnal] name_options (String) => options for the name_treatment function. Possible choixe "separate", "first" and "last"
    Output:
    ---------
    - df => Preprocessed Pandas DataFrame
    """
    if 'qty' in df.columns and 'amount' in df.columns:
        # Create 'revenue' column as product of 'quantity' and 'amount'
        df['revenue'] = df['qty'] * df['amount']

    df = name_treatment(df, options=name_options)
    df = parse_dates(df)
    convert_string_columns(df)

    return df

### Function for data exploration: displaying basic information on our dataset.
We can see that there is no missing or NaN data since all columns have 10000 non-null rows.

In [8]:
def explore_data(df: pd.DataFrame) -> None:
    """
    Prints key exploratory information: 
    - dataset shape (rows, columns)
    - column data types
    - missing values per column
    - description of columns
    - correlation matrix between numerical columns

    Input:
    ---------
    - df => Pandas DataFrame to be explored

    Output:
    ---------
    None (prints information to console)
    """
    print("Shape (rows, columns):", df.shape)

    print("\nColumn dtypes:")
    print(df.dtypes)

    print("\nMissing values per column:")
    print(df.isna().sum())

    print("\nBasic description of numerical columns:")
    print(df.describe())

    # Correlation matrix for numeric variables
    print("\nCorrelation matrix (numeric columns):")
    print(df[['qty', 'amount', 'revenue']].corr())

### Testing data collection, preprocessing and exploration on the Invoices dataset.

In [9]:
df = load_data('invoices.csv')
df = preprocess_data(df)
explore_data(df)

Data loaded successfully.
Dates parsed and temporal features extracted
Shape (rows, columns): (20000, 15)

Column dtypes:
name                    object
address                 object
amount                 float64
city                    object
email                   object
invoice_date    datetime64[ns]
job                     object
product_id               int64
qty                      int64
revenue                float64
stock_code               int64
year                     int32
month                    int32
day                      int32
dayofweek                int32
dtype: object

Missing values per column:
name            0
address         0
amount          0
city            0
email           0
invoice_date    0
job             0
product_id      0
qty             0
revenue         0
stock_code      0
year            0
month           0
day             0
dayofweek       0
dtype: int64

Basic description of numerical columns:
             amount                invoice_date

## 2) Querying the dataset

### Indicator 1: Grouping query (top cities by total revenue)
- Revenue = Quantity * Amount --> this is the total amount of a single transaction.
- Identifies the most profitable geographic locations by aggregating total revenue by city.
- Could potentially be used for business (targeted marketing, logistics, etc.).

In [16]:
def indicator_top_cities(df: pd.DataFrame, n: int = 10) -> pd.DataFrame:
    """
    Compute top N cities by total revenue.

    This function groups rows by city, sums the 'revenue' values and returns the top `n` cities ordered by total revenue descending.
    
    Input:
        - df: invoices dataframe
        - n: number of cities
    
    Output:
        - DataFrame with columns ['city', 'total_revenue']
    """
    # Ensure revenue column exists
    if 'revenue' not in df.columns:
        df = df.copy()
        df['revenue'] = df['qty'] * df['amount']
    
    city_rev = (
        df.groupby('city')
        .agg(
            total_revenue=('revenue', 'sum'),
            transactions_count=('city', 'count')
        )
        .reset_index()
        .sort_values('total_revenue', ascending=False)
        .head(n)
    )
    return city_rev

### Indicator 2: Data transformation (revenue normalization by city)
- Apply min‑max normalization or z‑score to city revenue to compare cities independently of absolute scale.
- Min-Max normalization using the formula:
    $$x_{\text{norm}} = \frac{x - \min(x)}{\max(x) - \min(x)}$$
    where x is the original revenue and min(x), max(x) are the minimum and maximum revenues across cities.

- Z-Score normalization using the formula:
    $$z = \frac{x - \mu}{\sigma}$$
    where $\mu$ is the mean of $x$ and $\sigma$ is the standard deviation of $x$.

In [17]:
def normalize_city_revenue(city_rev: pd.DataFrame, method: str = 'min-max') -> pd.DataFrame:
    """
    Normalize total_revenue column using specified method.
    
    Input: 
        - city_rev: DataFrame with 'city' and 'total_revenue'
        - method: normalization method, either 'min-max', 'z-score' or 'both' (default 'min-max')
    
    Output: 
        - same DataFrame with extra column 'revenue_norm'
    """
    city_rev = city_rev.copy()

    if method in ("minmax", "both"):
        min_val = city_rev["total_revenue"].min()
        max_val = city_rev["total_revenue"].max()
        city_rev["revenue_minmax"] = (
            (city_rev["total_revenue"] - min_val) / (max_val - min_val)
        )

    if method in ("zscore", "both"):
        mean_val = city_rev["total_revenue"].mean()
        std_val = city_rev["total_revenue"].std(ddof=0)
        if std_val != 0:
            city_rev["revenue_zscore"] = (
                (city_rev["total_revenue"] - mean_val) / std_val
            )
        else:
            city_rev["revenue_zscore"] = 0

    return city_rev

### Function to discretize city revenue, assigning to it revenue classes (low, medium, high). 

In [18]:
def discretize_city_revenue(city_rev: pd.DataFrame, q: int = 3) -> pd.DataFrame:
    """
    Discretize city revenue into q quantile-based categories.

    Input:
        - city_rev: DataFrame with 'total_revenue'
        - q: number of bins
    
    Output:
        - DataFrame with extra column 'revenue_segment'
    """
    city_rev['revenue_segment'] = pd.qcut(
        city_rev['total_revenue'],
        q=q,
        labels=[f"Segment_{i+1}" for i in range(q)]
    )
    return city_rev

### Indicator 2 (Version 2): Data Transformation - Customer Segmentation
This indicator applies MinMax Normalization to standardize features, then uses K-Means clustering to segment customers.

This helps identify high-value customers for targeted marketing and retention strategies.

In [None]:
def customer_segmentation(df: pd.DataFrame) -> pd.DataFrame:
    """
    Applies MinMax normalization and K-Means clustering for segmentation.
    Segments customers into Low, Medium, and High value groups based on
    spending patterns and transaction frequency.

    Input:
        - df: invoices dataframe
    Output:
        - DataFrame with city revenue segmentation
    """
    # Ensure revenue column exists
    df = df.copy()
    if 'revenue' not in df.columns:
        df['revenue'] = df.get('qty', 0) * df.get('amount', 0)

    # Aggregate customer-level metrics
    customer_profile = df.groupby(['name', 'email']).agg({
        'revenue': ['sum', 'mean', 'count'],
        'qty': 'sum'
    }).reset_index()

    customer_profile.columns = ['name', 'email',
                                 'total_spent', 'avg_transaction',
                                 'num_transactions', 'total_quantity']
    
    # Apply MinMax Normalization to features
    features = ['total_spent', 'avg_transaction', 'num_transactions', 'total_quantity']
    scaler = MinMaxScaler()
    customer_profile[[f'{f}_norm' for f in features]] = scaler.fit_transform(customer_profile[features])

    # Basic safety checks before clustering
    if customer_profile.shape[0] < 3:
        # Not enough samples for 3 clusters: we skip clustering and return profile
        customer_profile['segment'] = 0
        customer_profile['segment_label'] = 'Single/Small'
        return customer_profile

    # K-Means Clustering (3 clusters)
    kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
    customer_profile['segment'] = kmeans.fit_predict(customer_profile[['total_spent_norm', 'num_transactions_norm']])

    # Label segments based on spending levels
    segment_means = customer_profile.groupby('segment')['total_spent'].mean().sort_values()
    segment_mapping = {
        segment_means.index[0]: 'Low Value',
        segment_means.index[1]: 'Medium Value',
        segment_means.index[2]: 'High Value'
    }
    customer_profile['segment_label'] = customer_profile['segment'].map(segment_mapping)

    print("Segment Distribution:")
    print(customer_profile['segment_label'].value_counts())
    print("\nSegment Characteristics:")
    print(customer_profile.groupby('segment_label')[['total_spent', 'num_transactions']].mean())

    return customer_profile

customer_segmentation(df)


This indicator applies MinMax Normalization to standardize features,
then uses K-Means clustering to segment customers. This helps identify
high-value customers for targeted marketing and retention strategies.

Segment Distribution:
segment_label
Low Value       10574
Medium Value     6188
High Value       3238
Name: count, dtype: int64

Segment Characteristics:
               total_spent  num_transactions
segment_label                               
High Value      639.697912               1.0
Low Value       106.796832               1.0
Medium Value    341.488239               1.0


Unnamed: 0,name,email,total_spent,avg_transaction,num_transactions,total_quantity,total_spent_norm,avg_transaction_norm,num_transactions_norm,total_quantity_norm,segment,segment_label
0,Aaron Allen,ihogan@example.com,74.39,74.39,1,1,0.077552,0.077552,0.0,0.000,0,Low Value
1,Aaron Alvarez,anthony45@example.net,202.26,202.26,1,3,0.220607,0.220607,0.0,0.250,0,Low Value
2,Aaron Andersen,jacqueline56@example.com,320.40,320.40,1,5,0.352777,0.352777,0.0,0.500,2,Medium Value
3,Aaron Bennett,miranda15@example.net,98.16,98.16,1,3,0.104145,0.104145,0.0,0.250,0,Low Value
4,Aaron Booth,markmccall@example.com,321.54,321.54,1,6,0.354053,0.354053,0.0,0.625,2,Medium Value
...,...,...,...,...,...,...,...,...,...,...,...,...
19995,Zoe Brown,sparker@example.com,125.64,125.64,1,4,0.134888,0.134888,0.0,0.375,0,Low Value
19996,Zoe Cole,gabrielle56@example.org,44.36,44.36,1,2,0.043956,0.043956,0.0,0.125,0,Low Value
19997,Zoe Cowan,moorelisa@example.org,143.50,143.50,1,5,0.154869,0.154869,0.0,0.500,0,Low Value
19998,Zoe Klein,joseph00@example.org,123.62,123.62,1,2,0.132629,0.132629,0.0,0.125,0,Low Value


### Function to make temporal prediction on the dataset

In [16]:
def temporal_prediction(df: pd.DataFrame, time: str="year",periods: int = 10):
    """
    Use Prophet model to make a temporal prediction of the revenue.

    Inputs:
    ---------
    - df (DataFrame): Input dataset
    - [Optionnal] time (str): options for the prediction between "year" and "month".  
        - "year" (default): use the year column of the dataset to make the prediction.  
        - "month": use the month column of the dataset to make the prediction.  
    - [Optionnal] periods (int): The period to calculate the future date. Default 10.

    Outputs: 
    --------
    - new_df (DataFrame) - Contains 'time', 'original_revenue' and 'predicted_revenue'.
    - model (Prophet) - Prophet model trained on the dataset and used for the prediction
    - prediction (DataFrame) - Future prediction made by the model
    """
    dataset = df.copy()
    if time == "year":
        # A mettre ailleur la transformation en datetime ?
        dataset['year'] = pd.to_datetime(dataset['year'], format='%Y')

        freq = 'YE'
        dataset = dataset.groupby('year')['revenue'].sum().reset_index()
        dataset.rename(columns={'year': 'ds','revenue': 'y'}, inplace=True)

    elif time == "month":
        # A mettre ailleur la transformation en datetime ?
        dataset['month'] = dataset['year'].astype(str) + '-' + dataset['month'].astype(str).str.zfill(2)
        dataset['month'] = pd.to_datetime(dataset['month'], format='%Y-%m')

        freq = 'ME'
        dataset = dataset.groupby('month')['revenue'].sum().reset_index()
        dataset.rename(columns={'month': 'ds','revenue': 'y'}, inplace=True)
#   elif time == "day":

    else:
        print("Erreur: L'option 'time' doit être 'year' ou 'month'.")
        return df,None,None

    # Put cmdstanpy log ouput at ERROR to not have the output when the function is used
    logging.getLogger('cmdstanpy').setLevel(logging.ERROR)

    # Create a Prophet model to make prediction
    model = Prophet()
    model.fit(dataset)

    future_dates = model.make_future_dataframe(periods=periods, freq=freq)
    prediction = model.predict(future_dates)

    new_df = pd.merge(dataset[['ds', 'y']], prediction[['ds', 'yhat']], on='ds', how='outer')
    new_df.rename(columns={'y': 'original_revenue','yhat': 'predicted_revenue', 'ds': 'time'}, inplace=True)
    
    return new_df, model,prediction

    

Function to create a visualization based on a temporal prediction

In [17]:
def display_temporal_prediction(df: pd.DataFrame, model,prediction,options: str = "prophet"):
    """
    Create a visualization of the a dataset with temporal prediction 
    either with the dataset or with the prediction model.

    Inputs:
    ---------  
    - df (DataFrame): Input dataset.
    - model (Prophet): Prophet model trained on the dataset and used for the prediction.
    - prediction (DataFrame): Future prediction made by the model.
    - [Optionnal] options (str): options for the visualization between "ploty" and "prophet". 
        - "prophet": leo.   
        - "ploty" (default): leo.  

    Outputs: 
    --------
    - fig (Figure) - A figure containing the temporal visualization.
    """
    if options=="ploty":
        if 'predicted_revenue' in df.columns and 'original_revenue' in df.columns:
            fig = px.area()
            fig.add_scatter(x=df.index, y=df["original_revenue"], mode='lines', line=dict(color='blue'), name="original")
            fig.add_scatter(x=df.index ,y=df["predicted_revenue"], mode='lines', line=dict(color='green'), name="prediction")
            fig.update_layout(title="Prediction", xaxis_title="Date", yaxis_title="Revenue")
        else:
            print("Error, the prediction  was not found in the dataset")
            fig = None

    elif options == "prophet":
        if model is not None:
            fig = plot_plotly(model, prediction)
        else:
            print("Error, the prediction model was not found")
            fig = None

    # elif options == "3":
    
    return fig

## 3) Dash visualization

In [27]:
def create_dashboard(df: pd.DataFrame) -> Dash:
    # Indicator 1
    city_stats = indicator_top_cities(df, n=10)

    fig_top_cities_revenue = px.bar(
        city_stats,
        x='city',
        y='total_revenue',
        hover_data={'transactions_count': True, 'total_revenue': ':.2f'},
        title='Top Cities by Revenue and Transactions',
        labels={'total_revenue': 'Total Revenue ($)', 'city': 'City Name'},
        color='transactions_count',
        color_continuous_scale='Blues'
    )
    fig_top_cities_revenue.update_layout(
        xaxis_tickangle=-45,
        height=400
    )

    fig_top_cities_transactions = px.scatter(
        city_stats,
        x='transactions_count',
        y='total_revenue',
        size='transactions_count',
        color='total_revenue',
        color_continuous_scale='Viridis',
        hover_data={'city': True, 'transactions_count': True, 'total_revenue': ':.2f'},
        labels={'transactions_count': 'Transactions', 'total_revenue': 'Total Revenue ($)'},
        title='Total Revenue vs Number of Transactions (per City)'
    )

    # Indicator 2
    figure_pred_year = []
    figure_pred_month = []
    time_pred = [1900,2015,2010,2000,1990,1980]
    for i in time_pred:
        dataset = df.copy()
        dataset = dataset[dataset["year"]>i]
        data_y, model_y,predictions_y = temporal_prediction(dataset,time="year")
        figure_pred_year.append(display_temporal_prediction(data_y,model_y,predictions_y))
        data_m,model_m,predictions_m = temporal_prediction(dataset,time="month")
        figure_pred_month.append(display_temporal_prediction(data_m,model_m,predictions_m))
    year_to_index = {
    "YA": 0, "MA": 0,    # 1900 (All)
    "Y2015": 1, "M2015": 1,
    "Y2010": 2, "M2010": 2,
    "Y2000": 3, "M2000": 3,
    "Y1990": 4, "M1990": 4,
    "Y1980": 5, "M1980": 5}
    
    # Initialize the Dash app
    app = Dash(__name__)

    app.layout = html.Div([
        # Header
        html.Div([
            html.H1('Invoices Data Analysis Dashboard', 
                    style={'textAlign': 'center', 'color': '#2c3e50', 'marginBottom': 10}),
            html.H3('Programming for Data Science - Final Project', 
                    style={'textAlign': 'center', 'color': '#34495e', 'marginBottom': 5}),
            html.P('Team Members: Alvaro SERERO, Leo WINTER, Yoann SUBLET, Kellian VERVAELE KLEIN',
                style={'textAlign': 'center', 'color': '#7f8c8d', 'fontSize': 14}),
            html.P('Dataset: Invoices (Kaggle) - 10,000 online store transactions',
                style={'textAlign': 'center', 'color': '#7f8c8d', 'fontSize': 14, 'marginBottom': 20}),
            html.Div([
                html.P('Project Objective: Extract and visualize four key business indicators from invoice data ' +
                    'to support data-driven decision-making in e-commerce operations.',
                    style={'textAlign': 'center', 'color': '#2c3e50', 'fontSize': 15, 
                            'maxWidth': '900px', 'margin': '0 auto', 'padding': '15px',
                            'backgroundColor': '#ecf0f1', 'borderRadius': '8px'})
            ])
        ], style={'backgroundColor': '#f8f9fa', 'padding': '20px', 'marginBottom': '30px',
                'borderRadius': '10px', 'boxShadow': '0 2px 4px rgba(0,0,0,0.1)'}),

        # Dashboard content
        html.Div([
            # Row 1: Indicators 1 and 2
            html.Div([
                # Indicator 1: Top Cities by Revenue and Transactions
                html.Div([
                    html.H3('Indicator 1: Top Cities by Total Revenue', 
                            style={'color': '#2980b9', 'marginBottom': 15}),
                    html.P('Grouping Query - Aggregates total revenue and transaction count per city',
                        style={'fontSize': 13, 'color': '#7f8c8d', 'marginBottom': 15}),
                    dcc.Graph(
                        figure=fig_top_cities_revenue
                    )
                ], style={'width': '48%', 'display': 'inline-block', 'verticalAlign': 'top',
                        'padding': '20px', 'backgroundColor': 'white', 'borderRadius': '8px',
                        'boxShadow': '0 2px 4px rgba(0,0,0,0.1)', 'marginRight': '2%'}),
                
                # Indicator 2: Customer Segmentation
                html.Div([
                    html.H3('Indicator 2: Customer Segmentation', 
                            style={'color': '#27ae60', 'marginBottom': 15}),
                    html.P('Data Transformation - MinMax Normalization + K-Means Clustering',
                        style={'fontSize': 13, 'color': '#7f8c8d', 'marginBottom': 15}),
                    dcc.Graph(
                        figure=go.Figure([
                            go.Pie(
                                labels=customer_segments['segment_label'].value_counts().index,
                                values=customer_segments['segment_label'].value_counts().values,
                                hole=0.4,
                                marker=dict(colors=['#e74c3c', '#f39c12', '#27ae60'])
                            )
                        ]).update_layout(
                            title='Customer Distribution by Segment',
                            height=400
                        )
                    )
                ], style={'width': '48%', 'display': 'inline-block', 'verticalAlign': 'top',
                        'padding': '20px', 'backgroundColor': 'white', 'borderRadius': '8px',
                        'boxShadow': '0 2px 4px rgba(0,0,0,0.1)'})
            ], style={'marginBottom': '30px'}),  # End of Row 1

            # Row 2: Indicators 3 and 4
            html.Div([]),

        ]),

        html.Div([
                html.H3('Indicator 2: Prediction of future revenue', 
                    style={'color': '#2980b9', 'marginBottom': 15}),
                html.P('Predict the future by taking information from past time',
                    style={'fontSize': 13, 'color': '#7f8c8d', 'marginBottom': 15}),
                dcc.Graph(id='graph'),
                dcc.Dropdown(options=[{"label": "All Year", "value": "YA"},{"label": "2015 Year", "value": "Y2015"},
                                      {"label": "2010 Year", "value": "Y2010"}, {"label": "2000 Year", "value": "Y2000"},
                                      {"label": "1990 Year", "value": "Y1990"},{"label": "1980 Year", "value": "Y1980"},
                                      {"label": "All Month", "value": "MA"},{"label": "2015 Month", "value": "M2015"},
                                      {"label": "2010 Month", "value": "M2010"}, {"label": "2000 Month", "value": "M2000"},
                                      {"label": "1990 Month", "value": "M1990"},{"label": "1980 Month", "value": "M1980"}],
                                        value="YA", id='dropdown')
        ])
    ])

    @callback(
    Output('graph', 'figure'),
    Input('dropdown', 'value'))
    def update_temporal_graph(selected_value):
        index = year_to_index[selected_value]
        if selected_value.startswith("Y"):
            fig_dash = figure_pred_year[index]
        elif selected_value.startswith("M"):
            fig_dash = figure_pred_month[index]
        return fig_dash
    return app

In [19]:
def main():
    file_path = "invoices.csv"
    df = load_data(file_path)
    df = preprocess_data(df)
    explore_data(df)

    app = create_dashboard(df)
    app.run()

if __name__ == "__main__":
    main()

10:13:33 - cmdstanpy - INFO - Chain [1] start processing


Data loaded successfully.
Dates parsed and temporal features extracted
Shape (rows, columns): (20000, 15)

Column dtypes:
name                    object
address                 object
amount                 float64
city                    object
email                   object
invoice_date    datetime64[ns]
job                     object
product_id               int64
qty                      int64
revenue                float64
stock_code               int64
year                     int32
month                    int32
day                      int32
dayofweek                int32
dtype: object

Missing values per column:
name            0
address         0
amount          0
city            0
email           0
invoice_date    0
job             0
product_id      0
qty             0
revenue         0
stock_code      0
year            0
month           0
day             0
dayofweek       0
dtype: int64

Basic description of numerical columns:
             amount                invoice_date

10:13:34 - cmdstanpy - INFO - Chain [1] done processing


In [20]:
file_path = "invoices.csv"
dataset = load_data(file_path)
dataset = preprocess_data(dataset)
dataset = dataset[dataset["year"]>2010]
dataset, model, pred = temporal_prediction(dataset,time="month")
dataset.head()



Data loaded successfully.
Dates parsed and temporal features extracted


Unnamed: 0,time,original_revenue,predicted_revenue
0,2011-01-01,8330.52,7832.206586
1,2011-02-01,8489.4,8624.580339
2,2011-03-01,8415.04,8617.682624
3,2011-04-01,12473.16,7736.676639
4,2011-05-01,9097.32,7972.814474


In [21]:
display_temporal_prediction(dataset, model, pred)