<p align="center">
  <img src="header.jpg" width="100%">
</p>


<div style="text-align: center;">
    <strong style="display: block; margin-bottom: 10px;">Group P</strong> 
    <table style="margin: 0 auto; border-collapse: collapse; border: 1px solid black;">
        <tr>
            <th style="border: 1px solid white; padding: 8px;">Name</th>
            <th style="border: 1px solid white; padding: 8px;">Student ID</th>
        </tr>
        <tr>
            <td style="border: 1px solid white; padding: 8px;">Beatriz Monteiro</td>
            <td style="border: 1px solid white; padding: 8px;">20240591</td>
        </tr>
        <tr>
            <td style="border: 1px solid white; padding: 8px;">Catarina Nunes</td>
            <td style="border: 1px solid white; padding: 8px;">20230083</td>
        </tr>
        <tr>
            <td style="border: 1px solid white; padding: 8px;">Margarida Raposo</td>
            <td style="border: 1px solid white; padding: 8px;">20241020</td>
        </tr>
        <tr>
            <td style="border: 1px solid white; padding: 8px;">Teresa Menezes</td>
            <td style="border: 1px solid white; padding: 8px;">20240333</td>
        </tr>
    </table>
</div>

### 🔗 Table of Contents <a id='table-of-contents'></a>
1. [Introduction](#introduction)  
2. [Business Understanding](#business-understanding)  
3. [Data Understanding](#data-understanding)  
4. [Data Preparation](#data-preparation)  
5. [Modeling](#modeling)  
6. [Evaluation](#evaluation)  
7. [Conclusion](#conclusion)  

---

### <span style="background-color:#235987; padding:5px; border-radius:5px;"> 📌 Introduction <a id='introduction'></a>

This project follows the **CRISP-DM** methodology to conduct a monthly sales forecast of the smart infrastructure business unit of Siemens. 


#### Montlhy Sales Forecast

<p style="margin-bottom: 50px;"> This case study focuses on ... </p>

**Market Data** 

| Features                                      | Feature Description |
|-----------------------------------------------|---------------------|
| *date*                                       |  |
| *China_Production*                            |  |
| *China_Shipment*                              |  |
| *France_Production*                           |  |
| *France_Shipment*                             |  |
| *Germany_Production*                          |  |
| *Germany_Shipment*                            |  |
| *Italy_Production*                            |  |
| *Italy_Shipment*                              |  |
| *Japan_Production*                            |  |
| *Japan_Shipment*                              |  |
| *Switzerland_Production*                      |  |
| *Switzerland_Shipment*                        |  |
| *UK_Production*                               |  |
| *UK_Shipment*                                 |  |
| *US_Production*                               |  |
| *US_Shipment*                                 |  |
| *Europe_Production*                           |  |
| *Europe_Shipment*                             |  |
| *Base_Metal_Price*                            |  |
| *Energy_Price*                                |  |
| *Metal_and_Minerals_Price*                    |  |
| *Natural_Gas_Price*                           |  |
| *Crude_oil_avg_Price*                         |  |
| *Copper_price*                                |  |
| *LCU_EUR*                                     |  |
| *US_Producer_Price_Electrical_equip*          |  |
| *UK_Producer_Price_Electrical_equip*          |  |
| *Italy_Producer_Price_Electrical_equip*       |  |
| *France_Producer_Price_Electrical_equip*      |  |
| *Germany_Producer_Price_Electrical_equip*     |  |
| *China_Producer_Price_Electrical_equip*       |  |
| *US_Production_Index_Machinery_and_equipment* |  |
| *World_Production_Index_Machinery_and_equip*  |  |
| *Switzerland_Production_Index_Machinery_and_equip* |  |
| *UK_Production_Index_Machinery_and_equip*     |  |
| *Italy_Production_Index_Machinery_and_equip*  |  |
| *Japan_Production_Index_Machinery_and_equip*  |  |
| *France_Production_Index_Machinery_and_equip* |  |
| *Germany_Production_Index_Machinery_and_equip* |  |
| *US_Production_Index_Electrical_equip*        |  |
| *World_Production_Index_Electrical_equip*     |  |
| *Switzerland_Production_Index_Electrical_equip* |  |
| *UK_Production_Index_Electrical_equip*        |  |
| *Italy_Production_Index_Electrical_equip*     |  |
| *Japan_Production_Index_Electrical_equip*     |  |
| *France_Production_Index_Electrical_equip*    |  |
| *Germany_Production_Index_Electrical_equip*   |  |


**Sales Data**
| Features                                      | Feature Description |
|-----------------------------------------------|---------------------|
| *DATE*                                       |   |
| *Mapped_GCK*                                 |   |
| *Sales_EUR*                                  |   |

<b style="background-color:#A9A9A9; padding:5px; border-radius:5px; display: inline-block; margin-top: 50px;">CRISP-DM</b>

<ul style="margin-bottom: 30px;">
    <li><u>Business Understanding</u>: Defining objectives, assessing resources, and project planning.</li>
    <li><u>Data Understanding</u>: Collecting, exploring, and verifying data quality.</li>
    <li><u>Data Preparation</u>: Selecting, cleaning, constructing, integrating, and formatting data to ensure it is ready for analysis.</li>
    <li><u>Modeling</u>: Selecting and applying various modeling techniques while calibrating their parameters to optimal values.</li>
    <li><u>Evaluation</u>: Select the models which are the best performers and evaluate thoroughly if they align with the business objectives. </li>
    <li><u>Deployment</u>: Bridge between data mining goals and the business application of the finalized model.</li>
</ul>

<hr style="margin-top: 30px;">


In [75]:
import pandas as pd
import numpy as np

import seaborn as sns
import plotly.express as px

from statsmodels.tsa.stattools import adfuller

In [34]:
pd.set_option('display.max_columns', None)

In [35]:
#compose a pallete to use in the vizualizations
pal_novaims = ['#003B5C','#003B5C','#003B5C','#003B5C','#003B5C']
pastel_color = sns.utils.set_hls_values(pal_novaims[1], l=0.4, s=0.3)

### <span style="background-color:#235987; padding:5px; border-radius:5px;"> 📌 Business Understanding <a id='business-understanding'></a>

##### Click [here](#table-of-contents) ⬆️ to return to the Index.
---

The **Business Understanding** phase of the project entails the comprehension of the background leading to the project, as well as the business goals and requirements to be achieved. 

<b style="background-color:#A9A9A9; padding:5px; border-radius:5px;">Primary Business Objective</b> : 


<b style="background-color:#A9A9A9; padding:5px; border-radius:5px;">Plan</b> : 

### <span style="background-color:#235987; padding:5px; border-radius:5px;"> 📌 Data Understanding</span> <a id='data-understanding'></a>

- **[Data Loading and Description](#data-loading-and-description)**  
- **[Data Types](#Data-TypesDU)**
- **[Univariate EDA: Descriptive Summary](#Descriptive-Summary)**
- **[Univariate EDA: Missing values](#missing-valuesDU)**  
- **[Inconsistencies](#inconsistenciesDU)**  
- **[Feature Engineering](#feature-engineeringDU)**  
- **[Univariate EDA: Data Visualization](#univariate-vizualization)**  
    - **Numerical Variables:**  
        - [Numeric variables: Histograms](#hist)
        - [Outliers Analysis: Box-Plots](#box)
    - **Categorical Variables**  
        - [Categorical variables: Bar Plots](#bar)
        - [Categorical variables: Geographic Map](#GeographicMap)
- **[Bivariate EDA: Data Visualization](#Bivariate-Vizualization)**  
   - [Numeric-Numeric: Correlations](#NNCorrelations)
   - [Numeric-Categorical: Correlations](#NCCorrelations)
   - [Categorical-Categorical: Cross-tabulations](#CCCross-tabulations)
- **[Multivariate EDA: Duplicates](#Multivariate)**


##### Click [here](#table-of-contents) ⬆️ to return to the Index.
---

#### <span style="background-color:#235987; padding:5px; border-radius:5px;">**Data Loading and Description**</span> <a id='data-loading-and-description'></a>  
_This section provides an overview of the datasets, including their structure, size, and general characteristics._  

##### Click [here](#table-of-contents) ⬆️ to return to the Index.


#### df_market

In [36]:
df_market = pd.read_excel('Data/Case2_Market data.xlsx', sheet_name="Adapted Features")
df_market.head()

Unnamed: 0,date,China_Production,China_Shipment,France_Production,France_Shipment,Germany_Production,Germany_Shipment,Italy_Production,Italy_Shipment,Japan_Production,Japan_Shipment,Switzerland_Production,Switzerland_Shipment,UK_Production,UK_Shipment,US_Production,US_Shipment,Europe_Production,Europe_Shipment,Base_Metal_Price,Energy_Price,Metal_and_Minerals_Price,Natural_Gas_Price,Crude_oil_avg_Price,Copper_price,LCU_EUR,US_Producer_Price_Electrical_equip,UK_Producer_Price_Electrical_equip,Italy_Producer_Price_Electrical_equip,France_Producer_Price_Electrical_equip,Germany_Producer_Price_Electrical_equip,China_Producer_Price_Electrical_equip,US_Production_Index_Machinery_and_equip,World_Production_Index_Machinery_and_equip,Switzerland_Production_Index_Machinery_and_equip,UK_Production_Index_Machinery_and_equip,Italy_Production_Index_Machinery_and_equip,Japan_Production_Index_Machinery_and_equip,France_Production_Index_Machinery_and_equip,Germany_Production_Index_Machinery_and_equip,US_Production_Index_Electrical_equip,World_Production_Index_Electrical_equip,Switzerland_Production_Index_Electrical_equip,UK_Production_Index_Electrical_equip,Italy_Production_Index_Electrical_equip,Japan_Production_Index_Electrical_equip,France_Production_Index_Electrical_equip,Germany_Production_Index_Electrical_equip
0,2004m2,16.940704,16.940704,112.091273,83.458866,82.623037,79.452532,124.289603,86.560493,109.33401,110.495272,91.221862,89.987275,111.353812,73.601265,107.6014,79.24023,97.122911,80.09853,54.039811,44.123338,48.747945,87.076974,39.639458,36.623832,1.2646,78.969864,80.757423,93.020027,,93.230453,,102.491722,97.597374,97.1,106.191977,116.790276,110.890034,118.274109,80.82901,117.723991,,81.1,120.706516,141.510864,106.161262,102.077057,85.9132
1,2004m3,23.711852,23.711852,136.327976,106.168192,100.556582,97.012918,143.411662,106.344544,140.884616,144.686166,85.866287,79.883583,127.558608,84.047595,110.187364,98.619024,113.783904,96.015929,54.666162,47.588957,49.256157,87.192705,42.592034,39.931055,1.2262,79.673569,80.962135,93.540268,,93.335678,,105.62748,113.224892,91.195116,121.625075,139.288391,141.176853,148.121841,102.130104,119.220779,,76.690307,138.30955,152.880234,140.288741,117.225685,97.670815
2,2004m4,24.435235,24.435235,117.791806,92.007646,89.653203,84.932358,129.083828,95.579673,105.853579,102.655769,85.622508,79.740802,108.732297,73.026027,108.166564,89.774031,101.715199,85.167236,54.872715,47.779013,49.423751,91.379923,42.650637,39.134854,1.1985,80.337639,80.757423,93.852425,,93.440903,,103.484955,100.16909,93.793535,104.965505,125.289566,105.648765,125.482231,90.961426,117.441124,,71.552403,115.55733,137.796875,106.271197,105.335777,87.253983
3,2004m5,23.708115,23.708115,109.002541,85.696486,86.880571,82.372794,135.590391,100.087039,101.864777,100.305285,85.378729,79.598021,110.6452,74.591883,108.425887,87.463813,101.275727,84.485767,51.230356,53.590898,46.468392,99.04452,47.517121,36.278433,1.2007,80.798828,80.757423,93.852425,,93.546127,,103.643944,99.581436,96.391954,105.885359,131.988998,101.990361,116.64975,88.082901,117.899216,,66.4145,119.269534,143.860535,101.60871,96.616508,84.675552
4,2004m6,27.009138,27.009138,133.785737,106.641482,99.010814,95.10874,136.424935,110.889719,120.33292,119.61638,85.13495,79.455239,122.02096,82.343346,110.569933,97.364496,112.057197,96.963294,52.876331,50.799575,47.803913,98.636267,44.967605,35.65738,1.2138,80.91349,80.552711,93.956467,,93.440903,,106.062668,109.27771,98.990373,118.252278,132.988922,122.136575,143.248734,100.978699,119.499107,,61.276596,128.849416,144.315308,116.655248,118.45871,95.401802


Now, we want to set the column date as the index in the dataframe df_market, as this column is supposed to identify an individual time period. However, first, to ensure we don't lose any important rows due to identifier mistakes, we will check if this column only contains unique values:

In [37]:
unique_reservations = df_market['date'].nunique()
total_rows = len(df_market)

print(f"The date column has {unique_reservations}"
      f" unique values,\nand the dataframe df_market has {total_rows} rows.")
if unique_reservations == total_rows:
    print("All date values are unique.")
else:
    print("There are duplicate date values.")

The date column has 219 unique values,
and the dataframe df_market has 219 rows.
All date values are unique.


In [38]:
df_market.set_index('date', inplace = True)

In [39]:
df_market.shape

(219, 47)

#### df_sales

In [40]:
df_sales = pd.read_csv('Data/Case2_Sales data.csv', delimiter=';')
df_sales.head()

Unnamed: 0,DATE,Mapped_GCK,Sales_EUR
0,01.10.2018,#1,0
1,02.10.2018,#1,0
2,03.10.2018,#1,0
3,04.10.2018,#1,0
4,05.10.2018,#1,0


In [41]:
df_sales.shape

(9802, 3)

#### df_test

In [42]:
df_test = pd.read_csv('Data/Case2_Test Set Template.csv', delimiter=';')
df_test.head()

Unnamed: 0,Month Year,Mapped_GCK,Sales_EUR
0,Mai 22,#3,
1,Jun 22,#3,
2,Jul 22,#3,
3,Aug 22,#3,
4,Sep 22,#3,


In [43]:
df_test.shape

(140, 3)

# <span style="font-size: 38px; background-color:#235987; padding:5px; border-radius:5px;">**Sales**</span> <a id='Data-TypesDU'></a>

---
#### <span style="background-color:#235987; padding:5px; border-radius:5px;">**Data Types**</span> <a id='Data-TypesDU'></a>  
_By reducing memory consumption, we can enhance the performance of subsequent sections, including **feature engineering, clustering algorithms, and distance-based models**._   

##### Click [here](#table-of-contents) ⬆️ to return to the Index.

In [44]:
df_sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9802 entries, 0 to 9801
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   DATE        9802 non-null   object
 1   Mapped_GCK  9802 non-null   object
 2   Sales_EUR   9802 non-null   object
dtypes: object(3)
memory usage: 229.9+ KB


**Note:** The columns datatyes are not optimize, in data preparation is important to convert:
- `Date` to dd-mm-yyyy (datetime format)
- `Sales_EUR` to float
- `Mapped_GCK` to int (removing the character #)

Right now we have a memory usage of 229.9+ KB, but we can clearly see that these three columns can be optimized.

---
#### <span style="background-color:#235987; padding:5px; border-radius:5px;">**Descriptive Summary**</span> <a id='Descriptive-Summary'></a>  
_A detailed summary of the variables, including their central tendency, dispersion, and distribution._  

##### Click [here](#table-of-contents) ⬆️ to return to the Index.

In [45]:
# Summary statistics for numeric columns
df_sales.describe().T

Unnamed: 0,count,unique,top,freq
DATE,9802,1216,16.04.2021,14
Mapped_GCK,9802,14,#1,1179
Sales_EUR,9802,2609,0,7134


In [46]:
# Check unique products
print('Unique products:', list(df_sales['Mapped_GCK'].unique()))

Unique products: ['#1', '#11', '#6', '#8', '#12', '#16', '#4', '#5', '#3', '#9', '#14', '#13', '#20', '#36']


The columns of this dataframe don't have missing values.

---
#### <span style="background-color:#235987; padding:5px; border-radius:5px;">**Inconsistencies**</span> <a id='inconsistenciesDU'></a>  
_Checking for inconsistencies in data, such as incorrect formats, out-of-range values, or logical errors._ 

##### Click [here](#table-of-contents) ⬆️ to return to the Index.

- `Sales_EUR` == 0:

In [48]:
len(df_sales['Sales_EUR'] == 0)

9802

We have 9802 rows where sales are 0, which means 'no sales' so we should delete these cases.

In [49]:
df_sales.drop(df_sales[df_sales['Sales_EUR'] == '0'].index, inplace=True)

- `Sales_EUR` < 0:

In [50]:
negative_values = df_sales[pd.to_numeric(df_sales['Sales_EUR'], errors='coerce') < 0]

This negative values can be errors but can also represent returns or discounts, so we should keep them.

# <span style="font-size: 27px; background-color:#235987; padding:5px; border-radius:5px;">**Data Preparation**</span> <a id='Data-TypesDU'></a>

In [51]:
# convert DATE to datetime format
df_sales['month_year'] = pd.to_datetime(df_sales['DATE'], format='%d.%m.%Y').dt.to_period('M')

In [52]:
# convert Sales_EUR to float
df_sales['Sales_EUR'] = df_sales['Sales_EUR'].str.replace(',', '.').astype(float)

In [53]:
# convert Mapped_GCK
df_sales['Mapped_GCK'] = df_sales['Mapped_GCK'].str.replace("#", "").astype(int)

In [54]:
# Get unique values of 'Mapped_GCK'
unique_values = df_sales['Mapped_GCK'].unique()

# Create individual dataframes for each unique value of 'Mapped_GCK'
for val in unique_values:
    globals()['df_' + str(val)] = df_sales[df_sales['Mapped_GCK'] == val].reset_index(drop=True)

Now we will aggregate sales by the same month and year according to each `Mapped_GCK` and for each unique combination of month_year, the total sales (Sales_EUR) are aggregated.

In [55]:
df_1['month_year'] = pd.to_datetime(df_1['DATE'], format='%d.%m.%Y').dt.to_period('M')
df_1 = pd.DataFrame(df_1.groupby('month_year')['Sales_EUR'].sum())
df_1 = df_1.rename(columns={'Sales_EUR': 'P1'})

df_3['month_year'] = pd.to_datetime(df_3['DATE'], format='%d.%m.%Y').dt.to_period('M')
df_3 = pd.DataFrame(df_3.groupby('month_year')['Sales_EUR'].sum())
df_3 = df_3.rename(columns={'Sales_EUR': 'P3'})

df_4['month_year'] = pd.to_datetime(df_4['DATE'], format='%d.%m.%Y').dt.to_period('M')
df_4 = pd.DataFrame(df_4.groupby('month_year')['Sales_EUR'].sum())
df_4 = df_4.rename(columns={'Sales_EUR': 'P4'})

df_5['month_year'] = pd.to_datetime(df_5['DATE'], format='%d.%m.%Y').dt.to_period('M')
df_5 = pd.DataFrame(df_5.groupby('month_year')['Sales_EUR'].sum())
df_5 = df_5.rename(columns={'Sales_EUR': 'P5'})

df_6['month_year'] = pd.to_datetime(df_6['DATE'], format='%d.%m.%Y').dt.to_period('M')
df_6 = pd.DataFrame(df_6.groupby('month_year')['Sales_EUR'].sum())
df_6 = df_6.rename(columns={'Sales_EUR': 'P6'})

df_8['month_year'] = pd.to_datetime(df_8['DATE'], format='%d.%m.%Y').dt.to_period('M')
df_8 = pd.DataFrame(df_8.groupby('month_year')['Sales_EUR'].sum())
df_8 = df_8.rename(columns={'Sales_EUR': 'P8'})

df_9['month_year'] = pd.to_datetime(df_9['DATE'], format='%d.%m.%Y').dt.to_period('M')
df_9 = pd.DataFrame(df_9.groupby('month_year')['Sales_EUR'].sum())
df_9 = df_9.rename(columns={'Sales_EUR': 'P9'})

df_11['month_year'] = pd.to_datetime(df_11['DATE'], format='%d.%m.%Y').dt.to_period('M')
df_11 = pd.DataFrame(df_11.groupby('month_year')['Sales_EUR'].sum())
df_11 = df_11.rename(columns={'Sales_EUR': 'P11'})

df_12['month_year'] = pd.to_datetime(df_12['DATE'], format='%d.%m.%Y').dt.to_period('M')
df_12 = pd.DataFrame(df_12.groupby('month_year')['Sales_EUR'].sum())
df_12 = df_12.rename(columns={'Sales_EUR': 'P12'})

df_13['month_year'] = pd.to_datetime(df_13['DATE'], format='%d.%m.%Y').dt.to_period('M')
df_13 = pd.DataFrame(df_13.groupby('month_year')['Sales_EUR'].sum())
df_13 = df_13.rename(columns={'Sales_EUR': 'P13'})

df_14['month_year'] = pd.to_datetime(df_14['DATE'], format='%d.%m.%Y').dt.to_period('M')
df_14 = pd.DataFrame(df_14.groupby('month_year')['Sales_EUR'].sum())
df_14 = df_14.rename(columns={'Sales_EUR': 'P14'})

df_16['month_year'] = pd.to_datetime(df_16['DATE'], format='%d.%m.%Y').dt.to_period('M')
df_16 = pd.DataFrame(df_16.groupby('month_year')['Sales_EUR'].sum())
df_16 = df_16.rename(columns={'Sales_EUR': 'P16'})

df_20['month_year'] = pd.to_datetime(df_20['DATE'], format='%d.%m.%Y').dt.to_period('M')
df_20 = pd.DataFrame(df_20.groupby('month_year')['Sales_EUR'].sum())
df_20 = df_20.rename(columns={'Sales_EUR': 'P20'})

df_36['month_year'] = pd.to_datetime(df_36['DATE'], format='%d.%m.%Y').dt.to_period('M')
df_36 = pd.DataFrame(df_36.groupby('month_year')['Sales_EUR'].sum())
df_36 = df_36.rename(columns={'Sales_EUR': 'P36'})

In [57]:
#Assign the values of each dataframe to the main dataframe
df_sales = df_1
df_sales['P3'] = df_3['P3']
df_sales['P4'] = df_4['P4']
df_sales['P5'] = df_5['P5']
df_sales['P6'] = df_6['P6']
df_sales['P8'] = df_8['P8']
df_sales['P9'] = df_9['P9']
df_sales['P11'] = df_11['P11']
df_sales['P12'] = df_12['P12']
df_sales['P13'] = df_13['P13']
df_sales['P14'] = df_14['P14']
df_sales['P16'] = df_16['P16']
df_sales['P20'] = df_20['P20']
df_sales['P36'] = df_36['P36']

# Calculate the sum of each row in df_sales and assign it to a new column 'Sales_EUR'
df_sales['Sales_EUR'] = df_sales.sum(axis=1)

In [58]:
df_sales.head()

Unnamed: 0_level_0,P1,P3,P4,P5,P6,P8,P9,P11,P12,P13,P14,P16,P20,P36,Sales_EUR
month_year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2018-10,36098918.79,8089465.96,397760.69,2499061.19,369231.6,586052.74,3219.32,1021303.5,28686.33,27666.1,5770.0,333196.87,4563.14,6474.6,49471370.83
2018-11,5140760.0,11863001.51,371322.42,8993944.04,473046.96,526292.77,1875.9,1898844.8,1070.0,68180.0,17130.0,1377694.32,5798.14,21617.61,30760578.47
2018-12,37889612.12,8736859.39,430100.96,6947507.31,999472.69,271490.71,,1226122.0,17880.6,15655.18,,4762524.66,918.65,13924.52,61312068.79
2019-01,27728148.35,10705300.63,484173.88,8233205.07,598874.1,381400.15,1487.0,2216391.74,21484.0,27198.29,1686.4,942957.19,2398.04,15444.39,51360149.23
2019-02,34793163.53,10167796.86,620031.8,6879250.99,542037.52,368475.57,3234.28,610456.6,34214.74,32638.63,19196.3,257765.04,620.66,8051.15,54336933.67


Now we want to calculate the percentage change for each column and create new columns with that percentage:

In [59]:
# Replace all 0 values in df with 1
df_sales = df_sales.replace(0, 1) # This ensures that there are no 0s in the DataFrame, which helps prevent division 
                                  # errors when calculating percentage changes.

# Create a list of the column names of df_sales
nominal_prod_list = list(df_sales.columns)

# Initialize an empty list to store the names of the columns containing percent changes
percent_change_prod_list = []

# For each column in the DataFrame 'df', calculate the percent change and add a new column with the suffix '%Change'
for col in df_sales.columns:
    df_sales[col + '%Change'] = df_sales[col].pct_change()
    
    # Append the name of the new column to the percent_change_prod_list
    percent_change_prod_list.append(col + '%Change')

# Replace all 1 values in columns specified in 'nominal_prod_list' with 0
df_sales[nominal_prod_list] = df_sales[nominal_prod_list].replace(1, 0)


The default fill_method='pad' in Series.pct_change is deprecated and will be removed in a future version. Either fill in any non-leading NA values prior to calling pct_change or specify 'fill_method=None' to not fill NA values.


The default fill_method='pad' in Series.pct_change is deprecated and will be removed in a future version. Either fill in any non-leading NA values prior to calling pct_change or specify 'fill_method=None' to not fill NA values.


The default fill_method='pad' in Series.pct_change is deprecated and will be removed in a future version. Either fill in any non-leading NA values prior to calling pct_change or specify 'fill_method=None' to not fill NA values.



In [None]:
#df_sales.to_csv('df_sales.csv')

---
#### <span style="background-color:#235987; padding:5px; border-radius:5px;">**Sales Visualizations**</span> <a id='VisualizationsDU'></a>  

##### Click [here](#table-of-contents) ⬆️ to return to the Index.

In [73]:
# Creating a lineplot to viasualize the monthly sales development of the different products
fig = px.line(df_sales, x=df_sales.index.to_timestamp(), y=nominal_prod_list)
# fig.add_shape(
#     dict(
#         type='line',
#         x0='2021-05',  
#         y0=0,
#         x1='2021-05',
#         y1=1,
#         xref='x',
#         yref='paper',
#         line=dict(color='red', width=2, dash='dash')
#     )
# )
fig.show()

In [74]:
# Creating a lineplot to visualize the monthly percentual sales development of the different products
fig = px.line(df_sales, x=df_sales.index.to_timestamp(), y=percent_change_prod_list)
# fig.add_shape(
#             dict(type='line', x0=31, y0=0, x1=31, y1=1, xref='x', yref='paper',
#                  line=dict(color='red', width=2, dash='dash'))
#         )
fig.show()

---
#### <span style="background-color:#235987; padding:5px; border-radius:5px;">**Augmented Dickey-Fuller (ADF) test**</span> <a id='Augmented Dickey-Fuller (ADF) testDU'></a>  

##### Click [here](#table-of-contents) ⬆️ to return to the Index.

Now we will performe a **Augmented Dickey-Fuller (ADF) test** to check if the `Sales_EUR` time series is stationary or non-stationary.

- Null Hypothesis (H₀): The series is non-stationary
- Alternative Hypothesis (H₁): The series is stationary

**result** stores multiple outputs, where:
- result[0]: ADF test statistic.
- result[1]: p-value (used to determine stationarity).

In [76]:
result = adfuller(df_sales['Sales_EUR'])

if result[1] <= 0.05:
    print("Sales_EUR is stationary (reject the null hypothesis)")
else:
    print("Sales_EUR Change is not stationary (fail to reject the null hypothesis)")

# printing the test statistic and p-value
print('ADF Statistic:', result[0])

Sales_EUR is stationary (reject the null hypothesis)
ADF Statistic: -5.329516385633666


We then conclude that **Sales_EUR is stationary**, which means its statistical properties (mean, variance, autocorrelation) remain constant over time.