<p align="center">
  <img src="header.jpg" width="100%">
</p>


<div style="text-align: center;">
    <strong style="display: block; margin-bottom: 10px;">Group P</strong> 
    <table style="margin: 0 auto; border-collapse: collapse; border: 1px solid black;">
        <tr>
            <th style="border: 1px solid white; padding: 8px;">Name</th>
            <th style="border: 1px solid white; padding: 8px;">Student ID</th>
        </tr>
        <tr>
            <td style="border: 1px solid white; padding: 8px;">Beatriz Monteiro</td>
            <td style="border: 1px solid white; padding: 8px;">20240591</td>
        </tr>
        <tr>
            <td style="border: 1px solid white; padding: 8px;">Catarina Nunes</td>
            <td style="border: 1px solid white; padding: 8px;">20230083</td>
        </tr>
        <tr>
            <td style="border: 1px solid white; padding: 8px;">Margarida Raposo</td>
            <td style="border: 1px solid white; padding: 8px;">20241020</td>
        </tr>
        <tr>
            <td style="border: 1px solid white; padding: 8px;">Teresa Menezes</td>
            <td style="border: 1px solid white; padding: 8px;">20240333</td>
        </tr>
    </table>
</div>

### 🔗 Table of Contents <a id='table-of-contents'></a>
1. [Introduction](#introduction)  
2. [Business Understanding](#business-understanding)  
3. [Data Understanding](#data-understanding)  
4. [Data Preparation](#data-preparation)  
5. [Modeling](#modeling)  
6. [Evaluation](#evaluation)  
7. [Conclusion](#conclusion)  

---

### <span style="background-color:#235987; padding:5px; border-radius:5px;"> 📌 Introduction <a id='introduction'></a>

This project follows the **CRISP-DM** methodology to conduct a monthly sales forecast of the smart infrastructure business unit of Siemens. 


#### Montlhy Sales Forecast

<p style="margin-bottom: 50px;"> This case study focuses on ... </p>

**Market Data** 
| Features                                      | Feature Description |
|-----------------------------------------------|---------------------|
| *MAB_ELE_PRO156*                              |  |
| *MAB_ELE_SHP156*                              |  |
| *MAB_ELE_PRO250*                              |  |
| *MAB_ELE_SHP250*                              |  |
| *MAB_ELE_PRO276*                              |  |
| *MAB_ELE_SHP276*                              |  |
| *MAB_ELE_PRO380*                              |  |
| *MAB_ELE_SHP380*                              |  |
| *MAB_ELE_PRO392*                              |  |
| *MAB_ELE_SHP392*                              |  |
| *MAB_ELE_PRO756*                              |  |
| *MAB_ELE_SHP756*                              |  |
| *MAB_ELE_PRO826*                              |  |
| *MAB_ELE_SHP826*                              |  |
| *MAB_ELE_PRO840*                              |  |
| *MAB_ELE_SHP840*                              |  |
| *MAB_ELE_PRO1100*                             |  |
| *MAB_ELE_SHP1100*                             |  |
| *RohiBASEMET1000_org*                         |  |
| *RohiENERGY1000_org*                          |  |
| *RohiMETMIN1000_org*                          |  |
| *RohiNATGAS1000_org*                          |  |
| *RohCRUDE_PETRO1000_org*                      |  |
| *RohCOPPER1000_org*                           |  |
| *WKLWEUR840_org*                              |  |
| *PRI27840_org*                                |  |
| *PRI27826_org*                                |  |
| *PRI27380_org*                                |  |
| *PRI27250_org*                                |  |
| *PRI27276_org*                                |  |
| *PRI27156_org*                                |  |
| *PRO28840_org*                                |  |
| *PRO281000_org*                               |  |
| *PRO28756_org*                                |  |
| *PRO28826_org*                                |  |
| *PRO28380_org*                                |  |
| *PRO28392_org*                                |  |
| *PRO28250_org*                                |  |
| *PRO28276_org*                                |  |
| *PRO27840_org*                                |  |
| *PRO271000_org*                               |  |
| *PRO27756_org*                                |  |
| *PRO27826_org*                                |  |
| *PRO27380_org*                                |  |
| *PRO27392_org*                                |  |
| *PRO27250_org*                                |  |
| *PRO27276_org*                                |  |

- **Production index** is an economic indicator that measures the output quantity of a country's industrial sector. In this case the base year is 2010, meaning **if the production index >100, then production decreased, compared to 2010**. 


**Sales Data**
| Features                                      | Feature Description |
|-----------------------------------------------|---------------------|
| *DATE*                                       |   |
| *Mapped_GCK*                                 |   |
| *Sales_EUR*                                  |   |

<b style="background-color:#A9A9A9; padding:5px; border-radius:5px; display: inline-block; margin-top: 50px;">CRISP-DM</b>

<ul style="margin-bottom: 30px;">
    <li><u>Business Understanding</u>: Defining objectives, assessing resources, and project planning.</li>
    <li><u>Data Understanding</u>: Collecting, exploring, and verifying data quality.</li>
    <li><u>Data Preparation</u>: Selecting, cleaning, constructing, integrating, and formatting data to ensure it is ready for analysis.</li>
    <li><u>Modeling</u>: Selecting and applying various modeling techniques while calibrating their parameters to optimal values.</li>
    <li><u>Evaluation</u>: Select the models which are the best performers and evaluate thoroughly if they align with the business objectives. </li>
    <li><u>Deployment</u>: Bridge between data mining goals and the business application of the finalized model.</li>
</ul>

<hr style="margin-top: 30px;">


In [51]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

In [52]:
pd.set_option('display.max_columns', None)

In [53]:
#compose a pallete to use in the vizualizations
pal_novaims = ['#003B5C','#003B5C','#003B5C','#003B5C','#003B5C']
pastel_color = sns.utils.set_hls_values(pal_novaims[1], l=0.4, s=0.3)

### <span style="background-color:#235987; padding:5px; border-radius:5px;"> 📌 Business Understanding <a id='business-understanding'></a>

##### Click [here](#table-of-contents) ⬆️ to return to the Index.
---

The **Business Understanding** phase of the project entails the comprehension of the background leading to the project, as well as the business goals and requirements to be achieved. 

<b style="background-color:#A9A9A9; padding:5px; border-radius:5px;">Primary Business Objective</b> : 


<b style="background-color:#A9A9A9; padding:5px; border-radius:5px;">Plan</b> : 

### <span style="background-color:#235987; padding:5px; border-radius:5px;"> 📌 Data Understanding</span> <a id='data-understanding'></a>

- **[Data Loading and Description](#data-loading-and-description)**  
- **[Data Types](#Data-TypesDU)**
- **[Univariate EDA: Descriptive Summary](#Descriptive-Summary)**
- **[Univariate EDA: Missing values](#missing-valuesDU)**  
- **[Inconsistencies](#inconsistenciesDU)**  
- **[Feature Engineering](#feature-engineeringDU)**  
- **[Univariate EDA: Data Visualization](#univariate-vizualization)**  
    - **Numerical Variables:**  
        - [Numeric variables: Histograms](#hist)
        - [Outliers Analysis: Box-Plots](#box)
    - **Categorical Variables**  
        - [Categorical variables: Bar Plots](#bar)
        - [Categorical variables: Geographic Map](#GeographicMap)
- **[Bivariate EDA: Data Visualization](#Bivariate-Vizualization)**  
   - [Numeric-Numeric: Correlations](#NNCorrelations)
   - [Numeric-Categorical: Correlations](#NCCorrelations)
   - [Categorical-Categorical: Cross-tabulations](#CCCross-tabulations)
- **[Multivariate EDA: Duplicates](#Multivariate)**
   - [Old Segmentation Vs. All](#old-segmentation)
   - [Duplicates](#duplicatesdu)  
- **[Market Basket Analysis](#MBA)**


##### Click [here](#table-of-contents) ⬆️ to return to the Index.
---

#### <span style="background-color:#235987; padding:5px; border-radius:5px;">**Data Loading and Description**</span> <a id='data-loading-and-description'></a>  
_This section provides an overview of the dataset, including its structure, size, and general characteristics._  

##### Click [here](#table-of-contents) ⬆️ to return to the Index.


In [54]:
df_market = pd.read_excel('Data/Case2_Market data.xlsx', sheet_name="Original Values", index_col=0, header=2)
df_market.head()

Unnamed: 0_level_0,MAB_ELE_PRO156,MAB_ELE_SHP156,MAB_ELE_PRO250,MAB_ELE_SHP250,MAB_ELE_PRO276,MAB_ELE_SHP276,MAB_ELE_PRO380,MAB_ELE_SHP380,MAB_ELE_PRO392,MAB_ELE_SHP392,MAB_ELE_PRO756,MAB_ELE_SHP756,MAB_ELE_PRO826,MAB_ELE_SHP826,MAB_ELE_PRO840,MAB_ELE_SHP840,MAB_ELE_PRO1100,MAB_ELE_SHP1100,RohiBASEMET1000_org,RohiENERGY1000_org,RohiMETMIN1000_org,RohiNATGAS1000_org,RohCRUDE_PETRO1000_org,RohCOPPER1000_org,WKLWEUR840_org,PRI27840_org,PRI27826_org,PRI27380_org,PRI27250_org,PRI27276_org,PRI27156_org,PRO28840_org,PRO281000_org,PRO28756_org,PRO28826_org,PRO28380_org,PRO28392_org,PRO28250_org,PRO28276_org,PRO27840_org,PRO271000_org,PRO27756_org,PRO27826_org,PRO27380_org,PRO27392_org,PRO27250_org,PRO27276_org
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1
2004m2,16.940704,16.940704,112.091273,83.458866,82.623037,79.452532,124.289603,86.560493,109.33401,110.495272,91.221862,89.987275,111.353812,73.601265,107.6014,79.24023,97.122911,80.09853,54.039811,44.123338,48.747945,87.076974,39.639458,36.623832,1.2646,78.969864,80.757423,93.020027,,93.230453,,102.491722,97.597374,97.1,106.191977,116.790276,110.890034,118.274109,80.82901,117.723991,,81.1,120.706516,141.510864,106.161262,102.077057,85.9132
2004m3,23.711852,23.711852,136.327976,106.168192,100.556582,97.012918,143.411662,106.344544,140.884616,144.686166,85.866287,79.883583,127.558608,84.047595,110.187364,98.619024,113.783904,96.015929,54.666162,47.588957,49.256157,87.192705,42.592034,39.931055,1.2262,79.673569,80.962135,93.540268,,93.335678,,105.62748,113.224892,91.195116,121.625075,139.288391,141.176853,148.121841,102.130104,119.220779,,76.690307,138.30955,152.880234,140.288741,117.225685,97.670815
2004m4,24.435235,24.435235,117.791806,92.007646,89.653203,84.932358,129.083828,95.579673,105.853579,102.655769,85.622508,79.740802,108.732297,73.026027,108.166564,89.774031,101.715199,85.167236,54.872715,47.779013,49.423751,91.379923,42.650637,39.134854,1.1985,80.337639,80.757423,93.852425,,93.440903,,103.484955,100.16909,93.793535,104.965505,125.289566,105.648765,125.482231,90.961426,117.441124,,71.552403,115.55733,137.796875,106.271197,105.335777,87.253983
2004m5,23.708115,23.708115,109.002541,85.696486,86.880571,82.372794,135.590391,100.087039,101.864777,100.305285,85.378729,79.598021,110.6452,74.591883,108.425887,87.463813,101.275727,84.485767,51.230356,53.590898,46.468392,99.04452,47.517121,36.278433,1.2007,80.798828,80.757423,93.852425,,93.546127,,103.643944,99.581436,96.391954,105.885359,131.988998,101.990361,116.64975,88.082901,117.899216,,66.4145,119.269534,143.860535,101.60871,96.616508,84.675552
2004m6,27.009138,27.009138,133.785737,106.641482,99.010814,95.10874,136.424935,110.889719,120.33292,119.61638,85.13495,79.455239,122.02096,82.343346,110.569933,97.364496,112.057197,96.963294,52.876331,50.799575,47.803913,98.636267,44.967605,35.65738,1.2138,80.91349,80.552711,93.956467,,93.440903,,106.062668,109.27771,98.990373,118.252278,132.988922,122.136575,143.248734,100.978699,119.499107,,61.276596,128.849416,144.315308,116.655248,118.45871,95.401802


In [55]:
df_market.tail()

Unnamed: 0_level_0,MAB_ELE_PRO156,MAB_ELE_SHP156,MAB_ELE_PRO250,MAB_ELE_SHP250,MAB_ELE_PRO276,MAB_ELE_SHP276,MAB_ELE_PRO380,MAB_ELE_SHP380,MAB_ELE_PRO392,MAB_ELE_SHP392,MAB_ELE_PRO756,MAB_ELE_SHP756,MAB_ELE_PRO826,MAB_ELE_SHP826,MAB_ELE_PRO840,MAB_ELE_SHP840,MAB_ELE_PRO1100,MAB_ELE_SHP1100,RohiBASEMET1000_org,RohiENERGY1000_org,RohiMETMIN1000_org,RohiNATGAS1000_org,RohCRUDE_PETRO1000_org,RohCOPPER1000_org,WKLWEUR840_org,PRI27840_org,PRI27826_org,PRI27380_org,PRI27250_org,PRI27276_org,PRI27156_org,PRO28840_org,PRO281000_org,PRO28756_org,PRO28826_org,PRO28380_org,PRO28392_org,PRO28250_org,PRO28276_org,PRO27840_org,PRO271000_org,PRO27756_org,PRO27826_org,PRO27380_org,PRO27392_org,PRO27250_org,PRO27276_org
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1
2021m12,310.763183,310.763183,100.565744,134.589504,118.103281,149.364286,94.006826,150.482735,127.771735,131.029703,106.704029,104.819189,101.273544,,107.040766,148.590371,123.076659,150.046922,125.20703,112.372958,116.715183,236.488368,92.188708,126.76124,1.1304,128.511261,,113.309631,108.18251,115.748863,98.1062,105.736748,134.598755,102.27753,90.350055,103.191399,136.975506,112.791885,129.188248,109.624107,132.281006,114.326241,121.065762,72.915611,109.005151,80.763306,97.773956
2022m1,235.956129,235.956129,85.743503,108.15632,94.55061,120.353403,86.851008,101.258277,110.460181,110.823532,103.49926,101.70157,95.003541,,111.052133,129.565798,103.199827,120.338095,133.219393,121.309886,125.229641,196.91114,106.173052,129.829146,1.1314,131.62851,,115.390617,111.037476,117.853386,98.280171,110.894371,117.489883,100.305236,85.44417,92.292313,117.861377,90.558372,92.343117,111.36467,122.236023,108.999212,112.324119,74.355736,95.369065,77.944954,98.599052
2022m2,235.956129,235.956129,90.60354,117.71577,103.987916,129.383676,106.583758,120.956538,117.879631,118.300232,100.294492,98.583952,98.458412,,116.336327,138.56033,113.500635,131.500126,138.905572,131.273215,131.176501,197.523679,118.348203,131.963648,1.1342,133.342178,,116.431107,112.057098,118.905647,98.714158,117.168167,124.627762,98.332942,89.021378,113.290565,124.710859,97.766502,102.820961,114.6884,127.373421,103.672183,115.55733,91.182419,103.950687,79.001831,106.128059
2022m3,329.413367,329.413367,107.843548,136.85872,121.308119,151.201314,124.637966,153.645142,152.000561,156.400634,97.089723,95.466333,121.993915,,117.654038,165.926217,133.13301,158.055622,149.890871,163.186834,141.283339,271.079906,142.200872,135.782207,1.1019,136.153778,,117.471596,112.362991,119.852684,99.021554,118.910912,149.375229,96.360648,109.155949,134.288818,160.954233,114.72081,122.049515,115.164093,152.452942,98.345154,145.254965,102.475998,133.743932,96.704582,119.948433
2022m4,267.373145,267.373145,87.69811,116.528738,99.522205,127.022869,103.55669,128.733305,114.262328,115.012049,,,95.266502,,116.961047,,112.902215,134.93551,146.090998,153.188945,138.094143,243.43603,130.83543,134.859685,1.0819,137.531616,,118.408043,113.280655,121.220627,98.857087,119.385483,128.285706,,84.728728,111.090744,120.09881,91.979698,98.675873,112.158089,134.843353,,114.359844,86.255684,102.36168,80.763306,101.074341


In [56]:
df_market.shape

(219, 47)

In [57]:
df_market.info()

<class 'pandas.core.frame.DataFrame'>
Index: 219 entries,  2004m2 to  2022m4
Data columns (total 47 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   MAB_ELE_PRO156          219 non-null    float64
 1   MAB_ELE_SHP156          219 non-null    float64
 2   MAB_ELE_PRO250          219 non-null    float64
 3   MAB_ELE_SHP250          219 non-null    float64
 4   MAB_ELE_PRO276          219 non-null    float64
 5   MAB_ELE_SHP276          219 non-null    float64
 6   MAB_ELE_PRO380          219 non-null    float64
 7   MAB_ELE_SHP380          219 non-null    float64
 8   MAB_ELE_PRO392          219 non-null    float64
 9   MAB_ELE_SHP392          219 non-null    float64
 10  MAB_ELE_PRO756          218 non-null    float64
 11  MAB_ELE_SHP756          218 non-null    float64
 12  MAB_ELE_PRO826          219 non-null    float64
 13  MAB_ELE_SHP826          201 non-null    float64
 14  MAB_ELE_PRO840          219 non-null 

In [58]:
missing_values = df_market.isnull().sum()

features_with_missing_values = missing_values[missing_values > 0]

features_with_missing_values

MAB_ELE_PRO756     1
MAB_ELE_SHP756     1
MAB_ELE_SHP826    18
MAB_ELE_SHP840     1
PRI27826_org      18
PRI27250_org      35
PRI27156_org      23
PRO28756_org       1
PRO271000_org     11
PRO27756_org       1
dtype: int64

In [59]:
df_market.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
MAB_ELE_PRO156,219.0,138.303637,78.883209,16.940704,68.47774,133.50769,198.473934,329.413367
MAB_ELE_SHP156,219.0,138.303637,78.883209,16.940704,68.47774,133.50769,198.473934,329.413367
MAB_ELE_PRO250,219.0,104.431918,18.918529,50.75668,93.613505,102.736556,114.090851,152.743402
MAB_ELE_SHP250,219.0,105.316814,12.762209,64.420676,97.452819,106.012166,115.030479,136.85872
MAB_ELE_PRO276,219.0,107.499126,11.861942,74.332913,100.560897,108.99229,115.735786,130.869962
MAB_ELE_SHP276,219.0,114.898377,17.091571,71.787161,103.149778,117.428836,127.11222,151.297092
MAB_ELE_PRO380,219.0,105.228363,23.509638,34.213427,94.335162,105.088474,117.031701,153.940791
MAB_ELE_SHP380,219.0,105.735378,19.948183,45.19171,95.985839,107.695805,119.83636,153.645142
MAB_ELE_PRO392,219.0,111.948146,15.489336,67.53194,103.740049,111.683015,121.402653,153.898678
MAB_ELE_SHP392,219.0,112.670602,16.891947,64.372344,103.453182,112.597293,121.498141,159.495942


In [60]:
df_sales = pd.read_csv('Data/Case2_Sales data.csv', delimiter=';')
df_sales.head()

Unnamed: 0,DATE,Mapped_GCK,Sales_EUR
0,01.10.2018,#1,0
1,02.10.2018,#1,0
2,03.10.2018,#1,0
3,04.10.2018,#1,0
4,05.10.2018,#1,0


In [61]:
df_sales.tail()

Unnamed: 0,DATE,Mapped_GCK,Sales_EUR
9797,23.08.2019,#12,0
9798,23.08.2019,#36,1015
9799,12.08.2019,#12,0
9800,28.08.2019,#8,4376391
9801,27.08.2019,#8,0


In [62]:
df_sales.shape

(9802, 3)

In [63]:
df_sales.describe()

Unnamed: 0,DATE,Mapped_GCK,Sales_EUR
count,9802,9802,9802
unique,1216,14,2609
top,16.04.2021,#1,0
freq,14,1179,7134


In [64]:
df_sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9802 entries, 0 to 9801
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   DATE        9802 non-null   object
 1   Mapped_GCK  9802 non-null   object
 2   Sales_EUR   9802 non-null   object
dtypes: object(3)
memory usage: 229.9+ KB


#### <span style="background-color:#235987; padding:5px; border-radius:5px;">**Sales Data**</span> <a id='sales-data'></a>  
_This section provides a compreensive understanding of the Sales data and its preprocessing steps._

##### Click [here](#table-of-contents) ⬆️ to return to the Index.

> Functions

In [86]:
#plot the sales data over a period of time
def plot_sales(dataframe, x_column, y_column, title, x_label, y_label):
    # Create the line plot
    fig = px.line(dataframe, 
                  x=x_column, 
                  y=y_column,
                  title=title, 
                  labels={x_column: x_label, y_column: y_label})

    fig.show()

In [84]:
def plot_sales_by_product(dataframe, x_column, y_column, title, x_label, y_label):
    fig = px.bar(dataframe, 
                 x=x_column, 
                 y=y_column, 
                 title=title, 
                 labels={x_column: x_label, y_column: y_label},
                 text=y_column)
    
    fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')
    fig.update_layout(xaxis=dict(type='category', tickmode='linear'))
    fig.show()

In [88]:
def plot_sales_by_product_month(dataframe, x_column, y_column, category_column, title, x_label, y_label):
    fig = px.line(dataframe, 
                  x=x_column, 
                  y=y_column, 
                  color=category_column,  # Different lines for each product
                  title=title, 
                  labels={x_column: x_label, y_column: y_label, category_column: "Product"})
    
    fig.update_layout(xaxis=dict(tickangle=-45))  # Rotate x-axis labels for better readability
    fig.show()

> Dtypes

In [66]:
# Fixing dtypes of df_sales
df_sales['DATE'] = pd.to_datetime(df_sales['DATE'], dayfirst=True)
df_sales['Sales_EUR'] = df_sales['Sales_EUR'].str.replace(',', '.').astype(float)

# DroppinG the # in values of the column 'Mapped_GCK'
df_sales['Mapped_GCK'] = df_sales['Mapped_GCK'].str.replace('#', '')
df_sales['Mapped_GCK'] = df_sales['Mapped_GCK'].astype(int)

In [67]:
# Mapped_GCK categories by frequecy
df_sales['Mapped_GCK'].value_counts()

Mapped_GCK
1     1179
3     1017
5      959
8      944
4      877
12     803
6      794
16     765
11     732
13     441
36     434
9      333
20     293
14     231
Name: count, dtype: int64

In [68]:
# Drop rows in df_sales with Sales_EUR = 0
df_sales = df_sales[df_sales['Sales_EUR'] != 0]

In [69]:
# Rows with values for Sales
df_sales.shape[0]

2668

In [70]:
# Max and min values of Sales
df_sales['Sales_EUR'].max(), df_sales['Sales_EUR'].min()

(41127988.02, -506381.17)

In [71]:
negative_sales = df_sales[df_sales['Sales_EUR'] < 0]
negative_sales 

Unnamed: 0,DATE,Mapped_GCK,Sales_EUR
90,2022-02-02,6,-54.00
96,2022-04-04,11,-587.59
102,2021-12-02,3,-183.50
130,2022-03-02,6,-93.88
155,2021-05-03,5,-166.24
...,...,...,...
9333,2019-09-03,20,-6.60
9466,2018-11-05,4,-2029.91
9487,2019-04-30,13,-26.31
9681,2020-07-31,9,-35.25


In [87]:
negative_sales['Year_Month'] = negative_sales['DATE'].dt.to_period('M')
grouped_negative_sales = negative_sales.groupby('Year_Month')['Sales_EUR'].sum().reset_index()

grouped_negative_sales['Year_Month'] = grouped_negative_sales['Year_Month'].astype(str)

plot_sales(grouped_negative_sales, 'Year_Month', 'Sales_EUR', 'Negative Sales by Month', 'Year-Month', 'Total Sales (EUR)')



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [76]:
negative_sales['Day'] = negative_sales['DATE'].dt.day
grouped_by_day = negative_sales.groupby('Day')['Sales_EUR'].sum().reset_index()

grouped_by_day['Day'] = grouped_by_day['Day'].astype(str)

plot_sales(grouped_by_day, 'Day', 'Sales_EUR', 'Negative Sales by Day', 'Day', 'Total Sales (EUR)')



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [82]:
grouped_by_products= negative_sales.groupby('Mapped_GCK')['Sales_EUR'].sum().reset_index()

plot_sales_by_product(grouped_by_products, 'Mapped_GCK', 'Sales_EUR', 'Negative Sales by Product', 'Product Group', 'Total Negative Sales (EUR)')

**Negative values for Sales** 
There seems to be no evident pattern for the negative sales values. These might be:
- **Dataset errors**: errors that occured during the recording of sales records
- **Production errors**: errors that occured in the production of a certain product or product group
- **Rules related to recordings**: companies come up with their own rules to record records, some of which may cause these negative values, e.g. take discounts as negative income.
- **Returns**: customer returns are associated with a refund, creating a negative cash flow for the company to offset the positive cash flow generated at the time of purchase.

At first sight, no evident pattern emerged from the **Negative Sales by Month** plot. However, the analysis highlights January 2020 as a significant outlier, displaying a much lower value than other months, which may indicate production errors in a specific product group. Excluding this outlier, a subtle pattern can be observed: sales tend to drop noticeably twice a year—once in the first half and once in the second—though the exact months vary.

Considering the **Negative Sales by Day** plot, there seems to be a tendency of much lower values for negative sales in the first days of the month, which slows down until the last days. Besides the first days, around the 20th day of the month there is a new, but not as evident, peak. This peak would align with the last couple of days before workers receive their salary, that usually occurs around the 25th in Germany, which may be an indication of returns. 

Product Groups with **most negative sales**, from the lowest value: Group **1**, Group **5**, Group **3**, Group **11**, Group **6**, Group **16**.
- Are this products with the most sales as well? Is the negative sales value proportional to the positive values?

In [90]:
negative_sales['Year_Month'] = negative_sales['DATE'].dt.to_period('M').astype(str)
grouped_negative_sales_prod_month = negative_sales.groupby(['Year_Month', 'Mapped_GCK'])['Sales_EUR'].sum().reset_index()

plot_sales_by_product_month(grouped_negative_sales_prod_month, 'Year_Month', 'Sales_EUR', 'Mapped_GCK', 
                            'Negative Sales by Product and Month', 'Year-Month', 'Total Sales (EUR)')



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

