<a href="https://colab.research.google.com/github/Anello92/BusinessAnalytics/blob/master/Rossmann.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# **Rossmann Pharmacies Sales Forecast Project**
This project aims to develop a sales forecast system for the Rossmann company, a chain of over 3,000 pharmacies operating in 7 European countries. The need arose after a results meeting, in which Rossmann store managers were tasked with providing daily sales forecasts for the next six weeks. With over a thousand managers requesting their own forecasts, the challenge is to create an accurate and efficient solution that considers the peculiarities of each store, including factors such as promotions, competition, schools, holidays, seasonality, and locality.

### **Demand Motivation**
Before starting to devise the solution, it's essential to understand the motivation behind this demand. Instead of merely implementing an off-the-shelf solution, it's necessary to question and understand the reasons that led to the need for this forecast. The focus should be on the root cause of the problem to ensure the solution adequately meets the managers' expectations and needs.

### **Solution Format**
Furthermore, it's crucial to define the solution's format. For this, it's important to know if managers need daily, weekly, or monthly forecasts and if they want a specific granularity, such as by store, product, or category. It's also relevant to determine the type of problem, whether it's a classification task or forecast, to identify potential appropriate methods to address the issue.

Once these aspects have been understood, the data team can seek the data available in the Rossmann system and explore other relevant sources, ensuring that all necessary information is available to develop an accurate and reliable solution.

### **Project's Goal**
This project aims to provide a curated forecast for each store, considering its specific circumstances. By creating a system accessible through mobile devices, such as cell phones, managers will have easy access to forecasts, enabling more assertive decision-making regarding promotions, investments, and inventory planning.

A combination of data analysis techniques, such as regression and neural networks, will be explored to find the best solution to the problem. The solution's development will be done with the involvement of the managers, ensuring that their needs are met and that the final solution is effectively adopted.

Based on this context and the available information, the team will seek to improve the efficiency of Rossmann's sales forecasts, offering valuable insights to optimize the performance of its stores and boost business results.


In [None]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

---
### **0.0 Import Libraries and Load Data**

In [None]:
# Installing required packages silently
!pip install scikit-learn -q
!pip install inflection -q
!pip install tabulate -q
!pip install seaborn -q
!pip install --upgrade seaborn -q
!pip install sklearn -q
!pip install Ipython -q
!pip install boruta -q
!pip install scipy -q

# Importing libraries
import pandas               as pd
import numpy                as np
import seaborn              as sns
import matplotlib.pyplot    as plt
import matplotlib.gridspec  as gridspec

import inflection
import math
import datetime
import warnings

# Suppressing warnings
warnings.filterwarnings('ignore')

# Importing specific resources
from scipy.stats            import chi2_contingency
from IPython.display        import Image, HTML
from boruta                 import BorutaPy

# Importing models and evaluation metrics
from sklearn.ensemble       import RandomForestRegressor
from sklearn.metrics        import mean_absolute_error, mean_squared_log_error, mean_squared_error
from sklearn.linear_model   import LinearRegression, Lasso
from sklearn.preprocessing  import RobustScaler, MinMaxScaler, LabelEncoder

---
### **0.2 Loading Data**

In [None]:
# Loading the 'train.csv' file as a DataFrame df_sales_raw, using the low_memory=False option to avoid memory issues
df_sales_raw = pd.read_csv('train.csv', low_memory=False)

# Loading the 'store.csv' file as a DataFrame df_store_raw, using the low_memory=False option to avoid memory issues
df_store_raw = pd.read_csv('store.csv', low_memory=False)

In this section, we will use the **function read_csv** from **pandas**, which is a powerful data analysis and manipulation library in Python. The **read_csv** function is used to **read tabular data**, such as a **CSV file**, and create a **pandas DataFrame**.

The **first argument** for the **read_csv** function is the **path** to the **file** we want to read. In our case, the file is named **'Train.csv'**.

The **second argument** is **low_memory**. In this example, we are setting **low_memory** to **False**, which instructs the **read_csv** function to load the entire file into memory at once, rather than reading the file in chunks. If **low_memory** was set to **True**, the function would read the file in chunks to save memory.

The decision to **set low_memory** as **True** or **False** depends on the **computer's memory capacity**. If we try to load a very large file and the computer has limited memory, you might receive an error or warning.

In [None]:
# Merge of the DataFrames df_sales_raw and df_store_raw based on the 'Store' column
df_raw = pd.merge(df_sales_raw, df_store_raw, how='left', on='Store')

# Random sample from the resulting DataFrame
df_raw.sample()

After loading the files, we carried out a **'merge'** operation on two datasets using the **merge** function from **pandas**. This operation is similar to a **'JOIN' in SQL**, where data from two (or more) DataFrames are combined based on a common column (or multiple columns).

To do this, we use the **merge** function from **pandas**, which accepts several arguments:

- The **first argument** is the DataFrame that will serve as the reference for the **'merge'** operation. The **second argument** is the DataFrame that will be appended to the first. The **'how'** argument specifies the type of **'merge'** to be performed. In our case, the value is **'outer'**, which means we want a **'merge'** that includes **all rows from both** DataFrames, regardless of there being a match between columns.

- The **'on'** argument specifies the column(s) that will be used as the key for the **'merge'**. In our case, the column is **'Store'**, which is present in both DataFrames.

The result of the **'merge'** operation is stored in a new variable named **'df'**. In summary, the **merge** function is a method of the pandas class used to combine two or more DataFrames based on common columns.

---
## **1.0 Data Description**
A good practice is to make copies of the DataFrame whenever you transition to a new analysis session in a notebook. This precautionary measure prevents the loss of original data during the manipulation of DataFrames in subsequent sessions.

By creating copies, you preserve the original data and can safely work on analyses and transformations without altering the original DataFrame. This is particularly useful when dealing with large volumes of data, as it avoids the need to rerun the notebook from the beginning, saving processing time.

In [None]:
# Create a copy of the DataFrame df_raw and store it in a new variable named df1
df1 = df_raw.copy()

Here, we created a copy of the DataFrame **df_raw** and stored it in **df1**. Now, we can conduct analyses and make modifications to **df1** without affecting the original data in **df_raw**, ensuring the integrity of the data during the manipulation process.

---
##  **1.1 Rename Columns**

In [None]:
# Get names of the columns present in df_raw.
df_raw.columns

**It's advisable to rename the columns to more intuitive and easily memorable names**. This can help speed up subsequent development since **column names will frequently be used to explore the data, apply algorithms, create plots, among other things**.

While the column names in the current example are quite organized and in 'camel case' format (alternating between uppercase and lowercase letters), **this may not be the case in a real-world environment, where column names can be much less intuitive**.

Therefore, **it's a good practice to review and, if necessary, rename columns at the beginning of the data analysis process**.

In [None]:
cols_old = [
'Store', 'DayOfWeek', 'Date', 'Sales', 'Customers', 'Open', 'Promo',
'StateHoliday', 'SchoolHoliday', 'StoreType', 'Assortment',
'CompetitionDistance', 'CompetitionOpenSinceMonth',
'CompetitionOpenSinceYear', 'Promo2', 'Promo2SinceWeek',
'Promo2SinceYear', 'PromoInterval']

In this code, we are creating a list called **cols_old**, which contains the **original column names of the DataFrame**. These columns represent various attributes related to the stores and sales.

- **`Store`**: Identifier number for the store.
- **`DayOfWeek`**: Day of the week (1 to 7).
- **`Date`**: Sale date.
- **`Sales`**: Total sales value for that day and store.
- **`Customers`**: Total number of customers in that store on that day.
- **`Open`**: Indicator whether the store was open (1) or closed (0) on that day.
- **`Promo`**: Indicator whether the store was having a promotion on that day.
- **`StateHoliday`**: Indicator of a state holiday (a, b, c) or none (0).
- **`SchoolHoliday`**: Indicator of a school holiday (1) or none (0).
- **`StoreType`**: Type of store (a, b, c, d).
- **`Assortment`**: Assortment level of the store (a, b, c).
- **`CompetitionDistance`**: Distance in meters to the nearest competitor store.
- **`CompetitionOpenSinceMonth`**: Month when the nearest competitor opened.
- **`CompetitionOpenSinceYear`**: Year when the nearest competitor opened.
- **`Promo2`**: Indicator of participation in an ongoing promotion (0 or 1).
- **`Promo2SinceWeek`**: Week when the ongoing promotion started.
- **`Promo2SinceYear`**: Year when the ongoing promotion started.
- **`PromoInterval`**: Frequency interval of the ongoing promotion.

This list is useful for quick reference and for easier access to specific attributes of the DataFrame. It's a good practice to have an organized list of original columns, especially in projects with many attributes, as it can speed up development and data analysis.

In [None]:
# Defining the lambda function 'snakecase' that uses the 'inflection' library to convert the names to snake_case
snakecase = lambda x: inflection.underscore(x)

# Applying the 'snakecase' function to all elements of the 'cols_old' list using the 'map' function
cols_new = list(map(snakecase, cols_old))

We created the **snake_case** function. In this code, the snakecase function is defined as a lambda function that uses the **inflection** library to convert the column names to the **snake_case** style. The **map** function is then used to apply the **snakecase** function to all elements of the **cols_old** list. The result of this operation is stored in the **cols_new** list, which will contain the column names in the new naming style.

In [None]:
# Renaming the columns of the DataFrame df1 with the names from the cols_new list
df1.columns = cols_new

# Displaying the column names after the renaming process
df1.columns

---
### **1.2 Data Dimensions**
A crucial step in describing our data is determining the dimensions of our DataFrame - the number of rows and columns. For this, we use the **shape** method that provides this information.

In [None]:
# Print the number of rows (records) of the DataFrame df1
print('Number of Rows: {}'.format(df1.shape[0]))

# Print the number of columns (attributes) of the DataFrame df1
print('Number of Columns: {}'.format(df1.shape[1]))

After executing the instructions, we find that the DataFrame has 1,017,209 rows and 18 columns. The amount of data is considerable, but manageable with our current resources. For larger datasets, options like AWS servers, Google Cloud, or Kaggle offer robust computational resources, either free or paid.

---
### **1.3 Data Types**

The next step in data description is to examine the data types present in the DataFrame. For this, we use the **dtypes** method, which allows us to view the column and its corresponding data type without the need to use parentheses.

In [None]:
# Check the data types (dtypes) of each column in the 'df1' DataFrame
df1.dtypes

This allows us to identify that columns like `store` are of type int64 and `date` is of type object.

For the **`date`** column, we want to change the data format so that we can treat it as a date. This will make it easier to perform time-based operations and analyses on our dataset.

In [None]:
# Converting the 'date' column to the datetime data type and forcing errors to become NaT
df1['date'] = pd.to_datetime(df1['date'], errors='coerce')

# Displaying the data types of the columns in the df1 DataFrame
df1.dtypes

We used **pd.to_datetime** in **pandas** to convert the `date` column to the correct date format (datetime64).

---
### **1.4 Check NA**
The next step is to check for null data (NaN) to ensure data quality. We use the isna() method to identify rows with missing values. We then use the sum() method to get the sum of missing values by column.

In [None]:
# Check the number of missing values (NaN) in each column of the DataFrame 'df1'.
df1.isna().sum()

After executing the commands, we noticed that some columns have no missing values, while others have a few. Now, we need to address these missing values. There are three common methods to handle this issue:

1. **Dropping Rows**: Simply removing the rows that contain missing values. It's quick and easy, but it can result in the loss of important information for the algorithm.

2. **Using Machine Learning Algorithms**: Some algorithms can fill in missing values based on the overall behavior of the column, such as mean, median, or mode. Additionally, there are more advanced algorithms that can estimate missing values more accurately.

3. **Using Business Knowledge**: If we understand the business logic behind the missing values, we can fill them in accordingly and recover the data appropriately.

---
### **1.5 Fillout NA**
Começaremos tratando a coluna **`competition_distance`**. De acordo com a descrição da coluna, ela representa a distância, em metros, do concorrente mais próximo.

In [None]:
# competition_distance
# competition_open_since_month
# competition_open_since_year
# promo2_since_week
# promo2_since_year
# promo_interval

There are three main approaches to dealing with missing data (`NaN`):

1. **Data Deletion**: The first way is to simply delete all rows that contain missing data. This is fast, but it has the disadvantage of losing a significant amount of information.

2. **Filling Using Algorithms**: The second approach is to use machine learning algorithms to fill in the missing data based on column behavior. For example, we can calculate the median or mean and replace the missing values with those numbers. More advanced algorithms can perform clustering or even predict the missing values.

3. **Consider Business Context**: The third approach, which will be adopted in this tutorial, involves thinking from a business perspective. Even if we're not domain experts, this exercise can be useful. Reflecting on why the data is missing can provide valuable insights for properly handling these cases.

---
### **1.5.1 `Competition Distance`**
Let's analyze the **`competition_distance`** column. One plausible interpretation for the **missing data** is that the **distance to the nearest competitor is so large that, in practice, there is no nearby competition**.

In [None]:
# Find the maximum value in the 'competition_distance' column
df1['competition_distance'].max()

With the **absence of values** in '`competition_distance`', we will **replace them with an extremely large number** which we will call '`max_value`'. To set this '`max_value`', let's first look for the **maximum value** existing in the '`competition_distance`' column. In our case, we found the **value of 75,860 meters** as the farthest competitor.

In [None]:
# Replacing NaN values in the 'competition_distance' column with 200000.0
df1['competition_distance'] = df1['competition_distance'].apply(lambda x: 200000.0 if math.isnan(x) else x)

We chose `max_value` as 200,000 meters, a value that is **greater** than the maximum distance in the `competition_distance` column. **Missing values** are replaced with `max_value`, indicating the absence of nearby competition. We applied this logic using the `lambda` function, which operates only on the `competition_distance` column.

The result **replaces** the original column. After the operation, we verify that there are **no more missing values** in this column, demonstrating that we successfully handled the missing data.

In [None]:
# Checking the count of missing values (NaN) in each column of df1
df1.isna().sum()

The **`competition_distance` column no longer has missing values**, and the **maximum value** is **200,000 meters**, as defined earlier.

In [None]:
# Getting the maximum value in the 'competition_distance' column of the DataFrame df1
df1['competition_distance'].max().astype(int)

---
### **1.5.2 `Competition Open Since Month`**
Moving forward, let's analyze the **`competition_open_since`** column. This column indicates the approximate `month` and `year` in which the **nearest competitor** was **opened**.

We can assume that the **missing values** in this column might be due to two reasons: **first**, the store might not have a **nearby competitor**, hence there is no **opening date** for such competitor. **Second**, the store might have a **nearby competitor**, but we are **unaware** of the **opening date**, either because the competitor opened **before** the store or because it opened **afterwards**.

To **replace the missing values** in this column, we will **copy** the **corresponding sale date** in the relevant row to the **`competition_open_since`** column.

In [None]:
# Importing the function
from IPython.display import display

# Selecting store 238 on the date '20/05/2014'
store_238_20_05_2014 = df1.loc[(df1['store'] == 238) & (df1['date'] == '2014-05-20')]

# Displaying the dataframe
display(store_238_20_05_2014)

We will consider the **`competition_open_since`** column in an example row with a **sale** made by **store 283** on **20/05/2014**. If this row presents a **missing value** in **`competition_open_since`**, we will replace this value with the **sale date**.

Only the **`month`** is extracted from the **sale date** to fill in the **missing values** in the **`competition_open_since_month`** column.

The **logic** is that a **store without competition** has a **stable level of sales**, which will likely **drop when a competitor opens**. The **time since the opening of a competitor** can influence sales fluctuations.

We acknowledge the **incongruity** of using the **sale date** as the **competitor's opening date**, but we will proceed with this **approach** to evaluate its **impact on the model**. If necessary, we will **adjust** this approach in **future iterations**. The way we will implement this **replacement** will be similar to what we did for **`competition_distance`**.

---
### **1.5.2 `Competition Open Since Month`**

In [None]:
# Filling missing values in 'competition_open_since_month' with the month from the corresponding 'date' column
df1['competition_open_since_month'] = df1.apply(lambda x: x['date'].month
                                                if np.isnan(x['competition_open_since_month'])
                                                else x['competition_open_since_month'], axis=1)

Therefore, if the **`competition_open_since_month`** column is empty, we will fill it with the **month from the date column**.

If this condition is not true, we will simply **return the original value**, as we already have the month in which the competition was opened.

To apply this logic, we will use the **lambda** function again. Within this function, we can replace **df1** with **x**, since everything inside the function is referred to as **x**.

To apply this to our data, we will use the **apply** function and apply it across the columns. We do this because we are working with more than one column - **`competition_open_since_month`** and **`date`**. When working with more than one column, we need to explicitly apply the function across the columns.

Finally, the result of this function will be used to replace the original **`competition_open_since_month`** column.

---
### **1.5.3 `Competition Open Since Year`**
We will **replace all occurrences of '`month`' with '`year`'** in the current line.

In [None]:
# We created the **competition_open_since_year** column where we replaced the missing value with the year from the 'date' column.
df1['competition_open_since_year'] = df1.apply(lambda x: x['date'].year
                                               if np.isnan(x['competition_open_since_year'])
                                               else x['competition_open_since_year'], axis=1)

---
### **1.5.4 `Promo 2 Since Week`**
The **`promo2`** column indicates a store's participation in a continuation of a promotion. If the value is **zero**, the store is not participating; if it's **one**, the store is participating. When there are missing values (**`NaN`**) for *`promo2`*, it means the store has not opted to participate in the continuous promotion, therefore there is no start week for that promotion.

To handle these missing values, we will use a similar approach to what we used for the **`competition_distance`** column: **replace the NaNs with the value of the date in that row**. This way, when we consider the distance in time, the algorithm will recognize that we have this promotion active since a certain week.

To implement this replacement, we will use a code similar to what we used for **`competition_distance`**, but with **small modifications**: instead of looking for **`competition`**, we look for **`promo2`**, and instead of extracting the month from the date, we want to extract the week.

In [None]:
# Filling missing values in the 'promo2_since_week' column with the week number from the 'date' column.
df1['promo2_since_week'] = df1.apply(lambda x: x['date'].week
                                     if np.isnan(x['promo2_since_week'])
                                     else x['promo2_since_week'], axis=1)

---
### **1.5.5 `Promo 2 Since Year`**
To implement this replacement, we duplicate the code used for the **`promo2_since_week`** column, making slight modifications to extract the **`year`** from the **date** column.

In [None]:
# Extracting the year from the 'date' column for rows where the value of 'promo2_since_year' is NaN.
df1['promo2_since_year'] = df1.apply(lambda x: x['date'].year
                                     if np.isnan(x['promo2_since_year'])
                                     else x['promo2_since_year'], axis=1)

This code performs the filling of **missing values** in the **`promo2_since_year`** column. It iterates through each row of the dataframe and checks if the value in that column is **`NaN`** (missing). If it is, it replaces the missing value with the **year from the date** in that row. If the value is not NaN, it keeps the **original value of the column**. This is done using the **`apply`** function and a **lambda function** that performs these checks and replacements on a row-by-row basis.

---
### **1.5.6 `Promo Interval`**
The **Promo Interval** describes the **months** in which a promotion called **promo2** was initiated. For example, if the values are **"February, May, August, November"**, it indicates that the **promotion** was carried out in those **specific months**.

In [None]:
# Creating a month_map dictionary to map month numbers (1 to 12) to their respective English abbreviations.
month_map = {1: 'Jan', 2: 'Feb', 3: 'Mar',  4: 'Apr',  5: 'May',  6: 'Jun',
             7: 'Jul', 8: 'Aug', 9: 'Sep', 10: 'Oct', 11: 'Nov', 12: 'Dec'}

# Filling missing values in 'promo_interval' with 0, if any.
df1['promo_interval'].fillna(0, inplace=True)

# Creating the 'month_map' column with the abbreviations of the months from the promotions.
df1['month_map'] = df1['date'].dt.month.map(month_map)

# Creating the 'is_promo' column to indicate if a store is participating in a promotion in that specific month.
# If 'promo_interval' is not equal to 0 and the month abbreviation is present in 'promo_interval', the store is participating in the promotion (value 1), otherwise, it's not (value 0).
df1['is_promo'] = df1[['promo_interval', 'month_map']].apply(
    lambda x: 1 if x['promo_interval'] != 0 and x['month_map'] in x['promo_interval'].split(',') else 0, axis=1)

In [None]:
df1.sample(5).T

The first step involves **splitting** the "`promo_interval`" column to create a list of **promotion months**. Next, a new column called "month_map" is created based on the **promotion date**. If the **promotion month** is present in the list of months, a **value indicating that "`promo2`" was active** is inserted into this column.

After the auxiliary column is created, the **months are converted from numbers to their corresponding text representations**. For example, January is represented as "Jan". Then, a **function is applied to perform the direct substitution in the column**.

Next, the **values in the "`promo_interval`" column are replaced with their numeric equivalents** using a dictionary that maps the months to their respective numbers. This creates the "month_map" column, which contains the month in which the promotion occurred.

A **function is then applied to check if the promotion month is in the "`promo2`" list of months**. If it is in the list, the value "1" is returned, indicating that the store was participating in the promotion in that specific month. If the month is not in the list or if the store did not participate in `promo2`, the value `0` is returned.

In [None]:
# Calculate the number of missing values (NaN) in each column of the DataFrame df1
df1.isna().sum()

When working with columns in a dataset, it's **important to check the data types of these columns after performing operations**. This is because operations can inadvertently change the original data types.

In some cases, we might need to **change the data type of a column to another type**. For instance, if we want to convert the `float` data type of the 'competition' column to `int` (an integer), we can use the `astype()` method. This conversion is useful when we want to work with integer numbers instead of decimal values.

In [None]:
# Changing the data type of the 'competition_open_since_month' column to integer
df1['competition_open_since_month'] = df1['competition_open_since_month'].astype(int)

# Changing the data type of the 'competition_open_since_year' column to integer
df1['competition_open_since_year'] = df1['competition_open_since_year'].astype(int)

# Changing the data type of the 'promo2_since_week' column to integer
df1['promo2_since_week'] = df1['promo2_since_week'].astype(int)

# Changing the data type of the 'promo2_since_year' column to integer
df1['promo2_since_year'] = df1['promo2_since_year'].astype(int)

---
## **1.7 Descriptive Statistics**
Descriptive statistics act like a magnifying glass for your data. They help you see things you wouldn't be able to see with the naked eye. They serve two major purposes: **getting to know your business better** and **identifying potential issues in the data**.

Descriptive statistics have two main **"tools"**: measures of central tendency and measures of dispersion.

**Measures of central tendency**, such as mean and median, are like a snapshot of your data. They provide a general idea of what your data looks like. But just like a snapshot, they don't reveal everything. To get the complete picture, we need measures of dispersion.

**Measures of dispersion**, including variance, standard deviation, and minimum and maximum values, are like a map of your data. They show where the data points are located, whether they cluster around the mean or spread out. Additionally, there's **skewness** and **kurtosis**, which provide more insights about this map. Skewness tells us if the data is skewed to one side or the other. Kurtosis tells us if the data is tightly concentrated or spread out.

When using descriptive statistics, it's important to treat numerical variables and categorical variables differently. To select numerical variables, you can use a specific command in your data analysis tool, such as the **select_dtypes()** command in the pandas library of Python.

In [None]:
# Select the columns in DataFrame df1 with data type 'int64' or 'float64' and store them in the variable num_attributes
num_attributes = df1.select_dtypes(include=['int64', 'float64'])

# Select the columns in DataFrame df1 that do not have data type 'int64', 'float64', or 'datetime64[ns]' and store them in the variable cat_attributes
cat_attributes = df1.select_dtypes(exclude=['int64', 'float64', 'datetime64[ns]'])

The first line is selecting all columns that have data types **int64** or **float64**. These are typically considered numeric attributes, as **int64** represents an integer number and **float64** represents a floating-point number.

The second line is selecting all columns that are not of type **int64**, **float64**, or **datetime64[ns]**. These are typically considered categorical attributes, as they exclude numeric and datetime types.

The **`.select_dtypes()`** method is used to select columns from a DataFrame based on their data types. The **include** parameter is used to specify the data types you want to include, and the **exclude** parameter is used to specify the data types you want to exclude.

In [None]:
# Selects two random rows from the DataFrame num_attributes.
num_attributes.sample(2)

# Selects two random rows from the DataFrame cat_attributes.
cat_attributes.sample(2)

After dividing the data into **numerical** and **categorical** types, we can delve deeper into each type to uncover valuable insights. For numerical data, we can calculate the **mean**, which is the sum of all numbers divided by the total quantity, the **median**, which is the value in the middle of an ordered list of numbers, and the **standard deviation**, which indicates how much these numbers disperse from the mean.

For categorical data, we can **count how many times each category appears**, allowing us to identify the most common or frequent category.

---
### **1.7.1 Numerical Attributes**
Let's delve into calculating central tendency metrics for numerical variables. Central tendency metrics summarize the dataset into a single representative value, providing a 'center' around which the data is distributed. The most common central tendency metrics are the mean and the median.

### **Central Tendency Measures**

In [None]:
# Create a new DataFrame with the mean of each column
ct1 = pd.DataFrame(num_attributes.apply(np.mean)).T

# Create a new DataFrame with the median of each column
ct2 = pd.DataFrame(num_attributes.apply(np.median)).T

### **Dispersion Measures**

In [None]:
# Standard deviation gives us an idea of how spread out our data is
d1 = pd.DataFrame(num_attributes.apply(np.std)).T

# Minimum and Maximum show us the lowest and highest value in our data, respectively
d2 = pd.DataFrame(num_attributes.apply(min)).T
d3 = pd.DataFrame(num_attributes.apply(max)).T

# Range is the difference between the highest and lowest value in our data
d4 = pd.DataFrame(num_attributes.apply(lambda x: x.max() - x.min())).T

# Skewness gives us an idea of how symmetric our data is. If it's zero, the data is perfectly symmetric.
d5 = pd.DataFrame(num_attributes.apply(lambda x: x.skew())).T

# Kurtosis informs us if the data is light-tailed or heavy-tailed compared to a normal distribution
d6 = pd.DataFrame(num_attributes.apply(lambda x: x.kurtosis())).T

After calculating all of this important information, the next step is to gather all of it in one place to make analysis easier. To accomplish this, we will combine all of these datasets we've created into a single dataset, specifically, a single DataFrame.

The pandas library, which we are using to handle our data, has a function called **`concat()`**. This function allows us to merge multiple DataFrames into one, and that's exactly what we're going to do next:

In [None]:
# We concatenate all DataFrames into a single DataFrame 'measures'
measures = pd.concat([d2, d3, d4, ct1, ct2, d1, d5, d6]).T.reset_index()

# We rename the columns for better understanding
measures.columns = ['Attributes', 'Minimum', 'Maximum', 'Range', 'Mean', 'Median', 'Standard Deviation', 'Skewness', 'Kurtosis']

# We configure the display option to show all values without truncation
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)

# We display the final DataFrame
print(measures)

At this stage, we have the `measures` DataFrame, which combines all the metrics we calculated for each numeric attribute in our dataset. Now, each row of this DataFrame presents a specific metric, while each column represents one of the numeric attributes from our original dataset. This gives us a clear and concise view of the key characteristics of each of our numeric attributes.

Suppose we have a column named `sales` in our data. The statistics show us that the minimum value is zero, indicating days with no sales, possibly due to the store being closed. In contrast, the maximum value is **41,000**, representing the highest volume of sales achieved in a single day. The difference between these two values, namely the 'range', is in this case **41,000**.

When the mean (the average value of sales) and the median (the middle value when sales are ordered) are similar, as observed here, it suggests that sales are evenly distributed, without a prevalence of days with extremely high or low sales.

Additionally, we have two other indicators that provide us with a more detailed view of the distribution of sales: the `skewness` and the `kurtosis`. The `skewness` indicates whether the majority of sales are above or below the mean, while the `kurtosis` gives an idea of the degree of concentration of sales around the mean. In our case, both values are close to zero, which supports the notion that sales are well-distributed and do not exhibit significant surprises.

---
### **1.7.2 Categorical Attributes**

To perform a descriptive analysis of categorical data, it's useful to employ graphs for visualizing the information. One particularly valuable type of graph for this purpose is the box plot. This graph allows you to visualize various statistical measures, such as measures of central tendency and dispersion, all in one place, making it easy to compare between categories.

However, before constructing the box plot, it can be helpful to check the number of levels or unique values that each categorical variable possesses. This can be done using the `nunique` function in Python's pandas library. This function returns the count of distinct elements in each column of a dataframe. The `nunique` function can be applied to all columns of a dataframe using the `apply` method.

In [None]:
# Counting the number of unique categories in each categorical attribute.
cat_attributes.apply(lambda x: x.unique().shape[0])

In the example we are examining, we have several categorical variables, such as `state_holiday`, `store_type`, `assortment`, `promot_interval`, and `month_map`. It's interesting to note that each of these variables has three distinct categories.

In [None]:
# Setting the size of the plot, increasing to occupy more space
plt.figure(figsize=(17, 10))

# Creating the boxplot
sns.boxplot(x='state_holiday', y='sales', data=df1)

# Adding a title to the plot with a larger font
plt.title('Comparison of Sales in Different State Holidays', fontsize=18)

# Removing margins
plt.subplots_adjust(left=0, bottom=0, right=1, top=1, wspace=0, hspace=0)

# Displaying the plot
plt.show()

We are interested in exploring the dispersion of **`sales`** at each level of the categorical variable **`state_holiday`**. Thus, in the graph, we set **`state_holiday`** as the x-axis (representing the categories) and **`sales`** as the y-axis (representing the value we are analyzing). The data is sourced from our DataFrame **`df1`**, which contains the relevant data.

However, when viewing the boxplot, we might encounter a challenging or distorted interpretation of the data. A common cause for this is the presence of **outliers** that don't contribute to our analysis. An example could be the presence of sales equal to zero, corresponding to days when the stores were closed.

To address this issue, it's possible to **filter the data** before plotting the graph. For instance, we might want to visualize **only the sales that are greater than zero and occurred on non-holiday days**.

In [None]:
# Filtering the DataFrame to include only days with positive sales and that are not normal holidays
aux1 = df1[(df1['state_holiday'] != '0') & (df1['sales'] > 0)]

# Defining the order for each variable
state_holiday_order = ['a', 'b', 'c']
store_type_order    = ['a', 'b', 'c', 'd']
assortment_order    = ['a', 'b', 'c']

# Initializing the figure for the plots with a larger size
fig, axes = plt.subplots(1, 3, figsize=(24, 6))

# Adding a global title to the figure with a larger font
fig.suptitle('Sales Distribution by State Holiday, Store Type, and Assortment', fontsize=18, y=1.02)

# Creating the first subplot to show the sales distribution by state holiday
sns.boxplot(ax=axes[0], x='state_holiday', y='sales', data=aux1, order=state_holiday_order)
axes[0].set_title('State Holiday', fontsize=16)

# Creating the second subplot to show the sales distribution by store type
sns.boxplot(ax=axes[1], x='store_type', y='sales', data=aux1, order=store_type_order)
axes[1].set_title('Store Type', fontsize=16)

# Creating the third subplot to show the sales distribution by assortment
sns.boxplot(ax=axes[2], x='assortment', y='sales', data=aux1, order=assortment_order)
axes[2].set_title('Assortment', fontsize=16)

# Adjusting the layout to add spacing and prevent overlap
fig.tight_layout(pad=3.0)

# Displaying the plots
plt.show()

**Plotly**

In [None]:
# Filtering the DataFrame to include only days with positive sales and that are not normal holidays
aux1 = df1[(df1['state_holiday'] != '0') & (df1['sales'] > 0)]

# Defining the order for each variable
state_holiday_order = ['a', 'b', 'c']
store_type_order    = ['a', 'b', 'c', 'd']
assortment_order    = ['a', 'b', 'c']

# Initializing the figure for the plots with a larger size
fig, axes = plt.subplots(1, 3, figsize=(24, 6))

# Adding a global title to the figure with a larger font
fig.suptitle('Sales Distribution by State Holiday, Store Type, and Assortment', fontsize=18, y=1.02)

# Creating the first subplot to show the sales distribution by state holiday
sns.boxplot(ax=axes[0], x='state_holiday', y='sales', data=aux1, order=state_holiday_order)
axes[0].set_title('State Holiday', fontsize=16)

# Creating the second subplot to show the sales distribution by store type
sns.boxplot(ax=axes[1], x='store_type', y='sales', data=aux1, order=store_type_order)
axes[1].set_title('Store Type', fontsize=16)

# Creating the third subplot to show the sales distribution by assortment
sns.boxplot(ax=axes[2], x='assortment', y='sales', data=aux1, order=assortment_order)
axes[2].set_title('Assortment', fontsize=16)

# Adjusting the layout to add spacing and prevent overlap
fig.tight_layout(pad=3.0)

# Displaying the plots
plt.show()

In the above code, we used the **Seaborn** and **Plotly** libraries to explore the relationship between sales and different **categorical variables** such as **`state_holiday`** (type of holiday) and **`store_type`** (store type). The **plt.subplot()** function allows us to visualize multiple boxplots side by side, making it easier to make direct comparisons between variables.

The **boxplot** is a graphical tool that provides a visual view of the **median**, **quartiles**, and **outliers** in a data distribution. The line dividing the 'box' in the middle represents the median - the central value of a distribution. The upper and lower lines of the 'box' indicate the first quartile (25th percentile) and the third quartile (75th percentile), respectively.

Upon analyzing the boxplots, we observe that sales tend to be higher when **`state_holiday`** is of type 'B'. In the **`store_type`** variable, we notice that stores of type 'B' not only have a higher median sales but also a larger number of **outliers**.

---
## **Rossmann Project Status**

In the context of our **Rossmann sales prediction project** for stores over the next six weeks, the approach begins by identifying the root cause of the problem, which manifests in the CEO's difficulty in determining the amount of investment needed for store renovations. Thus, conducting **sales forecasting** emerges as crucial to assist in the decision-making process.

The subsequent phase involves **data collection**, which in this project involves downloading data available on the **Kaggle** platform. However, in practical situations, this collection can encompass various data sources relevant to constructing the final table used in analysis.

Following that, **data cleaning** occurs, ensuring the correctness and consistency of data types, and replacing missing values based on established criteria. It is also important to provide a summary of the data to gain an initial understanding of its size and characteristics.

Currently, we are in the **data exploration phase**, which involves three main tasks. The first task is the creation of **derived variables**, using existing ones, to enhance data representation and capture relevant information.

Next, it becomes necessary to formulate a list of **hypotheses** that will be tested during data exploration. These hypotheses may relate to behavioral patterns, correlations between variables, or any other assumptions relevant to the problem.

Lastly, it is crucial to **validate the hypotheses** raised during data exploration, using statistical and visual techniques to confirm or refute each one. This validation contributes to understanding the dataset and guiding the next stages of the project.

Upon concluding the data exploration phase, there will still be other stages to be executed to complete the first cycle of the proposed solution. However, with the progress achieved so far, it will be possible to deliver the **initial version of this solution**, providing valuable insights for the decision-making process in the retail sector.

---
### **Hypotheses Mind Map**

The **hypotheses mind map** plays a vital role in the **exploratory data analysis** phase, acting as a clear and specific guide to efficiently gain valuable insights.

Essentially, this map functions as a roadmap that directs which analyses need to be conducted to validate the established hypotheses. It also indicates which variables need to be derived to perform these analyses. Simplistically, the mind map provides an explicit structure of the required analyses and the involved variables, facilitating effective data exploration.

With the **hypotheses mind map** as our guide, we gain the ability to determine the level of detail required for each project cycle. This implies that we can identify how deep the analysis needs to go during the initial phase. With this clear direction, it becomes possible to conduct the analysis more swiftly and directly, generating valuable insights for each stage of the development cycle.

Before we move on to creating the **hypotheses mind map**, it's important to highlight the three elements that compose this map. These components are essential to understanding and effectively applying it, ensuring that we can extract the maximum value from this tool in our **Rossmann** sales forecasting project.

---
### **Elements of the Hypotheses Mind Map**

The mind map comprises three key elements: the **phenomenon**, the **agents**, and the **attributes of agents**. Each of these elements plays a crucial role in defining the hypotheses to be explored and validated during data analysis.

The first element, the phenomenon, refers to **what we are trying to measure or model**. In the context of our project, this phenomenon is **sales forecasting**. We can consider other examples, such as object detection in an image, image classification between cat and dog, or customer clustering for persona creation. **The phenomenon is the central aspect we aim to understand and model to teach machine learning algorithms**.

The second element of the mind map is the agents, i.e., the **entities that impact the phenomenon** in some way. In the case of **sales**, the agents can be **customers**, **stores**, and **products**. It's important to recognize that these agents directly influence sales, potentially contributing to their increase or decrease. For instance, **a higher number of customers tends to increase sales**, while an increase in **product price may result in lower sales**. Hence, it's crucial to identify and understand all relevant agents in this context.

The third and final element is the attributes of the agents. Each agent can be described by a set of attributes. For example, in the case of customers, attributes like **age**, **education level**, **marital status**, **number of children**, **store visit frequency**, **salary**, **education**, and **profession** can be considered. These attributes help describe the characteristics and peculiarities of each agent, providing valuable information for analysis.

The primary goal of the hypotheses mind map is to derive a list of hypotheses to be tested and validated. Based on this list, it becomes possible to prioritize hypotheses and conduct data analysis in a targeted manner.

During this analysis, relevant insights can be generated for the project. Insights can be generated in two ways: through surprises, where previously unknown information is discovered, or through belief opposition, where a hypothesis is challenged, and the obtained results defy initial expectations.

---
### **How to Write the Hypotheses?**

In the data science project, our goal is to understand and predict the daily sales of Rossmann stores, which is the **central phenomenon** in our investigation. Several **agents** influence these sales, such as **customers**, **stores**, and **products**.

Customers are individuals who make purchases in the stores, while stores have their own characteristics, such as **location** and **size**. Products are also fundamental, with attributes like **price**, **stock**, **promotions**, and **store layout**.

We also consider **temporal aspects**, such as **year**, **month**, **day**, **hour**, **week of the year**, **holidays**, and **special promotions**, to understand the seasonality and temporal patterns of sales. The **store's location** is relevant, taking into account proximity to other points of interest.

With these elements in mind, we can create a hypotheses mind map to be tested and validated. Each branch of the map symbolizes a hypothesis to be investigated. For instance, we might hypothesize that customers with high purchase volume have a positive impact on sales, while customers with high salaries might not have as significant an impact.

This hypotheses mind map can be built based on our prior knowledge or through brainstorming sessions with individuals from different areas. Each participant brings their insights into which elements affect sales.

The ultimate goal of this mind map is to generate a list of hypotheses to be tested during data analysis. During this analysis, we seek relevant insights that can confirm or refute the hypotheses. These insights can arise from surprising discoveries or when results contradict initial assumptions.

The hypotheses mind map guides our data science project, enabling a structured and focused approach to understand and predict the sales of Rossmann stores.

---
## **2.0 Feature Engineering**

When formulating hypotheses, it's crucial to understand that they function as assumptions or conjectures about the **phenomenon** we intend to model. For instance, a hypothesis could be: "**Larger stores tend to sell more**." Let's break down this conjecture. First and foremost, we should recognize that this is an assumption; we don't know if it's true yet. This hypothesis needs to be **validated based on the available data**.

Examining the hypothesis, we can identify key elements. In the mentioned example, the attribute is **store size**. "Sell" is the phenomenon we're trying to model, in this case, **store sales**. The words "Larger" and "More" characterize the assumption. In other words, we're positing that **larger stores should sell more**.

A useful approach involves **relating each attribute in the dataset to the phenomenon** we wish to model, making an assumption based on our intuition. For example, one might assume that if a certain attribute increases, the phenomenon also increases, or if the attribute decreases, the phenomenon decreases as well. These hypotheses will guide our investigation, allowing us to make assumptions about attributes and their relationship to the studied phenomenon.

However, it's important to stress that this relationship between attributes and the studied phenomenon **is not a cause-and-effect relationship**. For instance, we cannot definitively claim that larger stores sell more solely due to their size. Sales might increase due to various other factors, such as a **higher number of customers**, which could be influenced by store size. Hence, we must understand that these hypotheses establish a **correlation between attributes and the phenomenon**.

Now, let's present more examples of hypotheses derived from the mind map:

"**Stores with a wider assortment tend to sell more**" - Here, "Assortment" is an attribute of the 'store' agent, "sell" is the phenomenon, and "wider" represents the assumption.
"**Stores with more nearby competitors tend to sell less**" - In this case, "Competitors" is an attribute of the 'store' agent, "sell" is the phenomenon, and "more" and "less" are the assumptions.

---
## **2.1 Mental Map of Hypotheses**

### **2.1.1 Store Hypotheses**
In subsection 2.1.1, which deals with hypotheses related to stores, we will create our hypotheses. We can use a bullet-point format for better visualization. The first hypothesis will be about the number of employees in the stores. Based on the mind map, we can create the hypotheses:
1. Stores with a **larger number of employees** sell more.
2. Stores with a **higher stock capacity** sell more.
3. Larger stores sell more.
4. Smaller stores sell less.
5. Stores with a **wider assortment** sell more.
6. Stores with **closer competitors** sell less.
7. Stores with **longer-standing competitors** sell more.

It's important to highlight that these hypotheses will be validated during the exploratory data analysis phase. Each of them will be analyzed based on the available data, allowing us to confirm or refute their influence on sales.

### **2.1.2 Product Hypotheses**
1. Stores that invest more in **Marketing** sell more.
2. Stores with more products **displayed in the showcases** sell more.
3. Stores with **lower prices** on products sell more.
4. Stores that maintain **lower prices for a longer period** on products sell more.
5. Stores with **more aggressive promotions** sell more.
6. Stores with **promotions active for a longer duration** sell more.
7. Stores with more **promotion days** sell more.
8. Stores with more **consecutive promotions** sell more.

### **2.1.3 Time Hypotheses**

1. Stores open during the Christmas holiday **tend to sell more**.
2. Store sales **should increase over the years**.
3. Store sales **should be higher in the second half of the year**.
4. Store sales **should increase after the 10th day of each month**.
5. Store sales **should be lower on weekends**.
6. Store sales **should be lower during school holidays**.

---
## **2.2 Final List of Hypotheses**

An important step is **prioritizing hypotheses** that we will use during the data analysis. To do this, we use a simple and effective criterion: the **availability of the necessary data at the moment**.

Some hypotheses can be confirmed or refuted using the data already available, while others require time to collect, organize, and prepare the data for analysis. Therefore, we **prioritize hypotheses that can be validated immediately**, if they are relevant to the model in question:

1. Stores with a **greater variety of products** tend to sell more.
2. The **proximity of competitors** negatively impacts store sales.
3. Stores with **established competitors for a longer time** have higher sales.
4. **Promotions active for a longer time** tend to boost store sales.
5. Stores with **longer promotion duration** experience increased sales.
6. The **number of promotion days** is positively related to store sales.
7. Stores with **consecutive promotions** see an increase in sales.
8. Stores **open during the Christmas holiday** experience higher sales.
9. Store sales **tend to increase over the years**.
10. Store sales **are higher in the second half of the year**.
11. After the **10th day of each month**, store sales have an increase.
12. Store sales **are lower on weekends**.
13. Store sales **are lower during school holidays**.

Among the **store-related hypotheses**, we can consider that **stores with a larger number of employees tend to have higher sales**. However, this hypothesis depends on specific data about the number of employees, which is **currently not available**. Similarly, the hypothesis that **stores with a larger stock capacity sell more** also requires information about stock, which we don't have at the moment. On the other hand, we can consider the hypothesis that **larger stores tend to have higher sales**, as this attribute is already present in our dataset.

Regarding **products**, we can explore the hypothesis that **stores that invest more in marketing have higher sales**, as effective marketing can attract more customers. However, we lack specific information about marketing investments at the moment. Additionally, we can consider the hypothesis that **stores that display more products in their showcases have higher sales**, as visible display might catch customers' attention.

Regarding **product prices**, we can consider the hypothesis that **stores with lower prices have higher sales**, reflecting consumers' preference for affordable prices. However, we lack specific information about product prices currently. Similarly, the hypothesis that **stores with more aggressive promotions and larger discounts have higher sales** requires data about promotions, which is not available right now.

Finally, in the time-related hypotheses, we can consider the hypothesis that **stores have higher sales during the Christmas holiday**, utilizing the holiday records present in our dataset. Additionally, we can explore the hypothesis that **stores have higher sales in the second half of the year, after the 10th day of each month, during weekends, and school holiday periods**, based on the temporal information available, such as date, month, and year.

These are the hypotheses that we can evaluate at the moment, considering the availability of data. Each of them will be analyzed and tested during the data analysis process, aiming to understand the correlation and strength of these relationships with the sales phenomenon.


## **2.3 Feature Engineering**

After finalizing the list of **hypotheses**, we proceed to create the necessary **variables** for the sales prediction model. The variables include **temporal** information such as **year**, **month**, **day**, **week of the year**, **week of the month**, and **day of the week**, which will be used for temporal analyses.

We will also derive **important variables** such as `promo_since`, which indicates the **duration of active promotion**, and `competition_since`, which shows the **time since the start of competition** in the market. These variables will help us understand the impact of **promotions** and **competition** on sales.

Another **relevant variable** is **`state_holiday`**, which will inform how much time has passed since the last **holiday**, allowing us to evaluate how holidays influence sales.

This step is simple but **fundamental** to enrich the data and enable more accurate analyses.

In [None]:
df2 = df1.copy()


### **2.3.1 Derivation of Dates**

Let's **transform** the **`date`** column into a **date** format using the **to_datetime** method. Next, we will extract the **`year`**, **`month`**, and **`week_of_year`** using the "dt" accessor and copy this information to the **`year`**, **`month`**, and **`week_of_year`** columns, respectively.

We will also create the **`day`** column to **extract the day of the week** using the **`dayofweek`** method and format it using the **strftime** method.

These **new variables** will be useful for **future analyses** in the sales prediction project.

In [None]:
# Create a new column 'year' in the dataframe 'df2' with the values of the year extracted from the 'date' column.
df2['year'] = df2['date'].dt.year

# Create a new column 'month' in the dataframe 'df2' with the values of the months extracted from the 'date' column.
df2['month'] = df2['date'].dt.month

# Create a new column 'day' in the dataframe 'df2' with the values of the days extracted from the 'date' column.
df2['day'] = df2['date'].dt.day

# Create a new column 'week_of_year' in the dataframe 'df2' with the week numbers of the year extracted from the 'date' column.
df2['week_of_year'] = df2['date'].dt.weekofyear

# Create a new column 'year_week' in the dataframe 'df2' by concatenating the year and week number in the 'YYYY-WW' format, obtained from the 'date' column.
df2['year_week'] = df2['date'].dt.strftime('%Y-%W')


### **2.3.2 Competition Since**

To calculate the time between two dates in the context of our sales prediction project, it's important to have both dates available. In this case, we already have the date from the "**`date`**" column. However, we also have the "**`competition`**" information, which is divided into **year, month, and day**. To calculate the time between these two dates, we need to **combine this information into a single date** and then perform the **subtraction between them**.

For this purpose, we'll use the "**datetime**" method from the corresponding class. We'll extract the **year, month, and day** data from the "**`competition`**" column and create a new column called "**`competition_open_since`...**" to store this information. We'll use the "**apply**" method in combination with a **lambda function** to apply this operation to all rows of the column.

The result of this operation will be **a new date that represents the combination of the year, month, and day** from the "**`competition_since`**" column.

In [None]:
# Creates 'competition_since', a combined date from the 'competition_open_since_year' (year) and 'competition_open_since_month' (month) columns, setting the day as 1.
df2['competition_since'] = df2.apply(lambda x: datetime.datetime(
    year=x['competition_open_since_year'], month=x['competition_open_since_month'], day=1,), axis=1)

Next, we will divide this result by 30 in order to maintain the time unit in months. This choice is made because we aim to keep the monthly granularity in this specific case.

The result of this operation will be a new date representing the competition time in months.

In [None]:
# Calculates the competition time in months for each record in 'df2' by subtracting 'date' from 'competition_since'
df2['competition_time_month'] = ((df2['date'] - df2['competition_since']).dt.days / 30).astype(int)

# Fills NaN values with 0 or another appropriate value for your context
df2['competition_time_month'].fillna(0, inplace=True)
df2['competition_time_month'] = df2['competition_time_month'].astype(int)

The resulting value will be stored in a **new column** named **`competition_time_month`**. This column will represent the **time elapsed in months** since the beginning of the competition. To ensure that the values are in **numeric format**, we will use the **astype(int)** method.

This is the procedure used to **calculate the time in months since the start of the competition**, using the information available in our dataset.


### **2.3.3 `Promo Since`**
Firstly, we create a new column called **`promo_since`** in the DataFrame **`df2`**. This column is created by concatenating the columns **`promo2_since_year`** and **`promo2_since_week`**, which are converted to string format. The result is a representation of the start date of the promotion.

In [None]:
# Importing the necessary library
import datetime

# Creating the 'promo_since' column
df2['promo_since'] = pd.to_datetime(df2['promo2_since_year'].astype(str) + '-W' +
                                    df2['promo2_since_week'].astype(str) + '-1', format='%Y-W%W-%w')

# Adjusting 'promo_since' by one week
df2['promo_since'] -= pd.DateOffset(weeks=1)

# Creating 'promo_time_week'
df2['promo_time_week'] = ((df2['date'] - df2['promo_since']) / np.timedelta64(1, 'W')).astype(int)

Next, we use the **apply** function along with a **lambda** expression to apply an operation to each value in the **`promo_since`** column. This operation converts the string into a valid date. The **`datetime.strptime`** method from the **`datetime`** library is used for this conversion. The **`+ '-1'`** part is added to specify the first day of the week corresponding to the indicated week of the year. The result is subtracted by a period of seven days (**`datetime.timedelta(days=7)`**) to adjust the date correctly.

Finally, we create a new column called **`promo_time_week`** in the **`df2`** DataFrame. This column is obtained by calculating the difference in weeks between the **`date`** column (sales date) and the **`promo_since`** column (promotion start date).

First, we subtract the two dates and divide the result by seven to obtain the number of weeks. Then, we use the **apply** function with the lambda expression to extract the number of days from the resulting timedelta object. Finally, we convert the result to the integer (**int**) type and store it in the **`promo_time_week`** column.

In [None]:
df2.head().T

### **2.3.4 `Assortment`**
We utilize the **apply** function with a lambda expression to apply conditional logic to each value in the **`assortment`** column. This conditional logic checks the value of each element: if it is **a**, we assign **`basic`**; if it is **b**, we assign **`extra`**; otherwise, we assign **`extended`**.

By doing this, we are effectively mapping the different values present in the **`assortment`** column to more descriptive and understandable categories: **`basic`**, **`extra`**, and **`extended`**.

In [None]:
# Creates a mapping dictionary for 'assortment'
assortment_dict = {'a': 'basic', 'b': 'extra', 'c': 'extended'}

# Updates the 'assortment' column using the dictionary
df2['assortment'] = df2['assortment'].map(assortment_dict)


## **Rossmann Project Status**

In the first step, we performed **data description**, summarizing the information through measures of **descriptive analysis**. Additionally, we **handled missing values** and worked with different **data forms**.

Moving on to the second step, we worked on **identifying relevant features**, creating a **mind map of hypotheses**. We defined the **phenomenon we want to model**, identified the **involved agents and their attributes**, and formulated a **list of hypotheses**. We **prioritized these hypotheses** based on the data available at the moment.

Now, we are entering the **third step**, which deals with **variable filtering**. It's important to understand the difference between **filtering** and **selecting variables**.

**Variable filtering** involves **excluding or retaining variables** based on specific criteria, such as relevance, data quality, correlations, among others. On the other hand, **variable selection** refers to choosing the most important variables for analysis or predictive modeling, aiming to reduce dimensionality and increase computational efficiency.

In the previous steps, we conducted **data training**, downloading it from the Kaggle platform. We then proceeded with the **data cleaning step**, where we described the data and addressed potential quality issues. Moving forward, we reached the **data exploration step**, which we carried out in **step 2**, and we will continue through the stages of **Data Modeling, Machine Learning Algorithms, Algorithm Evaluation, and finally, deploying the Model into Production**.

---
## **3.0 Variable Filtering**

### **Filtering**
The motivation behind **variable filtering** is to address **constraints imposed by the business context**, ensuring that the **developed model can be successfully implemented** and meet the **company's needs**.

Often, when starting a data science project, all steps are carried out, but in the end, it is discovered that the **model cannot be put into production**. This happens **mainly when business constraints have not been considered from the beginning of the project**.

One **solution to avoid this situation is to think about business constraints early in the project**, even before starting to explore the data. That's why **we have included this step in our process**.

### **Selecting**
**Variable selection** is related to **identifying the most relevant variables for the model**. In this process, the algorithm examines the **correlations between the input variables and the target variable**, as well as the **relationships among the input variables themselves**. Based on this analysis, the algorithm decides which variables are more relevant for the model.

However, it's important to highlight that **variable selection does not consider business constraints**. This responsibility falls on the **data scientist**, who needs to understand the processes and constraints faced by the business teams. It's crucial to identify which constraints and data issues are relevant to the specific context and incorporate them into the model.

Therefore, the **variable filtering** in this **Rossmann sales forecasting project** takes into account both the **selection of variables relevant to the model** and the **specific constraints and issues faced by the business teams**. This approach allows for the development of a **more accurate model aligned with the company's needs**.

In [None]:
df3 = df2.copy()

### **3.1 Filtering Lines**

The column **`customers`** indicates the number of customers present in the store on the day of the recorded sales. However, for our project, this information is not available for the next six weeks as we don't know the quantity of customers that will be present during that period. Therefore, we cannot **use this column as an input variable** for sales forecasting.

Another relevant column is **"Open"**, which indicates whether the store was open or closed on the corresponding day. When the store is closed, sales are recorded as zero. In this case, there is no meaningful learning, as it is expected for sales to be zero when the store is closed. Therefore, **we have chosen to exclude the rows where the "Open" column is equal to zero**, indicating that the store was closed. By doing this, we are **filtering out only the sales that occurred when the stores were open**, removing sales recorded when the store was closed.

With these considerations, **we have selected the relevant columns for our sales forecasting model**, excluding information not available at the prediction time and **filtering only the sales that occurred when the stores were open**. This approach ensures that we are using the available data appropriately and aligning with the business constraints of the project.

In [None]:
# Define conditions for data selection
stores_open = df3['open'] != 0
positive_sales = df3['sales'] > 0

# Selects the data that meet the conditions
df3 = df3[stores_open & positive_sales]

Here, we are **filtering the DataFrame `df3`**. We are selecting only the rows where the column **"open" is not equal to zero** (meaning only the rows where the store was open) and the column **"`sales`" is greater than zero** (meaning only the rows where sales were recorded). This way, we are **removing the rows where the store was closed or sales were equal to zero**.

### **3.2 Filtering Columns**

We have created a list called **cols_drop** containing the names of the columns we want to remove, such as **`customers`**, **`open`**, **`promo_interval`**, and **`month_map`**.

Next, we use the **drop** method from pandas on the DataFrame **`df3`**, passing the list of columns to be excluded (**cols_drop**) as parameters and the argument **`axis=1`** to indicate that we are dropping columns.

In [None]:
# Removing the columns 'customers', 'open', 'promo_interval', and 'month_map'
df3 = df3.drop(['customers', 'open', 'promo_interval', 'month_map'], axis=1)

# Displaying the remaining columns
print(df3.columns)

## **4.0 Exploratory Data Analysis**

### **3 Objectives of EDA**

The objectives of Exploratory Data Analysis (EDA) can be summarized in three main points. Firstly, the goal is to **gain business understanding**, comprehending its operations and behavior through data. This is important to acquire skills that enable communication and information exchange with the business team, understanding their metrics and measures.

Secondly, EDA aims to **validate the business hypotheses created earlier**. Using the hypothesis map, the objective is to **generate insights and surprises** from the analyzed data. This step involves providing information that people didn't know yet, causing surprise, or **challenging existing beliefs**. By presenting results that counter expectations or break beliefs, an environment conducive to generating insights is created.

Finally, the third purpose of EDA is to **identify the relevant variables for the analysis model**. During the analysis, a sensitivity is developed to understand which variables impact the phenomenon under study. In the training, there will be a step where an algorithm is used to identify the most relevant variables. However, it's important to emphasize that this algorithm is not sufficient by itself, and **prior knowledge gained during the analysis is necessary to complement** the algorithm's suggestions.

In summary, EDA seeks to provide an in-depth understanding of the business through data, validate hypotheses, generate surprising insights, and identify the most relevant variables for the analysis model. This approach is essential to support strategic decisions and achieve more accurate and meaningful results.

### **3 Categories of Analysis**

**Exploratory analysis consists of three types of analysis**: `univariate, bivariate, and multivariate`.

**In `univariate analysis`**, we focus on **a single variable**, aiming to understand its characteristics such as minimum and maximum values, distribution, and variation. It's a way to study **each variable in isolation**, comprehending its behavior and traits.

**`Bivariate analysis`** explores **the impact of one variable on another**. In this case, we're interested in understanding **the relationship between two variables** and assessing whether there's correlation or hypotheses validation possible. We use **graphs and correlation measures** to describe the impact and evaluate its strength.

**Lastly, in `multivariate analysis`**, we consider **the relationship among multiple variables with the target variable**. Here, we're interested in comprehending how **different variables relate to each other and the target variable**. In some situations, **the combination of variables can result in a larger impact** than when analyzed individually. Thus, in multivariate analysis, we seek to understand **these interactions and their effects**.

In summary, **EDA covers univariate analysis** to understand each variable individually, **bivariate analysis** to assess the impact of one variable on another, and **multivariate analysis** to investigate relationships among multiple variables with the target variable. These analyses are essential to gain valuable insights from the data and assist informed decision-making in a data science project.

---
## **4.1 Univariate Analysis**

In [None]:
df4 = df3.copy()

### **4.1.1 Response Variable**

In our univariate analysis, the **focus is on the sales variable**. We will use the **distplot** method from the **Seaborn library**, or sns, to **visualize the distribution** of this variable.

In the code, we set the graph style to **whitegrid**, define its size as 10x6, and create a **histogram** of the **sales** variable without the kernel density curve and with 50 blue-colored bins. Then, **we add a title to the graph and labels to the axes**. Finally, we display the graph on the screen.

In [None]:
# Set the graph style
sns.set_style('white')

# Define the size of the graph
plt.figure(figsize=(24, 6))

# Generate the histogram of the 'sales' variable with 50 bins in blue color
sns.distplot(df4['sales'], kde=False, bins=50, color='blue')

# Add title and axis labels
plt.title('Sales Distribution')
plt.xlabel('Sales')
plt.ylabel('Frequency')

# Display the graph
plt.show()

When we execute this method, the expected result is a **distribution plot** that reveals the structure of our data. This graph allows us to assess crucial characteristics like **kurtosis and skewness of the distribution**. In this specific example, the distribution has positive skewness, indicated by the longer tail on the left side of the distribution. Even though the distribution isn't perfectly symmetric or normal, it's close enough to be compatible with most machine learning techniques.

The conformity of data to a **normal distribution** is crucial because numerous **machine learning algorithms are designed under the assumption that data is independent and follows a normal distribution**. The closer our response variable resembles a normal distribution, the **better the performance of machine learning models**. In certain situations, it might be necessary to apply transformation techniques, such as **logarithmic transformation**, to make the distribution of the response variable approach a normal distribution.

In [None]:
# Set the graph style
sns.set_style('white')

# Create a figure to contain the subplots
plt.figure(figsize=(24, 6))

# Create the first subplot
plt.subplot(1, 2, 1)
sns.distplot(df4['sales'], kde=False, bins=50, color='blue')
plt.title('Sales Distribution')
plt.xlabel('Sales')
plt.ylabel('Frequency')

# Create the second subplot
plt.subplot(1, 2, 2)
sns.distplot(np.log1p(df4['sales']), kde=False, bins=50, color='blue')
plt.title('Sales Distribution (log1p transformed)')
plt.xlabel('Sales')
plt.ylabel('Frequency')

# Display the graphs
plt.show()

### **Plotly**

In [None]:
```python
# Import necessary libraries
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Create a figure with 2 subplots (1 row, 2 columns)
fig = make_subplots(rows=1, cols=2)

# Add the first histogram to the figure
fig.add_trace(
    go.Histogram(x=df4['sales'], nbinsx=50, name='Sales'),
    row=1, col=1
)

# Add the second histogram to the figure
fig.add_trace(
    go.Histogram(x=np.log1p(df4['sales']), nbinsx=50, name='Sales (log1p transformed)'),
    row=1, col=2
)

# Update the layout of the figure
fig.update_layout(
    title_text="Sales Distribution",
    xaxis_title_text='Sales',
    yaxis_title_text='Frequency',
    bargap=0.2,
    bargroupgap=0.1
)

# Show the figure
fig.show()

We are plotting the **distribution of the sales variable** after applying the **logarithmic transformation `(np.log1p)`**, which helps to correct skewness. This transformation aims to bring the distribution closer to a **normal distribution**, benefiting the performance of many data analysis and machine learning models. The resulting graph facilitates understanding the shape of the transformed distribution, **aiding in the selection of appropriate modeling techniques**.

---
### **4.1.2 Numerical Variable**
In this section, an analysis of the numerical variables present in the dataset is being conducted. All available numerical columns are selected in the `num_attributes` and then the "`hist`" method is used to plot a histogram of these variables.

A histogram is a graphical representation that shows the distribution of values for a numerical variable. It is divided into intervals (bins), and within each interval, the count of occurrences of the variable's values is tallied. This allows us to observe the concentration or dispersion of values, as well as to identify patterns or trends in the distribution.

By plotting histograms of numerical variables, we can obtain an overall view of the characteristics of these variables, such as the presence of outliers, the symmetry of the distribution, and the concentration of values in certain intervals. This analysis helps in understanding the nature of the data and identifying possible patterns or trends that could be relevant for data analysis or modeling.

In [None]:
# Set the size of the plots
fig = plt.figure(figsize=(24, 6))

# Plot the histograms with the specified settings
ax = num_attributes.hist(bins=25, grid=False, color='blue', edgecolor='black', alpha=0.7,
                         layout=(5, 3), figsize=(20, 15))

# Add a title to the graph and adjust font size
fig.suptitle('Histograms of Numerical Attributes', fontsize=20, y=0.92)

# Iterate through each subplot to adjust the font size of x and y axis labels
for subplot in ax.flatten():
    subplot.set_xlabel('Values', fontsize=12)
    subplot.set_ylabel('Frequency', fontsize=12)

# Adjust the spacing between subplots
plt.subplots_adjust(hspace=0.5)

# Display the graph
plt.show()

The analysis of the **`competition_distance`** variable reveals that there's a higher concentration of competitors located nearby. This suggests that most competitors are situated in similar regions.

The temporal variable indicating when competitors started operating reveals an interesting panorama. Some competitors established their presence just one month ago, others two, three months, and so on. Interestingly, sales peak when considering competitors that started their operations four months ago. Subsequently, there's a drop followed by growth in the seventh month. This **fluctuation in sales over time is valuable information**, as more variable data can contribute to understanding the studied phenomenon.

In contrast, the **`day_of_week`** variable has almost negligible variation. In other words, **sales remain consistent regardless of the day of the week**. Therefore, this data doesn't provide much relevant information for model learning unless analyzed in relation to other variables.

Another significant insight comes from the **`is_promo`** variable. Here, it's evident that sales are significantly higher when there's no promotion **0.0** compared to when there's a promotion **1.0**. This **counterintuitive trend** requires further exploration to better comprehend its logic.

When observing the **`promo2_since_year`** variable, it can be seen that there was a **sales peak in 2013**, followed by a subsequent decline.

*In summary, the analysis of these variables provides valuable insights into sales behavior. These patterns and trends can be useful for enhancing the predictability of our sales forecasting model.

---
### **4.1.3 Categorical Variable**
In this section, we are conducting an analysis of categorical variables related to holidays. The idea is to transform these categorical variables into numerical ones so that we can analyze them more comprehensively.

In [None]:
# Cria um DataFrame para a variável 'state_holiday'
state_holiday_unique = pd.DataFrame(df4['state_holiday'].drop_duplicates())

# Redefine o índice do DataFrame
state_holiday_unique.reset_index(drop=True, inplace=True)

# Mostra o DataFrame
print(state_holiday_unique)

In this code, we are working with the **`state_holiday`** column. The **`drop_duplicates()`** function is applied to this column, which aims to identify and **remove any duplicate values, leaving only the unique values**.

In [None]:
# Levels of the 'state_holiday' variable
df4['state_holiday'].drop_duplicates()

In [None]:
# Set the style of the graph to white background
sns.set_style('white')

# Function to generate the KDE plot
def generate_kdeplot(df, label, ax):
    sns.kdeplot(data=df['sales'], label=label, shade=True, ax=ax)

# Create a figure to contain the subplots
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(24, 6))

# Calculate dataframes for reuse
df_holiday = df4[df4['state_holiday'] != 'regular_day']
df_public_holiday = df4[df4['state_holiday'] == 'public_holiday']
df_easter_holiday = df4[df4['state_holiday'] == 'easter_holiday']
df_christmas = df4[df4['state_holiday'] == 'christmas']

# Create the first subplot which is a bar plot of the 'state_holiday' column excluding 'regular_day'
ax1 = sns.countplot(x='state_holiday', data=df_holiday, ax=axes[0])
ax1.set_title('Count of Days by Holiday Type', fontsize=16)
ax1.set_xlabel('Holiday Type', fontsize=14)
ax1.set_ylabel('Count', fontsize=14)

# Add labels on each bar
for p in ax1.patches:
    ax1.annotate(format(p.get_height()), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 10), textcoords='offset points', fontsize=12)

# Create the second subplot which is a density plot of 'sales' for each type of 'state_holiday'
ax2 = axes[1]

# Using the function to generate KDE plots
generate_kdeplot(df_public_holiday, 'Public Holiday', ax2)
generate_kdeplot(df_easter_holiday, 'Easter Holiday', ax2)
generate_kdeplot(df_christmas, 'Christmas', ax2)

ax2.set_title('Sales Distribution by Holiday Type', fontsize=16)
ax2.set_xlabel('Sales', fontsize=14)
ax2.set_ylabel('Density', fontsize=14)
ax2.legend(title='Holiday Type', fontsize=12)

# Adjust the layout for better visualization of the plots
plt.tight_layout(pad=3.0)

# Display the plots
plt.show()

In this code, we analyze the impact of different types of holidays - **`public_holiday`**, **`easter_holiday`**, and **`christmas`** - on sales.

In the first graph, we use the **`countplot`** from Seaborn to visualize the frequency of each type of holiday, **excluding `regular_day`**. This graph shows us the occurrence count of each type of holiday in our dataset.

In the other graphs, we use the **`kdeplot`** function from Seaborn to plot the distribution of sales during each specific type of holiday. These graphs illustrate how sales behave during different types of holidays.

We can observe that **`public_holiday`** has a significantly higher volume of sales, followed by **`christmas`** and **`easter_holiday`**. This provides us with an interesting insight - when trying to predict with the model, if the time period falls within a **christmas** or **easter** holiday, the model will be able to appropriately adjust for this increased volume.

The **`state_holiday`** variable will play a crucial role in our analyses and in the learning of our model.

---
### **`Store Type`**

In [None]:
# Levels of the 'store_type' variable
df4['store_type'].drop_duplicates()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Function to generate bar plots
def generate_countplot(df, column, ax, title, xlabel, ylabel, order=None):
    ax = sns.countplot(x=column, data=df, palette='viridis', order=order, ax=ax)
    ax.set_title(title, fontsize=16)
    ax.set_xlabel(xlabel, fontsize=12)
    ax.set_ylabel(ylabel, fontsize=12)
    ax.grid(False)

    # Text annotation for each bar in the plot
    for p in ax.patches:
        ax.annotate(format(p.get_height()),
                    (p.get_x() + p.get_width() / 2., p.get_height()),
                    ha = 'center',
                    va = 'center',
                    xytext = (0, 10),
                    textcoords = 'offset points',
                    fontsize=10)

# Function to generate kde plots
def generate_kdeplot(df, column, value, label, ax, xlabel):
    sns.kdeplot(data=df[df[column] == value]['sales'], label=label, shade=True, ax=ax)
    ax.set_title('Sales Distribution by ' + xlabel, fontsize=16)
    ax.set_xlabel('Sales', fontsize=12)
    ax.set_ylabel('Density', fontsize=12)
    ax.legend(title=xlabel)
    ax.grid(False)

# Define the style for the plots
sns.set_style('white')

# Define the figure size
fig, axes = plt.subplots(2, 2, figsize=(24, 6))

# Plot 1 - state_holiday
df_holiday = df4[df4['state_holiday'] != 'regular_day']
generate_countplot(df_holiday, 'state_holiday', axes[0, 0], 'Day Count by Holiday Type', 'Holiday Type', 'Count')

# Plot 2 - state_holiday
generate_kdeplot(df4, 'state_holiday', 'public_holiday', 'Public Holiday', axes[0, 1], 'Holiday Type')
generate_kdeplot(df4, 'state_holiday', 'easter_holiday', 'Easter Holiday', axes[0, 1], 'Holiday Type')
generate_kdeplot(df4, 'state_holiday', 'christmas', 'Christmas', axes[0, 1], 'Holiday Type')

# Plot 3 - store_type
generate_countplot(df4, 'store_type', axes[1, 0], 'Store Count by Type', 'Store Type', 'Count', ['a', 'b', 'c', 'd'])

# Plot 4 - store_type
generate_kdeplot(df4, 'store_type', 'a', 'Store a', axes[1, 1], 'Store Type')
generate_kdeplot(df4, 'store_type', 'b', 'Store b', axes[1, 1], 'Store Type')
generate_kdeplot(df4, 'store_type', 'c', 'Store c', axes[1, 1], 'Store Type')
generate_kdeplot(df4, 'store_type', 'd', 'Store d', axes[1, 1], 'Store Type')

plt.tight_layout(pad=3.0)  # Adjust layout for better plot visualization

plt.show()  # Display the plots

In this visualization, we are analyzing the sales volume associated with different store categories. It can be observed that stores of **type a** display a considerably high sales volume, although they do not exhibit as prominent a sales peak as stores of **type d**. In contrast, stores of **type b**, despite recording a substantial sales volume, do not reach such a pronounced sales peak.

---
### **`Assortment`**

In [None]:
# Levels of the 'assortment' variable
df4['assortment'].drop_duplicates()

In [None]:
# Function to generate bar plots
def plot_bar(df, column, title, xlabel, ylabel, subplot, order=None):
    plt.subplot(3, 2, subplot)
    sns.countplot(data=df, x=column, order=order)
    plt.title(title, fontsize=16)
    plt.xlabel(xlabel, fontsize=14)
    plt.ylabel(ylabel, fontsize=14)

# Function to generate probability density plots
def plot_kde(df, column, values, labels, title, xlabel, ylabel, subplot):
    plt.subplot(3, 2, subplot)
    for value, label in zip(values, labels):
        sns.kdeplot(data=df[df[column] == value]['sales'], label=label, shade=True)
    plt.title(title, fontsize=16)
    plt.xlabel(xlabel, fontsize=14)
    plt.ylabel(ylabel, fontsize=14)
    plt.legend()

plt.figure(figsize=(24, 6))

# Bar plot and probability density plot for 'state_holiday'
plot_bar(df4[df4['state_holiday'] != 'regular_day'], 'state_holiday', 'Holiday Count', 'Holiday Type', 'Count', 1)
plot_kde(df4, 'state_holiday', ['public_holiday', 'easter_holiday', 'christmas'], ['Public Holiday', 'Easter Holiday', 'Christmas'],
         'Sales Distribution by Holiday Type', 'Sales', 'Density', 2)

# Bar plot and probability density plot for 'store_type'
plot_bar(df4, 'store_type', 'Store Type Count', 'Store Type', 'Count', 3, ['a', 'b', 'c', 'd'])
plot_kde(df4, 'store_type', ['a', 'b', 'c', 'd'], ['Store Type A', 'Store Type B', 'Store Type C', 'Store Type D'],
         'Sales Distribution by Store Type', 'Sales', 'Density', 4)

# Bar plot and probability density plot for 'assortment'
plot_bar(df4, 'assortment', 'Assortment Count', 'Assortment Type', 'Count', 5)
plot_kde(df4, 'assortment', ['extended', 'basic', 'extra'], ['Extended Assortment', 'Basic Assortment', 'Extra Assortment'],
         'Sales Distribution by Assortment Type', 'Sales', 'Density', 6)

plt.tight_layout()
plt.show()


The assortment types that record the highest sales are **basic**, **extended**, and **extra**. Although the **extra** assortment exhibits a smaller peak in sales, it stands out due to its broad and consistent sales distribution.

## **4.2 Bivariate Analysis**

In this section, we will conduct a bivariate analysis to **validate the hypotheses** established during the **feature engineering** phase, specifically in Step 2. We will review the selected hypotheses using the data available up to this point and proceed with a detailed evaluation of them.

### **H1. Stores with Larger Assortments Sell More**
The first hypothesis to be validated is that stores with larger assortments tend to have higher sales volumes. To explore this relationship, we will investigate the response variable in relation to the size of the store assortments.

In [None]:
def plot_total_sales(df, group_column, x_label, y_label, title, color_palette="viridis"):
    # Grouping and resetting the index
    aux = df[[group_column, 'sales']].groupby(group_column).sum().reset_index()

    # Setting a softer color palette
    sns.set_palette('Set2')

    # Configuring the plot size to fit the notebook area
    fig, ax = plt.subplots(figsize=(24, 6))  # Adjust the figure dimensions as needed

    # Creating the bar plot using Seaborn
    sns.barplot(x=group_column, y='sales', data=aux, palette=color_palette, ax=ax)

    # Setting the title and axis labels
    ax.set_title(title, fontsize=16)
    ax.set_xlabel(x_label, fontsize=14)
    ax.set_ylabel(y_label, fontsize=14)

    # Removing the plot's borders
    sns.despine()

    # Adjusting the layout to optimize space usage
    plt.subplots_adjust(left=0.1, right=0.9, bottom=0.1, top=0.9)

    # Displaying the plot
    plt.show()

# Calling the function
plot_total_sales(df4, 'assortment', 'Assortment Type', 'Total Sales', 'Total Sales by Assortment Type')

In our quest to understand the impact of product assortment on store sales, we embarked on a detailed analysis centered around two columns in our dataset: **`assortment`** and **`sales`**.

### **The step-by-step of our analysis**
- **Grouping by Assortment**: The data was grouped based on the assortment category of each store. This is crucial to understand how each type of product mix behaves.
- **Summing up Sales**: For each assortment type, we calculated the total sales. This calculation provides us with a clear view of the sales performance for each group.

### **What did we discover?**

Upon observing the total sales per assortment type, we noticed something intriguing: stores with the **`Extra`** assortment, despite seemingly having a larger product mix, **exhibit lower sales compared to other categories**. This contradicts our initial hypothesis that a larger assortment would result in higher sales.

### **Possible explanations and next steps**
An important question arises: has this scenario **always been this way or has there been a change in behavior over time?** Perhaps the **Extra** assortment was more popular in the past but is now declining. To answer this, we need to analyze total sales per day, not just per assortment category.

From here, we can investigate whether there have been significant changes in sales over time and validate our hypothesis.

Our next step is to expand our analysis: we will make a copy of the data, including the **`year_week`** variable. This way, we will perform a new sum of sales, considering both the assortment category and the week of the year.

After these steps, we will illustrate the identified trends through graphs. By doing so, we will continue to validate our hypotheses and gain valuable insights for more informed decision-making.

This analysis helps us understand the relationship between store product assortments and sales, providing a more comprehensive view of business performance over time.

In [None]:
# Setting a more subtle color palette.
sns.set_palette('Set2')

# Adjusting the figure size to occupy a larger area.
fig, ax = plt.subplots(figsize=(24, 6))

# Creating a DataFrame grouped by 'assortment' and 'year_week', and summing the 'sales'.
aux2 = df4[['year_week','assortment', 'sales']].groupby(['assortment', 'year_week']).sum().reset_index()

# Creating a line plot from the pivot DataFrame.
pivot_df = aux2.pivot(index = 'year_week', columns = 'assortment', values = 'sales')
pivot_df.plot(ax=ax)

# Removing the grids.
ax.grid(False)

# Adding title to the graph and axes.
ax.set_title('Weekly Sales Segmented by Assortment', fontsize=20, pad=20)
ax.set_xlabel('Week of Year', fontsize=16)
ax.set_ylabel('Sales', fontsize=16)

# Adjusting the legend.
ax.legend(title='Assortment Type', title_fontsize='15', fontsize='14')

# Displaying the graph.
plt.show()

**Parallel Behavior**: The **`basic`** and **`extended`** categories exhibit similar sales trajectories. Even though the **`basic`** category shows a slightly higher sales volume, both categories follow a fairly similar sales trend over time.

**Contrast**: In contrast, the **`extra`** category stands out due to its significantly lower sales volume. That is, the sales in the **`basic`** and **`extended`** categories are so substantial that, when presented on the same scale of the graph, they make the sales in the **`extra`** category appear insignificant in comparison.

In [None]:
# Increasing the size of the graph
plt.figure(figsize=(24, 6))

# Setting a softer color palette
sns.set_palette('Set2')

# Selecting only the data for the 'extra' category
aux3 = aux2[aux2['assortment'] == 'extra']

# Creating the graph
ax = sns.lineplot(x='year_week', y='sales', data=aux3, color='royalblue')

# Adjusting the titles and axis labels
ax.set_title('Weekly Sales of the Extra Category Over Time', fontsize=16)
ax.set_xlabel('Week of the Year', fontsize=14)
ax.set_ylabel('Sales', fontsize=14)

# Rotating x-axis labels and displaying only one label every 20 labels
for ind, label in enumerate(ax.get_xticklabels()):
    if ind % 20 == 0:  # adjust this number as needed
        label.set_visible(True)
        label.set_rotation(45)
    else:
        label.set_visible(False)

# Removing the graph's spine
sns.despine()

plt.show()

We initiated our analysis with a specific focus on the **Extra** assortment category. Within this context, we aimed to comprehend the unique characteristics and specific sales patterns of this category. This allowed us to isolate and examine the sales performance of the **Extra** assortment in a detailed and meticulous manner.

Our investigation unveiled a **non-linear sales pattern** for the **Extra** category, as depicted in the graph. It's worth emphasizing the importance of paying special attention to the Y-axis of the graph. The variation in sales can be substantial, and in situations where some values are much more prevalent than others, the visual representation needs to be adapted to ensure clarity and accuracy of the displayed information.

The analysis of the data led us to **surprising conclusions**, challenging the initial assumption. We discovered that stores with a broader variety of products, represented here by the **Extra** category, actually exhibit lower sales. This intriguing finding points to a sales behavior in the **Extra** category that is distinctly different from the other categories.

Ultimately, these observations confirm the conclusion that the range of products available in a store doesn't necessarily translate into **higher sales**. Contrary to the common assumption that a wider assortment would attract a more diverse customer base and subsequently lead to increased sales, we realize that the **Extra** category, with the broadest variety of products, underperforms in terms of sales compared to other categories.

### **H2. Stores with closer competitors have lower sales**
In order to validate the hypothesis, we will conduct an analysis of sales in relation to the **competition distance**.

For this analysis, we have selected the relevant columns, namely **`competition_distance`** and **`sales`**, as they are essential for our study.

Next, we will perform **data grouping** based on the **competition distance**, aggregating the corresponding sales for each distance group.

This approach will provide us with insights into how sales are related to the proximity of competitors, enabling a deeper understanding of the market behavior concerning this specific aspect.

In [None]:
# Figure size configuration
plt.figure(figsize=(24, 6))

# Set the style and color palette of the graph
sns.set_style('white')
sns.set_palette('Set2')

# Grouping the data by competition distance and summing the corresponding sales
aux1 = df4[['competition_distance', 'sales']].groupby('competition_distance').sum().reset_index()

# Defining contrast colors
bar_color = 'steelblue'
bg_color = 'white'
label_color = 'black'

# Bar plot
sns.barplot(x='competition_distance', y='sales', data=aux1, color=bar_color)

# Setting background and label colors
sns.set(style='whitegrid', rc={'axes.facecolor': bg_color, 'grid.color': label_color})
plt.rcParams['text.color'] = label_color

# Setting labels and titles
plt.title('Sales by Competition Distance', fontsize=16, color=label_color)
plt.xlabel('Competition Distance (in meters)', fontsize=14, color=label_color)
plt.ylabel('Sales', fontsize=14, color=label_color)

# Reducing the frequency of labels on the X axis
ticks = plt.xticks()[0]
n = len(ticks)
step = n // 10  # Show approximately 10 labels on the X axis
plt.xticks(ticks[::step], aux1['competition_distance'].iloc[::step], rotation=45, color=label_color, ha='right')

plt.yticks(color=label_color)

# Adjust layout to avoid overlaps
plt.tight_layout()

# Display the graph
plt.show()

Upon analysis, we observe that the granularity of the data makes it challenging to identify patterns. To enhance visualization, it would be advisable to create distance groups, for instance, by grouping into intervals of 10 units, in order to aggregate sales within each distance group. This will provide a clearer view of trends and facilitate the analysis of relationships between competitor distance and sales.

In [None]:
# Select relevant columns for analysis
aux1 = df4[['competition_distance', 'sales']].groupby('competition_distance').sum().reset_index()

# Define intervals for grouping competitor distance
bins = list(np.arange(0, 20000, 1000))

# Group data by competitor distance into intervals
aux1['competition_distance_binned'] = pd.cut(aux1['competition_distance'], bins=bins)

# Display a sample of 5 rows from the table
aux1.sample(5)

Sometimes, we need to categorize continuous variables into distinct intervals, a process known as **binning**. In the mentioned example, binning will be applied to a variable called **`competition_distance`**.

First, it's necessary to define the intervals or **`bins`** that will be used. In this case, an array ranging from 0 to 20,000 was chosen, with a **step of 1,000**. In other words, the data will be divided into groups of 1,000.

The result of this process is a new variable indicating which group (or bin) each original value belongs to. In the cited example, the new variable will be called **`competition_distance_binned`**. So, instead of having the exact distance in **`competition_distance`**, we now have the group of 1,000 that this distance falls into in the new column **`competition_distance_binned`**.

For example, if the original value of **`competition_distance`** is 20, it's categorized into the group of 0 to 1,000 in the new variable **`competition_distance_binned`**. This repeats for other values. If we have an original competition value of 14,600, it's mapped to the group of 14,000 to 15,000 in the new variable. Similarly, an original value of 18,010 is mapped to the group of 18,000 to 19,000, and so on.

In [None]:
# Configure the style and color palette of the plot
sns.set_style('white')
sns.set_palette('Set2')

# Define the bin limits for grouping the distances
bins = list(np.arange(0, 20000, 1000))

# Group sales by competitor distance and create a column with grouped distance
aux = (df4[['competition_distance', 'sales']]
       .groupby('competition_distance')
       .sum()
       .reset_index())
aux['competition_distance_binned'] = pd.cut(aux['competition_distance'], bins=bins)

# Regroup sales by binned distance
aux = (aux[['competition_distance_binned', 'sales']]
       .groupby('competition_distance_binned')
       .sum()
       .reset_index())

# Plot the bar graph with the auxiliary DataFrame
plt.figure(figsize=(24, 6))
sns.barplot(x=aux['competition_distance_binned'].astype(str), y='sales', data=aux, palette='viridis')

# Configure titles and labels
plt.title('Total Sales in Relation to Competitor Distance', fontsize=16)
plt.xlabel('Binned Competitor Distance (in units of 1000)', fontsize=14)
plt.ylabel('Total Sales', fontsize=14)
plt.xticks(rotation=45, ha='right', fontsize=12)

# Display the plot
plt.tight_layout()
plt.show()

When evaluating the post-binning data, we identified that the **highest volume of sales** for the **`competition_distance`** variable falls within the **range of 0 to 1,000 meters**.

This observation challenges a widely accepted hypothesis: that stores located close to their competitors would have a lower volume of sales as they would face more intense competition. However, the obtained information indicates the opposite reality. **Stores with nearby competitors appear to record a higher volume of sales.**

This trend can be influenced by various macroeconomic factors, such as:

- **Price Competition**: In areas where competitors are clustered, a **more intense price competition** is common. This can lead to **lower prices**, attracting more consumers and consequently increasing the volume of sales.

- **Product Variety**: The proximity of multiple competitors can result in a greater variety of products offered to consumers. This can captivate a broader audience, thereby increasing sales.

- **Convenience**: With a large number of stores in a limited area, consumers have the advantage of being able to compare products and prices without the need to travel long distances. This convenience can stimulate more purchases and thus boost sales.

- **Customer Service**: The presence of nearby competitors can encourage each store to improve its customer service in order to stand out. Superior customer service can drive sales.

- **Marketing Strategies**: In scenarios with a high concentration of competitors, it is common for stores to invest more in marketing strategies to attract customers. This can also lead to an increase in sales volume.

- **Agglomeration Effect**: In certain regions, the concentration of multiple stores in the same area can create a "shopping hub," attracting more consumers and enhancing the sales of all stores in the area.

In [None]:
# Configure the style and color palette of the plot
sns.set_style('white')
sns.set_palette('Set2')

# Group sales by competition distance
aux = df4[['competition_distance', 'sales']].groupby('competition_distance').sum().reset_index()

# Plot settings
plt.figure(figsize=(24, 6))
sns.scatterplot(x='competition_distance', y='sales', data=aux, palette='viridis')

# Configure titles and labels
plt.title('Total Sales in Relation to Competition Distance', fontsize=16)
plt.xlabel('Competition Distance', fontsize=14)
plt.ylabel('Total Sales', fontsize=14)

# Configure axis ticks
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Configure axis limits, adding a margin of 1000 to the maximum values
plt.xlim(0, aux['competition_distance'].max() + 1000)
plt.ylim(0, aux['sales'].max() + 1000)

# Display the plot
plt.tight_layout()
plt.show()

Through the analysis of our scatter plot, we can observe an interesting trend: stores with **closer competitors sell more**. We also have an interesting case of a store whose nearest competitor is at a distance of 200,000 meters - this is the replacement we applied in our descriptive analysis, but most of our sales are right there, in the stores with competitors next door.

To further analyze this relationship, let's create a grid with a trio of plots. In the first subplot, we'll place our scatter plot. In the second subplot, we'll use a **`barplot`** using the data from the scatter plot.

Now, let's save the best for last. In the final subplot, we'll create a heatmap that will show us the strength of the relationship between the variables. To do this, we'll apply the **`corr`** method on our **`aux1`** dataframe, which will calculate the correlation. Then, we'll use the **`heatmap`** method to transform these correlations into a visually impactful graph. Since we want to see the correlation values, we'll add the **`annot=True`** option to the **`heatmap`** function.

With these plots, we can delve deeper into the relationship between the proximity of competition and the sales volume of the stores.

In [None]:
# Style, color palette, and font size settings
sns.set_style('white')
sns.set_palette('Set2')
plt.rcParams['font.size'] = 14

# Grouping sales by competition distance
aux = df4[['competition_distance', 'sales']].groupby('competition_distance').sum().reset_index()

# Creating bins for competition distance
bins = list(np.arange(0, 20000, 1000))
aux['competition_distance_binned'] = pd.cut(aux['competition_distance'], bins = bins)

# Grouping data according to the bins
aux_binned = aux[['competition_distance_binned', 'sales']].groupby('competition_distance_binned').sum().reset_index()

# Figure settings
plt.figure(figsize=(24, 6))

# Subplot 1: Scatterplot
plt.subplot(2, 2, 1)
sns.scatterplot(x='competition_distance', y='sales', data=aux)
plt.title('Scatterplot: Competition Distance vs Sales', fontsize=16)
plt.xlabel('Competition Distance', fontsize=14)
plt.ylabel('Sales', fontsize=14)

# Subplot 2: Barplot
plt.subplot(2, 2, 2)
sns.barplot(x='competition_distance_binned', y='sales', data=aux_binned, palette='viridis')
plt.title('Barplot: Binned Competition Distance vs Sales', fontsize=16)
plt.xlabel('Competition Distance Bins', fontsize=14)
plt.ylabel('Sales', fontsize=14)
plt.xticks(rotation=45, ha='right')

# Subplot 3: Heatmap
plt.subplot(2, 2, 3)
sns.heatmap(aux.corr(method='pearson'), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Heatmap: Pearson Correlation', fontsize=16)

# Adjust subplots to avoid overlap
plt.tight_layout()
plt.show()

Our investigation revealed a **correlation of -0.23**. While this value doesn't stand out for its magnitude, it shows a moderate inverse relationship between the two variables in question. This means that, contrary to what one might expect, **sales tend to decrease as the distance to the nearest competitor increases**.

This finding might seem counterintuitive at first glance. Generally, it's common to think that proximity to competitors would lead to a decrease in sales, as customers would have more options available. However, our data suggests the opposite, reinforcing that the market reality can be more complex than initial assumptions.

### **H3. Stores with longer-standing competitors experience higher sales volume**
Our objective is to understand the correlation between the duration a store has been facing competition and the impact it has on sales. To conduct this analysis, we are using two columns from our dataset: **`competition_open_since_month`** (competition start month) and **`sales`**.

Initially, we categorize all sales based on the competition start date, summing up all the corresponding values for each specific period.

In [None]:
# Style, color palette, and font size settings for the graph
sns.set_style('white')
sns.set_palette('Set2')
plt.rcParams['font.size'] = 14

# Create a DataFrame with the sum of sales per competition start month
sales_per_month = df4[['competition_open_since_month', 'sales']].groupby('competition_open_since_month').sum().reset_index()

# Create a figure to contain the graph
plt.figure(figsize=(24, 6))

# Plot the bar graph with the sales_per_month DataFrame and store the graph in a variable
# Note: Sales values are converted to millions for easier visualization
barplot = sns.barplot(x='competition_open_since_month', y=sales_per_month['sales'] / 1_000_000, data=sales_per_month, palette='viridis')

# Add labels to the x and y axes
plt.xlabel('Competition Start Month', fontsize=14)
plt.ylabel('Total Sales (in millions)', fontsize=14)

# Add a title to the graph
plt.title('Total Sales in Relation to Competition Start Month', fontsize=16)

# Rotate the x-axis labels for better readability
plt.xticks(rotation=45, ha='right', fontsize=12)

# Add labels to each bar, converted to millions and rounded to three decimal places
for p in barplot.patches:
    height = p.get_height()
    barplot.text(p.get_x()+p.get_width()/2., height, f'{height:.3f}', ha="center", va='bottom')

# Display the graph
plt.tight_layout()
plt.show()

For instance, stores where the **competition started in month 9** exhibit the highest sales volume. However, this initial view doesn't provide the insights we're seeking, as our goal is to **understand the influence of the time period** in which the competition has been active on sales.

Therefore, we need to consider the **duration of competition**, which refers to the number of months the competition has been active. For example, if the competition started in month 9 and we're analyzing sales in month 10, this implies that the competition has been active for only one month.

In [None]:
# Style, color palette, and font size configurations
sns.set_style('white')
sns.set_palette('Set2')
plt.rcParams['font.size'] = 14

# Adjust the figure size
plt.figure(figsize=(24, 6))

# Filter rows where 'competition_time_month' is different from 0 and less than 120
aux1 = df4[['competition_time_month', 'sales']].groupby('competition_time_month').sum().reset_index()
aux2 = aux1[(aux1['competition_time_month'] < 120) & (aux1['competition_time_month'] != 0)]

# Plot the bar chart
sns.barplot(x='competition_time_month', y='sales', data=aux2, palette='viridis')

# Add title to the chart and axes
plt.title('Total Sales in Relation to Competition Time (in months)', fontsize=16)
plt.xlabel('Competition Time (in months)', fontsize=14)
plt.ylabel('Total Sales', fontsize=14)

# Rotate x-axis labels for better visualization and adjust to show every 10 intervals
plt.xticks(np.arange(0, len(aux2), 10), aux2['competition_time_month'].to_list()[::10], rotation=45, ha='right', fontsize=12)

# Display the chart
plt.tight_layout()
plt.show()

Here, we focused on the first 120 months of competition. From our initial dataset, we selected only the rows that represent this period.

In this analysis, we chose to **focus on the first 120 months after the opening of competition**.

When creating the variable **`competition_time_month`**, we essentially constructed a timeline of sales, calculating the difference between the date of each sale and the date of the competitor's opening. If the sale occurred after the competitor's opening, the result was positive; if it was before, the result was negative.

A certain trend became evident: **the closer the values of `competition_time_month` are to zero, meaning the more recent the competition, the higher the volume of sales appears to be**. Common sense might suggest that the arrival of new competitors would lead to a disruption in the market, with customers trying out new stores, which would tend to lower sales.

The general expectation might be for an initial drop in sales when new competitors enter the market, followed by stabilization as customers become accustomed to the new store options. However, our data points to a different reality. According to the data, **the newer the competitor, the more sales tend to rise**.

In [None]:
# Styling, color palette, and font size configurations
sns.set_style('white')
sns.set_palette('Set2')
plt.rcParams['font.size'] = 14

# Adjusting the figure size
plt.figure(figsize=(24, 6))

# Filtering data and grouping by 'competition_time_month'
aux = df4.loc[(df4['competition_time_month'] < 120) & (df4['competition_time_month'] != 0)]
aux = aux[['competition_time_month', 'sales']].groupby('competition_time_month').sum().reset_index()

# Bar plot showing sales per competition month
plt.subplot(1, 2, 1)
sns.barplot(x='competition_time_month', y='sales', data=aux)
plt.xlabel('Competition Months', fontsize=12)
plt.ylabel('Sales', fontsize=12)
plt.title('Sales vs. Competition Months', fontsize=14)
plt.xticks(range(0, aux.shape[0], 10), aux['competition_time_month'].tolist()[::10], rotation=45)

# Linear regression between competition months and sales
plt.subplot(1, 2, 2)
sns.regplot(x='competition_time_month', y='sales', data=aux, scatter_kws={'s': 5})
plt.xlabel('Competition Months', fontsize=12)
plt.ylabel('Sales', fontsize=12)
plt.title('Linear Regression: Sales vs. Competition Months', fontsize=14)

# Adjust subplots to avoid overlap
plt.tight_layout()

# Show the plot
plt.show()

We utilized the **`regplot`** function to plot a trend line, providing a clearer perspective on the relationship between the variables.

The **`regplot`** not only displays each individual point but also draws a **trend line to highlight the overall pattern** in the data.

To conclude our analysis, we constructed a **correlation heatmap**. This step is crucial for understanding the strength and direction of the association between variables. This overview enables us to draw more informed and accurate conclusions about our data.

In [None]:
# Style, color palette, and font size settings
sns.set_style('white')
sns.set_palette('Set2')
plt.rcParams['font.size'] = 14

# Adjusting the figure size
plt.figure(figsize=(24, 6))

# Filtering the data and grouping by 'competition_time_month'
aux = df4.loc[(df4['competition_time_month'] < 120) & (df4['competition_time_month'] != 0)]
aux = aux[['competition_time_month', 'sales']].groupby('competition_time_month').sum().reset_index()

# Bar plot showing sales per competition month
plt.subplot(1, 3, 1)
sns.barplot(x='competition_time_month', y='sales', data=aux)
plt.xlabel('Competition Months', fontsize=12)
plt.ylabel('Sales', fontsize=12)
plt.title('Sales vs. Competition Months', fontsize=14)
plt.xticks(range(0, aux.shape[0], 10), aux['competition_time_month'].tolist()[::10], rotation=45)

# Linear regression between competition months and sales
plt.subplot(1, 3, 2)
sns.regplot(x='competition_time_month', y='sales', data=aux, scatter_kws={'s': 5})
plt.xlabel('Competition Months', fontsize=12)
plt.ylabel('Sales', fontsize=12)
plt.title('Linear Regression: Sales vs. Competition Months', fontsize=14)

# Correlation heatmap
plt.subplot(1, 3, 3)
sns.heatmap(aux.corr(), annot=True, cmap='coolwarm')
plt.xlabel('Variables', fontsize=12)
plt.ylabel('Variables', fontsize=12)
plt.title('Correlation Heatmap', fontsize=14)

# Adjust subplots to prevent overlap
plt.tight_layout()

# Show the plot
plt.show()

The correlation between the **competition start time (in months)** and the response variable is **-0.1**. Although this indicates a weak correlation, it's far from -1, which would suggest a perfect negative correlation. This leads to two significant conclusions.

Thus, the hypothesis that longer-lasting competition results in more sales is, in fact, false. **Stores that have been facing competition for a longer time actually show, as indicated by our data, lower sales volumes**.

---
### **4. Stores with Long-Duration Promotions Exhibit Higher Sales Volume**
We aim to validate the **hypothesis that stores maintaining promotions for extended periods experience an increase in sales volume**. The analysis seeks to explore the impact of promotion duration – a variable derived from the difference between the sale date and the start of the promotional period – on sales.

In [None]:
# Configurations for style, color palette, and font size of the graph
sns.set_style('white')
sns.set_palette('Set2')
plt.rcParams['font.size'] = 14

# Adjust the figure size
plt.figure(figsize=(24, 8))

# Data preparation
aux = df4[['promo_time_week', 'sales']].groupby('promo_time_week').sum().reset_index()

# First bar plot for sales per extended promotion week
plt.subplot(2, 1, 1)
sns.barplot(x='promo_time_week', y='sales', data=aux[aux['promo_time_week'] > 0])
plt.title('Sales vs. Extended Promotion Weeks', fontsize=14)
plt.xlabel('Extended Promotion Weeks', fontsize=12)
plt.ylabel('Sales', fontsize=12)
plt.xticks(range(0, aux[aux['promo_time_week'] > 0].shape[0], 10), aux['promo_time_week'][aux['promo_time_week'] > 0].tolist()[::10], rotation=45)

# Second bar plot for sales per regular promotion week
plt.subplot(2, 1, 2)
sns.barplot(x='promo_time_week', y='sales', data=aux[aux['promo_time_week'] < 0])
plt.title('Sales vs. Regular Promotion Weeks', fontsize=14)
plt.xlabel('Regular Promotion Weeks', fontsize=12)
plt.ylabel('Sales', fontsize=12)
plt.xticks(range(0, aux[aux['promo_time_week'] < 0].shape[0], 10), aux['promo_time_week'][aux['promo_time_week'] < 0].tolist()[::10], rotation=45)

# Adjust subplots to avoid overlap
plt.tight_layout()

# Show the graph
plt.show()

We are here to understand **how sales are affected by regular and extended promotions**. We have examined past and future sales based on the start of promotions, but too much information can make it difficult to comprehend.

The upper graph shows **sales during extended promotions**. Here, we can observe that **sales remain stable for a while and then start to decline**. This indicates that the effect of an extended promotion does not last indefinitely.

The lower graph illustrates **sales during regular promotions**. When the promotion is still distant, sales are low. However, **as the promotion approaches, sales begin to increase**. This suggests that **consumers might be anticipating their purchases due to the announced promotions**.

**Our initial hypothesis was that stores with active promotions for longer periods should sell more**.

This proved to be partially true. Sales are influenced by the duration of the promotion, but not in a linear manner. Sales remain steady and then decline during extended promotions, and they increase as the promotion draws near in regular promotions. Therefore, **the duration of the promotion affects sales, but in a more complex way than we initially thought**.

In [None]:
# Style, color palette, and font size configurations for the plot
sns.set(style='white', palette='Set2', font_scale=1.4)

# Data preparation
aux = df4[['promo_time_week', 'sales']].groupby('promo_time_week').sum().reset_index()
promo_extended = aux[aux['promo_time_week'] > 0]  # extended promotion
promo_regular = aux[aux['promo_time_week'] < 0]  # regular promotion

fig, axs = plt.subplots(2, 2, figsize=(24, 10))

# First plot: barplot for sales by weeks of extended promotions
sns.barplot(x='promo_time_week', y='sales', data=promo_extended, ax=axs[0, 0])
axs[0, 0].set_title('Sales vs. Extended Promotion Weeks')
axs[0, 0].set_xlabel('Extended Promotion Weeks')
axs[0, 0].set_ylabel('Sales')
axs[0, 0].set_xticks(range(0, promo_extended.shape[0], 10))
axs[0, 0].set_xticklabels(promo_extended['promo_time_week'].tolist()[::10], rotation=90)

# Second plot: linear regression for sales by weeks of extended promotions
sns.regplot(x='promo_time_week', y='sales', data=promo_extended, scatter_kws={'s': 5}, ax=axs[0, 1])
axs[0, 1].set_title('Linear Regression: Sales vs. Extended Promotion Weeks')
axs[0, 1].set_xlabel('Extended Promotion Weeks')
axs[0, 1].set_ylabel('Sales')

# Third plot: barplot for sales by weeks of regular promotions
sns.barplot(x='promo_time_week', y='sales', data=promo_regular, ax=axs[1, 0])
axs[1, 0].set_title('Sales vs. Regular Promotion Weeks')
axs[1, 0].set_xlabel('Regular Promotion Weeks')
axs[1, 0].set_ylabel('Sales')
axs[1, 0].set_xticks(range(0, promo_regular.shape[0], 10))
axs[1, 0].set_xticklabels(promo_regular['promo_time_week'].tolist()[::10], rotation=90)

# Fourth plot: linear regression for sales by weeks of regular promotions
sns.regplot(x='promo_time_week', y='sales', data=promo_regular, scatter_kws={'s': 5}, ax=axs[1, 1])
axs[1, 1].set_title('Linear Regression: Sales vs. Regular Promotion Weeks')
axs[1, 1].set_xlabel('Regular Promotion Weeks')
axs[1, 1].set_ylabel('Sales')

# Adjust subplots to avoid overlap
plt.tight_layout()

# Show the plot
plt.show()

The analysis of the plot reveals an important insight: there is a clear downward trend, possibly driven by a period of significant decline. Simultaneously, we observe an upward trend with well-defined peaks.

To further deepen our understanding of the relationship between these trends, in the next step, we will use a `Heatmap`. This data visualization method will allow us to explore correlations in a clear way, providing a quantitative view of the strength of these relationships.

In [None]:
from matplotlib.ticker import MaxNLocator
import matplotlib.gridspec as gridspec

# Defining the data
aux1 = df4[['promo_time_week', 'sales']].groupby('promo_time_week').sum().reset_index()

# Configuring style, color palette, and font size of the plot
sns.set_style('white')
sns.set_palette('Set2')
plt.rcParams['font.size'] = 10

# Adjusting the figure size
plt.figure(figsize=(24, 10))

# Defining the grid for subplots
gs = gridspec.GridSpec(2, 3)

# Configuring the spacing between plots
plt.subplots_adjust(hspace=0.5, wspace=0.4)

# First plot: sales during extended promotions
plt.subplot(gs[0, 0])
aux2 = aux1[aux1['promo_time_week'] > 0]  # extended promotion
bp = sns.barplot(x='promo_time_week', y='sales', data=aux2)
bp.xaxis.set_major_locator(MaxNLocator(nbins=5))  # Adjust the number of visible ticks on the x-axis
plt.xlabel('Extended Promo Time (weeks)', fontsize=16)
plt.ylabel('Sales', fontsize=16)
plt.title('Sales during Extended Promotions', fontsize=18)

# Second plot: linear regression for extended promotions
plt.subplot(gs[0, 1])
sns.regplot(x='promo_time_week', y='sales', data=aux2, scatter_kws={'s': 5})
plt.xlabel('Extended Promo Time (weeks)', fontsize=16)
plt.ylabel('Sales', fontsize=16)
plt.title('Linear Regression: Extended Promotions', fontsize=18)

# Third plot: sales during regular promotions
plt.subplot(gs[1, 0])
aux3 = aux1[aux1['promo_time_week'] < 0]  # regular promotion
bp = sns.barplot(x='promo_time_week', y='sales', data=aux3)
bp.xaxis.set_major_locator(MaxNLocator(nbins=5))  # Adjust the number of visible ticks on the x-axis
plt.xlabel('Time to Next Regular Promo (weeks)', fontsize=16)
plt.ylabel('Sales', fontsize=16)
plt.title('Sales during Regular Promotions', fontsize=18)

# Fourth plot: linear regression for regular promotions
plt.subplot(gs[1, 1])
sns.regplot(x='promo_time_week', y='sales', data=aux3,


The Heatmap reveals a **correlation of 0.02**, which is extremely low, likely because the period of stability is much larger than the period of decline. This indicates that the **strength of the correlation is weak**, and there is no strong upward behavior. Therefore, in our prediction model, this feature **may not hold much significance and might not be considered**.

However, it's important to note that **this factor could have relevance when combined with others**. In data analysis, we don't just look at variables in isolation, but also how they interact with each other.

When we train machine learning models, we typically don't divide variables as done with `aux`. Instead, **the model sees the variable as a whole, `promo_time_week`**. The division was done in this case **solely to better visualize the data** and identify specific trends or patterns.

Therefore, the initial hypothesis that **stores with active promotions for a longer time sell more** has been invalidated. In fact, the data suggests the opposite: **after a certain period of promotion, sales tend to decrease**. Hence, this hypothesis is considered false.

---
### **6. Consecutive Promotions Generate More Sales**
We define a **consecutive promotion** as a period during which the store conducts **continuous promotions**.

For this analysis, we utilize two variables: **`promo`**, which indicates whether the store is in a **regular promotion period**, and **`promo2`**, which signals whether the store is participating in a **extended promotion**. Our objective is to compare the sales of stores that have consecutive promotions with those that do not.

Firstly, we **group our data by `promo` and `promo2`**, allowing us to analyze the average sales for each combination of promotion states. Next, we sort these averages from the lowest to the highest sales values.

In [None]:
# Group the data and reset the index
df_grouped = df4[['promo', 'promo2', 'sales']].groupby(['promo', 'promo2']).sum().reset_index()

# Reshape the data into the appropriate form
df_pivot = df_grouped.pivot("promo", "promo2", "sales")
print(df_pivot)

Stores that participated only in the extended promotion `promo2` recorded the lowest volume of sales. Surprisingly, the stores that took part in both the traditional and extended promotions did not achieve significantly better performance. However, the ones that fared the best were the stores that participated solely in the regular promotion `promo`.

These results suggest that participating in the extended promotion does not provide a substantial advantage in terms of sales.

Did this pattern remain consistent over time? In other words, did any changes in sales behavior occur for the stores that exclusively participated in the regular promotion when they began participating in the extended promotion? To explore this, let's create a graph to visualize this data over time.

In [None]:
# Selecting data where 'promo' and 'promo2' are 1 and grouping by 'year_week'
aux1 = df4[(df4['promo'] == 1) & (df4['promo2'] == 1)][['year_week', 'sales']].groupby('year_week').sum().reset_index()

# Setting up the figure
plt.figure(figsize=(24, 6))

# Configuring style, color palette, and font size of the graph
sns.set_style('white')
sns.set_palette('Set2')
plt.rcParams['font.size'] = 10

# Creating the line plot
sns.lineplot(x='year_week', y='sales', data=aux1)

# Adding title and labels to the axes
plt.title('Total Sales per Week of the Year', fontsize=18)
plt.xlabel('Week of the Year', fontsize=14)
plt.ylabel('Total Sales', fontsize=14)

# Configuring x-axis limits to occupy the entire graph area
plt.xlim(min(aux1['year_week']), max(aux1['year_week']))

# Rotating and adjusting x-axis labels for better visibility
plt.xticks(np.arange(0, len(aux1['year_week']), step=10), rotation=45)

# Displaying the graph
plt.show()

The data shows that stores participating exclusively in the **extended promotion `promo2`** had the lowest sales.

Interestingly, the performance of stores engaged in both traditional and extended promotions was not significantly superior. However, the best-performing sales were from stores that only took part in the **traditional promotion `promo`**, registering the highest sales volumes.

These findings suggest that **adhering to the extended promotion may not bring a substantial advantage in terms of sales**. In other words, extending the promotion doesn't necessarily result in a corresponding increase in sales volume.

The question that remains is: did this pattern remain stable over time? Is there a change in sales behavior for stores that initially only ran the traditional promotion and later joined the extended promotion? To decipher this, **we need to explore the data from a temporal perspective**, which we will do in the next step by creating a graph to visualize these trends over time.

In [None]:
# Graph Settings
plt.figure(figsize=(24, 6))

# Style, color palette, and font size configurations
sns.set_style('white')
sns.set_palette('Set2')
plt.rcParams['font.size'] = 10

# Defining the data
aux1 = df4[['promo_time_week', 'sales', 'year_week']].groupby(['promo_time_week', 'year_week']).sum().reset_index()

# Filtering data for extended and regular promotions
aux2 = aux1[aux1['promo_time_week'] > 0]  # Extended promotion
aux3 = aux1[aux1['promo_time_week'] <= 0]  # Regular promotion

# Creating the first line plot
ax = sns.lineplot(x='year_week', y='sales', data=aux2, label='Extended', linewidth=2.5)

# Adding the second line plot
sns.lineplot(x='year_week', y='sales', data=aux3, label='Regular', linewidth=2.5, ax=ax)

# Additional configurations
plt.title('Sales Over Time: Regular and Extended Promotions', fontsize=20)
plt.xlabel('Week of the Year', fontsize=15)
plt.ylabel('Sales', fontsize=15)
plt.legend(fontsize=12)

# Adjusting visibility and rotation of x-axis labels
ax.set_xticks(ax.get_xticks()[::5])  # Show a label every 5 ticks
plt.xticks(rotation=45, fontsize=12)  # Rotate labels and adjust font size
plt.yticks(fontsize=12)

# Removing the grid
ax.grid(False)

# Making the plot fill the entire area
plt.tight_layout()
plt.show()

The analysis of the graph, comparing the sales of stores with extended promotions to those that participated in both traditional and extended promotions, brings us relevant conclusions. **The sales behavior lines are similar, but with distinct volumes**. A temporary decline is noticeable for stores that conducted both promotions, but even so, the sales volume still exceeds that of stores with only the extended promotion.

Thus, our initial hypothesis, which suggested that stores with consecutive promotions would have higher sales volume, is not confirmed. The observed decline is circumstantial and does not alter the overall picture: **stores with only the extended promotion have lower sales volume**. Therefore, **this variable may not have a significant impact on our forecasting model**.

### **7. Stores Open During Christmas Holiday Generate More Sales**

In [None]:
# Defining the data
aux1 = df4[['state_holiday', 'sales']].groupby('state_holiday').sum().reset_index()

# Setting style, color palette, and font size for the plot
sns.set_style('white')
sns.set_palette('Set2')
plt.rcParams['font.size'] = 14

# Adjusting the figure size
plt.figure(figsize=(24, 6))

# Creating the bar plot
sns.barplot(x='state_holiday', y='sales', data=aux1)

plt.xlabel('Holidays')  # Set the x-axis label
plt.ylabel('Sales')  # Set the y-axis label
plt.title('Sales During Holidays')  # Set the main title

# Function to remove the top and right spines of the plot
sns.despine(bottom=True, left=True)

# Automatically adjusts subplot parameters to fill the figure
plt.tight_layout()

# Display the plot
plt.show()

In the pursuit of verifying the hypothesis that sales significantly increase during the Christmas period, we focused our analysis on the variables **`state_holiday`** and **`sales`**. These variables allow us to identify holiday occurrences and their respective sales performance.

However, our initial analysis revealed that **the volume of sales on normal days greatly exceeds that of holiday days**. The explanation is simple: **the number of normal days in a year is substantially greater than the number of holidays**.

To refine our analysis, we generated a new dataset, **`aux1`**, which **excludes normal days**, focusing solely on holiday days. By recalculating and visualizing the data, we will obtain a more accurate understanding of the impact of holidays on sales.

In [None]:
# Configuring the style, color palette, and font size of the plot
sns.set_style('white')
sns.set_palette('Set2')
plt.rcParams['font.size'] = 10

# Adjusting the figure size
plt.figure(figsize=(24, 6))

# Filtering the data to exclude regular days
aux = df4[df4['state_holiday'] != 'regular_day']

# Grouping the data by holidays and calculating the sum of sales
aux1 = aux[['state_holiday', 'sales']].groupby('state_holiday').sum().reset_index()

# Creating the plot
sns.barplot(x='state_holiday', y='sales', data=aux1)

# Setting titles and labels
plt.title('Sales During Holidays', fontsize=20)
plt.xlabel('Holiday Type', fontsize=15)
plt.ylabel('Sales', fontsize=15)

# Removing plot borders
sns.despine()

# Displaying the plot
plt.show()

In the visualization, it becomes evident that **sales during Christmas do not surpass those of other holidays**. We observe distinct sales patterns for Christmas, Easter, and public holidays. Therefore, initially, we can conclude that the original hypothesis is false: **stores do not sell more during Christmas**.

Now, the question arises of whether there has ever been a time when Christmas sales exceeded those of other holidays. This process involves the **analysis of changes in sales over time**, known as "shift" in data science. To do this, we need to conduct a more detailed analysis to better understand the sales patterns during holidays.

In [None]:
# Style settings, color palette, and font size of the plot
sns.set_style('white')
sns.set_palette('Set2')
plt.rcParams['font.size'] = 14

# Adjust the figure size
plt.figure(figsize=(24, 6))

# Filter out regular days
aux = df4[df4['state_holiday'] != 'regular_day']

# Create a figure that covers the entire notebook area
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(20, 7))

# Subplot 1
aux1 = aux[['state_holiday', 'sales']].groupby('state_holiday').sum().reset_index()
sns.barplot(x='state_holiday', y='sales', data=aux1, ax=axs[0])
axs[0].set_title('Total Sales by Holiday')

# Subplot 2
aux2 = aux[['year', 'state_holiday', 'sales']].groupby(['year', 'state_holiday']).sum().reset_index()
sns.barplot(x='year', y='sales', hue='state_holiday', data=aux2, ax=axs[1])
axs[1].set_title('Total Sales by Year and Holiday')

# Show the plot
plt.tight_layout()
plt.show()

The three major holidays analyzed were Christmas, Easter, and public holidays. Throughout the studied years (2013-2015), Christmas sales consistently remained lower compared to the other holidays, contradicting our initial hypothesis of higher sales volume during that period.

It's important to note that for the year 2015, the data is only available until August, hence the sales performance for Christmas is not fully accounted for.

Thus, our study has effectively debunked the initial hypothesis: **Christmas, contrary to expectations, is not the period of highest sales**. This critical insight can guide future strategies and more accurately manage sales expectations during holiday periods.

### **8. Sales in Stores Increase Over the Years**
Our current focus is to comprehend the sales evolution over the years. For this purpose, we are working with the **`year`** and **`sales`** columns from our dataset. By grouping the sales data by year and summing up all the corresponding values, **we obtain the total annual sales**. This approach will enable us to analyze the sales performance for each year.

In [None]:
# Creation of the DataFrame 'aux1', grouping data by 'year' and summing 'sales', with the index reset.
aux1 = df4[['year', 'sales']].groupby('year').sum().reset_index()

# Configuration of style, color palette, and font size for the graph
sns.set_style('white')
sns.set_palette('Set2')
plt.rcParams['font.size'] = 14

# Adjust the figure size
plt.figure(figsize=(24, 6))

# Create the bar plot visualization with the default palette
ax = sns.barplot(x='year', y='sales', data=aux1)

# Remove the graph's borders
sns.despine()

# Set the x-axis label
ax.set_xlabel('Year', fontsize=13, weight='bold')

# Set the y-axis label
ax.set_ylabel('Total Sales', fontsize=13, weight='bold')

# Add a title to the graph
ax.set_title('Evolution of Total Sales Over the Years', fontsize=16, weight='bold')

# Display the graph efficiently
plt.tight_layout()

To decipher the trend, we replaced the bar plot with a **regression plot**, a visual tool that allows us to intuitively identify whether there is a positive or negative slope in sales over the years.

Additionally, we used the **`heatmap`** method, an effective technique for representing **correlations between variables**. By calculating the **Pearson correlation coefficient** - we are able to visualize the interdependence and direct or inverse relationship between the variables.

In [None]:
# Create the 'aux1' DataFrame by grouping data by 'month' and summing 'sales', resetting the index.
aux1 = df4[['month', 'sales']].groupby('month').sum().reset_index()

# Set the style, color palette, and font size of the plot
sns.set_style('white')
sns.set_palette('Set2')
plt.rcParams['font.size'] = 14

# Adjust the figure size
plt.figure(figsize=(24, 6))

# Bar plot of total sales per month
plt.subplot(1, 3, 1)
sns.barplot(x='month', y='sales', data=aux1, palette="viridis")
plt.title('Total Sales by Month')
plt.xlabel('Month')
plt.ylabel('Total Sales')
sns.despine(bottom=True, left=True)

# Linear regression plot of sales per month
plt.subplot(1, 3, 2)
sns.regplot(x='month', y='sales', data=aux1, color='green')
plt.title('Sales Trend Over Months')
plt.xlabel('Month')
plt.ylabel('Total Sales')
sns.despine(bottom=True, left=True)

# Heatmap of correlation between variables
plt.subplot(1, 3, 3)
sns.heatmap(aux1.corr(method='pearson'), annot=True, cmap='coolwarm')
plt.title('Correlation between Month and Total Sales')
sns.despine(bottom=True, left=True)

plt.tight_layout()
plt.show()

The analysis reveals a **decreasing trend in sales over the years**, as indicated by the regression plot. A robust correlation of **-0.92** confirms this downward trend - **as the years progress, sales show a inclination to decrease**.

It's crucial to remember that the data for 2015 might not provide a complete picture as it only covers up to mid-year. In analyses like this, it's essential to consider closed periods to minimize distortions.

Even with such a significant correlation, the **`year`** variable might not bring a surprising impact. It's a common practice in businesses to monitor annual growth metrics. Therefore, while the discovery is relevant, it might not represent a surprising insight.

With this analysis, we can invalidate the initial hypothesis that "stores should sell more over the years." The data demonstrate an opposite trend - **sales tend to decrease over the years**.

---
### **9. The Highest Volume of Sales is Observed in the 2nd Half of the Year**
Upon evaluating the monthly data, we observe that sales exhibit high values from month 1 (January) to month 6 (June). However, starting from month 7 (July), there is a significant decline in the volume of sales.

In [None]:
# Creating the 'aux1' DataFrame by grouping data by 'month' and summing 'sales', with reset index.
aux1 = df4[['month', 'sales']].groupby('month').sum().reset_index()

# Setting the style, color palette, and font size for the plot
sns.set_style('white')
sns.set_palette('Set2')
plt.rcParams['font.size'] = 14

# Adjusting the figure size
plt.figure(figsize=(24, 6))

# Creating the bar plot visualization using the default palette
plt.subplot(1, 3, 1)
ax1 = sns.barplot(x='month', y='sales', data=aux1, palette="viridis")
ax1.set_title('Total Sales by Month', fontsize=16, weight='bold')
ax1.set_xlabel('Month', fontsize=13, weight='bold')
ax1.set_ylabel('Total Sales', fontsize=13, weight='bold')
sns.despine(bottom=True, left=True)

# Creating the regression plot visualization
plt.subplot(1, 3, 2)
ax2 = sns.regplot(x='month', y='sales', data=aux1, color='green')
ax2.set_title('Sales Trend Across Months', fontsize=16, weight='bold')
ax2.set_xlabel('Month', fontsize=13, weight='bold')
ax2.set_ylabel('Total Sales', fontsize=13, weight='bold')
sns.despine(bottom=True, left=True)

# Creating the heatmap visualization for correlation analysis
plt.subplot(1, 3, 3)
ax3 = sns.heatmap(aux1.corr(method='pearson'), annot=True, cmap='coolwarm')
ax3.set_title('Correlation between Month and Total Sales', fontsize=16, weight='bold')
sns.despine(bottom=True, left=True)

# Optimize the layout and display the figure
plt.tight_layout()
plt.show()

The observation of a decreasing trend in sales throughout the months is supported by a more robust quantitative analysis. The regression plot exhibits a downward slope, confirming the declining trend. Furthermore, the correlation coefficient of `-0.75`, obtained from the correlation analysis, indicates a strong negative correlation between the month of the year and sales.

This strong indication suggests that the month in which the sale occurs is a significant factor in determining the volume of sales. This implies that our forecasting model should take into account that sales tend to be lower in the second half of the year.

Based on these findings, the hypothesis that "stores should sell more in the second half of the year" is contradicted. On the contrary, our analysis demonstrates that stores tend to experience lower sales volumes in the second half of the year. Thus, we are ready to move forward and test the next hypothesis.

---
### **10. Sales Volume Increases After the 10th Day of Each Month**
In an attempt to verify the hypothesis that **stores tend to experience higher sales after the 10th day of each month**, we will adopt a similar analytical approach as used in the previous sections for years and months.

To do this, we will make a substitution in our codebase, **replacing the `year` variable with `day`**. This will allow us to create a new triptych set of visualizations, aiming to examine whether there is an increase in sales after the tenth day of each month. Please follow along for the results of this analysis.

In [None]:
# Grouping the data by day and summing the sales
aux1 = df4[['day', 'sales']].groupby('day').sum().reset_index()

# Configuration of style, color palette, and font size for the plot
sns.set_style('white')
sns.set_palette('Set2')
plt.rcParams['font.size'] = 14

# Adjusting the figure size
plt.figure(figsize=(24, 6))

# Bar plot of sales per day
plt.subplot(1, 3, 1)
bp = sns.barplot(x='day', y='sales', data=aux1, palette='viridis')
bp.set_xticklabels(bp.get_xticklabels(), rotation=90, horizontalalignment='right', fontsize=10) # Rotate labels 90 degrees and reduce font size
plt.title('Total Sales per Day')
plt.xlabel('Day')
plt.ylabel('Total Sales')
sns.despine(bottom=True, left=True)

# Linear regression plot of sales per day
plt.subplot(1, 3, 2)
sns.regplot(x='day', y='sales', data=aux1, color='green')
plt.title('Sales Trend Over Days')
plt.xlabel('Day')
plt.ylabel('Total Sales')
sns.despine(bottom=True, left=True)

# Heatmap of correlation between variables
plt.subplot(1, 3, 3)
sns.heatmap(aux1.corr(method='pearson'), annot=True, cmap='coolwarm')
plt.title('Correlation between Day and Total Sales')
sns.despine(bottom=True, left=True)

plt.tight_layout()
plt.show()

To thoroughly explore the hypothesis, we need to subdivide our data into two distinct categories: **days before and after the 10th day of each month**. To perform this classification, we apply the `apply` function to our dataset, combined with a `lambda` function. As a result, we create a new column called `before_after`, which categorizes the days as either `before_10_days` (before the 10th day) or `after_10_days` (after the 10th day).

In [None]:
# Adding a new column 'before_after' that checks if the day is before or after the 10th
aux1['before_after'] = ['before_10_days' if day <= 10 else 'after_10_days' for day in aux1['day']]

**By doing so**, we are stratifying our data according to the premise we want to examine. Now, we can proceed with a detailed analysis of sales performance within these two distinct sets of days.


In [None]:
# Grouping the data by 'before_after' and summing the 'sales', with reset index.
aux2 = aux1[['before_after', 'sales']].groupby('before_after').sum().reset_index()

# Configuring the style, color palette, and font size of the plot
sns.set_style('white')
sns.set_palette('Set2')
plt.rcParams['font.size'] = 14

# Adjusting the figure size
plt.figure(figsize=(24, 8))

# Bar plot of total sales per day
plt.subplot(2, 2, 1)
sns.barplot(x='day', y='sales', data=aux1)
plt.title('Total Sales per Day')
plt.xlabel('Day')
plt.ylabel('Total Sales')
sns.despine(bottom=True, left=True)

# Linear regression plot of sales per day
plt.subplot(2, 2, 2)
sns.regplot(x='day', y='sales', data=aux1, color='green')
plt.title('Sales Trend throughout the Day')
plt.xlabel('Day')
plt.ylabel('Total Sales')
sns.despine(bottom=True, left=True)

# Heatmap of correlation between variables
plt.subplot(2, 2, 3)
sns.heatmap(aux1.corr(method='pearson'), annot=True, cmap='coolwarm')
plt.title('Correlation between Day and Total Sales')
sns.despine(bottom=True, left=True)

# Bar plot of sales Before and After the 10th day
plt.subplot(2, 2, 4)
sns.barplot(x='before_after', y='sales', data=aux2)
plt.title('Total Sales Before and After the 10th Day')
plt.xlabel('Period')
plt.ylabel('Total Sales')
sns.despine(bottom=True, left=True)

# Adjusting the layout for better visualization
plt.tight_layout()
plt.show()

The data analysis reveals an intriguing pattern that challenges the conventional notion that consumers tend to spend more immediately after receiving their salaries. Our data indicates that **the volume of sales is higher after the tenth day of each month**, suggesting a distinct buying behavior of consumers.

This insight could have significant strategic implications for marketing and sales actions. Companies could, for example, **schedule their promotions and advertising campaigns to coincide with this period of higher buying activity, potentially enhancing their effectiveness and impact on sales**.

### **11. Decrease in Sales Volume on Weekends**
We are examining the hypothesis that proposes a reduction in sales volume during weekends. To verify this assumption, we will analyze the sales distribution in relation to the days of the week.

In [None]:
# Creating DataFrame 'aux1', grouping data by 'day_of_week' and summing 'sales', with reset index.
aux1 = df4[['day_of_week', 'sales']].groupby('day_of_week').sum().reset_index()

# Style settings, color palette, and font size for the plot
sns.set_style('white')
sns.set_palette('Set2')
plt.rcParams['font.size'] = 14

# Adjusting the size of the figure
plt.figure(figsize=(24, 8))

# Creating the bar plot
plt.subplot(1, 3, 1)
sns.barplot(x='day_of_week', y='sales', data=aux1, palette='viridis')
plt.title('Total Sales by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Total Sales')
sns.despine(bottom=True, left=True)

# Creating the regression plot
plt.subplot(1, 3, 2)
sns.regplot(x='day_of_week', y='sales', data=aux1, color='green')
plt.title('Sales Trend Throughout the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Total Sales')
sns.despine(bottom=True, left=True)

# Creating the heatmap for correlation analysis
plt.subplot(1, 3, 3)
sns.heatmap(aux1.corr(method='pearson'), annot=True, cmap='coolwarm')
plt.title('Correlation between Day of the Week and Total Sales')
sns.despine(bottom=True, left=True)

# Display the figure with all subplots
plt.tight_layout()
plt.show()

The analysis results confirm our hypothesis. There is a significant decline in sales volumes during weekends, with Sunday being the day with the lowest store activity. This insight is crucial as it points to the need for differentiated strategies to boost sales on these days.

Furthermore, we have identified a **negative correlation in the distribution of sales throughout the week**. This indicates that as the days progress, there is a tendency for sales to decrease. This standard behavior provides us with an overview of consumer purchasing habits and can be strategically used to optimize promotional and marketing actions.

In summary, this information provides an understanding of customer buying behavior and can be key to more effective and personalized commercial strategies for each day of the week.

### **12. Stores Experience Reduced Sales Volume During School Holidays**
Next, we will focus our analysis on investigating the potential impact of school holidays on sales volume. To explore this perspective, we will utilize the `school_holiday` feature present in our dataset. By doing so, we aim to uncover evidence that correlates **school holiday periods with significant changes in sales**.

In [None]:
# Grouping the data by 'school_holiday' and summing the 'sales'
aux1 = df4[['school_holiday', 'sales']].groupby('school_holiday').sum().reset_index()

# Setting the style
sns.set_style('white')
plt.figure(figsize=(24, 6))

# Configuring the subplot area
fig, axs = plt.subplots(ncols=3, figsize=(22, 7))

# Bar plot of sales during school holidays
sns.barplot(x='school_holiday', y='sales', data=aux1, ax=axs[0]).set_title('Total Sales During School Holidays')

# Regression plot of sales during school holidays
sns.regplot(x='school_holiday', y='sales', data=aux1, ax=axs[1]).set_title('Sales Trend During School Holidays')

# Heatmap of correlation between school holidays and sales
sns.heatmap(aux1.corr(method='pearson'), annot=True, cmap='Blues', ax=axs[2]).set_title('Correlation Between School Holidays and Sales')

plt.tight_layout()
plt.show()

The **sales volume is notably higher on days that do not coincide with school holidays** (`school_holiday = 0`), compared to days that are school holidays (`school_holiday = 1`). While this observation might seem intuitive, given that the number of days unrelated to school holidays is larger, the significant contrast between the two sets reinforces the relevance of this distinction for sales forecasting.

For a deeper understanding, we introduce `month` as a second variable in our analysis. This new approach allows us to **identify whether there are specific months where sales during school holidays rival or even surpass sales on regular school days**. This additional perspective could unveil hidden patterns and valuable insights to enhance sales and promotional strategies.

In [None]:
# Preparing the data for visualization, grouping sales by school holidays
aux1 = df4[['school_holiday', 'sales']].groupby('school_holiday').sum().reset_index()

# Renaming values 0 and 1 to 'School Days' and 'School Holidays' in 'school_holiday'
aux1['school_holiday'] = aux1['school_holiday'].map({0: 'School Days', 1: 'School Holidays'})

# Setting style, color palette, and font size for the plot
sns.set_style('white')
sns.set_palette('Set2')
plt.rcParams['font.size'] = 10

# Adjusting the figure size
plt.figure(figsize=(24, 8))

# Creating the bar plot for school holidays
plt.subplot(2, 1, 1)
sns.set_style("white")
sns.barplot(x='school_holiday', y='sales', data=aux1)
plt.title('Sales During School Holidays')

# Preparing the data for visualization, grouping sales by month and school holidays
aux2 = df4[['month', 'school_holiday', 'sales']].groupby(['month', 'school_holiday']).sum().reset_index()

# Renaming values 0 and 1 to 'School Days' and 'School Holidays' in 'school_holiday'
aux2['school_holiday'] = aux2['school_holiday'].map({0: 'School Days', 1: 'School Holidays'})

# Creating the bar plot for sales by month, segmented by school holidays
plt.subplot(2, 1, 2)
sns.set_style("white")
g = sns.barplot(x='month', y='sales', hue='school_holiday', data=aux2)
plt.title('Sales per Month During School Holidays')

# Moving the legend outside the plot
g.legend(loc='center left', bbox_to_anchor=(1, 0.5), ncol=1)

# Display the figure with all subplots
plt.tight_layout()
plt.show()

Upon delving deeper into our analysis of the impact of school holidays on sales volume, we discern an intriguing pattern throughout the year. Initially, **school days exhibit significantly higher sales volume**, which remains **consistent until we reach the month of July**.

However, in July, traditionally a period of school holidays, sales on holiday days approach those on school days. This shift indicates the **influence of school holidays on purchasing behavior**, likely due to a substantial portion of sales occurring during this period.

The most surprising discovery emerges in August. During this month, contrary to all expectations, **sales during school holidays surpass those on school days**. This behavior deviates from the observed pattern in other months, indicating a different sales dynamic in August.

After this peak, the situation normalizes, and sales on school days once again exceed those on school holidays.

---
## **Summary of Evaluated Hypotheses**

1. **Stores with larger assortment should sell more**: Contrary to expectations, our analysis showed that stores with a larger assortment tend to have lower sales volume, challenging common assumptions.

2. **Stores with closer competitors should sell less**: Reality proved to be opposite to the expected. We discovered that stores with closer competitors tend to sell more, indicating the influence of competitive environment on consumer behavior.

3. **Stores with longer-standing competitors should sell more**: Here, the hypothesis was refuted. In fact, stores with longer-standing competitors tend to sell less, suggesting that market maturity may negatively impact sales.

4. **Stores with longer-running promotions should sell more**: We found that the duration of promotions doesn't necessarily boost sales. Stores with longer-running promotions actually exhibited lower sales volume.

5. **Stores with more consecutive promotions should sell more**: Another counterintuitive result. Stores that ran more consecutive promotions showed reduced sales, potentially indicating consumer saturation with frequent promotions.

6. **Stores open during Christmas should sell more**: Our analysis refuted this hypothesis. Stores that remained open during Christmas showed lower sales.

7. **Stores should sell more over the years**: This expectation was also contradicted. A downward trend in sales over the years was observed.

8. **Stores should sell more in the second half of the year**: Again, the hypothesis was refuted. Stores demonstrated lower sales volume in the second half of the year.

9. **After the tenth day of each month, stores should sell more**: In this case, the hypothesis was validated. Stores indeed sell more after the tenth day of each month.

10. **Stores should sell less on weekends**: The hypothesis proved to be true. Sales are, on average, lower during weekends.

11. **Stores should sell less during school holidays**: We confirmed this hypothesis, with an interesting exception in the months of July and August, when sales increase during school holidays.

The analysis of these hypotheses unveiled various counterintuitive patterns that can be valuable in formulating business strategies.

In [None]:
import pandas as pd

# Criando o DataFrame
df = pd.DataFrame(
    [
        ['H1', 'Falsa', 'Baixa'],
        ['H2', 'Falsa', 'Média'],
        ['H3', 'Falsa', 'Média'],
        ['H4', 'Falsa', 'Baixa'],
        ['H5', '-', '-'],
        ['H6', 'Falsa', 'Baixa'],
        ['H7', 'Falsa', 'Média'],
        ['H8', 'Falsa', 'Alta'],
        ['H9', 'Falsa', 'Alta'],
        ['H10', 'Verdadeira', 'Alta'],
        ['H11', 'Verdadeira', 'Alta'],
        ['H12', 'Verdadeira', 'Baixa'],
    ],
    columns=['Hipóteses', 'Conclusão', 'Relevância'])

# Mostrando o DataFrame
print(df)


---
## **4.3 Multivariate Analysis**
In Multivariate Analysis, we explore mutual relationships among multiple variables, beyond those between individual variables and the target variable. It enables us to identify correlations and assists in deciding which variables can be eliminated to simplify Machine Learning models without compromising the quality of information.

This process follows the `Principle of Parsimony` or `Occam's Razor`, which favors simpler models for better generalization of learning. Removing redundant variables or those carrying similar information contributes to reducing data dimensionality, thus decreasing model complexity.

Hence, Multivariate Analysis is an essential tool to enhance the efficiency of Machine Learning models and gain deeper insights from the data.

---
### **4.3.1 Numerical Attributes**

In [None]:
list(num_attributes)

In [None]:
# Calculation of the correlation matrix
correlation_num = num_attributes.corr(method='pearson')

# Style settings, color palette, and font size for the plot
sns.set_style('white')
sns.set_palette('Set2')
plt.rcParams['font.size'] = 10

# Adjusting the figure size
plt.figure(figsize=(30, 15))

# Settings to enhance visualization
sns.heatmap(correlation_num, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5, square=True, cbar_kws={"shrink": 0.75}, annot_kws={"size": 10})
plt.title('Correlation Matrix of Numeric Attributes', fontsize=14)
plt.xticks(rotation=45, fontsize=9)
plt.yticks(rotation=0, fontsize=9)
plt.show()

The correlation matrix is represented by a spectrum of colors: deep blue symbolizes a strong negative correlation, while red indicates a strong positive correlation. Intermediate shades suggest less significant correlations.

For instance, if we observe the correlation between the variables `store` and `is_promo`, we notice a value of `0.0046`. The same value can be found symmetrically in the matrix when comparing `is_promo` and `store`. The main diagonal is marked by red squares, indicating the correlation of a variable with itself, which always has a value of 1.

---
### **4.3.2 Categorical Attributes**

In [None]:
df_cat = df4.select_dtypes(include='object')
df_cat.head()

In [None]:
# Import the required library
from scipy import stats

# Define the function to calculate the Cramér's V coefficient
def cramer_v(x, y):
    # Create a contingency matrix from the two categorical variables
    contingency_matrix = pd.crosstab(x, y).values

    # Calculate the total number of observations
    total_observations = contingency_matrix.sum()

    # Identify the number of rows and columns in the contingency matrix
    num_rows, num_columns = contingency_matrix.shape

    # Calculate the chi-squared statistic from the contingency matrix
    chi_squared = stats.chi2_contingency(contingency_matrix)[0]

    # Calculate and return the Cramér's V coefficient
    return np.sqrt((chi_squared/total_observations) / (min(num_columns-1, num_rows-1)))

The Cramer's V coefficient, also known as Kramer's V, is a standardized measure that assesses the **correlation between two categorical variables**. This metric is derived from the chi-squared (χ²) coefficient, but takes into account the sample size and the number of categories involved. The value of the Cramer's V coefficient ranges from 0 to 1, where:

- A value of 0 represents the absence of any association between the categorical variables.
- A value of 1 indicates a perfect association between the variables.

Therefore, this coefficient is an important tool in data analysis, capable of indicating the strength of the relationship between categorical variables.

In [None]:
result_cramer_v = cramer_v(cat_attributes['state_holiday'], cat_attributes['store_type'])
print(result_cramer_v)

Let's correct the calculation method of the Cramer's V to mitigate bias or estimation errors. For bias correction, we should use the formula of the C coefficient, which is a corrected form of Cramer's V.



In [None]:
# Select only the categorical data
a = df4.select_dtypes(include='object')

# Calculate the Cramer V for each pair of variables
cramer_v_values = {col1: {col2: cramer_v(a[col1], a[col2]) for col2 in a.columns} for col1 in a.columns}

# Create the final dataframe
d = pd.DataFrame(cramer_v_values)

# Configure style, color palette, and font size for the plot
sns.set_style('white')
sns.set_palette('Set2')
plt.rcParams['font.size'] = 14

# Adjust the figure size
plt.figure(figsize=(30, 6))

# Create the heatmap
sns.heatmap(d, annot=True, cmap='coolwarm')

# Add titles
plt.title("Cramer V Correlation Among Categorical Variables")
plt.xlabel("Variables")
plt.ylabel("Variables")

plt.show()

---
# **5.0 Data Wrangling**
In this stage of the **data science project**, we are focusing on **data modeling** for the training of **machine learning algorithms**. However, before we delve into this topic, it's crucial to update the progress of our project, which will be shared with stakeholders.

So far, in the **Rossman sales prediction project**, the initial demand came from store managers who were seeking a **sales forecast for the next six weeks**. This request was strategically handled, leading us to the next step: **understanding the business**.

We identified that the motivation behind this forecast was the need for **store renovation planning**. To achieve this, it's essential to know the expected revenue for each store, enabling an **accurate calculation of the required investment**. Hence, the sales forecast for the upcoming six weeks became relevant.

The next step in the process was **data collection**. In this case, data was obtained from an online platform. In a real-world scenario, this would be the time to extract all the necessary data from the company's database.

After data collection, **data cleaning** was performed. During this process, we also dealt with missing data, applied replacements, enforced business constraints, and filtered the data to retain only what's relevant for our analysis.

Finally, we conducted **exploratory data analysis**, which has been the most extensive phase of the project so far. Here, we formulated several hypotheses and learned how to prepare data for algorithm training.

While there are still a few steps left to complete the first cycle of the project and present the first version of our solution, we are already providing value to the business through the progress we've made so far. Now, we will proceed to the next phase of the project: **data modeling**.

In [None]:
df5 = df4.copy()

### **Motivation for Data Preparation**





In [None]:
m = pd.concat([d2, d3, d4, ct1, ct2, d1, d5, d6]).T.reset_index()
m.columns = ['attributes', 'min', 'max', 'range', 'mean', 'median', 'std', 'skew', 'kurtosis']
m

In [None]:
# Concatenate the DataFrames
m = pd.concat([d2, d3, d4, ct1, ct2, d1, d5, d6]).T.reset_index()

# Rename the columns
m.columns = ['Attributes', 'Minimum', 'Maximum', 'Range', 'Mean', 'Median', 'Standard Deviation', 'Skewness', 'Kurtosis']

# Adjust cell width and text alignment for better visualization
m_styled = m.style.set_properties(**{'text-align': 'left', 'width':'10em'})

# Apply conditional formatting to highlight the cell with the highest value in each column
m_styled = m_styled.highlight_max(color='green')

# Display the DataFrame
m_styled


The data preparation phase is a crucial element in the machine learning process. If the importance of data preparation were to be condensed into a single sentence, it would be: most machine learning algorithms are **designed to work with numerical data on a uniform scale**. Therefore, data preparation primarily involves two tasks: **converting categorical data into numerical data** and **normalizing the data scale**.

Such algorithms use optimization methods that often rely on mathematical operations, such as additions and multiplications. These **optimization methods seek to find the most effective parameters for the dataset**, and for this purpose, they rely on derivatives, which are defined exclusively for numerical variables. Hence, it's necessary to **convert categorical variables into numerical ones** without sacrificing vital information.

Another crucial point is the normalization of data scale. For example, when analyzing two variables, `day_of_week` ranging from 1 to 7 and `competition_distance` ranging from 20 meters to 200,000 meters, we notice a vastly different scale. This discrepancy can impact the model's learning, as optimization methods like gradient descent tend to prioritize variables with greater range. This could lead **the model to assign excessive importance to variables with larger scales**. To address this issue, we need to normalize the data, ensuring that all variables are on the same scale.

---
### **Types of Data Preparation**
There are essentially three types of data preparation: **Normalization, Scaling, and Numeric Encoding**.

- **`Normalization`**: This technique rescales the data so that they have a **mean of zero and a standard deviation of one**. For example, consider a variable 'competition_distance' with an average of 100 meters and a standard deviation of 20 meters. Normalization shifts this variable to a new range of values where the mean is 0 and the standard deviation is 1. This means that all values are adjusted to the same range, which goes from -1 to 1. Normalization is particularly effective when the data follows a normal distribution.

- **`Scaling`**: This technique is similar to normalization but is more suitable for data that does not follow a normal distribution. Like normalization, scaling adjusts values to a common range, but unlike normalization, it does not assume that the data follows a normal distribution.

- **`Encoding`**: This technique is used to convert categorical variables into numerical ones, as mentioned earlier.

In addition to these three techniques, other transformations may be necessary depending on the nature of the data. For example, cyclic variables like months of the year, which repeat in a fixed cycle, might need to be transformed in a way that reflects this cyclic nature.

## **5.2 Data Normalization**
**Normalization** ensures that all numbers in a dataset **share a similar scale**, preventing skewed comparisons. For instance, comparing people's heights in meters to weights in kilograms might be misleading due to the differing units. Normalization is akin to **converting to a common currency**, simplifying number comparison. In our method, we adjust numbers to have an **average of 0 and standard deviation**—**rescaling** while maintaining relationships.

Achieved by subtracting the mean and dividing by the standard deviation, tools like Scikit-Learn streamline this. However, **normalization can shift the data distribution's nature**. Upon normalization, diverse distributions such as **Beta (red), exponential (blue)**, and **two normal distributions (purple and green)** converge to **mean 0** and **standard deviation 1**.

Before starting, identify variables needing normalization based on their distribution, observed during **exploratory data analysis**. Revisiting **Univariate Analysis**, focus on **Numeric Variables**, particularly in "Univariate Analysis - Numeric Variables." The key is to **spot variables with a normal distribution**—remember, **normalization has most impact on such variables**. Identifying these is the challenge.

Our focus should be on **Numeric Variables**. Therefore, we will review the subsection "Univariate Analysis - Numeric Variables."

The crucial step now is to **recognize variables with a normal distribution**. It's worth reinforcing that **normalization has a more significant impact when applied to variables that already exhibit this distribution**. Therefore, the challenge is to identify which variables have this profile.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Initial settings for enhancing plot aesthetics
sns.set_style("white")
sns.set_palette("Set2")

# Plotting histograms
fig = plt.figure(figsize=(25, 20))

for i, column in enumerate(num_attributes.columns, start=1):
    ax = fig.add_subplot(4, 4, i)  # Assuming we have up to 16 columns in the DataFrame
    sns.histplot(num_attributes[column], bins=25, kde=True, ax=ax)

    # Adjusting titles and axes
    ax.set_title(f'Distribution of {column}', fontsize=15)
    ax.set_xlabel(column, fontsize=12)
    ax.set_ylabel('Frequency', fontsize=12)
    ax.tick_params(axis='x', labelsize=10)
    ax.tick_params(axis='y', labelsize=10)

# Adjusting the layout to prevent overlapping and maximize clarity
plt.tight_layout()
plt.show()

Although it can be observed that certain variables, such as `sales`, tend towards a distribution closer to normal, they still do not conform to a genuinely normal distribution. Due to the peculiarities and quality of our data, it would be more prudent to **refrain from applying normalization**, thus avoiding imposing a format that might not be the most appropriate.

Therefore, our next step will be the **application of the Rescaling technique** to all variables. This strategy aims to standardize the variables to the same range, optimizing the performance of machine learning algorithms in the subsequent phases of the project.

---
## **5.3 Rescaling**
Before we proceed, it's vital to understand the **motivation behind data preprocessing** in the context of data science.

1. **Categorical Variables:** Our dataset contains three of these variables that need to be transformed into numerical formats. This is because machine learning algorithms primarily operate with numerical inputs, making this conversion essential.

2. **Numeric Variables:** We've observed a wide range of intervals in our dataset. Algorithms tend to assign greater importance to variables with higher absolute values, which can be misleading. For instance, the variable `competition_distance` has more significant values than `day_of_week`, but this doesn't necessarily indicate that it's inherently more relevant. Thus, it's necessary to standardize these ranges, ensuring that the algorithm recognizes the true significance of each variable.

In this way, the **Rescaling technique** emerges as a tool to **standardize all variables within a common range**, optimizing the performance of machine learning algorithms.

---
### **5.3.1 Min-Max Scaler**
The **scaling** technique known as **Min-Max Scaler** is crucial when working with datasets that do not follow a normal distribution. This technique adjusts the values of a variable so that they fall within a defined scale, **typically between 0 and 1**. By doing so, we still **preserve the original data structure, but they are transformed to a standardized scale**. This standardization facilitates subsequent analysis, especially when using algorithms that are sensitive to the magnitude of values.

---
### **5.3.2 Robust Scaler**
The `Robust Scaler` is a technique used to adjust the scale of variables, especially when **outliers** are present. While the Min-Max Scaler method can be influenced by these outliers (as it considers minimum and maximum values), the **Robust Scaler is more stable** in this regard.

Instead of using extreme values, the Robust Scaler **relies on the quartiles of the data distribution**. Essentially, it adjusts the data using the first `25th percentile` and the third quartile `75th percentile`, which are points in the distribution less affected by outliers.

This approach is particularly useful when we want to **preserve the original data structure**, avoiding distortions caused by highly disparate values. However, it's important to note that, despite being less sensitive to outliers, the Robust Scaler **might alter the distribution characteristic of some datasets**.

---
### **Application**
The first step in applying scaling techniques is to **identify the variables that will be scaled**. We do this by selecting the numeric variables in our data.

We can use the `select_dtypes` function, which allows us to select columns from a DataFrame based on their data types.

In [None]:
# Selecting only the numeric columns (int64 and float64) from the DataFrame df5
numeric_columns = df5.select_dtypes(include=['int64', 'float64'])

# Displaying the first 5 rows of the selected numeric columns
numeric_columns.head()

In the provided code, we instruct pandas to show us columns that are predominantly numeric, meaning those of type `int64` or `float64`. By inspecting the first rows, we confirm if the selection was done correctly.

**Scaling Decision:** The nature of your data guides whether to apply MinMaxScaler or RobustScaler. In our case, we chose to use these techniques on variables like `day_of_week`, `competition_distance`, and others, totaling eight specific variables.

**Cyclic Variables:** Some variables exhibit a repeating pattern over time, such as **month** or `day_of_week`. We call these cyclic variables. In preparation for modeling, these variables benefit from a special transformation that considers their **repetitive nature**.

**Non-Cyclic Variables:** On the other hand, we have variables that do not follow this repeating cycle, for example, `competition_distance` and `year`. The key question is to decide which scaling technique is most appropriate. Here, **the decision often hinges on the presence or absence of outliers**, which are extreme values that can distort analyses.

**Outlier Identification:** A visual tool like the `boxplot` assists us in this process. By visualizing, for instance, the variable `competition_distance` using a boxplot, we can quickly identify the presence of outliers and thus make more informed decisions about the scaling technique to be applied.

---
### 1. **Robust Scaler on `competition_distance`**

In [None]:
# Figure size configuration
plt.figure(figsize=(24, 6))

# First Plot - Boxplot
plt.subplot(1, 2, 1)
sns.boxplot(df5['competition_distance'], color='skyblue')
plt.title('Competition Distance Boxplot', fontsize=15)
plt.xlabel('Competition Distance', fontsize=12)
plt.ylabel('Distribution', fontsize=12)

# Second Plot - Distribution
plt.subplot(1, 2, 2)
sns.distplot(df4['competition_distance'], color='salmon', bins=20, kde=True)
plt.title('Competition Distance Distribution', fontsize=15)
plt.xlabel('Competition Distance', fontsize=12)
plt.ylabel('Density', fontsize=12)

# Adjust layout to prevent overlap and enhance presentation
plt.tight_layout()

# Display the figure
plt.show()

In the context of our data, we have detected significant **outliers** in the `competition_distance` variable. Given this, the most suitable choice is the `RobustScaler`, which stands out precisely for its lower sensitivity to outliers. This characteristic makes it particularly well-suited for our scenario.

To start using the `RobustScaler`, the first step is to incorporate it into our code. This is done by importing it from the renowned `Scikit-learn` library in Python.

In [None]:
# Importing the necessary tool
from sklearn.preprocessing import RobustScaler

# Instantiating the RobustScaler
robust_scaler = RobustScaler()

# Applying the transformation to the desired column
df5['competition_distance'] = robust_scaler.fit_transform(df5[['competition_distance']])

When we talk about the `RobustScaler`, we're essentially discussing a tool that helps adjust our data, especially when we have some outlier values that deviate significantly from the average. Imagine you're trying to level a piece of land, but there are some large rocks in the way. The RobustScaler helps "flatten" these rocks so that they don't interfere too much with the overall landscape.

The magic happens with the `fit_transform()` method. The first part, `fit`, is like measuring the rocks, understanding their size and location. The second part, `transform`, is where we do the actual leveling.

So, after using the `RobustScaler`, the `competition_distance` column still holds its original values, but they are presented in a way that makes more sense for our analyses, as the "big rocks" are no longer as big of a problem.

In [None]:
# Configuring the overall figure size
plt.figure(figsize=(24, 6))

# Boxplot
plt.subplot(1, 2, 1)
sns.boxplot(df5['competition_distance'], color='skyblue')
plt.title('Competition Distance Boxplot with Robust Scaler', fontsize=16)
plt.xlabel('Competition Distance', fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Distribution Plot
plt.subplot(1, 2, 2)
sns.distplot(df5['competition_distance'], color='salmon', bins=30, kde=True)
plt.title('Competition Distance Distribution with Robust Scaler', fontsize=16)
plt.xlabel('Competition Distance', fontsize=14)
plt.ylabel('Density', fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Adjusting plots to prevent overlap
plt.tight_layout()
plt.show()

When we observe the `competition_distance` graph, we notice a change. **Previously, the values ranged from 0 to 200,000**. However, after applying a technique called `Robust Scaler`, the story has shifted a bit: the values now vary between 0 and 1.

But here's the key: **the shape of the distribution doesn't change**, only the range of values. If you imagine the data as a rubber band, we're stretching or compressing that rubber band to fit between 0 and 1. It still **retains the same shape**, just **in a different size**.

Why do this? Well, by bringing all variables to the same scale, between 0 and 1, **we ensure that none of them is more "important" than another just because they have larger numbers**. It's like leveling the playing field for all the data. So, when we build a model, each variable contributes fairly.

This is what makes the `Robust Scaler` so valuable. It allows variables to communicate on equal terms, without altering the essence of how each one is distributed.

---
### 2. **Robust Scaler on `competition_time_month`**
We will follow the same procedure for the variable `competition_time_month`.

In [None]:
# Configuring the overall figure size
plt.figure(figsize=(24, 6))

# Boxplot
plt.subplot(1, 2, 1)
sns.boxplot(df5['competition_time_month'], color='skyblue')
plt.title('Competition Time Months Boxplot with Robust Scaler', fontsize=16)
plt.xlabel('Competition Time Months', fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Distribution Plot
plt.subplot(1, 2, 2)
sns.distplot(df4['competition_time_month'], color='salmon', bins=30, kde=True)
plt.title('Competition Time Months Distribution with Robust Scaler', fontsize=16)
plt.xlabel('Competition Time Months', fontsize=14)
plt.ylabel('Density', fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Adjusting plots to prevent overlap
plt.tight_layout()
plt.show()

In [None]:
from sklearn.preprocessing import RobustScaler

# Applying RobustScaler to the column
rs = RobustScaler()
df5['competition_time_month'] = rs.fit_transform(df5[['competition_time_month']])

# Configuring the overall figure size
plt.figure(figsize=(24, 6))

# Boxplot
plt.subplot(1, 2, 1)
sns.boxplot(df5['competition_time_month'], color='skyblue')
plt.title('Competition Time Months Boxplot with Robust Scaler', fontsize=16)
plt.xlabel('Competition Time Months (Scaled)', fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Distribution Plot
plt.subplot(1, 2, 2)
sns.distplot(df5['competition_time_month'], color='salmon', bins=30, kde=True)
plt.title('Competition Time Months Distribution', fontsize=16)
plt.xlabel('Competition Time Months', fontsize=14)
plt.ylabel('Density', fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Adjusting plots to prevent overlap
plt.tight_layout()
plt.show()


---
### 3. **MinMaxScaler on `promo_time_week`**

In [None]:
# Configuring the overall figure size
plt.figure(figsize=(24, 6))

# Boxplot
plt.subplot(1, 2, 1)
sns.boxplot(df5['promo_time_week'], color='skyblue')
plt.title('Promotion Weeks Boxplot', fontsize=16)
plt.xlabel('Promotion Weeks', fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Distribution Plot
plt.subplot(1, 2, 2)
sns.distplot(df4['promo_time_week'], color='salmon', bins=30, kde=True)
plt.title('Promotion Weeks Distribution', fontsize=16)
plt.xlabel('Promotion Weeks', fontsize=14)
plt.ylabel('Density', fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Adjusting plots to prevent overlap
plt.tight_layout()
plt.show()


In [None]:
from sklearn.preprocessing import MinMaxScaler

# Applying MinMaxScaler transformation
mms = MinMaxScaler()
df5['promo_time_week'] = mms.fit_transform(df5[['promo_time_week']])

# Configuring the overall figure size
plt.figure(figsize=(24, 6))

# Boxplot
plt.subplot(1, 2, 1)
sns.boxplot(df5['promo_time_week'], color='skyblue')
plt.title('Promotion Weeks Boxplot with MinMax Scaler', fontsize=16)
plt.xlabel('Promotion Weeks', fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Distribution Plot
plt.subplot(1, 2, 2)
sns.distplot(df4['promo_time_week'], color='salmon', bins=30, kde=True)
plt.title('Promotion Weeks Distribution', fontsize=16)
plt.xlabel('Promotion Weeks', fontsize=14)
plt.ylabel('Density', fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Adjusting plots to prevent overlap
plt.tight_layout()
plt.show()

---
### **4. MinMaxScaler on `year`**

In [None]:
# Configuring the overall figure size
plt.figure(figsize=(24, 6))

# Boxplot
plt.subplot(1, 2, 1)
sns.boxplot(df5['year'], color='skyblue')
plt.title('Year Boxplot', fontsize=16)
plt.xlabel('Year', fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Distribution Plot
plt.subplot(1, 2, 2)
sns.distplot(df4['year'], color='salmon', bins=30, kde=True)
plt.title('Year Distribution', fontsize=16)
plt.xlabel('Year', fontsize=14)
plt.ylabel('Density', fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Adjusting plots to prevent overlap
plt.tight_layout()
plt.show()


In [None]:
from sklearn.preprocessing import MinMaxScaler

# Normalizing the 'year' column with MinMaxScaler
mms = MinMaxScaler()
df5['year'] = mms.fit_transform(df5[['year']])

# Configuring the overall figure size
plt.figure(figsize=(24, 6))

# Boxplot
plt.subplot(1, 2, 1)
sns.boxplot(df5['year'], color='skyblue')
plt.title('Year Boxplot with MinMaxScaler', fontsize=16)
plt.xlabel('Normalized Year', fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Distribution Plot
plt.subplot(1, 2, 2)
sns.distplot(df4['year'], color='salmon', bins=30, kde=True)
plt.title('Year Distribution', fontsize=16)
plt.xlabel('Year', fontsize=14)
plt.ylabel('Density', fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Adjusting plots to prevent overlap
plt.tight_layout()
plt.show()

## **5.4 `Encoding`**
When preparing data for machine learning models, we need to convert categorical variables into numerical ones. This process is known as **encoding**. For example, transforming a variable like "color" with values such as "red", "blue", and "green" into the numbers 1, 2, and 3.

Why do we need to do this? Most models only understand numbers. They don't know how to handle words or categories directly.

There are various methods for performing this encoding. Each method has its own utility, and the choice of the best method depends on the nature of your data. Here are two recommendations:

1. **Study your data:** Examine the different values that your categorical variable can take and how they relate to what you want to predict.
2. **Test different approaches:** The best way to determine what works for your data is to experiment. Apply different encoding methods and see which one provides the best results.

Remember: What works well for one dataset may not be ideal for another. So, understanding and experimenting are the keys to success.

### **Types of Encoding**
There are various ways to transform words or categories into numbers so that our machine learning models can understand them. Here are six common methods for achieving this:

- **`One-Hot Encoding`:** Creates a column for each category. If the category is present, the value is 1; if not, it's 0.

- **`Label Encoding`:** Assigns a different number to each category. Useful when dealing with many categories.

- **`Ordinal Encoding`:** Similar to Label Encoding, but respects the order of categories. For instance, "low", "medium", and "high" might become 1, 2, and 3.

- **`Target Encoding`:** Uses the average target value of each category as its number.

- **`Frequency Encoding`:** Instead of the average, this method uses the frequency of category occurrences.

- **`Binary Encoding`:** Converts categories into binary codes. Good for handling many categories.

Want to dive deeper? There's an [excellent article](https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02) on this topic. And if you're programming in Python, the [category_encoders](https://contrib.scikit-learn.org/category_encoders/) package could be helpful.

### **5.4.1 `One-Hot Encoding`**
One-Hot Encoding is like creating a "checkbox" for each category in a variable.

Imagine we have a variable called 'Temperature' that can be 'Cold', 'Warm', 'Hot', or 'Very Hot'. With One-Hot Encoding, we would create a column for each of these options.

Now, consider a row where 'Temperature' is 'Hot'. In this case, you would "check" the 'Hot' checkbox (or column) with a '1' and leave all the others (Cold, Warm, Very Hot) as '0'.

It's as if each option has its own checkbox, and we mark the one that applies in each case.

In [None]:
import pandas as pd

data = {'Temperature': ['Hot'], 'Cold': [0], 'Warm': [0], 'Hot': [1], 'Very Hot': [0]}
df = pd.DataFrame(data)
print(df)

Imagine you have a set of colored pencils. Using the One-Hot Encoding technique is like placing each pencil of a different color into separate boxes.

**Advantages:**
- It's as if you have a box for each color, making it easy to identify which pencil you have (i.e., it's straightforward to apply).
- Your initial set of colored pencils (categorical variables) can now be represented numerically.

**Disadvantages:**
- If you have many colors of pencils, you'll end up with many boxes! This can take up a lot of space (increases data dimensionality).
- Having many boxes can complicate things when you want to draw something (i.e., make the model learning more complex).

**Practical Example:**
Think about regular days and holidays. Holidays are like special colors in your set. By using One-Hot Encoding, you can separate holidays into their own boxes, highlighting their significance. This helps to perceive the behavioral differences between a regular day and a holiday, which can be valuable for models predicting events on these dates.

---
### **5.4.2 'Label Encoding'**
Imagine you have a collection of animal stickers. Using the Label Encoding technique is like assigning a unique identification number to each animal, as if you were cataloging your collection.

**What happens?**
- Each animal (category) is assigned a unique number, like a code.
- This number doesn't convey any information about the animal. For example, assigning the number 1 to a cat and 2 to a dog doesn't mean that one is "better" or "larger" than the other. It's just a code!

**Practical Example:**
If we consider a collection of temperature stickers like 'Cold', 'Warm', 'Hot', and 'Very Hot', the Label Encoding technique could catalog them like this:

- Cold: 1
- Warm: 2
- Hot: 3
- Very Hot: 4

Remember: The numbers are merely codes. In our example, it doesn't mean that "Very Hot" is four times hotter than "Cold". They are just identifiers.

In [None]:
data = {'Temperature': ['Hot', 'Cold', 'Very Hot', 'Warm'], 'Encoding (Label Encoding)': [1, 2, 3, 4]}
df = pd.DataFrame(data)
df

When using Label Encoding, we assign a unique number to each category of the 'Temperature' variable. These numbers do not reflect any order or hierarchy; they are merely numeric representations of the categories.

This technique is particularly valuable when working with categories that do not have a clear sequence or relationship between them, such as city names or product brands. However, caution is required: if the categorical variable has a logical sequence (like 'low', 'medium', 'high'), using Label Encoding might confuse the model. This is because, in this method, the numbers are just labels and do not reflect the actual order of the categories.


### **5.4.3 Ordinal Encoding**
Ordinal Encoding is a special encoding technique that takes into account the order or hierarchy of categories. This is essential when categories have a logical sequence, such as in rankings or intensities.

For instance, consider the 'Temperature' variable with categories 'Cold', 'Warm', 'Hot', and 'Very Hot'. Unlike Label Encoding, which merely labels categories, Ordinal Encoding respects the natural order. Therefore, during encoding, 'Cold' would have a lower value than 'Warm', which would have a lower value than 'Hot', and ultimately 'Hot' would have a lower value than 'Very Hot'.

In summary, Ordinal Encoding is akin to assigning grades on a scale to each category while respecting its position or intensity in the sequence.

If you have any more questions or need further assistance, feel free to ask!

In [None]:
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder

# Initial data
data = {'Temperature': ['Cold', 'Warm', 'Hot', 'Very Hot']}
df = pd.DataFrame(data)

# Encoding the 'Temperature' column respecting the order of categories
encoder = OrdinalEncoder(categories=[['Cold', 'Warm', 'Hot', 'Very Hot']])
df['Encoded_Temperature'] = encoder.fit_transform(df[['Temperature']])
print(df)

Ordinal Encoding reflects the sequence of categorical categories in their numerical representation, making the data more relevant for certain machine learning algorithms.

However, it is crucial to use it only when there is a clear order among the categories. Otherwise, we could induce a nonexistent hierarchy, compromising the accuracy of the learning model.

### **5.4.4 Target Encoding**

Target Encoding transforms categorical variables by using the average of the target value associated with each category.

Imagine you have a categorical variable "Temperature" with categories such as 'Hot', 'Cold', 'Very Hot', and 'Warm'. Additionally, you have a target variable, let's say "Sales".

When applying target encoding to the "Temperature" variable, each category gets replaced with the average "Sales" value corresponding to that category.

In [None]:
import pandas as pd

data = {
    'Temperature': ['Hot', 'Cold', 'Very Hot', 'Warm', 'Hot', 'Warm', 'Warm', 'Hot', 'Hot', 'Cold'],
    'Target_Value': [0.7, 0.5, 0.2, 0.3, 0.7, 0.3, 0.3, 0.7, 0.7, 0.5]
}

df = pd.DataFrame(data)
df

When dealing with a categorical variable that has numerous different categories, using target encoding can be really helpful! Why? Imagine if each category turned into a new column, similar to what happens with the one-hot method. We would end up with a multitude of columns!

However, with target encoding, we avoid this "explosion" of columns, making the analysis simpler and more effective.

---
### **5.4.5 Frequency Encoding**
Let's discuss frequency encoding, a very practical way to handle variables with many different categories. The idea is simple: instead of focusing on the category names, we'll look at how frequently each category appears.

Imagine we're talking about temperatures: 'Hot', 'Cold', 'Very Hot', and 'Warm'. Instead of using the actual category names, we would replace each "Hot," for instance, with the number of times "Hot" appears or its proportion in the entire dataset.

In [None]:
import pandas as pd

# Data about temperatures
data = {
    'Temperature': ['Hot', 'Cold', 'Very Hot', 'Warm', 'Hot', 'Warm', 'Warm', 'Hot', 'Hot', 'Cold'],
    'Corresponding_Value': [0.4, 0.2, 0.1, 0.3, 0.4, 0.3, 0.3, 0.4, 0.4, 0.2]
}

# Converting data into a DataFrame
df_temperatures = pd.DataFrame(data)
print(df_temperatures)

The technique of frequency encoding is quite practical when dealing with categorical variables that represent, for instance, "names" or "labels" without a clear order or relationship between them. Imagine you're working on a project related to car brands. In this scenario, names like 'Chevrolet', 'Ford', and others don't have a specific hierarchy or order among them.

In such a case, frequency encoding can be effective. With this method, instead of working directly with the brand names, you represent each brand by the number of times it appears in your dataset. Thus, 'Chevrolet' could be replaced by the frequency of times that brand appears, and the same would apply to 'Ford' and so on.

This way, we manage to provide a numerical treatment to our information, preserving the data's relevance and making it easier for algorithms to process.

### **5.4.6 `Embedding Encoding`**
The technique of "Embedding Encoding" is like a magic wand in the world of Deep Learning, especially when we're talking about Natural Language Processing (NLP). Its primary goal is to transform words or categories into small vectors filled with numbers. Words or categories with similar meanings or uses end up getting closer to each other in this vector space.

Imagine we have a "dictionary" of words: "hot," "cold," "very hot," and "warm." If we choose to represent each word using two numbers (in other words, in two dimensions), we could have something like this:

- "hot": [0.1, 0.3]
- "cold": [-0.2, -0.1]
- "very hot": [0.2, 0.4]
- "warm": [0, 0.2]

What we need to know is that the numbers in these vectors are not chosen randomly! During the training of a model, the algorithm adjusts these numbers based on how the words are used together in the dataset. Therefore, in the end, words with similar uses or meanings will have vectors that are closer to each other.

This technique is incredibly powerful as it manages to capture the "meaning" of words and intelligently transform it into numbers.

## **5.5 `Encoding`**
Imagine that we have a data table called **`df5`**. This table has a column named **`state_holiday`**, which indicates the type of holiday on a given day. In some records, the value is **regular days**, indicating a day without a holiday. In others, it can be 'Easter', 'Christmas', or other types of holidays.

The One-hot encoding technique aims to create new columns for each unique category in the original column. Thus, for each record, we assign the value '1' in the column corresponding to its holiday type and '0' in the other columns.

To facilitate this process, we can use the Pandas library. This library offers a function called **`get_dummies()`** that allows us to perform this transformation efficiently. By using this function, we can quickly convert our **`state_holiday`** column into multiple binary columns.

In [None]:
df6 = df5.copy()

---
### **5.5.1 `state_holiday`**

In [None]:
df6.columns

In [None]:
# state_holiday - One-Hot Encoding
df6 = pd.get_dummies(df6, columns=['state_holiday'], prefix='state_holiday')

In this example, the argument `columns` specifies the column to be encoded, while the argument `prefix` indicates the prefix to be added to the name of each new column.

---
### **5.5.2 `store_type`**
Using the **LabelEncoder** from sklearn, we can convert categories into numbers. For instance, the variable "store_type" has categories 'A', 'B', and 'C'. The **LabelEncoder** assigns a number to each category. This is useful for categories with a clear sequence, but it also works for others.

In [None]:
from sklearn.preprocessing import LabelEncoder

# Initializing the LabelEncoder
le = LabelEncoder()

# Transforming the 'store_type' column into numerical values
df6['store_type'] = le.fit_transform(df6['store_type'])

In the code, `fit_transform` applies the **LabelEncoder** to the `store_type` column, updating it with the encoded values. If we use this on the training data, the mapping is learned correctly. However, if we repeat it on the test set, it might learn a different mapping, leading to inconsistent numerical representation between training and testing. This can negatively impact the model's performance.

### **5.5.3 `assortment`**
We have the categorical variable `assortment`, which contains three categories: `Basic`, `Extra`, and `Extended`. These categories have an inherent order, where `Basic` is lower than `Extra`, which in turn is lower than `Extended`.

In such cases, it's useful to map these categories to numbers that reflect this order. For this purpose, we use the **ordinal encoding** method.

To apply ordinal encoding, we first define a dictionary that maps each category to a number that reflects its position on the scale.

In [None]:
# Creating a dictionary to map 'assortment' types to numerical values
assortment_dict = {'basic': 1, 'extra': 2, 'extended': 3}

# Applying the mapping to the 'assortment' column of the DataFrame df6
df6['assortment'] = df6['assortment'].map(assortment_dict)

# Displaying a random sample from the updated DataFrame
df6.sample()

## **5.6 Data Transformation**
Data transformation is essential in data science to optimize the performance of models. There are two main types of transformations:

- **Magnitude Transformation:** Aims to make the distribution of the variable closer to normal. This benefits algorithms that assume a normal distribution in the data, enhancing their accuracy.

- **Nature Transformation:** Focuses on preserving inherent data characteristics. As in the example of months: December (12) and January (1) are sequential, and this transformation preserves such a relationship despite the numerical difference.

---
### **5.6.1 Types of Transformation**
Data transformations are techniques used to **change the scale or distribution of data**, and there are various approaches to perform them. The most common transformations include logarithmic transformation, Box-Cox transformation, cubic root extraction, and square root extraction. These transformations are particularly useful for dealing with skewed data.

- **Logarithmic:** Applies the logarithm to each value of the variable. Useful to make skewed distributions more "normal".

- **Box-Cox:** Adjusts the variable to approach a normal distribution. This transformation encompasses others, such as logarithmic and square root.

- **Cubic Root:** Applies the cubic root to each value. Recommended for data with positive skewness.

- **Square Root:** Similar to the cubic root, but uses the square root.

These techniques enhance the treatment of data with atypical or asymmetric distributions.

---
### **Logarithmic Transformation of the Response Variable `sales`**
The process to apply a logarithmic transformation to the response variable is straightforward. In our example, we will use the **log1p** function from the **NumPy** package, which applies the natural logarithm plus 1, to transform the response variable, referred to as `sales` in this case.

In [None]:
# Set the figure size
plt.figure(figsize=(24, 6))

# First subplot: Original distribution of sales
plt.subplot(1, 2, 1)
sns.distplot(df6['sales'])
plt.title('Original Sales Distribution')  # Title of the plot
plt.xlabel('Sales')  # x-axis label
plt.ylabel('Density')  # y-axis label

# Apply log1p transformation to the sales column
df6['sales'] = np.log1p(df6['sales'])

# Second subplot: Sales distribution after logarithmic transformation
plt.subplot(1, 2, 2)
sns.distplot(df6['sales'])
plt.title('Sales Distribution after Logarithmic Transformation')
plt.xlabel('Sales (log)')
plt.ylabel('Density')

plt.tight_layout()  # Adjust layout to avoid overlap
plt.show()          # Display the plot

In the histogram on the right, we observe a distribution with a shape closer to that of a bell curve, which is typical of a normal distribution. This suggests that the application of the logarithmic transformation was effective in bringing the data closer to a normalized distribution.

### **Cyclic Temporal Transformation**
Cyclic transformation aims to represent the recurring nature of temporal data. For example, in a monthly sequence from January to December, we observe a repetition every year.

Think of the months as points on a circle. The closeness between December and January is similar to that between January and February, as the cycle restarts from January after December. A linear representation (numbers from 1 to 12) doesn't capture this cyclic relationship, as it places December and January at opposite ends.

The solution is **cyclic encoding** using the **trigonometric circle**. Here, each month is assigned two coordinates: the sine and cosine of its corresponding angle on the circle. For instance, with the circle divided into 12 parts, January (1) will have a sine of 0 and a cosine of 1, while February (2) will have different values, and so on.

This technique represents each month with two coordinates, capturing the cyclic nature and preserving the proximity between consecutive months, such as December and January.

In [None]:
import numpy as np

# Define a constant for cyclic calculation based on a year with 12 months
CYCLIC_FACTOR_MONTH = 2. * np.pi / 12

# Use sine to transform the 'month' column into a cyclic coordinate
df6['month_sin'] = df6['month'].apply(lambda x: np.sin(x * CYCLIC_FACTOR_MONTH))

# Use cosine to transform the 'month' column into another cyclic coordinate
df6['month_cos'] = df6['month'].apply(lambda x: np.cos(x * CYCLIC_FACTOR_MONTH))

Let's explain this step in a technical and straightforward manner:

1. **Selection of the Month Column**: First, we choose the 'month' column in the DataFrame df6. This column will undergo the cyclic transformation.

2. **Definition of Transformation Function**: Next, we create a transformation function that will be applied to each value in the month column. The function uses the sine function to transform each value. The value is multiplied by **2 * π / cycle**, where cycle represents the duration of the cycle (in this case, 12 months).

3. **Application of Transformation Function**: The transformation function is applied to each value in the month column using the apply method, which executes a function for each element in a series or dataframe. The result is stored in a new column named 'month_sin'.

4. **Repeat with Cosine Function**: The process is repeated, but this time using the cosine function. The result is stored in another new column called 'month_cos'.

Now, we will follow the same procedure for the other cyclic variables and their respective periods.

In [None]:
# Define a constant for cyclic calculation based on a week with 7 days
CYCLIC_FACTOR_WEEK = 2. * np.pi / 7

# Use sine to transform 'day_of_week' into a cyclic coordinate
df6['day_of_week_sin'] = df6['day_of_week'].apply(lambda x: np.sin(x * CYCLIC_FACTOR_WEEK))

# Use cosine to transform 'day_of_week' into another cyclic coordinate
df6['day_of_week_cos'] = df6['day_of_week'].apply(lambda x: np.cos(x * CYCLIC_FACTOR_WEEK))

In [None]:
# Define a constant for cyclic calculation based on a month with approximately 30 days
CYCLIC_FACTOR_DAY = 2. * np.pi / 30

# Use sine to transform the 'day' column into a cyclic coordinate
df6['day_sin'] = df6['day'].apply(lambda x: np.sin(x * CYCLIC_FACTOR_DAY))

# Use cosine to transform the 'day' column into another cyclic coordinate
df6['day_cos'] = df6['day'].apply(lambda x: np.cos(x * CYCLIC_FACTOR_DAY))

In [None]:
# Define a constant for cyclic calculation based on 52 weeks in a year
CYCLIC_FACTOR_YEAR_WEEK = 2. * np.pi / 52

# Use sine to transform the 'week_of_year' column into a cyclic coordinate
df6['week_of_year_sin'] = df6['week_of_year'].apply(lambda x: np.sin(x * CYCLIC_FACTOR_YEAR_WEEK))

# Use cosine to transform the 'week_of_year' column into another cyclic coordinate
df6['week_of_year_cos'] = df6['week_of_year'].apply(lambda x: np.cos(x * CYCLIC_FACTOR_YEAR_WEEK))


- **Cyclic Transformation for Days of the Week:** Let's acknowledge that a week cycle consists of seven days, as a week contains exactly seven days. Therefore, when applying the sine and cosine functions, we will use the number seven as the cycle's period.

- **Cyclic Transformation for Month:** Next, we'll consider that an average month has a duration of 30 days. Hence, when performing the cyclic transformation for the month variable, we will use the number 30 as the cycle's period.

- **Cyclic Transformation for Week of the Year:** Finally, a year is composed of approximately 52 weeks. Thus, when applying the cyclic transformation for the week of the year variable, we will use the number 52 as the cycle's period.

However, it's important to note that this cyclic transformation should not be applied to the year variable, as it is considered linear and not cyclic, unless there's a mechanism for time reversal.

Please let me know if there's anything else I can assist you with!

## **Rossmann Project Status**
In this report on the sales projection project we conducted using the **`CRISP-DM`** methodology, we present the main steps:

1. **Business Problem Definition**: The primary focus was to understand the needs of store managers, who were seeking a sales forecast for the upcoming six months. This definition is crucial to establish the project's objectives and parameters.

2. **Business Understanding**: We investigated the motivation behind the request and identified that the goal was to set a budget for store renovations. The allocated amount depends directly on the projected revenue, making the sales projection a cornerstone in the renovation planning.

3. **Data Collection**: With the challenge clarified, we collected data from the `Kaggle` platform. However, data acquisition can stem from multiple sources, such as databases or `APIs`.

4. **Data Cleaning**: After collection, we conducted a thorough review of the data. This involved gaining an overall perspective, generating new `features` from raw data, and filtering variables based on business rules.

5. **Data Exploration**: With the data prepared, we focused on analyzing correlations and forming hypotheses to understand how the collected information could assist in sales forecasting.

6. **Data Modeling**: We prepared the data for a machine learning model, selecting the most impactful `variables` for sales forecasting.

---
# **6.0 Feature Selection**

## **6.1 Why Select Features?**
Feature selection is a crucial phase in data science. This practice dates back to the **Occam's Razor principle**, which suggests that the simplest version of a phenomenon is often the most appropriate.

To illustrate: when describing a car, we might mention elements like four wheels, two headlights, and a license plate. This set of features gives us a basic representation of the car.

Some might add attributes like Bluetooth, Wi-Fi, automatic transmission, and more. While more detailed, this model doesn't cover all cars, making it less universal.

According to Occam's Razor, **the simplest model is often the most effective**, as it generalizes the studied phenomenon better. This philosophy guides variable selection in data science.

In the context of machine learning, **the goal is to enhance learning by adopting leaner models**. The simplicity of the dataset is reflected in the number of `variables` or `features` present. Some of these might be repetitive or collinear, meaning they portray the same aspect of the phenomenon.

**Eliminating collinear variables makes the model more agile**. Using our car example, "four wheels" and "four tires" depict the same thing. Similarly, "taillights" and "headlights" are equivalent in essence.

Thus, feature selection focuses on detecting and discarding collinear variables, creating a more concise model and thereby facilitating its assimilation by algorithms. Various techniques for variable selection will be explored in subsequent phases of this project.

### **6.1.2 Feature Selection Methods**

### **Why Select Features?**
The act of choosing the right attributes is central to data modeling. Proper selection can enhance a model's accuracy and efficiency. Here are three main selection methods:

- **`Univariate Selection (Filter Methods)`:** Evaluates the relationship between each individual `variable` and the target variable. Those with **strong correlation are retained**, while those with low correlation are discarded.

- **`Feature Importance (Embedded Methods)`:** Ranks `variables` by their statistical importance in the dataset. Essentially, the more a `variable` influences the outcome, the more important it is.

- **`Subset Selection (Wrapper Methods)`:** Examines **combinations of variables to determine sets that optimize performance**. Useful when variables are interrelated.

### **Understanding Correlation**
**Correlation quantifies the relationship** between two variables. With values between -1 and 1, it indicates the strength and direction of the relationship. Close to 1 suggests positive correlation; -1 implies negative correlation, and values close to zero denote weak relationship.

For numeric variables, we use the **correlation coefficient**. For categorical variables, we employ other methods:

- **Continuous variables:** Pearson correlation.
- **Categorical variables:** `Cramér's V` or `Chi-Square`.
- **Categorical predictor and continuous target:** Analysis of Variance `(ANOVA)`.
- **Continuous predictor and categorical target:** Linear Discriminant Analysis `(LDA)`.

### **Univariate Selection Method**
In this method, each `variable` is individually evaluated against the target variable. It's straightforward and quick. However, its drawback is not considering interactions between `variables`. Hence, there's a risk of discarding `variables` that could be relevant when combined with others.

When choosing selection methods, it's essential to weigh the dataset's characteristics and project objectives. The balance between simplicity and accuracy is key to a robust model.

---
### **6.1.3 Variable Selection by Importance Method**  

**Variable selection** is fundamental in data science. The importance method, also known as the *embedded method*, is distinct as it doesn't solely focus on the correlation between variables. For example, the `Random Forest` algorithm selects variables during its learning process, making this selection an integral part of the process. Similarly, regularized regression algorithms like `Lasso` and `Ridge` assign weights to variables, indicating their relevance.

When we talk about variable selection using the importance method, we're referring to a process where we identify which variables are most relevant for a model. This approach differs from univariate selection, which analyzes the correlation between variables in isolation.

Imagine you're building a model like Random Forest. During construction, the model is already naturally selecting the most important variables to arrive at a result. Thus, at the end, we can know which variables the model considered most relevant. Therefore, this technique is like the model "telling us" which variables are most important to it.

---
### **6.1.4 Variable Selection with Gini Impurity**  

The `Random Forest` algorithm uses Gini impurity, a measure of homogeneity, to evaluate variable importance. Variables that result in homogeneous subsets of data are considered important.

The `Random Forest` has a specific way of choosing variables based on Gini impurity. This criterion assesses how homogeneous a dataset is when divided by a variable.

Let's take an example: imagine we're analyzing sales and divide the data based on `price`. If, after the division, the high-price and low-price sales groups are very similar internally, the `price` is an important variable. The key here is that the Random Forest looks for variables that result in the most homogeneous subsets possible.

---
### **6.1.5 Variable Selection with Lasso Regression**  

Lasso regression penalizes model coefficients, causing some to become zero. This simplifies the model, highlighting variables with coefficients significantly different from zero as more relevant.

Lasso regression is like a "filter" for variables. When creating a model, this method applies a penalty to variable coefficients. If the penalty is strong enough, some coefficients can become zero, indicating that these variables are not important.

In practice, when applying Lasso regression, we end up with a list of variables with their respective weights. The variables with weights far from zero are the most relevant.

---
### **6.1.6 Subset Selection (Wrapper Method)**  

Different from methods relying on correlation or model-assigned importance, **subset selection** evaluates combinations of variables in relation to model performance. The process involves adding variables, evaluating performance, and removing non-informative ones until the ideal subset is obtained.

Imagine having to test all possible combinations of variables to find the best set for your model. That's what subset selection does. It evaluates each combination, trains the model, and checks performance.

The standout feature of this method is that it's not solely based on correlation between variables or on the importance given by a model. It directly focuses on performance.

---
### **6.2 Variable Selection with Boruta Algorithm**  

`Boruta` stands out for using `shadow variables`, copies of original variables with shuffled data. The idea is to compare the importance of original variables with the shadows, identifying the most relevant ones through statistical analysis.

`Boruta` is like a detective searching for important variables. It uses Random Forest and creates copies (or shadows) of the original variables but with shuffled data. Then, it compares the importance of the original variables with these shuffled copies.

If an original variable is consistently more important than its shuffled copies, it's considered relevant. Thus, `Boruta` provides us with a list of the most significant

In [None]:
# Defining a list of columns to be removed
cols_drop = ['week_of_year', 'day', 'month', 'day_of_week', 'promo_since', 'competition_since', 'year_week']

# Removing the specified columns from the DataFrame df6
df6.drop(columns=cols_drop, inplace=True)

When using the `drop` method with the `axis=1` argument, we are specifying the removal of entire **columns**. This means that the complete column, along with all its records, will be removed.

The main focus in preparing the DataFrame is to eliminate **original variables** that generated new features. This ensures **optimization and data integrity** when applying the **Boruta** algorithm.

### **Dataset Splitting**
This step is crucial. Here, we split our dataset into **training and test sets**. The purpose? To train the model with a portion of the data and then use the test set to evaluate its accuracy.

Since we are dealing with a time series problem, we have to pay special attention to the **temporal order**. It's not appropriate to randomly select rows for training and testing. Mixing future data with past data can lead to **overfitting** - when the model memorizes the data instead of actually learning from it.

Given our context of predicting six weeks of daily sales per `store`, it's ideal for the **last six weeks** of sales to be the test set, and the rest to comprise the training set.

A useful practice is to **group the data** by `store` and identify the oldest and newest sales dates. This clarifies the time period we are working with.

In [None]:
df6[['store', 'date']].groupby('store').max().reset_index()['date'][0] - datetime.timedelta(days = 6*7)

In [None]:
# Group the DataFrame by store and get the maximum date for each
max_dates_by_store = df6.groupby('store')['date'].max()

# Get the maximum date from the first store
max_date_first_store = max_dates_by_store.iloc[0]

# Subtract 42 days (6 weeks) from the maximum date of the first store
result_date = max_date_first_store - datetime.timedelta(days=6*7)

### **Data Splitting Strategy**

To split the data into **training and test sets**, we identify the **cutoff date** for training. This is done by subtracting six weeks from the maximum sales date using the `timedelta` method from the `datetime` class.

For example, considering the maximum date as 07/31/2015: when we subtract 42 days (six weeks), we arrive at the date 06/19/2015. Thus, the data from the beginning up to 06/18/2015 is designated for **training**, and from 06/19/2015 to the maximum date is designated for **testing**.

In [None]:
# Defining the cutoff date to split the training and test sets
cutoff_date = '2015-06-19'

# Creating the training set from entries prior to the cutoff date
X_train = df6[df6['date'] < cutoff_date]
y_train = X_train['sales']

# Creating the test set from entries starting from the cutoff date (inclusive)
X_test = df6[df6['date'] >=  cutoff_date]
y_test = X_test['sales']

### **Data Validation**

After splitting the data, it's crucial to **validate** the division. We print the minimum and maximum dates of the `training` and `test` sets. The maximum date of the `training` set should be exactly one day before the minimum date of the `test` set. Additionally, the last sales date should match the maximum date of the `test` set. These checks ensure a **continuous sequence of dates**, respecting the temporal nature of the data.

In [None]:
# Displaying the minimum date of the training set
print('Minimum date of the training set: {}'.format(X_train['date'].min()))

# Displaying the maximum date of the training set
print('Maximum date of the training set: {}'.format(X_train['date'].max()))

# Displaying the minimum date of the test set
print('\nMinimum date of the test set: {}'.format(X_test['date'].min()))

# Displaying the maximum date of the test set
print('Maximum date of the test set: {}'.format(X_test['date'].max()))

---

## **6.3 Boruta as a Feature Selector**

Following the data split for training and testing, the focus now shifts towards optimizing our model by selecting the most relevant features using **Boruta**.

The `BorutaPy` class in Python is designed for this purpose. Here are its key aspects:

1. **Model Choice**: One of its crucial arguments is the model to be used. Here, we opt for the `RandomForestRegressor`.

2. **Number of Estimators**: To achieve the best performance, we allow Boruta to choose the ideal number of trees by setting this to `'auto'`.

3. **Verbosity Log**: We set this log to 2. This provides us with the opportunity to monitor the progress of the algorithm, which can be a time-consuming process.

4. **Random State**: Consistency is crucial. Hence, we set the random state to **42**. This ensures that the selection process is reproducible across different runs.

Before using the `'fit'` method, it's essential to format the data correctly. Boruta **doesn't work with dataframes**; it requires a numerical array.

Here's a brief overview of the process:

- We begin with the necessary imports.
- We instantiate `RandomForestRegressor` and `BorutaPy`.
- We prepare the data: by removing the `'date'` and `'sales'` columns from `X_train` and transforming the remaining into a numerical array. The target array, `y_train`, undergoes a similar process.
- With the data ready, we proceed to train Boruta.



In [None]:
# Importing the necessary libraries for feature selection and modeling
from boruta import BorutaPy
from sklearn.ensemble import RandomForestRegressor

# Initializing the RandomForestRegressor. Parallelization (n_jobs=-1) allows the algorithm to use all available CPU cores.
rf = RandomForestRegressor(n_jobs=-1)

# Initializing Boruta. This is a feature selection algorithm that works with RandomForest.
# The 'auto' parameter for n_estimators allows Boruta to automatically decide the number of trees to be used.
boruta = BorutaPy(rf, n_estimators='auto', verbose=2, random_state=42)

# Preparing the data to be used with Boruta. We are removing columns that are not needed for feature selection.
X_train_n = X_train.drop(['date', 'sales'], axis=1).values
y_train_n = y_train.values.ravel()

# Applying Boruta to the data for feature selection.
boruta.fit(X_train_n, y_train_n)

### **Understanding Boruta's Results**
Analyzing the results of the Boruta algorithm for feature selection, we observe that the tool generates various pieces of information during its execution. These are organized into different categories, namely `Iteration`, `Confirmed`, `Tentative`, and `Rejected`.

- `Iteration` represents the current iteration of the algorithm out of a total of 100. In our project, Boruta went through nine iterations. This is a default value, as Boruta performs a maximum of 100 iterations to decide on each variable. If it reaches the 100th iteration and cannot make a decision, the algorithm stops.

- `Confirmed` indicates the number of variables that Boruta confirmed as relevant to the model. In this case, the algorithm confirmed `18 variables as important` or relevant.

- `Tentative` refers to the variables about which Boruta has doubts. At this stage, the algorithm couldn't classify these variables as irrelevant or relevant.

- `Rejected` are the variables that Boruta rejected and deemed not relevant to the model. The algorithm rejected nine variables in our project.

Therefore, by using the Boruta algorithm, we were able to identify and select the most relevant variables for our model. We were also able to discard variables that do not significantly contribute to data modeling, thus improving the efficiency and accuracy of our model.

In [None]:
from numpy import setdiff1d

# Identify columns approved by Boruta
cols_selected = boruta.support_.tolist()

# Remove the 'date' and 'sales' columns and store in X_train_fs
X_train_fs = X_train.drop(['date', 'sales'], axis=1)

# Determine the names of columns approved by Boruta
cols_selected_boruta = X_train_fs.columns[cols_selected].to_list()

# Identify the columns that were not approved by Boruta
cols_not_selected_boruta = list(setdiff1d(X_train_fs.columns, cols_selected_boruta))

### **Variable Selection with Boruta**

The Boruta algorithm assesses and ranks variables based on their **relevance**. We utilize Boruta's `support_` attribute to determine which variables were considered important, as it returns a boolean array for this purpose.

For a more straightforward analysis, we create the `cols_selected` list. This list stores the indices of the columns validated by Boruta, and for practical visualization, we convert these indices into their respective column names. It's worth noting that initially, our dataset contains the `date` and `sales` columns, which need to be excluded. Therefore, we generate a subset named `X_train_fs`, derived from the original `X_train` but lacking these two columns.

By doing this, we establish a new set named `cols_selected_boruta`, containing only the variables authenticated by Boruta from `X_train_fs`. Simultaneously, it's insightful to identify the columns that did not receive approval from Boruta. To achieve this goal, we compare `X_train_fs` with `cols_selected_boruta` using the `np.setdiff1d` function from numpy. The difference between them is then assigned to the `cols_not_selected_boruta` variable.

In summary, through Boruta, we determine **which variables** the algorithm regards as pivotal.


In [None]:
cols_selected_boruta

### **Columns Discarded by Boruta**

The Boruta algorithm assesses the relevance of each variable and can discard those that do not contribute significantly to the modeling. In this context, **some variables have been classified as irrelevant** for the model and subsequently discarded. By being aware of the discarded columns, we can focus our attention on variables that indeed have higher predictive potential.


In [None]:
cols_not_selected_boruta

### **Concluding Feature Selection with Boruta**

After the feature selection step using the Boruta algorithm, it's essential to reflect on the insights gained. At this point, it's beneficial to revisit our previous **exploratory data analysis**, where we addressed various hypotheses and built a framework to differentiate which were validated, which were discarded, and the significance of each in the model.

The goal now is to align these observations with what Boruta has identified as relevant. This alignment helps us solidify the results and corroborate them with our initial investigations.

For example, upon reviewing high-relevance hypotheses, we might observe that each is tied to a specific attribute. Hypothesis 9, suggesting that store sales would increase over the years, interestingly, wasn't seen as a key factor by Boruta. This is a moment to weigh and decide whether to accept the algorithm's indication or not. A common approach is to initially follow Boruta's recommendations and, in later phases, reintroduce features that might optimize the model.

Similarly, hypothesis 10 related to `Month` posits that sales would be more substantial in the second half of the year. While Boruta assessed `month_cos` as important and `month_sin` as secondary, it's up to us to determine if both attributes should be considered in the analysis.

Hypothesis 11, regarding the `Day`, suggests more robust sales after the tenth day of each month. Here, Boruta validated both `day_sin` and `day_cos`, reinforcing our initial assumption.

In conclusion, integrating and reviewing the guidance from Boruta with the hypotheses from our exploratory analysis is crucial to ensure a robust and well-founded modeling approach.


---
## **6.4 Manual Feature Selection**

In [None]:
# List of columns selected by the Boruta method
cols_selected_boruta = [
    'store',
    'promo',
    'store_type',
    'competition_distance',
    'competition_open_since_month',
    'competition_open_since_year',
    'promo2',
    'promo2_since_week',
    'promo2_since_year',
    'competition_time_month',
    'promo_time_week',
    'day_of_week_sin',
    'day_of_week_cos',
    'month_sin',
    'month_cos',
    'day_sin',
    'day_cos',
    'week_of_year_sin',
    'week_of_year_cos'
]

# Additional columns to be included
feat_to_add = ['date', 'sales']

# Create a copy of the original list of columns and add the new columns to it
cols_selected_boruta_full = cols_selected_boruta.copy()
cols_selected_boruta_full.extend(feat_to_add)

# Definition of columns for the training and test sets using the columns selected by Boruta
x_train = X_train[cols_selected_boruta]
x_test = X_test[cols_selected_boruta]

# Data preparation for time series using all columns (including additional ones)
x_training = X_train[cols_selected_boruta_full]

---
# **7. Machine Learning**
Data science is heavily propelled by **machine learning algorithms**. These algorithms empower machines with the ability to learn from existing data, akin to how humans make associations.

However, the sophistication of machine learning doesn't overshadow the relevance of basic methods, such as calculating the mean, an essential tool in various business scenarios that provides an overview of data distribution.

Within this realm of learning, there are specific tasks. **Classification** is one of them, teaching algorithms to differentiate, for instance, between images of a car, a van, and a bus based on specific features. For optimal performance, a diverse range of examples in training is crucial.

Yet, it's not limited to classification. There are tasks like **regression**, focused on numerical predictions, and **time series analysis**, which forecasts future trends based on past records.

On the opposite end of the spectrum, we encounter **unsupervised learning**, where the algorithm operates without pre-established labels, identifying patterns and relationships on its own. And nestled between these extremes is **semi-supervised learning**, which combines features of both. A everyday example is how Netflix selects thumbnails: they are initially tested with users, and based on feedback, the most effective ones are prioritized.

In summary, machine learning is a versatile and adaptive tool, bringing immense value to the field of data science, regardless of the method employed.

---
## **Machine Learning Algorithms**
**Average Model**: In the context of machine learning, we often underestimate simplicity, but it can be incredibly powerful. Consider the **average model**, for instance. It predicts, as the name suggests, the average of the input data. Imagine you want to predict future sales in your store. This model would project the average of past sales as your forecast. It might seem basic, but its real value lies in serving as a benchmark for other models. If another model outperforms the average model, it signifies genuine learning. Otherwise, the simple average proves effective.

**Linear Regression and Regularized Linear Regression**: Moving forward, we have the **linear regression models** and their regularized counterparts. Both take a more sophisticated approach than the average model, attempting to establish a direct relationship between input and output data. The regularized version often performs better due to its ability to penalize less significant features, promoting generalization. I opt for these models initially due to the principle of **Occam's Razor**, which advocates simpler models whenever they are appropriate.

**Random Forest and Gradient Boosting**: Next, we have two tree-based approaches: **Random Forest** and **Gradient Boosting**. Both have gained popularity due to their robust performance, especially in competitions like Kaggle. Being adaptable and often superior in terms of performance, they are an essential choice in our toolkit.

Now, you might be curious about more modern approaches like **neural networks**, **Deep Learning**, **CNN**, and **LSTM**. While extremely powerful, these models are more complex and time-consuming to implement. We're in an agile development cycle, prioritizing quick and efficient results. In a real-world scenario, we'd already be several weeks into this project, and the pressure for tangible results would be mounting.

Certainly, as the project evolves, these advanced models will be considered and possibly incorporated to further optimize our predictions.

---
## **Selecting Relevant Variables**

In the process of building a machine learning model, **selecting the right variables** is a crucial step. Using **Boruta** as our selection tool, we've identified a set of relevant variables that have the potential to improve the accuracy of our model.

We'll start our implementation by focusing on the training set. If you recall, we had `X_train` and `X_test` already separated. The strategy now is to extract the columns that Boruta highlighted as most important, identified as `cols_selected_boruta`. This selection will be stored in a variable called `x_train`, intentionally using lowercase "x" to differentiate it from `X_train`.

Similarly, for the test set, we'll filter `X_test` using the columns selected by Boruta. The result will be stored in `x_test`.

One thing to note: initially, we decided to exclude the "date" and "sales" columns at the training stage. However, there's a plan to reintroduce these variables later, ensuring a broader analysis and potentially better insights.

Our dataset now reflects a combination of Boruta's selection power with our experience and intuition. With this step completed, we're ready to move forward.

Let's improve the code with more detailed instructions:

In [None]:
cols_to_remove = ['month_sin', 'month_cos', 'week_of_year_sin', 'week_of_year_cos']
cols_selected_boruta = [col for col in cols_selected_boruta if col not in cols_to_remove]

# Select the columns identified as relevant by Boruta for training
x_train = X_train[cols_selected_boruta]
x_test = X_test[cols_selected_boruta]

# In an additional context, for validation or other training purposes,
# a more comprehensive version of the columns selected by Boruta can be used
x_training = X_train[cols_selected_boruta_full]

---
## **7.1 Average Model**

The **Average Model** is a fundamental technique when dealing with predictions, as it serves as a starting point to evaluate the performance of more complex models.

To develop our Average Model, our focus is on predicting the average sales across different stores. Firstly, we create a copy of the test set `X_test` and store it in `aux1`. In this dataset, we introduce a column named `sales` that represents the actual sales, i.e., our target variable.

The next step involves calculating the average sales for each store. For this task, we group the data by `store` and apply a mean operation on the `sales` column. This operation yields a result, stored in `aux2`.

The integration of the calculated averages back into the original dataset `aux1` is achieved through the pandas merging function, using 'store' as the primary key. This step is crucial as it allows us to directly compare the projected average sales to the actual sales for each store.

With the actual sales and their corresponding averages in hand, we are ready to evaluate the model's performance. To do so, we employ a function called `ml_error`. This function, when given the model name, actual sales, and predictions (calculated average sales), provides three error metrics:
- **Mean Absolute Error (MAE)**: Represents the average of the absolute differences between the projections and the actual values.
- **Mean Absolute Percentage Error (MAPE)**: Represents the average of the absolute percentage differences between the predictions and the actual values.
- **Root Mean Squared Error (RMSE)**: Corresponds to the square root of the average of the squared errors.

An additional and crucial step is to reverse the logarithmic transformation previously applied to the response variable. The exponential function is used for this purpose.

In the end, `ml_error` generates a pandas DataFrame that consolidates the model name and the obtained error metrics.

In [None]:
# Importing necessary libraries
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np
import pandas as pd

# Defining the mean absolute percentage error (MAPE) function
def mean_absolute_percentage_error(y, yhat):
    """Calculates the mean absolute percentage error between actual and predicted values."""
    return np.mean(np.abs((y - yhat) / y))

def ml_error(model_name, y, yhat):
    """Calculates and returns error metrics for model evaluation."""
    mae = mean_absolute_error(y, yhat)
    mape = mean_absolute_percentage_error(y, yhat)
    rmse = np.sqrt(mean_squared_error(y, yhat))

    return pd.DataFrame({
        'Model Name': model_name,
        'MAE': mae,
        'MAPE': mape,
        'RMSE': rmse
    }, index=[0])

# Data preparation for evaluation

# Creating a copy of the test data
aux1 = x_test.copy()
# Inserting actual sales into the auxiliary dataset
aux1['sales'] = y_test.copy()

# Calculating the average sales per store and storing in aux2
aux2 = aux1[['store', 'sales']].groupby('store').mean().reset_index().rename(columns={'sales': 'predictions'})

# Integrating the sales predictions (averages) into the auxiliary dataset
aux1 = pd.merge(aux1, aux2, how='left', on='store')
yhat_baseline = aux1['predictions']

# Evaluating model performance
# Reversing the logarithmic transformation to obtain original values
baseline_result = ml_error('Average Model', np.expm1(y_test), np.expm1(yhat_baseline))
baseline_result


### **Results Analysis**

Upon evaluating the model, we have identified some key metrics for its performance. The `MAE` metric indicated an average absolute error of **R$ 1,354** in our predictions. Additionally, the **Mean Absolute Percentage Error `MAPE` was **45%**, suggesting that, on average, the predictions deviate by 45% from the actual values. Lastly, the `RMSE`, which gives us an idea of the magnitude of errors, stood at **1,835**. This metric is particularly useful to understand the dispersion of errors in our predictions.

## **7.2 Linear Regression Model**

In the scope of our project, we aim to experiment with various machine learning models to determine which one best suits our dataset. Among them, we introduce the Linear Regression Model. This supervised algorithm aims to predict an output based on independent variables. It stands out in situations that require quantitative predictions, such as sales forecasting or estimating progress trends.

The implementation of this model unfolds in three critical phases:

**1. Training**: Here, we use the `fit` method to instruct the model with our training data (`x_train`, `y_train`). The code snippet associated with this step is:

In [None]:
# Importing the Linear Regression model from scikit-learn
from sklearn.linear_model import LinearRegression

# Creating an instance of the Linear Regression model and training it with the training set (x_train and y_train)
lr = LinearRegression().fit(x_train, y_train)

After this step, the trained model is stored in the variable `lr`.

**2. Prediction**: With the model properly trained, we move on to making predictions using the test data (`X_test`), through the `predict` method. The projections are then stored in the variable `yhat_lr`:


In [None]:
# Making predictions using the trained Linear Regression model
yhat_lr = lr.predict(x_test)

**3. Performance Evaluation**: Next, we assess the effectiveness of our model using the `ml_error` function, which compares the actual values `y_test` with the predictions `yhat_lr`. If the data has undergone transformation, it's crucial to revert them to their original scale using the `np.expm1` function:

In [None]:
# Calculate the error of the Linear Regression model by reversing the logarithmic transformation.
lr_result = ml_error('Linear Regression', np.expm1(y_test), np.expm1(yhat_lr))
lr_result

The presented results showed that the `MAE` metric was 1867, the Mean Absolute Percentage Error `MAPE` reached 29%, and the `RMSE` was established at 2671.

### **Model Comparison**

When we compare the `Average Model` and the `Linear Regression Model` side by side, we notice that the `RMSE` of the former was 1835, while the latter achieved 2671. This discrepancy reveals that the Linear Regression Model made more errors in its projections.

**These observations lead us to two vital deductions:**

- **1. Efficiency of the Average Model**: Its lower `RMSE` indicates a more accurate performance in predicting the data compared to the Linear Regression Model.
- **2. Intricate Data Behavior**: The Linear Regression Model, by its nature, seeks linear patterns. If our data exhibits a distinct pattern, this could justify the lower accuracy of this model. Therefore, the data seems to be more multifaceted than simple linear relationships could capture.

In conclusion, it is essential for us to explore nonlinear models in future analyses, as these models might be more suitable to understand and replicate the intrinsic complexity of our dataset.

### **7.3.1 Linear Regression Model - Cross Validation**

In [None]:
# Performing Cross-Validation for Linear Regression with 5 folds
lr_result_cv = cross_validation(x_training, 5, 'Linear Regression', lr, verbose=False)
lr_result_cv

## **7.3 Regularized Linear Regression Model - Lasso**

The pursuit of improved performance in regression models often leads us to adopt advanced techniques, with **regularization** being one of the fundamental methods in this context. This technique aims to **prevent overfitting**, a phenomenon where the model overly adapts to the training data, compromising its generalization ability. This is achieved by constraining the weights associated with each feature.

Within regularization, we encounter the **Lasso** (Least Absolute Shrinkage and Selection Operator) method. This method has the unique feature of not only adjusting but also zeroing certain weights, selectively excluding some features that are not crucial for the model.

To employ Lasso, we instantiate a corresponding object and set an `alpha` parameter with a value of 0.01. This parameter is of significant importance as it determines the level of regularization we want to impose on our dataset.

In [None]:
# Importing classes from the 'linear_model' module of 'sklearn'
from sklearn.linear_model import LinearRegression, Lasso

# Initializing and training a Lasso model with an 'alpha' of 0.01 using the training data
lrr = Lasso(alpha=0.01).fit(x_train, y_train)

After being trained, we use this model to make predictions on the test set:

In [None]:
# Making predictions using the trained Lasso model on the test data:
yhat_lrr = lrr.predict(x_test)

The next step involves evaluating the performance of this model. For this purpose, we rely on the `ml_error` function, a custom function that generates essential error metrics for an objective performance analysis.

The unique feature of Lasso lies in its ability to adjust the model's complexity through the `alpha` parameter. One important aspect to note is that as you decrease the value of `alpha`, the training time might increase, but in return, the results tend to be more accurate and efficient.

In [None]:
# Calculating the error of the Lasso Regression model using the ml_error function and reversing the data transformation
lrr_result = ml_error('Linear Regression - Lasso', np.expm1(y_test), np.expm1(yhat_lrr))
lrr_result

### **Model Performance Evaluation**

After the analysis, we have arrived at certain key indicators that reflect the performance of our model. The **Mean Absolute Error (MAE)** showed a value of `1891`. In percentage terms, the **Mean Absolute Percentage Error (MAPE)** indicated a discrepancy of 29%. Additionally, the model exhibited a **Root Mean Squared Error (RMSE)** of `2744`. These results are crucial to understand the accuracy and effectiveness of our modeling in relation to the analyzed data.

### **Initial Models Evaluation**

We examined three models in this study: the `average` model, `linear regression`, and `regularized linear regression (Lasso)`. When evaluating the models based on root mean squared error (`RMSE`), we observed that the average model, despite its simplicity, had the lowest `RMSE` with a value of 1.835. In comparison, linear regression scored 2.670, while Lasso, which aims to prevent overfitting by penalizing coefficients, had an `RMSE` of 2.744. Despite the prominence of the average model, it's undeniable that considering non-linear models is important given the intrinsic complexity of the analyzed data. The conclusion here is clear: **simpler approaches are not always the most suitable for intricate contexts**. Therefore, our next step will be to investigate non-linear models.

### **Exploration of Non-Linear Algorithms**

Previously, we focused on linear models. Now, our attention turns to non-linear models, specifically the `Random Forest`. The reason for this choice lies in the fact that while linear models are straightforward, they often fail to capture the full patterns in the data. The `Random Forest`, being a more flexible model, is capable of recognizing and modeling these nuances. We are following the `CRISP-DM` methodology, starting with simpler models and increasing complexity as the project progresses.

---
## **7.4 Random Forest Regressor**

The `Random Forest` algorithm operates through multiple decision trees. During training, a "forest" of trees is formed that operate independently. When making predictions, especially in regression contexts, the output is determined by the average of predictions from all these trees. We implement the `Random Forest` using the `Scikit-learn` library, a robust Python tool for data modeling, and an instance of this model is established as `rf`.

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Instantiating and training a Random Forest model with 100 trees, using a single core and setting a seed for reproducibility.
rf = RandomForestRegressor(n_estimators=100,
                           n_jobs=1,
                           random_state=42).fit(x_train, y_train)

**Random Forest Model Configuration**

When instantiating the Random Forest model, we set key parameters. The `n_estimators=100` determines that **100 decision trees** will be created in the forest. With `n_jobs=-1`, we ensure the utilization of **all available processors** to optimize training. To ensure **result consistency**, we use `random_state=42`.

After configuring the model, we train it with the `x_train` and `y_train` data using the `fit` method. When we need to make predictions on new data, the model is applied to the `x_test` dataset, resulting in estimates stored in `yhat_rf`.

In [None]:
# Performing prediction with the Random Forest model on the test set
yhat_rf = rf.predict(x_test)

### **Model Performance Evaluation**

To gauge the performance of our model, we compare the **actual sales** represented by `y_test` with the **predictions generated** by the model, `yhat_rf`. We utilize the `ml_error` function for this evaluation, and the obtained insights are stored in the variable `rf_result`.

In [None]:
# Evaluating the Performance of the Random Forest Model, reverting the logarithmic data transformation.
rf_result = ml_error('Random Forest Regressor', np.expm1(y_test), np.expm1(yhat_rf))

### **Conclusion of Random Forest Regressor Application**

We have concluded the implementation phase of the Random Forest Regressor, a pivotal milestone in our modeling journey. Throughout this phase, we have conducted **model training**, generated **predictions**, and meticulously evaluated **the algorithm's performance**.


## **7.5 XGBoost Regressor**

We now move forward with the implementation of the `XGBoost Regressor`, our fifth machine learning algorithm. The choice of **XGBoost** - short for "Extreme Gradient Boosting" - is motivated by its **speed and efficiency**, qualities that make it highly regarded in the realm of Data Science.

In [None]:
# Import the xgboost library
import xgboost as xgb

# Create and train an XGBRegressor model
model_xgb = xgb.XGBRegressor(
    objective='reg:squarederror',  # Optimization objective: squared error
    n_estimators=100,              # Number of boosting steps
    eta=0.001,                     # Learning rate
    max_depth=10,                  # Maximum depth of a tree
    subsample=0.7,                 # Fraction of samples to train each tree
    colsample_bytree=0.9           # Fraction of columns selected for each tree
).fit(x_train, y_train)            # Train the model using the x_train and y_train data

### **XGBoost Model Configuration**

When creating the XGBoost model, we define several parameters that guide its behavior:
- `objective='reg:squarederror'`: Sets the error metric to be minimized, in this case, squared error, which is common in regression tasks.
- `n_estimators=100`: Specifies that 100 decision trees will be created in the boosting process.
- `eta=0.001`: Establishes a learning rate, controlling the contribution of each tree.
- `max_depth=10`: Limits the depth of each tree, preventing overfitting.
- `subsample=0.7`: Uses 70% of the data to train each tree, a form of randomization to enhance the model.
- `colsample_bytree=0.9`: Ensures that 90% of the features are considered when splitting nodes in a tree.

With the model defined, we proceed to training using `x_train` and `y_train`. For future predictions, this model will be applied to the test data `x_test`, and the results stored in the variable `yhat_xgb`.

In [None]:
# prediction
yhat_xgb = model_xgb.predict(x_test)

### **XGBoost Model Evaluation**

To gauge the effectiveness of our XGBoost model, it's crucial to measure the discrepancy between the expected and predicted outcomes. Thus, we compare the **actual sales** represented by `y_test` with the **predictions** generated by the model, `yhat_xgb`. We use the `ml_error` function to perform this comparison, and the insights obtained are stored in the variable `xgb_result`.


In [None]:
# performance
xgb_result = ml_error('XGBoost Regressor', np.expm1(y_test), np.expm1(yhat_xgb))

### **Implementation of XGBoost Regressor**

In this phase, we've tackled the **XGBoost Regressor**, a renowned model in machine learning. We've undertaken three crucial steps: **model creation**, **prediction generation**, and, finally, **performance evaluation**. Each step is vital to ensure the efficiency and accuracy of the model in predictive tasks.

### **7.6 Model Performance Comparison**

Our main intention in this section is to analyze and contrast the performances of the different machine learning models we have implemented. The ultimate goal is to **identify the model with the lowest prediction error**.

To consolidate and compare these results, we use the `concat` function from Pandas, which combines multiple DataFrames into one, facilitating our comparative analysis.

In [None]:
# Using the Pandas 'concat' function to aggregate the results of different models
modelling_result = pd.concat([baseline_result, lr_result, lrr_result, rf_result, xgb_result])

### **Model Performance Comparison**

In the provided line of code, we have integrated the performances of various machine learning algorithms. These include the **baseline model**, **linear regression**, **Lasso regularized regression**, **Random Forest**, and **XGBoost**. This aggregation of results is stored in the variable `modelling_result`.

Subsequently, our objective is to rank the models based on their accuracy. We use the **Root Mean Squared Error (RMSE)** as the indicator. A lower RMSE indicates a more accurate model, as the predictions are closer to the actual values. To arrange the models by RMSE, we utilize the `sort_values` function.

In [None]:
# Sorts the model results based on Root Mean Squared Error (RMSE).
# A lower RMSE value indicates better model performance.
modelling_result.sort_values('RMSE')

### **Model Performance Evaluation**

In our analysis, we categorized the models based on the **Root Mean Squared Error (RMSE)**. RMSE, being an error metric, indicates that lower values signify **better performance**. Among the tested models - encompassing both linear and non-linear approaches - the `Random Forest` stood out with an RMSE of `1011`. Following that, we have the **XGBoost Regressor** with `1250` and the average model with `1835`. The analysis reveals that linear models, with RMSEs of `2600` and `2744`, might not be the best choices for complex tasks like sales prediction.

This suggests that the nature of the phenomenon under study - sales prediction - is inherently complex and non-linear. Therefore, models like `Random Forest` and `XGBoost` might be more suitable. However, it's important to keep in mind that the efficiency of these models could come at the cost of execution time. For example, by increasing the number of trees to `2500` or `3000`, we might face considerable time challenges in a production environment.

Lastly, a crucial point is that the obtained `RMSE` might not be an absolute representation of a model's actual performance. Due to data partitioning and the inherent volatility in sales, there's a possibility that the `RMSE` could offer an optimistic or pessimistic view of the model's effectiveness.

---
## **7.7 Cross-Validation Method (Measuring Actual Model Performance)**

In the previous phase, we **evaluated five models** using data from the last six weeks. However, the performance observed in this evaluation might not reflect the models' true capability. This is because performance can vary depending on the time period selected for testing.

For a more robust evaluation, we apply the **cross-validation** method. This procedure divides the dataset into subsets and repeatedly assesses the model's performance using different combinations of these subsets for training and validation. You can visualize this process as having a deck of 100 cards representing the dataset. In each step, we use 90 cards for training and 10 for validation, ensuring that all cards are used in both contexts. In the end, the average of the results determines the overall model performance. It's crucial to understand that, in cross-validation, we refer to the test data as "validation data."

---
### **7.7.1 Time Series Cross-Validation**

When dealing with **time series** data, cross-validation requires a slightly different approach. The temporal sequence of the data should not be disregarded. Instead of randomly selecting data for training and validation, we maintain their chronological order. In each iteration, we expand the training set to incorporate data from the previous step and select a new period for validation.

For example, when predicting weekly sales over a 6-week period, the approach would be:

1. Train with data from the 1st week and validate with the 2nd week.
2. Train with data from the 1st and 2nd weeks and validate with the 3rd week.

And so on, always maintaining the time sequence and expanding the training set. This way, the model is continually trained with more information and evaluated under conditions that closely resemble real-world scenarios.

---
### **Preparing Data for Time Series Analysis**

When dealing with time series analyses, proper data structuring is a key aspect. Let's consider that we have a dataset, named `X_train`, which contains all the relevant variables but lacks the `date` and `sales` columns essential for our analysis.

In this context, the first step is to expand `X_train`, creating a column named `full` that will encompass all the information. Subsequently, we integrate the `date` and `sales` columns, ensuring that the dataset's structure is complete and ready for the analyses that will follow.

In [None]:
# Defining the columns to be added
feat_to_add = ['date', 'sales']

# Copying the list of columns selected by Boruta
cols_selected_boruta_full = cols_selected_boruta.copy()

# Extending the list with the columns 'date' and 'sales'
cols_selected_boruta_full.extend(feat_to_add)

# Filtering the training dataset with the selected columns
x_training = X_train[cols_selected_boruta_full]

# Displaying the first few rows of the updated training dataset
x_training.head()

---
### **K-Fold Cross Validation**

We have developed a function called `cross_validation` to perform K-Fold cross validation. This function takes three arguments: the training data, the number of folds (represented by the `kfold` parameter), and the model name.

Within this function, we initialize lists to store the values of `MAE`, `MAPE`, and `RMSE`. The choice of these metrics is motivated by the need to robustly evaluate the performance of our model.

The mechanism is simple: we iterate in reverse from 1 to `kfold + 1`. For each iteration, the training and validation sets are segmented based on dates, allowing the model to learn from different samples of the dataset. After training the model with the training data of each fold, we make predictions using the validation data.

These results, i.e., the performance in each cycle, are stored in the respective lists of `MAE`, `MAPE`, and `RMSE`. At the end of the function, we consolidate these results and return a DataFrame that presents the model name alongside the means and standard deviations of the mentioned metrics.

This approach provides us with a broader view of the model's performance, rather than relying on a single train-test split, ensuring a more reliable evaluation.

In [None]:
```python
# Defining the cross-validation function
def cross_validation(x_training, kfold, model_name, model, verbose=False):
    # Lists to store errors for each K-Fold cycle
    mae_list  = []
    mape_list = []
    rmse_list = []

    # Starting the loop in reverse order from kfold value to 1
    for k in reversed(range(1, kfold+1)):
        # Determining the date range for validation
        validation_start_date = x_training['date'].max() - datetime.timedelta(k*7*6)
        validation_end_date   = x_training['date'].max() - datetime.timedelta((k-1)*7*6)

        # Segmenting the data into training and validation sets based on dates
        training   = x_training[x_training['date'] < validation_start_date]
        validation = x_training[(x_training['date'] >= validation_start_date) & (x_training['date'] <= validation_end_date)]

        # Defining the training and validation sets for features and targets
        # Training
        xtraining = training.drop(['date', 'sales'], axis=1)
        ytraining = training['sales']

        # Validation
        xvalidation = validation.drop(['date', 'sales'], axis=1)
        yvalidation = validation['sales']

        # Training the model with the training data
        m = model().fit(xtraining, ytraining)

        # Making predictions with the validation data
        yhat = m.predict(xvalidation)

        # Calculating the error between predictions and actual values
        m_result = ml_error(model_name, np.expm1(yvalidation), np.expm1(yhat))

        # Storing the error of each metric for this iteration
        mae_list.append( m_result['MAE'])
        mape_list.append(m_result['MAPE'])
        rmse_list.append(m_result['RMSE'])

    # Calculating and returning the means and standard deviations of error metrics after all K-Fold cycles
    return pd.DataFrame({'Model Name': model_name,
                         'MAE CV'    : np.round(np.mean(mae_list),  2).astype(str) + '+/-' + np.round(np.std(mae_list),  2).astype(str),
                         'MAPE CV'   : np.round(np.mean(mape_list), 2).astype(str) + '+/-' + np.round(np.std(mape_list), 2).astype(str),
                         'RMSE CV'   : np.round(np.mean(rmse_list), 2).astype(str) + '+/-' + np.round(np.std(rmse_list), 2).astype(str)})

### **Explaining the Cross Validation Function**

When working with predictive models, effective data segmentation is crucial. The process of cross-validation assists in this segmentation, ensuring an accurate evaluation of model performance.

In this function, we adopt the strategy of **temporal validation**, where we define two essential dates: `validation_start_date` and `validation_end_date`. These determine the time interval that will be used for testing the model. The choice of these dates is influenced by the most recent period of the dataset. For example, if our dataset ends at date 'D' and we decide to use the last 6 weeks as the test period, `validation_end_date` would be 'D', while `validation_start_date` would be 'D' minus those 6 weeks.

Through this approach, we simulate a scenario where the model is trained on historical data and evaluated on its ability to predict events in the near future. This ensures that we assess the true predictive capability of the model, preparing it for real-world application scenarios.

In [None]:
'''
# start date for validation
validation_start_date = x_training['date'].max() - datetime.timedelta(k*7*6)

#end date for validation
validation_end_date = x_training['date'].max() - datetime.timedelta((k-1)*7*6)

### **Data Segmentation into Training and Validation**

After establishing the essential dates, `validation_start_date` and `validation_end_date`, we proceed to divide our dataset. The training set will encompass all entries prior to `validation_start_date`, while the validation set will incorporate the information within the interval between `validation_start_date` and `validation_end_date`.

This organization is crucial to train our model on a solid foundation of historical data and subsequently assess its predictive capability in a specific and more recent period.

In [None]:
'''
# filtering dataset
training = x_training[x_training['date'] < validation_start_date]
validation = x_training[(x_training['date'] >= validation_start_date) & (x_training['date'] <= validation_end_date)]

### **Confirming Data Segregation by Date Ranges**

After subdividing our data into training and validation sets, it is crucial to ensure that this segmentation has been done correctly. To do so, a recommended practice is to assess the extreme dates - minimum and maximum - present in each subset. This allows us to confirm whether the intervals of `validation_start_date` and `validation_end_date` have been respected and if the data is structured as expected.

In [None]:
'''
training['date'].min()
training['date'].max()
validation['date'].min()
validation['date'].max()

### **Flexibility in Temporal Analysis**

When determining the start and end dates for validation, we utilize a specific value represented by the variable `k`, which defines the number of weeks included in validation. This choice not only ensures that the subsets are correctly segmented but also offers flexibility. By changing the value of `k`, we can adjust the time period considered in validation, ensuring a robust and adaptable temporal analysis that meets our specific needs.

### **Cross-Validation Details**

As we prepare for cross-validation, proper **data handling** is crucial. In our dataset, the `date` and `sales` columns are essential for analysis but shouldn't directly influence the model training. Therefore, before commencing the training process, we remove both columns from the training and validation sets. This action ensures that the model focuses solely on the most relevant features for prediction.

In [None]:
'''
xtraining = training.drop(['date', 'sales'], axis =1 )
validation = validation.drop(['date', 'sales'], axis=1)
xvalidation = validation.drop(['date', 'sales'], axis =1)
yvalidation = validation['sales']

### **Model Creation and Training**

After meticulous data preparation, we proceed to the crucial step: **building and training** our model. We have opted for the `Linear Regression` model due to its effectiveness and simplicity in situations with linear relationships between variables. Using the training set, we teach this model to recognize patterns and establish a relationship between the features and the target variable. This phase is essential as the quality of training will dictate the model's performance in future predictions.

In [None]:
'''
lr = LinearRegression().fit(xtraining, ytraining)

### **Prediction and Assessment**

Once our model is trained, we enter a crucial stage: **prediction**. Using the validation set, the model applies the acquired knowledge to make predictions about unseen data. But how do we know if these predictions are accurate? This is where **assessment** comes into play. We measure the model's effectiveness using the `ml_error` function, a custom metric that quantifies the difference between predictions and actual values. This evaluation provides us with a clear picture of the model's ability to generalize its learning to new data.

In [None]:
'''
yhat_lr = lr.predict(xvalidation)
lr_result = ml_error('Linear Regression', np.expm1(yvalidation), np.expm1(yhat_lr))

### **Invocation and Finalization**

As we approach the conclusion, we encounter the invocation of the `cross_validation` function. In this step, we provide the function with three crucial pieces of information: the **training data**, the **number of folds (k-folds)**, and, of course, the chosen **model** - in this case, the `Linear Regression`. Additionally, there's a touch of customization in the process: the `verbose` parameter. This option determines whether we want to see details of each iteration or not. If we're interested in a more detailed monitoring, we activate this parameter; otherwise, the process proceeds more discreetly.

### **7.2.1 Linear Regression Model with Cross-Validation**

The **Linear Regression Model** is one of the fundamental tools in a data scientist's toolbox. In our context, cross-validation, often referred to as **K-Fold Cross-Validation**, enhances the robustness of this model. By invoking this method, we ensure that the model is trained and tested on different subsets of the dataset, thus maximizing its ability to generalize. This way, we prevent the model from overly fitting to a single portion of the data (`overfitting`) and ensure a more comprehensive evaluation of its performance.

In [None]:
lr_result

In [None]:
# Initialize the Linear Regression model
model = LinearRegression()

# Perform cross-validation on the training dataset using the Linear Regression model
lr_result_cv = cross_validation(x_training,           # Training dataset
                                5,                    # Number of folds for cross-validation
                                'Linear Regression',  # Description/model name
                                model)                # Linear Regression model to be validated

# Display the cross-validation results
lr_result_cv

The `cross_validation` function plays a crucial role in assessing the performance of a Linear Regression model. In the presented example, this function is invoked with five distinct parameters:

1. `x_training`, which represents the training dataset.
2. The value `5`, indicating the number of folds to be used in cross-validation.
3. The string `'Linear Regression'`, designating the name of the model in question.
4. `model`, which is the trained model instance.
5. Lastly, we have the verbosity level of the displayed information, which is set to "false" in this context.

**Understanding the Process:** Cross-validation divides the dataset into a specified number of parts (in this case, five). The model is trained on four of these parts and tested on the fifth one. This process is repeated five times, ensuring that each segment is used as the test set once.

After the iterations, we have five performance metrics. These are essential for calculating the mean and standard deviation, providing a comprehensive view of the model's performance. In one of the iterations, the value of the **Root Mean Square (RMS)** was `2.671`. However, the calculated average RMS was `2.900`, suggesting variations in performance during different iterations.

For a robust assessment, the average ± the standard deviation is considered. In this example, the actual performance of the model varies between `2.400` and `3.400`. This approach offers a holistic view of performance, being more informative than an isolated evaluation.

### **7.3.1 Lasso - Cross Validation**

Cross-validation is a vital technique in evaluating the effectiveness of regression models. In the context of our sales forecasting project, we turn our attention to the **Lasso** model, a form of linear regression that incorporates regularization.

Invoking the `cross_validation` function for the Lasso model is similar to that used for simple linear regression models. It is called with five specific parameters:

- The `x_training` dataset designated for training.
- The number `5`, which defines the total number of divisions or "folds" in cross-validation.
- The designation `'Lasso'` that specifies the type of model.
- `LRR`, which is the instance of the Lasso model after being trained.
- And a verbosity control marker, here set to "false".

This process provides insights into the performance of the Lasso model, allowing us to evaluate its accuracy and reliability in sales forecasting.

In [None]:
lrr_result

In [None]:
# Initialize the Lasso model with a specific alpha value
model = Lasso(alpha=0.01)

# Perform cross-validation on the training dataset using the Lasso model
lrr_result_cv = cross_validation(x_training,   # Training dataset
                                 5,            # Number of divisions/folds for cross-validation
                                 'Lasso',      # Description/name of the model
                                 model)        # Lasso model to be validated

# Display the results of cross-validation
lrr_result_cv

When addressing the **Lasso model**, we turned to cross-validation to assess its performance. This technique stands out by dividing the `x_training` dataset into five segments, training the model on four of them and testing it on the remaining segment. This repetitive strategy provides us with various performance metrics.

The obtained results are enlightening. While the `RMSE` of the last week was recorded at `2.952`, the average of RMSEs obtained through cross-validation was `3.057`. This value, coupled with a standard deviation of `504`, reveals that the model's performance fluctuates within a range of `2.500` to `3.500`.

It's important to emphasize that the most reliable metric for evaluating the Lasso model's performance is `3.057 ± 504`. This extended analysis, facilitated by cross-validation, allows us to identify atypical periods, such as a week of high performance, and provides a broader view of the model's behavior in different scenarios. Such an approach ensures that our predictions are not based on outliers or unrepresentative weeks.

### **7.4.1 Random Forest Regressor - Cross Validation**

Continuing our investigation into predictive models, our focus now shifts to the **Random Forest Regressor**. This approach distinguishes itself by harnessing the collective power of multiple decision trees. Rather than relying on a single tree, it trains an ensemble of trees and combines their predictions. This process of integration aims to enhance prediction accuracy and robustness.

Similar to the previous models, we employ the strategy of cross-validation to assess the efficacy of the Random Forest Regressor. We once again invoke the `cross_validation` function, this time assigning the trained model to the variable `rf`. Additionally, we choose to activate the `verbose` parameter, enabling detailed monitoring of the operation's progress.


In [None]:
rf_result

In [None]:
# Perform cross-validation on the training dataset using the Random Forest model
rf_result_cv = cross_validation(x_training,                 # Training dataset
                                5,                          # Number of folds for cross-validation
                                'Random Forest Regressor',  # Model description/name
                                rf)                         # Random Forest model to be validated

# Display the cross-validation results
rf_result_cv

As we continue our analysis of predictive models, we now delve into the **Random Forest Regressor**. This model, by utilizing multiple decision trees, enhances the accuracy of predictions.

By executing the command `rf_result_cv = cross_validation(x_training, 5, 'Random Forest Regressor', rf, verbose=True)`, we initiate the process of cross-validation for the Random Forest Regressor. The obtained results, consistent with previous evaluations, provide valuable insights into the performance and accuracy of this model in predicting sales.

### **7.5.1 XGBoost Regressor - Cross Validation**

In our continuous exploration of predictive models, we have now reached the **XGBoost Regressor**. This advanced algorithm employs the boosting technique, where subsequent models are trained based on the errors of the previous model. The result is a series of progressively refined predictions.

What distinguishes XGBoost is its remarkable efficiency. It has a well-established reputation in machine learning competitions, often being recognized for its unparalleled accuracy.

To evaluate the performance of the XGBoost Regressor, we will apply cross-validation, just as we did with the previous models. By using the `cross_validation` function, we provide the essential parameters: the training dataset, the number of folds, the model name, and the trained model itself (in this case, `model_xgb`). We set the `verbose` parameter to `True` to ensure real-time progress visualization during the process.

In [None]:
xgb_result

In [None]:
# Running cross-validation on the training dataset using the XGBoost model
xgb_result_cv = cross_validation(x_training,            # Training dataset
                                 5,                     # Number of divisions/folds for cross-validation
                                 'XGBoost Regressor',   # Model description/name
                                 model_xgb)             # XGBoost model to be validated

# Displaying the cross-validation results
xgb_result_cv

### **Performance Analysis: XGBoost Regressor**

With the model-specific performance metrics in hand, we are equipped to assess the efficiency of the **XGBoost Regressor** in comparison to the other models already studied. The core objective is to discern which algorithm offers the most accurate sales prediction. This analysis is crucial as it guides a well-informed decision on the optimal approach to be pursued in our future projections.

## **Sales Forecasting Algorithm: Comparative Analysis**

We conclude our exploratory journey with the application of sales forecasting algorithms, where we delved into models such as **Linear Regression**, **Lasso Regressor**, **Random Forest Regressor**, and **XGBoost Regressor**. The next crucial step is to analyze and contrast the performance of each model, aiming to identify the most accurate one among them.

For a clear and direct analysis, we have established two evaluation categories:

- The first, presented in subsection **7.6.1 - Single Performance**, consolidates the performance of each model based on a single fold.

- The second, detailed in subsection **7.6.2 - Real Performance - Cross Validation**, offers us a more in-depth view, revealing the models' performance when subjected to the cross-validation process.

### **7.6.1 - Single Performance**

In this phase, our goal is to unify the results of all models into a single `DataFrame`, named `modeling_result`. This consolidation will allow a direct comparison of performance based on the value of **Root Mean Squared Error** (RMSE). The code for this operation is:

With this organized structure, we are prepared to discern which model best meets our forecasting needs.

In [None]:
# Import the 'concat' function from the Pandas library
import pandas as pd

# Concatenating the results of different models into a single DataFrame
modeling_result = pd.concat([baseline_result, lr_result, lrr_result, rf_result, xgb_result])

# Sorting the resulting DataFrame based on the RMSE (Root Mean Squared Error) value
# A lower RMSE indicates a model with better performance
modeling_result.sort_values('RMSE')

### **7.6.2 Real Performance: Cross Validation**

We shift our focus to the **actual performance** of the models using the **cross-validation** technique. This approach is essential as it provides a more accurate and realistic assessment of a model's behavior by testing it on various segments of the dataset.

By executing dedicated code, we consolidate the results of this validation, enabling a comparative analysis across different models. This is crucial in determining which one is most suitable for our projections.

It's worth noting that cross-validation segments the dataset into various parts. The model is trained on some of these divisions and tested on the others. This cycle repeats, alternating training and testing data, ensuring the model is evaluated in a **comprehensive and unbiased** manner.

In [None]:
# Concatenating the cross-validation results of different models into a single DataFrame
modeling_result_cv = pd.concat([lr_result_cv, lrr_result_cv, rf_result_cv, xgb_result_cv])

# Sorting the resulting DataFrame based on the RMSE (Root Mean Squared Error) value
# Again, a lower RMSE indicates a model with better performance
modeling_result_cv.sort_values('RMSE')

### **Performance Analysis of the Models**

By executing the provided code, we will obtain two essential tables for our analysis. The first table showcases the **performance** of each model based on a single data split, while the second table highlights the effective performance of the models considering comprehensive **cross-validation**.

These tables are crucial as they not only illustrate the individual capability of each model but also allow for an objective comparison among them. Through this comparative analysis, we can identify and select the model that most efficiently predicts sales in our specific context.

### **Conclusion on Forecasting Models**
We have concluded the training and evaluation phase of the forecasting models. Throughout this process, we analyzed four models: `Linear Regression`, `Lasso Regression`, `Random Forest`, and `XGBoost`. The **Random Forest** stood out, exhibiting an average `RMSE` of 1256, which was superior to the other models, highlighting its efficiency in both classification and regression tasks.

However, we have decided to proceed with the **XGBoost** model. This choice is grounded in two primary reasons: its growing popularity across diverse scenarios and our previous experience where, despite the efficacy of Random Forest, its size and subsequent maintenance costs in production were high. Nonetheless, it's vital to understand that model selection is adaptable to the context and specific goals of each project. The procedures discussed here can be tailored to any Machine Learning model.

The next step is to fine-tune the hyperparameters of the `XGBoost`, a crucial process to refine the algorithm's performance, focusing on optimizing the parameters that guide the model's learning.

---

### **Project Update: Rossmann Sales Forecast**

At this juncture, we have provided an overview of the current stage of our sales forecasting project for Rossmann, elucidating each phase for all stakeholders involved.

Our project is based on a 10-step cycle, inspired by CRISP-DM, a renowned methodology in data science. We commenced by defining Rossmann's need: predicting sales to establish renovation budgets. We then delved into comprehending this demand, engaging with stakeholders, and determining that predicting sales for the next six weeks would be ideal.

In the data collection phase, while methods can vary in different contexts, we obtained data from the Kaggle platform. Subsequently, the data underwent preprocessing, including description, feature engineering, and filtering according to business needs. Following that, we immersed ourselves in exploratory analysis, validating hypotheses and conducting tests.

After data preparation, we selected the most impactful variables using the Boruta algorithm. In the modeling phase, five machine learning algorithms were implemented.

Currently, our focus is on fine-tuning the `XGBoost`, even though it exhibited slightly lower performance than the Random Forest. The upcoming steps will address the evaluation of this model and the decision of its application in production, culminating in the first solution to Rossmann's sales forecasting demand.

---
# **8.0 Fine Tuning**
In the sales forecasting project, **hyperparameter tuning** is crucial to optimize the performance of our Machine Learning models. These hyperparameters are adjustable settings that determine how models learn from data. To illustrate, they can be likened to the buttons on a radio, adjusted for optimal tuning.

Hyperparameters vary depending on the model. In neural networks, they encompass the number of neurons, hidden layers, and activation function. In "Random Forest" models, they involve the number of estimators and tree depth. Selecting their values correctly is vital, as well-tuned hyperparameters enhance prediction accuracy and mitigate overfitting. For this purpose, we adopt advanced techniques, instead of simple random selection, to optimize model performance.

The three main approaches to tuning are: `Random Search`, `Grid Search`, and `Bayesian Optimization`.

### **Random Search**

The `Random Search` strategy randomly selects hyperparameter values from a predefined set. For instance, in an XGBoost model, we consider hyperparameters such as the number of estimators, subsample rate, maximum depth, learning rate, and minimum child weight. In each iteration, values are randomly chosen, the model is trained/tested, and performance is recorded. At the end, the set of hyperparameters with the best performance is selected. Its strength lies in its implementation speed. However, due to its random approach, it might not discover the optimal set.

### **Grid Search**

The `Grid Search` technique examines all possible combinations of hyperparameters. The model is trained/tested for each combination using cross-validation, and performance is recorded. The combination with the best performance is selected. However, its weakness is the time it consumes, making it infeasible in certain scenarios. While hyperparameter tuning is crucial, other factors such as data quality and algorithm choice also influence performance.

### **Bayesian Search**

Unlike the previous techniques, `Bayesian Search` is grounded in Bayesian theory to choose hyperparameter values. Its focus is to calculate the probability of hyperparameters based on observed error, making the search more targeted and efficient. It outperforms Grid Search in speed and Random Search in accuracy, but it can be more complex. In our project, we prefer Random Search due to its cost-effectiveness and ease of application. However, it's essential to be familiar with all strategies to enrich expertise in Machine Learning.

---
## **8.1 Hyperparameter Tuning with Random Search**
While developing our sales forecasting project, we opted for the `XGBoost` algorithm to structure our predictive model. A crucial step in machine learning modeling is hyperparameter tuning. Hyperparameters are specific settings that guide the training process. In the context of `XGBoost`, some of the key hyperparameters are:

- **`n_estimators`**: Refers to the number of trees that the model will build.
- **`eta`**: Is the learning rate, determining the step of each iteration.
- **`max_depth`**: Indicates the maximum depth each tree can reach.
- **`subsample`**: Defines the fraction of samples used to fit each individual tree.
- **`colsample_bytree`**: Specifies the fraction of columns considered when constructing each tree.
- **`min_child_weight`**: Represents the minimum weight required in a leaf node.

To optimize the model's performance, we need to choose appropriate values for these hyperparameters. The **Random Search** strategy is a technique that randomly selects combinations of these values and tests the model's performance. This way, we can identify the combination that yields the best results for our specific dataset.

In [None]:
# Importing the necessary module
import warnings

# Disabling warnings to make the code output cleaner
warnings.filterwarnings('ignore')

# Defining the hyperparameters for Random Search
param = {
    # The number of decision trees in the forest
    'n_estimators': [15, 17, 25, 30, 35],

    # The learning rate
    'eta': [0.01, 0.03],

    # The maximum depth of the trees
    'max_depth': [3, 5, 9],

    # Fraction of the training examples to be used in each tree
    'subsample': [0.1, 0.5, 0.7],

    # Fraction of columns to be used in each tree
    'colsample_bytree': [0.3, 0.7, 0.9],

    # Minimum sum of instance weights required in a child
    'min_child_weight': [3, 8, 15]
}

### **Random Search: Hyperparameter Optimization Technique**

**Random search** is an effective approach for optimizing `hyperparameters` in machine learning models. Unlike exhaustive methods that test all combinations, this technique focuses on selecting random sets of hyperparameters for a defined number of iterations, controlled by the `MAX_EVAL` variable.

In each iteration, a specific set of hyperparameters is used to train a model, in this context, an `XGBoost` model. The model's performance is then evaluated through cross-validation, ensuring a comprehensive and reliable analysis. The code that executes this search technique was built following precisely this logic.

In [None]:
import pandas as pd
import xgboost as xgb

# Maximum number of evaluations
MAX_EVAL = 2

# DataFrame to store the results of all iterations
final_results = pd.DataFrame()

for _ in range(MAX_EVAL):
    # Random selection of hyperparameters
    hp = {k: random.choice(v) for k, v in param.items()}
    print(f"Selected Hyperparameters: {hp}")

    # Initialize the XGBoost model with the selected hyperparameters
    model_xgb = xgb.XGBRegressor(
        objective='reg:squarederror',
        n_estimators=hp['n_estimators'],
        eta=hp['eta'],
        max_depth=hp['max_depth'],
        subsample=hp['subsample'],
        colsample_bytree=hp['colsample_bytree'],
        min_child_weight=hp['min_child_weight']
    )

    # Evaluate the model's performance using cross-validation
    result = cross_validation(x_training, 2, 'XGBoost Regressor', model_xgb, verbose=False)
    final_results = pd.concat([final_results, result])

# Display the final results
final_results


### **Hyperparameter Optimization Analysis**

Upon completing the optimization iterations, it becomes imperative to carefully assess the results to determine the optimal combination of hyperparameters. Each iteration reflects a distinct, randomly chosen combination, leading to varying performances.

In our study, we observed different combinations in each iteration. The model's performance, measured by the `Root Mean Square Error (RMSE)` metric, serves as an indicator of prediction quality. This value quantifies the model's error, which is the discrepancy between predictions and actual data.

In the evaluated context, the second iteration recorded a lower `RMSE` value than the first, indicating that the hyperparameters of the second iteration were more effective. Hence, a practical data scientist would focus on the values from this second iteration, as they optimized performance according to the RMSE metric.

This procedure is vital in forecasting projects as it guides towards hyperparameters that enhance model accuracy, elevating the reliability of sales projections.

## **8.2 Final Model**

Upon completing the hyperparameter optimization for our sales forecasting model, we determined that the **parameter set from the second iteration** provided the best performance based on the `RMSE` metric.

Thus, for the construction of the final model, we utilize the optimized values of these hyperparameters. These values will serve as a reference when implementing the model in practical situations, ensuring a more accurate and reliable prediction.

In [None]:
# Defining a dictionary to store the optimized hyperparameters of the model.
param_tuned = {
    'n_estimators': 30,             # Number of trees to be used in boosting
    'eta': 0.03,                    # Learning rate
    'max_depth': 9,                 # Maximum depth of each tree
    'subsample': 0.1,               # Proportion of samples used to train each tree
    'colsample_bytree': 0.7,        # Proportion of columns used by each tree
    'min_child_weight': 15          # Minimum sum of instance weights required in a child
}

### **Optimized Model Training**
Subsequently, the **XGBoost model** was fine-tuned using the `optimized hyperparameters` and the designated training data. With this tuned model, predictions were made based on the `test dataset`.

In [None]:
# Initialization and training of the XGBoost model with optimized hyperparameters
model_xgb_tuned = xgb.XGBRegressor(
    objective        = 'reg:squarederror',
    n_estimators     = param_tuned['n_estimators'],
    eta              = param_tuned['eta'],
    max_depth        = param_tuned['max_depth'],
    subsample        = param_tuned['subsample'],
    colsample_bytree = param_tuned['colsample_bytree'],
    min_child_weight = param_tuned['min_child_weight']
).fit(x_train, y_train)

# Prediction using the tuned model
yhat_xgb_tuned = model_xgb_tuned.predict(x_test)


### **Model Performance Evaluation**

Upon completing the training of the optimized model, its effectiveness was meticulously evaluated using the `ml_error` method. Essentially, this method measures the difference between the actual sales values represented by `y_test` and the estimates generated by the model denoted as `yhat_xgb_tuned`. It is crucial to note that an **exponential function** was applied to these sets in order to reverse the logarithmic transformation performed earlier in the project.


In [None]:
# Optimized Model Performance Evaluation
xgb_result_tuned = ml_error('XGBoost Regressor', np.exp(y_test), np.exp(yhat_xgb_tuned))
xgb_result_tuned

### **Optimizing Sales Forecasting**

With the optimization process completed, our model now incorporates **adjusted hyperparameter values** to enhance sales forecasting. The significance of each `hyperparameter` cannot be underestimated, as each adjustment directly impacts the model's effectiveness. This refinement is essential to ensure the robustness of a sales forecasting project.

Following this calibration, we proceed to **model training and validation**. This phase is focused on increasing prediction accuracy and solidifying the reliability of the generated estimates.

## **Rossmann Project Status**

Currently, we are in **phase 9** of our sales forecasting project. In this stage, our focus is on **interpreting and contextualizing the model's error**. The goal is to understand how the model's performance translates into tangible business outcomes. We aim to decipher, for example, what a 70% accuracy implies in terms of profitability and potential customer engagement.

It's worth revisiting our journey thus far:
We began by understanding the business question at hand. Followed by the meticulous data collection, aided by SQL. The next step was ensuring data integrity through cleaning and preprocessing processes. Once the data was in order, we delved into exploratory analysis, seeking insights into the current business landscape.

In the modeling phase, we transformed the data to fit the specific requirements of machine learning algorithms. Various models were trained, ranging from linear to non-linear, always compared against a baseline model.

The heart of this project has always been tackling a real business challenge. While data science and machine learning techniques are fundamental, the ultimate goal has been practical application to optimize Rossmann's outcomes. Thus, our current stage is to evaluate the model's performance in relation to its real impact on the business.

---

## **9.0 Error Translation and Interpretation**

In the sales forecasting project, it's imperative to appropriately evaluate and interpret the error, as well as adjust the scale of the data. In this section, we'll perform both activities transparently and didactically.

To assess the error, we begin by selecting the relevant columns from `X_test`, identified during the feature selection step using the **Boruta** technique. These are retained in the variable `cols_selected_boruta_full`. Additionally, columns like `date` and `sales` are crucial, given their fundamental role in cross-validation.


In [None]:
df9 = X_test[cols_selected_boruta_full]

### **Scale Transformation and Model Accuracy**

To ensure the accuracy of our model, it's crucial to consider the scale of the data we're working with. **The data was originally transformed to a logarithmic scale**, optimizing the forecasting model. However, when calculating the error, we need these data to be in their original scale.

This is where the importance of exponential transformation comes in, acting as the counterpart to the logarithmic transformation. By using the `np.expm1()` function from the NumPy library, we can revert the values of both sales and predictions to their original scale. This process is essential to ensure that error assessment is conducted accurately and aligned with the context of the original data.

In [None]:
# Reverting the Logarithmic Transformation of Data Using the `expm1` Function from NumPy
df9[['sales', 'predictions']] = np.expm1(df9[['sales', yhat_xgb_tuned]])

When dealing with the forecasting process, it's essential to understand the source of the predictions. The forecasted sales, labeled as `predictions`, originate from `yhat_xgb_tuned`. These predictions, in turn, are generated from an optimized machine learning model referred to as "tuned".

The accuracy of these predictions is paramount. Therefore, rigorously evaluating the associated error of the forecasting model is crucial. By ensuring that the data is interpreted on the correct scale, a solid foundation is established for making informed business decisions.

---
## **9.1 Business Performance**
When delving into data analysis, the primary focus should be on **effectively communicating insights** to the business team. Analytical results need to have a tangible application that highlights the financial impact on company operations.

Rather than solely concentrating on model accuracy metrics, it's crucial to present the data in a way that emphasizes the real value it represents for the business. An essential step in this process is consolidating the sales forecasts for each store, as what truly matters is the total predicted revenue. To achieve this objective, we've established a variable named `df91`, which aggregates the sales forecasts for each establishment.


In [None]:
# Grouping the data by store and summing the sales forecasts for each one, then resetting the index.
df91 = df9.groupby('store')['predictions'].sum().reset_index()

Next, we calculate the Mean Absolute Error (MAE) and the Mean Absolute Percentage Error (MAPE) for each store. These metrics are important for understanding the model's performance in business terms. MAE indicates the absolute error of each forecast, while MAPE represents this error in percentage terms.

In [None]:
def calculate_errors(x):
    mae = mean_absolute_error(x['sales'], x['predictions'])
    mape = mean_absolute_percentage_error(x['sales'], x['predictions'])
    return pd.Series((mae, mape), index=['MAE', 'MAPE'])

df9_aux = df9[['store', 'sales', 'predictions']].groupby('store').apply(calculate_errors).reset_index()

Next, we combine these metrics into a single dataframe to facilitate the analysis.

In [None]:
# Perform a join between DataFrames df9_aux1 and df9_aux2.
# This join is based on the 'store' column, and it uses the 'inner' method,
# which means only the stores that exist in both DataFrames will be retained.
df9_aux3 = pd.merge(df9_aux1, df9_aux2, how='inner', on='store')

# Now, perform another join, this time between DataFrame df91 and df9_aux3.
# Again, the join is based on the 'store' column using the 'inner' method.
# The resulting df92 contains the combined information from the three original DataFrames.
df92 = pd.merge(df91, df9_aux3, how='inner', on='store')

### **Analyzing Sales Projections and Their Accuracy**

When providing sales forecasts to the business team, it is imperative to also communicate the associated margin of error with these forecasts. The **`Mean Absolute Error (MAE)`** and **`Mean Absolute Percentage Error (MAPE)`** are essential tools in this context, enabling a comprehensive understanding of forecast accuracy.

When projecting revenue for the next six weeks for each store, we are basing our decisions on past data and observed trends. However, it's crucial to understand that every predictive model carries uncertainty. Using Store 1 as an example: we identified an error, measured by the `MAE`, of 306 units. This corresponds to a deviation of 7% from the store's typical sales.

Taking Store 4 as another case: we forecast sales of 345,000 dollars over the six-week period. However, this number could have a variation of 855 units (or 8% of the total). This deviation, represented by the `MAE`, is crucial as it indicates the average of absolute errors between the forecasted and observed values.

Our intention, in sharing these forecasts, is more than just providing a number: we aim to present a **spectrum of possible outcomes**. Thus, when considering a sales forecast of 162,000 dollars, for example, it's vital for the manager to understand the associated margin of error. A 7% margin provides two scenarios: the optimistic and the pessimistic. Both are essential for effective management and well-informed decisions.

In [None]:
# Calculate the most pessimistic scenario (worst_scenario) for each store by subtracting the Mean Absolute Error (MAE) from the prediction.
# This provides us with a lower estimate of the forecasted sales, considering the possible model error.
df92['worst_scenario'] = df92['predictions'] - df92['MAE']

# Calculate the most optimistic scenario (best_scenario) for each store by adding the MAE to the prediction.
# This provides us with an upper estimate of the forecasted sales, considering the possible model error.
df92['best_scenario'] = df92['predictions'] + df92['MAE']

# Rearrange the columns of the DataFrame for better visualization and understanding of the data.
# Here, the columns are ordered by relevance and context, starting with the store, prediction, and derived scenarios.
df92 = df92[['store', 'predictions', 'worst_scenario', 'best_scenario', 'MAE', 'MAPE']]

In [None]:
df92.sample()

---

### **Comprehending Sales Predictions**

When assessing our forecasts, it is essential to understand the numbers to make informed decisions. Let's take store `638` as an example. Our predictions indicate an expected revenue of 232,000 dollars over the next six weeks. However, each prediction comes with a margin of error, and in this case, the error is **6%**. In practical terms, when predicting a sale of 100 dollars, the actual value can fluctuate between 94 and 106 dollars. This 6% variation translates to a difference of 437 dollars. Thus, the range of sales varies from 232,000 dollars to 233,000 dollars.

With this understanding, a manager can align their expectations. With a small error, like 7%, they can be more optimistic. However, with a larger error, such as the 18% observed in store `642`, it is wise to adopt a more cautious perspective.

The challenge between data experts and management is to **translate technical metrics into business insights**. Instead of focusing on the Mean Absolute Error (MAE), it is more effective to discuss in terms of revenue and potential scenarios, taking into account the margin of error. It is crucial to understand that in regression contexts, accuracy is not the ideal metric. Unlike binary classifications (such as identifying cats or dogs in photos), we are estimating continuous values. What matters to us is how closely our estimates align with the actual value.

In summary, the crux of the matter is the proximity of the forecast to the actual value and the reliability of that estimate. **An error close to zero indicates a reliable forecast; a larger error signals the opposite**. Translating model performance metrics into business language is vital for effective communication between data scientists and managers.

Throughout the project, we noticed variations in predictions among stores. For instance, we have a MAPE of 7% for store `932` and 18% for store `642`, indicating that some stores have more challenging predictions than others. To visualize these variations clearly, the data can be sorted by MAPE using a sorting command in the `df92` dataframe, focusing on stores with larger deviations.

In [None]:
# Sorts the DataFrame df92 based on Mean Absolute Percentage Error (MAPE) in descending order.
top_mape_stores = df92.sort_values('MAPE', ascending=False).head()

### **Analyzing Store Prediction Accuracy**

While analyzing our data, we've identified that store `292` has a **Mean Absolute Percentage Error (MAPE)** of 56%. This suggests that the predictions for this establishment can deviate by around half of the actual value. An error of this magnitude is concerning. For instance, considering a 50% deviation, it could reflect an estimate that overestimates or underestimates the actual outcome by up to 50%.

Clearly, we need to adopt different approaches to enhance the accuracy of our predictions for stores like this one. Some alternatives include developing specific models for stores with atypical performance or enriching our current models with more variables that capture the nuances of these stores.

An effective way to visualize the discrepancy in predictions among stores is through a **scatterplot**. In this plot, the x-axis represents the stores (`store`) while the y-axis shows the value of `MAPE`. Hence, each point on the graph corresponds to a store, and its vertical position indicates the associated `MAPE`.

In [None]:
# Importing necessary libraries
import seaborn as sns
import matplotlib.pyplot as plt

class StoreAnalysis:

    def __init__(self, data):
        self.data = data

    def plot_mape_distribution(self):
        """Visualizes the distribution of Mean Absolute Percentage Error (MAPE) per store."""
        plt.figure(figsize=(12, 6))
        sns.scatterplot(x='store', y='MAPE', data=self.data)
        plt.title('MAPE Distribution per Store')
        plt.xlabel('Store ID')
        plt.ylabel('MAPE (%)')
        plt.grid(True, which='both', linestyle='--', linewidth=0.5)
        plt.tight_layout()
        plt.show()

# Instantiating the class and plotting the graph
analysis = StoreAnalysis(df92)
analysis.plot_mape_distribution()


### **Store Performance Visualization**

Upon analyzing the graph, we observe that most stores have a `MAPE` close to 10%. However, specific stores, such as those identified by the numbers `292` and `900`, stand out with a `MAPE` exceeding 50%.

There are also establishments with a `MAPE` around 30%. This graph signals to us which stores require more focused interventions to optimize our sales predictions.

---
## **9.2 Total Forecast Performance**
To provide a comprehensive perspective on **sales forecasts**, it is essential to highlight the most and least optimistic scenarios for all stores. Such a summary is valuable, particularly for delivering a concise overview to project stakeholders.

We initiate the analysis by extracting three essential columns from `df92`: `predictions`, `worst_scenario`, and `best_scenario`. These represent, respectively, our predictions and estimates of more and less favorable sales scenarios.

The subsequent step involves using the `apply` function in conjunction with a `lambda` function to sum all the values in each of these columns. Essentially, this operation iterates and totals the values along the vertical axis (thus aggregating the values for each respective column across rows).

The outcome of this operation is `df93`, where each row illustrates a specific scenario and the accumulation of values associated with it. To enhance data clarity, we adjust the index of `df93` and rename the columns to 'Scenario' and 'Values'. Additionally, we format the 'Values' column to reflect the amounts in Brazilian Real (R$) currency format, ensuring a more intuitive and understandable presentation of financial projections.

In [None]:
# Calculating the sum of columns 'predictions', 'worst_scenario', and 'best_scenario' and reformatting the resulting dataframe
df93 = df92[['predictions', 'worst_scenario', 'best_scenario']].sum().reset_index()
df93.columns = ['Scenario', 'Values']

# Converting values in the 'Values' column to Brazilian Real (R$) format
df93['Values'] = df93['Values'].apply(lambda x: f'R${x:,.2f}')

# Displaying the dataframe df93
df93

### **Forecasting Sales Overview**

The analysis of our `df93` now provides an overview of the **sales forecasts** for the next six weeks across all stores. Based on the data, we anticipate total sales of R$285 million. In a more conservative scenario, sales could reach 285 million and 115 thousand. In contrast, in an optimistic scenario, we expect sales to reach 286 million and 605 thousand.

It's crucial to understand that, while it might seem intuitive to subtract one scenario from another to assess accuracy, this method is not suitable here. The term "accuracy" is more aligned with `classification` problems, not `regression`.

With these analyses completed, the **visualization of data** becomes the next crucial step. In data science projects, visuals play a pivotal role not only in interpreting information but also in conveying it clearly, especially to individuals without technical familiarity in the field.

## **9.3 Machine Learning Performance**

In this phase, we prioritize clarity in presenting the results by introducing two crucial columns: `error` and `error_rate`. These are meticulously calculated to provide a solid foundation for constructing analytical graphics. These graphics, in turn, offer significant insights into the performance of the adopted algorithm.

In [None]:
# Calculating the difference between actual sales and predictions to determine the error
df9['error'] = df9['sales'] - df9['predictions']

# Calculating the error rate as the ratio between predictions and actual sales
df9['error_rate'] = df9['predictions'] / df9['sales']

# Error Analysis in Sales Predictions

When assessing the accuracy of predictions, two crucial parameters are **`error`** and **`error_rate`**. The first parameter, **`error`**, measures the discrepancy between actual sales and estimated values. On the other hand, the **`error_rate`** represents the proportion of these predictions in relation to the true values, providing a percentage perspective of accuracy.

For a comprehensive and visual interpretation of these data, we have implemented four graphs. The following code snippet details the methodology for creating these visualizations.

In [None]:
# Preparing a 2x2 subplot matrix to display 4 graphs
fig, axs = plt.subplots(2, 2, figsize=(14,10))

# First graph: Comparison between actual sales and predictions over time
axs[0,0].set_title("Comparison between Sales and Predictions")
sns.lineplot(x='date', y='sales', data=df9, label='SALES', ax=axs[0,0])
sns.lineplot(x='date', y='predictions', data=df9, label='PREDICTIONS', ax=axs[0,0])

# Second graph: Visualization of error rate over time
axs[0,1].set_title("Error Rate over Time")
sns.lineplot(x='date', y='error_rate', data=df9, ax=axs[0,1])

# Third graph: Distribution of errors
axs[1,0].set_title("Error Distribution")
sns.distplot(df9['error'], ax=axs[1,0])

# Fourth graph: Scatter plot between predictions and errors
axs[1,1].set_title("Relationship between Predictions and Error")
sns.scatterplot(x=df9['predictions'], y=df9['error'], ax=axs[1,1])

plt.tight_layout()  # Improves spacing between graphs
plt.show()  # Displays the graphs

### **Analysis of Sales Predictions**

In the first graph, we depict the relationship between **actual sales and predictions** using lines over time. The noticeable convergence between the two over a six-week period indicates the accuracy of the predictions in relation to realized sales.

In the subsequent graph, the focus shifts to the `error_rate` depicted over time. The visualization serves to assess the model's efficiency: an `error_rate` of one denotes impeccable accuracy. Values above one indicate an overestimation of sales, while values below suggest underestimation.

The third graph, labeled as a `distplot`, details the **distribution of errors**. When this distribution resembles a normal distribution, it suggests that errors are well-distributed, a promising indicator of the model's reliability.

Finally, the fourth graph, a `scatterplot`, juxtaposes **predictions and errors**. This visualization is crucial for residual analysis, a technique that highlights areas where the model may be susceptible to flaws, providing insights for continuous improvement.

These visual representations and metrics not only convey the model's technical performance but also facilitate its interpretation in terms of business impact, making the results more accessible to professionals in the field.

---
## **Rossmann Project Status**
The Rossmann project, focused on sales forecasting, adopted ten essential steps inspired by the CRISP-DM methodology. These steps outline a cohesive trajectory, starting with a deep understanding of the business and culminating in the deployment of the model in question. Let's briefly outline each one:

**Business Understanding**: Central to the project was identifying the need of store managers to forecast sales for the next six weeks, aiming to plan renovations. This initial alignment with the business strategy is crucial.

**Business Understanding**: Before any technical approach, it was essential to understand the core of the challenge. This involved dialogues with stakeholders to capture the real magnitude of the problem.

**Data Collection**: We used a dataset from the Kaggle platform. In various contexts, the source of data can be multiple, but here, we opted for a dataset that reflects an authentic professional environment.

**Data Cleaning**: After an initial descriptive analysis, we engaged in feature engineering, creating more informative variables from the original ones and filtering the dataset to keep only the essential.

**Data Exploration**: This stage focused on exploratory analysis to understand the data in light of the business and identify variables with potential impact on the final model.

**Data Preparation**: After exploration, the data was shaped for modeling, including encoding categorical variables and selecting variables using the `Boruta` algorithm.

**Modeling**: We implemented five machine learning algorithms, evaluating the effectiveness of each one.

**Model Evaluation**: We used cross-validation for a robust evaluation of the developed models.

**Interpretation of Results**: In addition to assessing technical effectiveness, we sought to understand the real value that the model could bring to the business, whether financially, in customer acquisition, or in cost savings.

**Deployment**: The final model was hosted in a production environment in the cloud, making it available to end users.

It's worth highlighting that the goal in the first cycle of CRISP-DM is to quickly deliver a pilot version of the solution. This initial release allows stakeholders to perceive the real value of Machine Learning, and the feedback collected guides improvements in subsequent iterations.

---
## **10.0 Model Deployment in Production**
In the context of sales forecasting projects, deploying the model in a production environment is imperative, allowing its predictions to be accessed by various consumers, whether through a mobile app, website, or other online software.

An `API` (Application Programming Interface) can be understood as a contract that defines how different software communicate and interact. In essence, the API establishes which functions and methods are available for one application to communicate with another.

When we talk about a `Request`, we're referring to a call made to the API. This is how one application "calls" another to interact. The `EndPoint` is essentially the address to which this request is directed, usually a URL. The term `Deploy` refers to the process of moving software from the development environment to the production environment, often in the cloud, making it accessible to devices connected to the internet.

Regarding the **Development Cycle**:
- `Local Environment`: Represents the initial development location, usually the programmer's device.
- `Production Environment`: Encompasses servers, often cloud-based, that offer services or information to end users.
- `Development Environment`: Functions as a replica of the production environment, but with more controls and without public access. Here, developers test changes before migrating them to production.

Analyzing the **Production Architecture** for our model:
1. `Handler`: Acts as an intermediary, receiving user requests and forwarding them to the model.
2. `Data Preparation`: Prepares the data to be processed by the model, performing all the steps of cleaning, transformation, and variable creation. This step is crucial to align the received data with the format used during model training.
3. `Trained Model`: Represents the core of the entire system. Once the data is prepared, the model makes predictions and sends them back to the Handler, which then forwards them to the requesting user.

The development of this project follows a defined flow: it starts in the local environment, undergoes testing in the development environment, and finally gets deployed to the production environment for user access.

---
## **10.1 Rossmann Class Overview**
Central to our project is the **Rossmann class**, a key component that encompasses all the cleaning, transformation, and encoding operations performed during model training. This class is essential to ensure **consistency** and accuracy when applying the model to unseen datasets.

Upon completing the training, the next critical step is preserving that model. We use Python's `pickle` library for this purpose. Through it, the model is serialized into a single file, which can be stored and later deserialized when needed to put it into operation.

In [None]:
import pickle
from google.colab import files

# Save the trained model to a .pkl file
filename = 'model_rossmann.pkl'
with open(filename, 'wb') as file:
    pickle.dump(model_xgb_tuned, file)

# Download the model file directly to your local machine
files.download(filename)


### **Saving Transformation Parameters**

In addition to safeguarding the model itself, it is essential to ensure the persistence of the **transformation parameters** applied to the data during training. This step guarantees that when we receive new data, we can apply the exact same transformations, maintaining **consistency** and ensuring the integrity of the prediction process.

---
## **5.1 Normalization Techniques (Return)**

**Normalization** is a crucial step in data preparation to ensure that all attributes are on a common scale, optimizing the performance of machine learning models. In our implementation, particularly in the `Rossmann` class, we have utilized two notable approaches: the `RobustScaler` and the `MinMaxScaler`.

The choice of the **RobustScaler** for the `competition_distance` variable is strategic, as this technique is robust to outliers. After applying this transformation, we save the results with the help of the `pickle` library, ensuring efficient reuse in the future.

In [None]:
from sklearn.preprocessing import RobustScaler, MinMaxScaler
from google.colab import files
import pickle

# Initializing scaling instances
rs = RobustScaler()
mms = MinMaxScaler()  # This scaler wasn't used in the provided snippet, make sure it's necessary.

# Scaling the 'competition_distance' column using RobustScaler
df5['competition_distance'] = rs.fit_transform(df5[['competition_distance']].values)

# Saving the trained scaler to a file using pickle
filename = 'encoding_competition_distance.pkl'
with open(filename, 'wb') as file:
    pickle.dump(rs, file)

# Downloading the scaler directly to the local machine from Google Colab
files.download(filename)


### **Variable Transformation: Competition and Promotion Times (Return)**

When dealing with the `competition_time_month` and `promo_time_week` variables, we applied the **RobustScaler** transformation. This method is particularly valuable when the data contains many outliers, as it scales the data considering the median and quartiles, reducing the influence of these extreme values. After the transformation, we ensured the persistence of these transformed data using `pickle`, allowing for easy loading and future reuse.