# TIME SERIES ANALYSIS (COPORATION FAVORITA)

#### Business Understanding
Sales forecasting plays a pivotal role in strategic decision-making for corporations like Favorita. By accurately predicting sales trends, Favorita can optimize inventory management, staffing, and promotional strategies, ultimately enhancing operational efficiency and profitability. As a leading retail chain, Favorita faces challenges such as fluctuating customer demand, seasonality, and external market dynamics, making precise sales predictions crucial for maintaining competitive advantage and meeting customer expectations.

#### Problem Statement
Favorita Corporation, a leading retail chain, is facing significant challenges in accurately forecasting sales across its various stores. The inability to predict sales trends precisely leads to suboptimal inventory management, increased holding costs, missed sales opportunities, and customer dissatisfaction. Seasonal variations, promotional activities, economic factors, and regional differences further complicate the forecasting process.

#### Project Goal
In this project, I will develop a robust and scalable machine learning model to accurately forecast sales for Favorita, enabling the company to optimize inventory management, improve resource allocation, and enhance overall customer satisfaction. The model should be capable of:

- Handling seasonality and trends effectively.

- Incorporating the impact of promotional activities.

- Integrating external factors that influence sales.

- Scaling across different stores and product categories.

By achieving these objectives, Favorita aims to transform its sales forecasting process, driving efficiency, reducing costs, and fostering sustainable growth.

##### Stakeholders

- Management
- Marketing 
- Data Team

##### Features

1. train.csv

- The training data, comprising time series of features store_nbr, family, and onpromotion as well as the target sales.

- store_nbr identifies the store at which the products are sold.

- family identifies the type of product sold.

- sales gives the total sales for a product family at a particular store at a given date. Fractional values are possible since products can be sold in fractional units (1.5 kg of cheese, for instance, as opposed to 1 bag of chips).

- onpromotion gives the total number of items in a product family that were being promoted at a store at a given date.


2. test.csv

- The test data, having the same features as the training data. You will predict the target sales for the dates in this file.

- The dates in the test data are for the 15 days after the last date in the training data.



3. transaction.csv

- Contains date, store_nbr and transaction made on that specific date.

4. sample_submission.csv

- A sample submission file in the correct format.

5. stores.csv

- Store metadata, including city, state, type, and cluster.

- cluster is a grouping of similar stores.

6. oil.csv

- Daily oil price which includes values during both the train and test data timeframes. (Ecuador is an oil-dependent country and its economical health is highly vulnerable to shocks in oil prices.)

7. holidays_events.csv

- Holidays and Events, with metadata

### Hypothesis

Null Hypothesis

H0: Promotions and discounts do not have a significant impact on sales.

Alternative Hypothesis

H1: Promotions and discounts have a significant positive impact on sales.




#### Business Questions
1. Is the train dataset complete (has all the required dates)?
2. hich dates have the lowest and highest sales for each year (excluding days the store was closed)?
3. Compare the sales for each month across the years and determine which month of which year had the highest sales.
4. Did the earthquake impact sales?
5. Are certain stores or groups of stores selling more products? (Cluster, city, state, type)
6. Are sales affected by promotions, oil prices and holidays?
7. What analysis can we get from the date and its extractable features?
8. Which product family and stores did the promotions affect.
9. What is the difference between RMSLE, RMSE, MSE (or why is the MAE greater than all of them?)
10. Does the payment of wages in the public sector on the 15th and last days of the month influence the store sales.

# Data Understanding

Importing all relevant libraries

In [1]:
# Environment variables management
from dotenv import dotenv_values

In [2]:
%pip install statsmodels
%pip install fancyimpute
%pip install catboost
%pip install lightgbm
%pip install xgboost
%pip install tensorflow
%pip install prophet
%pip install tqdm

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [3]:
# Database connection
import pyodbc
import MySQLdb
import mysql.connector
import pymysql


In [4]:
# Visualization

import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML, display_html
from scipy import stats


In [5]:
# Data handling and utilities
import pandas as pd
import numpy as np
import re
import calendar
import warnings
import os
import pickle

import joblib

# Data fetching
import requests
import zipfile

#Statistical Analysis
from scipy.stats import skew, kurtosis, chi2_contingency

# Time Series Analysis
import statsmodels.api as sm
import statsmodels.tsa.api as tsa
import prophet

from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from statsmodels.tsa.stattools import adfuller, kpss
from tqdm.notebook import tqdm

# Suppressing warnings to avoid cluttering the output
warnings.filterwarnings("ignore")

  from .autonotebook import tqdm as notebook_tqdm


In [6]:
# Data imputation
from fancyimpute import IterativeImputer
from sklearn.experimental import enable_iterative_imputer, enable_halving_search_cv

from sklearn.impute import IterativeImputer, SimpleImputer

# Feature processing
from sklearn.preprocessing import (
    StandardScaler, MinMaxScaler, LabelEncoder, 
    FunctionTransformer, OneHotEncoder, RobustScaler, PowerTransformer, quantile_transform
)
from sklearn.model_selection import (
    train_test_split, StratifiedShuffleSplit, 
    GridSearchCV, RandomizedSearchCV, cross_val_score, 
    cross_val_predict, TimeSeriesSplit
)
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.utils import resample, estimator_html_repr
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as ImbPipeline


# Feature selection
from sklearn.feature_selection import (
    SelectKBest, chi2, f_classif, 
    mutual_info_classif, RFE, SelectFromModel, SelectPercentile
)

# Machine Learning models
from sklearn.ensemble import (
    RandomForestClassifier, GradientBoostingClassifier
)
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
import catboost as cb
import lightgbm as lgb
import xgboost as xgb
from xgboost import XGBClassifier



# Neural Networks
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization

# Model evaluation
from sklearn.metrics import (
    confusion_matrix, classification_report, make_scorer, 
    accuracy_score, roc_auc_score, precision_score, recall_score, 
    f1_score, log_loss, roc_curve, mean_squared_error, mean_absolute_error
)

# Suppressing warnings to avoid cluttering the output
warnings.filterwarnings("ignore")

# Set display options for Pandas DataFrame
pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", 100)

# Set theme for plots
sns.set_theme(style="white", palette="pastel", font="sans-serif", font_scale=1.5)
plt.style.use("dark_background")
custom_palette = ["cyan", "magenta", "yellow"]

## Accessing Datasets from their sources.

The data was accessed from different sources: a database, OneDrive and a GitHub repository.

##### Acessing data from database using the credentials

In [7]:
# Load environment variables from.env file into dictionary
environment_variables = dotenv_values(".env")
 
# get the values for the environment variables
server = environment_variables.get("server")
login = environment_variables.get("login")
password = environment_variables.get("password")
database = environment_variables.get("database")
 
# Create a database connection string using pyodbc
connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={login};PWD={password}"
#Establish a connection to the database
try:
    connection = pyodbc.connect(connection_string)
    print("Connection successful:", connection_string)    
except Exception as e:
    print("Connection failed:", e)

Connection successful: DRIVER={SQL Server};SERVER=dap-projects-database.database.windows.net;DATABASE=dapDB;UID=learning_project_3;PWD=A$uB1Lp3$2@24


In [8]:
# Define the SQL query to show specific tables in the database
db_query = """
        SELECT *
        FROM INFORMATION_SCHEMA.TABLES
        WHERE TABLE_SCHEMA = 'dbo'
        """
# Read data from the SQL query result into a DataFrame using the established database connection
schema_df = pd.read_sql(db_query, connection)
 
#  Check whether data has been retrieved successfully to confirm successful connection to database
try:
    schema_df = pd.read_sql(db_query, connection)    
    print("Data retrieved successfully")
    print()
    print(schema_df)    
except Exception as e:
    print("Failed to retrieve data:", e)

Data retrieved successfully

  TABLE_CATALOG TABLE_SCHEMA       TABLE_NAME  TABLE_TYPE
0         dapDB          dbo  holidays_events  BASE TABLE
1         dapDB          dbo              oil  BASE TABLE
2         dapDB          dbo           stores  BASE TABLE


In [9]:
# Define the SQL query to show specific tables in the database
db_query = """
        SELECT *
        FROM stores        
        """
# Read data from the SQL query result into a DataFrame using the established database connection
df_stores = pd.read_sql(db_query, connection)
 
# Display the DataFrame
df_stores.head()

Unnamed: 0,store_nbr,city,state,type,cluster
0,1,Quito,Pichincha,D,13
1,2,Quito,Pichincha,D,13
2,3,Quito,Pichincha,D,8
3,4,Quito,Pichincha,D,9
4,5,Santo Domingo,Santo Domingo de los Tsachilas,D,4


In [10]:
df_stores.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54 entries, 0 to 53
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   store_nbr  54 non-null     int64 
 1   city       54 non-null     object
 2   state      54 non-null     object
 3   type       54 non-null     object
 4   cluster    54 non-null     int64 
dtypes: int64(2), object(3)
memory usage: 2.2+ KB


In [11]:
df_stores.isnull().sum()

store_nbr    0
city         0
state        0
type         0
cluster      0
dtype: int64

In [12]:
df_stores.describe().T


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
store_nbr,54.0,27.5,15.732133,1.0,14.25,27.5,40.75,54.0
cluster,54.0,8.481481,4.693395,1.0,4.0,8.5,13.0,17.0


In [13]:
# Define the SQL query to show specific tables in the database
db_query = """
        SELECT *
        FROM oil        
        """
# Read data from the SQL query result into a DataFrame using the established database connection
df_oil = pd.read_sql(db_query, connection)
 
# Display the DataFrame
df_oil.head()

Unnamed: 0,date,dcoilwtico
0,2013-01-01,
1,2013-01-02,93.139999
2,2013-01-03,92.970001
3,2013-01-04,93.120003
4,2013-01-07,93.199997


In [14]:
df_oil.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1218 entries, 0 to 1217
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   date        1218 non-null   object 
 1   dcoilwtico  1175 non-null   float64
dtypes: float64(1), object(1)
memory usage: 19.2+ KB


In [15]:
df_oil.isnull().sum()

date           0
dcoilwtico    43
dtype: int64

In [16]:
df_oil.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
dcoilwtico,1175.0,67.714366,25.630476,26.190001,46.405001,53.189999,95.66,110.620003


In [17]:
# Define the SQL query to show specific tables in the database
db_query = """
        SELECT *
        FROM holidays_events        
        """
# Read data from the SQL query result into a DataFrame using the established database connection
df_holidays = pd.read_sql(db_query, connection)
 
# Display the DataFrame
df_holidays.head()

Unnamed: 0,date,type,locale,locale_name,description,transferred
0,2012-03-02,Holiday,Local,Manta,Fundacion de Manta,False
1,2012-04-01,Holiday,Regional,Cotopaxi,Provincializacion de Cotopaxi,False
2,2012-04-12,Holiday,Local,Cuenca,Fundacion de Cuenca,False
3,2012-04-14,Holiday,Local,Libertad,Cantonizacion de Libertad,False
4,2012-04-21,Holiday,Local,Riobamba,Cantonizacion de Riobamba,False


In [18]:
df_holidays.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350 entries, 0 to 349
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   date         350 non-null    object
 1   type         350 non-null    object
 2   locale       350 non-null    object
 3   locale_name  350 non-null    object
 4   description  350 non-null    object
 5   transferred  350 non-null    bool  
dtypes: bool(1), object(5)
memory usage: 14.1+ KB


In [19]:
df_holidays.isnull().sum()

date           0
type           0
locale         0
locale_name    0
description    0
transferred    0
dtype: int64

In [20]:
df_holidays.describe().T

Unnamed: 0,count,unique,top,freq
date,350,312,2014-06-25,4
type,350,6,Holiday,221
locale,350,3,National,174
locale_name,350,24,Ecuador,174
description,350,103,Carnaval,10
transferred,350,2,False,338


#### Access the data from Github repository

In [21]:
# URL of the file to download
url = "https://github.com/EfyaDufie2020/Career_Accelerator_LP3-Regression/raw/main/store-sales-forecasting.zip"
 
# Local file path where the file will be saved
local_file_path = '../Data/store-sales-forecasting.zip'
 
# Create the directory if it doesn't exist
os.makedirs(os.path.dirname(local_file_path), exist_ok=True)
 
# Send a GET request to the URL
response = requests.get(url)
 
# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Write the content of the response to the specified file path
    with open(local_file_path, "wb") as f:
        f.write(response.content)
    print("File downloaded successfully")
   
    # Extract the ZIP file
    with zipfile.ZipFile(local_file_path, 'r') as zip_ref:
        zip_ref.extractall(os.path.dirname(local_file_path))
    print("File extracted successfully")
else:
    print(f"Failed to download file. Status code: {response.status_code}")

File downloaded successfully
File extracted successfully


In [22]:
# Read the downloaded CSV file into a DataFrame
df_train = pd.read_csv('../Data/train.csv')
 
# Display the DataFrame
df_train.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,0,2013-01-01,1,AUTOMOTIVE,0.0,0
1,1,2013-01-01,1,BABY CARE,0.0,0
2,2,2013-01-01,1,BEAUTY,0.0,0
3,3,2013-01-01,1,BEVERAGES,0.0,0
4,4,2013-01-01,1,BOOKS,0.0,0


In [23]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000888 entries, 0 to 3000887
Data columns (total 6 columns):
 #   Column       Dtype  
---  ------       -----  
 0   id           int64  
 1   date         object 
 2   store_nbr    int64  
 3   family       object 
 4   sales        float64
 5   onpromotion  int64  
dtypes: float64(1), int64(3), object(2)
memory usage: 137.4+ MB


In [24]:
df_train.isnull().sum()

id             0
date           0
store_nbr      0
family         0
sales          0
onpromotion    0
dtype: int64

In [25]:
df_train.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,3000888.0,1500444.0,866281.891642,0.0,750221.75,1500443.5,2250665.0,3000887.0
store_nbr,3000888.0,27.5,15.585787,1.0,14.0,27.5,41.0,54.0
sales,3000888.0,357.7757,1101.997721,0.0,0.0,11.0,195.8473,124717.0
onpromotion,3000888.0,2.60277,12.218882,0.0,0.0,0.0,0.0,741.0


In [26]:
# Read the downloaded CSV file into a DataFrame
df_transactions = pd.read_csv('../Data/transactions.csv')
 
# Display the DataFrame
df_transactions.head()

Unnamed: 0,date,store_nbr,transactions
0,2013-01-01,25,770
1,2013-01-02,1,2111
2,2013-01-02,2,2358
3,2013-01-02,3,3487
4,2013-01-02,4,1922


In [27]:
df_transactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83488 entries, 0 to 83487
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   date          83488 non-null  object
 1   store_nbr     83488 non-null  int64 
 2   transactions  83488 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 1.9+ MB


In [28]:
df_transactions.isnull().sum()

date            0
store_nbr       0
transactions    0
dtype: int64

In [29]:
df_transactions.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
store_nbr,83488.0,26.939237,15.608204,1.0,13.0,27.0,40.0,54.0
transactions,83488.0,1694.602158,963.286644,5.0,1046.0,1393.0,2079.0,8359.0


#### Accessing data from the OneDrive

In [30]:
# Read the downloaded CSV file into a DataFrame
df_test = pd.read_csv('../Data/test.csv')
 
# Display the DataFrame
df_test.head()


Unnamed: 0,id,date,store_nbr,family,onpromotion
0,3000888,2017-08-16,1,AUTOMOTIVE,0
1,3000889,2017-08-16,1,BABY CARE,0
2,3000890,2017-08-16,1,BEAUTY,2
3,3000891,2017-08-16,1,BEVERAGES,20
4,3000892,2017-08-16,1,BOOKS,0


In [31]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28512 entries, 0 to 28511
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           28512 non-null  int64 
 1   date         28512 non-null  object
 2   store_nbr    28512 non-null  int64 
 3   family       28512 non-null  object
 4   onpromotion  28512 non-null  int64 
dtypes: int64(3), object(2)
memory usage: 1.1+ MB


In [32]:
df_test.isnull().sum()

id             0
date           0
store_nbr      0
family         0
onpromotion    0
dtype: int64

In [33]:
df_test.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,28512.0,3015144.0,8230.849774,3000888.0,3008015.75,3015143.5,3022271.25,3029399.0
store_nbr,28512.0,27.5,15.586057,1.0,14.0,27.5,41.0,54.0
onpromotion,28512.0,6.965383,20.683952,0.0,0.0,0.0,6.0,646.0


In [34]:
# Read the downloaded CSV file into a DataFrame
df_sample = pd.read_csv('../Data/sample_submission.csv')
 
# Display the DataFrame
df_sample.head()

Unnamed: 0,id,sales
0,3000888,0.0
1,3000889,0.0
2,3000890,0.0
3,3000891,0.0
4,3000892,0.0


In [35]:
df_sample.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28512 entries, 0 to 28511
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   id      28512 non-null  int64  
 1   sales   28512 non-null  float64
dtypes: float64(1), int64(1)
memory usage: 445.6 KB


In [36]:
df_sample.isnull().sum()

id       0
sales    0
dtype: int64

#### Comments

The Stores dataset has 54 rows and 5 columns. It shows the store_nbr, city, state, type and cluster. It has no null values.
The Oil dataset has 1218 rows and 2 columns. It shows the date and daily oil price. It has 43 null values in the  daily oil price.
The Holidays Events dataset has 350 rows and 6 columns. It shows the date, type, locale, locale_name, description and transferred. It has no null values.
The Train dataset has 3000888 rows and 6 columns. It shows the id, date, store_nbr, family, sales and onpromotion. It has no null values. The test dataset has 28512 rows and 5 columns. It has no null values. It also has no null values. The train dataset is significantly larger than the test datasets. This means that the train dataset provides sufficient data for model training.
The Transactions dataset has 83488 rows and 3 columns. It shows the date, store_nbr and transactions. It has no null values.
The sample dataset has 28512 rows and 2 columns. It shows only id and sales. It has no null values.

The Oil,Holidays,Transactions,Train and Test datasets have a date cloumn which are in object format. 





# Exploratory Data Analysis(EDA) 

#### Data Quality
The Oil,Holiday_Events,Transaction,Train and Test datasets have a date cloumn which are in object format. These will be converted to  datetime.
Only the oil dataset has 43 null values in the daily oil price column.

In [37]:
# Converting date columns to datetime

df_oil['date'] = pd.to_datetime(df_oil['date'])
df_train['date'] = pd.to_datetime(df_train['date'])
df_transactions['date'] = pd.to_datetime(df_transactions['date'])
df_holidays['date'] = pd.to_datetime(df_holidays['date'])
df_test['date'] = pd.to_datetime(df_test['date'])

In [38]:
# Confirm if the changes have been made
df_holidays.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350 entries, 0 to 349
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   date         350 non-null    datetime64[ns]
 1   type         350 non-null    object        
 2   locale       350 non-null    object        
 3   locale_name  350 non-null    object        
 4   description  350 non-null    object        
 5   transferred  350 non-null    bool          
dtypes: bool(1), datetime64[ns](1), object(4)
memory usage: 14.1+ KB


#### Merging All The datasets

In [39]:
# Merging the common columns ('store_nbr' and 'date') in the datasets using the inner merge() function
# Merge train_data with stores_df based on 'store_nbr' 
columnmerged_df1 = pd.merge(df_train,df_stores, on='store_nbr', how='left')
columnmerged_df1.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster
0,0,2013-01-01,1,AUTOMOTIVE,0.0,0,Quito,Pichincha,D,13
1,1,2013-01-01,1,BABY CARE,0.0,0,Quito,Pichincha,D,13
2,2,2013-01-01,1,BEAUTY,0.0,0,Quito,Pichincha,D,13
3,3,2013-01-01,1,BEVERAGES,0.0,0,Quito,Pichincha,D,13
4,4,2013-01-01,1,BOOKS,0.0,0,Quito,Pichincha,D,13


In [40]:
# Merge merged_df1 with trans_data based on 'date' and 'store_nbr' 
columnmerged_df2 = columnmerged_df1.merge(df_transactions, on=['date', 'store_nbr'], how='inner')
columnmerged_df2.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,transactions
0,561,2013-01-01,25,AUTOMOTIVE,0.0,0,Salinas,Santa Elena,D,1,770
1,562,2013-01-01,25,BABY CARE,0.0,0,Salinas,Santa Elena,D,1,770
2,563,2013-01-01,25,BEAUTY,2.0,0,Salinas,Santa Elena,D,1,770
3,564,2013-01-01,25,BEVERAGES,810.0,0,Salinas,Santa Elena,D,1,770
4,565,2013-01-01,25,BOOKS,0.0,0,Salinas,Santa Elena,D,1,770


In [41]:
# Merge merged_df2 with df_oil on 'date' 
columnmerged_df3 = columnmerged_df2.merge(df_oil, on='date', how='left')
columnmerged_df3.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,transactions,dcoilwtico
0,561,2013-01-01,25,AUTOMOTIVE,0.0,0,Salinas,Santa Elena,D,1,770,
1,562,2013-01-01,25,BABY CARE,0.0,0,Salinas,Santa Elena,D,1,770,
2,563,2013-01-01,25,BEAUTY,2.0,0,Salinas,Santa Elena,D,1,770,
3,564,2013-01-01,25,BEVERAGES,810.0,0,Salinas,Santa Elena,D,1,770,
4,565,2013-01-01,25,BOOKS,0.0,0,Salinas,Santa Elena,D,1,770,


In [42]:
# Merge merged_df3 with df_holidays based on 'date' 
originaldata= columnmerged_df3.merge(df_holidays, on='date', how='inner') 




In [43]:
originaldata.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type_x,cluster,transactions,dcoilwtico,type_y,locale,locale_name,description,transferred
0,561,2013-01-01,25,AUTOMOTIVE,0.0,0,Salinas,Santa Elena,D,1,770,,Holiday,National,Ecuador,Primer dia del ano,False
1,562,2013-01-01,25,BABY CARE,0.0,0,Salinas,Santa Elena,D,1,770,,Holiday,National,Ecuador,Primer dia del ano,False
2,563,2013-01-01,25,BEAUTY,2.0,0,Salinas,Santa Elena,D,1,770,,Holiday,National,Ecuador,Primer dia del ano,False
3,564,2013-01-01,25,BEVERAGES,810.0,0,Salinas,Santa Elena,D,1,770,,Holiday,National,Ecuador,Primer dia del ano,False
4,565,2013-01-01,25,BOOKS,0.0,0,Salinas,Santa Elena,D,1,770,,Holiday,National,Ecuador,Primer dia del ano,False


In [44]:
originaldata.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 459063 entries, 0 to 459062
Data columns (total 17 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   id            459063 non-null  int64         
 1   date          459063 non-null  datetime64[ns]
 2   store_nbr     459063 non-null  int64         
 3   family        459063 non-null  object        
 4   sales         459063 non-null  float64       
 5   onpromotion   459063 non-null  int64         
 6   city          459063 non-null  object        
 7   state         459063 non-null  object        
 8   type_x        459063 non-null  object        
 9   cluster       459063 non-null  int64         
 10  transactions  459063 non-null  int64         
 11  dcoilwtico    300003 non-null  float64       
 12  type_y        459063 non-null  object        
 13  locale        459063 non-null  object        
 14  locale_name   459063 non-null  object        
 15  description   459

In [45]:
originaldata.isnull().sum()

id                   0
date                 0
store_nbr            0
family               0
sales                0
onpromotion          0
city                 0
state                0
type_x               0
cluster              0
transactions         0
dcoilwtico      159060
type_y               0
locale               0
locale_name          0
description          0
transferred          0
dtype: int64

In [50]:
originaldata['date'].unique()

<DatetimeArray>
['2013-01-01 00:00:00', '2013-01-05 00:00:00', '2013-01-12 00:00:00',
 '2013-02-11 00:00:00', '2013-02-12 00:00:00', '2013-03-02 00:00:00',
 '2013-04-01 00:00:00', '2013-04-12 00:00:00', '2013-04-14 00:00:00',
 '2013-04-21 00:00:00',
 ...
 '2017-06-23 00:00:00', '2017-06-25 00:00:00', '2017-07-03 00:00:00',
 '2017-07-23 00:00:00', '2017-07-24 00:00:00', '2017-07-25 00:00:00',
 '2017-08-05 00:00:00', '2017-08-10 00:00:00', '2017-08-11 00:00:00',
 '2017-08-15 00:00:00']
Length: 251, dtype: datetime64[ns]