# **Time Series Regression Analysis**

## **<span style="color: blue;">Business Understanding</span>**

The main objective is to build a model that accurately predicts unit sales for thousands of products across different Favorita store locations. This will help ensure optimal inventory management, reduce waste, and improve operational efficiency.

We have access to several datasets:

1. Training data: Contains dates, store and product information, promotions, and sales figures.
2. Store metadata: Includes cluster, type, city, and state information.
3. Oil price data: Daily oil prices for the relevant time period.
4. Transactions data: Daily transaction counts for each store.

Key features include:

1. store_nbr: Identifies the store location
2. family: Product category
3. sales: Total sales for a product family at a specific store on a given date
4. onpromotion: Number of items in a product family on promotion

### **Hypothesis**

**Null Hypothesis (H0):**

Promotional activities (as measured by the 'onpromotion' variable) have no significant effect on the daily sales of products across Favorita stores.

Mathematically, we can express this as:
        
H0: β = 0

Where β is the coefficient of the 'onpromotion' variable in our predictive model.

**Alternative Hypothesis (H1):**

Promotional activities have a significant effect on the daily sales of products across Favorita stores.

Mathematically, we can express this as:

H1: β ≠ 0

### **Analytical Questions**

1. Is the train dataset complete (has all the required dates)?
2. Which dates have the lowest and highest sales for each year (excluding days the store was closed)?
3. Compare the sales for each month across the years and determine which month of which year had the highest sales.
4. Did the earthquake impact sales?
5. Are certain stores or groups of stores selling more products? (Cluster, city, state, type)
6. Are sales affected by promotions, oil prices and holidays?
7. What analysis can we get from the date and its extractable features?
8. Which product family and stores did the promotions affect.
9. What is the difference between RMSLE, RMSE, MSE (or why is the MAE greater than all of them?)
10. Does the payment of wages in the public sector on the 15th and last days of the month influence the store sales.

## **<span style="color: blue;">Data Understanding</span>**

### **<span style="color: skyblue;">Importation</span>**

**Importation of all necessary packages**

In [41]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

**Dataset Importation & Loading**

In [42]:
# Connect to the Database Using "pyodbc"
import pyodbc
print("pyodbc is installed and imported successfully")

import pyodbc
from dotenv import dotenv_values    #import the dotenv_values function from the dotenv package
import pandas as pd
import warnings 

warnings.filterwarnings('ignore')

# Load environment variables from .env file into a dictionary
environment_variables = dotenv_values('.env')

# Get the values for the credentials set in the '.env' file
server = environment_variables.get("SERVER_NAME")
database = environment_variables.get("DATABASE_NAME")
login = environment_variables.get("LOGIN")
password = environment_variables.get("PASSWORD")

pyodbc is installed and imported successfully


In [43]:
# Create a connection string
connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={login};PWD={password};MARS_Connection=yes;MinProtocolVersion=TLSv1.2;"


In [44]:
# connect to the Database
connection = pyodbc.connect(connection_string)
print("Connection successful")

Connection successful


In [45]:
# Query the Sql Database
query = "SELECT * FROM dbo.oil"

# Execute the query and read the results into a DataFrame
oil = pd.read_sql(query, connection)

In [46]:
# Query the Sql Database
query = "SELECT * FROM dbo.holidays_events"

# Execute the query and read the results into a DataFrame
holidays_events = pd.read_sql(query, connection)

In [47]:
# Query the Sql Database
query = "SELECT * FROM dbo.stores"

# Execute the query and read the results into a DataFrame
stores = pd.read_sql(query, connection)

In [30]:
sample_submission = pd.read_csv('../Data/sample_submission.csv')

In [32]:
test = pd.read_csv('../Data/test.csv')

In [35]:
train = pd.read_csv('../Data/train.csv')

In [37]:
transactions = pd.read_csv('../Data/transactions.csv')