# Business Understanding

## Project Overview

The project aims to enhance the sales prediction process for a large Ecuadorian-based grocery retailer, Corporation Favorita. The focus is on developing a robust time series forecasting model to accurately predict unit sales for thousands of items across different Favorita stores.

### Objectives

- Build a model that predicts unit sales based on historical data and various features.
- Identify factors influencing sales, including promotions, oil prices, holidays, and store-specific attributes.
- Analyze the impact of external events, such as an earthquake and public sector wages, on supermarket sales.

### Business Context

Accurate sales forecasting is crucial for optimizing inventory management, ensuring adequate stock levels, and ultimately improving overall business efficiency. By understanding the key drivers of sales and developing a reliable forecasting model, Favorita can make informed decisions to meet customer demand, manage resources efficiently, and adapt to external factors affecting sales.

# Project Introduction

## Problem Statement

Corporation Favorita faces the challenge of predicting unit sales across its stores accurately. The current forecasting process may not fully capture the complexity of factors influencing sales, leading to suboptimal inventory management and resource allocation.

## Scope

- Time series forecasting for unit sales using historical data.
- Analysis of external factors such as promotions, oil prices, holidays, earthquakes, and public sector wages.

## Data Description

The dataset includes training and test data with features like store_nbr, family, onpromotion, and sales. Additional files provide information on transactions, store metadata, oil prices, and holidays.

## Hypotheses & Questions

- Is the training dataset complete with all required dates?
- Which dates have the lowest and highest sales for each year?
- Did the earthquake impact sales?
- Are certain groups of stores selling more products?
- How do promotions, oil prices, and holidays affect sales?

## Evaluation Metrics

The model's performance will be evaluated using Root Mean Squared Logarithmic Error (RMSLE), with a target RMSLE of 0.2 for excellent performance.

## Documentation Expectations

Comprehensive documentation will be crucial, covering data cleaning, analysis, hypothesis testing, and the model building process. Clear explanations and insights will be rewarded.

This project aligns with Favorita's strategic goals of improving sales forecasting accuracy, optimizing inventory management, and adapting to external market dynamics.


## Importing the necesary modules

In [23]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import PolynomialFeatures
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.model_selection import cross_val_score
import statsmodels.api as sm
import gdown
from dotenv import dotenv_values
import logging
import pyodbc
import datetime as dt
from scipy.stats import shapiro
from scipy.stats import ttest_ind
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')


In [24]:
environment_variables = dotenv_values('.env')

In [25]:
# Get the values for the credentials you set in the '.env' file
server = environment_variables.get("ServerName")
database = environment_variables.get("DataBase")
username = environment_variables.get("Username")
password = environment_variables.get("Password")


connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={username};PWD={password};MARS_Connection=yes;MinProtocolVersion=TLSv1.2;"
connection = pyodbc.connect(connection_string)

In [26]:
connection = pyodbc.connect(connection_string)

In [27]:
query = "SELECT * FROM [dbo].[holidays_events]"
Table1 = pd.read_sql(query, connection)

In [28]:
Table1.head()

Unnamed: 0,date,type,locale,locale_name,description,transferred
0,2012-03-02,Holiday,Local,Manta,Fundacion de Manta,False
1,2012-04-01,Holiday,Regional,Cotopaxi,Provincializacion de Cotopaxi,False
2,2012-04-12,Holiday,Local,Cuenca,Fundacion de Cuenca,False
3,2012-04-14,Holiday,Local,Libertad,Cantonizacion de Libertad,False
4,2012-04-21,Holiday,Local,Riobamba,Cantonizacion de Riobamba,False


In [29]:
query= "SELECT * FROM [dbo].[oil]"
Table2 = pd.read_sql(query, connection)
Table2.head()


Unnamed: 0,date,dcoilwtico
0,2013-01-01,
1,2013-01-02,93.139999
2,2013-01-03,92.970001
3,2013-01-04,93.120003
4,2013-01-07,93.199997


In [30]:
query= "SELECT * FROM [dbo].[stores]"
Table3 = pd.read_sql(query, connection)
Table3.head()

Unnamed: 0,store_nbr,city,state,type,cluster
0,1,Quito,Pichincha,D,13
1,2,Quito,Pichincha,D,13
2,3,Quito,Pichincha,D,8
3,4,Quito,Pichincha,D,9
4,5,Santo Domingo,Santo Domingo de los Tsachilas,D,4


In [31]:
df = pd.read_csv('C:\\Users\\Brian Bassey\\Desktop\\Project_Portfolio\\LP3\\LP3 Datasets.csv')
df.head()

ParserError: Error tokenizing data. C error: Expected 1 fields in line 10, saw 4
