# Overview Solution Workflow
There are 7 stages to complete when approaching an ML problem
1. Question / Problem definition
2. Acquire training and testing data
3. Wrangle, prepare and cleanse data
4. Analyze, identify patterns and explore the data
5. Model, predict and solve the problem.
6 Visualize,  report and present the problem solving steps and final solution
7. Supply or submit the results



# Stage 1: Question / Problem Definition

The objective of this project would be to predict sales for the thousands of product families sold at Favorita stores located in Ecuador. The data used includes dates, store and product information, whether that item was being promoted, as well as the sales numbers. Additional files include supplementary information that may be useful in building your models.

#Stage 2: Acquire training and testing data
There are 6 .csv files provided:

1. train.csv, which includes:
    - store_nbr
    - family
    - onpromotion
    - (target) sales

2. test.csv, which includes the same features as the training data

3. sample_submission.csv, which is a sample submission file in the correct format

4. stores.csv, which includes:
    - Store metadata, including city, state, type, and cluster.
        - Cluster is a grouping of similar stores.

5. oil.csv

    - Daily oil price. Includes values during both the train and test data timeframes. (Ecuador is an oil-dependent country and it's economical health is highly vulnerable to shocks in oil prices.)

6. holidays_events.csv

    - Holidays and Events, with metadata
    NOTE: Pay special attention to the transferred column. A holiday that is transferred officially falls on that calendar day, but was moved to another date by the government. A transferred day is more like a normal day than a holiday. To find the day that it was actually celebrated, look for the corresponding row where type is Transfer. For example, the holiday Independencia de Guayaquil was transferred from 2012-10-09 to 2012-10-12, which means it was celebrated on 2012-10-12. Days that are type Bridge are extra days that are added to a holiday (e.g., to extend the break across a long weekend). These are frequently made up by the type Work Day which is a day not normally scheduled for work (e.g., Saturday) that is meant to payback the Bridge.
    Additional holidays are days added a regular calendar holiday, for example, as typically happens around Christmas (making Christmas Eve a holiday).

Additional Notes

    Wages in the public sector are paid every two weeks on the 15 th and on the last day of the month. Supermarket sales could be affected by this.
    A magnitude 7.8 earthquake struck Ecuador on April 16, 2016. People rallied in relief efforts donating water and other first need products which greatly 

# Initializing the required libraries and Analyzing the data

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as plt

In [9]:
training_data = pd.read_csv('store_sales_data/train.csv')
test_data = pd.read_csv('store_sales_data/test.csv')
oil_data = pd.read_csv('store_sales_data/oil.csv')
holidays_events_data = pd.read_csv('store_sales_data/holidays_events.csv')
sample_submission_data = pd.read_csv('store_sales_data/sample_submission.csv')
transactions_data = pd.read_csv('store_sales_data/transactions.csv')
training_data.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,0,2013-01-01,1,AUTOMOTIVE,0.0,0
1,1,2013-01-01,1,BABY CARE,0.0,0
2,2,2013-01-01,1,BEAUTY,0.0,0
3,3,2013-01-01,1,BEVERAGES,0.0,0
4,4,2013-01-01,1,BOOKS,0.0,0


# Analyzing dtypes.

This is done to better understand how the data cna be manipulated later on.

In [4]:
training_data.dtypes

id               int64
date            object
store_nbr        int64
family          object
sales          float64
onpromotion      int64
dtype: object

# Observation and next step

It appears that date and family are objects. Let's take a look at what are its attributes.

In [6]:
training_data.describe(include= ['O'])


Unnamed: 0,date,family
count,3000888,3000888
unique,1684,33
top,2013-01-01,AUTOMOTIVE
freq,1782,90936


In [13]:
training_data['date'] = pd.to_datetime(training_data.date)
training_data['date']
test_data['date']



0        2017-08-16
1        2017-08-16
2        2017-08-16
3        2017-08-16
4        2017-08-16
            ...    
28507    2017-08-31
28508    2017-08-31
28509    2017-08-31
28510    2017-08-31
28511    2017-08-31
Name: date, Length: 28512, dtype: object

In [14]:
training_data.dtypes

id                      int64
date           datetime64[ns]
store_nbr               int64
family                 object
sales                 float64
onpromotion             int64
dtype: object