I will start with general overview of all datasets given and some of its features, mainly the Transactions table and Items Table.

Later, I will be focusing on the Train data, by filtering store number 25 only for the year 2016.

In [None]:
import pandas as pd
import numpy as np

In [None]:
holidays_events_df = pd.read_csv('../input/holidays_events.csv', low_memory=False)
items_df = pd.read_csv('../input/items.csv', low_memory=False)
oil_df = pd.read_csv('../input/oil.csv', low_memory=False)
stores_df = pd.read_csv('../input/stores.csv', low_memory=False)
transactions_df = pd.read_csv('../input/transactions.csv', low_memory=False)

# Favorita Transactions 

I will start by changing some columns on this data set. 
Add columns month and year. Also add *day of the week* so we can see what are the favourite days of the week to shop.

In [None]:
import calendar

transactions_df["year"] = transactions_df["date"].astype(str).str[:4].astype(np.int64)
transactions_df["month"] = transactions_df["date"].astype(str).str[5:7].astype(np.int64)
transactions_df['date'] = pd.to_datetime(transactions_df['date'], errors ='coerce')
transactions_df['day_of_week'] = transactions_df['date'].dt.weekday_name


transactions_df["year"] = transactions_df["year"].astype(str)
transactions_df.head()

This heatmap plots the transactions for each month and year. Since we don´t have data beyond September 2017, those squares appear blank.

It appears that December has the most transactions, for all years considered. 
Also,  as we advance through the years, the little squares are getting lighter, which indicates that the number of transactions are increasing each year. It could be the fact that new Favorita stores are opening each year.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

x = transactions_df.groupby(['month', 'year'], as_index=False).agg({'transactions':'sum'})
y = x.pivot("month", "year", "transactions")
fig, ax = plt.subplots(figsize=(10,7))
sns.heatmap(y);

Now, let´s see what happens during each day of the week.

As one might expect, Saturdays and Sundays seem to be the prefered day to shop at this supermaket.
Thursdays and Mondays are little shy on shopping days.
The year 2017 seems to be worse, but that is just the fact that these maps average out for each year, and since we dont have 4 months of data for 2017, it appears that 2017 is performing worse compared to previous years. 

In [None]:
x = transactions_df.groupby(['day_of_week', 'year'], as_index=False).agg({'transactions':'sum'})
y = x.pivot("day_of_week", "year", "transactions")
fig, ax = plt.subplots(figsize=(10,7))
sns.heatmap(y);

# Stores

Let´s have a look at another dataset, Stores.

I will add another column, the region column. This column will have three values/regions: Sierra, Costa and Oriente.

In [None]:
regions_data = {
        
        'region': ['Sierra','Sierra','Sierra','Sierra', 'Sierra', 'Sierra', 'Sierra', 'Sierra', 
                   'Costa', 'Costa', 'Costa', 'Costa', 'Costa', 'Costa' , 'Costa' , 'Oriente'],
        
    
        'state': ['Imbabura','Tungurahua', 'Pichincha', 'Azuay', 'Bolivar', 'Chimborazo', 
                 'Loja', 'Cotopaxi', 'Esmeraldas', 'Manabi', 'Santo Domingo de los Tsachilas', 
                 'Santa Elena', 'Guayas', 'El Oro', 'Los Rios', 'Pastaza']}

df_regions = pd.DataFrame(regions_data, columns = ['region', 'state'])

df_regions_cities = pd.merge(df_regions, stores_df, on='state')


transactions_regions = pd.merge(transactions_df, df_regions_cities, on='store_nbr')
transactions_regions.head()

Pichincha State pops out in terms of transactions for all years considered. Guayas comes right next.
The states with less transactions seem to be Bolivar, Chimborazo, Esmeraldas, Manabi and Santa Elena.

In [None]:
x = transactions_regions.groupby(['state', 'year'], as_index=False).agg({'transactions':'sum'})
y = x.pivot("state", "year", "transactions")
fig, ax = plt.subplots(figsize=(12,9))
sns.heatmap(y);

In terms of stores, from the graph below, we can see that the stores with lighter colors have the most transactions.
Store numbers such as 3, 9, 11, 43, 45 and 47 are the ones with more transactions.

In [None]:
x = transactions_regions.groupby(['store_nbr', 'year'], as_index=False).agg({'transactions':'sum'})
y = x.pivot("store_nbr", "year", "transactions")
fig, ax = plt.subplots(figsize=(12,9))
sns.heatmap(y);

# Items

In [None]:
items_df.head()

In [None]:
items_df.family.unique()

There are 33 different food/supplies categories. I want to know what is the percentage for each one in terms of transactions. So I will add a column, the *percentage* column, and then plot the results.

In [None]:
items_df_family = items_df.groupby(['family']).size().to_frame(name = 'count').reset_index()
items_df_family['percentage']= items_df_family['count']/items_df_family['count'].sum() * 100
items_df_family.head()

In [None]:
sns.set_style("white")
fig, ax =plt.subplots(figsize=(14,10))
ax = sns.barplot(x="percentage", y="family", data=items_df_family, palette="BuGn_r")

It appears that Grocery I has the most transactions, with more than 30%. It then follows Beverages, Cleaning and Produce.

Beauty Care, Hardware, Seafood and Magazines are in the bottom, being less representative than any other category.

# Items,Transactions and Train Data

Since there is more than 125 million rows on the Train data, I will filter this data to show only what happened in store number 25 during 2016.

In [None]:
dtypes = {'store_nbr': np.dtype('int64'),
          'item_nbr': np.dtype('int64'),
          'unit_sales': np.dtype('float64'),
          'onpromotion': np.dtype('O')}


train = pd.read_csv('../input/train.csv', index_col='id', parse_dates=['date'], dtype=dtypes)
date_mask = (train['date'] >= '2016-01-01') & (train['date'] <= '2016-12-31') & (train['store_nbr'] == 25)
train = train[date_mask]
train.head()

It would be interesting to see the train data merged with the Items table so we could work on more information.
I can sort both tables through *item_nbr*  and then I will sort the date.

In [None]:
df_train_item = pd.merge(train, items_df, on='item_nbr').sort_values(by='date')
df_train_item["year"] = df_train_item["date"].astype(str).str[:4].astype(np.int64)
df_train_item["month"] = df_train_item["date"].astype(str).str[5:7].astype(np.int64)
df_train_item.head()

For store number 25, during 2016, I will plot the top 7 food/supplies categories and for each I want to know the transactions for each month.

In [None]:
sns.set_style("white")
ax = plt.subplots(figsize=(13, 9))
sns.countplot(x="family", hue="month", data=df_train_item, palette="Greens_d",
              order=df_train_item.family.value_counts().iloc[:7].index);

It appears that the top 7 categories are Grocery, Beverages, Cleaning, Produce, Dairy, Bread and Personal Care.

We dont have data for September, therefore this month does not appear on the graph. Also, for the month of October, we only have a few days of data, therefore the bar for this month is way below, compared to the other months.

Let us see what are the top 30 products for this store.
Below is this list, along with the number of times this product was transactioned for this store.

In [None]:
df_train_item['item_nbr'].value_counts().nlargest(30)