# Transactions from a bakery¶

The data belongs to a bakery called "The Bread Basket", located in the historic center of Edinburgh. This bakery presents a refreshing offer of Argentine and Spanish products.


Content Data set containing 15 010 observations and more than 6 000 transactions from a bakery. The data set contains the following columns:

- **Date**. Categorical variable that tells us the date of the transactions (YYYY-MM-DD format). The column includes dates from 30/10/2016 to 09/04/2017.

- **Time**. Categorical variable that tells us the time of the transactions (HH:MM:SS format).

- **Transaction**. Quantitative variable that allows us to differentiate the transactions. The rows that share the same value in this field belong to the same transaction, that's why the data set has less transactions than observations.

You can find the original dataset [here](https://www.kaggle.com/aboliveira/bakery-market-basket-analysis/data).

![](./dataset-cover.jpg)

## Getting Started

We will be using the python programming language to help us look at the data in more detail. Before we continue, we should ensure that the correct version of python is installed.

In [1]:
!python --version

Python 2.7.15


These are some of the libraries that we will use along the way. 

- Pandas
- Numpy
- Scipy
- Bokeh

In [2]:
import pandas as pd
import numpy as np
import scipy as sp
import bokeh

Now that we have imported all the tools that we need, let's start by loading the dataset.

In [3]:
raw_df = pd.read_csv('BreadBasket_DMS.csv')
raw_df.dtypes

Date           object
Time           object
Transaction     int64
Item           object
dtype: object

Here is a glimps of what the dataset looks like:

In [4]:
raw_df.head()

Unnamed: 0,Date,Time,Transaction,Item
0,2016-10-30,09:58:11,1,Bread
1,2016-10-30,10:05:34,2,Scandinavian
2,2016-10-30,10:05:34,2,Scandinavian
3,2016-10-30,10:07:57,3,Hot chocolate
4,2016-10-30,10:07:57,3,Jam


## Cleaning the data

Before we continue, it is probably a good idea to get the data into a workable format. In this dataset, some of the Item information is missing. To make our lives easier, we can discard these data points.

In [36]:
def cleanup_dataset(df):
    # Returns new dataset without NONE values in specified c
    df_none_entries = df.loc[df['Item']=='NONE',:]
    return df.drop(df_none_entries.index)

dataset = cleanup_dataset(raw_df)
dataset.head()

Unnamed: 0,Date,Time,Transaction,Item
0,2016-10-30,09:58:11,1,Bread
1,2016-10-30,10:05:34,2,Scandinavian
2,2016-10-30,10:05:34,2,Scandinavian
3,2016-10-30,10:07:57,3,Hot chocolate
4,2016-10-30,10:07:57,3,Jam


Each row in the table above represents in item in a transaction. We might want to focus in on each of these aspects seperatly. So let's make a list of transactions and a seperate list of items!

In [59]:
list_of_transactions = dataset[['Transaction', 'Date', 'Time']].drop_duplicates()
list_of_transactions.head()

Unnamed: 0,Transaction,Date,Time
0,1,2016-10-30,09:58:11
1,2,2016-10-30,10:05:34
3,3,2016-10-30,10:07:57
6,4,2016-10-30,10:08:41
7,5,2016-10-30,10:13:03


In [60]:
bakery_items = dataset[['Item']].drop_duplicates()
bakery_items.head()

Unnamed: 0,Item
0,Bread
1,Scandinavian
3,Hot chocolate
4,Jam
5,Cookies


## Setting up some auxillary functions

Here are a few functions that will make it easier for us to interact with our dataframes:

The **Date** and **Time** headings encode many different pieces of information. Splitting up this information is gonna make it easier for us to group data points.

In [41]:
def split_date_field(df_orig):
    """
    Converts the Date Column into three sepreate columns
    (YYYY-MM-DD) -> (YYYY, MM, DD)
    """
    df = df_orig.copy()
    df['date'] = pd.to_datetime(df['Date'])
    df['Year'] = df['date'].dt.year
    df['Month'] = df['date'].dt.month
    df['Day'] = df['date'].dt.day
    df['Weekday'] = df['date'].dt.weekday
    return df.drop(['date'], axis=1)

split_date_field(dataset).head()

Unnamed: 0,Date,Time,Transaction,Item,Year,Month,Day,Weekday
0,2016-10-30,09:58:11,1,Bread,2016,10,30,6
1,2016-10-30,10:05:34,2,Scandinavian,2016,10,30,6
2,2016-10-30,10:05:34,2,Scandinavian,2016,10,30,6
3,2016-10-30,10:07:57,3,Hot chocolate,2016,10,30,6
4,2016-10-30,10:07:57,3,Jam,2016,10,30,6


In [43]:
def split_time_field(df_orig):
    """
    Converts the Date Column into three sepreate columns
    
    (HH-MM-SS) -> (HH, MM, SS)
    """
    df = df_orig.copy()
    df['Hours'], df['Mins'], df['Secs'] = df['Time'].str.split(':').str
    return df

split_time_field(dataset).head()

Unnamed: 0,Date,Time,Transaction,Item,Hours,Mins,Secs
0,2016-10-30,09:58:11,1,Bread,9,58,11
1,2016-10-30,10:05:34,2,Scandinavian,10,5,34
2,2016-10-30,10:05:34,2,Scandinavian,10,5,34
3,2016-10-30,10:07:57,3,Hot chocolate,10,7,57
4,2016-10-30,10:07:57,3,Jam,10,7,57


The exact time of day is sometimes too specific to an individual transaction. If we want to look for trends, it might make more sense to look at the approximate time of day.

In [37]:
def extract_time_of_day(df_orig):
    df = df_orig.copy()
    
    time = df['Time']
    
    df.loc[(time <'12:00:00'),'Daytime']='Morning'
    df.loc[(time>='12:00:00')&(time <'17:00:00'),'Daytime']='Afternoon'
    df.loc[(time>='17:00:00')&(time <'21:00:00'),'Daytime']='Evening'
    df.loc[(time>='21:00:00')&(time <'23:50:00'),'Daytime']='Night'
    
    return df

extract_time_of_day(dataset).head()

Unnamed: 0,Date,Time,Transaction,Item,Daytime
0,2016-10-30,09:58:11,1,Bread,Morning
1,2016-10-30,10:05:34,2,Scandinavian,Morning
2,2016-10-30,10:05:34,2,Scandinavian,Morning
3,2016-10-30,10:07:57,3,Hot chocolate,Morning
4,2016-10-30,10:07:57,3,Jam,Morning


In [51]:
# @TODO: Season

In this dataset, we can see that a transaction is defined by a date and time. We can now look more closely at all the transactions that were made during this time.

## Understanding the data

Now we can use the data to answer some real questions!

### How many transactions took place?

In [56]:
listOfTransactions['Transaction'].count()

9465

In [58]:
dataset['Item'].nunique()

94

In [61]:
bakery_items.count()

Item    94
dtype: int64