In [1]:
import pandas as pd
import seaborn as sns
from google.colab import drive

In [2]:
import sys
import pandas as pd
import seaborn as sns

# OvÄ›Å™Ã­, zda bÄ›Å¾Ã­Å¡ v Colabu
if 'google.colab' in sys.modules:
    from google.colab import drive
    drive.mount('/content/drive')
    print("âœ… PÅ™ipojen Google Drive")
else:
    print("ðŸ’» BÄ›Å¾Ã­Å¡ lokÃ¡lnÄ› (Jupyter), Drive nenÃ­ potÅ™eba")

ValueError: mount failed

## Sales at The Bread Basket bakery



In this notebook we will be visualizing the 2016-2017 sales data at *The Bread Basket* bakery in Edinburgh.

The data comes from the set published at [kaggle](https://www.kaggle.com/akashdeepkuila/bakery) with the *CC0* license.

Let's get familiar with the content of the notebook and follow the instructions to prepare the data that we are going to need during classes.

**Note!** When you come back to the document later, remember to re-run the code cells.

### Dataset contents

In the **bakery_sales.csv** file imported below there are 20 507 items assigned to 9 684 client transactions with the information on:


*   **TransactionNo** - transaction number
*   **Items** - purchased items
*   **DateTime** - time of transaction
*   **Daypart** - time of the day
*   **DayType** - weekday or weekend.

The necessary data is provided with the document: the code below imports the files we need.

In [None]:
drive.mount('/content/drive')

Data is imported to a pandas DataFrame which lets us work on visualizing them in an efficient way.

In [None]:
bakery_data = pd.read_csv('/content/drive/My Drive/Vis/Bakery Data/bakery_sales.csv')
bakery_data

## Notebook preparation

We want to start by making sure that the data has been correctly identified and make necessary conversions.

Based on data overview, we expect the first column to contain consecutive integers; the second one: names of sold products; third: the data identified as time-based; and the last two columns should have text-based information.

### Checking data types

Below, we need to run the listed instructions to get the DataFrames used during classes.

First let's check how the data was identified on import.

In [None]:
bakery_data.dtypes

Let's make sure the data has records with missing information for any of the columns.

In [None]:
"complete records: " + str(len(bakery_data.dropna(how="any"))) + "; total records: " + str(len(bakery_data))

Let's also take a look at what data is really hidden under the **object** type for each of the columns.

In [None]:
for column in bakery_data.columns:
  check_types = bakery_data[column].apply(lambda x: type(x))
  print(check_types.value_counts())

#### Date conversion

In the case of transaction time it is by default identified as a *string*.

Let's change the **DateTime** column data to *timestamp*.

In [None]:
bakery_data["DateTime"] = pd.to_datetime(bakery_data["DateTime"])

We'll add a new column with translation date, callled **Date** and validate the conversion.

In [None]:
bakery_data["Date"] = bakery_data["DateTime"].dt.date

In [None]:
bakery_data["Date"].value_counts()

Because we are not going to use the information about time, in **bakery_data** we can leave just the column with the date.

In [None]:
bakery_data = bakery_data[["TransactionNo", "Items", "Date", "Daypart", "DayType"]]
bakery_data

#### Category assignment based on the number of sold products.

Let's take a closer look at the contents of the **Items** category.

In [None]:
bakery_data["Items"].value_counts()

We see that in the period we analyze many products were sold with a varied frequency.

We'll add categorization by adding the **Item Categories** column that will enable us to highlight top 5 products and assign the "Other" category to the remaining ones.

In [None]:
product_categories = list(bakery_data["Items"].value_counts().index)[0:5]
product_categories.append("Other")
product_categories
bakery_data["Item Categories"] = pd.Series(pd.Categorical(bakery_data["Items"], categories=product_categories)).fillna("Other")
bakery_data

#### Converting times of day to categories

Let's take a closer look at the contents of the **Daypart** category.

In [None]:
bakery_data["Daypart"].value_counts()

In the case of this column the list of categories is a short one: we only want the order of the times of day in the visualization to be a natural one.

We'll define a new **Day Part** column, set the correct category order and use it to replace the current **Daypart** column.

In [None]:
bakery_data["Day Part"] = pd.Series(pd.Categorical(bakery_data["Daypart"], categories=["Morning", "Afternoon", "Evening", "Night"]))
bakery_data = bakery_data[["TransactionNo", "Items", "Date", "Day Part", "DayType", "Item Categories"]]
bakery_data

#### Converting day types to categories

Let's take a closer look at the **Day Type** column contents.

In [None]:
bakery_data["DayType"].value_counts()

Similarly to the time of day, the list of categories is short. We'll prepare a new **Day Type** column just like before and remove the unnecessary column.

In [None]:
bakery_data["Day Type"] = pd.Series(pd.Categorical(bakery_data["DayType"], categories=["Weekday", "Weekend"]))
bakery_data = bakery_data[["TransactionNo", "Items", "Date", "Day Part","Day Type", "Item Categories"]]
bakery_data

### Creating dataframes used in the visualization

Besides the **bakery_data** set, for class we are going to need several other points of view to base our visualization on.

#### Daily statistics

Below, we are calculating how many products, and in how many transactions, were purchased daily, divided by type of day.

In [None]:
items_daily = bakery_data[["Date","Day Type", "Items"]].groupby(["Date", "Day Type"]).count()
transactions_daily = bakery_data[["Date","Day Type", "TransactionNo"]].groupby(["Date", "Day Type"]).nunique()
daytype_statistics_daily = pd.merge(items_daily, transactions_daily, on=["Date", "Day Type"])
daytype_statistics_daily

Below, we are calculating how many products, and in how many transactions, were purchased daily, divided by time of day.

In [None]:
items_daily = bakery_data[["Date","Day Part", "Items"]].groupby(["Date", "Day Part"]).count()
transactions_daily = bakery_data[["Date","Day Part", "TransactionNo"]].groupby(["Date", "Day Part"]).nunique()
daypart_statistics_daily = pd.merge(items_daily, transactions_daily, on=["Date", "Day Part"])
daypart_statistics_daily

#### Category statistics

Finally, we also return the number of products purchased via transactions with part and type of the day information.

In [None]:
items_count = bakery_data[["TransactionNo", "Items"]].groupby(["TransactionNo"]).count()
transactions_data = pd.merge(pd.DataFrame(bakery_data[["TransactionNo", "Day Type", "Day Part"]].drop_duplicates()), items_count, on="TransactionNo")
transactions_data

## Exercises