In [1]:
# filter warnings to not confuse readers
import warnings
warnings.filterwarnings("ignore")

# Python for Excel Users - Part 1

#### Is it for me?
This tutorial notebook is for everyone who wants to get started working in python instead of Excel, e.g. for automating tasks or improving speed and scalability. Or for just the curious ones.

You're at the right spot when you use Excel to combine various sheets or tables (using formulas like Index, Match, VLookup), use simple mathematical operations like sum and mean, or even use conditional aggregation functions (e.g. sum of transactions per category) or pivot tables.

#### Goal:
We will explore and combine different sales data tables like we would in Excel. The tables are:
- Orders
- Order_lines
- Products
More details about the data below.

#### Requirements:
While some basic experience with Python helps, you don't have to be a a programmer or data scientist to follow this tutorial. Ideally, you should have heard of pandas and jupyter notebook (which is where you're reading this tutorial right now) and spent some time with python yourself. 

But don't get discouraged. Just try to follow along and read up on things we did not cover. You can always come back here. And if you need some personal guidance, we at HOSD Mentoring offer free one-to-one data mentoring at www.hosd-mentoring.com


# 1. The Situation

Imagine you just started in the sales department of a huge online retailer. It's your first day. You set up your computer, got a great double espresso from your company's barista aka. the old coffee machine in the break room, and you're read to get started! 

Unfortunately, it's summer time and everyone in your team is on vacation. This leaves you with a lot of questions and time to **explore the company data yourself**. Great!

#### A few of the questions in your head are:
- 🏬 How big is the company actually, measured by number of transactions? 
- 💵 Company size in terms of revenue per year? 
- 👗 How many products do we offer per Category? 
- 📦 How successful are our deliveries?

You know that you can answer these questions simply by using and combining the three tables Orders, Order_lines, and Products in Excel using **standard formulas and a maybe a pivot table**. 

But since no one's arround and your starting of a new job marks a new era in your career, you want to do it differently this time. Why not try out this **cool thing called python** that everyone on LinkedIn talks about? Yeah, let's do that!

# 2. Getting started - the Data

The data used in this tutorial is an adapted and simplified version of the [Brazilian E-Commerce Public Dataset by Olist](https://www.kaggle.com/olistbr/brazilian-ecommerce). As mentioned above, we will work with three tables.

In Excel, you would do something like opening a new Excel document, go to Data > Get Data > From Text, then search for the .csv file and click through the Text Import Wizard (choice of delimiter, file encoding, data format, etc.) or you would just double-click on the .csv file and hope Excel will figure everything out correctly by itself (ouch!).

![Excel csv example](img/img1.png)

But now we're in python land and things are a bit different (but not harder!). 

To work with an .csv file, we will read it as an dataframe (fancy name for table) using the package [pandas](https://pandas.pydata.org/docs/). We save the tables in variables with an appropriate name. It's conventional that dataframe names end with df.

In [2]:
# setup notebook and load tables
import pandas as pd

orders_df = pd.read_csv("data/Orders.csv")
order_lines_df = pd.read_csv("data/Order_lines.csv")
products_df = pd.read_csv("data/Products.csv")

Great! We've just loaded the three tables into our python program. Right?
Unlike in Excel, the imported data is not displayed immediately. Python is quite silent if you don't tell explicitly what you want to see.

After you've imported a csv file using pandas' read_csv() method and saved it as something_df, we can display the first 5 rows using **something_df.head(5)**. 

That way we can briefly see how the table looks like and which columns we are dealing with. Let's do that for our three tables.

**Orders** will state all transactions, when they happened and whether the order has been fulfilled successfully. Each order has an order_id (unique per order), customer_id (unique per customer), an order_status, and purchase date.

In [3]:
# Preview of the Orders table
orders_df.head(5)

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,02.10.17 10:56
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,24.07.18 20:41
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,08.08.18 08:38
3,949d5b44dbf5de918fe9c16f97b45f8a,f88197465ea7920adcdbec7375364d82,delivered,18.11.17 19:28
4,ad21c59c0840e6cb83a9ceb5573f8159,8ab97904e6daea8866dbdbc4fb7aad2c,delivered,13.02.18 21:18


However, the orders_df won't state the content of the individual orders. For that we need to take a look at the **Order_lines** table and connect the two tables using an order id. In order_lines_df, each row refers to one transaction item, including quantity, price and, freight value. 

In [4]:
# Preview of Order_lines table
order_lines_df.head(5)

Unnamed: 0,order_id,order_item_id,product_id,price,freight_value
0,00010242fe8c5a6d1ba2dd792cb16214,1,4244733e06e7ecb4970a6e2683c13e61,58.9,13.29
1,00018f77f2f0320c557190d7a144bdd3,1,e5f2d52b802189ee658865ca93d83a8f,239.9,19.93
2,000229ec398224ef6ca0657da4fc703e,1,c777355d18b72b67abbeef9df44fd0fd,199.0,17.87
3,00024acbcdf0a6daa1e931b038114c75,1,7634da152a4610f1595efa32f14722fc,12.99,12.79
4,00042b26cf59d7ce69dfabb4e55b4fd9,1,ac6c3623068f30de03045865e4e10089,199.9,18.14


The product category is not part of this table yet, so we need to combine the **Products** table (products_df) with the order_lines_df late in case we want to analyze category-level data.

In [5]:
# Preview of Products table
products_df.head(5)

Unnamed: 0,product_id,product_category_name
0,1e9e8ef04dbcff4541ed26657ea517e5,perfumaria
1,3aa071139cb16b67ca9e5dea641aaa2f,artes
2,96bd76ec8810374ed1b65e291975717f,esporte_lazer
3,cef67bcfe19066a932b7673e239eb23d,bebes
4,9dc1a7de274444849c219cff195d0b71,utilidades_domesticas


# 3. Answering our Questions with Data

Now that we have everything loaded into our notebook, we can start answering our questions with data.

## 3.1 🏬 How big is the company actually, measured by number of transactions? 

Since we are working in the sales department, when we talk about company size here we mean **number of transactions**. For that we can start by simply counting the number of row in order_df using python's len() method. This is similar to clicking on the column in Excel and reading the count at the bottom of the screen.

In [6]:
len(orders_df)

99441

Alternatively, we can use pandas pd.unique() method on the column order_id, which will return the unique values of order_id. In this case it shouldn't make a difference since the order ids are unique by nature. 

However, if we would use this approach on a column with duplicates (like customer_id) then our results with len() on the whole table would not be correct. So let's do it properly and use pd.unique() to get the unique order ids and then use len() on that.

Note: You can access a column of a table using table_df["column_name"]

In [7]:
len(pd.unique(orders_df["order_id"]))

99441

We are suprised by the large number of transactions first, but then we noticed two things:
- Not all orders have been delivered. We should only take a look at successfully delivered orders using order_status.
- These transactions happened in three years from 2016-2018. We should take a look at number of transactions per year.

Let's start by filtering our all orders that have not been delivered successfully. To find out how many order_status options there are, we would click on the filter of that column in Excel.

![filter in excel](img/img2.png)

It's convenient that Excel shows us only unique values but as we just learned, we can achieve the same by using pd.unique() on the order_status column.

In [8]:
pd.unique(orders_df["order_status"])

array(['delivered', 'invoiced', 'shipped', 'processing', 'unavailable',
       'canceled', 'created', 'approved'], dtype=object)

In Excel we would simply de-select all options except of 'delivered' to filter our table. Excel will check each row and only display the ones where order_status equals 'delivered'. 

We can tell python to do the same. If our computer was a person, we would tell her: Please take the order_df table and show me only the rows where the column "order_status" of the order_df table is equal to "delivered". Luckily, our computer is not a person and needs way less text to do the same. 

Translated to python it's:

In [9]:
# order table where order table column order_status is equal to delivered.
orders_df[orders_df["order_status"]=="delivered"]

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,02.10.17 10:56
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,24.07.18 20:41
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,08.08.18 08:38
3,949d5b44dbf5de918fe9c16f97b45f8a,f88197465ea7920adcdbec7375364d82,delivered,18.11.17 19:28
4,ad21c59c0840e6cb83a9ceb5573f8159,8ab97904e6daea8866dbdbc4fb7aad2c,delivered,13.02.18 21:18
...,...,...,...,...
99436,9c5dedf39a927c1b2549525ed64a053c,39bd1228ee8140590ac3aca26f2dfe00,delivered,09.03.17 09:54
99437,63943bddc261676b46f01ca7ac2f7bd8,1fca14ff2861355f6e5f14306ff977a7,delivered,06.02.18 12:58
99438,83c1379a015df1e13d02aae0204711ab,1aa71eb042121263aafbe80c1b562c9c,delivered,27.08.17 14:46
99439,11c177c8e97725db2631073c19f07b62,b331b74b18dc79bcdf6532d51e1637c1,delivered,08.01.18 21:28


We should save the filtering result to another dataframe to further work with this table.

In [10]:
delivered_orders_df = orders_df[orders_df["order_status"]=="delivered"]

Next, we want to break our transactions down per year. This might be the most difficult part of this tutorial since a few further data transforation steps are necessary that I won't explain fully here. Working with datetime can be challenging sometimes but there are great resources online.

We will convert the order_purchase_timestamp column into a **datetime column**. Then we will **extract the year** from it, and save it to a **new column called "year"**. 

This is similar to using Excel's Year() formula.

![excels year() formula](img/img3.png)

In [11]:
# the data type of this column is not yet datetime
delivered_orders_df["order_purchase_timestamp"].dtypes

dtype('O')

In [12]:
# convert to datetime
delivered_orders_df["order_purchase_timestamp"] = pd.to_datetime(delivered_orders_df["order_purchase_timestamp"])
delivered_orders_df["order_purchase_timestamp"].dtypes

dtype('<M8[ns]')

In [13]:
# extract the year from datetime and save to new column
delivered_orders_df["order_year"] = delivered_orders_df["order_purchase_timestamp"].dt.year

In [14]:
# show the new table
delivered_orders_df.head(5)

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_year
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-02-10 10:56:00,2017
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:00,2018
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:00,2018
3,949d5b44dbf5de918fe9c16f97b45f8a,f88197465ea7920adcdbec7375364d82,delivered,2017-11-18 19:28:00,2017
4,ad21c59c0840e6cb83a9ceb5573f8159,8ab97904e6daea8866dbdbc4fb7aad2c,delivered,2018-02-13 21:18:00,2018


Great! Now we have everything we need to count the delivered orders per year. In Excel we would filter by year, count each row, and write it to a new summary table. 

More advanced Excel users would skip this manual task by using a pivot table with Year as a column and count values of order_id.

![excel pivot of orders per year](img/img5.png)

We can achieve the same by using pandas groupby and agg function. The syntax might look intimidating for such a simple task, but the function is quite versatile and powerful in ways that Excel's pivot can only dream of.

First, we select the table to aggregate, then we define the column to aggregate by, followed by defining the aggregation function.

In [15]:
delivered_orders_df.groupby(by=["order_year"], as_index=False).agg({'order_id':'count'})

Unnamed: 0,order_year,order_id
0,2016,267
1,2017,43428
2,2018,52783


## 3.2 💵 Company size in terms of revenue per year? 
Next to the number of transactions we would also find out the revenue. As stated above, we need to combine delivered_order_df with order_lines_df through order_id as the key. Furthermore, we notice that the order_lines table has information on the number of order items and price, but not revenue yet.

Hence, this is what we need to do:

- create a column "line_revenue" in order_lines_df with values for order_item * price
- create an aggregated table that sums up line_revenue for each order as order_revenue
- combine the aggregated table with delivered_orders_df and sum up order_revenue per year

In Excel, we would create a new column, use a pivot table, combine the pivot table with the other filtered table using something like VLOOKUP, build yet another pivot on that resulting table, ... 

Let's not do that. It's not a complex task but we can already see why we should switch over to Python for doing tasks like that. Once you get used to the syntax and process, everything will be easier and more scalable to do.

In [16]:
order_lines_df.head()

Unnamed: 0,order_id,order_item_id,product_id,price,freight_value
0,00010242fe8c5a6d1ba2dd792cb16214,1,4244733e06e7ecb4970a6e2683c13e61,58.9,13.29
1,00018f77f2f0320c557190d7a144bdd3,1,e5f2d52b802189ee658865ca93d83a8f,239.9,19.93
2,000229ec398224ef6ca0657da4fc703e,1,c777355d18b72b67abbeef9df44fd0fd,199.0,17.87
3,00024acbcdf0a6daa1e931b038114c75,1,7634da152a4610f1595efa32f14722fc,12.99,12.79
4,00042b26cf59d7ce69dfabb4e55b4fd9,1,ac6c3623068f30de03045865e4e10089,199.9,18.14


In [17]:
# create new column and save the result of order_item_id * price
order_lines_df["line_revenue"] = order_lines_df["order_item_id"] * order_lines_df["price"]
order_lines_df.head()

Unnamed: 0,order_id,order_item_id,product_id,price,freight_value,line_revenue
0,00010242fe8c5a6d1ba2dd792cb16214,1,4244733e06e7ecb4970a6e2683c13e61,58.9,13.29,58.9
1,00018f77f2f0320c557190d7a144bdd3,1,e5f2d52b802189ee658865ca93d83a8f,239.9,19.93,239.9
2,000229ec398224ef6ca0657da4fc703e,1,c777355d18b72b67abbeef9df44fd0fd,199.0,17.87,199.0
3,00024acbcdf0a6daa1e931b038114c75,1,7634da152a4610f1595efa32f14722fc,12.99,12.79,12.99
4,00042b26cf59d7ce69dfabb4e55b4fd9,1,ac6c3623068f30de03045865e4e10089,199.9,18.14,199.9


In [18]:
# aggregate table and sum up line_revenue to get the total order value per order id
order_lines_agg_df = order_lines_df.groupby(by=["order_id"], as_index=False).agg({"line_revenue":"sum"})
order_lines_agg_df.head(5)

Unnamed: 0,order_id,line_revenue
0,00010242fe8c5a6d1ba2dd792cb16214,58.9
1,00018f77f2f0320c557190d7a144bdd3,239.9
2,000229ec398224ef6ca0657da4fc703e,199.0
3,00024acbcdf0a6daa1e931b038114c75,12.99
4,00042b26cf59d7ce69dfabb4e55b4fd9,199.9


In [19]:
delivered_orders_df.head(5)

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_year
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-02-10 10:56:00,2017
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:00,2018
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:00,2018
3,949d5b44dbf5de918fe9c16f97b45f8a,f88197465ea7920adcdbec7375364d82,delivered,2017-11-18 19:28:00,2017
4,ad21c59c0840e6cb83a9ceb5573f8159,8ab97904e6daea8866dbdbc4fb7aad2c,delivered,2018-02-13 21:18:00,2018


Now we want to combine the two tables above, order_lines_agg_df and delivered_orders_df. What we do is similar to Excel's VLOOKUP() formula in terms of result, however we **merge the two tables using order_id as a key**. The python code is straight forward and almost like prose.

After that we repeat what we did above when we calculated the number of transactions per year, but this time we sum up line_revenue (which should be called order_revenue in this table, since it's the sum of line_revenue per order).

In [20]:
# merge the two tables to get revenue per order
delivered_orders_merged_df = pd.merge(left=delivered_orders_df, right=order_lines_agg_df, how="left", on="order_id")
delivered_orders_merged_df.head(5)

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_year,line_revenue
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-02-10 10:56:00,2017,29.99
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:00,2018,118.7
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:00,2018,159.9
3,949d5b44dbf5de918fe9c16f97b45f8a,f88197465ea7920adcdbec7375364d82,delivered,2017-11-18 19:28:00,2017,45.0
4,ad21c59c0840e6cb83a9ceb5573f8159,8ab97904e6daea8866dbdbc4fb7aad2c,delivered,2018-02-13 21:18:00,2018,19.9


In [21]:
delivered_orders_merged_df.groupby(by=["order_year"], as_index=False).agg({"line_revenue":"sum"})

Unnamed: 0,order_year,line_revenue
0,2016,46490.29
1,2017,6761452.0
2,2018,8173995.0


This looks wrong, right? Not really. This is called **scientific notation** where very big or very small numbers are displayed as a calculation. 

4.649029e+4 means 4.649029 * 10^4 means 4.649029 * 10,000 means 46,490.29. You can change the way Jupyter Notebook dispays numbers using the following code snippet.

In [22]:
pd.set_option('display.float_format', lambda x: '%.2f' % x)

In [23]:
# display float numbers up to second decimal point
delivered_orders_merged_df.groupby(by=["order_year"], as_index=False).agg({"line_revenue":"sum"})

Unnamed: 0,order_year,line_revenue
0,2016,46490.29
1,2017,6761451.8
2,2018,8173995.22


Great! It seems like our company is quite alive. Using this approach we can calculate all sorts of things. Like the number of customers per year and the average order value. The ratio of transactions to customers might be of interest too, since a company can be at risk when their revenue is only based on a small number of highly valuable customers. What if they switch to our competitor? 

Just play around with the data to get more familiar with how it works in Python.

## 3.3 👗 How many products do we offer per Category? 

Since you already know how to count unique values of a column, this will be quite easy now. Above, we counted the number of unique order_ids in orders_df. Now we have to count the product_id in products_df to get a general idea of the scope of our offering, i.e. **how many different products** we offer.

In [24]:
products_df.head()

Unnamed: 0,product_id,product_category_name
0,1e9e8ef04dbcff4541ed26657ea517e5,perfumaria
1,3aa071139cb16b67ca9e5dea641aaa2f,artes
2,96bd76ec8810374ed1b65e291975717f,esporte_lazer
3,cef67bcfe19066a932b7673e239eb23d,bebes
4,9dc1a7de274444849c219cff195d0b71,utilidades_domesticas


In [25]:
# count unique products in our database
len(pd.unique(products_df["product_id"]))

32951

Since we want to also know the **number of products per category**, we replicate the approach from above where we calculated the number of transactions per year. Instead of "order_year" we will now use "product_category_name" and instead of "order_id" to count we use "product_id".

In [26]:
# count products per category
products_per_category_df = products_df.groupby(by=["product_category_name"], as_index=False).agg({"product_id":"count"})
products_per_category_df.columns = ["product_category_name", "product_id_count"]
products_per_category_df.head(5)

Unnamed: 0,product_category_name,product_id_count
0,agro_industria_e_comercio,74
1,alimentos,82
2,alimentos_bebidas,104
3,artes,55
4,artes_e_artesanato,19


In [27]:
# number of categories
len(products_per_category_df)

73

Wow, we offer almost 33k products in 73 categories? What are the top 5 categories in terms of number of products?

Like in Excel, we simply have to sort the table by product_id_count.

In [28]:
# sort by product_id_count descending
products_per_category_df.sort_values(by=["product_id_count"], ascending=False).head(5)

Unnamed: 0,product_category_name,product_id_count
13,cama_mesa_banho,3029
32,esporte_lazer,2867
54,moveis_decoracao,2657
11,beleza_saude,2444
72,utilidades_domesticas,2335


That is probably the point where you find out that knowing Portuguese is quite helpful when working for a Brazilian company...

## 3.4 📦 How successful are our deliveries?

In 3.1 we created a new table with only successfully delivered orders. If we count the unique order ids of that table and divide that number by the count of unique order ids in the big unfiltered orders_df table, then we get the percentage of successful deliveries.

In [29]:
unique_orderids_total = len(pd.unique(orders_df["order_id"]))
unique_orderids_delivered = len(pd.unique(delivered_orders_df["order_id"]))

# calculate ratio and round to second decimal point
round(unique_orderids_delivered/unique_orderids_total,2)

0.97

97% delivery success is something our company can be really proud of.

# 4. Summary

Now that we learned that, thanks to pandas, working in python is a real alternative to Excel, we got a bit carried away and calculated all sorts of KPIs based on aggregated data from the three original table. A quick look at the watch tells us what our stomach already knew: it's lunch time!

But before we head down to the company cantine aka. wending machine, we want to save what we did and send it to our colleagues. How do we do that?

We can simply **export dataframes as csv files** and send them to others who prefer to work in Excel.

In [32]:
# Export dataframe as csv for Excel
products_per_category_df.to_csv("output/products_per_category_df.csv", index=False)

Today we learned a few basic but powerful functions that allow us to work in Python instead of Excel. Here's a summary.

- *pd.read_csv()* can be used to import .csv files as dataframes

- *table_df.head()* shows us a preview of that table as output

- to access a column of table_df, we type *table_df["column_name"]*

- *len(pd.unique(table_df["column_name"]))* gives us the number of unique values in that column

- we can multiply two columns and save the result in a new column using *table_df["new"] = table_df["col1"] * table_df["col2"]*

- Instead of Excel's pivot table, we can aggregate values per category, e.g. count the number of transactions per year or sum up the revenue per order using *table_df.groupby(by=["category"], as_index=False).agg({"col3":"sum"})*

- Instead of using VLOOKUP() in Excel we can merge two tables using *pd.merge(left=table_1_df, right=table_2_df, how="left", on="key_column")*

I suggest you to keep going, play around with this dataset or your own a bit further. Whenever you don't know how to do something best is to check the [pandas documentation](https://pandas.pydata.org/docs/) or to google your specific question ("python datetime from string" or "python pivot table") and see how others do it. Just don't get discouraged and keep on learning! 

The beginning is always tough but stick with it and you will be way more effective and faster than any Excel pro.

!["hosd data mentoring"](img/img6.png)


And in case you need some general guidance on whether Data Analytics and Data Science could be a career path for you, we at [HOSD Mentoring offer free one-to-one mentoring](http://www.hosd-mentoring.com). You can book a session with one of our mentors with just a click on our website. Don't be shy :)