# Initial exploration
In a first step we want to get to know the data we are dealing with.
We have 4 csv-files describing products and orders of the company eniac.
Let's take a look at these:

In [1]:
import pandas as pd

First we are loading the csv-files 

In [2]:
brands = pd.read_csv('data/brands.csv')

orders = pd.read_csv('data/orders.csv')

orderlines = pd.read_csv('data/orderlines.csv')

products = pd.read_csv('data/products.csv')

We want to get some initial info about these four csv-files, so we get some feeling for the data. We start with brands:

In [3]:
brands.head()
#brands.tail(20)
#brands.loc[brands.short == 'SEV']

Unnamed: 0,short,long
0,8MO,8Mobility
1,ACM,Acme
2,ADN,Adonit
3,AII,Aiino
4,AKI,Akitio


In [4]:
brands.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 187 entries, 0 to 186
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   short   187 non-null    object
 1   long    187 non-null    object
dtypes: object(2)
memory usage: 3.0+ KB


We have 187 different brands, which are represented by a short name, which consists of 3 letters, and the full name. The data seems to be complete. Also looking at the head and the tail of the dataframe, there seems to be no parsing error.

In [5]:
#orders.head()
orders.info()
orders.state.unique()
#orders[orders.total_paid.isna()]
#orders[orders.state == 'Pending']

orders.state.value_counts()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226909 entries, 0 to 226908
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   order_id      226909 non-null  int64  
 1   created_date  226909 non-null  object 
 2   total_paid    226904 non-null  float64
 3   state         226909 non-null  object 
dtypes: float64(1), int64(1), object(2)
memory usage: 6.9+ MB


Shopping Basket    117809
Completed           46605
Place Order         40883
Pending             14379
Cancelled            7233
Name: state, dtype: int64

The order-csv consists of 4 columns: 
- **order_id**: an integer, 
- **created_date**: a date object of the form yyyy-mm-dd (will change that to pd.to_datetime), 
- **total_paid**: a float number,
- **state**: an object having values in 'Cancelled', 'Completed', 'Pending', 'Shopping Basket', 'Place Order'

We see that the states contain whitespaces. These should be changed.
The total_paid column has some null-values. The state of those entries is Pending, so everything seems to be fine.

The dataframe seems to be ordered by order_id

Grouping by the state, we see that about half of the orders are still in the shopping_basket, another quarter is also not completed or cancelled.

From the look of it most of the orders were created in 2017 and 2018

In [6]:
orderlines.head()
orderlines.info()

orderlines.sku.value_counts().head(10)

orderlines.describe()

orderlines.sort_values('product_quantity', ascending=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 293983 entries, 0 to 293982
Data columns (total 7 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   id                293983 non-null  int64 
 1   id_order          293983 non-null  int64 
 2   product_id        293983 non-null  int64 
 3   product_quantity  293983 non-null  int64 
 4   sku               293983 non-null  object
 5   unit_price        293983 non-null  object
 6   date              293983 non-null  object
dtypes: int64(4), object(3)
memory usage: 15.7+ MB


Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
68712,1254032,358747,0,999,SEV0028,19.99,2017-05-24 14:51:58
53860,1228150,346221,0,999,APP1190,55.99,2017-04-14 21:50:52
57796,1234924,349475,0,800,KIN0137,7.49,2017-04-25 09:59:00
57306,1234111,349133,0,555,APP0665,70.99,2017-04-24 10:20:13
40813,1204788,335057,0,201,THU0029,80.99,2017-03-14 15:25:53
...,...,...,...,...,...,...,...
101945,1313523,387305,0,1,APP1972,360.33,2017-08-08 00:49:11
101946,1313527,387168,0,1,PAC2140,2.962.59,2017-08-08 00:55:17
101948,1313529,387307,0,1,SPE0159,39.99,2017-08-08 00:57:50
101949,1313531,387307,0,1,SPE0173,39.99,2017-08-08 00:58:57


orderlines consists of 7 columns
- id: an integer
- id_order: an int, is the same as orders.order_id
- product_id: can be dropped later since every entry is 0
- product_quantity: an integer between 1 and 999
- sku: an object containing information about the brand and the unique product. The first 3 letters seem to match the short name of the brand
- unit_price: price of a unit of that product. dtype is object --> want to change it to float
- date: an object of the form yyyy-mm-dd hh:mm:ss --> change it to datetime later

data seems to be complete. dataframe has more rows than the orders dataframe, which makes sense. 
Looking at the value_counts for sku, we can see that the brand Apple seems to do quite well in that market.
From describe() we can tell that most of the items are only ordered once per order, but there have been some outliners (999): namely an Apple product (APP) and some Service product (SEV)

Right now wee can not sort by unit price, but there are some products in the 3000€ range



In [7]:
products.head()
products.info()

products.describe()

#products.tail(20)

products.loc[products.price.isna()]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19326 entries, 0 to 19325
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   sku          19326 non-null  object
 1   name         19326 non-null  object
 2   desc         19319 non-null  object
 3   price        19280 non-null  object
 4   promo_price  19326 non-null  object
 5   in_stock     19326 non-null  int64 
 6   type         19276 non-null  object
dtypes: int64(1), object(6)
memory usage: 1.0+ MB


Unnamed: 0,sku,name,desc,price,promo_price,in_stock,type
34,TWS0019,Twelve South MagicWand support Apple Magic Tra...,MagicWand for wireless keyboard and Magic Trac...,,299.899,0,8696
1900,AII0008,Aiino Case MacBook Air 11 '' Transparent,MacBook Air 11-inch casing with matte finish.,,22.99,0,13835403
2039,CEL0020,Celly Ambo Luxury Leather Case + iPhone 6 Case...,Cover and housing together with magnet for iPh...,,399.905,0,11865403
2042,CEL0007,Celly Wallet Case with removable cover Black i...,Case Book for iPhone 6 card case type.,,128.998,0,11865403
2043,CEL0012,Celly Silicone Hard Shell iPhone 6 Blue,Hard Shell Silicone iPhone 6.,,4.99,0,11865403
2044,CEL0014,Celly Silicone Hard Shell iPhone 6 Amarillo,Hard Shell Silicone iPhone 6.,,59.895,0,11865403
2049,CEL0015,Celly fur-lined Powerbank battery 4000mAh Black,Leather-wrapped External Battery 4000mAh for i...,,239.895,0,1515
2051,CEL0018,Celly Wallet Leather Case cover Black iPhone 6,Card case with transparent protective cover fo...,,294.877,0,11865403
2052,CEL0023,Celly Ambo Luxury Leather Case + Case Gold iPh...,Cover and housing together with magnet for iPh...,,329.894,0,11865403
2053,CEL0025,Celly Ambo Luxury Leather Case + Case iPhone 6...,Cover and housing together with magnet for iPh...,,449.878,0,11865403


products df has 19326 entries, consisting of
- sku: connection to orderlines
- name
- a description
- a price
- a promo price, which seems to be significantly greater than the actual price sometimes
- in-stock, which tells whether or not the product was in stock in the moment of data extraction. Not sure if this is needed later on.
- a type number: seems like an internal code. Maybe this helps when categorizing items later

only 'in_stock' is an int (either 0 or 1), the rest is an object --> change the prices and the sku accordingly (so we can extract the brand)
looking at the tail of the dataframe, we see that some prices (also promo prices) have multiple floating points in it --> look out for that when converting

note that not every entry has a 
- price (46 entries, almost all of them belong to the brand name CEL)...but there is always a promo price --> either drop these rows or add the promo price in that column
- a description
- a type number


In [8]:
#products.loc
orders.loc[orders.total_paid.isna()]
#orderlines.head()
#brands.head()

Unnamed: 0,order_id,created_date,total_paid,state
127701,427314,2017-11-20 18:54:39,,Pending
132013,431655,2017-11-22 12:15:24,,Pending
147316,447411,2017-11-27 10:32:37,,Pending
148833,448966,2017-11-27 18:54:15,,Pending
149434,449596,2017-11-27 21:52:08,,Pending


In [11]:
#products.drop_duplicates(inplace=True)
products.duplicated().sum()

8746

In [12]:
orders.duplicated().sum()

0

In [13]:
orderlines.duplicated().sum()

0

In [14]:
brands.duplicated().sum()

0

There are some duplicate rows in the products table, which we also get rid of shortly