# Data consistency
What we did so far was looking at each Dataframe individually. Now it's time to look at all Dataframes and see if the information we have is consistent. This include things like

- Are huge price differences explainable with discounts etc. when comparing the prices in the products df and the orders df?
- Is every product ordered present in the products table? 
- Are there significant datetime differences?

If we do not have a solution to inconsistencies that might arise, we'll have to get rid of inconsistent data. At the end of this notebook we want to have merged our tables into one big dataframe, which has all the information we need for our data analysis and visualization. 

In [1]:
import pandas as pd

In [2]:
brands = pd.read_csv('data/brands.csv')
orders = pd.read_csv('data/orders_clean.csv', parse_dates=['created_date'])
orderlines = pd.read_csv('data/orderlines_clean.csv', parse_dates=['date'])
products = pd.read_csv('data/products_clean.csv')

Note that we have to tell pandas which columns should be treated as datetimes. Otherwise the dtype of our columns would have been `object`.

## Inconsistencies regarding the sku

We want to exclude orders which contain products that do not appear in the products list.
Since orderlines also has a sku column, we can compare the orderlines and products df.

In [3]:
orderlines['check_products'] = ~orderlines.sku.isin(products.sku)

Now we should remove every id_order, that contains a product that is not in the list.
Note that every row with such an id_order should be removed, not only the ones with a non-existing sku.
To do this we can create a new table orderlines_id_remove, which is grouped by the id_order and aggregates the False-counts of check_products. If the count is >0, we remove that id_order.
To make life a little easier, I change the check_products column for this: It says True if the sku is NOT in the product list.
Now we can sum up easier.

In [4]:
orderlines_id_remove = orderlines.copy().groupby('id_order').agg({'check_products':'sum'})

We keep only those id_orders, where orderlines_id_remove.check_products == 0.
We also reset the index, so that we don't get a key error in the next step.

In [5]:
orderlines_id_remove = orderlines_id_remove.loc[orderlines_id_remove.check_products == 0].reset_index()

Now we only keep those rows from orderlines, that have an id_order, that appears in orderlines_id_remove

In [6]:
orderlines = orderlines.loc[orderlines.id_order.isin(orderlines_id_remove.id_order)]

Since id_orders should match orders.order_id, we also remove rows from orders and orderlines whose id does not appear in both tables.

In [7]:
orders = orders.loc[orders.order_id.isin(orderlines.id_order)]

In [8]:
orderlines = orderlines.loc[orderlines.id_order.isin(orders.order_id)]

We don't need the column 'check_products' anymore, so it can be dropped.

In [9]:
orderlines.drop(columns=['check_products'], inplace=True)

## Merging tables

We'll now merge the tables at investigate further inconsistencies afterwards.
Since orderlines and products share the sku column and the order_id's from orders and orderlines match, joining these tables don't pose any difficulties. For merging the brands table, we need to create a column with the first 3 characters of the sku.

In [10]:
orderlines['short'] = orderlines.sku.str[:3]
orderlines.head()

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date,short
0,1119109,299539,0,1,OTT0133,18.99,2017-01-01 00:07:19,OTT
1,1119110,299540,0,1,LGE0043,399.0,2017-01-01 00:19:45,LGE
2,1119111,299541,0,1,PAR0071,474.05,2017-01-01 00:20:57,PAR
3,1119112,299542,0,1,WDT0315,68.39,2017-01-01 00:51:40,WDT
4,1119113,299543,0,1,JBL0104,23.74,2017-01-01 01:06:38,JBL


Now we can create our big dataframe, which we will simply call df.
All the merging could have been done in one step. We split it here for simplicity.

In [11]:
df = pd.merge(orderlines, products, how = 'left', on='sku')
df = df.merge(orders, how='left', left_on='id_order', right_on='order_id')
df = df.merge(brands, how='left', on='short')

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 290588 entries, 0 to 290587
Data columns (total 19 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   id                290588 non-null  int64         
 1   id_order          290588 non-null  int64         
 2   product_id        290588 non-null  int64         
 3   product_quantity  290588 non-null  int64         
 4   sku               290588 non-null  object        
 5   unit_price        290588 non-null  float64       
 6   date              290588 non-null  datetime64[ns]
 7   short             290588 non-null  object        
 8   name              290588 non-null  object        
 9   desc              290588 non-null  object        
 10  price             290588 non-null  float64       
 11  promo_price       290588 non-null  float64       
 12  in_stock          290588 non-null  int64         
 13  type              290588 non-null  object        
 14  orde

One thing we notice is that the long column, which stores the brand name, seems to have missing values. 
With some research we could impute the missing values with the help of the short column by hand. It isn't really needed for our analysis later though, since 288 rows do not really have an impact.

Next we'll rename columns like `long` to `brand` and drop multiple columns and columns not needed. Afterwards we will change the order of the columns to have important information not at the end.

In [13]:
df.rename({'long':'brand'},axis=1, inplace= True)

In [14]:
df.drop(columns=['order_id', 'promo_price', 'short', 'sku', 'product_id', 'id'], inplace=True)

In [15]:
df.columns

Index(['id_order', 'product_quantity', 'unit_price', 'date', 'name', 'desc',
       'price', 'in_stock', 'type', 'created_date', 'total_paid', 'state',
       'brand'],
      dtype='object')

In [16]:
cols= ['id_order', 'name', 'desc', 'brand',
       'product_quantity', 'unit_price', 'total_paid', 'price', 
       'date', 'created_date',
       'state',
       'in_stock', 'type']
df = df[cols]

## Looking at dates
