# Data overview

## orders.csv 
Every row in this file represents an order.

* **order_id** – a unique identifier for each order
* **created_date** – a timestamp for when the order was created
* **total_paid** – the total amount paid by the customer for this order, in euros
* **state** –
    * “Shopping basket” - products have been placed in the shopping basket
    - “Place Order” – the order has been placed, but is awaiting shipment details 
    - “Pending” – the order is awaiting payment confirmation
    - “Completed” – the order has been placed and paid, and the transaction is completed.
    - “Cancelled” – the order has been cancelled and the payment returned to the customer.

## orderlines.csv 
Every row represents each one of the different products involved in an order.

* **id** – a unique identifier for each row in this file
* **id_order** – corresponds to orders.order_id
* **product_id** – an old identifier for each product, nowadays not in use
* **product_quantity** – how many units of that product were purchased on that order
* **sku** – stock keeping unit: a unique identifier for each product
* **unit_price** – the unitary price (in euros) of each product at the moment of placing that order
* **date** – timestamp for the processing of that product

## products.csv

* **sku** – stock keeping unit: a unique identifier for each product
* **name** – product name
* **desc** – product description
* **price** – base price of the product, in euros
* **promo_price** – promotional price, in euros
* **in_stock** – whether or not the product was in stock at the moment of the data extraction
* **type** – a numerical code for product type

## brands.csv

* **short** – the 3-character code by which the brand can be identified in the first 3 characters of products.sku
* **long** – brand name

# Data cleaning

In [2]:
import data_utils

# Import the data
orderlines_clean = data_utils.clean_orderlines()
orders_clean = data_utils.clean_orders()
brands_clean = data_utils.clean_brands()
products_clean = data_utils.clean_products()

# Merge the data
completed_sales =  data_utils.merge_data(orders_clean, orderlines_clean, products_clean, brands_clean)

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
0,1119109,299539,0,1,OTT0133,18.99,2017-01-01 00:07:19
1,1119110,299540,0,1,LGE0043,399.00,2017-01-01 00:19:45
2,1119111,299541,0,1,PAR0071,474.05,2017-01-01 00:20:57
3,1119112,299542,0,1,WDT0315,68.39,2017-01-01 00:51:40
4,1119113,299543,0,1,JBL0104,23.74,2017-01-01 01:06:38
...,...,...,...,...,...,...,...
293978,1650199,527398,0,1,JBL0122,42.99,2018-03-14 13:57:25
293979,1650200,527399,0,1,PAC0653,141.58,2018-03-14 13:57:34
293980,1650201,527400,0,2,APP0698,9.99,2018-03-14 13:57:41
293981,1650202,527388,0,1,BEZ0204,19.99,2018-03-14 13:58:01
