# Data overview

## orders.csv 
Every row in this file represents an order.

* **order_id** – a unique identifier for each order
* **created_date** – a timestamp for when the order was created
* **total_paid** – the total amount paid by the customer for this order, in euros
* **state** –
    * “Shopping basket” - products have been placed in the shopping basket
    - “Place Order” – the order has been placed, but is awaiting shipment details 
    - “Pending” – the order is awaiting payment confirmation
    - “Completed” – the order has been placed and paid, and the transaction is completed.
    - “Cancelled” – the order has been cancelled and the payment returned to the customer.

## orderlines.csv 
Every row represents each one of the different products involved in an order.

* **id** – a unique identifier for each row in this file
* **id_order** – corresponds to orders.order_id
* **product_id** – an old identifier for each product, nowadays not in use
* **product_quantity** – how many units of that product were purchased on that order
* **sku** – stock keeping unit: a unique identifier for each product
* **unit_price** – the unitary price (in euros) of each product at the moment of placing that order
* **date** – timestamp for the processing of that product

## products.csv

* **sku** – stock keeping unit: a unique identifier for each product
* **name** – product name
* **desc** – product description
* **price** – base price of the product, in euros
* **promo_price** – promotional price, in euros
* **in_stock** – whether or not the product was in stock at the moment of the data extraction
* **type** – a numerical code for product type

## brands.csv

* **short** – the 3-character code by which the brand can be identified in the first 3 characters of products.sku
* **long** – brand name

# Data cleaning
## Import the data

In [7]:
import pandas as pd
import numpy as np
import re

path = '../data/'
orderlines = pd.read_csv(path + 'orderlines.csv', 
                                  dtype={'id': int, 
                                         'id_order': int, 
                                         'product_id': int,
                                         'product_qunatity': int,
                                         'sku': str, 
                                         'unit_price': str}, 
                                  parse_dates=['date'])

orders = pd.read_csv(path + 'orders.csv', 
                              dtype={'order_id': int, 
                                     'total_paid': float, 
                                     'state': str}, 
                              parse_dates=['created_date'])

brands = pd.read_csv(path + 'brands.csv',
                     dtype={'short': str, 
                            'long': str})

products = pd.read_csv(path + 'products.csv',
                       dtype={'sku': str, 
                              'name': str, 
                              'desc': str,
                              'price': str,
                              'promo_price': str, 
                              'in_stock': int,
                              'type': str})

## Initial cleaning process
### Clean orders

In [34]:
def start_pipeline(df):
    '''Make a copy of the pipeline to prevent corrupting the original data'''
    return df.copy()

def remove_missing_data(df, col):
    return df[~df[col].isna()]

    
orders_clean = (orders
                .pipe(start_pipeline)
                .pipe(remove_missing_data, col='total_paid')
                )

print(f"{orders.shape[0]-orders_clean.shape[0]} missing values were removed from orders.")
print(f"This represents {(orders.shape[0]-orders_clean.shape[0])/orders.shape[0] * 100:.2f}% of the data.")
print(f"orders.shape: {orders.shape}")
print(f"orders_clean.shape: {orders_clean.shape}")

# Save the data
orders_clean.to_csv(path + 'orders_clean.csv', index=False)

5 missing values were removed from orders.
This represents 0.00% of the data.
orders.shape: (226909, 4)
orders_clean.shape: (226904, 4)


### Clean orderlines

In [35]:
def drop_deprecated_columns(df, col_list):
    return (df
            .drop(col_list, axis=1)
           )

def rename_columns(df, col_dict):
    return (df
            .rename(columns=col_dict)
           )

# Transform the unit_price price column to floats
def transform_unit_price_to_floats(df):
    return (
        df.assign(unit_price = df.unit_price.str.split('.')
                  .apply(lambda x : x[0]+x[1]+'.'+x[2] if len(x)==3 else x[0]+'.'+ x[1])
                  .astype(float)
        )
    )

def create_short_col(df):
    return df.assign(short = lambda row: row['sku'].str[:3])

orderlines_clean = (orderlines
                    .pipe(start_pipeline)
                    .pipe(drop_deprecated_columns, col_list=['product_id'])
                    .pipe(rename_columns, {'id_order': 'order_id'})
                    .pipe(transform_unit_price_to_floats)
                    .pipe(create_short_col)
                    )

print(f"{orderlines.shape[0]-orderlines_clean.shape[0]} missing values were removed from orderlines.")
print(f"This represents {(orderlines.shape[0]-orderlines_clean.shape[0])/orderlines.shape[0] * 100:.2f}% of the data.")
print(f"orderlines.shape: {orderlines.shape}")
print(f"orderlines_clean.shape: {orderlines_clean.shape}")

# Save the data
orderlines_clean.to_csv(path + 'orderlines_clean.csv', index=False)

0 missing values were removed from orderlines.
This represents 0.00% of the data.
orderlines.shape: (293983, 7)
orderlines_clean.shape: (293983, 7)


### Clean products

In [37]:
# Check for products without descriptions
names_of_products_without_descriptions = products[products.desc.isna()].name.tolist()

# Add missing descriptions
missing_product_descriptions = [
    '2TB Mac hard drive and Nas',
    'Apple keyboard for iPad 9.7',
    'NAS server with 10GB RAM',
    'Ethernet adapter for Macbook 12',
    'Luxury power bank combined with powder, 2 mirrors - normal and 3x magnification, Illuminated under mirror with LED, Low weight and compact dimensions',
    'Battery capacity: 20,000 mAh; ultra-stable: outer shell made of durable synthetic rubber (military standard, withstands drops from up to 2 metres) ; protection: dust and splash proof: military standard iP54; battery level indicator and super fast charging; USB port can be connected to charger and other devices',
    'Smart thermostat designed to provide automatic time and temperature control of heating systems in homes and apartments. '
]

def add_missing_product_descriptions(df):
    for i in range(len(names_of_products_without_descriptions)):
        df.loc[df.name == names_of_products_without_descriptions[i], 'desc'] = missing_product_descriptions[i]
    return df

def drop_duplicate_rows_by_column(df, col):
    return df.drop_duplicates(subset=col)


products_clean = (products
        .pipe(start_pipeline)
        .pipe(drop_deprecated_columns, col_list=['type', 'in_stock']) 
        .pipe(add_missing_product_descriptions)
        .pipe(remove_missing_data, col='price')
        .pipe(drop_duplicate_rows_by_column, 'sku')
)

print(f"{products.shape[0]-products_clean.shape[0]} missing values were removed from products")
print(f"This represents {(products.shape[0]-products_clean.shape[0])/products.shape[0] * 100:.2f}% of the data.")
print(f"products.shape: {products.shape}")
print(f"products_clean.shape: {products_clean.shape}")

# Save the data
products_clean.to_csv(path + 'products_clean.csv', index=False)

8792 missing values were removed from products
This represents 45.49% of the data.
products.shape: (19326, 7)
products_clean.shape: (10534, 5)


In [27]:
8792 missing values were removed from products
This represents 45.49% of the data.
products.shape: (19326, 7)
products_clean.shape: (10534, 5)

products_clean

Unnamed: 0,sku,name,desc,price,promo_price
0,RAI0007,Silver Rain Design mStand Support,Aluminum support compatible with all MacBook,59.99,499.899
1,APP0023,Apple Mac Keyboard Keypad Spanish,USB ultrathin keyboard Apple Mac Spanish.,59,589.996
2,APP0025,Mighty Mouse Apple Mouse for Mac,mouse Apple USB cable.,59,569.898
3,APP0072,Apple Dock to USB Cable iPhone and iPod white,IPhone dock and USB Cable Apple iPod.,25,229.997
4,KIN0007,Mac Memory Kingston 2GB 667MHz DDR2 SO-DIMM,2GB RAM Mac mini and iMac (2006/07) MacBook Pr...,34.99,31.99
...,...,...,...,...,...
19321,BEL0376,Belkin Travel Support Apple Watch Black,compact and portable stand vertically or horiz...,29.99,269.903
19322,THU0060,"Enroute Thule 14L Backpack MacBook 13 ""Black",Backpack with capacity of 14 liter compartment...,69.95,649.903
19323,THU0061,"Enroute Thule 14L Backpack MacBook 13 ""Blue",Backpack with capacity of 14 liter compartment...,69.95,649.903
19324,THU0062,"Enroute Thule 14L Backpack MacBook 13 ""Red",Backpack with capacity of 14 liter compartment...,69.95,649.903


In [32]:
pd.set_option('display.max_colwidth', None)

print(products_clean.loc[products_clean.name == names_of_products_without_descriptions[4], ['name', 'desc']])

pd.reset_option('display.max_colwidth')

                                                              name  \
18490  Hyper Pearl 1600mAh battery Mini USB Mirror and Comic Blond   

                                                                                                                                                        desc  
18490  Luxury power bank combined with powder, 2 mirrors - normal and 3x magnification, Illuminated under mirror with LED, Low weight and compact dimensions  
