# Eniac

## Importing Data

Turning them into a dictionary of dataframes.

In [65]:
import pandas as pd

# URLs for raw content of the CSV files on GitHub
orders_url = "https://raw.githubusercontent.com/MerleSt/Eniac/main/Data-Eniac/Data_Cleaned/orders_clean.csv"
orderlines_url = "https://raw.githubusercontent.com/MerleSt/Eniac/main/Data-Eniac/Data_Cleaned/orderlines_clean.csv"
products_url = "https://raw.githubusercontent.com/MerleSt/Eniac/main/Data-Eniac/Data_Cleaned/products_clean.csv"
brands_url = "https://raw.githubusercontent.com/MerleSt/Eniac/main/Data-Eniac/Data_Cleaned/brands_clean.csv"


# Loading dataframes directly from GitHub
orders = pd.read_csv(orders_url)
orderlines = pd.read_csv(orderlines_url)
products = pd.read_csv(products_url)
brands = pd.read_csv(brands_url)

Keys of tables:

order_id : orders, orderlines

sku = products, orderlines, brands --> the first three letters of sku = short brand name

Questions:

What is id: and unit_price: in orderlines?

What is total_paid in orders? What is created_date in orders?

In [66]:
# Rename column to match with orders (better readability)
orderlines.rename(columns={'id_order': 'order_id'}, inplace=True)
orderlines

Unnamed: 0,id,order_id,product_quantity,sku,unit_price,date
0,1119109,299539,1,OTT0133,18.99,2017-01-01 00:07:19
1,1119110,299540,1,LGE0043,399.00,2017-01-01 00:19:45
2,1119111,299541,1,PAR0071,474.05,2017-01-01 00:20:57
3,1119112,299542,1,WDT0315,68.39,2017-01-01 00:51:40
4,1119113,299543,1,JBL0104,23.74,2017-01-01 01:06:38
...,...,...,...,...,...,...
293978,1650199,527398,1,JBL0122,42.99,2018-03-14 13:57:25
293979,1650200,527399,1,PAC0653,141.58,2018-03-14 13:57:34
293980,1650201,527400,2,APP0698,9.99,2018-03-14 13:57:41
293981,1650202,527388,1,BEZ0204,19.99,2018-03-14 13:58:01


In [67]:
orderlines['short_brand'] = orderlines['sku'].str[:3]
orderlines

Unnamed: 0,id,order_id,product_quantity,sku,unit_price,date,short_brand
0,1119109,299539,1,OTT0133,18.99,2017-01-01 00:07:19,OTT
1,1119110,299540,1,LGE0043,399.00,2017-01-01 00:19:45,LGE
2,1119111,299541,1,PAR0071,474.05,2017-01-01 00:20:57,PAR
3,1119112,299542,1,WDT0315,68.39,2017-01-01 00:51:40,WDT
4,1119113,299543,1,JBL0104,23.74,2017-01-01 01:06:38,JBL
...,...,...,...,...,...,...,...
293978,1650199,527398,1,JBL0122,42.99,2018-03-14 13:57:25,JBL
293979,1650200,527399,1,PAC0653,141.58,2018-03-14 13:57:34,PAC
293980,1650201,527400,2,APP0698,9.99,2018-03-14 13:57:41,APP
293981,1650202,527388,1,BEZ0204,19.99,2018-03-14 13:58:01,BEZ


In [68]:
# Merge Orderlines on brands, since both tables are small anyway
orderlines_brands = orderlines.merge(brands, left_on='short_brand' ,right_on='short', how='left')
orderlines_brands

Unnamed: 0,id,order_id,product_quantity,sku,unit_price,date,short_brand,short,long
0,1119109,299539,1,OTT0133,18.99,2017-01-01 00:07:19,OTT,OTT,Otterbox
1,1119110,299540,1,LGE0043,399.00,2017-01-01 00:19:45,LGE,LGE,LG
2,1119111,299541,1,PAR0071,474.05,2017-01-01 00:20:57,PAR,PAR,Parrot
3,1119112,299542,1,WDT0315,68.39,2017-01-01 00:51:40,WDT,WDT,Western Digital
4,1119113,299543,1,JBL0104,23.74,2017-01-01 01:06:38,JBL,JBL,JBL
...,...,...,...,...,...,...,...,...,...
293978,1650199,527398,1,JBL0122,42.99,2018-03-14 13:57:25,JBL,JBL,JBL
293979,1650200,527399,1,PAC0653,141.58,2018-03-14 13:57:34,PAC,PAC,Pack
293980,1650201,527400,2,APP0698,9.99,2018-03-14 13:57:41,APP,APP,Apple
293981,1650202,527388,1,BEZ0204,19.99,2018-03-14 13:58:01,BEZ,BEZ,Be.ez


- DataFrame **.describe()** gives basic numerical aggregations. It can be applied to a single column as well.
- DataFrame **.isna().any()** highlights which columns contain missing data
- DataFrame **.shape** gives the number of rows and columns
- DataFrame **.columns** gives the column names. Note that a list with new names can be passed to this attribute to rename the columns.
- DataFrame **.columnName.isna().sum()** is a quick way to check the number of missing values in a column
- DataFrame **.columnName.value_counts()** is a great way to summarise a categorical column. You can use it to discover how many orders are completed, cancelled, pending…
- DataFrame **.columnName.hist()** is an easy way to plot a histogram in a numerical column. Play with the bins argument to change the granularity of the graph.

In [69]:
# Try to calculate the promo_price/ promo percentage
# Let me see the uni_price of orderlines and the price of products
products_orderlines = products.merge(orderlines, on='sku', how='left')
products_orderlines[['sku', 'price', 'unit_price', 'promo_price']]

Unnamed: 0,sku,price,unit_price,promo_price
0,RAI0007,59.99,54.99,49.99
1,RAI0007,59.99,49.99,49.99
2,RAI0007,59.99,49.99,49.99
3,RAI0007,59.99,49.99,49.99
4,RAI0007,59.99,49.99,49.99
...,...,...,...,...
290593,BEL0376,29.99,,26.99
290594,THU0060,69.95,,64.99
290595,THU0061,69.95,,64.99
290596,THU0062,69.95,,64.99
