# Eniac

## Importing Data

Turning them into a dictionary of dataframes.

In [1]:
import pandas as pd
import numpy as np

# URLs for raw content of the CSV files on GitHub
orders_url = "https://raw.githubusercontent.com/MerleSt/Eniac/main/Data-Eniac/Data_Cleaned/orders_clean.csv"
orderlines_url = "https://raw.githubusercontent.com/MerleSt/Eniac/main/Data-Eniac/Data_Cleaned/orderlines_clean.csv"
products_url = "https://raw.githubusercontent.com/MerleSt/Eniac/main/Data-Eniac/Data_Cleaned/products_clean.csv"
brands_url = "https://raw.githubusercontent.com/MerleSt/Eniac/main/Data-Eniac/Data_Cleaned/brands_clean.csv"


# Loading dataframes directly from GitHub
orders = pd.read_csv(orders_url)
orderlines = pd.read_csv(orderlines_url)
products = pd.read_csv(products_url)
brands = pd.read_csv(brands_url)

Keys of tables:

order_id : orders, orderlines

sku = products, orderlines, brands --> the first three letters of sku = short brand name

Questions:

What is id: and unit_price: in orderlines?

What is total_paid in orders? What is created_date in orders?

In [2]:
# Rename column to match with orders (better readability)
orderlines.rename(columns={'id_order': 'order_id'}, inplace=True)

In [3]:
orderlines['short_brand'] = orderlines['sku'].str[:3]

In [4]:
# Merge Orderlines on brands, since both tables are small anyway
orderlines_brands = orderlines.merge(brands, left_on='short_brand' ,right_on='short', how='left')
orderlines_brands

Unnamed: 0,order_id,product_quantity,sku,unit_price,date,short_brand,short,long
0,299539,1,OTT0133,18.99,2017-01-01 00:07:19,OTT,OTT,Otterbox
1,299540,1,LGE0043,399.00,2017-01-01 00:19:45,LGE,LGE,LG
2,299541,1,PAR0071,474.05,2017-01-01 00:20:57,PAR,PAR,Parrot
3,299542,1,WDT0315,68.39,2017-01-01 00:51:40,WDT,WDT,Western Digital
4,299543,1,JBL0104,23.74,2017-01-01 01:06:38,JBL,JBL,JBL
...,...,...,...,...,...,...,...,...
293978,527398,1,JBL0122,42.99,2018-03-14 13:57:25,JBL,JBL,JBL
293979,527399,1,PAC0653,141.58,2018-03-14 13:57:34,PAC,PAC,Pack
293980,527400,2,APP0698,9.99,2018-03-14 13:57:41,APP,APP,Apple
293981,527388,1,BEZ0204,19.99,2018-03-14 13:58:01,BEZ,BEZ,Be.ez


- DataFrame **.describe()** gives basic numerical aggregations. It can be applied to a single column as well.
- DataFrame **.isna().any()** highlights which columns contain missing data
- DataFrame **.shape** gives the number of rows and columns
- DataFrame **.columns** gives the column names. Note that a list with new names can be passed to this attribute to rename the columns.
- DataFrame **.columnName.isna().sum()** is a quick way to check the number of missing values in a column
- DataFrame **.columnName.value_counts()** is a great way to summarise a categorical column. You can use it to discover how many orders are completed, cancelled, pending…
- DataFrame **.columnName.hist()** is an easy way to plot a histogram in a numerical column. Play with the bins argument to change the granularity of the graph.

## Trying to create a new column with the Promo Price

Let me explain my logic on how I believe this column can be computed. 

In [5]:
products_orderlines = products.merge(orderlines, on='sku', how='inner')
# products_orderlines.groupby('sku').agg({'price':'mean', 'unit_price':'mean'}).round(2)

products_orderlines.isna().sum()

sku                 0
name                0
desc                0
price               0
promo_price         0
in_stock            0
type                0
promo_percentage    0
order_id            0
product_quantity    0
unit_price          0
date                0
short_brand         0
dtype: int64

In [6]:
products_orderlines['is_promo'] = products_orderlines['unit_price'] < products_orderlines['price']
products_orderlines['estimated_promo_price'] = products_orderlines.apply(lambda row: row['unit_price'] if row['is_promo'] else np.nan, axis=1)
products_orderlines

Unnamed: 0,sku,name,desc,price,promo_price,in_stock,type,promo_percentage,order_id,product_quantity,unit_price,date,short_brand,is_promo,estimated_promo_price
0,RAI0007,Silver Rain Design mStand Support,Aluminum support compatible with all MacBook,59.99,49.99,True,8696,0.1667,300551,1,54.99,2017-01-02 13:34:30,RAI,True,54.99
1,RAI0007,Silver Rain Design mStand Support,Aluminum support compatible with all MacBook,59.99,49.99,True,8696,0.1667,304067,1,49.99,2017-01-07 09:02:08,RAI,True,49.99
2,RAI0007,Silver Rain Design mStand Support,Aluminum support compatible with all MacBook,59.99,49.99,True,8696,0.1667,304484,1,49.99,2017-01-07 21:17:55,RAI,True,49.99
3,RAI0007,Silver Rain Design mStand Support,Aluminum support compatible with all MacBook,59.99,49.99,True,8696,0.1667,305406,1,49.99,2017-01-09 07:45:12,RAI,True,49.99
4,RAI0007,Silver Rain Design mStand Support,Aluminum support compatible with all MacBook,59.99,49.99,True,8696,0.1667,305590,1,49.99,2017-01-09 11:53:15,RAI,True,49.99
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
285044,REP0076,repair Full screen iPad (1st generation),Repair service including parts and labor for iPad,149.99,149.98,False,"1,44E+11",0.0001,355016,1,149.99,2017-05-11 18:30:11,REP,False,
285045,REP0076,repair Full screen iPad (1st generation),Repair service including parts and labor for iPad,149.99,149.98,False,"1,44E+11",0.0001,365406,1,149.99,2017-06-13 22:37:38,REP,False,
285046,REP0076,repair Full screen iPad (1st generation),Repair service including parts and labor for iPad,149.99,149.98,False,"1,44E+11",0.0001,409162,1,149.99,2017-10-06 00:32:29,REP,False,
285047,REP0076,repair Full screen iPad (1st generation),Repair service including parts and labor for iPad,149.99,149.98,False,"1,44E+11",0.0001,512821,1,149.98,2018-02-16 12:11:38,REP,True,149.98


In [7]:
products_orderlines['promo_price']

0          49.99
1          49.99
2          49.99
3          49.99
4          49.99
           ...  
285044    149.98
285045    149.98
285046    149.98
285047    149.98
285048    149.98
Name: promo_price, Length: 285049, dtype: float64

In [8]:

# Round to 2 decimal places
products_orderlines['rounded_unit_price'] = np.round(products_orderlines['unit_price'], 2)

# Check if rounded_unit_price equals rounded_promo_price
products_orderlines['is_match'] = products_orderlines['rounded_unit_price'] == products_orderlines['promo_price']

# Count the matches and mismatches
match_count = products_orderlines['is_match'].sum()
mismatch_count = len(products_orderlines) - match_count

print(f"Number of matches: {match_count}")
print(f"Number of mismatches: {mismatch_count}")

Number of matches: 71542
Number of mismatches: 213507
