# Cleaning data and setting data types
In our initial exploration we already saw some missing values, duplicated rows, and wrong data types. In this notebook we will clean our data and save those cleaned tables as csv-files.

We will start with orders and orderlines.
## Orders

In [1]:
import pandas as pd

In [2]:
orders = pd.read_csv('data/orders.csv')
orders.head()

Unnamed: 0,order_id,created_date,total_paid,state
0,241319,2017-01-02 13:35:40,44.99,Cancelled
1,241423,2017-11-06 13:10:02,136.15,Completed
2,242832,2017-12-31 17:40:03,15.76,Completed
3,243330,2017-02-16 10:59:38,84.98,Completed
4,243784,2017-11-24 13:35:19,157.86,Cancelled


In [3]:
orders.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226909 entries, 0 to 226908
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   order_id      226909 non-null  int64  
 1   created_date  226909 non-null  object 
 2   total_paid    226904 non-null  float64
 3   state         226909 non-null  object 
dtypes: float64(1), int64(1), object(2)
memory usage: 6.9+ MB


We saw in our initial exploration that the state of the orders with missing *total_paid* values is pending. We therefore do not have to worry about that.

But since we are interested in the dates, we'd prefer that column to be of data type datetime.

In [4]:
orders['created_date'] = pd.to_datetime(orders['created_date'])

In [5]:
min(orders.created_date)

Timestamp('2017-01-01 00:07:19')

In [6]:
max(orders.created_date)

Timestamp('2018-03-14 13:58:36')

In [7]:
orders.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226909 entries, 0 to 226908
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   order_id      226909 non-null  int64         
 1   created_date  226909 non-null  datetime64[ns]
 2   total_paid    226904 non-null  float64       
 3   state         226909 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 6.9+ MB


We succesfully changed the date to datetime. It is easy to see now that the orders were created between Jan 2017 and Mar 2018.

Next up we'll look at the orderlines table:
## Orderlines

In [8]:
orderlines = pd.read_csv('data/orderlines.csv')

In [9]:
orderlines.head()

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
0,1119109,299539,0,1,OTT0133,18.99,2017-01-01 00:07:19
1,1119110,299540,0,1,LGE0043,399.0,2017-01-01 00:19:45
2,1119111,299541,0,1,PAR0071,474.05,2017-01-01 00:20:57
3,1119112,299542,0,1,WDT0315,68.39,2017-01-01 00:51:40
4,1119113,299543,0,1,JBL0104,23.74,2017-01-01 01:06:38


In [10]:
orderlines.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 293983 entries, 0 to 293982
Data columns (total 7 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   id                293983 non-null  int64 
 1   id_order          293983 non-null  int64 
 2   product_id        293983 non-null  int64 
 3   product_quantity  293983 non-null  int64 
 4   sku               293983 non-null  object
 5   unit_price        293983 non-null  object
 6   date              293983 non-null  object
dtypes: int64(4), object(3)
memory usage: 15.7+ MB


In [11]:
# orderlines['unit_price'] = pd.to_numeric(orderlines['unit_price'])

When trying to convert the _unit_price_ to a numeric value, we get an error message. This pops up because some of these values have multiple points in it. This was probably done for readability of big numbers, but for our purposes we have to get rid of it. What we want to do is to remove every dot except for the last one using regular expressions

In [12]:
orderlines.loc[orderlines.unit_price.str.count('\.') >2]

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date


We see that these numbers have at most 2 dots. So if we remove the first dot of those fields, we'll be fine.

In [13]:
orderlines.loc[orderlines.unit_price.str.count('\.') >1, 'unit_price'] = orderlines.loc[orderlines.unit_price.str.count('\.') >1, 'unit_price'].str.replace('\.', '', n=1, regex=True)

In [14]:
orderlines.head(10)

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
0,1119109,299539,0,1,OTT0133,18.99,2017-01-01 00:07:19
1,1119110,299540,0,1,LGE0043,399.0,2017-01-01 00:19:45
2,1119111,299541,0,1,PAR0071,474.05,2017-01-01 00:20:57
3,1119112,299542,0,1,WDT0315,68.39,2017-01-01 00:51:40
4,1119113,299543,0,1,JBL0104,23.74,2017-01-01 01:06:38
5,1119114,295310,0,10,WDT0249,231.79,2017-01-01 01:14:27
6,1119115,299544,0,1,APP1582,1137.99,2017-01-01 01:17:21
7,1119116,299545,0,1,OWC0100,47.49,2017-01-01 01:46:16
8,1119119,299546,0,1,IOT0014,18.99,2017-01-01 01:50:34
9,1119120,295347,0,1,APP0700,72.19,2017-01-01 01:54:11


In [15]:
orderlines['unit_price'] = pd.to_numeric(orderlines['unit_price'])

Looks like it worked! Note that the string method `count` uses regular expressions by default, but for the `replace` method we have to say that we are dealing with regex.

Now to the datetime:

In [16]:
orderlines['date'] = pd.to_datetime(orderlines['date'])

In [17]:
orderlines.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 293983 entries, 0 to 293982
Data columns (total 7 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   id                293983 non-null  int64         
 1   id_order          293983 non-null  int64         
 2   product_id        293983 non-null  int64         
 3   product_quantity  293983 non-null  int64         
 4   sku               293983 non-null  object        
 5   unit_price        293983 non-null  float64       
 6   date              293983 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(4), object(1)
memory usage: 15.7+ MB


## Products

Next we'll tackle the table with the most issues, which is the `products` table. From the initial exploration we know that we have to deal with missing prices, missing descriptions, duplicate rows, and some problems with the actual prices (most of them are way too high), which might be due to an error of wrong data imputation.



In [18]:
products = pd.read_csv('data/products.csv')

In [19]:
products.head()

Unnamed: 0,sku,name,desc,price,promo_price,in_stock,type
0,RAI0007,Silver Rain Design mStand Support,Aluminum support compatible with all MacBook,59.99,499.899,1,8696
1,APP0023,Apple Mac Keyboard Keypad Spanish,USB ultrathin keyboard Apple Mac Spanish.,59.0,589.996,0,13855401
2,APP0025,Mighty Mouse Apple Mouse for Mac,mouse Apple USB cable.,59.0,569.898,0,1387
3,APP0072,Apple Dock to USB Cable iPhone and iPod white,IPhone dock and USB Cable Apple iPod.,25.0,229.997,0,1230
4,KIN0007,Mac Memory Kingston 2GB 667MHz DDR2 SO-DIMM,2GB RAM Mac mini and iMac (2006/07) MacBook Pr...,34.99,31.99,1,1364


#### Duplicates

In [20]:
products.duplicated().sum()

8746

In [21]:
products.drop_duplicates(inplace=True)

In [22]:
products.duplicated(subset = ['sku']).sum()

1

Looking at the duplicates: There are 8746 entries that match some other entry in every column.
If we look at duplicates in the sku-column only, we see that there is one more duplicated entry. Let's look at that one and see which one we want to keep.

In [23]:
products.loc[products.duplicated(subset = ['sku']) == True]

Unnamed: 0,sku,name,desc,price,promo_price,in_stock,type
8000,APP1197,"Apple iMac 21.5 ""Core i5 31 GHz Retina display...",Desktop Apple iMac 21.5 inch i5 31 GHz Retina ...,,1305.59,0,1282


In [24]:
products.loc[products.sku == 'APP1197']

Unnamed: 0,sku,name,desc,price,promo_price,in_stock,type
7992,APP1197,"Apple iMac 21.5 ""Core i5 31 GHz Retina display...",Desktop Apple iMac 21.5 inch i5 31 GHz Retina ...,1729.0,1305.59,0,1282
8000,APP1197,"Apple iMac 21.5 ""Core i5 31 GHz Retina display...",Desktop Apple iMac 21.5 inch i5 31 GHz Retina ...,,1305.59,0,1282


We found the last duplicate sku APP1197. the second does not have a price, everything else is the same. So we can drop it.

In [25]:
products.drop_duplicates(subset=['sku'], inplace=True)
#products.loc[products.sku == 'APP1197']

#### Data types and missing values

In [26]:
products.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10579 entries, 0 to 19325
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   sku          10579 non-null  object
 1   name         10579 non-null  object
 2   desc         10572 non-null  object
 3   price        10534 non-null  object
 4   promo_price  10579 non-null  object
 5   in_stock     10579 non-null  int64 
 6   type         10529 non-null  object
dtypes: int64(1), object(6)
memory usage: 661.2+ KB


In [27]:
# products['promo_price'] = pd.to_numeric(products['promo_price'])

As it was the case with the orderlines dataframe, there are some issues with multiple dots for some `price`'s and `promo_price`'s.
Fortunately, we already know how to deal with that.

In [28]:
#products.loc[products.promo_price.str.count('\.') >2]
products.loc[products.price.str.count('\.') >2]

Unnamed: 0,sku,name,desc,price,promo_price,in_stock,type


In [29]:
products.loc[products.price.str.count('\.') >1, 'price'] = (
    products.loc[products.price.str.count('\.') >1, 
                   'price'].str.replace('\.', '', n=1, regex=True)
)

products.loc[products.promo_price.str.count('\.') >1, 'promo_price'] = (
    products.loc[products.promo_price.str.count('\.') >1, 
                   'promo_price'].str.replace('\.', '', n=1, regex=True)
)

In [30]:
products['price'] = pd.to_numeric(products['price'])
products['promo_price'] = pd.to_numeric(products['promo_price'])

In [31]:
products.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10579 entries, 0 to 19325
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   sku          10579 non-null  object 
 1   name         10579 non-null  object 
 2   desc         10572 non-null  object 
 3   price        10534 non-null  float64
 4   promo_price  10579 non-null  float64
 5   in_stock     10579 non-null  int64  
 6   type         10529 non-null  object 
dtypes: float64(2), int64(1), object(4)
memory usage: 661.2+ KB


For the purpose of later categorization, we'll copy-paste the name of the product into the description for those products that are missing descriptions.
This avoids having to deal with nan-values.

In [32]:
products.loc[products.desc.isna(), 'desc'] = products.loc[products.desc.isna(), 'name']

Looking at the type, we see that all these contain some positive numbers. Since we will not look into the details of it right now (these are likely some company internal numbers), and since not too many products are missing type numbers, we'll simply impute 0.
As it does not make sense to do calculations with these numbers, it is okay for now to keep it as data type `object`.

We might come back to it for categorization later, so we won't drop the column.

In [33]:
products.type.unique()

array(['8696', '13855401', '1387', '1230', '1364', '1325', '5384', '1334',
       '13005399', '13835403', '12995397', '11865403', '13955395', '1216',
       '12355400', '5,49E+11', '1276', '11905404', '12635403', '12755395',
       '1296', '12285400', '1229', '11935397', '12655397', '1404',
       '101781405', '4259', '14035403', '12085400', '1282', '12175397',
       '1424', '9094', '1405', '57445397', nan, '14305406', '10142',
       '12645406', '10230', '12215397', '11821715', '13555403',
       '1,44E+11', '14365395', '5405', '5395', '5398', '21485407',
       '20642062', '1,46E+11', '1280', '12575403', '1433', '1515', '5720',
       '1298', '13615399', '12585395', '1392', '1231', '15435404', '1375',
       '42945397', '12141714', '42931714', '54025401', '21535407', '1416',
       '2,17E+11', '24215399', '11434', '2425', '2434', '2449',
       '13621714', '24861714', '24821716', '5,74E+15', '54864259',
       '51861714', '1714', '51601716', '51871714', '5403', '54085407',
       '2

In [34]:
products.loc[products.type.isna(), 'type'] = 0

In [35]:
products.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10579 entries, 0 to 19325
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   sku          10579 non-null  object 
 1   name         10579 non-null  object 
 2   desc         10579 non-null  object 
 3   price        10534 non-null  float64
 4   promo_price  10579 non-null  float64
 5   in_stock     10579 non-null  int64  
 6   type         10579 non-null  object 
dtypes: float64(2), int64(1), object(4)
memory usage: 661.2+ KB


#### The price issue
Now we want to get rid of our biggest problem: the weird prices.
As mentioned before there is something wrong with a lot of prices and promo_prices given in the table. 
Unfortunately we have no access to the person managing that dataframe and no internal price catalog.
So to fix the prices we assumed two things: 

- The problem with the pricing is caused by some floating point error, therefore the difference of the actual price and the price given in the table can be fixed with multiplication by some order of 10
- The actual price are comparable to product prices that can be found in other online stores

So through extensive online research and using the difference between `price` and `promo_price` as an auxiliary column the prices got fixed manually.

One thing to note: Only the `price` column gets fixed! Mainly because of the following reasons:
- When calculating the total price of an order later, we'll use only the price from the price-column. The promo_price is only used as a reference to find the actual price of the products.
- The promo_price given reflects a discount given by the respective company, probably for special promotional events. The actual vendors might give more or less discount, depending on various circumstances such as: Special dates (Black Friday, Christmas), getting rid of remaining stock, etc.
- One can basically see the discount when comparing the payment of the orders and the total price of the orders based on the prices given in the products dataframe. It's is not 100% accurate because of shipping costs etc., but it's good enough for our puposes.

Another thing to note: It is possible to lose some accuracy of the prices and spend less time on the cleaning process is exchange. This can be done via the use of functions and automizing the process. It is not done here though, since the amount of work was manageable in the timeframe given.

In [36]:
# code here
products.loc[products.price.isna() == True]

Unnamed: 0,sku,name,desc,price,promo_price,in_stock,type
34,TWS0019,Twelve South MagicWand support Apple Magic Tra...,MagicWand for wireless keyboard and Magic Trac...,,299.899,0,8696
1900,AII0008,Aiino Case MacBook Air 11 '' Transparent,MacBook Air 11-inch casing with matte finish.,,22.99,0,13835403
2039,CEL0020,Celly Ambo Luxury Leather Case + iPhone 6 Case...,Cover and housing together with magnet for iPh...,,399.905,0,11865403
2042,CEL0007,Celly Wallet Case with removable cover Black i...,Case Book for iPhone 6 card case type.,,128.998,0,11865403
2043,CEL0012,Celly Silicone Hard Shell iPhone 6 Blue,Hard Shell Silicone iPhone 6.,,4.99,0,11865403
2044,CEL0014,Celly Silicone Hard Shell iPhone 6 Amarillo,Hard Shell Silicone iPhone 6.,,59.895,0,11865403
2049,CEL0015,Celly fur-lined Powerbank battery 4000mAh Black,Leather-wrapped External Battery 4000mAh for i...,,239.895,0,1515
2051,CEL0018,Celly Wallet Leather Case cover Black iPhone 6,Card case with transparent protective cover fo...,,294.877,0,11865403
2052,CEL0023,Celly Ambo Luxury Leather Case + Case Gold iPh...,Cover and housing together with magnet for iPh...,,329.894,0,11865403
2053,CEL0025,Celly Ambo Luxury Leather Case + Case iPhone 6...,Cover and housing together with magnet for iPh...,,449.878,0,11865403


we have several entries with missing price. there is a promo price, but it seems that it is 1 order higher in most cases. I would say CEL0012, AII0008 and TWS0019 are the only correct ones there. 

In [37]:
products['no_price'] = products.price.isna()
products.loc[products.no_price]

Unnamed: 0,sku,name,desc,price,promo_price,in_stock,type,no_price
34,TWS0019,Twelve South MagicWand support Apple Magic Tra...,MagicWand for wireless keyboard and Magic Trac...,,299.899,0,8696,True
1900,AII0008,Aiino Case MacBook Air 11 '' Transparent,MacBook Air 11-inch casing with matte finish.,,22.99,0,13835403,True
2039,CEL0020,Celly Ambo Luxury Leather Case + iPhone 6 Case...,Cover and housing together with magnet for iPh...,,399.905,0,11865403,True
2042,CEL0007,Celly Wallet Case with removable cover Black i...,Case Book for iPhone 6 card case type.,,128.998,0,11865403,True
2043,CEL0012,Celly Silicone Hard Shell iPhone 6 Blue,Hard Shell Silicone iPhone 6.,,4.99,0,11865403,True
2044,CEL0014,Celly Silicone Hard Shell iPhone 6 Amarillo,Hard Shell Silicone iPhone 6.,,59.895,0,11865403,True
2049,CEL0015,Celly fur-lined Powerbank battery 4000mAh Black,Leather-wrapped External Battery 4000mAh for i...,,239.895,0,1515,True
2051,CEL0018,Celly Wallet Leather Case cover Black iPhone 6,Card case with transparent protective cover fo...,,294.877,0,11865403,True
2052,CEL0023,Celly Ambo Luxury Leather Case + Case Gold iPh...,Cover and housing together with magnet for iPh...,,329.894,0,11865403,True
2053,CEL0025,Celly Ambo Luxury Leather Case + Case iPhone 6...,Cover and housing together with magnet for iPh...,,449.878,0,11865403,True


In [38]:
products.loc[products.no_price, 'price'] = products.loc[products.no_price].promo_price

In [39]:
products['price'] = pd.to_numeric(products.price)
products.loc[products.no_price]

Unnamed: 0,sku,name,desc,price,promo_price,in_stock,type,no_price
34,TWS0019,Twelve South MagicWand support Apple Magic Tra...,MagicWand for wireless keyboard and Magic Trac...,299.899,299.899,0,8696,True
1900,AII0008,Aiino Case MacBook Air 11 '' Transparent,MacBook Air 11-inch casing with matte finish.,22.99,22.99,0,13835403,True
2039,CEL0020,Celly Ambo Luxury Leather Case + iPhone 6 Case...,Cover and housing together with magnet for iPh...,399.905,399.905,0,11865403,True
2042,CEL0007,Celly Wallet Case with removable cover Black i...,Case Book for iPhone 6 card case type.,128.998,128.998,0,11865403,True
2043,CEL0012,Celly Silicone Hard Shell iPhone 6 Blue,Hard Shell Silicone iPhone 6.,4.99,4.99,0,11865403,True
2044,CEL0014,Celly Silicone Hard Shell iPhone 6 Amarillo,Hard Shell Silicone iPhone 6.,59.895,59.895,0,11865403,True
2049,CEL0015,Celly fur-lined Powerbank battery 4000mAh Black,Leather-wrapped External Battery 4000mAh for i...,239.895,239.895,0,1515,True
2051,CEL0018,Celly Wallet Leather Case cover Black iPhone 6,Card case with transparent protective cover fo...,294.877,294.877,0,11865403,True
2052,CEL0023,Celly Ambo Luxury Leather Case + Case Gold iPh...,Cover and housing together with magnet for iPh...,329.894,329.894,0,11865403,True
2053,CEL0025,Celly Ambo Luxury Leather Case + Case iPhone 6...,Cover and housing together with magnet for iPh...,449.878,449.878,0,11865403,True


In [40]:
products.loc[(products.promo_price > products.price), 'promo_price'] /= 10

In [41]:
#products['promo_price'] = round(products.promo_price,2)
#products['price'] = round(products.price,2)
#products.loc[products.promo_price > products.price]

In [42]:
products.loc[(products.no_price), 'price'] /= 10
products.loc[(products.no_price), 'promo_price'] /= 10

In [43]:
products.loc[products.sku.isin(['TWS0019', 'AII0008', 'CEL0012']), 'price'] *= 10
products.loc[products.sku.isin(['TWS0019', 'AII0008', 'CEL0012']), 'promo_price'] *= 10

In [44]:
products['price_promo_difference'] = products.price - products.promo_price
products.loc[products.price < products.promo_price].sort_values('price_promo_difference').head(50)

Unnamed: 0,sku,name,desc,price,promo_price,in_stock,type,no_price,price_promo_difference
14718,QNA0189,QNAP TVS-1271U-RP NAS with redundant source,Expansion unit rack bays 12 format and 8GB of ...,2346.19,2538.9902,0,12175397,False,-192.8002
14717,QNA0188,QNAP TVS-1271U-RP NAS server with redundant po...,Expansion unit rack bays 12 and format memory ...,2781.79,2962.9899,0,12175397,False,-181.1999
14716,QNA0187,QNAP TVS-1271U-RP NAS server with redundant po...,Expansion unit rack format with 12 bays and 32 GB,3386.79,3567.9899,0,12175397,False,-181.1999
15111,EIZ0022,"CS2730 Monitor Eizo ColorEdge 27 ""QHD AdobeRGB...",Professional monitor for color management Adob...,1034.99,1209.9879,1,1296,False,-174.9979
11060,PAC1388,Pack QNAP TS-251 + | 2GB RAM | WD 16TB Network,Nas QNAP TS-251 + 2GB RAM memory + 16TB (2x8TB...,673.702,815.1794,1,12175397,False,-141.4774
14719,QNA0190,QNAP TVS-871U-RP NAS Server | Core i5 | 8GB RA...,NAS system format rack 8 bays and 8GB RAM,2418.79,2538.9902,0,12175397,False,-120.2002
1749,PAC0607,Synology DS215j Pack | 8TB Seagate Desktop,Synology DS215j NAS Server Pack + 8TB Seagate ...,366.934,443.9901,0,12175397,False,-77.0561
2879,APP1084,Apple iPod Touch 64GB Silver,New 6th generation iPod Touch 64GB with 8 mega...,292.81,342.8112,0,11821715,False,-50.0012
2880,APP1087,Apple iPod Touch 64GB Gold,New 6th generation iPod Touch 64GB with 8 mega...,292.81,342.8112,0,11821715,False,-50.0012
2881,APP1090,Apple iPod Touch 64GB Space Gray,New 6th generation iPod Touch 64GB with 8 mega...,292.81,342.8112,0,11821715,False,-50.0012


Looking at the price_promo_difference, we see that this is enormous. Prices and (mostly) promo_prices are way too high. Sometimes we see a reasonable promo_price for a macbook, so many of these entries have to be divided by 100. In a first step I go with a price diff of 40000

In [45]:
products.sort_values('price_promo_difference', ascending=False).head(50)

Unnamed: 0,sku,name,desc,price,promo_price,in_stock,type,no_price,price_promo_difference
16608,APP2333,"Apple MacBook Pro 15 ""Core i7 Touch Bar 31GHz ...",New MacBook Pro 15-inch Core i7 Touch Bar 31Gh...,352559.394,35255.939,0,"1,02E+12",False,317303.455
16824,APP2299,"Apple MacBook Pro 13 ""Core i7 Touch Bar 35GHz ...",New MacBook Pro 13 inch Touch Bar 35 GHz Core ...,336559.443,33655.944,0,"1,02E+12",False,302903.499
16836,APP2310,"Apple MacBook Pro 13 ""Core i7 Touch Bar 35GHz ...",New MacBook Pro 13 inch Touch Bar 35 GHz Core ...,336559.443,33655.944,0,"1,02E+12",False,302903.499
16834,APP2307,"Apple MacBook Pro 13 ""Core i5 Touch Bar 33GHz ...",New MacBook Pro 13-inch Core i5 Touch Bar 33 G...,336559.443,33655.944,0,"1,02E+12",False,302903.499
16812,APP2296,"Apple MacBook Pro 13 ""Core i5 Touch Bar 33GHz ...",New MacBook Pro 13-inch Core i5 Touch Bar 33 G...,336559.443,33655.944,0,"1,02E+12",False,302903.499
16838,APP2312,"Apple MacBook Pro 13 ""Core i5 Touch Bar 31GHz ...",New MacBook Pro 13 inch Touch Bar 31 GHz Core ...,300559.402,3005.594,0,"1,02E+12",False,297553.808
16828,APP2301,"Apple MacBook Pro 13 ""Core i5 Touch Bar 31GHz ...",New MacBook Pro 13 inch Touch Bar 31 GHz Core ...,300559.402,3005.594,0,"1,02E+12",False,297553.808
16832,APP2303,"Apple MacBook Pro 13 ""Core i5 Touch Bar 33GHz ...",New MacBook Pro 13 inch Touch Bar 33 GHz Core ...,237559.421,23755.942,0,"1,02E+12",False,213803.479
16840,APP2314,"Apple MacBook Pro 13 ""Core i5 Touch Bar 33GHz ...",New MacBook Pro 13 inch Touch Bar 33 GHz Core ...,237559.421,23755.942,0,"1,02E+12",False,213803.479
17487,APP2490,Apple iPhone 64GB X Silver,New Apple iPhone 64GB Free Silver X,115900.092,11590.009,1,113271716,False,104310.083


In [46]:
products.loc[products.price_promo_difference > 40000, 'price'] /= 100

Now we take a closer look to the macbooks

In [47]:
products.loc[products.name.str.contains('MacBook')].head(5)

Unnamed: 0,sku,name,desc,price,promo_price,in_stock,type,no_price,price_promo_difference
13,APP0156,45W MagSafe charger Apple MacBook Air,Charger MacBook Air MagSafe 45 W (MC747Z / A).,89.0,84.9904,0,13005399,False,4.0096
15,MOS0021,Clearguard Moshi MacBook Pro and Air,Keyboard Protector MacBook Pro 13-inch Retina ...,24.95,24.9889,0,13835403,False,-0.0389
16,APP0206,60W MagSafe charger Apple MacBook and MacBook ...,Charger MacBook and 13 inch MacBook MagSafe 60...,89.0,84.9904,1,13005399,False,4.0096
17,HGD0007,"Henge Docks Docking Station MacBook Pro 13 """,Dock and support with multiple connections for...,69.99,69.9864,0,12995397,False,0.0036
20,APP0247,"85W MagSafe charger Apple MacBook Pro 15 ""/ 17""",Charger MacBook Pro MagSafe 85 W (MC556Z / A).,89.0,84.9904,1,13005399,False,4.0096


macbook prices seem to be mostly reasonable now. some are still way over 20000, which is to much, so I divide by 10 again.
also the ram expansions seem to be too expensive.

In [48]:
products.loc[(products.name.str.contains('MacBook')) & (products.price > 10000), 'price'] /= 10

from what i could tell, one of the highest selling synology products online seems to be around 5.500€. So I will divide every synology product with a higher price by 10.
If you look at the promo prices, you sometimes see a promoprice that is not even a 10th of the price column. This also indicates that prices are corrupted

In [49]:
products.loc[products.name.str.contains('Synology')].sort_values('price', ascending=False).head(100)

Unnamed: 0,sku,name,desc,price,promo_price,in_stock,type,no_price,price_promo_difference
16001,PAC2309,Synology DS1817 + | 16GB RAM | 96TB (8x12TB) S...,NAS with 16GB of RAM and 96TB for Mac and PC,51577.892,41217.4640,0,12175397,False,10360.4280
16000,PAC2307,Synology DS1817 + | 8GB RAM | 96TB (8x12TB) Se...,NAS with 8GB RAM and 96TB for Mac and PC,50917.885,40557.4580,0,12175397,False,10360.4270
16004,PAC2310,Synology DS1817 + | 16GB RAM | 80TB (8x10TB) W...,NAS server with 16GB of RAM and 80TB for Mac a...,46698.692,38577.4620,0,12175397,False,8121.2300
16003,PAC2308,Synology DS1817 + | 8GB RAM | 80TB (8x10TB) WD...,NAS server with 8GB RAM and 80TB for Mac and PC,46038.685,37907.4610,0,12175397,False,8131.2240
15993,PAC1984,Synology DS1817 + | 16GB RAM | 80TB (8x10TB) S...,NAS server with 16GB of RAM and 80TB for Mac a...,43577.892,33377.4630,0,12175397,False,10200.4290
...,...,...,...,...,...,...,...,...,...
15999,PAC2305,Synology DS1817 + | 2GB RAM | 96TB (8x12TB) Se...,NAS with 2GB of RAM and 96TB for Mac and PC,4978.810,3942.7463,0,12175397,False,1036.0637
1764,PAC1810,Synology DS1815 Pack + | 16GB RAM | WD 64TB Ne...,Pack Synology DS1815 + with 16GB RAM + 64TB (8...,4534.490,3455.7455,0,12175397,False,1078.7445
16002,PAC2306,Synology DS1817 + | 2GB RAM | 80TB (8x10TB) WD...,NAS server with 2GB of RAM and 80TB for Mac an...,4490.890,3678.7461,0,12175397,False,812.1439
1654,PAC1799,Synology DS1815 Pack + | 8GB RAM | WD 64TB Net...,Pack Synology DS1815 + with 8GB RAM + 64TB (8x...,4476.890,3402.7463,0,12175397,False,1074.1437


In [50]:
products.loc[(products.name.str.contains('Synology')) & (products.price > 5500), 'price'] /= 10

we see that there are still a lot inconsistencies with the prices and promoprices, but we will keep it for now and come back later. 
Next we'll look at expansions. fyi: I'll ad the .head() so I don't have to scroll so far.

In [51]:
products.loc[(products.name.str.contains('Synology'))].sort_values('price',ascending=False).head()

Unnamed: 0,sku,name,desc,price,promo_price,in_stock,type,no_price,price_promo_difference
15168,SYN0165,Synology RackStation RS18017xs + NAS 12 bays S...,Nas Format rack 12 bays with Intel Xeon 6-core...,5499.0,5498.9999,0,12175397,False,0.0001
15166,SYN0164,Synology RackStation RS4017xs + Nas Server,Nas Format rack 16 bays with Intel Xeon proces...,5389.0,5389.0,0,12175397,False,0.0
12863,PAC1481,Synology DS216 + II | WD 4TB Network for Mac a...,Synology DS216 + II + 4TB (2x2TB) WD Red hard ...,5309.896,5081.794,0,12175397,False,228.102
16001,PAC2309,Synology DS1817 + | 16GB RAM | 96TB (8x12TB) S...,NAS with 16GB of RAM and 96TB for Mac and PC,5157.7892,41217.464,0,12175397,False,10360.428
16000,PAC2307,Synology DS1817 + | 8GB RAM | 96TB (8x12TB) Se...,NAS with 8GB RAM and 96TB for Mac and PC,5091.7885,40557.458,0,12175397,False,10360.427


these expansion kits' price seems to be in the 100s, not in the 1000s based on a google search. In my case, the ones that are too pricey have a price_promo_difference of at least 200€, so I divide those by 10.

In [52]:
products.loc[products.name.str.contains('expansion')].sort_values('price_promo_difference', ascending=False).head()

Unnamed: 0,sku,name,desc,price,promo_price,in_stock,type,no_price,price_promo_difference
14929,PAC1907,Crucial MX300 expansion kit 2TB SSD + 16GB 160...,expansion kit 2TB SSD + 16GB 1600Mhz RAM for M...,7995.192,682.584,0,1433,False,7312.608
11501,PAC1251,Samsung SSD 850 expansion kit PRO 1TB + 16GB 1...,SSD upgrade kit 1TB + 16GB 1600Mhz RAM for Mac...,6989.592,626.584,0,1433,False,6363.008
2010,PAC0746,Kit PRO 512GB Samsung SSD expansion + 16GB 160...,SSD upgrade kit 512GB + 16GB 1600Mhz RAM for M...,4689.592,418.585,1,1433,False,4271.007
1955,PAC0784,Samsung SSD 850 expansion kit EVO 500GB + 16GB...,Expansion kit SSD 500GB + 16GB 1600Mhz RAM + D...,4519.592,353.585,1,1433,False,4166.007
14923,PAC1902,SSD Crucial MX300 expansion kit 1TB + 8GB 1600...,SSD expansion 1TB + 8GB 1600Mhz RAM for Mac mi...,4463.085,384.584,1,1433,False,4078.501


In [53]:
products.loc[products.name.str.contains('expansion') & (products.price_promo_difference > 200), 'price'] /= 10

Next we'll look at SanDisk and Wi-Fi, also sku REP seems suspicious (sorting by price) --> SanDisk with a price over 1000€ -> divide by 10
REP: anything above 2000 -> divide by 100, anything above 300 -> divide by 10
Wi-Fi: all above 1500 -> divide by 10

In [54]:
products.loc[products.name.str.contains('SanDisk') & products.price > 1000, 'price'] /= 10
products.loc[products.sku.str.contains('REP') & products.price > 2000, 'price'] /= 100
products.loc[products.sku.str.contains('REP') & products.price > 3000, 'price'] /= 10
products.loc[products.name.str.contains('Wi-Fi') & products.price > 1500, 'price'] /= 10

In [55]:
products.loc[products.sku.str.contains('REP')].sort_values('price', ascending=False).head()

Unnamed: 0,sku,name,desc,price,promo_price,in_stock,type,no_price,price_promo_difference
15401,REP0350,Full Screen Repair iPad Air,Repair service including parts and labor for i...,19999.001,199.99,0,"1,44E+11",False,19799.011
15429,REP0339,Repair Touch Screen iPad 4,Repair service including parts and labor for i...,9990.002,99.9,0,"1,44E+11",False,9890.102
15423,REP0348,Wi-Fi antenna repair iPad 4,Repair service including parts and labor for i...,6999.003,69.99,0,"1,44E+11",False,6929.013
15421,REP0380,iPad Mini 3 battery repair,Repair service including parts and labor for i...,6999.003,69.99,0,"1,44E+11",False,6929.013
15397,REP0371,Cases mini repair posterior chamber 2,Repair service including parts and labor for i...,6999.003,69.99,0,"1,44E+11",False,6929.013


In [56]:
products.loc[products.sku.str.contains('SEV')].sort_values('price', ascending=False)

Unnamed: 0,sku,name,desc,price,promo_price,in_stock,type,no_price,price_promo_difference
525,SEV0024,Service installation RAM + HDD + SSD Mac mini,installation RAM HDD + SSD + on your Mac mini ...,599.918,599.918,0,20642062,False,0.0
526,SEV0025,Service installation RAM + HDD + SSD iMac,installation RAM HDD + SSD + on your iMac + Da...,599.918,599.918,0,20642062,False,0.0
518,SEV0026,Service installation RAM + HDD + SSD MacBook /...,RAM + HDD installation + SSD in your MacBook /...,599.918,599.918,0,20642062,False,0.0
540,SEV0032,Budget Request repair Apple Mac Pro,Diagnosis to repair Mac Pro.,199.892,199.892,0,"1,46E+11",False,0.0
539,SEV0031,Budget Request repair Apple Mac mini,diagnosis to repair Mac mini.,199.892,199.892,0,"1,46E+11",False,0.0
538,SEV0030,Apple iMac repair Budget Request,diagnosis to repair iMac.,199.892,199.892,0,"1,46E+11",False,0.0
537,SEV0029,Apple MacBook repair Budget Request,MacBook repair diagnosis.,199.892,199.892,0,"1,46E+11",False,0.0
536,SEV0028,Budget Request repair Apple MacBook Pro,diagnosis to repair MacBook Pro.,199.892,199.892,0,"1,46E+11",False,0.0
528,SEV0027,Budget Request repair Apple MacBook Air,diagnosis to repair MacBook Air.,199.892,199.892,0,"1,46E+11",False,0.0
11667,SEV0041,Current 2012-Service installation RAM + HDD + ...,installation RAM HDD + SSD + on your iMac + Da...,99.99,99.9896,0,20642062,False,0.0004


In [57]:
products.loc[products.promo_price > products.price, 'promo_price'] = products.price

In [58]:
products.drop(columns = ['no_price', 'price_promo_difference'], inplace=True)

## Brands

As mentioned in the initial_exploration, the brands csv looks fine. 

## Saving the dataframes
Now we save the cleaned csv-files

In [61]:
orders.to_csv('data/orders_clean.csv', index=False)
orderlines.to_csv('data/orderlines_clean.csv', index=False)
products.to_csv('data/products_clean.csv', index=False)