# products.csv

* **sku** – stock keeping unit: a unique identifier for each product
* **name** – product name
* **desc** – product description
* **price** – base price of the product, in euros
* **promo_price** – promotional price, in euros
* **in_stock** – whether or not the product was in stock at the moment of the data extraction
* **type** – a numerical code for product type

## Import the data

In [1]:
import pandas as pd
import numpy as np
import re

pd.options.display.max_rows = 500

products_original = pd.read_csv('../../data/products.csv')
products_original.head()

Unnamed: 0,sku,name,desc,price,promo_price,in_stock,type
0,RAI0007,Silver Rain Design mStand Support,Aluminum support compatible with all MacBook,59.99,499.899,1,8696
1,APP0023,Apple Mac Keyboard Keypad Spanish,USB ultrathin keyboard Apple Mac Spanish.,59.0,589.996,0,13855401
2,APP0025,Mighty Mouse Apple Mouse for Mac,mouse Apple USB cable.,59.0,569.898,0,1387
3,APP0072,Apple Dock to USB Cable iPhone and iPod white,IPhone dock and USB Cable Apple iPod.,25.0,229.997,0,1230
4,KIN0007,Mac Memory Kingston 2GB 667MHz DDR2 SO-DIMM,2GB RAM Mac mini and iMac (2006/07) MacBook Pr...,34.99,31.99,1,1364


## Data exploration

In [6]:
products = products_original.copy()
products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19326 entries, 0 to 19325
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   sku          19326 non-null  object
 1   name         19326 non-null  object
 2   desc         19319 non-null  object
 3   price        19280 non-null  object
 4   promo_price  19326 non-null  object
 5   in_stock     19326 non-null  int64 
 6   type         19276 non-null  object
dtypes: int64(1), object(6)
memory usage: 1.0+ MB


<div class="alert alert-block alert-info">
    <b>price</b> and <b>promo_price</b> must be transformed to float and type and <b>in_stock</b> can be removed.
</div>

In [66]:
products.drop(['type', 'in_stock'], axis=1, inplace=True)

In [67]:
products.nunique()

sku            10579
name           10373
desc            7098
price           2690
promo_price     4614
dtype: int64

In [68]:
products[products.sku.duplicated()]

Unnamed: 0,sku,name,desc,price,promo_price
101,APP0390,"Apple MacBook Pro 133 ""Core i5 25GHz | 4GB RAM...",MacBook Pro laptop 133 inches (MD101Y / A).,1199,11.455.917
102,APP0390,"Apple MacBook Pro 133 ""Core i5 25GHz | 4GB RAM...",MacBook Pro laptop 133 inches (MD101Y / A).,1199,11.455.917
103,APP0390,"Apple MacBook Pro 133 ""Core i5 25GHz | 4GB RAM...",MacBook Pro laptop 133 inches (MD101Y / A).,1199,11.455.917
104,APP0390,"Apple MacBook Pro 133 ""Core i5 25GHz | 4GB RAM...",MacBook Pro laptop 133 inches (MD101Y / A).,1199,11.455.917
105,APP0390,"Apple MacBook Pro 133 ""Core i5 25GHz | 4GB RAM...",MacBook Pro laptop 133 inches (MD101Y / A).,1199,11.455.917
...,...,...,...,...,...
16831,APP2302,"Apple MacBook Pro 13 ""Core i5 Touch Bar 33GHz ...",New MacBook Pro 13-inch Core i5 Touch Bar 33 G...,26.155.941,26.155.941
16833,APP2303,"Apple MacBook Pro 13 ""Core i5 Touch Bar 33GHz ...",New MacBook Pro 13 inch Touch Bar 33 GHz Core ...,237.559.421,23.755.942
18190,PAR0077,Parrot Bebop Drone 2 Power,Drone cuadricóptero quality camera integrated ...,699.9,6.733.892
18308,NKI0010,Nokia Wireless sphygmomanometer Plata,Sphygmomanometer for iPhone iPad and iPod App.,129.99,1.149.899


In [69]:
products[products.sku=='APP0390']

Unnamed: 0,sku,name,desc,price,promo_price
100,APP0390,"Apple MacBook Pro 133 ""Core i5 25GHz | 4GB RAM...",MacBook Pro laptop 133 inches (MD101Y / A).,1199,11.455.917
101,APP0390,"Apple MacBook Pro 133 ""Core i5 25GHz | 4GB RAM...",MacBook Pro laptop 133 inches (MD101Y / A).,1199,11.455.917
102,APP0390,"Apple MacBook Pro 133 ""Core i5 25GHz | 4GB RAM...",MacBook Pro laptop 133 inches (MD101Y / A).,1199,11.455.917
103,APP0390,"Apple MacBook Pro 133 ""Core i5 25GHz | 4GB RAM...",MacBook Pro laptop 133 inches (MD101Y / A).,1199,11.455.917
104,APP0390,"Apple MacBook Pro 133 ""Core i5 25GHz | 4GB RAM...",MacBook Pro laptop 133 inches (MD101Y / A).,1199,11.455.917
105,APP0390,"Apple MacBook Pro 133 ""Core i5 25GHz | 4GB RAM...",MacBook Pro laptop 133 inches (MD101Y / A).,1199,11.455.917
106,APP0390,"Apple MacBook Pro 133 ""Core i5 25GHz | 4GB RAM...",MacBook Pro laptop 133 inches (MD101Y / A).,1199,11.455.917
107,APP0390,"Apple MacBook Pro 133 ""Core i5 25GHz | 4GB RAM...",MacBook Pro laptop 133 inches (MD101Y / A).,1199,11.455.917
108,APP0390,"Apple MacBook Pro 133 ""Core i5 25GHz | 4GB RAM...",MacBook Pro laptop 133 inches (MD101Y / A).,1199,11.455.917


<div class="alert alert-block alert-danger">
    We will need to drop these duplicated values. 
</div>

#### Missing values

In [70]:
products.isna().sum()

sku             0
name            0
desc            7
price          46
promo_price     0
dtype: int64

<div class="alert alert-block alert-danger">
    We need to remove and rows with missing <b>price</b> values.
</div>

In [71]:
before = products.shape[0]
products = products[~products.price.isna()]
after = products.shape[0]
print(f"{before-after} rows removed.")

46 rows removed.


#### Missing descriptions

In [72]:
names_of_products_without_descriptions = products[products.desc.isna()].name.tolist()
names_of_products_without_descriptions

['Open - Purple 2TB WD 35 "PC Security Mac hard drive and NAS',
 'Open - Apple Smart Keyboard Pro Keyboard Folio iPad 9.7 "',
 'Synology DS718 + NAS Server | 10GB RAM',
 'Open - Kanex USB-C Gigabit Ethernet Adapter MacBook 12 "',
 'Hyper Pearl 1600mAh battery Mini USB Mirror and Comic Blond',
 'OtterBox External Battery Power Pack 20000 mAHr',
 'Open - Honeywell thermostat Lyric zonificador T6 Intelligent Wireless (cable)']

In [73]:
products[products.desc.isna()]

Unnamed: 0,sku,name,desc,price,promo_price
16126,WDT0211-A,"Open - Purple 2TB WD 35 ""PC Security Mac hard ...",,107,814.659
16128,APP1622-A,Open - Apple Smart Keyboard Pro Keyboard Folio...,,1.568.206,1.568.206
17843,PAC2334,Synology DS718 + NAS Server | 10GB RAM,,566.35,5.659.896
18152,KAN0034-A,Open - Kanex USB-C Gigabit Ethernet Adapter Ma...,,29.99,237.925
18490,HTE0025,Hyper Pearl 1600mAh battery Mini USB Mirror an...,,24.99,22.99
18612,OTT0200,OtterBox External Battery Power Pack 20000 mAHr,,79.99,56.99
18690,HOW0001-A,Open - Honeywell thermostat Lyric zonificador ...,,199.99,1.441.174


<div class="alert alert-block alert-info">
    There are only 7 missing descriptions. We can simply look up the products online and add them.
</div>

#### Explore price values

In [74]:
# Create a copy of the dataset
price_temp = products.copy()

# create a new column with the amount of dots in the unit_price column
price_temp['num_decimals'] = price_temp['price'].str.count(r'\.')

# show the rows with more than one dot
price_temp.query('num_decimals > 1').shape[0]

431

In [75]:
# show the rows with more than two dots
price_temp.query('num_decimals > 2').shape[0]

0

In [45]:
price_temp.query('num_decimals > 1').head(5)

Unnamed: 0,sku,name,desc,price,promo_price,in_stock,num_decimals
665,CRU0015-2,Crucial memory Mac 16GB (2x8GB) SO-DIMM DDR3 1...,RAM 16GB (2x8GB) 135V MacBook Pro iMac (2012/2...,1.639.792,1.629.894,1,2
792,APP0672,Apple iPhone 5S 16GB Space Gray,New iPhone 5S 16G Libre (ME432Y / AB).,4.694.994,4.694.994,0,2
797,APP0673,Apple iPhone 5S 16GB Silver,New Free iPhone 5S 16GB (ME433Y / A).,4.090.042,4.090.042,0,2
827,PAC0339,NewerTech miniStack 4TB Hard Drive Mac,External Box Hard Drive Mac + 4TB.,2.199.791,2.199.901,0,2
885,PAC0376,OWC Mercury Elite Pro Dual Thunderbolt + 8TB,RAID outer box 35 inch SATA connection Thunder...,5.609.698,5.549.895,0,2


In [77]:
# show the rows with more than two dots
price_temp.query('num_decimals < 1').shape[0]

11528

In [78]:
price_temp.query('num_decimals < 1').head(5)

Unnamed: 0,sku,name,desc,price,promo_price,num_decimals
1,APP0023,Apple Mac Keyboard Keypad Spanish,USB ultrathin keyboard Apple Mac Spanish.,59,589.996,0
2,APP0025,Mighty Mouse Apple Mouse for Mac,mouse Apple USB cable.,59,569.898,0
3,APP0072,Apple Dock to USB Cable iPhone and iPod white,IPhone dock and USB Cable Apple iPod.,25,229.997,0
5,APP0073,Apple Composite AV Cable iPhone and iPod white,IPhone and iPod AV Cable Dock to Composite Video.,45,420.003,0
8,KIN0001-2,Mac memory Kingston 4GB (2x2GB) 667MHz DDR2 SO...,RAM 4GB (2x2GB) Mac mini and iMac (2006/07) Ma...,74,669.904,0


#### Explore promo price values

In [47]:
# Create a copy of the dataset
promo_price_temp = products.copy()

# create a new column with the amount of dots in the unit_price column
promo_price_temp['num_decimals'] = promo_price_temp['promo_price'].str.count(r'\.')

# show the rows with more than one dot
promo_price_temp.query('num_decimals > 1').shape[0]

12902

In [48]:
# show the rows with more than two dots
price_temp.query('num_decimals > 2').shape[0]

0

#### Compare both

<div class="alert alert-block alert-info">
    There are far more corrupted values in <b>promo_price</b> so let's check the values where only <b>promo_price</b> is corrupted and compare the values
</div>

In [51]:
# Create a copy of the dataset
products_temp = products.copy()

# create a new column with the amount of dots in the unit_price column
products_temp['num_decimals_price'] = products_temp['price'].str.count(r'\.')
products_temp['num_decimals_promo_price'] = products_temp['promo_price'].str.count(r'\.')

# show the rows with more than one dot
products_temp.query('num_decimals_price != num_decimals_promo_price')

Unnamed: 0,sku,name,desc,price,promo_price,in_stock,num_decimals_price,num_decimals_promo_price
1,APP0023,Apple Mac Keyboard Keypad Spanish,USB ultrathin keyboard Apple Mac Spanish.,59,589.996,0,0,1
2,APP0025,Mighty Mouse Apple Mouse for Mac,mouse Apple USB cable.,59,569.898,0,0,1
3,APP0072,Apple Dock to USB Cable iPhone and iPod white,IPhone dock and USB Cable Apple iPod.,25,229.997,0,0,1
5,APP0073,Apple Composite AV Cable iPhone and iPod white,IPhone and iPod AV Cable Dock to Composite Video.,45,420.003,0,0,1
8,KIN0001-2,Mac memory Kingston 4GB (2x2GB) 667MHz DDR2 SO...,RAM 4GB (2x2GB) Mac mini and iMac (2006/07) Ma...,74,669.904,0,0,1
...,...,...,...,...,...,...,...,...
19316,REP0407,iPad battery repair,Repair service including parts and labor for iPad,6.999.003,69.99,0,2,1
19317,REP0403,iPad LCD screen repair,Repair service including parts and labor for iPad,159.99,1.599.898,0,1,2
19318,REP0402,iPad touch screen repair,Repair service including parts and labor for iPad,139.99,1.399.897,0,1,2
19319,KNO0032,"Knomo MacBook Pro Beauchamp Backpack 14 ""Black",Backpack thin nylon mesh internal compartment ...,179,1.699.905,1,0,2


<div class="alert alert-block alert-danger">
    In rows 1-5 above we see that even with a single decimal place, the <b>promo_price</b> has been corrupted.<br>
    In row 19316 we see that the <b>price</b> has been corrupted and the <b>promo_price</b> is correct.
</div>

## Clean products

In [17]:
products = products_original.copy()

products_before = products.shape[0]

print("Products.shape - Initial:", products.shape[0])
# Drop the type and in_stock columns 
products.drop('type', axis=1, inplace=True, errors='ignore')
products.drop('in_stock', axis=1, inplace=True, errors='ignore')

print("Products.shape - After dropping deprecated columns:", products.shape[0])

#  Drop duplicate rows based on sku value
products.drop_duplicates(subset='sku', inplace=True)

print("Products.shape - After dropping duplicate sku values:", products.shape[0])

# Check for products without descriptions
names_of_products_without_descriptions = products[products.desc.isna()].name.tolist()
# Add missing descriptions
missing_descriptions = [
    '2TB Mac hard drive and Nas',
    'Apple keyboard for iPad 9.7',
    'NAS server with 10GB RAM',
    'Ethernet adapter for Macbook 12',
    'Luxury power bank combined with powder, 2 mirrors - normal and 3x magnification, Illuminated under mirror with LED, Low weight and compact dimensions',
    'Battery capacity: 20,000 mAh; ultra-stable: outer shell made of durable synthetic rubber (military standard, withstands drops from up to 2 metres) ; protection: dust and splash proof: military standard iP54; battery level indicator and super fast charging; USB port can be connected to charger and other devices',
    'Smart thermostat designed to provide automatic time and temperature control of heating systems in homes and apartments. '
]
for i in range(len(names_of_products_without_descriptions)):
    products.loc[products.name == names_of_products_without_descriptions[i], 'desc'] = missing_descriptions[i]

def start_pipeline(df):
    '''Make a copy of the pipeline to prevent corrupting the original data'''
    return df.copy()

def remove_missing_prices(df, col):
    return df[~df[col].isna()]


products = (products
        .pipe(start_pipeline)
        .pipe(remove_missing_prices,col='price')
)


print("Products.shape - After removing rows with missing price values:", products.shape[0])

products_after = products.shape[0]

print('\n')
print(f"{products_before-products_after} missing values were removed from products")
print(f"This represents {(products_before-products_after)/products_before * 100:.2f}% of the data.")
print("\n\n")

products

Products.shape - Initial: 19326
Products.shape - After dropping deprecated columns: 19326
Products.shape - After dropping duplicate sku values: 10579
Products.shape - After removing rows with missing price values: 10534


8792 missing values were removed from products
This represents 45.49% of the data.





Unnamed: 0,sku,name,desc,price,promo_price
0,RAI0007,Silver Rain Design mStand Support,Aluminum support compatible with all MacBook,59.99,499.899
1,APP0023,Apple Mac Keyboard Keypad Spanish,USB ultrathin keyboard Apple Mac Spanish.,59,589.996
2,APP0025,Mighty Mouse Apple Mouse for Mac,mouse Apple USB cable.,59,569.898
3,APP0072,Apple Dock to USB Cable iPhone and iPod white,IPhone dock and USB Cable Apple iPod.,25,229.997
4,KIN0007,Mac Memory Kingston 2GB 667MHz DDR2 SO-DIMM,2GB RAM Mac mini and iMac (2006/07) MacBook Pr...,34.99,31.99
...,...,...,...,...,...
19321,BEL0376,Belkin Travel Support Apple Watch Black,compact and portable stand vertically or horiz...,29.99,269.903
19322,THU0060,"Enroute Thule 14L Backpack MacBook 13 ""Black",Backpack with capacity of 14 liter compartment...,69.95,649.903
19323,THU0061,"Enroute Thule 14L Backpack MacBook 13 ""Blue",Backpack with capacity of 14 liter compartment...,69.95,649.903
19324,THU0062,"Enroute Thule 14L Backpack MacBook 13 ""Red",Backpack with capacity of 14 liter compartment...,69.95,649.903
