# products.csv

* **sku** – stock keeping unit: a unique identifier for each product
* **name** – product name
* **desc** – product description
* **price** – base price of the product, in euros
* **promo_price** – promotional price, in euros
* **in_stock** – whether or not the product was in stock at the moment of the data extraction
* **type** – a numerical code for product type

## Import the data

In [18]:
import pandas as pd
import numpy as np
import re

pd.options.display.max_rows = 500

products_original = pd.read_csv('../../data/products.csv')
products_original.head()

Unnamed: 0,sku,name,desc,price,promo_price,in_stock,type
0,RAI0007,Silver Rain Design mStand Support,Aluminum support compatible with all MacBook,59.99,499.899,1,8696
1,APP0023,Apple Mac Keyboard Keypad Spanish,USB ultrathin keyboard Apple Mac Spanish.,59.0,589.996,0,13855401
2,APP0025,Mighty Mouse Apple Mouse for Mac,mouse Apple USB cable.,59.0,569.898,0,1387
3,APP0072,Apple Dock to USB Cable iPhone and iPod white,IPhone dock and USB Cable Apple iPod.,25.0,229.997,0,1230
4,KIN0007,Mac Memory Kingston 2GB 667MHz DDR2 SO-DIMM,2GB RAM Mac mini and iMac (2006/07) MacBook Pr...,34.99,31.99,1,1364


## Data exploration

In [19]:
products = products_original.copy()
products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19326 entries, 0 to 19325
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   sku          19326 non-null  object
 1   name         19326 non-null  object
 2   desc         19319 non-null  object
 3   price        19280 non-null  object
 4   promo_price  19326 non-null  object
 5   in_stock     19326 non-null  int64 
 6   type         19276 non-null  object
dtypes: int64(1), object(6)
memory usage: 1.0+ MB


<div class="alert alert-block alert-info">
    <b>price</b> and <b>promo_price</b> must be transformed to float and type and <b>in_stock</b> can be removed.
</div>

In [20]:
products.drop(['type', 'in_stock'], axis=1, inplace=True)

In [21]:
products.nunique()

sku            10579
name           10373
desc            7098
price           2690
promo_price     4614
dtype: int64

In [22]:
products[products.sku.duplicated()]

Unnamed: 0,sku,name,desc,price,promo_price
101,APP0390,"Apple MacBook Pro 133 ""Core i5 25GHz | 4GB RAM...",MacBook Pro laptop 133 inches (MD101Y / A).,1199,11.455.917
102,APP0390,"Apple MacBook Pro 133 ""Core i5 25GHz | 4GB RAM...",MacBook Pro laptop 133 inches (MD101Y / A).,1199,11.455.917
103,APP0390,"Apple MacBook Pro 133 ""Core i5 25GHz | 4GB RAM...",MacBook Pro laptop 133 inches (MD101Y / A).,1199,11.455.917
104,APP0390,"Apple MacBook Pro 133 ""Core i5 25GHz | 4GB RAM...",MacBook Pro laptop 133 inches (MD101Y / A).,1199,11.455.917
105,APP0390,"Apple MacBook Pro 133 ""Core i5 25GHz | 4GB RAM...",MacBook Pro laptop 133 inches (MD101Y / A).,1199,11.455.917
...,...,...,...,...,...
16831,APP2302,"Apple MacBook Pro 13 ""Core i5 Touch Bar 33GHz ...",New MacBook Pro 13-inch Core i5 Touch Bar 33 G...,26.155.941,26.155.941
16833,APP2303,"Apple MacBook Pro 13 ""Core i5 Touch Bar 33GHz ...",New MacBook Pro 13 inch Touch Bar 33 GHz Core ...,237.559.421,23.755.942
18190,PAR0077,Parrot Bebop Drone 2 Power,Drone cuadricóptero quality camera integrated ...,699.9,6.733.892
18308,NKI0010,Nokia Wireless sphygmomanometer Plata,Sphygmomanometer for iPhone iPad and iPod App.,129.99,1.149.899


In [23]:
products[products.sku=='APP0390']

Unnamed: 0,sku,name,desc,price,promo_price
100,APP0390,"Apple MacBook Pro 133 ""Core i5 25GHz | 4GB RAM...",MacBook Pro laptop 133 inches (MD101Y / A).,1199,11.455.917
101,APP0390,"Apple MacBook Pro 133 ""Core i5 25GHz | 4GB RAM...",MacBook Pro laptop 133 inches (MD101Y / A).,1199,11.455.917
102,APP0390,"Apple MacBook Pro 133 ""Core i5 25GHz | 4GB RAM...",MacBook Pro laptop 133 inches (MD101Y / A).,1199,11.455.917
103,APP0390,"Apple MacBook Pro 133 ""Core i5 25GHz | 4GB RAM...",MacBook Pro laptop 133 inches (MD101Y / A).,1199,11.455.917
104,APP0390,"Apple MacBook Pro 133 ""Core i5 25GHz | 4GB RAM...",MacBook Pro laptop 133 inches (MD101Y / A).,1199,11.455.917
105,APP0390,"Apple MacBook Pro 133 ""Core i5 25GHz | 4GB RAM...",MacBook Pro laptop 133 inches (MD101Y / A).,1199,11.455.917
106,APP0390,"Apple MacBook Pro 133 ""Core i5 25GHz | 4GB RAM...",MacBook Pro laptop 133 inches (MD101Y / A).,1199,11.455.917
107,APP0390,"Apple MacBook Pro 133 ""Core i5 25GHz | 4GB RAM...",MacBook Pro laptop 133 inches (MD101Y / A).,1199,11.455.917
108,APP0390,"Apple MacBook Pro 133 ""Core i5 25GHz | 4GB RAM...",MacBook Pro laptop 133 inches (MD101Y / A).,1199,11.455.917


<div class="alert alert-block alert-danger">
    We will need to drop these duplicated values. 
</div>

#### Missing values

In [24]:
products.isna().sum()

sku             0
name            0
desc            7
price          46
promo_price     0
dtype: int64

<div class="alert alert-block alert-danger">
    We need to remove and rows with missing <b>price</b> values.
</div>

In [25]:
before = products.shape[0]
products = products[~products.price.isna()]
after = products.shape[0]
print(f"{before-after} rows removed.")

46 rows removed.


#### Missing descriptions

In [26]:
names_of_products_without_descriptions = products[products.desc.isna()].name.tolist()
names_of_products_without_descriptions

['Open - Purple 2TB WD 35 "PC Security Mac hard drive and NAS',
 'Open - Apple Smart Keyboard Pro Keyboard Folio iPad 9.7 "',
 'Synology DS718 + NAS Server | 10GB RAM',
 'Open - Kanex USB-C Gigabit Ethernet Adapter MacBook 12 "',
 'Hyper Pearl 1600mAh battery Mini USB Mirror and Comic Blond',
 'OtterBox External Battery Power Pack 20000 mAHr',
 'Open - Honeywell thermostat Lyric zonificador T6 Intelligent Wireless (cable)']

In [27]:
products[products.desc.isna()]

Unnamed: 0,sku,name,desc,price,promo_price
16126,WDT0211-A,"Open - Purple 2TB WD 35 ""PC Security Mac hard ...",,107,814.659
16128,APP1622-A,Open - Apple Smart Keyboard Pro Keyboard Folio...,,1.568.206,1.568.206
17843,PAC2334,Synology DS718 + NAS Server | 10GB RAM,,566.35,5.659.896
18152,KAN0034-A,Open - Kanex USB-C Gigabit Ethernet Adapter Ma...,,29.99,237.925
18490,HTE0025,Hyper Pearl 1600mAh battery Mini USB Mirror an...,,24.99,22.99
18612,OTT0200,OtterBox External Battery Power Pack 20000 mAHr,,79.99,56.99
18690,HOW0001-A,Open - Honeywell thermostat Lyric zonificador ...,,199.99,1.441.174


<div class="alert alert-block alert-info">
    There are only 7 missing descriptions. We can simply look up the products online and add them.
</div>

#### Explore price values

In [28]:
# Create a copy of the dataset
price_temp = products.copy()

# create a new column with the amount of dots in the unit_price column
price_temp['num_decimals'] = price_temp['price'].str.count(r'\.')

# show the rows with more than one dot
price_temp.query('num_decimals > 1').shape[0]

431

In [29]:
# show the rows with more than two dots
price_temp.query('num_decimals > 2').shape[0]

0

In [30]:
price_temp.query('num_decimals > 1').head(5)

Unnamed: 0,sku,name,desc,price,promo_price,num_decimals
665,CRU0015-2,Crucial memory Mac 16GB (2x8GB) SO-DIMM DDR3 1...,RAM 16GB (2x8GB) 135V MacBook Pro iMac (2012/2...,1.639.792,1.629.894,2
792,APP0672,Apple iPhone 5S 16GB Space Gray,New iPhone 5S 16G Libre (ME432Y / AB).,4.694.994,4.694.994,2
797,APP0673,Apple iPhone 5S 16GB Silver,New Free iPhone 5S 16GB (ME433Y / A).,4.090.042,4.090.042,2
827,PAC0339,NewerTech miniStack 4TB Hard Drive Mac,External Box Hard Drive Mac + 4TB.,2.199.791,2.199.901,2
885,PAC0376,OWC Mercury Elite Pro Dual Thunderbolt + 8TB,RAID outer box 35 inch SATA connection Thunder...,5.609.698,5.549.895,2


In [31]:
# show the rows with more than two dots
price_temp.query('num_decimals < 1').shape[0]

11528

In [32]:
price_temp.query('num_decimals < 1').head(5)

Unnamed: 0,sku,name,desc,price,promo_price,num_decimals
1,APP0023,Apple Mac Keyboard Keypad Spanish,USB ultrathin keyboard Apple Mac Spanish.,59,589.996,0
2,APP0025,Mighty Mouse Apple Mouse for Mac,mouse Apple USB cable.,59,569.898,0
3,APP0072,Apple Dock to USB Cable iPhone and iPod white,IPhone dock and USB Cable Apple iPod.,25,229.997,0
5,APP0073,Apple Composite AV Cable iPhone and iPod white,IPhone and iPod AV Cable Dock to Composite Video.,45,420.003,0
8,KIN0001-2,Mac memory Kingston 4GB (2x2GB) 667MHz DDR2 SO...,RAM 4GB (2x2GB) Mac mini and iMac (2006/07) Ma...,74,669.904,0


#### Explore promo price values

In [33]:
# Create a copy of the dataset
promo_price_temp = products.copy()

# create a new column with the amount of dots in the unit_price column
promo_price_temp['num_decimals'] = promo_price_temp['promo_price'].str.count(r'\.')

# show the rows with more than one dot
promo_price_temp.query('num_decimals > 1').shape[0]

12902

In [34]:
# show the rows with more than two dots
price_temp.query('num_decimals > 2').shape[0]

0

#### Compare both

<div class="alert alert-block alert-info">
    There are far more corrupted values in <b>promo_price</b> so let's check the values where only <b>promo_price</b> is corrupted and compare the values
</div>

In [35]:
# Create a copy of the dataset
products_temp = products.copy()

# create a new column with the amount of dots in the unit_price column
products_temp['num_decimals_price'] = products_temp['price'].str.count(r'\.')
products_temp['num_decimals_promo_price'] = products_temp['promo_price'].str.count(r'\.')

# show the rows with more than one dot
products_temp.query('num_decimals_price != num_decimals_promo_price')

Unnamed: 0,sku,name,desc,price,promo_price,num_decimals_price,num_decimals_promo_price
1,APP0023,Apple Mac Keyboard Keypad Spanish,USB ultrathin keyboard Apple Mac Spanish.,59,589.996,0,1
2,APP0025,Mighty Mouse Apple Mouse for Mac,mouse Apple USB cable.,59,569.898,0,1
3,APP0072,Apple Dock to USB Cable iPhone and iPod white,IPhone dock and USB Cable Apple iPod.,25,229.997,0,1
5,APP0073,Apple Composite AV Cable iPhone and iPod white,IPhone and iPod AV Cable Dock to Composite Video.,45,420.003,0,1
8,KIN0001-2,Mac memory Kingston 4GB (2x2GB) 667MHz DDR2 SO...,RAM 4GB (2x2GB) Mac mini and iMac (2006/07) Ma...,74,669.904,0,1
...,...,...,...,...,...,...,...
19316,REP0407,iPad battery repair,Repair service including parts and labor for iPad,6.999.003,69.99,2,1
19317,REP0403,iPad LCD screen repair,Repair service including parts and labor for iPad,159.99,1.599.898,1,2
19318,REP0402,iPad touch screen repair,Repair service including parts and labor for iPad,139.99,1.399.897,1,2
19319,KNO0032,"Knomo MacBook Pro Beauchamp Backpack 14 ""Black",Backpack thin nylon mesh internal compartment ...,179,1.699.905,0,2


<div class="alert alert-block alert-danger">
    In rows 1-5 above we see that even with a single decimal place, the <b>promo_price</b> has been corrupted.<br>
    In row 19316 we see that the <b>price</b> has been corrupted and the <b>promo_price</b> is correct.
</div>

## Clean products

In [49]:
products = products_original.copy()

products_before = products.shape[0]

print("Products.shape - Initial:", products.shape[0])
# Drop the type and in_stock columns 
products.drop('type', axis=1, inplace=True, errors='ignore')
products.drop('in_stock', axis=1, inplace=True, errors='ignore')

print("Products.shape - After dropping deprecated columns:", products.shape[0])

#  Drop duplicate rows based on sku value
products.drop_duplicates(subset='sku', inplace=True)

print("Products.shape - After dropping duplicate sku values:", products.shape[0])

# Check for products without descriptions
names_of_products_without_descriptions = products[products.desc.isna()].name.tolist()
# Add missing descriptions
missing_descriptions = [
    '2TB Mac hard drive and Nas',
    'Apple keyboard for iPad 9.7',
    'NAS server with 10GB RAM',
    'Ethernet adapter for Macbook 12',
    'Luxury power bank combined with powder, 2 mirrors - normal and 3x magnification, Illuminated under mirror with LED, Low weight and compact dimensions',
    'Battery capacity: 20,000 mAh; ultra-stable: outer shell made of durable synthetic rubber (military standard, withstands drops from up to 2 metres) ; protection: dust and splash proof: military standard iP54; battery level indicator and super fast charging; USB port can be connected to charger and other devices',
    'Smart thermostat designed to provide automatic time and temperature control of heating systems in homes and apartments. '
]
for i in range(len(names_of_products_without_descriptions)):
    products.loc[products.name == names_of_products_without_descriptions[i], 'desc'] = missing_descriptions[i]

def start_pipeline(df):
    '''Make a copy of the pipeline to prevent corrupting the original data'''
    return df.copy()

def remove_missing_prices(df, col):
    return df[~df[col].isna()]


products = (products
        .pipe(start_pipeline)
        .pipe(remove_missing_prices,col='price')
)


print("Products.shape - After removing rows with missing price values:", products.shape[0])

products_after = products.shape[0]

print('\n')
print(f"{products_before-products_after} missing and duplicated values were removed from products")
print(f"This represents {(products_before-products_after)/products_before * 100:.2f}% of the data.")
print("\n\n")

products

Products.shape - Initial: 19326
Products.shape - After dropping deprecated columns: 19326
Products.shape - After dropping duplicate sku values: 10579
Products.shape - After removing rows with missing price values: 10534


8792 missing and duplicated values were removed from products
This represents 45.49% of the data.





Unnamed: 0,sku,name,desc,price,promo_price
0,RAI0007,Silver Rain Design mStand Support,Aluminum support compatible with all MacBook,59.99,499.899
1,APP0023,Apple Mac Keyboard Keypad Spanish,USB ultrathin keyboard Apple Mac Spanish.,59,589.996
2,APP0025,Mighty Mouse Apple Mouse for Mac,mouse Apple USB cable.,59,569.898
3,APP0072,Apple Dock to USB Cable iPhone and iPod white,IPhone dock and USB Cable Apple iPod.,25,229.997
4,KIN0007,Mac Memory Kingston 2GB 667MHz DDR2 SO-DIMM,2GB RAM Mac mini and iMac (2006/07) MacBook Pr...,34.99,31.99
...,...,...,...,...,...
19321,BEL0376,Belkin Travel Support Apple Watch Black,compact and portable stand vertically or horiz...,29.99,269.903
19322,THU0060,"Enroute Thule 14L Backpack MacBook 13 ""Black",Backpack with capacity of 14 liter compartment...,69.95,649.903
19323,THU0061,"Enroute Thule 14L Backpack MacBook 13 ""Blue",Backpack with capacity of 14 liter compartment...,69.95,649.903
19324,THU0062,"Enroute Thule 14L Backpack MacBook 13 ""Red",Backpack with capacity of 14 liter compartment...,69.95,649.903


# Debug prices

In [74]:
def test_col_vals_are_greater_or_equal_to_other(df, greater_col, lesser_col):
    num_incorrect_vals = df[df[greater_col] < df[lesser_col]].shape[0]
    if num_incorrect_vals == 0:
        print(f"All of the {greater_col} values are greater or equal to the corresponding {lesser_col} values.\n")
        return pd.DataFrame()
    else:
        print(f"There are corrupted values in {greater_col} which are less than their corresponding {lesser_col} values.")
        print(f"This respresents {num_incorrect_vals/df.shape[0]*100:.2f}% of the data.\n")
        corrupted_price_skus = df[df[greater_col] < df[lesser_col]].sku
        return corrupted_price_skus

def test_price_greater_or_equal_to_promo_price(df):
    incorrect_val_ids = test_col_vals_are_greater_or_equal_to_other(df, 'price', 'promo_price')
    return incorrect_val_ids

def add_discount_percentage_col(df):
    df['discount_percentage'] = (df.price - df.promo_price)/df.price*100
    return df

def check_for_outliers(df):
    ''' 
    Check for outliers to see if the decimal value was in the wrong spot. 
    
    We define an outlier as having a discount percentage in price and promo_price 
    greater than 3 standard deviations from the mean discount percentage
    '''

    mean_diff = df['discount_percentage'].mean()
    std_diff = df['discount_percentage'].std()
    threshold = mean_diff + 3 * std_diff
    
    outliers = df[df['discount_percentage'] > threshold]
    
    return outliers

In [50]:
# Create a copy of the dataset
products_temp = products.copy()

# create a new column with the amount of dots in the unit_price column
products_temp['price_decimal_count'] = products_temp['price'].str.count(r'\.')
products_temp['promo_price_decimal_count'] = products_temp['promo_price'].str.count(r'\.')

# Group by the num decimal columns and get the count & percentage for each group
grouped_counts = products_temp.groupby(
    ['price_decimal_count', 'promo_price_decimal_count']
).size().reset_index(name='count')

# Calculate the percentage of each group relative to the total number of rows
total_rows = len(products_temp)
grouped_counts['percentage'] = (grouped_counts['count'] / total_rows) * 100

# Display the resulting dataframe
grouped_counts.sort_values('percentage', ascending=False)

Unnamed: 0,price_decimal_count,promo_price_decimal_count,count,percentage
4,1,1,4610,43.763053
5,1,2,2312,21.947978
2,0,2,2009,19.071578
1,0,1,1104,10.480349
8,2,2,276,2.620087
0,0,0,102,0.968293
7,2,1,98,0.930321
3,1,0,20,0.189861
6,2,0,3,0.028479


## price_decimal_count = promo_price_decimal_count = 0

In [72]:
skus = products_temp[
    (products_temp.price_decimal_count == 0) &
    (products_temp.promo_price_decimal_count == 0)
].sku

zero_zero_decimal_data = products_temp[products_temp.sku.isin(skus)].copy()
zero_zero_decimal_data[['price', 'promo_price']] = zero_zero_decimal_data[['price', 'promo_price']].astype('float')

In [73]:
test_price_greater_or_equal_to_promo_price(zero_zero_decimal_data)

zero_zero_decimal_data = add_discount_percentage_col(zero_zero_decimal_data)

outliers = check_for_outliers(zero_zero_decimal_data)
outliers

All of the price values are greater or equal to the corresponding promo_price values.



Unnamed: 0,sku,name,desc,price,promo_price,price_decimal_count,promo_price_decimal_count,discount_percentage
15936,PAC2059,"Second hand - Apple iMac 20 ""Core 2 Duo 266GHz...",Refurbished iMac 20 inch Core 2 Duo 266GHz | 4...,1199.0,409.0,0,0,65.88824
15954,PAC2065,"Second hand - Apple iMac 24 ""Core 2 Duo 28 GHz...",Computer Refurbished iMac 24 inch Core 2 Duo 2...,1799.0,499.0,0,0,72.262368
17706,PAC2199,"Second hand - Apple iMac 20 ""Core 2 Duo 226GHz...",Refurbished iMac 20 inch Core 2 Duo | 2GB RAM ...,1199.0,259.0,0,0,78.398666


<div class="alert alert-block alert-success">
    The test passes and the outliers do not have the decimal point in the wrong place.
    <br><br>
    The data is clean.
    <br><br>
    For now we will assume that values with no decimal point are whole numbers which were saved as ints.
</div>

## price_decimal_count = 1,  promo_price_decimal_count = 0

In [78]:
skus = products_temp[
    (products_temp.price_decimal_count == 1) &
    (products_temp.promo_price_decimal_count == 0)
].sku

one_zero_decimal_data = products_temp[products_temp.sku.isin(skus)].copy()
one_zero_decimal_data[['price', 'promo_price']] = one_zero_decimal_data[['price', 'promo_price']].astype('float')

corrupted_price_skus = test_price_greater_or_equal_to_promo_price(one_zero_decimal_data)
one_zero_decimal_data[one_zero_decimal_data.sku.isin(corrupted_price_skus)]

There are corrupted values in price which are less than their corresponding promo_price values.
This respresents 15.00% of the data.



Unnamed: 0,sku,name,desc,price,promo_price,price_decimal_count,promo_price_decimal_count
10966,SAN0092,SanDisk Ultra USB 3.0 128GB pendrive,Pendrive USB 3.0 Flash Drive 128G for Mac and PC.,44.99,47.0,1,0
11038,SAN0110,SanDisk Ultra Flair Flash Drive 16GB USB 3.0,USB 3.0 flash drive 16GB USB Flash Drive Mac a...,8.99,9.0,1,0
13487,SAN0158,SanDisk Cruzer Dial Flash Drive 16GB USB 2.0,Ultra compact flash drive with built-dial for ...,7.99,9.0,1,0


In [79]:
zero_zero_decimal_data = add_discount_percentage_col(one_zero_decimal_data)

outliers = check_for_outliers(one_zero_decimal_data)
outliers

Unnamed: 0,sku,name,desc,price,promo_price,price_decimal_count,promo_price_decimal_count,discount_percentage


<div class="alert alert-block alert-danger">
    Some of the price or promo_price values here are incorrect.
</div>

## price_decimal_count = 0,  promo_price_decimal_count = 1

In [96]:
skus = products_temp[
    (products_temp.price_decimal_count == 0) &
    (products_temp.promo_price_decimal_count == 1)
].sku

zero_one_decimal_data = products_temp[products_temp.sku.isin(skus)].copy()
zero_one_decimal_data[['price', 'promo_price']] = zero_one_decimal_data[['price', 'promo_price']].astype('float')

corrupted_price_skus = test_price_greater_or_equal_to_promo_price(zero_one_decimal_data)
zero_one_decimal_data[zero_one_decimal_data.sku.isin(corrupted_price_skus)].head()

There are corrupted values in price which are less than their corresponding promo_price values.
This respresents 80.43% of the data.



Unnamed: 0,sku,name,desc,price,promo_price,price_decimal_count,promo_price_decimal_count
1,APP0023,Apple Mac Keyboard Keypad Spanish,USB ultrathin keyboard Apple Mac Spanish.,59.0,589.996,0,1
2,APP0025,Mighty Mouse Apple Mouse for Mac,mouse Apple USB cable.,59.0,569.898,0,1
3,APP0072,Apple Dock to USB Cable iPhone and iPod white,IPhone dock and USB Cable Apple iPod.,25.0,229.997,0,1
5,APP0073,Apple Composite AV Cable iPhone and iPod white,IPhone and iPod AV Cable Dock to Composite Video.,45.0,420.003,0,1
8,KIN0001-2,Mac memory Kingston 4GB (2x2GB) 667MHz DDR2 SO...,RAM 4GB (2x2GB) Mac mini and iMac (2006/07) Ma...,74.0,669.904,0,1


<div class="alert alert-block alert-danger">
    It appears that 80% of the promo_price values in our subset have the decimal point one place too far to the left.
    <br>
    <br>
    Let's try to fix this and see if the test passes.
</div>

In [97]:
''' 
First extract the skus of the promo_prices which passed the test to see if there is a 
difference between the format of their promo_price values and the values that didn't pass. 
'''
passing_skus = zero_one_decimal_data[zero_one_decimal_data.price >= zero_one_decimal_data.promo_price].sku

In [103]:
# Copy the data again to view the prices in string format
skus = products_temp[
    (products_temp.price_decimal_count == 0) &
    (products_temp.promo_price_decimal_count == 1)
].sku
zero_one_decimal_data = products_temp[products_temp.sku.isin(skus)].copy()

zero_one_decimal_data[zero_one_decimal_data.sku.isin(passing_skus)].head()

Unnamed: 0,sku,name,desc,price,promo_price,price_decimal_count,promo_price_decimal_count
262,PAC0183,"Apple MacBook Pro 133 ""25GHz | 8GB RAM | 500GB...",Apple MacBook Pro 133 inches (MD101Y / A) with...,1319,1279.99,0,1
297,WDT0139,"Red 1TB WD 35 ""Mac PC hard drive and NAS",Western Digital hard drive designed for NAS 1T...,77,64.493,0,1
306,SAN0014,SanDisk Cruzer Blade 16GB USB 2.0 Flash Drive,Compact flash drive 16GB USB 2.0 Mac and PC.,6,4.99,0,1
341,APP0416,Apple iPod Touch 32GB Silver,New generation iPod touch 5 32GB white / silver.,279,264.99,0,1
342,APP0417,Apple iPod Touch 32GB yellow,New 5 generation iPod touch 32GB yellow.,279,264.99,0,1


In [102]:
'''
There is no pattern
Some of the promo_price values have 1, 2 or 3 digits after the decimal place
just like in the corrupted data.
'''
zero_one_decimal_data[zero_one_decimal_data.sku.isin(['GTE0012-A', 'SAN0014', 'WDT0139'])]

Unnamed: 0,sku,name,desc,price,promo_price,price_decimal_count,promo_price_decimal_count
297,WDT0139,"Red 1TB WD 35 ""Mac PC hard drive and NAS",Western Digital hard drive designed for NAS 1T...,77,64.493,0,1
306,SAN0014,SanDisk Cruzer Blade 16GB USB 2.0 Flash Drive,Compact flash drive 16GB USB 2.0 Mac and PC.,6,4.99,0,1
1892,GTE0012-A,(Open) G-Tech G-DRIVE USB 3.0 Hard Disk 2TB,external hard drive G-Technology for Mac or PC...,235,169.9,0,1


In [154]:
'''
We're going to have to move the decimal point until the promo_price is less than or equal to the price
'''
# Copy the data again so the prices are back in str form
skus = products_temp[
    (products_temp.price_decimal_count == 0) &
    (products_temp.promo_price_decimal_count == 1)
].sku
zero_one_decimal_data = products_temp[products_temp.sku.isin(skus)].copy()

# Transform the price col to float
zero_one_decimal_data.price = zero_one_decimal_data.price.astype('float')

In [155]:
import math

def _insert_decimal_at_string_position(s, pos):
    '''Insert a decimal point at a given position in a string'''
    s = s.split('.')
    s = s[0] + s[1]
    s = s[:pos]+'.'+s[pos:]
    return s 

def _insert_decimal_in_promo_price(row, decimal_position = -2):
    '''
    If the euro value of the price is equal to the euro value of the promo_price, 
    set the promo_price equal to price and return it.
    This is because some promo_prices are slightly larger than their equivalent prices, such as: 
    price - promo_price
    12.95 - 12.99
    59.00 - 59.99 
    89.00 - 89.79
    549.00 - 550.33
    55.0	5.88	
    60.0	6.86	
    529.0	53.18
    
    Otherwise, keep moving the decimal point towards the start of the string until the promo_price is lower than the price.
    Then transform the string to a float and round it to two decimal places.
    '''
    
    while float(row.promo_price) > row.price:
        #print(row.price, row.promo_price, math.floor(row.price)), math.floor(float(row.promo_price))
        if math.floor(float(row.promo_price)) == math.floor(row.price):
            row.promo_price = row.price
            return row.promo_price
        else:
            row.promo_price = _insert_decimal_at_string_position(row.promo_price, decimal_position)
            decimal_position += -1
    #print("promo_price:", round(float(row.promo_price), 2))
    return round(float(row.promo_price), 2)


def transform_promo_price_to_float_wrt_price(df):
    df.promo_price = [_insert_decimal_in_promo_price(row) for index, row in df.iterrows()]
    return df

test_data = zero_one_decimal_data.copy() #[zero_one_decimal_data.sku.isin(['APP1482'])].copy()
test_data = transform_promo_price_to_float_wrt_price(test_data)
test_data

Unnamed: 0,sku,name,desc,price,promo_price,price_decimal_count,promo_price_decimal_count
1,APP0023,Apple Mac Keyboard Keypad Spanish,USB ultrathin keyboard Apple Mac Spanish.,59.0,59.00,0,1
2,APP0025,Mighty Mouse Apple Mouse for Mac,mouse Apple USB cable.,59.0,56.99,0,1
3,APP0072,Apple Dock to USB Cable iPhone and iPod white,IPhone dock and USB Cable Apple iPod.,25.0,23.00,0,1
5,APP0073,Apple Composite AV Cable iPhone and iPod white,IPhone and iPod AV Cable Dock to Composite Video.,45.0,42.00,0,1
8,KIN0001-2,Mac memory Kingston 4GB (2x2GB) 667MHz DDR2 SO...,RAM 4GB (2x2GB) Mac mini and iMac (2006/07) Ma...,74.0,66.99,0,1
...,...,...,...,...,...,...,...
19283,AP20468,Like new - Apple iPhone Black Lightning Dock,Support base and refitted with dock connector ...,59.0,44.00,0,1
19285,AP20470,Like new - Apple Thunderbolt to Gigabit Ethern...,Refurbished Mac adapter Thunderbolt to Gigabit...,35.0,28.00,0,1
19288,AP20649,Like new - Apple Leather Case iPhone Case 8/7 ...,Reconditioned sleeve leather and microfiber Ap...,55.0,42.00,0,1
19295,AP20471,Apple Thunderbolt to FireWire 800 adapter,Reconditioned connection adapter Thunderbolt t...,35.0,28.00,0,1


In [159]:
corrupted_price_skus = test_price_greater_or_equal_to_promo_price(test_data)

All of the price values are greater or equal to the corresponding promo_price values.



In [156]:
test_data = add_discount_percentage_col(test_data)

outliers = check_for_outliers(test_data)
outliers

Unnamed: 0,sku,name,desc,price,promo_price,price_decimal_count,promo_price_decimal_count,discount_percentage
560,KIN0078,Kingston V300 SSD Disk 120GB,SSD 120GB SATA Hard Drive Mac and PC III.,60.0,6.86,0,1,88.566667
1301,LAC0134,LaCie 6TB 2big Thunderbolt 2,Dual-drive RAID box for Mac 6TB Thunderbolt 2.,529.0,53.18,0,1,89.94707
2884,APP1104,Apple iPod Shuffle 2GB Gray Space,Music player iPod Shuffle 2GB and small size.,55.0,5.88,0,1,89.309091
2888,APP1105,Apple iPod Shuffle 2GB Pink,Music player iPod Shuffle 2GB and small size.,55.0,5.88,0,1,89.309091
2889,APP1106,Apple iPod Shuffle 2GB Blue,Music player iPod Shuffle 2GB and small size.,55.0,5.88,0,1,89.309091
2890,APP1102,Apple iPod Shuffle 2GB Silver,Music player iPod Shuffle 2GB and small size.,55.0,5.88,0,1,89.309091
2891,APP1103,Apple iPod Shuffle 2GB Gold,Music player iPod Shuffle 2GB and small size.,55.0,5.88,0,1,89.309091
11770,APP1482,Apple iPhone 64GB Silver,New iPhone SE with 4-inch screen 64 free GB.,549.0,55.03,0,1,89.976321
11771,APP1483,Apple iPhone 64GB Gold,New iPhone SE with 4-inch screen 64 free GB.,549.0,55.03,0,1,89.976321
11772,APP1484,Apple iPhone 64GB Rose Gold,New iPhone SE with 4-inch screen 64 free GB.,549.0,55.03,0,1,89.976321


In [160]:
''' Let's check some of the values which do not seem to match. '''

orderlines_clean = pd.read_csv('../../data/orderlines_clean.csv')
orderlines_clean[orderlines_clean.sku=="APP1749"]

Unnamed: 0,id,order_id,product_quantity,sku,unit_price,date,short
110,1119301,299631,1,APP1749,429.0,2017-01-01 11:43:38,APP
627,1120315,300140,1,APP1749,429.0,2017-01-02 00:13:43,APP
656,1120359,300162,1,APP1749,429.0,2017-01-02 00:30:48,APP
667,1120375,300172,1,APP1749,429.0,2017-01-02 00:47:02,APP
678,1120398,300181,1,APP1749,429.0,2017-01-02 01:01:13,APP
1755,1122711,301185,1,APP1749,429.0,2017-01-02 23:56:13,APP
2077,1123350,301487,1,APP1749,429.0,2017-01-03 12:01:11,APP
3675,1130717,302898,1,APP1749,429.0,2017-01-05 07:23:18,APP
4550,1132509,303757,1,APP1749,411.99,2017-01-06 19:10:26,APP
4590,1132579,303802,1,APP1749,429.0,2017-01-06 20:43:48,APP


<div class="alert alert-bloc alert-danger">
    sku=="APP1749"
    <br><br>
    Clearly the promo_price values are incorrect. 
    <br><br>
    products.price == 429.00 is correct.
    products.promo_price == 43.28 is probably supposed to be 432.8 but either way it is correct.
    <br><br>
    As we can see above, the unit_price varies over time. Clearly varying discounts have been applied at different times. This is exactly what we want to analyse.
    <br><br>
    We will have to ignore the products.promo_price values. 
    <br>
    The products.price and orderlines.unit_price values will have to be merged and compared. 
    <br>
    The price cleaning functions above can be reused but it might make more sense to use the discount percentage to compare the values when adjusting the decimal point position.
</div>