# Products.csv (pro_)
products.csv
- sku – stock keeping unit: a unique identifier for each product
- name – product name
- desc – product description
- in_stock – whether or not the product was in stock at the moment of the data extraction
- type – a numerical code for product type

    ==============================================================================================

## Importing the data
- ``` glob-glob("file_pat") ``` --> read multi files 
- ``` pd.concat(dfs_list, ignore_index=True)```  --> create 1 df from multi dfs
- ``` pd.read_csv(path)```  --> create 1 df from a csv file

In [162]:
import pandas as pd
import numpy as np

pd.set_option("display.min_rows", 0) 
pd.set_option("display.max_rows", 30) 
pd.__version__

In [163]:
url = "https://drive.google.com/file/d/1afxwDXfl-7cQ_qLwyDitfcCx3u7WMvkU/view?usp=sharing" 
path = "https://drive.google.com/uc?export=download&id="+url.split("/")[-2]
pro = pd.read_csv(path)

      ===============================================

## Rename Columns , Set Index
- Rules: 
    - add 2-3 chars prefix to all columns
    - change column name to CamelCase
    - shorten it as max as possible
    - the Unique column shall have ALL_CAP name
 - ``` df.columns```   , ``` df.index``` 
 - ``` df=df.rename(columns={"A": "a", "B": "c"})``` 
 - ``` df.columns = ["a","b":"x"]``` 
     - take care, renaming the columns like that will convert the NAN to some value!!
 - ``` df=df.set_index("col")```  , ``` df=df.reset_index()``` 

In [164]:
pro.columns

Index(['sku', 'name', 'desc', 'price', 'promo_price', 'in_stock', 'type'], dtype='object')

In [165]:
pro=pro.rename(columns={"sku": "PRO_SKU"
                        , "name": "pro_Name"
                        , "desc": "pro_Desc"
                        , "price": "pro_Price"
                        , "promo_price": "pro_PromoPrice"
                        , "in_stock": "pro_InStock"
                        , "type": "pro_Type"})

In [166]:
pro.columns

Index(['PRO_SKU', 'pro_Name', 'pro_Desc', 'pro_Price', 'pro_PromoPrice', 'pro_InStock', 'pro_Type'], dtype='object')

In [167]:
pro.index

RangeIndex(start=0, stop=19326, step=1)

    ==============================================================================================

## Explore the data
- ``` df.shape``` , ``` df.size``` , ``` df.ndim``` 
- ``` df.sample(5)``` , ``` df.info()``` 
- Numerical : ``` df.describe()``` , ``` df.col.nlargest()``` , ``` df.col.nsmallest()``` 
- Category : ``` df.nunique()``` , ``` df.unique() ``` 

In [168]:
pro.shape

(19326, 7)

In [169]:
pro.sample(10)
## price and promo_price require clean up TODO: fix

Unnamed: 0,PRO_SKU,pro_Name,pro_Desc,pro_Price,pro_PromoPrice,pro_InStock,pro_Type
1351,APP0863,Apple iPad Air 2 Wi-Fi 16GB Space Gray,New iPad Air 2 Wi-Fi 16GB (MGL12TY / A).,429.0,4.129.851,0,12141714
5610,PAC1054,"Apple iMac 27 ""Core i5 3.3GHz Retina 5K | 32GB...",IMac desktop computer 27 inch 5K Retina i5 3.3...,3649.0,30.809.903,0,"5,74E+15"
8824,APP1226-A,"(Open) Apple iMac 215 ""Core i5 16GHz | 8GB | 2...",IMac desktop computer 215 inch 8GB RAM 256GB F...,1579.0,14.444.449,0,1282
14005,APP1871,"Apple MacBook Pro 15 ""Core i7 Touch Bar 29GHz ...",New MacBook Pro 15-inch Core i7 Touch Bar 29Gh...,3319.0,31.595.847,0,2158
5561,PAC1054,"Apple iMac 27 ""Core i5 3.3GHz Retina 5K | 32GB...",IMac desktop computer 27 inch 5K Retina i5 3.3...,3649.0,30.809.903,0,"5,74E+15"
12107,LAC0180,LaCie Porsche Design Mobile Hard Drive 1TB USB...,External Hard Drive 1TB USB-C and USB 3.0 conn...,104.99,789.949,1,11935397
15094,PRO0023,Pegasus2 Promise R6 Thunderbolt RAID 24TB Hard 2,Massive external storage system of 24TB (6x4TB...,3502.95,28.468.166,0,11935397
2080,APP0938,"Apple MacBook Air 11 ""i5 16 Ghz | 4GB RAM | 25...",laptop MacBook Air 11 inch i5 16GHz 4GB 256GB ...,1249.0,11.715.849,0,1282
12557,IFX0060,"Battery iFixit MacBook Air 13 ""(Late 2008 / Mi...",Replacement Battery for MacBook Air 13-inch La...,99.95,899.901,1,13005399
9868,PAC0974,"Apple iMac 27 ""Core i5 3.2GHz Retina 5K | 32GB...",IMac desktop computer 27 inch 5K Retina i5 3.2...,3169.0,26.309.901,0,"5,74E+15"


In [170]:
pro.info()  
## hint: there are no nulls 
## hint: price  is of type object, has to be float - TODO: fix required
## hint: promo_price is of type object, has to be float - TODO: fix required

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19326 entries, 0 to 19325
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   PRO_SKU         19326 non-null  object
 1   pro_Name        19326 non-null  object
 2   pro_Desc        19319 non-null  object
 3   pro_Price       19280 non-null  object
 4   pro_PromoPrice  19326 non-null  object
 5   pro_InStock     19326 non-null  int64 
 6   pro_Type        19276 non-null  object
dtypes: int64(1), object(6)
memory usage: 1.0+ MB


In [171]:
pro.describe()
# in_stock seems to be bool TODO: convert to bool
# TODO: 75% are 0 !!! there is no stock!!

Unnamed: 0,pro_InStock
count,19326.0
mean,0.109593
std,0.31239
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,1.0


In [172]:
pro.nunique() # hint: id is unique per row 
# seems there is alot of duplicates

PRO_SKU           10579
pro_Name          10373
pro_Desc           7098
pro_Price          2690
pro_PromoPrice     4614
pro_InStock           2
pro_Type            126
dtype: int64

In [173]:
pro.pro_Price.sort_values(ascending=False).unique()

array(['999.99', '999.944', '999.896', ..., '1.099.043', '1.090.004', nan],
      dtype=object)

In [174]:
pro.pro_PromoPrice.sort_values(ascending=False)

2206        999.99
11809       999.99
2649       999.944
2670       999.944
12108      999.944
10793      999.944
11456      999.944
10785      999.944
2404       999.944
2403       999.944
11764      999.944
14609      999.944
769        999.944
76         999.944
77         999.944
           ...    
2949     1.027.943
16678    1.027.002
16676    1.027.002
16677    1.027.002
12962    1.026.775
14653    1.021.777
299      1.019.933
1861     1.019.909
2644     1.017.937
18640    1.013.179
17125    1.010.414
11237    1.008.368
11236    1.008.368
2064     1.007.942
17320    1.003.978
Name: pro_PromoPrice, Length: 19326, dtype: object

    ==============================================================================================

## Initial Clean
- Remember to create a copy of the df using ``` df.copy()``` 

In [175]:
pro_original = pro.copy()

      ===============================================

### Strip whitespaces
- ``` df.applymap(lambda x: x.strip() if isinstance(x, str) else x)```

In [176]:
pro = pro.applymap(lambda x: x.strip() if isinstance(x, str) else x)

      ===============================================
### Remove Duplicates Rows
- ``` df.duplicated().sum()``` 
- ``` df.loc[df.duplicated()==True]``` 
- ``` df=df.drop_duplicates() ``` 
- ``` df=df.drop_duplicates(subset=["col"])```  --> remove rows based on duplicated in specific column

In [177]:
pro.duplicated().sum() #hint : No duplicates

8746

In [178]:
pro.loc[pro.duplicated()==True]

Unnamed: 0,PRO_SKU,pro_Name,pro_Desc,pro_Price,pro_PromoPrice,pro_InStock,pro_Type
101,APP0390,"Apple MacBook Pro 133 ""Core i5 25GHz | 4GB RAM...",MacBook Pro laptop 133 inches (MD101Y / A).,1199,11.455.917,0,1282
102,APP0390,"Apple MacBook Pro 133 ""Core i5 25GHz | 4GB RAM...",MacBook Pro laptop 133 inches (MD101Y / A).,1199,11.455.917,0,1282
103,APP0390,"Apple MacBook Pro 133 ""Core i5 25GHz | 4GB RAM...",MacBook Pro laptop 133 inches (MD101Y / A).,1199,11.455.917,0,1282
104,APP0390,"Apple MacBook Pro 133 ""Core i5 25GHz | 4GB RAM...",MacBook Pro laptop 133 inches (MD101Y / A).,1199,11.455.917,0,1282
105,APP0390,"Apple MacBook Pro 133 ""Core i5 25GHz | 4GB RAM...",MacBook Pro laptop 133 inches (MD101Y / A).,1199,11.455.917,0,1282
106,APP0390,"Apple MacBook Pro 133 ""Core i5 25GHz | 4GB RAM...",MacBook Pro laptop 133 inches (MD101Y / A).,1199,11.455.917,0,1282
107,APP0390,"Apple MacBook Pro 133 ""Core i5 25GHz | 4GB RAM...",MacBook Pro laptop 133 inches (MD101Y / A).,1199,11.455.917,0,1282
108,APP0390,"Apple MacBook Pro 133 ""Core i5 25GHz | 4GB RAM...",MacBook Pro laptop 133 inches (MD101Y / A).,1199,11.455.917,0,1282
110,PAC0508,Apple MacBook Pro 133 '' 25GHz | 16GB RAM | 1T...,Apple MacBook Pro Fusion Drive 16GB 2 internal...,1919,16.999.895,0,1282
111,PAC0508,Apple MacBook Pro 133 '' 25GHz | 16GB RAM | 1T...,Apple MacBook Pro Fusion Drive 16GB 2 internal...,1919,16.999.895,0,1282


In [179]:
pro=pro.drop_duplicates()

In [180]:
pro.shape

(10580, 7)

In [181]:
pro.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10580 entries, 0 to 19325
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   PRO_SKU         10580 non-null  object
 1   pro_Name        10580 non-null  object
 2   pro_Desc        10573 non-null  object
 3   pro_Price       10534 non-null  object
 4   pro_PromoPrice  10580 non-null  object
 5   pro_InStock     10580 non-null  int64 
 6   pro_Type        10530 non-null  object
dtypes: int64(1), object(6)
memory usage: 661.2+ KB


      =======================

#### Remove duplicated rows related to Unique columns
- find possible duplicates ``` len(df.ol_ID.unique()) == df.shape[0] ``` 
- get the excat value for duplicate columns ``` df.loc[df.duplicated(subset="col")]``` 
- find all columns with same value ``` df[df.col=="val"]``` 

check if pro_SKU is unique

In [182]:
len(pro.PRO_SKU.unique()) == pro.shape[0]

False

In [183]:
pro.loc[pro.duplicated(subset="PRO_SKU")]

Unnamed: 0,PRO_SKU,pro_Name,pro_Desc,pro_Price,pro_PromoPrice,pro_InStock,pro_Type
8000,APP1197,"Apple iMac 21.5 ""Core i5 31 GHz Retina display...",Desktop Apple iMac 21.5 inch i5 31 GHz Retina ...,,1305.59,0,1282


In [184]:
pro[pro.PRO_SKU=="APP1197"]
#since it will drop the second one, its okie

Unnamed: 0,PRO_SKU,pro_Name,pro_Desc,pro_Price,pro_PromoPrice,pro_InStock,pro_Type
7992,APP1197,"Apple iMac 21.5 ""Core i5 31 GHz Retina display...",Desktop Apple iMac 21.5 inch i5 31 GHz Retina ...,1729.0,1305.59,0,1282
8000,APP1197,"Apple iMac 21.5 ""Core i5 31 GHz Retina display...",Desktop Apple iMac 21.5 inch i5 31 GHz Retina ...,,1305.59,0,1282


In [185]:
pro = pro.drop_duplicates(subset="PRO_SKU")

      ===============================================

### Drop duplicate/un-necessary Columns
- ``` df=df.drop(columns=["col1","col2"])``` 

In [186]:
pro.describe() ## hint: all columns contain needed data. nothing to drop

Unnamed: 0,pro_InStock
count,10579.0
mean,0.194158
std,0.39557
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,1.0


    ==============================================================================================

## Compare to original DataFrame
``` df.compare(df2)``` 

In [None]:
#pro.compare(pro_original)

    ==============================================================================================

## Export the cleaned DataFrame

``` df.to_pickle("file_name.pkl")``` 

In [187]:
pro.to_pickle("clean_Tables\Products_c.pkl")