# orderlines.csv (ol_)
Every row represents each one of the different products involved in an order.
- id – a unique identifier for each row in this file
- id_order – corresponds to orders.order_id
- product_id – an old identifier for each product, nowadays not in use
- product_quantity – how many units of that product were purchased on that order
- sku – stock keeping unit: a unique identifier for each product
- unit_price – the unitary price (in euros) of each product at the moment of placing that order
- date – timestamp for the processing of that product

    ==============================================================================================

## Importing the data
- ``` glob-glob("file_pat") ``` --> read multi files 
- ``` pd.concat(dfs_list, ignore_index=True)```  --> create 1 df from multi dfs
- ``` pd.read_csv(path)```  --> create 1 df from a csv file

In [114]:
import pandas as pd
import numpy as np

pd.set_option("display.min_rows", 0) 
pd.set_option("display.max_rows", 30) 
pd.__version__

'1.4.4'

In [115]:
url = "https://drive.google.com/file/d/1FYhN_2AzTBFuWcfHaRuKcuCE6CWXsWtG/view?usp=sharing" 
path = "https://drive.google.com/uc?export=download&id="+url.split("/")[-2]
ol = pd.read_csv(path)

      ===============================================

## Rename Columns , Set Index
- Rules: 
    - add 2-3 chars prefix to all columns
    - change column name to CamelCase
    - shorten it as max as possible
    - the Unique column shall have ALL_CAP name
 - ``` df.columns```   , ``` df.index``` 
 - ``` df=df.rename(columns={"A": "a", "B": "c"})``` 
 - ``` df.columns = ["a","b":"x"]``` 
     - take care, renaming the columns like that will convert the NAN to some value!!
 - ``` df=df.set_index("col")```  , ``` df=df.reset_index()``` 

In [116]:
ol.columns

Index(['id', 'id_order', 'product_id', 'product_quantity', 'sku', 'unit_price',
       'date'],
      dtype='object')

In [117]:
ol=ol.rename(columns={"id": "OL_ID"
                        , "id_order": "ol_ORD_ID"
                        , "product_id": "ol_ProdId"
                        , "product_quantity": "ol_ProdQnty"
                        , "sku": "ol_Sku"
                        , "unit_price": "ol_ProdUntPr"
                        , "date": "ol_ProcessDate"})

In [118]:
ol.columns

Index(['OL_ID', 'ol_ORD_ID', 'ol_ProdId', 'ol_ProdQnty', 'ol_Sku',
       'ol_ProdUntPr', 'ol_ProcessDate'],
      dtype='object')

In [119]:
ol.index #hint: no need to change index

RangeIndex(start=0, stop=293983, step=1)

    ==============================================================================================

## Explore the data
- ``` df.shape``` , ``` df.size``` , ``` df.ndim``` 
- ``` df.sample(5)``` , ``` df.info()``` 
- Numerical : ``` df.describe()``` , ``` df.col.nlargest()``` , ``` df.col.nsmallest()``` 
- Category : ``` df.nunique()``` , ``` df.unique() ``` 

In [120]:
ol.shape

(293983, 7)

In [121]:
ol.sample(10)
## unit_price has numbers like 3.121.59 TODO: fix it to proper float

Unnamed: 0,OL_ID,ol_ORD_ID,ol_ProdId,ol_ProdQnty,ol_Sku,ol_ProdUntPr,ol_ProcessDate
264372,1602949,506997,0,1,TUC0350,11.99,2018-02-06 18:46:23
180793,1467161,451532,0,1,APP1216,130.0,2017-11-29 08:24:10
286437,1639313,522848,0,1,PAC2482,498.18,2018-03-07 16:42:42
249188,1579011,497916,0,1,APP1644,626.0,2018-01-24 20:48:37
178521,1464727,450810,0,1,APP0666,319.99,2017-11-28 18:58:42
218423,1527150,477303,0,1,APP1215,106.82,2017-12-30 13:01:59
249111,1578882,467391,0,1,ELA0038,12.99,2018-01-24 19:11:58
113858,1346972,397399,0,4,OWC0085-2,52.99,2017-09-07 12:55:47
185334,1473572,454066,0,1,BEL0371,34.99,2017-12-02 23:25:56
29611,1183583,325434,0,1,TRK0011,49.99,2017-02-16 20:52:13


In [122]:
ol.info()  
## hint: there are no nulls 
## hint: date  is of type object, has to be datetime - TODO: fix required
## hint: unit_price is of type object, has to be float - TODO: fix required

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 293983 entries, 0 to 293982
Data columns (total 7 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   OL_ID           293983 non-null  int64 
 1   ol_ORD_ID       293983 non-null  int64 
 2   ol_ProdId       293983 non-null  int64 
 3   ol_ProdQnty     293983 non-null  int64 
 4   ol_Sku          293983 non-null  object
 5   ol_ProdUntPr    293983 non-null  object
 6   ol_ProcessDate  293983 non-null  object
dtypes: int64(4), object(3)
memory usage: 15.7+ MB


In [123]:
ol.describe()
# hint: product_id seems like empty and shall be removed - TODO: Fix required
# hint: product_quantity 3rd quartile = min = 1, max=999! 
    # probably we shall handle orders with different quantities differently - TODO:fix required

Unnamed: 0,OL_ID,ol_ORD_ID,ol_ProdId,ol_ProdQnty
count,293983.0,293983.0,293983.0,293983.0
mean,1397918.0,419999.116544,0.0,1.121126
std,153009.6,66344.486479,0.0,3.396569
min,1119109.0,241319.0,0.0,1.0
25%,1262542.0,362258.5,0.0,1.0
50%,1406940.0,425956.0,0.0,1.0
75%,1531322.0,478657.0,0.0,1.0
max,1650203.0,527401.0,0.0,999.0


In [124]:
ol.nunique() # hint: id is unique per row 
# hint: product_id is useless, shall be removed - TODO: fix required

OL_ID             293983
ol_ORD_ID         204855
ol_ProdId              1
ol_ProdQnty           67
ol_Sku              7951
ol_ProdUntPr       11329
ol_ProcessDate    251631
dtype: int64

In [125]:
ol.ol_ProdId.unique()
#hint: useless data


array([0], dtype=int64)

In [126]:
ol.ol_ProdUntPr.sort_values(ascending=False).tail(40) 
# There is negative value in ol_ProdUnitPrice,  TODO: Fix Required

225400       0.00
225401       0.00
276547       0.00
276548       0.00
67456        0.00
199139       0.00
280903       0.00
67563        0.00
280907       0.00
67854        0.00
265161       0.00
256505       0.00
280906       0.00
246443       0.00
256504       0.00
           ...   
67585        0.00
67626        0.00
212228       0.00
280904       0.00
67644        0.00
67661        0.00
67682        0.00
67701        0.00
212227       0.00
67717        0.00
265154       0.00
67746        0.00
265156       0.00
261851       0.00
77008     -119.00
Name: ol_ProdUntPr, Length: 40, dtype: object

In [127]:
ol.loc[(ol.ol_ProdUntPr == "0.00")].ol_ProdQnty.sum()
#That amount of product with price 0.00! TODO:check more about it 946/293983 = 0.3%

946

    ==============================================================================================

## Initial Clean
- Remember to create a copy of the df using ``` df.copy()``` 

In [128]:
ol_original = ol.copy()

      ===============================================

### Strip whitespaces
- ``` df.applymap(lambda x: x.strip() if isinstance(x, str) else x)```

In [129]:
ol = ol.applymap(lambda x: x.strip() if isinstance(x, str) else x)

      ===============================================
### Remove Duplicates Rows
- ``` df.duplicated().sum()``` 
- ``` df.loc[df.duplicated()==True]``` 
- ``` df=df.drop:duplicates() ``` 
- ``` df=df.drop:duplicates(subset=["col"])```  --> remove rows based on duplicated in specific column

In [130]:
ol.duplicated().sum() #hint : No duplicates

0

      =======================

#### Remove duplicated rows related to Unique columns
- find possible duplicates ``` len(df.col.unique()) == df.shape[0] ``` 
- get the excat value for duplicate columns ``` df.loc[df.duplicated(subset="col")]``` 
- find all columns with same value ``` df[df.col=="val"]``` 

check if ol_ID is unique

In [131]:
len(ol.OL_ID.unique()) == ol.shape[0]

True

      ===============================================

### Drop duplicate/un-necessary Columns
- ``` df=df.drop(columns=["col1","col2"])``` 

In [132]:
ol.describe() ## hint: all columns contain needed data. nothing to drop

Unnamed: 0,OL_ID,ol_ORD_ID,ol_ProdId,ol_ProdQnty
count,293983.0,293983.0,293983.0,293983.0
mean,1397918.0,419999.116544,0.0,1.121126
std,153009.6,66344.486479,0.0,3.396569
min,1119109.0,241319.0,0.0,1.0
25%,1262542.0,362258.5,0.0,1.0
50%,1406940.0,425956.0,0.0,1.0
75%,1531322.0,478657.0,0.0,1.0
max,1650203.0,527401.0,0.0,999.0


drop ol_ProductId

In [133]:
ol = ol.drop(columns=["ol_ProdId"])

    ==============================================================================================

## Compare to original DataFrame
``` df.compare(df2)``` 

In [134]:
#ol.compare(ol_original)

    ==============================================================================================

## Export the cleaned DataFrame

``` df.to_pickle("file_name.pkl")``` 

In [135]:
ol.to_pickle("clean_Tables\OrderLines_c.pkl")