# CleaningUp orders.csv (ord_)
orders.csv – Every row in this file represents an order.
- order_id – a unique identifier for each order
- created_date – a timestamp for when the order was created
- total_paid – the total amount paid by the customer for this order, in euros
- state 
  - “Shopping basket” – products have been placed in the shopping basket, but the order has not been processed yet.
  - “Pending” – the shopping basket has been processed, but payment confirmation is pending.
  - “Completed” – the order has been placed and paid, and the transaction is completed.
  - “Cancelled” – the order has been cancelled and the payment returned to the customer.

## Importing the data
- ``` glob-glob("file_pat") ``` --> read multi files 
- ``` pd.concat(dfs_list, ignore_index=True)```  --> create 1 df from multi dfs
- ``` pd.read_csv(path)```  --> create 1 df from a csv file

In [36]:
import pandas as pd

url = "https://drive.google.com/file/d/1Vu0q91qZw6lqhIqbjoXYvYAQTmVHh6uZ/view?usp=sharing" 
path = "https://drive.google.com/uc?export=download&id="+url.split("/")[-2]
df = pd.read_csv(path)



In [37]:
#pd.options.display.max_rows = 999
pd.set_option("display.min_rows", 0) 
pd.get_option("display.min_rows")

0

In [38]:
pd.set_option("display.max_rows", 200) 
pd.get_option("display.max_rows")

200

## Explore the data
- ``` df.shape``` , ``` df.size``` , ``` df.ndim``` 
- ``` df.sample(5)``` , ``` df.info()``` 
- Numerical : ``` df.describe()``` , ``` df.nlargest()``` , ``` df.nsmallest()``` 
- Category : ``` df.nunique()``` , ``` df.unique() ``` 

In [39]:
df.shape

(226909, 4)

In [40]:
df.sample(20)

Unnamed: 0,order_id,created_date,total_paid,state
110482,410068,2017-10-08 20:06:27,0.0,Cancelled
125939,425552,2017-11-11 22:42:25,87.33,Shopping Basket
192573,493055,2018-01-17 16:42:15,0.0,Place Order
135405,435077,2017-11-23 16:36:19,629.99,Shopping Basket
28879,328359,2017-02-24 10:02:36,114.98,Shopping Basket
98546,398130,2017-09-09 12:24:56,70.78,Completed
115026,414612,2017-10-19 19:03:26,74.99,Shopping Basket
10089,309562,2017-01-14 23:03:56,55.99,Shopping Basket
67312,366858,2017-06-18 22:23:52,75.52,Shopping Basket
99446,399030,2017-09-13 13:41:25,2133.59,Cancelled


In [41]:
df.info()  
## hint: there are 5 nulls in total_paid - fix required
## hint: created_date is of type object, has to be datetime - fix required

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226909 entries, 0 to 226908
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   order_id      226909 non-null  int64  
 1   created_date  226909 non-null  object 
 2   total_paid    226904 non-null  float64
 3   state         226909 non-null  object 
dtypes: float64(1), int64(1), object(2)
memory usage: 6.9+ MB


In [42]:
df.describe()
## hint: total_paid has huge std , min = 0 as some points! probably we should exclude the total_paid = 0 - fix required

Unnamed: 0,order_id,total_paid
count,226909.0,226904.0
mean,413296.48248,569.225818
std,65919.250331,1761.778002
min,241319.0,0.0
25%,356263.0,34.19
50%,413040.0,112.99
75%,470553.0,525.98
max,527401.0,214747.53


In [43]:
df.nunique() # hint: order_id is unique per row 
# hint: state is category data

order_id        226909
created_date    224828
total_paid       31236
state                5
dtype: int64

In [69]:
(df
 .state
 .unique())


#hint: state has value "Shopping Basket"/"Pending" which ideally shall not be part of the analysis - fix required
# (if we care about actuall sold products)


array(['Cancelled', 'Completed', 'Pending', 'Shopping Basket',
       'Place Order'], dtype=object)

## Clean the data per csv
- Remember to create a copy of the df using ``` df.copy()``` 

In [45]:
df_c = df.copy()

### Rename Columns , Set Index
 - ``` df.columns```   , ``` df.index``` 
 - ``` df=df.rename(columns={"A": "a", "B": "c"})``` 
 - ``` df.columns = ["a","b":"x"]``` 
     - take care, renaming the columns like that will convert the NAN to some value!!
 - ``` df=df.set_index("col")```  , ``` df=df.reset_index()``` 

In [46]:
df_c.columns #hint: columns names shall has indec ord_

Index(['order_id', 'created_date', 'total_paid', 'state'], dtype='object')

In [47]:
#orders_c.columns=['ord_id', 'ord_created_date', 'ord_total_paid', 'ord_state'] 
## take care, renaming the columns like that will convert the NAN to some value!!


In [70]:
df_c=(
    df_c
    .rename(
        columns={"order_id": "ord_ID"
                , "created_date": "ord_CreatDate"
                , "total_paid": "ord_TotalPaid"
                , "state": "ord_State"})
)

In [71]:
df_c.index #hint: no need to change index

RangeIndex(start=0, stop=226909, step=1)

### Remove Duplicates Rows
- ``` df.duplicated().sum()``` 
- ``` df.loc[df.duplicated()==True]``` 
- ``` df=df.drop:duplicates() ``` 
- ``` df=df.drop:duplicates(subset=["col"])```  --> remove rows based on duplicated in specific column

In [50]:
df_c.duplicated().sum() #hint : No duplicates

0

In [51]:
df_c.loc[df_c.duplicated()==True]

Unnamed: 0,ord_ID,ord_CreatDate,ord_TotalPaid,ord_State


### Clean NAN and empty cells
- ``` df.isna().sum()``` 
- ``` df = df.replace('^\s*$', np.nan)```  -->replace empty cells and cells with only whitspace with NAN
- ``` df=df.col.fillna(value,method="bfill"or"ffill",limit=value)``` 

- Extra: 
  - ``` (df.values == '').sum()```  --> check if any cell is empty
  - ``` df.col.str.isspace().sum()```  --> check if all cell is filled with whitespaces

In [52]:
df_c.sample(5)

Unnamed: 0,ord_ID,ord_CreatDate,ord_TotalPaid,ord_State
202786,503272,2018-01-31 21:36:27,1948.59,Shopping Basket
211215,511706,2018-02-14 19:42:16,0.0,Shopping Basket
115416,415002,2017-10-20 11:21:09,19.99,Shopping Basket
81538,381112,2017-07-23 23:10:58,34.0,Shopping Basket
107432,407017,2017-10-01 22:33:45,399.0,Shopping Basket


In [53]:
df_c.isna().sum()

ord_ID           0
ord_CreatDate    0
ord_TotalPaid    5
ord_State        0
dtype: int64

In [54]:
df_c.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226909 entries, 0 to 226908
Data columns (total 4 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   ord_ID         226909 non-null  int64  
 1   ord_CreatDate  226909 non-null  object 
 2   ord_TotalPaid  226904 non-null  float64
 3   ord_State      226909 non-null  object 
dtypes: float64(1), int64(1), object(2)
memory usage: 6.9+ MB


In [55]:
import numpy as np
df_c = df_c.replace('^\s*$', np.nan) #Question is it safe to do that?


In [56]:
df_c.isna().sum()

ord_ID           0
ord_CreatDate    0
ord_TotalPaid    5
ord_State        0
dtype: int64

In [57]:
df_c.ord_TotalPaid.dtype

dtype('float64')

In [58]:
df_c.ord_TotalPaid=df_c.ord_TotalPaid.fillna(0.0)

In [59]:
df_c.ord_TotalPaid.dtype

dtype('float64')

In [60]:
df_c.isna().sum()

ord_ID           0
ord_CreatDate    0
ord_TotalPaid    0
ord_State        0
dtype: int64

### Fix DataTypes
TODO: update to use assign
- ``` df.col.astype(type,errors="raise")``` 
  - type = "int","float","bool","category","object","datetime","timedelta"
- for mixed data
  - ``` pd.to_numeric(df.col, downcast=x,errors="raise") ``` 
  x = "integer" or "float"
  - ``` pd.to_datetime(df.col, downcast=None,errors="raise") ``` 
  - ``` pd.to_timedelta(df.col, downcast=None,errors="raise") ``` 

In [61]:
df_c.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226909 entries, 0 to 226908
Data columns (total 4 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   ord_ID         226909 non-null  int64  
 1   ord_CreatDate  226909 non-null  object 
 2   ord_TotalPaid  226909 non-null  float64
 3   ord_State      226909 non-null  object 
dtypes: float64(1), int64(1), object(2)
memory usage: 6.9+ MB


In [62]:
df_c.ord_CreatDate=pd.to_datetime(df_c.ord_CreatDate,errors="raise")

In [63]:
df_c.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226909 entries, 0 to 226908
Data columns (total 4 columns):
 #   Column         Non-Null Count   Dtype         
---  ------         --------------   -----         
 0   ord_ID         226909 non-null  int64         
 1   ord_CreatDate  226909 non-null  datetime64[ns]
 2   ord_TotalPaid  226909 non-null  float64       
 3   ord_State      226909 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 6.9+ MB


In [64]:
df_c.ord_State = df_c.ord_State.astype("category",errors="raise")

In [65]:
df_c.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226909 entries, 0 to 226908
Data columns (total 4 columns):
 #   Column         Non-Null Count   Dtype         
---  ------         --------------   -----         
 0   ord_ID         226909 non-null  int64         
 1   ord_CreatDate  226909 non-null  datetime64[ns]
 2   ord_TotalPaid  226909 non-null  float64       
 3   ord_State      226909 non-null  category      
dtypes: category(1), datetime64[ns](1), float64(1), int64(1)
memory usage: 5.4 MB


### Remove Outliers ??
TODO: Detailed it
remove un-necessary values(respectively related rows) of specific columns

### Drop duplicate/un-necessary Columns
- ``` df=df.drop(columns=["col1","col2"])``` 

In [66]:
df_c.describe() ## hint: all columns contain needed data. nothing to drop

Unnamed: 0,ord_ID,ord_TotalPaid
count,226909.0,226909.0
mean,413296.48248,569.213275
std,65919.250331,1761.760618
min,241319.0,0.0
25%,356263.0,34.19
50%,413040.0,112.99
75%,470553.0,525.97
max,527401.0,214747.53


## Re-Explore the data
draw some ``` df.col.hist()```  , ``` df.ser.boxplot()```  per column
take notes

In [67]:
df_c.sample(10)  # Question: how state="Shopping Basket" while "Total_paid" has a value!! 

Unnamed: 0,ord_ID,ord_CreatDate,ord_TotalPaid,ord_State
213132,513623,2018-02-18 07:28:32,620.0,Shopping Basket
148787,448920,2017-11-27 17:40:19,5.09,Shopping Basket
127949,427563,2017-11-16 12:30:26,415.44,Shopping Basket
170181,470553,2017-12-24 00:59:46,120.99,Completed
1103,300495,2017-01-02 13:02:55,300.94,Completed
28723,328203,2017-02-23 22:31:40,96.98,Place Order
138901,438696,2017-11-24 12:10:58,31.42,Shopping Basket
112239,411825,2017-10-12 16:56:39,3.99,Shopping Basket
72270,371816,2017-07-03 12:25:52,56.0,Shopping Basket
119939,419527,2017-10-30 01:13:02,399.0,Shopping Basket


In [68]:
df_c.loc[df_c.ord_TotalPaid == 0.0].ord_State.unique()

['Completed', 'Shopping Basket', 'Place Order', 'Pending', 'Cancelled']
Categories (5, object): ['Cancelled', 'Completed', 'Pending', 'Place Order', 'Shopping Basket']