# CleaningUp orders.csv (ord_)
orders.csv – Every row in this file represents an order.
- order_id – a unique identifier for each order
- created_date – a timestamp for when the order was created
- total_paid – the total amount paid by the customer for this order, in euros
- state 
  - “Shopping basket” – products have been placed in the shopping basket, but the order has not been processed yet.
  - “Pending” – the shopping basket has been processed, but payment confirmation is pending.
  - “Completed” – the order has been placed and paid, and the transaction is completed.
  - “Cancelled” – the order has been cancelled and the payment returned to the customer.

    ==============================================================================================
    
## Importing the data
- ``` glob-glob("file_pat") ``` --> read multi files 
- ``` pd.concat(dfs_list, ignore_index=True)```  --> create 1 df from multi dfs
- ``` pd.read_csv(path)```  --> create 1 df from a csv file

In [None]:
import pandas as pd
import numpy as np

url = "https://drive.google.com/file/d/1Vu0q91qZw6lqhIqbjoXYvYAQTmVHh6uZ/view?usp=sharing" 
path = "https://drive.google.com/uc?export=download&id="+url.split("/")[-2]
df = pd.read_csv(path)



In [None]:
pd.set_option("display.min_rows", 0) 
pd.set_option("display.max_rows", 30) 

    ==============================================================================================
    
## Explore the data
- ``` df.shape``` , ``` df.size``` , ``` df.ndim``` 
- ``` df.sample(5)``` , ``` df.info()``` 
- Numerical : ``` df.describe()``` , ``` df.nlargest()``` , ``` df.nsmallest()``` 
- Category : ``` df.nunique()``` , ``` df.unique() ``` 

In [None]:
df.shape

In [None]:
df.sample(20)

In [None]:
df.info()  
## hint: there are 5 nulls in total_paid - fix required
## hint: created_date is of type object, has to be datetime - fix required

In [None]:
df.describe()
## hint: total_paid has huge std , min = 0 as some points! probably we should exclude the total_paid = 0 - fix required

In [None]:
df.nunique() # hint: order_id is unique per row 
# hint: state is category data

In [None]:
(df
 .state
 .unique())


#hint: state has value "Shopping Basket"/"Pending" which ideally shall not be part of the analysis - fix required
# (if we care about actuall sold products)


    ==============================================================================================
    
## Clean the data per csv
- Remember to create a copy of the df using ``` df.copy()``` 

In [None]:
df_c = df.copy()

### Rename Columns , Set Index
 - ``` df.columns```   , ``` df.index``` 
 - ``` df=df.rename(columns={"A": "a", "B": "c"})``` 
 - ``` df.columns = ["a","b":"x"]``` 
     - take care, renaming the columns like that will convert the NAN to some value!!
 - ``` df=df.set_index("col")```  , ``` df=df.reset_index()``` 

In [None]:
df_c.columns #hint: columns names shall has indec ord_

In [None]:
#orders_c.columns=['ord_id', 'ord_created_date', 'ord_total_paid', 'ord_state'] 
## take care, renaming the columns like that will convert the NAN to some value!!


In [None]:
df_c=(
    df_c
    .rename(
        columns={"order_id": "ord_ID"
                , "created_date": "ord_CreatDate"
                , "total_paid": "ord_TotalPaid"
                , "state": "ord_State"})
)

In [None]:
df_c.index #hint: no need to change index

### Remove Duplicates Rows
- ``` df.duplicated().sum()``` 
- ``` df.loc[df.duplicated()==True]``` 
- ``` df=df.drop:duplicates() ``` 
- ``` df=df.drop:duplicates(subset=["col"])```  --> remove rows based on duplicated in specific column

In [None]:
df_c.duplicated().sum() #hint : No duplicates

In [None]:
df_c.loc[df_c.duplicated()==True]

### Clean NAN and empty cells
- ``` df.isna().sum()``` 
- ``` df = df.replace('^\s*$', np.nan)```  -->replace empty cells and cells with only whitspace with NAN
- ``` df=df.col.fillna(value,method="bfill"or"ffill",limit=value)``` 

- Extra: 
  - ``` (df.values == '').sum()```  --> check if any cell is empty
  - ``` df.col.str.isspace().sum()```  --> check if all cell is filled with whitespaces

In [None]:
df_c.sample(5)

In [None]:
df_c.isna().sum()

In [None]:
df_c.info()

In [None]:
import numpy as np
df_c = df_c.replace('^\s*$', np.nan) #Question is it safe to do that?


In [None]:
df_c.isna().sum()

In [None]:
df_c.ord_TotalPaid.dtype

In [None]:
df_c.ord_TotalPaid=df_c.ord_TotalPaid.fillna(0.0)

In [None]:
df_c.ord_TotalPaid.dtype

In [None]:
df_c.isna().sum()

### Fix DataTypes
TODO: update to use assign
- ``` df.col.astype(type,errors="raise")``` 
  - type = "int","float","bool","category","object","datetime","timedelta"
- for mixed data
  - ``` pd.to_numeric(df.col, downcast=x,errors="raise") ``` 
  x = "integer" or "float"
  - ``` pd.to_datetime(df.col, downcast=None,errors="raise") ``` 
  - ``` pd.to_timedelta(df.col, downcast=None,errors="raise") ``` 

In [None]:
df_c.info()

In [None]:
df_c.ord_CreatDate=pd.to_datetime(df_c.ord_CreatDate,errors="raise")

In [None]:
df_c.info()

In [None]:
df_c.ord_State = df_c.ord_State.astype("category",errors="raise")

In [None]:
df_c.info()

### Analyze Outliers

remove un-necessary values(respectively related rows) of specific columns
- ``` Q1   = df.col.quantile(0.25) ``` 
- ``` Q3   = df.col.quantile(0.75) ``` 
- ``` IQR = Q3 - Q1 ``` 
- ``` df   = df.loc[(df.col >= (Q1 - 1.5*IQR)) & (df.col <= (Q3 + 1.5*IQR))] ``` 

In [None]:
df_c.describe()

### Drop duplicate/un-necessary Columns
- ``` df=df.drop(columns=["col1","col2"])``` 

In [None]:
df_c.describe() ## hint: all columns contain needed data. nothing to drop

    ==============================================================================================
    
## Re-Explore the data
draw some ``` df.col.hist()```  , ``` df.ser.boxplot()```  per column
take notes

In [None]:
df_c.sample(10)  # Question: how state="Shopping Basket" while "Total_paid" has a value!! 

In [None]:
df_c.loc[df_c.ord_TotalPaid == 0.0].ord_State.unique()