# CleaningUp orders.csv (ord_)
orders.csv – Every row in this file represents an order.
- order_id – a unique identifier for each order
- created_date – a timestamp for when the order was created
- total_paid – the total amount paid by the customer for this order, in euros
- state 
  - “Shopping basket” – products have been placed in the shopping basket, but the order has not been processed yet.
  - “Pending” – the shopping basket has been processed, but payment confirmation is pending.
  - “Completed” – the order has been placed and paid, and the transaction is completed.
  - “Cancelled” – the order has been cancelled and the payment returned to the customer.

## Importing the data
- ``` glob-glob("file_pat") ``` --> read multi files 
- ``` pd.concat(dfs_list, ignore_index=True)```  --> create 1 df from multi dfs
- ``` pd.read_csv(path)```  --> create 1 df from a csv file

In [54]:
import pandas as pd

url = "https://drive.google.com/file/d/1Vu0q91qZw6lqhIqbjoXYvYAQTmVHh6uZ/view?usp=sharing" 
path = "https://drive.google.com/uc?export=download&id="+url.split("/")[-2]
df = pd.read_csv(path)



In [55]:
#pd.options.display.max_rows = 999
pd.set_option("display.min_rows", 0) 
pd.get_option("display.min_rows")


0

In [56]:
pd.set_option("display.max_rows", 200) 
pd.get_option("display.max_rows")

200

## Explore the data
- ``` df.shape``` , ``` df.size``` , ``` df.ndim``` 
- ``` df.sample(5)``` , ``` df.info()``` 
- Numerical : ``` df.describe()``` , ``` df.nlargest()``` , ``` df.nsmallest()``` 
- Category : ``` df.nunique()``` , ``` df.unique() ``` 

In [3]:
df.shape

(226909, 4)

In [52]:
df.sample(200)

Unnamed: 0,order_id,created_date,total_paid,state
215882,516373,2018-02-23 10:03:16,1190.78,Shopping Basket
178475,478867,2018-01-01 22:23:34,24.98,Completed
197516,498000,2018-01-24 22:38:52,2403.58,Shopping Basket
75564,375110,2017-07-10 07:03:16,229.00,Shopping Basket
133719,433380,2017-11-23 07:07:21,2531.59,Shopping Basket
104748,404333,2017-09-25 12:36:54,154.98,Place Order
65022,364567,2017-06-11 22:58:03,144.97,Shopping Basket
47318,346811,2017-04-17 21:32:28,22.49,Shopping Basket
205941,506429,2018-02-05 19:04:05,1008.98,Place Order
6623,306092,2017-01-09 22:00:10,22.98,Completed


In [5]:
df.info()  
## hint: there are 5 nulls in total_paid - fix required
## hint: created_date is of type object, has to be datetime - fix required

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226909 entries, 0 to 226908
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   order_id      226909 non-null  int64  
 1   created_date  226909 non-null  object 
 2   total_paid    226904 non-null  float64
 3   state         226909 non-null  object 
dtypes: float64(1), int64(1), object(2)
memory usage: 6.9+ MB


In [6]:
df.describe()
## hint: total_paid has huge std , min = 0 as some points! probably we should exclude the total_paid = 0 - fix required

Unnamed: 0,order_id,total_paid
count,226909.0,226904.0
mean,413296.48248,569.225818
std,65919.250331,1761.778002
min,241319.0,0.0
25%,356263.0,34.19
50%,413040.0,112.99
75%,470553.0,525.98
max,527401.0,214747.53


In [7]:
df.nunique() # hint: order_id is unique per row 
# hint: state is category data

order_id        226909
created_date    224828
total_paid       31236
state                5
dtype: int64

In [8]:
df.state.unique()
#hint: state has value "Shopping Basket"/"Pending" which ideally shall not be part of the analysis - fix required
# (if we care about actuall sold products)


array(['Cancelled', 'Completed', 'Pending', 'Shopping Basket',
       'Place Order'], dtype=object)

## Clean the data per csv
- Remember to create a copy of the df using ``` df.copy()``` 

In [9]:
df_c = df.copy()

### Rename Columns , Set Index
 - ``` df.columns```   , ``` df.index``` 
 - ``` df=df.rename(columns={"A": "a", "B": "c"})``` 
 - ``` df.columns = ["a","b":"x"]``` 
     - take care, renaming the columns like that will convert the NAN to some value!!
 - ``` df=df.set_index("col")```  , ``` df=df.reset_index()``` 

In [10]:
df_c.columns #hint: columns names shall has indec ord_

Index(['order_id', 'created_date', 'total_paid', 'state'], dtype='object')

In [11]:
#orders_c.columns=['ord_id', 'ord_created_date', 'ord_total_paid', 'ord_state'] 
## take care, renaming the columns like that will convert the NAN to some value!!


In [12]:
df_c=df_c.rename(columns={"order_id": "ord_ID"
                        , "created_date": "ord_CreatDate"
                        , "total_paid": "ord_TotalPaid"
                        , "state": "ord_State"})

In [13]:
df_c.index #hint: no need to change index

RangeIndex(start=0, stop=226909, step=1)

### Remove Duplicates Rows
- ``` df.duplicated().sum()``` 
- ``` df.loc[df.duplicated()==True]``` 
- ``` df=df.drop:duplicates() ``` 
- ``` df=df.drop:duplicates(subset=["col"])```  --> remove rows based on duplicated in specific column

In [14]:
df_c.duplicated().sum() #hint : No duplicates

0

In [15]:
df_c.loc[df_c.duplicated()==True]

Unnamed: 0,ord_ID,ord_CreatDate,ord_TotalPaid,ord_State


### Clean NAN and empty cells
- ``` df.isna().sum()``` 
- ``` df = df.replace('^\s*$', np.nan)```  -->replace empty cells and cells with only whitspace with NAN
- ``` df=df.col.fillna(value,method="bfill"or"ffill",limit=value)``` 

- Extra: 
  - ``` (df.values == '').sum()```  --> check if any cell is empty
  - ``` df.col.str.isspace().sum()```  --> check if all cell is filled with whitespaces

In [16]:
df_c.sample(5)

Unnamed: 0,ord_ID,ord_CreatDate,ord_TotalPaid,ord_State
79135,378683,2017-07-17 19:02:53,44.97,Completed
112764,412350,2017-10-14 08:14:01,277.79,Pending
179547,479940,2018-01-02 17:00:43,139.99,Shopping Basket
216888,517379,2018-02-25 23:12:31,68.07,Completed
70623,370169,2017-06-29 00:52:29,0.0,Shopping Basket


In [17]:
df_c.isna().sum()

ord_ID           0
ord_CreatDate    0
ord_TotalPaid    5
ord_State        0
dtype: int64

In [18]:
df_c.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226909 entries, 0 to 226908
Data columns (total 4 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   ord_ID         226909 non-null  int64  
 1   ord_CreatDate  226909 non-null  object 
 2   ord_TotalPaid  226904 non-null  float64
 3   ord_State      226909 non-null  object 
dtypes: float64(1), int64(1), object(2)
memory usage: 6.9+ MB


In [19]:
import numpy as np
df_c = df_c.replace('^\s*$', np.nan) #Question is it safe to do that?


In [20]:
df_c.isna().sum()

ord_ID           0
ord_CreatDate    0
ord_TotalPaid    5
ord_State        0
dtype: int64

In [21]:
df_c.ord_TotalPaid.dtype

dtype('float64')

In [22]:
df_c.ord_TotalPaid=df_c.ord_TotalPaid.fillna(0.0)

In [23]:
df_c.ord_TotalPaid.dtype

dtype('float64')

In [24]:
df_c.isna().sum()

ord_ID           0
ord_CreatDate    0
ord_TotalPaid    0
ord_State        0
dtype: int64

### Fix DataTypes
- ``` df.col.astype(type,errors="raise")``` 
  - type = "int","float","bool","category","object","datetime","timedelta"
- for mixed data
  - ``` pd.to_numeric(df.col, downcast=x,errors="raise") ``` 
  x = "integer" or "float"
  - ``` pd.to_datetime(df.col, downcast=None,errors="raise") ``` 
  - ``` pd.to_timedelta(df.col, downcast=None,errors="raise") ``` 

In [25]:
df_c.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226909 entries, 0 to 226908
Data columns (total 4 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   ord_ID         226909 non-null  int64  
 1   ord_CreatDate  226909 non-null  object 
 2   ord_TotalPaid  226909 non-null  float64
 3   ord_State      226909 non-null  object 
dtypes: float64(1), int64(1), object(2)
memory usage: 6.9+ MB


In [26]:
df_c.ord_CreatDate=pd.to_datetime(df_c.ord_CreatDate,errors="raise")

In [27]:
df_c.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226909 entries, 0 to 226908
Data columns (total 4 columns):
 #   Column         Non-Null Count   Dtype         
---  ------         --------------   -----         
 0   ord_ID         226909 non-null  int64         
 1   ord_CreatDate  226909 non-null  datetime64[ns]
 2   ord_TotalPaid  226909 non-null  float64       
 3   ord_State      226909 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 6.9+ MB


In [28]:
df_c.ord_State = df_c.ord_State.astype("category",errors="raise")

In [29]:
df_c.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226909 entries, 0 to 226908
Data columns (total 4 columns):
 #   Column         Non-Null Count   Dtype         
---  ------         --------------   -----         
 0   ord_ID         226909 non-null  int64         
 1   ord_CreatDate  226909 non-null  datetime64[ns]
 2   ord_TotalPaid  226909 non-null  float64       
 3   ord_State      226909 non-null  category      
dtypes: category(1), datetime64[ns](1), float64(1), int64(1)
memory usage: 5.4 MB


### Drop duplicate/un-necessary Columns
- ``` df=df.drop(columns=["col1","col2"])``` 

In [30]:
df_c.describe() ## hint: all columns contain needed data. nothing to drop

Unnamed: 0,ord_ID,ord_TotalPaid
count,226909.0,226909.0
mean,413296.48248,569.213275
std,65919.250331,1761.760618
min,241319.0,0.0
25%,356263.0,34.19
50%,413040.0,112.99
75%,470553.0,525.97
max,527401.0,214747.53


## Re-Explore the data
draw some ``` df.col.hist()```  , ``` df.ser.boxplot()```  per column
take notes

In [31]:
df_c.sample(10)  # Question: how state="Shopping Basket" while "Total_paid" has a value!! 

Unnamed: 0,ord_ID,ord_CreatDate,ord_TotalPaid,ord_State
192721,493203,2018-01-17 20:58:39,171.97,Place Order
11087,310560,2017-01-16 17:12:37,3856.48,Place Order
113375,412961,2017-10-16 10:34:44,23.98,Completed
112339,411925,2017-10-13 00:06:38,2872.59,Cancelled
223360,523852,2018-03-09 17:12:54,958.97,Pending
41627,341116,2017-03-31 10:02:53,1598.99,Cancelled
50162,349656,2017-04-25 16:44:48,0.0,Shopping Basket
189009,489488,2018-01-12 13:23:15,34.98,Completed
82037,381611,2017-07-25 00:07:05,355.22,Shopping Basket
125942,425555,2017-11-11 22:57:36,266.93,Completed


In [32]:
df_c.loc[df_c.ord_TotalPaid == 0.0].ord_State.unique()

['Completed', 'Shopping Basket', 'Place Order', 'Pending', 'Cancelled']
Categories (5, object): ['Cancelled', 'Completed', 'Pending', 'Place Order', 'Shopping Basket']