<a href="https://colab.research.google.com/github/JacquelineBashta/Pandas_Eniac/blob/main/Project_2_Eniac.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CleaningUp orders.csv (ord_)
orders.csv – Every row in this file represents an order.
- order_id – a unique identifier for each order
- created_date – a timestamp for when the order was created
- total_paid – the total amount paid by the customer for this order, in euros
- state 
  - “Shopping basket” – products have been placed in the shopping basket, but the order has not been processed yet.
  - “Pending” – the shopping basket has been processed, but payment confirmation is pending.
  - “Completed” – the order has been placed and paid, and the transaction is completed.
  - “Cancelled” – the order has been cancelled and the payment returned to the customer.

## Importing the data
- ``` glob-glob("file_pat") ``` --> read multi files 
- ``` pd.concat(dfs_list, ignore_index=True)```  --> create 1 df from multi dfs
- ``` pd.read_csv(path)```  --> create 1 df from a csv file

In [445]:
import pandas as pd

url = "https://drive.google.com/file/d/1Vu0q91qZw6lqhIqbjoXYvYAQTmVHh6uZ/view?usp=sharing" 
path = "https://drive.google.com/uc?export=download&id="+url.split("/")[-2]
orders = pd.read_csv(path)



In [446]:
#pd.options.display.max_rows = 999
pd.set_option("display.max_rows", 50) #Question : doesn't work!
pd.get_option("display.max_rows")


50

## Explore the data
- ``` df.shape``` , ``` df.size``` , ``` df.ndim``` 
- ``` df.sample(5)``` , ``` df.info()``` 
- Numerical : ``` df.describe()``` , ``` df.nlargest()``` , ``` df.nsmallest()``` 
- Category : ``` df.nunique()``` , ``` df.unique() ``` 

In [447]:
orders.shape

(226909, 4)

In [448]:
orders.sample(5)

Unnamed: 0,order_id,created_date,total_paid,state
14657,314136,2017-01-24 09:42:39,16.99,Shopping Basket
133261,432915,2017-11-23 00:34:46,68.22,Completed
187471,487883,2018-01-10 17:23:07,213.98,Shopping Basket
12003,311478,2017-01-18 10:49:20,18.98,Place Order
41967,341456,2017-04-01 00:19:26,0.0,Shopping Basket


In [449]:
orders.info()  
## hint: there are 5 nulls in total_paid - TODO:fix required
## hint: created_date is of type object, has to be datetime - TODO:fix required

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226909 entries, 0 to 226908
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   order_id      226909 non-null  int64  
 1   created_date  226909 non-null  object 
 2   total_paid    226904 non-null  float64
 3   state         226909 non-null  object 
dtypes: float64(1), int64(1), object(2)
memory usage: 6.9+ MB


In [450]:
orders.describe()
## hint: total_paid has huge std , min = 0 as some points! probably we should exclude the total_paid = 0 - TODO: fix required

Unnamed: 0,order_id,total_paid
count,226909.0,226904.0
mean,413296.48248,569.225818
std,65919.250331,1761.778002
min,241319.0,0.0
25%,356263.0,34.19
50%,413040.0,112.99
75%,470553.0,525.98
max,527401.0,214747.53


In [451]:
orders.nunique() # hint: order_id is unique per row 
# hint: state is category data

order_id        226909
created_date    224828
total_paid       31236
state                5
dtype: int64

In [452]:
orders.state.unique()
#hint: state has value "Shopping Basket"/"Pending" which ideally shall not be part of the analysis - TODO: fix required
# (if we care about actuall sold products)


array(['Cancelled', 'Completed', 'Pending', 'Shopping Basket',
       'Place Order'], dtype=object)

## Clean the data per csv
- Remember to create a copy of the df using ``` df.copy()``` 

In [453]:
orders_c = orders.copy()

### Rename Columns , Set Index
 - ``` df.columns```   , ``` df.index``` 
 - ``` df=df.rename(columns={"A": "a", "B": "c"})``` 
 - ``` df.columns = ["a","b":"x"]``` 
     - take care, renaming the columns like that will convert the NAN to some value!!
 - ``` df=df.set_index("col")```  , ``` df=df.reset_index()``` 

In [454]:
orders_c.columns #hint: columns names shall has indec ord_

Index(['order_id', 'created_date', 'total_paid', 'state'], dtype='object')

In [455]:
#orders_c.columns=['ord_id', 'ord_created_date', 'ord_total_paid', 'ord_state'] 
## take care, renaming the columns like that will convert the NAN to some value!!


In [456]:
orders_c=orders_c.rename(columns={"order_id": "ord_id", "created_date": "ord_created_date"
, "total_paid": "ord_total_paid", "state": "ord_state"})

In [457]:
orders_c.index #hint: no need to change index

RangeIndex(start=0, stop=226909, step=1)

### Remove Duplicates Rows
- ``` df.duplicated().sum()``` 
- ``` df.loc[df.duplicated()==True]``` 
- ``` df=df.drop:duplicates() ``` 
- ``` df=df.drop:duplicates(subset=["col"])```  --> remove rows based on duplicated in specific column

In [458]:
orders_c.duplicated().sum() #hint : No duplicates

0

In [459]:
orders_c.loc[orders_c.duplicated()==True]

Unnamed: 0,ord_id,ord_created_date,ord_total_paid,ord_state


### Clean NAN and empty cells
- ``` df.isna().sum()``` 
- ``` df = df.replace('^\s*$', np.nan)```  -->replace empty cells and cells with only whitspace with NAN
- ``` df=df.col.fillna(value,method="bfill"or"ffill",limit=value)``` 

- Extra: 
  - ``` (df.values == '').sum()```  --> check if any cell is empty
  - ``` df.col.str.isspace().sum()```  --> check if all cell is filled with whitespaces

In [460]:
orders_c.sample(5)

Unnamed: 0,ord_id,ord_created_date,ord_total_paid,ord_state
225234,525726,2018-03-12 02:37:50,0.0,Shopping Basket
88401,387980,2017-08-09 19:37:28,177.8,Pending
195199,495682,2018-01-21 22:28:24,167.97,Completed
24106,323586,2017-02-13 19:01:12,53.98,Pending
175008,475396,2017-12-28 20:03:01,12.34,Pending


In [461]:
orders_c.isna().sum()

ord_id              0
ord_created_date    0
ord_total_paid      5
ord_state           0
dtype: int64

In [462]:
orders_c.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226909 entries, 0 to 226908
Data columns (total 4 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   ord_id            226909 non-null  int64  
 1   ord_created_date  226909 non-null  object 
 2   ord_total_paid    226904 non-null  float64
 3   ord_state         226909 non-null  object 
dtypes: float64(1), int64(1), object(2)
memory usage: 6.9+ MB


In [463]:
import numpy as np
orders_c = orders_c.replace('^\s*$', np.nan) #Question is it safe to do that?


In [464]:
orders_c.isna().sum()

ord_id              0
ord_created_date    0
ord_total_paid      5
ord_state           0
dtype: int64

In [465]:
orders_c.ord_total_paid.dtype

dtype('float64')

In [466]:
orders_c.ord_total_paid=orders_c.ord_total_paid.fillna(0.0)

In [467]:
orders_c.ord_total_paid.dtype

dtype('float64')

In [468]:
orders_c.isna().sum()

ord_id              0
ord_created_date    0
ord_total_paid      0
ord_state           0
dtype: int64

### Fix DataTypes
- ``` df.col.astype(type,errors="raise")``` 
  - type = "int","float","bool","category","object","datetime","timedelta"
- for mixed data
  - ``` pd.to_numeric(df.col, downcast=x,errors="raise") ``` 
  x = "integer" or "float"
  - ``` pd.to_datetime(df.col, downcast=None,errors="raise") ``` 
  - ``` pd.to_timedelta(df.col, downcast=None,errors="raise") ``` 

In [469]:
orders_c.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226909 entries, 0 to 226908
Data columns (total 4 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   ord_id            226909 non-null  int64  
 1   ord_created_date  226909 non-null  object 
 2   ord_total_paid    226909 non-null  float64
 3   ord_state         226909 non-null  object 
dtypes: float64(1), int64(1), object(2)
memory usage: 6.9+ MB


In [470]:
orders_c.ord_created_date=pd.to_datetime(orders_c.ord_created_date,errors="raise")

In [471]:
orders_c.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226909 entries, 0 to 226908
Data columns (total 4 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   ord_id            226909 non-null  int64         
 1   ord_created_date  226909 non-null  datetime64[ns]
 2   ord_total_paid    226909 non-null  float64       
 3   ord_state         226909 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 6.9+ MB


In [472]:
orders_c.ord_state=orders_c.ord_state.astype("category",errors="raise")

In [473]:
orders_c.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226909 entries, 0 to 226908
Data columns (total 4 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   ord_id            226909 non-null  int64         
 1   ord_created_date  226909 non-null  datetime64[ns]
 2   ord_total_paid    226909 non-null  float64       
 3   ord_state         226909 non-null  category      
dtypes: category(1), datetime64[ns](1), float64(1), int64(1)
memory usage: 5.4 MB


### Drop duplicate/un-necessary Columns
- ``` df=df.drop(columns=["col1","col2"])``` 

In [474]:
orders_c.describe() ## hint: all columns contain needed data. nothing to drop

Unnamed: 0,ord_id,ord_total_paid
count,226909.0,226909.0
mean,413296.48248,569.213275
std,65919.250331,1761.760618
min,241319.0,0.0
25%,356263.0,34.19
50%,413040.0,112.99
75%,470553.0,525.97
max,527401.0,214747.53


## Re-Explore the data
draw some ``` df.col.hist()```  , ``` df.ser.boxplot()```  per column
take notes

In [475]:
orders_c.sample(10)  # Question: how state="Shopping Basket" while "Total_paid" has a value!! 

Unnamed: 0,ord_id,ord_created_date,ord_total_paid,ord_state
8023,307493,2017-01-11 19:06:05,337.98,Place Order
119477,419065,2017-10-28 20:00:50,399.0,Place Order
97812,397396,2017-09-07 12:51:37,24.99,Shopping Basket
22095,321575,2017-02-09 00:10:53,52.99,Shopping Basket
125342,424955,2017-11-10 22:29:52,854.68,Shopping Basket
82102,381676,2017-07-25 09:16:38,48.97,Completed
67567,367113,2017-06-19 13:35:10,46.98,Place Order
73941,373487,2017-07-06 03:11:04,128.99,Shopping Basket
6942,306411,2017-01-10 12:01:44,9.0,Place Order
74159,373705,2017-07-06 15:00:43,0.0,Shopping Basket


In [476]:
orders_c.loc[orders_c.ord_total_paid == 0.0].ord_state.unique()

['Completed', 'Shopping Basket', 'Place Order', 'Pending', 'Cancelled']
Categories (5, object): ['Cancelled', 'Completed', 'Pending', 'Place Order', 'Shopping Basket']