<a href="https://colab.research.google.com/github/JacquelineBashta/Pandas_Eniac/blob/main/Project_2_Eniac.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CleaningUp orders.csv
#### orders.csv – Every row in this file represents an order.
- order_id – a unique identifier for each order
- created_date – a timestamp for when the order was created
- total_paid – the total amount paid by the customer for this order, in euros
- state 
  - “Shopping basket” – products have been placed in the shopping basket, but the order has not been processed yet.
  - “Pending” – the shopping basket has been processed, but payment confirmation is pending.
  - “Completed” – the order has been placed and paid, and the transaction is completed.
  - “Cancelled” – the order has been cancelled and the payment returned to the customer.

##Importing the data
- glob-glob("file_pat") --> read multi files 
- pd.concat(dfs_list, ignore_index=True) --> create 1 df from multi dfs
- pd.read_csv(path) --> create 1 df from a csv file

In [192]:
import pandas as pd

url = "https://drive.google.com/file/d/1Vu0q91qZw6lqhIqbjoXYvYAQTmVHh6uZ/view?usp=sharing" 
path = "https://drive.google.com/uc?export=download&id="+url.split("/")[-2]
orders = pd.read_csv(path)



In [193]:
pd.set_option("display.max_rows", None) #Question : doesn't work!
pd.get_option("display.max_rows")


## Explore the data
- .shape, .size, .ndim
- .sample(5), .info()
- Numerical : .describe(), .nlargest(), .nsmallest()
- Category: .nunique(), .unique() 

In [163]:
orders.shape

(226909, 4)

In [164]:
orders.sample(5)

Unnamed: 0,order_id,created_date,total_paid,state
179226,479619,2018-01-02 13:33:21,42.99,Shopping Basket
39200,338686,2017-03-24 12:53:02,469.0,Place Order
146907,447002,2017-11-27 08:03:38,0.0,Shopping Basket
35641,335125,2017-03-14 18:34:17,65.99,Shopping Basket
180272,480675,2018-01-03 10:42:06,80.99,Place Order


In [165]:
orders.info()  
## hint: there are 5 nulls in total_paid - fix required
## hint: created_date is of type object, has to be datetime - fix required

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226909 entries, 0 to 226908
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   order_id      226909 non-null  int64  
 1   created_date  226909 non-null  object 
 2   total_paid    226904 non-null  float64
 3   state         226909 non-null  object 
dtypes: float64(1), int64(1), object(2)
memory usage: 6.9+ MB


In [166]:
orders.describe()
## hint: total_paid has huge std , min = 0 as some points! probably we should exclude the total_paid = 0 - fix required

Unnamed: 0,order_id,total_paid
count,226909.0,226904.0
mean,413296.48248,569.225818
std,65919.250331,1761.778002
min,241319.0,0.0
25%,356263.0,34.19
50%,413040.0,112.99
75%,470553.0,525.98
max,527401.0,214747.53


In [167]:
orders.nunique() # hint: order_id is unique per row 
# hint: state is category data

order_id        226909
created_date    224828
total_paid       31236
state                5
dtype: int64

In [168]:
orders.state.unique()
#hint: state has value "Shopping Basket"/"Pending" which ideally shall not be part of the analysis - fix required
# (if we care about actuall sold products)


array(['Cancelled', 'Completed', 'Pending', 'Shopping Basket',
       'Place Order'], dtype=object)

##Clean the data per csv
- Remember to create a copy of the df using .copy()

In [169]:
orders_c = orders.copy()

###Rename Columns , Set Index
 - df.columns  , df.index
 - df=df.rename(columns={"A": "a", "B": "c"})
 - df.columns = ["a","b":"x"]
 - df=df.set_index("col") , df=df.reset_index()

In [170]:
orders_c.columns #hint: columns names looks good

Index(['order_id', 'created_date', 'total_paid', 'state'], dtype='object')

In [171]:
orders_c.index #hint: no need to change index

RangeIndex(start=0, stop=226909, step=1)

### Remove Duplicates Rows
- df.duplicated().sum()
- df.loc[df.duplicated()==True]
- df=df.drop:duplicates() 
- df=df.drop:duplicates(subset=["col"]) --> remove rows based on duplicated in specific column

In [172]:
orders_c.duplicated().sum() #hint : No duplicates

0

In [173]:
orders_c.loc[orders_c.duplicated()==True]

Unnamed: 0,order_id,created_date,total_paid,state


### Clean NAN and empty cells
- df.isna().sum()
- df = df.replace('^\s*$', np.nan) -->replace empty cells and cells with only whitspace with NAN
- df=df.col.fillna(value,method="bfill"or"ffill",limit=value)

- Extra: 
  - (df.values == '').sum() --> check if any cell is empty
  - df.col.str.isspace().sum() --> check if all cell is filled with whitespaces

In [174]:
orders_c.isna().sum()

order_id        0
created_date    0
total_paid      5
state           0
dtype: int64

In [175]:
orders_c.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226909 entries, 0 to 226908
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   order_id      226909 non-null  int64  
 1   created_date  226909 non-null  object 
 2   total_paid    226904 non-null  float64
 3   state         226909 non-null  object 
dtypes: float64(1), int64(1), object(2)
memory usage: 6.9+ MB


In [176]:
import numpy as np
orders_c = orders_c.replace('^\s*$', np.nan)


In [177]:
orders_c.isna().sum()

order_id        0
created_date    0
total_paid      5
state           0
dtype: int64

In [178]:
orders_c.total_paid.dtype

dtype('float64')

In [179]:
orders_c.total_paid=orders_c.total_paid.fillna(0.0)

In [180]:
orders_c.total_paid.dtype

dtype('float64')

In [181]:
orders_c.isna().sum()

order_id        0
created_date    0
total_paid      0
state           0
dtype: int64

###Fix DataTypes
- df.col.astype(type,errors="raise")
  - type = "int","float","bool","category","object","datetime","timedelta"
- for mixed data
  - pd.to_numeric(df.col, downcast=x,errors="raise") 
  x = "integer" or "float"
  - pd.to_datetime(df.col, downcast=None,errors="raise") 
  - pd.to_timedelta(df.col, downcast=None,errors="raise") 

In [182]:
orders_c.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226909 entries, 0 to 226908
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   order_id      226909 non-null  int64  
 1   created_date  226909 non-null  object 
 2   total_paid    226909 non-null  float64
 3   state         226909 non-null  object 
dtypes: float64(1), int64(1), object(2)
memory usage: 6.9+ MB


In [183]:
orders_c.created_date=pd.to_datetime(orders_c.created_date,errors="raise")

In [184]:
orders_c.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226909 entries, 0 to 226908
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   order_id      226909 non-null  int64         
 1   created_date  226909 non-null  datetime64[ns]
 2   total_paid    226909 non-null  float64       
 3   state         226909 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 6.9+ MB


In [185]:
orders_c.state=orders_c.state.astype("category",errors="raise")

In [186]:
orders_c.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226909 entries, 0 to 226908
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   order_id      226909 non-null  int64         
 1   created_date  226909 non-null  datetime64[ns]
 2   total_paid    226909 non-null  float64       
 3   state         226909 non-null  category      
dtypes: category(1), datetime64[ns](1), float64(1), int64(1)
memory usage: 5.4 MB


### Drop duplicate/un-necessary Columns
- df=df.drop(columns=["col1","col2"])

In [187]:
orders_c.describe() ## hint: all columns contain needed data. nothing to drop

Unnamed: 0,order_id,total_paid
count,226909.0,226909.0
mean,413296.48248,569.213275
std,65919.250331,1761.760618
min,241319.0,0.0
25%,356263.0,34.19
50%,413040.0,112.99
75%,470553.0,525.97
max,527401.0,214747.53


## Re-Explore the data
drow some hist() , boxplot() per column
take notes

In [159]:
orders_c.sample(10)  # Question: how state="Shopping Basket" while "Total_paid" has a value!! 

Unnamed: 0,order_id,created_date,total_paid,state
153054,453267,2017-12-01 15:52:25,0.0,Place Order
207636,508127,2018-02-08 13:13:21,2511.59,Cancelled
22219,321699,2017-02-09 10:41:43,80.99,Shopping Basket
158350,458574,2017-12-09 21:38:25,2559.57,Shopping Basket
107130,406715,2017-09-30 17:03:58,57.79,Shopping Basket
56577,356113,2017-05-15 22:45:24,57.98,Place Order
74786,374332,2017-07-07 18:33:43,99.98,Completed
173457,473832,2017-12-27 19:26:21,270.04,Completed
190772,491252,2018-01-15 16:10:44,506.97,Completed
137866,437654,2017-11-24 09:33:37,83.19,Completed


In [196]:
orders_c.loc[orders_c.total_paid == 0.0].state.unique()

['Completed', 'Shopping Basket', 'Place Order', 'Pending', 'Cancelled']
Categories (5, object): ['Cancelled', 'Completed', 'Pending', 'Place Order', 'Shopping Basket']