<a href="https://colab.research.google.com/github/JacquelineBashta/Pandas_Eniac/blob/main/Project_2_Eniac.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CleaningUp orders.csv (ord_)
#### orders.csv – Every row in this file represents an order.
- order_id – a unique identifier for each order
- created_date – a timestamp for when the order was created
- total_paid – the total amount paid by the customer for this order, in euros
- state 
  - “Shopping basket” – products have been placed in the shopping basket, but the order has not been processed yet.
  - “Pending” – the shopping basket has been processed, but payment confirmation is pending.
  - “Completed” – the order has been placed and paid, and the transaction is completed.
  - “Cancelled” – the order has been cancelled and the payment returned to the customer.

##Importing the data
- glob-glob("file_pat") --> read multi files 
- pd.concat(dfs_list, ignore_index=True) --> create 1 df from multi dfs
- pd.read_csv(path) --> create 1 df from a csv file

In [258]:
import pandas as pd

url = "https://drive.google.com/file/d/1Vu0q91qZw6lqhIqbjoXYvYAQTmVHh6uZ/view?usp=sharing" 
path = "https://drive.google.com/uc?export=download&id="+url.split("/")[-2]
orders = pd.read_csv(path)



In [296]:
#pd.options.display.max_rows = 999
pd.set_option("display.max_rows", 50) #Question : doesn't work!
pd.get_option("display.max_rows")


50

## Explore the data
- .shape, .size, .ndim
- .sample(5), .info()
- Numerical : .describe(), .nlargest(), .nsmallest()
- Category: .nunique(), .unique() 

In [260]:
orders.shape

(226909, 4)

In [261]:
orders.sample(5)

Unnamed: 0,order_id,created_date,total_paid,state
20571,320051,2017-02-05 10:00:51,73.98,Shopping Basket
177353,477745,2017-12-30 23:46:53,98.49,Pending
25696,325176,2017-02-16 12:26:40,0.0,Shopping Basket
67473,367019,2017-06-19 11:17:42,655.18,Pending
7740,307209,2017-01-11 12:19:08,321.98,Completed


In [262]:
orders.info()  
## hint: there are 5 nulls in total_paid - fix required
## hint: created_date is of type object, has to be datetime - fix required

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226909 entries, 0 to 226908
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   order_id      226909 non-null  int64  
 1   created_date  226909 non-null  object 
 2   total_paid    226904 non-null  float64
 3   state         226909 non-null  object 
dtypes: float64(1), int64(1), object(2)
memory usage: 6.9+ MB


In [263]:
orders.describe()
## hint: total_paid has huge std , min = 0 as some points! probably we should exclude the total_paid = 0 - fix required

Unnamed: 0,order_id,total_paid
count,226909.0,226904.0
mean,413296.48248,569.225818
std,65919.250331,1761.778002
min,241319.0,0.0
25%,356263.0,34.19
50%,413040.0,112.99
75%,470553.0,525.98
max,527401.0,214747.53


In [264]:
orders.nunique() # hint: order_id is unique per row 
# hint: state is category data

order_id        226909
created_date    224828
total_paid       31236
state                5
dtype: int64

In [265]:
orders.state.unique()
#hint: state has value "Shopping Basket"/"Pending" which ideally shall not be part of the analysis - fix required
# (if we care about actuall sold products)


array(['Cancelled', 'Completed', 'Pending', 'Shopping Basket',
       'Place Order'], dtype=object)

##Clean the data per csv
- Remember to create a copy of the df using .copy()

In [266]:
orders_c = orders.copy()

###Rename Columns , Set Index
 - df.columns  , df.index
 - df=df.rename(columns={"A": "a", "B": "c"})
 - df.columns = ["a","b":"x"]
   - take care, renaming the columns like that will convert the NAN to some value!!
 - df=df.set_index("col") , df=df.reset_index()

In [267]:
orders_c.columns #hint: columns names shall has indec ord_

Index(['order_id', 'created_date', 'total_paid', 'state'], dtype='object')

In [268]:
#orders_c.columns=['ord_id', 'ord_created_date', 'ord_total_paid', 'ord_state'] 
## take care, renaming the columns like that will convert the NAN to some value!!


In [269]:
orders_c=orders_c.rename(columns={"order_id": "ord_id", "created_date": "ord_created_date"
, "total_paid": "ord_total_paid", "state": "ord_state"})

In [270]:
orders_c.index #hint: no need to change index

RangeIndex(start=0, stop=226909, step=1)

### Remove Duplicates Rows
- df.duplicated().sum()
- df.loc[df.duplicated()==True]
- df=df.drop:duplicates() 
- df=df.drop:duplicates(subset=["col"]) --> remove rows based on duplicated in specific column

In [271]:
orders_c.duplicated().sum() #hint : No duplicates

0

In [272]:
orders_c.loc[orders_c.duplicated()==True]

Unnamed: 0,ord_id,ord_created_date,ord_total_paid,ord_state


### Clean NAN and empty cells
- df.isna().sum()
- df = df.replace('^\s*$', np.nan) -->replace empty cells and cells with only whitspace with NAN
- df=df.col.fillna(value,method="bfill"or"ffill",limit=value)

- Extra: 
  - (df.values == '').sum() --> check if any cell is empty
  - df.col.str.isspace().sum() --> check if all cell is filled with whitespaces

In [273]:
orders_c.sample(5)

Unnamed: 0,ord_id,ord_created_date,ord_total_paid,ord_state
170289,470661,2017-12-24 11:49:08,1159.0,Shopping Basket
83538,383113,2017-07-28 03:22:26,19.99,Shopping Basket
140872,440798,2017-11-24 18:50:32,77.28,Place Order
143024,443117,2017-11-25 12:32:50,21.23,Shopping Basket
130975,430617,2017-11-21 20:50:11,1842.59,Place Order


In [274]:
orders_c.isna().sum()

ord_id              0
ord_created_date    0
ord_total_paid      5
ord_state           0
dtype: int64

In [275]:
orders_c.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226909 entries, 0 to 226908
Data columns (total 4 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   ord_id            226909 non-null  int64  
 1   ord_created_date  226909 non-null  object 
 2   ord_total_paid    226904 non-null  float64
 3   ord_state         226909 non-null  object 
dtypes: float64(1), int64(1), object(2)
memory usage: 6.9+ MB


In [276]:
import numpy as np
orders_c = orders_c.replace('^\s*$', np.nan) #Question is it safe to do that?


In [277]:
orders_c.isna().sum()

ord_id              0
ord_created_date    0
ord_total_paid      5
ord_state           0
dtype: int64

In [278]:
orders_c.ord_total_paid.dtype

dtype('float64')

In [279]:
orders_c.ord_total_paid=orders_c.ord_total_paid.fillna(0.0)

In [280]:
orders_c.ord_total_paid.dtype

dtype('float64')

In [281]:
orders_c.isna().sum()

ord_id              0
ord_created_date    0
ord_total_paid      0
ord_state           0
dtype: int64

###Fix DataTypes
- df.col.astype(type,errors="raise")
  - type = "int","float","bool","category","object","datetime","timedelta"
- for mixed data
  - pd.to_numeric(df.col, downcast=x,errors="raise") 
  x = "integer" or "float"
  - pd.to_datetime(df.col, downcast=None,errors="raise") 
  - pd.to_timedelta(df.col, downcast=None,errors="raise") 

In [282]:
orders_c.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226909 entries, 0 to 226908
Data columns (total 4 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   ord_id            226909 non-null  int64  
 1   ord_created_date  226909 non-null  object 
 2   ord_total_paid    226909 non-null  float64
 3   ord_state         226909 non-null  object 
dtypes: float64(1), int64(1), object(2)
memory usage: 6.9+ MB


In [283]:
orders_c.ord_created_date=pd.to_datetime(orders_c.ord_created_date,errors="raise")

In [284]:
orders_c.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226909 entries, 0 to 226908
Data columns (total 4 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   ord_id            226909 non-null  int64         
 1   ord_created_date  226909 non-null  datetime64[ns]
 2   ord_total_paid    226909 non-null  float64       
 3   ord_state         226909 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 6.9+ MB


In [285]:
orders_c.ord_state=orders_c.ord_state.astype("category",errors="raise")

In [286]:
orders_c.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226909 entries, 0 to 226908
Data columns (total 4 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   ord_id            226909 non-null  int64         
 1   ord_created_date  226909 non-null  datetime64[ns]
 2   ord_total_paid    226909 non-null  float64       
 3   ord_state         226909 non-null  category      
dtypes: category(1), datetime64[ns](1), float64(1), int64(1)
memory usage: 5.4 MB


### Drop duplicate/un-necessary Columns
- df=df.drop(columns=["col1","col2"])

In [287]:
orders_c.describe() ## hint: all columns contain needed data. nothing to drop

Unnamed: 0,ord_id,ord_total_paid
count,226909.0,226909.0
mean,413296.48248,569.213275
std,65919.250331,1761.760618
min,241319.0,0.0
25%,356263.0,34.19
50%,413040.0,112.99
75%,470553.0,525.97
max,527401.0,214747.53


## Re-Explore the data
drow some hist() , boxplot() per column
take notes

In [288]:
orders_c.sample(10)  # Question: how state="Shopping Basket" while "Total_paid" has a value!! 

Unnamed: 0,ord_id,ord_created_date,ord_total_paid,ord_state
204083,504571,2018-02-02 17:50:44,3.99,Place Order
137994,437782,2017-11-24 14:28:30,466.16,Completed
33778,333262,2017-03-08 13:19:37,70.99,Shopping Basket
55430,354966,2017-05-11 16:48:30,50.98,Completed
168472,468817,2017-12-21 14:13:03,29.99,Shopping Basket
140317,440216,2017-11-24 17:02:36,316.27,Shopping Basket
124214,423811,2017-11-09 13:36:54,299.0,Shopping Basket
40556,340044,2017-03-28 13:20:40,107.04,Shopping Basket
225570,526062,2018-03-12 17:36:21,48.99,Shopping Basket
68336,367882,2017-06-21 20:26:33,42.98,Pending


In [289]:
orders_c.loc[orders_c.ord_total_paid == 0.0].ord_state.unique()

['Completed', 'Shopping Basket', 'Place Order', 'Pending', 'Cancelled']
Categories (5, object): ['Cancelled', 'Completed', 'Pending', 'Place Order', 'Shopping Basket']

# CleaningUp orderlines.csv (ol_)
#### orderlines.csv – Every row represents each one of the different products involved in an order.
- id – a unique identifier for each row in this file
- id_order – corresponds to orders.order_id
- product_id – an old identifier for each product, nowadays not in use
- product_quantity – how many units of that product were purchased on that order
- sku – stock keeping unit: a unique identifier for each product
- unit_price – the unitary price (in euros) of each product at the moment of placing that order
- date – timestamp for the processing of that product

##Importing the data
- glob-glob("file_pat") --> read multi files 
- pd.concat(dfs_list, ignore_index=True) --> create 1 df from multi dfs
- pd.read_csv(path) --> create 1 df from a csv file

In [306]:
import pandas as pd

url = "https://drive.google.com/file/d/1FYhN_2AzTBFuWcfHaRuKcuCE6CWXsWtG/view?usp=sharing" 
path = "https://drive.google.com/uc?export=download&id="+url.split("/")[-2]
ol = pd.read_csv(path)



In [307]:
pd.set_option("display.max_rows", 50) #Question : doesn't work!
pd.get_option("display.max_rows")


50

## Explore the data
- .shape, .size, .ndim
- .sample(5), .info()
- Numerical : .describe(), .nlargest(), .nsmallest()
- Category: .nunique(), .unique() 

In [308]:
ol.shape

(293983, 7)

In [309]:
ol.sample(5)

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
284407,1635975,521356,0,2,APP0663,219.00,2018-03-05 10:36:05
162859,1439017,439981,0,1,PAC2074,2.307.59,2017-11-24 16:24:06
123933,1365355,406312,0,1,APP2205,2.874.60,2017-09-29 14:37:29
214482,1519508,473644,0,1,SAN0102,7.89,2017-12-27 17:15:35
281364,1631131,519262,0,1,APP2135,1.252.00,2018-03-01 08:44:34


In [310]:
ol.info()  
## hint: there are no nulls 
## hint: date  is of type object, has to be datetime - fix required
## hint: unit_price is of type object, has to be float - fix required

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 293983 entries, 0 to 293982
Data columns (total 7 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   id                293983 non-null  int64 
 1   id_order          293983 non-null  int64 
 2   product_id        293983 non-null  int64 
 3   product_quantity  293983 non-null  int64 
 4   sku               293983 non-null  object
 5   unit_price        293983 non-null  object
 6   date              293983 non-null  object
dtypes: int64(4), object(3)
memory usage: 15.7+ MB


In [311]:
ol.describe()
# hint: product_id seems like empty and shall be removed - Fix required
# hint: product_quantity 3rd quartile = min = 1, max=999! probably we shall handle orders with different quantities differently - fix required

Unnamed: 0,id,id_order,product_id,product_quantity
count,293983.0,293983.0,293983.0,293983.0
mean,1397918.0,419999.116544,0.0,1.121126
std,153009.6,66344.486479,0.0,3.396569
min,1119109.0,241319.0,0.0,1.0
25%,1262542.0,362258.5,0.0,1.0
50%,1406940.0,425956.0,0.0,1.0
75%,1531322.0,478657.0,0.0,1.0
max,1650203.0,527401.0,0.0,999.0


In [303]:
ol.nunique() # hint: id is unique per row 
# hint: product_id is useless, shall be removed - fix required

id                  293983
id_order            204855
product_id               1
product_quantity        67
sku                   7951
unit_price           11329
date                251631
dtype: int64

In [312]:
ol.product_id.unique()
#hint: useless data


array([0])

In [325]:
ol.unit_price.sort_values(ascending=False).tail(40) 
# There is negative value in unit_price,  Fix Required

225400       0.00
225401       0.00
276547       0.00
276548       0.00
67456        0.00
199139       0.00
280903       0.00
67563        0.00
280907       0.00
67854        0.00
265161       0.00
256505       0.00
280906       0.00
246443       0.00
256504       0.00
280905       0.00
67848        0.00
265159       0.00
265158       0.00
67812        0.00
246444       0.00
67802        0.00
67773        0.00
265157       0.00
67768        0.00
67585        0.00
67626        0.00
212228       0.00
280904       0.00
67644        0.00
67661        0.00
67682        0.00
67701        0.00
212227       0.00
67717        0.00
265154       0.00
67746        0.00
265156       0.00
261851       0.00
77008     -119.00
Name: unit_price, dtype: object

In [326]:
ol.loc[ol.unit_price == "0.00"]

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
53515,1227566,345934,0,1,KIN0153-2,0.00,2017-04-13 13:47:21
53530,1227590,345957,0,1,WDT0347,0.00,2017-04-13 14:44:05
56529,1232832,348502,0,1,LIBRO,0.00,2017-04-21 18:14:54
56562,1232888,348531,0,1,LIBRO,0.00,2017-04-21 19:46:54
56576,1232909,348542,0,1,LIBRO,0.00,2017-04-21 20:48:06
...,...,...,...,...,...,...,...
291880,1646431,525595,0,1,SYN0150,0.00,2018-03-11 20:50:46
292216,1647014,525883,0,1,APP0699,0.00,2018-03-12 12:26:57
292388,1647334,526025,0,1,WAC0251,0.00,2018-03-12 16:31:39
292905,1648288,526430,0,1,OWC0110,0.00,2018-03-13 13:27:05


In [330]:
(ol
 .query("unit_price == '0.00'")
 [["id","sku"]]
)

Unnamed: 0,id,sku
53515,1227566,KIN0153-2
53530,1227590,WDT0347
56529,1232832,LIBRO
56562,1232888,LIBRO
56576,1232909,LIBRO
...,...,...
291880,1646431,SYN0150
292216,1647014,APP0699
292388,1647334,WAC0251
292905,1648288,OWC0110


##Clean the data per csv
- Remember to create a copy of the df using .copy()

In [255]:
orderlines_c = orderlines.copy()

###Rename Columns , Set Index
 - df.columns  , df.index
 - df=df.rename(columns={"A": "a", "B": "c"})
 - df.columns = ["a","b":"x"]
  - take care, renaming the columns like that will convert the NAN to some value!!
 - df=df.set_index("col") , df=df.reset_index()

In [256]:
orderlines_c.columns #hint: columns names looks good

Index(['id', 'id_order', 'product_id', 'product_quantity', 'sku', 'unit_price',
       'date'],
      dtype='object')

In [None]:
orders_c=orders_c.rename(columns={"id": "ordli_id", "id_order": "ordli_ord_id"
, "product_id": "ordli_product_id", "product_quantity": "ordli_product_quantity", "sku": "ordli_sku"
, "unit_price": "ordli_product_unit_price"})

In [257]:
orders_c.index #hint: no need to change index

RangeIndex(start=0, stop=226909, step=1)

### Remove Duplicates Rows
- df.duplicated().sum()
- df.loc[df.duplicated()==True]
- df=df.drop:duplicates() 
- df=df.drop:duplicates(subset=["col"]) --> remove rows based on duplicated in specific column

In [None]:
orders_c.duplicated().sum() #hint : No duplicates

0

In [None]:
orders_c.loc[orders_c.duplicated()==True]

Unnamed: 0,order_id,created_date,total_paid,state


### Clean NAN and empty cells
- df.isna().sum()
- df = df.replace('^\s*$', np.nan) -->replace empty cells and cells with only whitspace with NAN
- df=df.col.fillna(value,method="bfill"or"ffill",limit=value)

- Extra: 
  - (df.values == '').sum() --> check if any cell is empty
  - df.col.str.isspace().sum() --> check if all cell is filled with whitespaces

In [None]:
orders_c.isna().sum()

order_id        0
created_date    0
total_paid      5
state           0
dtype: int64

In [None]:
orders_c.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226909 entries, 0 to 226908
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   order_id      226909 non-null  int64  
 1   created_date  226909 non-null  object 
 2   total_paid    226904 non-null  float64
 3   state         226909 non-null  object 
dtypes: float64(1), int64(1), object(2)
memory usage: 6.9+ MB


In [None]:
import numpy as np
orders_c = orders_c.replace('^\s*$', np.nan)


In [None]:
orders_c.isna().sum()

order_id        0
created_date    0
total_paid      5
state           0
dtype: int64

In [None]:
orders_c.total_paid.dtype

dtype('float64')

In [None]:
orders_c.total_paid=orders_c.total_paid.fillna(0.0)

In [None]:
orders_c.total_paid.dtype

dtype('float64')

In [None]:
orders_c.isna().sum()

order_id        0
created_date    0
total_paid      0
state           0
dtype: int64

###Fix DataTypes
- df.col.astype(type,errors="raise")
  - type = "int","float","bool","category","object","datetime","timedelta"
- for mixed data
  - pd.to_numeric(df.col, downcast=x,errors="raise") 
  x = "integer" or "float"
  - pd.to_datetime(df.col, downcast=None,errors="raise") 
  - pd.to_timedelta(df.col, downcast=None,errors="raise") 

In [None]:
orders_c.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226909 entries, 0 to 226908
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   order_id      226909 non-null  int64  
 1   created_date  226909 non-null  object 
 2   total_paid    226909 non-null  float64
 3   state         226909 non-null  object 
dtypes: float64(1), int64(1), object(2)
memory usage: 6.9+ MB


In [None]:
orders_c.created_date=pd.to_datetime(orders_c.created_date,errors="raise")

In [None]:
orders_c.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226909 entries, 0 to 226908
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   order_id      226909 non-null  int64         
 1   created_date  226909 non-null  datetime64[ns]
 2   total_paid    226909 non-null  float64       
 3   state         226909 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 6.9+ MB


In [None]:
orders_c.state=orders_c.state.astype("category",errors="raise")

In [None]:
orders_c.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226909 entries, 0 to 226908
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   order_id      226909 non-null  int64         
 1   created_date  226909 non-null  datetime64[ns]
 2   total_paid    226909 non-null  float64       
 3   state         226909 non-null  category      
dtypes: category(1), datetime64[ns](1), float64(1), int64(1)
memory usage: 5.4 MB


### Drop duplicate/un-necessary Columns
- df=df.drop(columns=["col1","col2"])

In [None]:
orders_c.describe() ## hint: all columns contain needed data. nothing to drop

Unnamed: 0,order_id,total_paid
count,226909.0,226909.0
mean,413296.48248,569.213275
std,65919.250331,1761.760618
min,241319.0,0.0
25%,356263.0,34.19
50%,413040.0,112.99
75%,470553.0,525.97
max,527401.0,214747.53


## Re-Explore the data
drow some hist() , boxplot() per column
take notes

In [None]:
orders_c.sample(10)  # Question: how state="Shopping Basket" while "Total_paid" has a value!! 

Unnamed: 0,order_id,created_date,total_paid,state
153054,453267,2017-12-01 15:52:25,0.0,Place Order
207636,508127,2018-02-08 13:13:21,2511.59,Cancelled
22219,321699,2017-02-09 10:41:43,80.99,Shopping Basket
158350,458574,2017-12-09 21:38:25,2559.57,Shopping Basket
107130,406715,2017-09-30 17:03:58,57.79,Shopping Basket
56577,356113,2017-05-15 22:45:24,57.98,Place Order
74786,374332,2017-07-07 18:33:43,99.98,Completed
173457,473832,2017-12-27 19:26:21,270.04,Completed
190772,491252,2018-01-15 16:10:44,506.97,Completed
137866,437654,2017-11-24 09:33:37,83.19,Completed


In [None]:
orders_c.loc[orders_c.total_paid == 0.0].state.unique()

['Completed', 'Shopping Basket', 'Place Order', 'Pending', 'Cancelled']
Categories (5, object): ['Cancelled', 'Completed', 'Pending', 'Place Order', 'Shopping Basket']