<a href="https://colab.research.google.com/github/JacquelineBashta/Pandas_Eniac/blob/main/Project_2_Eniac.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# orders.csv (ord_)
orders.csv – Every row in this file represents an order.
- order_id – a unique identifier for each order
- created_date – a timestamp for when the order was created
- total_paid – the total amount paid by the customer for this order, in euros
- state 
  - “Shopping basket” – products have been placed in the shopping basket, but the order has not been processed yet.
  - “Pending” – the shopping basket has been processed, but payment confirmation is pending.
  - “Completed” – the order has been placed and paid, and the transaction is completed.
  - “Cancelled” – the order has been cancelled and the payment returned to the customer.

    ==============================================================================================

## Importing the data
- ``` glob-glob("file_pat") ``` --> read multi files 
- ``` pd.concat(dfs_list, ignore_index=True)```  --> create 1 df from multi dfs
- ``` pd.read_csv(path)```  --> create 1 df from a csv file

In [104]:
import pandas as pd
import numpy as np

pd.set_option("display.min_rows", 0) 
pd.set_option("display.max_rows", 30) 
pd.__version__

'1.4.4'

In [105]:
url = "https://drive.google.com/file/d/1Vu0q91qZw6lqhIqbjoXYvYAQTmVHh6uZ/view?usp=sharing" 
path = "https://drive.google.com/uc?export=download&id="+url.split("/")[-2]
ord = pd.read_csv(path)

      ===============================================

## Rename Columns , Set Index
- Rules: 
    - add 2-3 chars prefix to all columns
    - change column name to CamelCase
    - shorten it as max as possible
    - the Unique column shall have ALL_CAP name
 - ``` df.columns```   , ``` df.index``` 
 - ``` df=df.rename(columns={"A": "a", "B": "c"})``` 
 - ``` df.columns = ["a","b":"x"]``` 
     - take care, renaming the columns like that will convert the NAN to some value!!
 - ``` df=df.set_index("col")```  , ``` df=df.reset_index()``` 

In [106]:
ord.columns 

Index(['order_id', 'created_date', 'total_paid', 'state'], dtype='object')

In [107]:
ord=ord.rename(columns={"order_id": "ORD_ID"
                            , "created_date": "ordreatDate"
                            , "total_paid": "ord_TotalPaid"
                            , "state": "ord_State"})

In [108]:
ord.index 

RangeIndex(start=0, stop=226909, step=1)

    ==============================================================================================

## Explore the data
- ``` df.shape``` , ``` df.size``` , ``` df.ndim``` 
- ``` df.sample(5)``` , ``` df.info()``` 
- Numerical : ``` df.describe()``` , ``` df.col.nlargest()``` , ``` df.col.nsmallest()``` 
- Category : ``` df.nunique()``` , ``` df.unique() ``` 

In [109]:
ord.shape

(226909, 4)

In [110]:
ord.sample(5)

Unnamed: 0,ORD_ID,ordreatDate,ord_TotalPaid,ord_State
99989,399573,2017-09-13 12:48:54,404.89,Place Order
211423,511914,2018-02-15 10:18:06,1129.0,Shopping Basket
186743,487155,2018-01-09 23:26:38,1139.0,Shopping Basket
4539,304008,2017-01-07 01:09:18,221.99,Shopping Basket
85739,385314,2017-08-02 10:12:05,0.0,Shopping Basket


In [111]:
ord.info()  

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226909 entries, 0 to 226908
Data columns (total 4 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   ORD_ID         226909 non-null  int64  
 1   ordreatDate    226909 non-null  object 
 2   ord_TotalPaid  226904 non-null  float64
 3   ord_State      226909 non-null  object 
dtypes: float64(1), int64(1), object(2)
memory usage: 6.9+ MB


In [112]:
ord.describe()

Unnamed: 0,ORD_ID,ord_TotalPaid
count,226909.0,226904.0
mean,413296.48248,569.225818
std,65919.250331,1761.778002
min,241319.0,0.0
25%,356263.0,34.19
50%,413040.0,112.99
75%,470553.0,525.98
max,527401.0,214747.53


In [113]:
ord.nunique() # hint: order_id is unique per row 
# hint: state is category data

ORD_ID           226909
ordreatDate      224828
ord_TotalPaid     31236
ord_State             5
dtype: int64

In [114]:
ord.ord_State.unique()
#hint: state has value "Shopping Basket"/"Pending" which ideally shall not be part of the analysis - TODO: fix required
# (if we care about actuall sold products)


array(['Cancelled', 'Completed', 'Pending', 'Shopping Basket',
       'Place Order'], dtype=object)

    ==============================================================================================

## Initial Clean
- Remember to create a copy of the df using ``` df.copy()``` ``` 

In [115]:
#keep original
ord_original = ord.copy()

      ===============================================

### Strip whitespaces
- ``` df.applymap(lambda x: x.strip() if isinstance(x, str) else x)```

In [116]:
ord = ord.applymap(lambda x: x.strip() if isinstance(x, str) else x)

      ===============================================
### Remove Duplicates Rows
- ``` df.duplicated().sum()``` 
- ``` df.loc[df.duplicated()==True]``` 
- ``` df=df.drop_duplicates() ``` 
- ``` df=df.drop_duplicates(subset=["col"])```  --> remove rows based on duplicated in specific column

In [117]:
ord.duplicated().sum() #hint : No duplicates

0

      =======================

#### Remove duplicated rows related to Unique columns
- find possible duplicates ``` len(df.ol_ID.unique()) == df.shape[0] ``` 
- get the excat value for duplicate columns ``` df.loc[df.duplicated(subset="col")]``` 
- find all columns with same value ``` df[df.col=="val"]``` 

In [118]:
len(ord.ORD_ID.unique()) == ord.shape[0]

True

      ===============================================

### Drop duplicate/un-necessary Columns
- ``` df=df.drop(columns=["col1","col2"])``` 

In [119]:
ord.sample(5) ## hint: all columns contain needed data. nothing to drop

Unnamed: 0,ORD_ID,ordreatDate,ord_TotalPaid,ord_State
68382,367928,2017-06-21 20:19:59,79.99,Pending
176606,476998,2017-12-30 00:15:05,212.99,Completed
39858,339344,2017-03-26 19:32:56,849.99,Pending
28406,327886,2017-02-23 03:13:39,1024.99,Shopping Basket
183617,484024,2018-01-06 23:51:55,193.0,Pending


    ==============================================================================================

## Compare to original DataFrame
``` df.compare(df2)```

In [120]:
ord.compare(ord_original)

    ==============================================================================================

## Export the cleaned DataFrame

``` df.to_pickle("file_name.pkl")``` 

In [121]:
ord.to_pickle("clean_Tables\Orders_c.pkl")