# Cleaning dirty dataset 
**14/01/2025**

**in this practice simulation we're gonna try cleaning a dirty retail store dataset**

https://www.kaggle.com/datasets/ahmedmohamed2003/retail-store-sales-dirty-for-data-cleaning?resource=download

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('retail_store_sales.csv')
df.head()

Unnamed: 0,Transaction ID,Customer ID,Category,Item,Price Per Unit,Quantity,Total Spent,Payment Method,Location,Transaction Date,Discount Applied
0,TXN_6867343,CUST_09,Patisserie,Item_10_PAT,18.5,10.0,185.0,Digital Wallet,Online,2024-04-08,True
1,TXN_3731986,CUST_22,Milk Products,Item_17_MILK,29.0,9.0,261.0,Digital Wallet,Online,2023-07-23,True
2,TXN_9303719,CUST_02,Butchers,Item_12_BUT,21.5,2.0,43.0,Credit Card,Online,2022-10-05,False
3,TXN_9458126,CUST_06,Beverages,Item_16_BEV,27.5,9.0,247.5,Credit Card,Online,2022-05-07,
4,TXN_4575373,CUST_05,Food,Item_6_FOOD,12.5,7.0,87.5,Digital Wallet,Online,2022-10-02,False


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12575 entries, 0 to 12574
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Transaction ID    12575 non-null  object 
 1   Customer ID       12575 non-null  object 
 2   Category          12575 non-null  object 
 3   Item              11362 non-null  object 
 4   Price Per Unit    11966 non-null  float64
 5   Quantity          11971 non-null  float64
 6   Total Spent       11971 non-null  float64
 7   Payment Method    12575 non-null  object 
 8   Location          12575 non-null  object 
 9   Transaction Date  12575 non-null  object 
 10  Discount Applied  8376 non-null   object 
dtypes: float64(3), object(8)
memory usage: 1.1+ MB


**let's go as simple as possible**

In [4]:
print(df.isna().sum())

Transaction ID         0
Customer ID            0
Category               0
Item                1213
Price Per Unit       609
Quantity             604
Total Spent          604
Payment Method         0
Location               0
Transaction Date       0
Discount Applied    4199
dtype: int64


In [5]:
threshold = len(df) * 0.05 
# print(threshold)

cols_to_drop = df.columns[df.isna().sum() <= threshold] 
print(cols_to_drop)

Index(['Transaction ID', 'Customer ID', 'Category', 'Price Per Unit',
       'Quantity', 'Total Spent', 'Payment Method', 'Location',
       'Transaction Date'],
      dtype='object')


In [6]:
df.dropna(subset=cols_to_drop, inplace=True)

In [7]:
print(df.isna().sum())

Transaction ID         0
Customer ID            0
Category               0
Item                   0
Price Per Unit         0
Quantity               0
Total Spent            0
Payment Method         0
Location               0
Transaction Date       0
Discount Applied    3783
dtype: int64


**ganti tipe data sedikit imputasi**

In [8]:
# Cek jumlah nilai unik di tiap kolom object
print(df.select_dtypes('object').nunique())


Transaction ID      11362
Customer ID            25
Category                8
Item                  200
Payment Method          3
Location                2
Transaction Date     1114
Discount Applied        2
dtype: int64


In [9]:

cols_to_category = ['Customer ID', 'Category', 'Item', 'Payment Method', 'Location']
for col in cols_to_category:
    df[col] = df[col].astype('category')


df['Transaction Date'] = pd.to_datetime(df['Transaction Date'])


df['Discount Applied'] = df['Discount Applied'].astype('boolean').fillna(False)



In [10]:
print(df.isna().sum())

Transaction ID      0
Customer ID         0
Category            0
Item                0
Price Per Unit      0
Quantity            0
Total Spent         0
Payment Method      0
Location            0
Transaction Date    0
Discount Applied    0
dtype: int64


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 11362 entries, 0 to 12574
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Transaction ID    11362 non-null  object        
 1   Customer ID       11362 non-null  category      
 2   Category          11362 non-null  category      
 3   Item              11362 non-null  category      
 4   Price Per Unit    11362 non-null  float64       
 5   Quantity          11362 non-null  float64       
 6   Total Spent       11362 non-null  float64       
 7   Payment Method    11362 non-null  category      
 8   Location          11362 non-null  category      
 9   Transaction Date  11362 non-null  datetime64[ns]
 10  Discount Applied  11362 non-null  boolean       
dtypes: boolean(1), category(5), datetime64[ns](1), float64(3), object(1)
memory usage: 632.4+ KB


In [12]:
print(df.head())

  Transaction ID Customer ID       Category          Item  Price Per Unit  \
0    TXN_6867343     CUST_09     Patisserie   Item_10_PAT            18.5   
1    TXN_3731986     CUST_22  Milk Products  Item_17_MILK            29.0   
2    TXN_9303719     CUST_02       Butchers   Item_12_BUT            21.5   
3    TXN_9458126     CUST_06      Beverages   Item_16_BEV            27.5   
4    TXN_4575373     CUST_05           Food   Item_6_FOOD            12.5   

   Quantity  Total Spent  Payment Method Location Transaction Date  \
0      10.0        185.0  Digital Wallet   Online       2024-04-08   
1       9.0        261.0  Digital Wallet   Online       2023-07-23   
2       2.0         43.0     Credit Card   Online       2022-10-05   
3       9.0        247.5     Credit Card   Online       2022-05-07   
4       7.0         87.5  Digital Wallet   Online       2022-10-02   

   Discount Applied  
0              True  
1              True  
2             False  
3             False  
4     

---