# Preprocessing of Transaction Data 

## Module 1
### Task 1: Polishing the Dataset for Insights
In the realm of e-commerce, data analyst Alex undertook the critical mission of transforming the "transaction_dataset.csv" into a strategic asset. He meticulously cleaned the data to ensure precision, eliminating extraneous columns such as "product_class" and "product_size." Furthermore, he revamped column names to enhance clarity.

The objective of this task was both simple and pivotal: to equip the organization with top-tier data for facilitating informed decision-making. It aimed to create a well-defined pathway towards data-driven insights that would steer the e-commerce platform toward resounding success.

In [1]:
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%load_ext lab_black

In [17]:
# laod the dataset
data = pd.read_csv("transaction_dataset.csv")
data.head()

Unnamed: 0,tr_id,p_id,c_id,tr_date,online_order,order_status,brand,product_line,product_class,product_size,list_price,standard_cost,product_first_sold_date
0,1,2,2950,25-02-2017 00:00,False,Approved,Solex,Standard,medium,medium,71.49,53.62,41245.0
1,2,3,3120,21-05-2017 00:00,True,Approved,Trek Bicycles,Standard,medium,large,2091.47,388.92,41701.0
2,3,37,402,16-10-2017 00:00,False,Approved,OHM Cycles,Standard,low,medium,1793.43,248.82,36361.0
3,4,88,3135,31-08-2017 00:00,False,Approved,Norco Bicycles,Standard,medium,medium,1198.46,381.1,36145.0
4,5,78,787,01-10-2017 00:00,True,Approved,Giant Bicycles,Standard,medium,large,1765.3,709.48,42226.0


In [18]:
# summary
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   tr_id                    20000 non-null  int64  
 1   p_id                     20000 non-null  int64  
 2   c_id                     20000 non-null  int64  
 3   tr_date                  20000 non-null  object 
 4   online_order             20000 non-null  bool   
 5   order_status             20000 non-null  object 
 6   brand                    20000 non-null  object 
 7   product_line             20000 non-null  object 
 8   product_class            19803 non-null  object 
 9   product_size             19803 non-null  object 
 10  list_price               20000 non-null  float64
 11  standard_cost            20000 non-null  float64
 12  product_first_sold_date  20000 non-null  float64
dtypes: bool(1), float64(3), int64(3), object(6)
memory usage: 1.9+ MB


In [19]:
# null values
data.isna().sum()

tr_id                        0
p_id                         0
c_id                         0
tr_date                      0
online_order                 0
order_status                 0
brand                        0
product_line                 0
product_class              197
product_size               197
list_price                   0
standard_cost                0
product_first_sold_date      0
dtype: int64

In [20]:
# removing the columns 'product class' and 'product size' as these columns are required
data = data.drop(columns=["product_class", "product_size"])

In [21]:
data.columns

Index(['tr_id', 'p_id', 'c_id', 'tr_date', 'online_order', 'order_status',
       'brand', 'product_line', 'list_price', 'standard_cost',
       'product_first_sold_date'],
      dtype='object')

In [23]:
# Rename the columns in 'df' to 'transaction_id', 'product_id', 'customer_id', and 'transaction_date' for improved readability.

data = data.rename(
    columns={
        "tr_id": "transaction_id",
        "p_id": "product_id",
        "c_id": "customer_id",
        "tr_date": "transaction_date",
    }
)

In [27]:
data.to_csv("cleaned_dataset.csv", index=False)