# Assignement 3
## Student Information
### Name: Tanzim Nawaz
### ID: 11834685

## Problem Statement
The task, based on my student ID, is to to perform Cleansing 6 - Replace missing or generic values. I will be using python and hence Pandas library for this task. This particular task involves demonstrating that I have the knowledge to use different Pandas functions for handling missing and generic values in large datasets. The primary objective is to show the implementation of following pandas functions - 

- replace(): For substituting specific values or patterns
- fillna(): For filling missing values with specified values
- ffill(): For forward filling missing values

## Input Data Description 

I have opted to create and use a synthetic e-commerce dataset for this assignment to keep the plagiarism percentage low. For this purpose, I will create a dataset containing 10,000 records with different types of missing and erroneous data that maybe found in real world conditions.

The dataset will have the following features -
- transaction_date: Date of transaction (time series data)
- transaction_id: Unique transaction identifier (TN_000001 to TN_010000)
- customer_id: Customer identifier (CI_00001 to CI_10000)
- product_category: Product categories (Electronics, Clothing, Books, etc.)
- purchase_amount: Transaction amount in USD ($10 - $2000)
- customer_region: Geographic regions (North, South, East, West, Central)
- payment_method: Payment methods (Credit Card, Debit Card, PayPal, etc.)
- customer_rating: Customer satisfaction rating (1-5 stars)

## Methodology

The very first step is to create the synthetic data and introduce missing data to it. After that, the cleansing process can start.

First, I will show how to use the replace() function to replace generic values of the payment method column.

Next, I will show how to use the fillna() function to replace nan values of the customer region column.

Finally, I will show how to use the ffill() function to replace consecutive missing values of the customer rating column.

## Code Implementation

In [1]:
import pandas as pd
import numpy as np
np.random.seed(42)

In the following section, I am creating the base dataset.

In [2]:
# Prepare time period for 10,000 data
dataset_size = 10000
date_range = pd.date_range(start='2023-01-01', end='2024-12-31', periods=dataset_size).round('s')

# Prepare the transaction ids, customer ids and the other possible categorical data
transaction_id = [f"TN_{i:06d}" for i in range(1, dataset_size + 1)]
customer_ids = [f"CI_{i:05d}" for i in range(1, dataset_size + 1)]
product_categories = ['Electronics', 'Clothing', 'Books', 'Home & Garden', 'Sports', 'Beauty', 'Toys']
regions = ['North', 'South', 'East', 'West', 'Central']
payment_methods = ['Credit Card', 'Debit Card', 'PayPal', 'Bank Transfer', 'Cash']

# Generate the main dataset
sales_data = pd.DataFrame({
    'transaction_date': date_range,
    'transaction_id': transaction_id,
    'customer_id': np.random.choice(customer_ids, dataset_size),
    'product_category': np.random.choice(product_categories, dataset_size),
    'purchase_amount': np.round(np.random.uniform(10, 2000, dataset_size), 2),
    'customer_region': np.random.choice(regions, dataset_size),
    'payment_method': np.random.choice(payment_methods, dataset_size),
    'customer_rating': np.random.randint(1, 6, dataset_size)
})

In the following section, I am introducing different types of missing/problematic values to simulate real world scenarios.

In [3]:
# Missing customer regions
missing_regions = np.random.choice(sales_data.index, size=1000, replace=False)
sales_data.loc[missing_regions, 'customer_region'] = np.nan

# Generic values that need replacement
placeholder_indices = np.random.choice(sales_data.index, size=500, replace=False)
sales_data.loc[placeholder_indices, 'payment_method'] = 'UNKNOWN'

# Consecutive missing values (simulating system downtime)
sales_data.loc[9749:9998, 'customer_rating'] = np.nan

sales_data

Unnamed: 0,transaction_date,transaction_id,customer_id,product_category,purchase_amount,customer_region,payment_method,customer_rating
0,2023-01-01 00:00:00,TN_000001,CI_07271,Electronics,1926.60,Central,Debit Card,4.0
1,2023-01-01 01:45:08,TN_000002,CI_00861,Toys,254.52,Central,Cash,1.0
2,2023-01-01 03:30:16,TN_000003,CI_05391,Electronics,517.22,,PayPal,4.0
3,2023-01-01 05:15:23,TN_000004,CI_05192,Electronics,1629.12,North,PayPal,4.0
4,2023-01-01 07:00:31,TN_000005,CI_05735,Electronics,1241.01,West,Bank Transfer,4.0
...,...,...,...,...,...,...,...,...
9995,2024-12-30 16:59:29,TN_009996,CI_04678,Electronics,623.43,,Cash,
9996,2024-12-30 18:44:37,TN_009997,CI_02219,Books,426.81,Central,Debit Card,
9997,2024-12-30 20:29:44,TN_009998,CI_01390,Toys,782.22,,UNKNOWN,
9998,2024-12-30 22:14:52,TN_009999,CI_04277,Books,1217.55,East,Debit Card,


The following code section demonstrates how to use replace() for this dataset

In [4]:
print('5 data points showing missing Payment Method')
indices = sales_data[sales_data['payment_method'] == 'UNKNOWN'].head().index
display(sales_data.loc[indices])

print('Missing Payment Method before cleansing:',sales_data[sales_data['payment_method'] == 'UNKNOWN'].shape[0])

most_common_payment = sales_data['payment_method'].mode()[0]
sales_data['payment_method'] = sales_data['payment_method'].replace('UNKNOWN', most_common_payment)

print('Missing Payment Method after cleansing:',sales_data[sales_data['payment_method'] == 'UNKNOWN'].shape[0])
print('Most common Payment Method:', most_common_payment)

print('\nSame 5 data points showing imputed Payment Method')
display(sales_data.loc[indices])

5 data points showing missing Payment Method


Unnamed: 0,transaction_date,transaction_id,customer_id,product_category,purchase_amount,customer_region,payment_method,customer_rating
38,2023-01-03 18:34:58,TN_000039,CI_03891,Sports,1361.35,South,UNKNOWN,3.0
41,2023-01-03 23:50:21,TN_000042,CI_08793,Home & Garden,1724.65,West,UNKNOWN,4.0
61,2023-01-05 10:52:58,TN_000062,CI_09693,Sports,707.93,Central,UNKNOWN,1.0
116,2023-01-09 11:15:08,TN_000117,CI_06911,Beauty,538.29,South,UNKNOWN,3.0
133,2023-01-10 17:02:21,TN_000134,CI_01758,Books,1631.44,Central,UNKNOWN,4.0


Missing Payment Method before cleansing: 500
Missing Payment Method after cleansing: 0
Most common Payment Method: Debit Card

Same 5 data points showing imputed Payment Method


Unnamed: 0,transaction_date,transaction_id,customer_id,product_category,purchase_amount,customer_region,payment_method,customer_rating
38,2023-01-03 18:34:58,TN_000039,CI_03891,Sports,1361.35,South,Debit Card,3.0
41,2023-01-03 23:50:21,TN_000042,CI_08793,Home & Garden,1724.65,West,Debit Card,4.0
61,2023-01-05 10:52:58,TN_000062,CI_09693,Sports,707.93,Central,Debit Card,1.0
116,2023-01-09 11:15:08,TN_000117,CI_06911,Beauty,538.29,South,Debit Card,3.0
133,2023-01-10 17:02:21,TN_000134,CI_01758,Books,1631.44,Central,Debit Card,4.0


The following code section demonstrates how to use fillna() for this dataset

In [5]:
print('5 data points showing missing Region')
indices = sales_data[sales_data['customer_region'].isnull() == True].head().index
display(sales_data.loc[indices])

print('Missing Region before cleansing',sales_data['customer_region'].isnull().sum())

region_mode = sales_data['customer_region'].mode()[0]
sales_data['customer_region'] = sales_data['customer_region'].fillna(region_mode)

print('Missing Region after cleansing',sales_data['customer_region'].isnull().sum())
print('Most common Region:', region_mode)

print('\nSame 5 data points showing imputed Region')
display(sales_data.loc[indices])

5 data points showing missing Region


Unnamed: 0,transaction_date,transaction_id,customer_id,product_category,purchase_amount,customer_region,payment_method,customer_rating
2,2023-01-01 03:30:16,TN_000003,CI_05391,Electronics,517.22,,PayPal,4.0
21,2023-01-02 12:47:44,TN_000022,CI_08667,Books,899.2,,Bank Transfer,4.0
31,2023-01-03 06:19:03,TN_000032,CI_03006,Sports,1186.99,,Debit Card,4.0
36,2023-01-03 15:04:42,TN_000037,CI_01529,Books,1236.42,,PayPal,5.0
40,2023-01-03 22:05:13,TN_000041,CI_05394,Clothing,864.39,,PayPal,3.0


Missing Region before cleansing 1000
Missing Region after cleansing 0
Most common Region: South

Same 5 data points showing imputed Region


Unnamed: 0,transaction_date,transaction_id,customer_id,product_category,purchase_amount,customer_region,payment_method,customer_rating
2,2023-01-01 03:30:16,TN_000003,CI_05391,Electronics,517.22,South,PayPal,4.0
21,2023-01-02 12:47:44,TN_000022,CI_08667,Books,899.2,South,Bank Transfer,4.0
31,2023-01-03 06:19:03,TN_000032,CI_03006,Sports,1186.99,South,Debit Card,4.0
36,2023-01-03 15:04:42,TN_000037,CI_01529,Books,1236.42,South,PayPal,5.0
40,2023-01-03 22:05:13,TN_000041,CI_05394,Clothing,864.39,South,PayPal,3.0


The following code section demonstrates how to use ffill() for this dataset

In [6]:
print('5 data points showing missing Customer Rating')
indices = sales_data[sales_data['customer_rating'].isnull() == True].head().index
display(sales_data.loc[indices])

print('Missing Customer Rating before cleansing',sales_data['customer_rating'].isnull().sum())

sales_data['customer_rating'] = sales_data['customer_rating'].ffill()

print('Missing Customer Rating after cleansing',sales_data['customer_rating'].isnull().sum())

print('\nSame 5 data points showing imputed Customer Rating')
display(sales_data.loc[indices])

5 data points showing missing Customer Rating


Unnamed: 0,transaction_date,transaction_id,customer_id,product_category,purchase_amount,customer_region,payment_method,customer_rating
9749,2024-12-12 17:57:22,TN_009750,CI_06855,Beauty,246.82,North,Cash,
9750,2024-12-12 19:42:30,TN_009751,CI_00756,Electronics,951.47,North,Debit Card,
9751,2024-12-12 21:27:38,TN_009752,CI_09639,Books,579.58,North,Bank Transfer,
9752,2024-12-12 23:12:46,TN_009753,CI_04932,Clothing,1789.25,South,Debit Card,
9753,2024-12-13 00:57:54,TN_009754,CI_09856,Clothing,370.77,North,Debit Card,


Missing Customer Rating before cleansing 250
Missing Customer Rating after cleansing 0

Same 5 data points showing imputed Customer Rating


Unnamed: 0,transaction_date,transaction_id,customer_id,product_category,purchase_amount,customer_region,payment_method,customer_rating
9749,2024-12-12 17:57:22,TN_009750,CI_06855,Beauty,246.82,North,Cash,3.0
9750,2024-12-12 19:42:30,TN_009751,CI_00756,Electronics,951.47,North,Debit Card,3.0
9751,2024-12-12 21:27:38,TN_009752,CI_09639,Books,579.58,North,Bank Transfer,3.0
9752,2024-12-12 23:12:46,TN_009753,CI_04932,Clothing,1789.25,South,Debit Card,3.0
9753,2024-12-13 00:57:54,TN_009754,CI_09856,Clothing,370.77,North,Debit Card,3.0


## Output and Result Interpretation

In this assignment, I have successfully used three distinct pandas functions to demonstrate how they can be used to impute and/or replace missing values