**STEP 1 : Loading the dataset**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv("/content/Data/customer_transactions.csv")

# Display the first few rows
df.head()


Unnamed: 0,customer_id_legacy,transaction_id,purchase_amount,purchase_date,product_category,customer_rating
0,151,1001,408,2024-01-01,Sports,2.3
1,192,1002,332,2024-01-02,Electronics,4.2
2,114,1003,442,2024-01-03,Electronics,2.1
3,171,1004,256,2024-01-04,Clothing,2.8
4,160,1005,64,2024-01-05,Clothing,1.3


**Step 2: Data Cleaning & Handling Missing Values**

In [2]:
# Check missing values
df.isnull().sum()


Unnamed: 0,0
customer_id_legacy,0
transaction_id,0
purchase_amount,0
purchase_date,0
product_category,0
customer_rating,10


From the above results we obltained,the only column with missing values is customer_rating (10 missing values).To solve this , l will fill in missing values with mean.

Handling The Missing Values

In [4]:
# Filling missing values in 'customer_rating' with the mean
df['customer_rating'] = df['customer_rating'].fillna(df['customer_rating'].mean())

# Verifying that there are no more missing values
print(df.isnull().sum())

customer_id_legacy    0
transaction_id        0
purchase_amount       0
purchase_date         0
product_category      0
customer_rating       0
dtype: int64


It seems like we have resolved the missing value issue!

**Step 3: Data Augmentation Strategies**

1. Synthetic Data Generation

In [8]:
print(df[['purchase_amount']].head())

   purchase_amount
0       409.219845
1       330.732267
2       433.780366
3       255.158850
4        63.318647


Applying random noise to numerical transaction values.

In [9]:
# Adding random noise to transaction amounts
df['purchase_amount'] = df['purchase_amount'] + np.random.normal(loc=0, scale=5, size=len(df))

# Verifying the changes
print(df[['purchase_amount']].head())

   purchase_amount
0       402.289113
1       329.927391
2       435.177691
3       252.946658
4        60.485287


Balancing the Dataset Using SMOTE

In [12]:
from imblearn.over_sampling import SMOTE

# Defining features (X) and target (y)
X = df[['purchase_amount']]  # Numerical feature
y = df['product_category']  # Categorical target

# Applying SMOTE
smote = SMOTE(sampling_strategy='auto', random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

# Creating new balanced DataFrame
df_resampled = pd.DataFrame(X_resampled, columns=['purchase_amount'])
df_resampled['product_category'] = y_resampled

# Verifying class distribution
print(df_resampled['product_category'].value_counts())


product_category
Sports         35
Electronics    35
Clothing       35
Groceries      35
Books          35
Name: count, dtype: int64


Applying Log Transformation (Feature Value Transformation)

In [13]:
# Applying log transformation to 'purchase_amount' to reduce skewness
df['log_purchase_amount'] = np.log1p(df['purchase_amount'])

# Verifying transformation
df[['purchase_amount', 'log_purchase_amount']].head()


Unnamed: 0,purchase_amount,log_purchase_amount
0,402.289113,5.999654
1,329.927391,5.801899
2,435.177691,6.07805
3,252.946658,5.537124
4,60.485287,4.118798


Generating New Synthetic Transactions to expand our dataset

In [14]:
# Generating synthetic transactions by duplicating and adding noise
num_synthetic = 500
synthetic_data = df.sample(num_synthetic, replace=True)

# Applying random noise
synthetic_data['purchase_amount'] += np.random.normal(0, 5, size=num_synthetic)

# Appending to original DataFrame
df_augmented = pd.concat([df, synthetic_data], ignore_index=True)

# Verifying new dataset size
print("Original dataset size:", len(df))
print("New dataset size after augmentation:", len(df_augmented))


Original dataset size: 150
New dataset size after augmentation: 650


Now that we have managed to successfully expand our dataset, let's save it!

In [15]:
# Saving cleaned & augmented dataset
df_augmented.to_csv("customer_transactions_augmented.csv", index=False)
