<a href="https://colab.research.google.com/github/Joh-Ishimwe/Data-Preprocessing/blob/master/Formative_2_Data_Preprocessing_Assignment_for_Machine_Learning_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Formative 2: Data Preprocessing Assignment for Machine Learning Pipeline
**Team Members**:  

*   Liliane Kayitesi
*   Ines Ikirezi
*   Josiane Ishimwe


**Group Number**: 7  

---
## Part 1: Data Augmentation on CSV Files


# Load the dataset

In [68]:
import numpy as np
import pandas as pd

# Load the dataset
path = '/content/customer_transactions.csv'
df = pd.read_csv(path)

#Step 2: Data Cleaning & Handling Missing Values

In [69]:
# Check for missing values
print(df.isnull().sum())

customer_id_legacy     0
transaction_id         0
purchase_amount        0
purchase_date          0
product_category       0
customer_rating       10
dtype: int64


In [70]:
# Fill missing customer_rating values with median imputation:
df['customer_rating'].fillna(df['customer_rating'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['customer_rating'].fillna(df['customer_rating'].median(), inplace=True)


In [72]:
# Check for missing values again
print(df.isnull().sum())

customer_id_legacy    0
transaction_id        0
purchase_amount       0
purchase_date         0
product_category      0
customer_rating       0
dtype: int64


#Step 3: Data Augmentation Strategies

In [73]:
# 1. Add random noise to purchase_amount
noise = np.random.normal(0, 10, size=len(df))
df['purchase_amount'] = df['purchase_amount'] + noise

In [74]:
# 2. Apply log transformation to purchase_amount
df['log_purchase_amount'] = np.log1p(df['purchase_amount'])

In [75]:
# Create a binary target column (e.g., high vs. low purchase amount)
df['target'] = (df['purchase_amount'] > df['purchase_amount'].median()).astype(int)

# Check the distribution of the target variable
print(df['target'].value_counts())

target
1    75
0    75
Name: count, dtype: int64


#Synthetic Data Generation

In [76]:
# 3. Synthetic Data Generation
def generate_synthetic_data(df, num_samples=100):
    synthetic_data = df.sample(n=num_samples, replace=True)
    # Add variations (adjust as needed)
    synthetic_data['purchase_amount'] *= np.random.uniform(0.9, 1.1, size=num_samples)
    synthetic_data['product_category'] = np.random.choice(df['product_category'].unique(), size=num_samples)
    # Ensure 'purchase_date' is datetime before adding timedelta
    synthetic_data['purchase_date'] = pd.to_datetime(synthetic_data['purchase_date'], errors='coerce')
    synthetic_data['purchase_date'] = synthetic_data['purchase_date'] + pd.to_timedelta(np.random.randint(-3, 3, size=num_samples), unit='days')
    synthetic_data['customer_rating'] += np.random.uniform(-0.2, 0.2, size=num_samples)

    # Clip customer_rating to be within the original range
    synthetic_data['customer_rating'] = synthetic_data['customer_rating'].clip(lower=df['customer_rating'].min(), upper=df['customer_rating'].max())
    return synthetic_data

synthetic_data = generate_synthetic_data(df)
df_augmented = pd.concat([df, synthetic_data], ignore_index=True)

#Feature Engineering

In [77]:
# 1. Extract purchase_month
df_augmented['purchase_month'] = pd.to_datetime(df_augmented['purchase_date']).dt.month

# 2. Calculate avg_purchase_amount per customer
avg_purchase = df_augmented.groupby('customer_id_legacy')['purchase_amount'].mean().reset_index()
avg_purchase.columns = ['customer_id_legacy', 'avg_purchase_amount']
df_augmented = pd.merge(df_augmented, avg_purchase, on='customer_id_legacy', how='left')

In [78]:
# 3. Calculate days_since_last_purchase
df_augmented.sort_values(by=['customer_id_legacy', 'purchase_date'], inplace=True)
df_augmented['purchase_date'] = pd.to_datetime(df_augmented['purchase_date'], errors='coerce')
df_augmented.dropna(subset=['purchase_date'], inplace=True)
df_augmented['days_since_last_purchase'] = df_augmented.groupby('customer_id_legacy')['purchase_date'].diff().dt.days
df_augmented['days_since_last_purchase'].fillna(0, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_augmented['days_since_last_purchase'].fillna(0, inplace=True)


#Export the Augmented Dataset

In [79]:
# Save the augmented dataset
df_augmented.to_csv('customer_transactions_augmented.csv', index=False)