# Week 9: Data Cleaning - Guided Practice

This notebook provides a hands-on opportunity to practice the data cleaning techniques we learned in the main workshop. You will work with a new dataset and prepare it for both descriptive and predictive analysis.

## Your Mission: Analyse Customer Transactions

You are a Data Analyst at an e-commerce company. You have been given a dataset of customer transactions and asked to prepare it for analysis.

**Your Goal:**
- Create a `df_desc` version for descriptive analysis (e.g., for a sales dashboard).
- Create a `df_pred` version for predictive modeling (e.g., to predict fraudulent transactions).

## Step 1: Load and Explore the Data

In [None]:
import pandas as pd
import numpy as np

# Load the dataset
df_raw = pd.read_csv("customer_transactions.csv")

print("Practice dataset loaded successfully!")
df_raw.head()

**Task:** Get a quick overview of the data types and missing values.

In [None]:
# Your code here


**Task:** Get a statistical summary of the numerical columns.

In [None]:
# Your code here


## Part 1: Cleaning for Descriptive Analytics

In [None]:
df_desc = df_raw.copy()

### 1.1 Clean Categorical Columns

**Task:** Check the `product_category` and `payment_method` columns for typos or inconsistencies and clean them.

In [None]:
# Check product_category


### 1.2 Handle Missing Values

**Task:** For descriptive analysis, fill missing `payment_method` values with appropriate placeholders. (We will leave `customer_age` with NaN values as this ensures statistics such as mean will still be calculated correctly)

In [None]:
# Fill missing payment_method


### 1.3 Identify Outliers

**Task:** Identify outliers in the `item_price` column for reporting purposes.

In [None]:
# Identify item_price outliers


## Part 2: Cleaning for Predictive Modeling

In [None]:
df_pred = df_raw.copy()

### 2.1 Impute Missing Values

**Task:** Impute missing values in `customer_age` and `payment_method` for predictive modeling.

In [None]:
# Your code here


### 2.2 Handle Outliers

**Task:** Cap the outliers in the `item_price` column at the 99th percentile.

In [None]:
# Your code here


### 2.3 Transform Skewed Data

**Task:** Apply a log transformation to the `item_price` column.

In [None]:
# Your code here

### 2.4 Encode Categorical Variables

**Task:** One-hot encode the `product_category` and `customer_location` columns.

In [None]:
# Your code here

## Step 3: Create Final Datasets

**Task:** Create the final `df_desc` and `df_pred` datasets, ready for their respective analyses.

In [None]:
# Final descriptive dataset is already df_desc
print("Descriptive dataset ready:")
print(df_desc.info())

# Final predictive dataset
columns_to_drop = [
    "transaction_id", "customer_id", "transaction_date", 
    "product_category", "customer_location", "payment_method",
    "item_price" # Drop original price, keep log version
]
df_model_ready = df_pred.drop(columns=columns_to_drop)

print("Predictive dataset ready:")
print(df_model_ready.info())

print("Data Quality Validation for Predictive Dataset:")
print(f"Any missing values? {df_model_ready.isnull().sum().sum() == 0}")
print(f"All columns are numeric? {all(pd.api.types.is_numeric_dtype(c) for c in df_model_ready.columns)}")