# Clinical Data Cleaning Pipeline - Demo Notebook

This notebook demonstrates how to use the clinical data cleaning pipeline with a small synthetic dataset.

## 1. Setup and Data Import

First, we'll load the required libraries and import our synthetic clinical data.

In [None]:
# Load required libraries
library(tidyverse)
library(data.table)
library(janitor)

# Import the raw clinical data
raw_data <- read.csv('data/raw/patient_data.csv')

# Display the first few rows
head(raw_data)

## 2. Data Visualization: Before Cleaning

Let's examine the data quality issues before applying the cleaning pipeline.

In [None]:
# Check for missing values
library(naniar)
vis_miss(raw_data)

# Summary statistics
summary(raw_data)

# Histogram of blood pressure values (showing potential outliers)
ggplot(raw_data, aes(x = blood_pressure)) +
  geom_histogram(bins = 10, fill = 'steelblue', color = 'black') +
  labs(title = 'Blood Pressure Distribution (Raw Data)',
       x = 'Blood Pressure', y = 'Count') +
  theme_minimal()

## 3. Running the Pipeline

Now we'll execute the main cleaning pipeline script.

In [None]:
# Run the complete data cleaning pipeline
source('main_pipeline.R')

# Alternative: Run individual steps for more control
# source('scripts/01_import_data.R')
# source('scripts/02_validate_data.R')
# source('scripts/03_clean_data.R')
# source('scripts/04_quality_checks.R')
# source('scripts/05_export_data.R')

## 4. Results: After Cleaning

Let's examine the cleaned data and compare it with the original.

In [None]:
# Load the cleaned data
cleaned_data <- read.csv('data/processed/patient_data_cleaned.csv')

# Display cleaned data
print('Cleaned Data:')
print(cleaned_data)

# Compare data quality metrics
cat('\nData Quality Comparison:\n')
cat('Raw data completeness:', mean(complete.cases(raw_data)) * 100, '%\n')
cat('Cleaned data completeness:', mean(complete.cases(cleaned_data)) * 100, '%\n')

# Visualize cleaned blood pressure distribution
ggplot(cleaned_data, aes(x = blood_pressure)) +
  geom_histogram(bins = 10, fill = 'forestgreen', color = 'black') +
  labs(title = 'Blood Pressure Distribution (Cleaned Data)',
       x = 'Blood Pressure', y = 'Count') +
  theme_minimal()

## 5. Quality Report Output

The pipeline generates comprehensive quality reports in the `reports/` directory:

### Generated Reports:

1. **Data Quality Report** (`reports/data_quality/quality_report.html`)
   - Summary statistics for all variables
   - Missing data patterns and visualizations
   - Outlier detection results
   - Validation rule violations

2. **Audit Trail** (`reports/audit_trails/audit_log.csv`)
   - Complete timestamp log of all operations
   - Detailed record of changes made to each data point
   - User/script information for each transformation

3. **Statistical Summary** (`reports/summary_statistics/summary.html`)
   - Descriptive statistics (mean, median, SD, etc.)
   - Distribution plots for continuous variables
   - Frequency tables for categorical variables
   - Data completeness metrics

### Viewing the Reports:

You can open the HTML reports directly in your browser, or use the code below to load them within R:

In [None]:
# View the quality report in your browser
browseURL('reports/data_quality/quality_report.html')

# Load and inspect the audit trail
audit_log <- read.csv('reports/audit_trails/audit_log.csv')
head(audit_log, 10)

# Display summary of changes made
cat('\nTotal number of transformations:', nrow(audit_log), '\n')
cat('Number of records affected:', length(unique(audit_log$record_id)), '\n')

---

**Note:** This demo uses synthetic data for educational purposes. For production use with real clinical trial data, ensure compliance with all regulatory requirements (GCP, 21 CFR Part 11, ALCOA+ principles) and consult with your data management team.