# Oncology Clinical Trials Data Exploration

This notebook explores the oncology clinical trials dataset using visualization functions from the `src/visualization` module.

The analysis focuses on understanding:
- Trial status distribution
- Trial phases
- Enrollment sizes
- Trial durations
- Completion rates by sponsor type
- Relationships between enrollment and duration
- Temporal trends in completion rates

In [None]:
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Add project root to path to import project modules
project_root = Path().resolve().parents[0]
sys.path.append(str(project_root))

# Import visualization functions
from src.visualization.visualize import (
    set_plotting_style,
    plot_trial_status_distribution,
    plot_trial_phases,
    plot_enrollment_distribution,
    plot_duration_by_phase,
    plot_completion_rate_by_sponsor,
    plot_enrollment_vs_duration,
    plot_completion_by_year,
    plot_correlation_heatmap
)

# Define project directories
PROJECT_DIR = project_root
PROCESSED_DATA_DIR = PROJECT_DIR / 'data' / 'processed'

# Set plotting style
set_plotting_style()

# Display plots inline
%matplotlib inline

## Loading the Processed Data

First, we'll load the most recent processed oncology trials dataset.

In [None]:
# Find the most recent processed data file
csv_files = list(PROCESSED_DATA_DIR.glob("processed_oncology_trials_*.csv"))

if not csv_files:
    raise FileNotFoundError("No processed data files found")
    
# Sort by modification time (most recent first)
csv_files.sort(key=lambda x: x.stat().st_mtime, reverse=True)
latest_data_path = csv_files[0]

print(f"Loading data from {latest_data_path}")
df = pd.read_csv(latest_data_path)

# Display basic information about the dataset
print(f"Dataset shape: {df.shape}")
df.head()

## Data Overview

Let's examine the basic statistics and data types of our dataset.

In [None]:
# Display column information
df.info()

In [None]:
# Display summary statistics
df.describe(include='all').T

## Trial Status Distribution

Let's examine the distribution of trial statuses to understand the current state of oncology trials.

In [None]:
# Plot trial status distribution
plot_trial_status_distribution(df)

## Trial Phases

Now let's look at the distribution of trials across different clinical phases.

In [None]:
# Plot trial phases distribution
plot_trial_phases(df)

## Enrollment Distribution

Let's analyze the distribution of enrollment sizes across trials.

In [None]:
# Plot enrollment distribution
plot_enrollment_distribution(df)

## Trial Duration by Phase

Let's examine how trial duration varies across different clinical phases.

In [None]:
# Plot trial duration by phase
plot_duration_by_phase(df)

## Completion Rate by Sponsor Type

Let's analyze how trial completion rates vary by sponsor type.

In [None]:
# Plot completion rate by sponsor type
plot_completion_rate_by_sponsor(df)

## Relationship Between Enrollment Size and Trial Duration

Let's examine if there's a relationship between enrollment size and trial duration.

In [None]:
# Plot enrollment vs duration
plot_enrollment_vs_duration(df)

## Temporal Trends in Completion Rates

Let's analyze how trial completion rates have changed over time.

In [None]:
# Plot completion rate by year
plot_completion_by_year(df)

## Feature Correlation Analysis

Let's examine correlations between key features to identify potential relationships.

In [None]:
# Define key features for correlation analysis
key_features = [
    'trial_duration_days', 'EnrollmentCount', 'completion_status',
    'is_industry_sponsored', 'is_multi_country', 'has_us_sites',
    'is_phase_2', 'is_phase_3', 'is_randomized', 'is_double_blind'
]

# Plot correlation heatmap
plot_correlation_heatmap(df, key_features)

## Summary of Findings

Based on the exploratory data analysis, we can draw the following insights:

1. **Trial Status Distribution**: [Add insights after running the notebook]
2. **Trial Phases**: [Add insights after running the notebook]
3. **Enrollment Sizes**: [Add insights after running the notebook]
4. **Trial Duration by Phase**: [Add insights after running the notebook]
5. **Completion Rate by Sponsor**: [Add insights after running the notebook]
6. **Enrollment vs. Duration**: [Add insights after running the notebook]
7. **Temporal Trends**: [Add insights after running the notebook]
8. **Feature Correlations**: [Add insights after running the notebook]

These insights will inform our feature engineering and modeling approach in the next notebooks.