# Exploratory Analysis — Clinical Trials CSV

This notebook performs an initial exploration of `../raw/clin_trials.csv` to understand the data structure, missing values, and identify potential categorical columns.
```

In [None]:
import pandas as pd

df = pd.read_csv('../raw/clin_trials.csv')

# Basic Exploration

Check the shape and column data types to understand the dataset size and structure.
```

In [None]:
print("_" * 60)
print(f"Shape: {df.shape}")
print(f"\nData types:")
print(df.dtypes)

## Missing Values Analysis

Compute missing value counts and percentages for each column. Only columns with missing values are displayed.
```

In [None]:
print("_" * 60)
print("Missing values")
missing = df.isnull().sum()
missing_percent = (missing / len(df)) * 100
missing_df = pd.DataFrame({
    'missing_count': missing, 
    'missing_percent': missing_percent
})
print(missing_df[missing_df['missing_count'] > 0].sort_values(by='missing_percent', ascending=False))

## Sample Data Preview

Display the first few rows to inspect actual data values.
```

In [None]:
print("_" * 60)
print("Sample data")
display(df.head(100))

## Key Insights: Categorical Variable Detection

Identify likely categorical columns using a heuristic approach:
- **Dynamic Threshold**: Uses the maximum of an absolute cutoff (20) and a percentage-based threshold (1% of total rows)
- This adaptive approach works better for datasets of varying sizes
- Columns with unique values below the threshold are likely categorical and candidates for separate lookup tables

**Note**: This helped identify categorical vs continuous variables during initial exploration.
```

In [None]:
print("_" * 60)
print("Key Insights:")

absolute_cutoff = 20   # FIXME: I don't like this being hardcoded, but works for now
percent_cutoff = 0.01  # 1% of rows, this is probably too high for large datasets
n_rows = df.shape[0]
dynamic_cutoff = max(absolute_cutoff, int(n_rows * percent_cutoff))

for col in df.columns:
    unique_vals = df[col].nunique(dropna=True)
    if unique_vals <= dynamic_cutoff:
        uniques_sample = df[col].dropna().unique()[:50]
        print(f"Column: {col}")
        print(f"  - Number of uniques: {unique_vals}")
        print(f"  - Sample uniques: {list(uniques_sample)}\n")
