# BioInsight Hackathon - Exploratory Data Analysis

**Goal:** Understand the bioactivity dataset before building models

**Dataset:** `data/sample_bioactivity.csv` (~100K compound-target pairs)

**Key Questions:**
1. What does the data look like?
2. Are there missing values?
3. How are active/inactive compounds distributed?
4. Which features correlate with bioactivity?
5. Are there outliers or data quality issues?

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from pathlib import Path

# Configure plotting
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
%matplotlib inline

# Display settings
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 100)

print("✅ Libraries imported successfully!")

## 1. Load Data

In [None]:
# Load sample data
df = pd.read_csv('../data/sample_bioactivity.csv')

print(f"Dataset Shape: {df.shape[0]:,} rows × {df.shape[1]} columns")
print(f"Memory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")

df.head()

In [None]:
# Data types and info
df.info()

## 2. Missing Values Analysis

In [None]:
# Calculate missing values
missing = df.isnull().sum()
missing_pct = 100 * missing / len(df)

missing_df = pd.DataFrame({
    'Missing Count': missing,
    'Percentage': missing_pct
}).sort_values('Missing Count', ascending=False)

# Show only columns with missing values
missing_df[missing_df['Missing Count'] > 0]

In [None]:
# Visualize missing values
plt.figure(figsize=(12, 6))
sns.heatmap(df.isnull(), cbar=False, yticklabels=False, cmap='viridis')
plt.title('Missing Values Heatmap', fontsize=14, fontweight='bold')
plt.xlabel('Columns')
plt.tight_layout()
plt.show()

## 3. Target Variable Distribution

In [None]:
# Active vs Inactive distribution
target_counts = df['is_active'].value_counts()
target_pct = 100 * target_counts / len(df)

print("Target Distribution:")
print(f"  Inactive (0): {target_counts[0]:,} ({target_pct[0]:.1f}%)")
print(f"  Active (1): {target_counts[1]:,} ({target_pct[1]:.1f}%)")

# Visualize
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Bar chart
target_counts.plot(kind='bar', ax=ax1, color=['#e74c3c', '#2ecc71'])
ax1.set_title('Active vs Inactive Compounds', fontsize=14, fontweight='bold')
ax1.set_xlabel('Activity')
ax1.set_ylabel('Count')
ax1.set_xticklabels(['Inactive', 'Active'], rotation=0)

# Pie chart
ax2.pie(target_counts, labels=['Inactive', 'Active'], autopct='%1.1f%%',
        colors=['#e74c3c', '#2ecc71'], startangle=90)
ax2.set_title('Activity Distribution', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

## 4. Feature Distributions

In [None]:
# Key molecular properties
features = ['mw_freebase', 'alogp', 'hba', 'hbd', 'psa', 'aromatic_rings']

fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.flatten()

for idx, feature in enumerate(features):
    if feature in df.columns:
        df[feature].hist(bins=50, ax=axes[idx], edgecolor='black', alpha=0.7)
        axes[idx].set_title(f'{feature} Distribution', fontsize=12, fontweight='bold')
        axes[idx].set_xlabel(feature)
        axes[idx].set_ylabel('Frequency')
        axes[idx].axvline(df[feature].median(), color='red', linestyle='--', 
                         label=f'Median: {df[feature].median():.1f}')
        axes[idx].legend()

plt.tight_layout()
plt.show()

In [None]:
# Activity value distribution
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Original scale
df['standard_value'].hist(bins=100, ax=ax1, edgecolor='black', alpha=0.7)
ax1.set_title('Activity Value Distribution (nM)', fontsize=12, fontweight='bold')
ax1.set_xlabel('Standard Value (nM)')
ax1.set_ylabel('Frequency')
ax1.axvline(10000, color='red', linestyle='--', label='Threshold (10 μM)')
ax1.legend()

# Log scale
np.log10(df['standard_value'] + 1).hist(bins=100, ax=ax2, edgecolor='black', alpha=0.7)
ax2.set_title('Activity Value Distribution (log10)', fontsize=12, fontweight='bold')
ax2.set_xlabel('log10(Standard Value)')
ax2.set_ylabel('Frequency')
ax2.axvline(np.log10(10000), color='red', linestyle='--', label='Threshold')
ax2.legend()

plt.tight_layout()
plt.show()

## 5. Correlation Analysis

In [None]:
# Select numeric features for correlation
numeric_features = ['mw_freebase', 'alogp', 'hba', 'hbd', 'psa', 'rtb', 
                   'aromatic_rings', 'heavy_atoms', 'qed_weighted', 
                   'lipinski_violations', 'is_active']

# Filter existing columns
numeric_features = [f for f in numeric_features if f in df.columns]

# Calculate correlation matrix
corr_matrix = df[numeric_features].corr()

# Visualize
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            center=0, square=True, linewidths=1)
plt.title('Feature Correlation Heatmap', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Correlation with target variable
target_corr = corr_matrix['is_active'].sort_values(ascending=False)

plt.figure(figsize=(10, 6))
target_corr[target_corr.index != 'is_active'].plot(kind='barh', color='steelblue')
plt.title('Feature Correlation with Bioactivity', fontsize=14, fontweight='bold')
plt.xlabel('Correlation Coefficient')
plt.ylabel('Features')
plt.axvline(0, color='black', linestyle='--', linewidth=0.8)
plt.tight_layout()
plt.show()

print("\nTop 5 Positively Correlated Features:")
print(target_corr[1:6])
print("\nTop 5 Negatively Correlated Features:")
print(target_corr[-5:])

## 6. Outlier Detection

In [None]:
# Box plots for key features
features_to_plot = ['mw_freebase', 'alogp', 'psa', 'hba', 'hbd', 'aromatic_rings']

fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.flatten()

for idx, feature in enumerate(features_to_plot):
    if feature in df.columns:
        df.boxplot(column=feature, by='is_active', ax=axes[idx])
        axes[idx].set_title(f'{feature} by Activity', fontsize=12, fontweight='bold')
        axes[idx].set_xlabel('Activity (0=Inactive, 1=Active)')
        axes[idx].set_ylabel(feature)

plt.suptitle('')  # Remove default title
plt.tight_layout()
plt.show()

## 7. Summary Statistics

In [None]:
# Summary statistics by activity
print("Summary Statistics by Activity:\n")
print(df.groupby('is_active')[numeric_features[:-1]].describe().T)

## 8. Key Findings & Insights

**Document your observations here:**

### Data Quality:
- Missing values: [List columns with missing data]
- Outliers: [Note any extreme values]
- Data types: [Any issues with data types?]

### Target Distribution:
- Active/Inactive ratio: [X% active, Y% inactive]
- Class imbalance: [Balanced or imbalanced?]

### Feature Insights:
- Most correlated features: [List top 3]
- Least correlated features: [List bottom 3]
- Interesting patterns: [Any surprising findings?]

### Recommendations for Modeling:
1. [e.g., Handle class imbalance with SMOTE]
2. [e.g., Impute missing PSA values with median]
3. [e.g., Focus on MW, LogP, PSA as key features]
4. [e.g., Remove outliers beyond 3 standard deviations]

### Next Steps:
- [ ] Feature engineering (create interaction terms, polynomial features)
- [ ] Feature selection (remove low-correlation features)
- [ ] Data preprocessing (scaling, encoding)
- [ ] Model training (XGBoost, Logistic Regression)