# Cancer Data Exploration Notebook

This notebook provides exploratory data analysis of the Global Cancer Patients dataset used in the OncoPredictAI project. We'll analyze key features, distributions, and relationships to gain insights for model development.

## 1. Setup and Configuration

First, let's set up the environment and import necessary libraries.

In [None]:
import os
import sys
import yaml
from pathlib import Path

# Add the project root directory to the path
project_root = Path('..', '..').resolve()
sys.path.append(str(project_root))

# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load project configuration
config_path = os.path.join(project_root, 'config', 'default.yaml')
with open(config_path, 'r') as f:
    config = yaml.safe_load(f)

# Set visualization style
from src.visualization.visualize import set_visualization_style
viz_config = config.get('visualization', {})
set_visualization_style(
    style=viz_config.get('style', 'whitegrid'),
    context=viz_config.get('context', 'notebook'),
    palette=viz_config.get('palette', 'viridis')
)

print(f"OncoPredictAI Project directory: {project_root}")

## 2. Load and Examine the Dataset

In [None]:
# Define the path to the cancer dataset
data_path = os.path.join(project_root, 'data', 'raw', 'cancer_patients', 'global_cancer_patients_2015_2024.csv')

# Check if file exists, if not, check in the original location
if not os.path.exists(data_path):
    data_path = os.path.join(project_root, 'data', 'global_cancer_patients_2015_2024.csv')

print(f"Loading data from: {data_path}")
df = pd.read_csv(data_path)

# Display basic information
print(f"\nDataset shape: {df.shape} (rows, columns)")
print("\nFirst 5 rows:")
df.head()

In [None]:
# Check data types and missing values
print("Data types:")
print(df.dtypes)

print("\nMissing values:")
missing = df.isnull().sum()
missing_percent = (missing / len(df)) * 100
missing_df = pd.DataFrame({
    'Missing Values': missing,
    'Percent': missing_percent
}).sort_values('Missing Values', ascending=False)

print(missing_df[missing_df['Missing Values'] > 0])

## 3. Statistical Summary

Let's examine the statistical distribution of the numerical features.

In [None]:
# Statistical summary of numerical columns
df.describe()

## 4. Categorical Data Analysis

Now let's examine the categorical variables and their distributions.

In [None]:
# Identify categorical columns
categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
print(f"Categorical columns: {categorical_cols}")

# Display value counts for each categorical column
for col in categorical_cols[:5]:  # Limit to first 5 to avoid too much output
    print(f"\nDistribution of {col}:")
    print(df[col].value_counts().sort_values(ascending=False).head(10))
    print(f"Number of unique values: {df[col].nunique()}")

## 5. Target Variable Analysis

Let's explore the target variable (severity score) distribution.

In [None]:
target_col = 'Severity_Score'  # Adjust based on actual column name

if target_col in df.columns:
    plt.figure(figsize=(10, 6))
    sns.histplot(df[target_col], kde=True)
    plt.title(f'Distribution of {target_col}')
    plt.xlabel(target_col)
    plt.ylabel('Count')
    plt.grid(True, alpha=0.3)
    plt.show()
    
    # Basic statistics
    print(f"\n{target_col} statistics:")
    print(df[target_col].describe())
else:
    print(f"Target column '{target_col}' not found in dataset.")
    print(f"Available columns: {df.columns.tolist()}")

## 6. Feature Correlations

Let's examine correlations between numerical features.

In [None]:
# Select numeric columns
numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns

# Calculate correlation matrix
corr = df[numeric_cols].corr()

# Create heatmap
plt.figure(figsize=(14, 12))
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr, mask=mask, annot=True, fmt='.2f', cmap='coolwarm',
            linewidths=0.5, vmin=-1, vmax=1)
plt.title('Feature Correlation Matrix', fontsize=16)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

## 7. Feature Distributions

Let's visualize distributions of key features.

In [None]:
# Select top 6 most important numeric features (excluding target if present)
features_to_plot = [col for col in numeric_cols if col != target_col][:6]

# Create plots
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.flatten()

for i, feature in enumerate(features_to_plot):
    sns.histplot(df[feature], kde=True, ax=axes[i])
    axes[i].set_title(f'Distribution of {feature}')
    axes[i].set_xlabel(feature)
    axes[i].grid(True, alpha=0.3)
    
plt.tight_layout()
plt.show()

## 8. Relationship Between Key Features and Target

In [None]:
if target_col in df.columns:
    # Select top 3 features with highest correlation to target
    if len(numeric_cols) > 1:
        corr_with_target = corr[target_col].abs().sort_values(ascending=False)
        top_features = corr_with_target.index[1:4]  # Skip the target itself
        
        fig, axes = plt.subplots(1, 3, figsize=(18, 6))
        
        for i, feature in enumerate(top_features):
            sns.scatterplot(x=feature, y=target_col, data=df, alpha=0.6, ax=axes[i])
            axes[i].set_title(f'{feature} vs {target_col}')
            axes[i].grid(True, alpha=0.3)
            
        plt.tight_layout()
        plt.show()
    else:
        print("Not enough numeric columns for correlation analysis")
else:
    print(f"Target column '{target_col}' not found in dataset.")

## 9. Initial Clustering Analysis

Let's apply K-means clustering to identify potential patterns in the data.

In [None]:
from models.clustering.kmeans import KMeans
from sklearn.preprocessing import StandardScaler

# Select features for clustering (excluding target)
features = [col for col in numeric_cols if col != target_col]
features = features[:10]  # Limit to top 10 features for performance

# Prepare data
X = df[features].copy()

# Handle missing values if any
X = X.fillna(X.median())

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply K-means with different cluster numbers to determine optimal k
k_values = range(2, 8)
inertias = []

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_scaled)
    inertias.append(kmeans.inertia)

# Plot elbow curve
plt.figure(figsize=(10, 6))
plt.plot(k_values, inertias, 'o-', linewidth=2)
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

## 10. Dimensionality Reduction with PCA

Apply PCA to reduce dimensionality and visualize the data.

In [None]:
from models.dimensionality_reduction.pca import PCA

# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Plot explained variance
pca.plot_explained_variance()
plt.show()

# Apply K-means on PCA results
optimal_k = 3  # Use result from elbow curve
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
clusters = kmeans.fit_predict(X_scaled)

# Plot the clusters in 2D PCA space
plt.figure(figsize=(12, 8))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=clusters, cmap='viridis', alpha=0.7, s=50)
plt.scatter(pca.transform(kmeans.centroids), marker='X', s=200, c='red', label='Centroids')
plt.colorbar(scatter, label='Cluster')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('Clusters visualized in PCA space')
plt.grid(True, alpha=0.3)
plt.legend()
plt.show()

## 11. Summary of Findings

In this exploratory analysis, we've discovered several key insights from the Global Cancer Patients dataset:

1. **Dataset Overview**: [Summarize the size and scope of the dataset]
2. **Missing Values**: [Summarize findings about missing data]
3. **Feature Correlations**: [Highlight key correlations discovered]
4. **Target Variable**: [Describe distribution of target variable]
5. **Clustering Results**: [Describe what the clustering analysis revealed]
6. **PCA Results**: [Summarize dimensionality reduction findings]

These insights will inform our feature engineering and model selection process for the OncoPredictAI system.

## 12. Next Steps

Based on this exploratory analysis, we'll proceed with:

1. Feature engineering to enhance predictive power
2. Advanced preprocessing techniques to handle identified data quality issues
3. Model development focusing on [specific algorithms identified as promising]
4. Detailed evaluation of model performance for cancer prediction tasks