# Medical AI - Dataset Download & Setup

This notebook downloads all required datasets directly to your Google Drive using Kaggle API.

**Important:** Run this ONCE to download all datasets. They will be stored in your Google Drive for reuse.

**Estimated Download Time:** 30-60 minutes (depending on internet speed)

**Required Storage:** ~40-50 GB in Google Drive

## Step 1: Mount Google Drive & Check Storage

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

import os
os.chdir('/content/drive/MyDrive')
print("‚úì Google Drive mounted successfully")

In [None]:
# Check available storage
import shutil

total, used, free = shutil.disk_usage("/content/drive/MyDrive")

print("Google Drive Storage:")
print(f"Total: {total // (2**30)} GB")
print(f"Used: {used // (2**30)} GB")
print(f"Free: {free // (2**30)} GB")
print()

if free < 50 * (2**30):  # 50 GB
    print("‚ö†Ô∏è WARNING: You may not have enough space for all datasets.")
    print("Recommended: At least 50 GB free space")
    print("Consider downloading datasets one at a time.")
else:
    print("‚úì Sufficient storage available for all datasets")

## Step 2: Setup Kaggle API

You'll need your Kaggle API credentials. Get them from: https://www.kaggle.com/settings/account

In [None]:
# Install Kaggle API
!pip install -q kaggle

In [None]:
# Setup Kaggle credentials
import os
import json
from getpass import getpass

# Create .kaggle directory
!mkdir -p ~/.kaggle

# Enter your Kaggle credentials securely
# Get these from: https://www.kaggle.com/settings/account -> Create New API Token
print("Enter your Kaggle credentials:")
kaggle_username = input("Kaggle Username: ")
kaggle_key = getpass("Kaggle API Key (hidden): ")

# Create kaggle.json
kaggle_credentials = {
    "username": kaggle_username,
    "key": kaggle_key
}

with open('/root/.kaggle/kaggle.json', 'w') as f:
    json.dump(kaggle_credentials, f)

# Set permissions
!chmod 600 ~/.kaggle/kaggle.json

print("‚úì Kaggle API configured successfully")
print("\nTesting Kaggle connection...")
!kaggle datasets list --max-size 100

## Step 3: Create Project Directories

In [None]:
# Create project structure in Google Drive
import os

base_dir = '/content/drive/MyDrive/medical-ai-project'
os.makedirs(base_dir, exist_ok=True)

# Create data directories
data_dirs = [
    f'{base_dir}/chest-xray-classification/data',
    f'{base_dir}/skin-lesion-detection/data',
    f'{base_dir}/drug-discovery/data',
]

for dir_path in data_dirs:
    os.makedirs(dir_path, exist_ok=True)
    print(f"‚úì Created: {dir_path}")

print("\n‚úì All directories created successfully")

## Step 4: Download Chest X-Ray Dataset

**Dataset:** COVID-19 Radiography Database

**Size:** ~3-5 GB

**Classes:** COVID, Lung Opacity, Normal, Viral Pneumonia

In [None]:
import os
os.chdir(f'{base_dir}/chest-xray-classification/data')

print("Downloading COVID-19 Radiography Database...")
print("This may take 10-15 minutes...\n")

!kaggle datasets download -d tawsifurrahman/covid19-radiography-database

print("\n‚úì Download complete!")
print("\nExtracting files...")

!unzip -q covid19-radiography-database.zip
!rm covid19-radiography-database.zip

print("‚úì Chest X-Ray dataset ready!")
!ls -lh

In [None]:
# Verify chest X-ray data
import os
from pathlib import Path

data_path = Path(f'{base_dir}/chest-xray-classification/data')
print("Chest X-Ray Dataset Structure:")
print("=" * 50)

for item in data_path.iterdir():
    if item.is_dir():
        num_files = len(list(item.glob('*.png'))) + len(list(item.glob('*.jpg')))
        print(f"üìÅ {item.name}: {num_files} images")

print("\n‚úì Chest X-Ray dataset verification complete")

## Step 5: Download Skin Lesion Dataset

**Dataset:** HAM10000 (Human Against Machine with 10000 training images)

**Size:** ~5-8 GB

**Classes:** 7 types of skin lesions

In [None]:
os.chdir(f'{base_dir}/skin-lesion-detection/data')

print("Downloading HAM10000 Skin Lesion Dataset...")
print("This may take 15-20 minutes...\n")

!kaggle datasets download -d kmader/skin-cancer-mnist-ham10000

print("\n‚úì Download complete!")
print("\nExtracting files...")

!unzip -q skin-cancer-mnist-ham10000.zip
!rm skin-cancer-mnist-ham10000.zip

print("‚úì Skin Lesion dataset ready!")
!ls -lh

In [None]:
# Verify skin lesion data
import pandas as pd

data_path = Path(f'{base_dir}/skin-lesion-detection/data')
print("Skin Lesion Dataset Structure:")
print("=" * 50)

# Check for metadata CSV
csv_files = list(data_path.glob('*.csv'))
if csv_files:
    df = pd.read_csv(csv_files[0])
    print(f"\nüìä Metadata file: {csv_files[0].name}")
    print(f"Total samples: {len(df)}")
    if 'dx' in df.columns:
        print("\nClass distribution:")
        print(df['dx'].value_counts())

# Count image files
for item in data_path.iterdir():
    if item.is_dir():
        num_files = len(list(item.glob('*.jpg'))) + len(list(item.glob('*.png')))
        print(f"\nüìÅ {item.name}: {num_files} images")

print("\n‚úì Skin Lesion dataset verification complete")

## Step 6: Download Drug Discovery Dataset

**Dataset:** ESOL (Aqueous Solubility) or QM9

**Size:** ~500 MB - 2 GB

**Type:** SMILES strings with molecular properties

In [None]:
# Option 1: Download from Kaggle
os.chdir(f'{base_dir}/drug-discovery/data')

print("Downloading molecular dataset...")

# QM9 dataset
!kaggle datasets download -d burakhmmtgl/qm9-dataset
!unzip -q qm9-dataset.zip
!rm qm9-dataset.zip

print("‚úì QM9 dataset ready!")
!ls -lh

In [None]:
# Option 2: Download ESOL using DeepChem (alternative)
# Uncomment if you prefer ESOL dataset

# !pip install -q deepchem
# import deepchem as dc

# print("Downloading ESOL dataset via DeepChem...")
# tasks, datasets, transformers = dc.molnet.load_esol(featurizer='ECFP', splitter='random')
# train_dataset, valid_dataset, test_dataset = datasets

# # Save to CSV
# import pandas as pd
# train_df = pd.DataFrame({
#     'smiles': train_dataset.ids,
#     'solubility': train_dataset.y.flatten()
# })
# train_df.to_csv('esol_train.csv', index=False)

# print("‚úì ESOL dataset ready!")
# print(f"Training samples: {len(train_dataset)}")
# print(f"Validation samples: {len(valid_dataset)}")
# print(f"Test samples: {len(test_dataset)}")

In [None]:
# Verify drug discovery data
data_path = Path(f'{base_dir}/drug-discovery/data')
print("Drug Discovery Dataset Structure:")
print("=" * 50)

csv_files = list(data_path.glob('*.csv'))
if csv_files:
    for csv_file in csv_files:
        df = pd.read_csv(csv_file, nrows=5)
        print(f"\nüìä File: {csv_file.name}")
        print(f"Columns: {list(df.columns)}")
        print(f"Sample data:")
        print(df.head())

print("\n‚úì Drug Discovery dataset verification complete")

## Step 7: Final Verification & Summary

In [None]:
# Final summary
import os
from pathlib import Path

def get_dir_size(path):
    """Calculate directory size in GB"""
    total = 0
    for entry in Path(path).rglob('*'):
        if entry.is_file():
            total += entry.stat().st_size
    return total / (1024**3)  # Convert to GB

print("="*60)
print("          DATASET DOWNLOAD SUMMARY")
print("="*60)

datasets = [
    ('Chest X-Ray Classification', f'{base_dir}/chest-xray-classification/data'),
    ('Skin Lesion Detection', f'{base_dir}/skin-lesion-detection/data'),
    ('Drug Discovery', f'{base_dir}/drug-discovery/data'),
]

total_size = 0
for name, path in datasets:
    if os.path.exists(path):
        size = get_dir_size(path)
        total_size += size
        print(f"\n‚úì {name}")
        print(f"  Location: {path}")
        print(f"  Size: {size:.2f} GB")
        
        # Count files
        num_files = sum(1 for _ in Path(path).rglob('*') if _.is_file())
        print(f"  Files: {num_files:,}")
    else:
        print(f"\n‚úó {name} - NOT FOUND")

print("\n" + "="*60)
print(f"TOTAL SIZE: {total_size:.2f} GB")
print("="*60)

print("\nüéâ All datasets downloaded successfully!")
print("\nNext steps:")
print("1. Open chest_xray_classification.ipynb to start training")
print("2. Or open skin_lesion_detection.ipynb")
print("3. Or open drug_discovery.ipynb")
print("\nüí° Datasets are saved in your Google Drive and will persist across sessions")

## Optional: Download Additional Datasets

In [None]:
# Alternative chest X-ray dataset (Pneumonia)
# Uncomment to download

# os.chdir(f'{base_dir}/chest-xray-classification/data')
# !kaggle datasets download -d paultimothymooney/chest-xray-pneumonia
# !unzip -q chest-xray-pneumonia.zip -d pneumonia_dataset
# !rm chest-xray-pneumonia.zip
# print("‚úì Pneumonia dataset downloaded")

In [None]:
# ISIC 2019 Skin Lesion Dataset (larger, more comprehensive)
# Uncomment to download (WARNING: ~25 GB)

# os.chdir(f'{base_dir}/skin-lesion-detection/data')
# !kaggle datasets download -d nodoubttome/skin-cancer9-classesisic
# !unzip -q skin-cancer9-classesisic.zip -d isic2019
# !rm skin-cancer9-classesisic.zip
# print("‚úì ISIC 2019 dataset downloaded")

## Cleanup & Tips

In [None]:
print("üí° IMPORTANT TIPS:")
print("\n1. Datasets are now in your Google Drive - they won't disappear!")
print("2. You can disconnect from Colab and come back later")
print("3. Each training notebook will mount Drive and access these datasets")
print("4. No need to re-download unless you delete the data")
print("\n5. Recommended training order:")
print("   a) Chest X-Ray (4-6 hours) - Start here")
print("   b) Skin Lesion (4-6 hours)")
print("   c) Drug Discovery (4-8 hours)")
print("\n6. You can train multiple models across different sessions")
print("\n7. Your Colab Pro+ compute units: 1131.86 units")
print("   This is MORE than enough for all three projects!")