C:\Users\dengy\.conda\envs\tf_env\python.exe


In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout, BatchNormalization
import matplotlib.pyplot as plt
import os
from sklearn.model_selection import train_test_split
import sys
import tensorflow as tf

# --- Optional: For reproducibility ---
np.random.seed(42)
tf.random.set_seed(42)

print(sys.executable)

# --- GitHub Repository Link ---
# My GitHub Repo: [Paste Your GitHub Repository URL Here]

/usr/bin/python3


# Histopathologic Cancer Detection

## 1. Problem and Data Description

In this project, we tackle the Histopathologic Cancer Detection challenge from Kaggle. The goal is to build a binary classification model that can identify the presence of metastatic cancer in 96x96 pixel image patches derived from larger digital pathology scans. Accurately automating this process can significantly aid pathologists in diagnosing cancer and reducing their workload.

The dataset consists of:
- A `train_labels.csv` file containing the ID of each training image and its corresponding label (1 for positive, 0 for negative).
- A `train/` folder with approximately 220,000 training images.
- A `test/` folder with approximately 57,000 test images.

Each image is a 96x96 pixel color image with 3 RGB channels.

In [2]:
# --- 最终优化版：内存高效的数据加载代码 ---
import os
import h5py
import gzip
import numpy as np
import time

# 1. 挂载 Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True) # 使用 force_remount=True 确保每次都重新挂载

# 2. 定义源文件和本地路径
source_path_x = '/content/drive/MyDrive/pcamv1/camelyonpatch_level_2_split_train_x.h5.gz'
local_path_x_gz = '/content/camelyonpatch_level_2_split_train_x.h5.gz'
local_path_x_h5 = '/content/camelyonpatch_level_2_split_train_x.h5' # 解压后的文件名

# 3. 从 Drive 复制到本地 (这一步仍然需要)
print(f"正在将训练数据从 Google Drive 复制到 Colab 本地...")
!cp "{source_path_x}" "{local_path_x_gz}"
print("复制完成！")

# 4. 在本地解压文件 (这是一个新步骤)
# 我们不再在内存中解压，而是直接在磁盘上解压出一个 .h5 文件
print(f"正在解压文件: {local_path_x_gz}...")
start_time = time.time()
!gunzip -k "{local_path_x_gz}" # -k 参数保留原始的 .gz 文件，以防万一
end_time = time.time()
print(f"解压完成！耗时: {end_time - start_time:.2f} 秒")

# 5. 【核心】使用 h5py 进行“懒加载”
print("\n正在使用 h5py '懒加载' 方式打开数据...")
# 直接打开解压后的 .h5 文件
h5f = h5py.File(local_path_x_h5, 'r')

# 这行代码几乎是瞬间完成的！
# X_train_h5 现在不是一个 NumPy 数组，而是一个指向磁盘上数据的 HDF5 数据集对象。
X_train_h5 = h5f['x']

print("懒加载完成！数据并未完全读入内存。")
print(f"数据集对象的类型: {type(X_train_h5)}")
print(f"数据集对象的维度 (和之前一样): {X_train_h5.shape}")

# 6. 验证懒加载是否有效
# 我们可以像操作 NumPy 数组一样操作它，但数据只会在需要时才从磁盘读取
print("\n正在从磁盘读取第一张图片到内存中...")
first_image = X_train_h5[0] # 只读取第一张图片
print("第一张图片的维度:", first_image.shape)
print("操作成功！")

# 对标签文件做同样的操作
source_path_y = '/content/drive/MyDrive/pcamv1/camelyonpatch_level_2_split_train_y.h5.gz'
local_path_y_gz = '/content/camelyonpatch_level_2_split_train_y.h5.gz'
local_path_y_h5 = '/content/camelyonpatch_level_2_split_train_y.h5'
!cp "{source_path_y}" "{local_path_y_gz}"
!gunzip -k "{local_path_y_gz}"
h5f_y = h5py.File(local_path_y_h5, 'r')
y_train_h5 = h5f_y['y']

Mounted at /content/drive
正在将训练数据从 Google Drive 复制到 Colab 本地...
复制完成！
正在解压文件: /content/camelyonpatch_level_2_split_train_x.h5.gz...
解压完成！耗时: 127.71 秒

正在使用 h5py '懒加载' 方式打开数据...
懒加载完成！数据并未完全读入内存。
数据集对象的类型: <class 'h5py._hl.dataset.Dataset'>
数据集对象的维度 (和之前一样): (262144, 96, 96, 3)

正在从磁盘读取第一张图片到内存中...
第一张图片的维度: (96, 96, 3)
操作成功！


## 2. Exploratory Data Analysis (EDA)

Here, we will inspect the data to understand its structure and distribution. This will help inform our modeling strategy.

First, let's examine the distribution of labels in the training set.

In [None]:
# In a new code cell
# Visualize label distribution
label_counts = df_labels['label'].value_counts()
print(f"Negative (0) samples: {label_counts[0]}")
print(f"Positive (1) samples: {label_counts[1]}")

label_counts.plot(kind='bar', title='Label Distribution')
plt.xlabel('Label (0: No Cancer, 1: Cancer)')
plt.ylabel('Count')
plt.show()

# Based on the plot, we can see the dataset is fairly well-balanced.

Now, let's visualize a few sample images from each class.

In [None]:
# In a new code cell
# Display sample images
# We will use a subset for this demonstration for speed.
# It's recommended to use a more robust data loading pipeline for actual training.
sample_df = df_labels.sample(n=10000, random_state=42)

# Split into a smaller training and validation set for faster iteration
train_df, valid_df = train_test_split(sample_df, test_size=0.2, random_state=42, stratify=sample_df['label'])

# --- Function to display images (for EDA) ---
def display_samples(df, n_samples=5):
    fig, axes = plt.subplots(2, n_samples, figsize=(15, 6))

    # Positive samples
    positive_samples = df[df['label'] == 1].sample(n=n_samples)
    for i, row in enumerate(positive_samples.itertuples()):
        img = plt.imread(os.path.join(TRAIN_DIR, row.id))
        axes[0, i].imshow(img)
        axes[0, i].set_title("Label: 1 (Cancer)")
        axes[0, i].axis('off')

    # Negative samples
    negative_samples = df[df['label'] == 0].sample(n=n_samples)
    for i, row in enumerate(negative_samples.itertuples()):
        img = plt.imread(os.path.join(TRAIN_DIR, row.id))
        axes[1, i].imshow(img)
        axes[1, i].set_title("Label: 0 (No Cancer)")
        axes[1, i].axis('off')

    plt.tight_layout()
    plt.show()

display_samples(train_df)

**EDA Conclusion and Plan:**
The data consists of 96x96 color images and is well-balanced between the two classes. The images appear clean and consistently sized. My plan is to use a data generator to efficiently load images from the directory and feed them into a Convolutional Neural Network (CNN) for classification. I will start with a simple CNN architecture as a baseline and then explore a more complex model using transfer learning to compare performance.

## 3. Model Architecture

For this problem, I will start with a baseline CNN and then implement a model using transfer learning.

### Baseline CNN
My baseline model will be a simple sequential CNN with a few convolutional and pooling layers, followed by dense layers for classification. This architecture is a standard starting point for image classification tasks.

In [None]:
# In a new code cell
# Keras Data Generator - A more efficient way to handle large image datasets
from tensorflow.keras.preprocessing.image import ImageDataGenerator

# Rescale images
datagen = ImageDataGenerator(rescale=1./255.)

# NOTE: For a real submission, you'd use the full df_labels. We use train_df/valid_df for demonstration.
# Ensure the 'label' column is string type for the generator
train_df['label'] = train_df['label'].astype(str)
valid_df['label'] = valid_df['label'].astype(str)

# Create generators
train_generator = datagen.flow_from_dataframe(
    dataframe=train_df,
    directory=TRAIN_DIR,
    x_col='id',
    y_col='label',
    target_size=(96, 96),
    class_mode='binary',
    batch_size=32
)

valid_generator = datagen.flow_from_dataframe(
    dataframe=valid_df,
    directory=TRAIN_DIR,
    x_col='id',
    y_col='label',
    target_size=(96, 96),
    class_mode='binary',
    batch_size=32
)

# --- Define Baseline CNN Model ---
def build_baseline_model():
    model = Sequential([
        Conv2D(32, (3,3), activation='relu', input_shape=(96, 96, 3)),
        MaxPooling2D(2,2),
        Conv2D(64, (3,3), activation='relu'),
        MaxPooling2D(2,2),
        Flatten(),
        Dense(128, activation='relu'),
        Dropout(0.5), # Add dropout for regularization
        Dense(1, activation='sigmoid') # Sigmoid for binary classification
    ])

    model.compile(optimizer='adam',
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    return model

baseline_model = build_baseline_model()
baseline_model.summary()

## 4. Results and Analysis

Now, we will train our baseline model and analyze its performance. We will then try other techniques to see if we can improve the results.

*You should add markdown cells here to describe each experiment you run.*

In [None]:
# In a new code cell
# Train the baseline model
# Note: training for more epochs will yield better results.
# This is a demonstration.
history_baseline = baseline_model.fit(
    train_generator,
    validation_data=valid_generator,
    epochs=5, # Increase epochs for better performance
    verbose=1
)

# --- Function to plot training history ---
def plot_history(history, title):
    plt.figure(figsize=(12, 4))

    # Plot training & validation accuracy values
    plt.subplot(1, 2, 1)
    plt.plot(history.history['accuracy'])
    plt.plot(history.history['val_accuracy'])
    plt.title(f'{title} - Model Accuracy')
    plt.ylabel('Accuracy')
    plt.xlabel('Epoch')
    plt.legend(['Train', 'Validation'], loc='upper left')

    # Plot training & validation loss values
    plt.subplot(1, 2, 2)
    plt.plot(history.history['loss'])
    plt.plot(history.history['val_loss'])
    plt.title(f'{title} - Model Loss')
    plt.ylabel('Loss')
    plt.xlabel('Epoch')
    plt.legend(['Train', 'Validation'], loc='upper left')

    plt.show()

plot_history(history_baseline, "Baseline CNN")

**Analysis of Baseline:**
*[Here, you would write your analysis. For example: "The baseline model achieved a validation accuracy of X%. The loss curves show signs of overfitting, as the training loss continues to decrease while the validation loss flattens. To improve this, I will try a model with transfer learning."]*

### Transfer Learning Model

Next, I will use a pre-trained model (VGG16) as a feature extractor. This is a powerful technique that leverages knowledge from a model trained on a much larger dataset (ImageNet).

In [None]:
# In a new code cell
# --- Build Transfer Learning Model ---
from tensorflow.keras.applications import VGG16
from tensorflow.keras.models import Model

def build_transfer_model():
    # Load VGG16 base, pre-trained on ImageNet, without the top classification layer
    base_model = VGG16(weights='imagenet', include_top=False, input_shape=(96, 96, 3))

    # Freeze the base model layers
    base_model.trainable = False

    # Add our custom classifier on top
    x = base_model.output
    x = Flatten()(x)
    x = Dense(512, activation='relu')(x)
    x = Dropout(0.5)(x)
    predictions = Dense(1, activation='sigmoid')(x)

    model = Model(inputs=base_model.input, outputs=predictions)

    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001), # Use a lower learning rate for fine-tuning
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    return model

transfer_model = build_transfer_model()
transfer_model.summary()

# Train the transfer learning model
history_transfer = transfer_model.fit(
    train_generator,
    validation_data=valid_generator,
    epochs=5, # Increase epochs for better performance
    verbose=1
)

plot_history(history_transfer, "Transfer Learning (VGG16)")

**Analysis of Transfer Learning Model:**
*[Here, you would compare the results. For example: "The transfer learning model significantly outperformed the baseline, achieving a validation accuracy of Y%. This demonstrates the power of using pre-trained features. The convergence was also faster."]*

## 5. Conclusion

In this project, I explored building a CNN to detect histopathologic cancer.

**Key Findings:**
- The baseline CNN provided a reasonable starting point but showed signs of overfitting.
- The transfer learning approach using a pre-trained VGG16 model yielded substantially better results, achieving a higher validation accuracy more quickly.
- Regularization techniques like Dropout were crucial in controlling overfitting, especially in the custom classifier built on top of the VGG16 base.

**Future Improvements:**
If I had more time, I would:
- Implement data augmentation (e.g., random flips, rotations) to further reduce overfitting and improve model generalization.
- Experiment with fine-tuning more layers of the pre-trained model instead of just training the top classifier.
- Try other pre-trained architectures like ResNet or InceptionV3 to see if they perform better.

## 6. Kaggle Submission

Finally, I will use my best-performing model (the transfer learning model) to make predictions on the test set and generate a submission file.

In [None]:
# In a new code cell
# --- Generate Submission File ---
# NOTE: You will need to create a test generator.
# The test data directory has a different structure, so you may need to adjust.

# Create a dataframe for test images
test_files = os.listdir(TEST_DIR)
test_df = pd.DataFrame({'id': test_files})

test_datagen = ImageDataGenerator(rescale=1./255.)
test_generator = test_datagen.flow_from_dataframe(
    dataframe=test_df,
    directory=TEST_DIR,
    x_col='id',
    y_col=None, # No labels for test data
    class_mode=None, # No labels
    target_size=(96, 96),
    shuffle=False, # Important: do not shuffle test data
    batch_size=32
)

# Make predictions
predictions = transfer_model.predict(test_generator)

# Format for submission
predicted_labels = (predictions > 0.5).astype(int).flatten()
submission_df = pd.DataFrame({
    'id': [os.path.splitext(f)[0] for f in test_generator.filenames],
    'label': predicted_labels
})

submission_df.to_csv('submission.csv', index=False)
print("Submission file created successfully!")
print(submission_df.head())

### Kaggle Leaderboard Screenshot

*[Insert your Kaggle leaderboard screenshot here]*