# Project Focus
This project focuses on identifying certain types of breast cancer tumors in a given tissue sample. While some types of tumors are benign, such as adenosis, fibroadenoma, phyllodes tumor, and tubular adenoma tumors, others are malignant and must be treated imminently, including carcinoma, lobular carcinoma, mucinous carcinoma, and papillary carcinoma. We utilized an ____ model to accurately, but also quickly, identify tumor types.

# BreakHis Dataset
The BreakHis dataset contains 9109 images of breast tumor tissue that comes from 82 patients. Each image is magnified by either 40x, 100x, 200x, or 400x. Each image is a 700X460 pixel, RGB, 8-bit depth, PNG image. Additionally, images are separated into multiple types of benign and malignant tumors, with 2,480 benign and 5,429 malignant images. The types of benign tumors are adenosis (A), fibroadenoma (F), phyllodes tumor (PT), and tubular adenoma (TA), and the types of malignant tumors are carcinoma (DC), lobular carcinoma (LC), mucinous carcinoma (MC), and papillary carcinoma (PC). The dataset collected samples for imaging through partial mastectomy or excisional biopsy.

# Methodology
(This is an example, not sure how we want to do all of this)
ViT model: Since the dataset was already split into training and testing categories, I used these distinctions to train the model as well. While I originally used a binary cross-entropy loss function for drawing distinctions purely between possible malignant and benign tumors, I eventually transitioned to a cross-entropy loss function to attempt to identify all 8 different tumor types. Training the models required adding transformations to the training image set to reduce overfitting.


In [None]:
import os
import numpy as np
import pandas as pd
from PIL import Image
import time

import torch
import torch.nn as nn
from torch.optim import AdamW
from torch.utils.data import DataLoader, Dataset
import torchvision.transforms as T
import torch.backends.cudnn as cudnn
from torch.optim.lr_scheduler import OneCycleLR

# data visualization (TODO: decide how to visual the data)

In [None]:
# Example dataset loading (unsure if this is the same for all of us)

class BreakHisDataset(Dataset):
  def __init__(self, csv_file, root_dir, train=True, transform=None):
    
    self.data_frame = pd.read_csv(csv_file)
    self.root_dir = root_dir
    self.transform = transform

    if train:
      self.data_frame = self.data_frame[self.data_frame['grp'].str.lower() == "train"]
    else:
      self.data_frame = self.data_frame[self.data_frame['grp'].str.lower() == "test"]
    
    self.data_frame.reset_index(drop=True, inplace=True)

  def __len__(self):
    return len(self.data_frame)

  def __getitem__(self, idx):
    row = self.data_frame.iloc[idx]
    filename = row['filename']
    
    img_path = os.path.join(self.root_dir, filename)
    image = Image.open(img_path).convert('RGB')
  
    lower_filename = filename.lower()
    if "adenosis" in lower_filename:
      label = 0
    elif "fibroadenoma" in lower_filename:
      label = 1
    elif "phyllodes_tumor" in lower_filename:
      label = 2
    elif "tubular_adenoma" in lower_filename:
      label = 3
    elif "ductal_carcinoma" in lower_filename:
      label = 4
    elif "lobular_carcinoma" in lower_filename:
      label = 5
    elif "mucinous_carcinoma" in lower_filename:
      label = 6
    elif "papillary_carcinoma" in lower_filename:
      label = 7
    else:
      raise ValueError(f"Cannot determine label from filename: {filename}")

    if self.transform:
      image = self.transform(image)
    return image, label