# Exploratory Data Analysis (EDA)

## Table of Contents
1. [Dataset Overview](#dataset-overview)
2. [Handling Missing Values](#handling-missing-values)
3. [Feature Distributions](#feature-distributions)
4. [Possible Biases](#possible-biases)
5. [Correlations](#correlations)


In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


## Dataset Overview

We used MHIST dataset that contains 3152 histological pictures of (H&E)-stained Formalin Fixed Paraffin-Embedded (FFPE) fixed-size images (224 by 224 pixels) of colorectal polyps from the Department of Pathology and Laboratory Medicine at Dartmouth-Hitchcock Medical Center (DHMC).


In [None]:
import pandas as pd

# paths to the dataset
excel_path = "C:/Users/user/Desktop/Tensor-FLow Project/Filik.xlsx"
image_src_dir = "C:/Users/user/Desktop/Tensor-FLow Project/images"  #directory with all .png
target_base_dir = "C:/Users/user/Desktop/Tensor-FLow Project/images_by_class"  # divided on classes

# read Excel
df = pd.read_excel(excel_path)
df.columns = ['filename', 'label_str', 'partition']

# organize dataset
for _, row in df.iterrows():
    label = row['label_str']   # HP or SSA
    part = row['partition']    # train or test
    fname = row['filename']

    src_path = os.path.join(image_src_dir, fname)
    dst_dir = os.path.join(target_base_dir, part, label)
    dst_path = os.path.join(dst_dir, fname)

    os.makedirs(dst_dir, exist_ok=True)

    if os.path.exists(src_path):
        shutil.copy(src_path, dst_path)
    else:
        print(f"⚠️ File not found: {src_path}")

# make test directory
test_dir = "C:/Users/user/Desktop/Tensor-FLow Project/images_by_class/test"

data = []
for label in ['HP', 'SSA']:
    class_dir = os.path.join(test_dir, label)
    for fname in os.listdir(class_dir):
        data.append({
            'filename': os.path.join(class_dir, fname),
            'class': label
        })

df_test = pd.DataFrame(data)

# make validation and final test directories
df_val, df_final_test = train_test_split(
    df_test, test_size=0.2, stratify=df_test['class'], random_state=42
)

train_gen = datagen.flow_from_directory(
    "C:/Users/user/Desktop/Tensor-FLow Project/images_by_class/train",
    target_size=(224, 224),
    class_mode='binary',
    batch_size=16,
    shuffle=True
)

val_gen = datagen.flow_from_dataframe(
    df_val,
    x_col='filename',
    y_col='class',
    target_size=(224, 224),
    class_mode='binary',
    batch_size=16,
    shuffle=False
)

test_gen = datagen.flow_from_dataframe(
    df_final_test,
    x_col='filename',
    y_col='class',
    target_size=(224, 224),
    class_mode='binary',
    batch_size=16,
    shuffle=False
)

#total number of used images
total_images = train_gen.samples + val_gen.samples + test_gen.samples
print(f"Total number of images: {total_images}")
# = 2628

# Display the first few rows of the dataframe to show the structure
print("Example data:")
print(df_val.head())
#filename class
#370  C:/Users/user/Desktop/Tensor-FLow Project/imag...    HP
#188  C:/Users/user/Desktop/Tensor-FLow Project/imag...    HP
#673  C:/Users/user/Desktop/Tensor-FLow Project/imag...   SSA
#240  C:/Users/user/Desktop/Tensor-FLow Project/imag...    HP
#23   C:/Users/user/Desktop/Tensor-FLow Project/imag...    HP


## Handling Missing Values

There are no missing values.


## Feature Distributions

We deleted all pictures with 3/7 or 4/7 pathologists voting for either HP or SSA, because it is around 50% possibility for either of the outcomes. We deleted the values in Excel file, and our code above searches for the remaining pictures only.


## Possible Biases

We have class bias: HP outweighs SSA image amount.


In [None]:
# Solved by focal loss function in model compilation (see further folders).

## Correlations

There are no correlations to be described.
