## Exploratory Data Analysis

In [None]:
from pathlib import Path
from collections import Counter
import pandas as pd

Counter({'.jpg': 25553, '.yaml': 1, '': 1})

The code cell below making sure all images are in the same format.

In [17]:
# Notebook is in notebooks/
# Data is in ../data/interim/
# Make sure to adjust the path accordingly
DATA_DIR = Path("../data/interim")

# Making sure all images same format
extensions = []

for path in DATA_DIR.rglob("*"):
    if path.is_file():
        extensions.append(path.suffix.lower())

extension_counts = Counter(extensions)
extension_counts

Counter({'.jpg': 25553, '.yaml': 1, '': 1})

Fetching number of images per split per class.

In [18]:
# Counting images per split and class
splits = ["train", "val", "test"]
classes = ["normal", "pneumonia", "tuberculosis"]

records = []

for split in splits:
    for cls in classes:
        class_dir = DATA_DIR / split / cls
        count = len(list(class_dir.glob("*.*")))
        
        records.append({
            "split": split,
            "class": cls,
            "num_images": count
        })

counts_df = pd.DataFrame(records)
counts_df.to_csv("../data/class_counts.csv", index=False)
counts_df


Unnamed: 0,split,class,num_images
0,train,normal,7263
1,train,pneumonia,4674
2,train,tuberculosis,8513
3,val,normal,900
4,val,pneumonia,570
5,val,tuberculosis,1064
6,test,normal,925
7,test,pneumonia,580
8,test,tuberculosis,1064


Given table above, I will keep the current split. 
The reasons for this are: a- the split contains all classes, 
b- close proportion of classes within the split, 
c- keep comparability with existing work.
d- lack of patient metadata to actually inform the splitting.