# Data Exploration - Notebook 1
## Understanding face mask dataset


- Checking image sizes and quality
- Identifying missing or corrupted images
- Analyzing label distribution
- Summarizing key findings


In [9]:
import os
import cv2
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
from tqdm import tqdm
from pathlib import Path
import cv2

In [3]:
projectRoot = Path().resolve().parent
datasetPath = projectRoot / "CV2024_CW_Dataset"

# Defining dataset paths dynamically
trainImagePath = datasetPath / "train" / "images"
trainLabelPath = datasetPath / "train" / "labels"
testImagePath = datasetPath / "test" / "images"
testLabelPath = datasetPath / "test" / "labels"

# ensuring paths exist first
for path in [trainImagePath, trainLabelPath, testImagePath, testLabelPath]:
    if not path.exists():
        raise FileNotFoundError(f"Path not found: {path}")

# Convert paths to strings for OpenCV compatibility
trainImagePath = str(trainImagePath)
trainLabelPath = str(trainLabelPath)
testImagePath = str(testImagePath)
testLabelPath = str(testLabelPath)

# counting files
trainImages = sorted(os.listdir(trainImagePath))
trainLabels = sorted(os.listdir(trainLabelPath))
testImages = sorted(os.listdir(testImagePath))
testLabels = sorted(os.listdir(testLabelPath))

print(f"Dataset located at: {datasetPath}")
print(f"Total Training Images: {len(trainImages)}")
print(f"Total Training Labels: {len(trainLabels)}")
print(f"Total Testing Images: {len(testImages)}")
print(f"Total Testing Labels: {len(testLabels)}")


Dataset located at: C:\3rd year uni\IN1 Computer Vision\MaskDetection\CV2024_CW_Dataset
Total Training Images: 2394
Total Training Labels: 2394
Total Testing Images: 458
Total Testing Labels: 458


## Checking images 

In [8]:
corruptImages = []

#checking if images can be opened
for imageName in tqdm(trainImages, desc="Checking training images"):
    imagePath = os.path.join(trainImagePath, imageName)
    image = cv2.imread(imagePath)
    
    if image is None:
        corruptImages.append(imageName)

for imageName in tqdm(testImages, desc="Checking test images"):
    imagePath = os.path.join(testImagePath, imageName)
    image = cv2.imread(imagePath)
    
    if image is None:
        corruptImages.append(imageName)

# display results
if corruptImages:
    print(f"Corrupt images found: {len(corruptImages)}")
    print("List of corrupt images:", corruptImages)
else:
    print("All images readable no corruption")

Checking training images: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2394/2394 [00:00<00:00, 9452.30it/s]
Checking test images: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 458/458 [00:00<00:00, 9586.50it/s]

All images readable no corruption





## Analysing label distributions
- **0** → No Mask
- **1** → Mask
- **2** → Improper Mask

In [11]:
# Read all label files and count occurences
def countLabels(labelPath, labelFiles):
    labelCounts = Counter()
    
    for labelFile in tqdm(labelFiles, desc=f"Reading labels from {labelPath}"):
        filePath = os.path.join(labelPath, labelFile)
        with open(filePath, 'r') as file:
            label = int(file.read().strip())  # Convert label to int
            labelCounts[label] += 1

    return labelCounts

# Count labels for Train and Test sets
trainLabelCounts = countLabels(trainLabelPath, trainLabels)
testLabelCounts = countLabels(testLabelPath, testLabels)

# Convert counts to DataFrame
def createLabelDataFrame(labelCounts, datasetSize):
    df = pd.DataFrame.from_dict(labelCounts, orient='index', columns=['Count'])
    df['Percentage'] = (df['Count'] / datasetSize) * 100  # Calculate percentage
    df.index = ["No Mask (0)", "Mask (1)", "Improper Mask (2)"]
    return df

trainLabelDistribution = createLabelDataFrame(trainLabelCounts, len(trainImages))
testLabelDistribution = createLabelDataFrame(testLabelCounts, len(testImages))

print("Label Distribution in Training Set:")
print(trainLabelDistribution)

print("\nLabel Distribution in Test Set:")
print(testLabelDistribution)

Reading labels from C:\3rd year uni\IN1 Computer Vision\MaskDetection\CV2024_CW_Dataset\train\labels: 100%|█████████████████████████████████████| 2394/2394 [00:00<00:00, 16658.45it/s]
Reading labels from C:\3rd year uni\IN1 Computer Vision\MaskDetection\CV2024_CW_Dataset\test\labels: 100%|█████████████████████████████████████████| 458/458 [00:00<00:00, 9810.84it/s]

Label Distribution in Training Set:
                   Count  Percentage
No Mask (0)          376   15.705931
Mask (1)            1940   81.035923
Improper Mask (2)     78    3.258145

Label Distribution in Test Set:
                   Count  Percentage
No Mask (0)          388   84.716157
Mask (1)              51   11.135371
Improper Mask (2)     19    4.148472





## Summary of Findings

### Dataset Quality
- **No missing images or labels** detected 
- **No corrupt images** found

### Class Distribution Observations
#### Training Set
- **81% of images** : **Mask (1)**  
- **Only 3% of images** : **Improper Mask (2)** 
- **No Mask (0) is underrepresented** : **15%**

#### Testing Set
- **No Mask (0) dominates at 84%**, unlike the training set.  
- **Mask (1) drops to 11%**, which is inconsistent with training distribution 
- **Improper Mask (2) remains low (4%)**, making it harder to train for  

### Impact on Model Performance
- **Severe class imbalance** will lead to **bias in predictions** 
- **Test set does not match training distribution** which will lead to **poor generalization**
- **Underrepresentation of Improper Mask (2)** means model will **struggle correctly predicting this class**  

