# Information Retrieval Project
## COVID-19 Search Engine (P12)

Content-based image retrieval (CBIR) is a computer vision technique which addresses the problem of searching for digital images in large databases. A content-based approach exploits the contents of an image, such as colors, shapes and textures, differing from its concept-based counterpart, which instead focuses on keywords and tags associated with the image itself.

Image retrieval has gained more and more relevance in the medical field, due to the accumulation of extensive collections of scans in hospitals. These images are stored in DICOM format, which must be manually annotated and may require considerable time to process by physicians. The goal of this project is trying to address this problem by considering different approaches for building a content-based medical image retrieval system and comparing their results based on classification metrics and computational time.

# Installation of Requirements

This part of the code installs necessary dependencies specified in the `requirements.txt` file.
Utilizes the `!pip install -r requirements.txt` command to install packages listed in the `requirements.txt` file. This ensures that all required dependencies are installed before proceeding with execution.


In [None]:
!pip install -r requirements.txt

# Importing Required Libraries and Modules Summary

This part of the code imports necessary libraries and modules for further execution, providing functionalities for dataset management, visualization, and image processing.

## Imported Libraries and Modules
- `os`: A standard library for interacting with the operating system.
- `kaggle`: A library for accessing the Kaggle API to download datasets.
- `platform`: A standard library for accessing platform-specific information.
- `imagehash`: A library for computing perceptual image hashes, useful for identifying duplicate images.
- `matplotlib.pyplot`: A module from the Matplotlib library used for visualizing images and plotting graphs.
- `PIL.Image`: A module from the Python Imaging Library (PIL) used for opening, manipulating, and saving many different image file formats.
- `UnidentifiedImageError`: An exception raised when an image file cannot be identified or opened by PIL.

In [31]:
import os
import PIL
import kaggle
import platform
import imagehash
import matplotlib

from matplotlib import pyplot as plt
from PIL import Image, UnidentifiedImageError

# Python and Library Version Checking

This part of the code checks the versions of Python and specific libraries installed in the environment.


In [33]:
# Checking Python version
print("- Python version: {}".format(platform.python_version()))
print("- Matplotlib version is: {}".format(matplotlib.__version__))
print("- Pillow version is: {}".format(PIL.__version__))

- Python version: 3.12.0
- Matplotlib version is: 3.8.3
- Pillow version is: 10.2.0


# Dataset Paths and Constants Summary

This part of the code defines useful constants and paths.

## 1. Defined Constants and Paths
- `DATASET_PATH`: Specifies the path to the dataset folder named "archives".
- `DATASET_ID`: Provides the unique identifier necessary for downloading the dataset using the Kaggle API.
- `COVID_PATH`: Defines the path to the directory containing COVID images within the dataset folder.
- `NON_COVID_PATH`: Defines the path to the directory containing non-COVID images within the dataset folder.


In [34]:
# PATHS AND SIMILAR
DATASET_PATH = "archives"  
DATASET_ID = "plameneduardo/sarscov2-ctscan-dataset"
COVID_PATH = os.path.join(DATASET_PATH, "COVID")
NON_COVID_PATH = os.path.join(DATASET_PATH, "non-COVID")

# Dataset Download from Kaggle

This part of the code checks if the dataset exists in the workspace and downloads it from Kaggle if it does not already exist.

## 1. Checking Dataset Existence
- The code uses the `os.path.exists()` function to check if the dataset is already present in the workspace.

## 2. Downloading Dataset from Kaggle
- If the dataset does not exist, the code proceeds to download it from Kaggle using the `kaggle.api.dataset_download_files()` function.
- The `DATASET_ID` and `DATASET_PATH` variables specify the dataset to download and the location to save it, respectively.
- The `unzip=True` parameter ensures that the downloaded dataset is unzipped after download.


In [35]:
# Download the dataset if not exist in the workplace
if not os.path.exists(DATASET_PATH):
    print("\n> Download the dataset from Kaggle...")
    # Download dataset and unzip it
    kaggle.api.dataset_download_files(dataset=DATASET_ID, path=DATASET_PATH, quiet=False, unzip=True)
else:
    print("\n> Dataset already downloaded.")


> Dataset already downloaded.


# Dataset Preprocessing Summary

This code performs preprocessing tasks on a dataset, which includes counting files, checking for corrupted files, filtering out duplicates, and plotting duplicate images alongside their originals.

## 1. Counting Files
The `count_files` function counts the number of files with specified extensions in a specified directory. It takes the directory path and file extensions as inputs and returns the count of files.

## 2. Checking for Corrupted Files
The `corruption_filter` function checks the dataset for corrupted image files and provides an option to delete them. It iterates through all files, verifies their integrity using PIL's `Image` module, and removes corrupted files if requested.

## 3. Finding and Handling Duplicates
The `find_out_duplicate` function identifies duplicate images within the dataset. It computes the hash of each image and compares it with previous hashes to detect duplicates. If duplicates are found, it plots each pair of original and duplicate images side by side using Matplotlib. The user is prompted to decide whether to delete the duplicate images.

## Usage
1. The code begins with checking the dataset's file count before preprocessing tasks.
2. It then checks for corrupted files and provides an option to delete them.
3. Next, it identifies and handles duplicates within the COVID and non-COVID dataset subdirectories.
4. Finally, it confirms the total file count after preprocessing.

This code ensures the integrity and cleanliness of the dataset for further analysis or model training.

In [36]:
# Count the number of files
def count_files(file_path, extensions="jpg"):
    """
    Count the number of files with specified extensions in the specified directory.

    Example: count_files("/path/to/directory", extensions=["jpg", "png"]) -> 12

    :param file_path: (str) The path to the directory for which file count is required.
    :param extensions: (list or None) List of file extensions to count. If None, count all files.

    :return: (int) The number of files with specified extensions in the specified directory.
    """

    if extensions is None:
        extensions = ['']

    counter = 0
    with os.scandir(file_path) as entries:
        for entry in entries:
            if entry.is_file() and any(entry.name.lower().endswith(ext) for ext in extensions):
                counter += 1

    return counter


# Just a helper function
def print_file_counts():
    """
    A helper function that pint information about the number of files inside the directory.
    """

    count_covid = count_files(file_path=COVID_PATH)
    count_non_covid = count_files(file_path=NON_COVID_PATH)

    tot_number_file = count_covid + count_non_covid
    print("- Total Number of file: {}\n".format(tot_number_file) +
          "- Number of file in COVID: {}\n".format(count_covid) +
          "- Number of file in non-Covid: {}\n".format(count_non_covid))

# Check dataset: filter out possible corrupted files.
def corruption_filter(dataset_path):
    """
    Check dataset for corrupted files and delete them if requested.

    :param dataset_path: The path to the dataset.
    """

    # Initialize
    bad_files = []  # to store corrupted file

    # Loop through all dataset subfolders
    for dirpath, _, filenames in os.walk(dataset_path):

        # Ensure we're processing a sub-folder level
        if dirpath is not dataset_path:

            # Loop through all files
            for filename in filenames:
                # Check the file extension
                if filename.endswith("jpg"):
                    # Get the file path
                    file_path = os.path.join(dirpath, filename)
                    try:
                        with Image.open(file_path) as image:
                            image.verify()
                    except UnidentifiedImageError:
                        bad_files.append(file_path)
                        print("\n> There are {} corrupted files: {}".format(len(bad_files), bad_files))

    if len(bad_files) != 0:
        doc_message = input("\n> Do you want to delete these {} file? [Y/N]: ".format(len(bad_files)))
        if doc_message.upper() == "Y":
            for bad_file in bad_files:
                # delete duplicate
                os.remove(bad_file)
                print("- {} Corrupted File Deleted Successfully!".format(bad_file))

            # Print count
            print("\n> Checking the Number of file after the application of the corruption filter:")
            print_file_counts()
    else:
        print("> No Corrupted File Found")
    
    
# Check dataset: control the presence of duplicate inside the training set
def find_out_duplicate(dataset_path, hash_size):
    """
    Find and delete Duplicates inside the training set

    :param dataset_path: the path to dataset.
    :param hash_size: images will be resized to a matrix with size by given value.
    """

    # Initialize
    hashes = {}
    originals = {}
    duplicates = []

    # loop through file
    for file in os.listdir(dataset_path):
        file_path = os.path.join(dataset_path, file)

        with Image.open(file_path) as image:
            tmp_hash = imagehash.average_hash(image, hash_size)

            if tmp_hash in hashes:
                print("- Duplicate [{}] found for Og. Image [{}]".format(file, hashes[tmp_hash]))
                duplicates.append(file)  # duplicate files 
                originals[file] = hashes[tmp_hash]  # original files
            else:
                hashes[tmp_hash] = file

    if len(duplicates) != 0:
        
        fig, axs = plt.subplots(len(duplicates), 2, figsize=(10, 5 * len(duplicates)))

        for idx, duplicate in enumerate(duplicates):
            duplicate_path = os.path.join(dataset_path, duplicate)
            original_path = os.path.join(dataset_path, originals[duplicate])

            # Load images
            duplicate_image = plt.imread(duplicate_path)
            original_image = plt.imread(original_path)

            # Plot side by side
            axs[idx, 0].imshow(original_image)
            axs[idx, 0].set_title('Original')
            axs[idx, 0].axis('off')

            axs[idx, 1].imshow(duplicate_image)
            axs[idx, 1].set_title('Duplicate')
            axs[idx, 1].axis('off')

        plt.tight_layout()
        plt.show()
        
        doc_message = input("\n> Do you want to delete these {} duplicate images? [Y/N]: ".format(len(duplicates)))

        if doc_message.upper() == "Y":
            for duplicate in duplicates:
                # Delete duplicate
                os.remove(os.path.join(dataset_path, duplicate))
                print("- {} Deleted Successfully!".format(duplicate))
    else:
        print("> No duplicate images found.")


print("\n> CHECK THE DATASET")
print("\n> Checking the Number of file before performing Pre-processing Task...")

# Print count
print_file_counts()

# Check for corrupted file
print("> Checking for corrupted files...")
corruption_filter(dataset_path=DATASET_PATH)

    # Check for duplicates in the dataset: COVID/ and non-COVID/
print("\n> Checking duplicates in COVID/...[current num. of file: {}]"
      .format(count_files(file_path=COVID_PATH)))
find_out_duplicate(dataset_path=COVID_PATH, hash_size=8)

print("\n> Checking duplicates in non-COVID/...[current num. of file: {}]"
      .format(count_files(file_path=NON_COVID_PATH)))
find_out_duplicate(dataset_path=NON_COVID_PATH, hash_size=8)

print("\n> Final check to confirm the total file count:")
print_file_counts()

print("> DATASET CHECK COMPLETE!")


> CHECK THE DATASET

> Checking the Number of file before performing Pre-processing Task...
- Total Number of file: 2430
- Number of file in COVID: 1215
- Number of file in non-Covid: 1215

> Checking for corrupted files...
> No Corrupted File Found

> Checking duplicates in COVID/...[current num. of file: 1215]
> No duplicate images found.

> Checking duplicates in non-COVID/...[current num. of file: 1215]
> No duplicate images found.

> Final check to confirm the total file count:
- Total Number of file: 2430
- Number of file in COVID: 1215
- Number of file in non-Covid: 1215

> DATASET CHECK COMPLETE!
