# **Data collection**

### **Objectives**

- Collect images from Kaggle<br>
- Search for non-images<br>
- Visualize distribution of images

### **Inputs**

- Kaggle JSON file (token for authentication)

### **Outputs**

- Generate dataset into: inputs/dataset/raw/flower_photos<br>
- Image distribution through all labels<br>
- Pickle file with all labels


## **Install Requirements and Prepare Workspace**

### Workspace setup

First let see that we are working from the correct directory that should be "flowers_CNN".<br>
By default the working directory is "..../flowers_CNN/jupyter_notebook"

In [None]:
import os
working_dir = os.getcwd()
print(f"You are now working in {working_dir}")
print("If you need to change to the parent directory, run the cell below")

By running the cell below the working directory will be the parent directory of the cell above

In [None]:
os.chdir(os.path.dirname(working_dir))
new_working_dir = os.getcwd()
print(f"You have now changed your working directory to {new_working_dir}")

### Set output destination

In [None]:
version = 'v8'
file_path = f'outputs/{version}'
current_working_dir = os.getcwd()

# Checks if the folder exist otherwise it will create the folder
if 'outputs' in os.listdir(current_working_dir) and version in os.listdir(current_working_dir + '/outputs'):
    print("This version already exists, create a new version if you are working on a new version")
    pass
else:
    os.makedirs(name=file_path, exist_ok=True) 

### Install and import packages

In [None]:
%pip install -r requirements.txt

In [None]:
import zipfile
import glob
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle

Import your kaggle.json file to the main folder so that the cell below can find your token for authentication

In [None]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()

Import the dataset and set destination folder for the dataset

In [None]:
KaggleDatasetPath = "kurito/flower-photos"
DestinationFolder = "inputs/dataset/raw"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Unpack the zip-file and delete kaggle.json

In [None]:
KaggleJsonPath = "kaggle.json" # path to JSON-file

# This will find the zip-file in the DestinationFolder and unzip
for zip_path in glob.glob(f"{DestinationFolder}/*.zip"):
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall(DestinationFolder)
    os.remove(zip_path)  # Removes zip-file after unpacking
    
# Removes JSON-file when the zip-file has been unpacked
if os.path.exists(KaggleJsonPath):
    os.remove(KaggleJsonPath)

### Search for non-image files


In [None]:
def search_non_image_files(raw_dir):
    """
    This function searches through the specified dataset directory
    to identify files that do not have the specified image file extensions.

    The function iterates through all folders and files within the `raw_dir`,
    categorizing them into:
    - `image_files`: Files with extensions defined in `image_extension`.
    - `non_image_files`: Files without these extensions.

    Finally, it prints a summary of the search results and returns two lists:
    one for image files and one for non-image files.
    """
    
    # Tuple of file extensions considered as image files
    image_extension = ('.png', '.jpg', '.jpeg')
    
    # List to store paths of non-image files
    non_image_files = []
    # List to store paths of image files
    image_files = []
    
    # Get a list of all objects in the raw dataset directory
    folders = os.listdir(raw_dir)
    
    for folder in folders:
        folder_path = os.path.join(raw_dir, folder)
        
        if not os.path.isdir(folder_path):
            print(f"Removing non-directory file: {folder_path}")
            os.remove(folder_path) # deletes the file if not a folder
            continue
        
        files = os.listdir(folder_path)
        
        for file in files:
            file_location = os.path.join(folder_path, file)
            
            # Check if the file does not have an image extension
            if not file.lower().endswith(image_extension):
                non_image_files.append(file_location)
                print(f"Removing non-image file: {file_location}")
                os.remove(file_location) # Removes non-image files
            else:
                image_files.append(file_location)
    
    # Print a summary of the results
    print("Total amount of folders searched:", len(folders))
    print("Total image files found:", len(image_files))
    print("Total non-image files found:", len(non_image_files))
    
    return image_files, non_image_files


Run the function to search for non-image files

In [None]:
search_non_image_files(raw_dir='inputs/dataset/raw/flower_photos')

### Image distribution through all labels

In [None]:
def label_distribution(image_dirs):
    """
    This script analyzes the distribution of images in a dataset.

    It processes a directory containing subfolders, where each subfolder 
    represents a label (e.g., flower categories). The script calculates 
    the number of images in each subfolder, creates a bar plot to visualize 
    the distribution, and saves the plot to a specified output location.

    Steps:
    1. Loop through the subdirectories in the dataset directory.
    2. Count the number of image files in each subdirectory.
    3. Print the frequency of images for each label.
    4. Store the data in a Pandas DataFrame.
    5. Generate a bar plot of the image distribution by label.
    6. Save the plot as an image file.

    Parameters:
    - image_dirs (str): Path to the dataset directory containing subfolders of images.
    - file_path (str): Path to the directory where the plot will be saved.

    Outputs:
    - Prints the frequency of images for each label.
    - Displays a bar plot of image distribution.
    - Saves the bar plot as a PNG file in the specified output directory.
    """

    labels = os.listdir(image_dirs)
    data = []


    for label in labels:
        label_path = os.path.join(image_dirs, label)
        
        if os.path.isdir(label_path):
            frequency = len(os.listdir(label_path))
            
            data.append({
                'Label': label,
                'Frequency': frequency
            })
            
            print(f"{label}: {frequency} images")
        
        
    df_freq = pd.DataFrame(data)

    print("\n")
    sns.set_style("whitegrid")
    plt.figure(figsize=(10, 6))
    sns.barplot(data=df_freq, x='Label', y='Frequency', hue='Label')
    plt.xticks(rotation=45, ha='right')
    plt.title("Image distribution by flower")
    plt.savefig(f'{file_path}/labels_distribution_raw.png', bbox_inches='tight', dpi=150)
    plt.show()

Run the function to see image distribution

In [None]:
label_distribution(image_dirs = 'inputs/dataset/raw/flower_photos')

Save labels as a pickle file for the dashboard

In [None]:
image_dirs = 'inputs/dataset/raw/flower_photos'
labels = os.listdir(image_dirs)

print(f"Flower labels: {labels}")

with open(f"{file_path}/labels.pkl", "wb") as file:
    pickle.dump(labels, file)
    
print("Labels saved as labels.pkl")