In [1]:
!curl -O "https://www.seanoe.org/data/00829/94052/data/101141.zip"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 7159M  100 7159M    0     0  1649k      0  1:14:04  1:14:04 --:--:-- 3286k


In [8]:
!curl -O "https://www.seanoe.org/data/00828/94040/data/101095.zip"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1299M  100 1299M    0     0  5627k      0  0:03:56  0:03:56 --:--:-- 10.1M


In [9]:
import zipfile
from tqdm import tqdm

zip_path = "/content/101095.zip"       # path to your zip file
extract_path = "/content/101095_folder" # destination folder

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    files = zip_ref.namelist()
    for file in tqdm(files, desc="Unzipping", unit="file"):
        zip_ref.extract(file, extract_path)

Unzipping: 100%|██████████| 702239/702239 [02:08<00:00, 5485.61file/s]


# Task
Analyze the directory "/content/101141_folder/individual_images" to count the number of images in each subdirectory (class), calculate the size of each subdirectory in MB, and present the results in a pandas DataFrame sorted by image count in ascending order. Finally, save the DataFrame to a CSV file and display the DataFrame.

## List directories and count images

### Subtask:
Get a list of all subdirectories within `/content/101141_folder/individual_images` and count the number of image files in each.


**Reasoning**:
The subtask requires iterating through subdirectories and counting image files, which can be accomplished by importing the `os` module and using its functions to list directories and files.



In [10]:
import os

main_dir = "/content/101141_folder/individual_images"
image_counts = {}

for entry in os.listdir(main_dir):
    entry_path = os.path.join(main_dir, entry)
    if os.path.isdir(entry_path):
        image_files = [f for f in os.listdir(entry_path) if f.lower().endswith(('.jpg', '.jpeg', '.png'))]
        image_counts[entry] = len(image_files)

print(image_counts)

{'Nannosquillidae__45061': 2, 'Paguridae__83500': 4, 'Aetideidae__61995': 75, 'dead_Copepoda__84964': 17151, 'part_Ctenophora__85187': 319, 'Penaeoidea__92080': 7, 'pluteus_Echinoidea__85000': 1441, 'zoea_Galatheidae__84989': 660, 'nauplii_Crustacea__85116': 10747, 'larvae_Crustacea__85114': 114, 'part_Cnidaria__84970': 692, 'Rhincalanidae__61977': 127, 'Oithonidae__62005': 110510, 'Obelia__72344': 1016, 'egg_other__85078': 2281, 'Foraminifera__11758': 384, 'Euchaetidae__61986': 12957, 'Branchiostoma__25925': 210, 'badfocus_Copepoda__92042': 11656, 'Cephalopoda__12906': 3, 'eudoxie_Diphyidae__84977': 69, 'Neoceratium__18758': 4830, 'ephyra_Scyphozoa__85188': 64, 'Salpida__25942': 470, 'tail_Appendicularia__85004': 11349, 'part_Thaliacea__92235': 44, 'Oncaeidae__78418': 34651, 'siphonula__85127': 20, 'tornaria larvae__92754': 83, 'Heterorhabdidae__61985': 205, 'badfocus_artefact__85061': 34507, 'phyllosoma__84988': 8, 'artefact__85008': 60718, 'Thecosomata__91704': 59, 'Atlanta__92139':

## Calculate directory sizes

### Subtask:
Calculate the size of each subdirectory in MB.


**Reasoning**:
Calculate the size of each subdirectory in MB by iterating through the directories and summing up the sizes of the files within them.



In [11]:
dir_sizes = {}

for dir_name in image_counts.keys():
    dir_path = os.path.join(main_dir, dir_name)
    total_size = 0
    for dirpath, dirnames, filenames in os.walk(dir_path):
        for f in filenames:
            fp = os.path.join(dirpath, f)
            total_size += os.path.getsize(fp)
    dir_sizes[dir_name] = total_size / (1024 * 1024)

print(dir_sizes)

{'Nannosquillidae__45061': 0.031023025512695312, 'Paguridae__83500': 0.3491649627685547, 'Aetideidae__61995': 1.1797285079956055, 'dead_Copepoda__84964': 92.84973812103271, 'part_Ctenophora__85187': 6.643007278442383, 'Penaeoidea__92080': 0.20598125457763672, 'pluteus_Echinoidea__85000': 4.805417060852051, 'zoea_Galatheidae__84989': 11.644847869873047, 'nauplii_Crustacea__85116': 32.97245788574219, 'larvae_Crustacea__85114': 1.4346723556518555, 'part_Cnidaria__84970': 17.069294929504395, 'Rhincalanidae__61977': 4.8950347900390625, 'Oithonidae__62005': 353.0365047454834, 'Obelia__72344': 11.794878005981445, 'egg_other__85078': 7.068790435791016, 'Foraminifera__11758': 1.0463447570800781, 'Euchaetidae__61986': 242.81782913208008, 'Branchiostoma__25925': 3.097750663757324, 'badfocus_Copepoda__92042': 42.343411445617676, 'Cephalopoda__12906': 0.0647125244140625, 'eudoxie_Diphyidae__84977': 0.6808338165283203, 'Neoceratium__18758': 16.070223808288574, 'ephyra_Scyphozoa__85188': 0.4753284454

## Create and sort dataframe

### Subtask:
Create a pandas DataFrame with the class names, image counts, and sizes, then sort it by image count in ascending order.


**Reasoning**:
Create a pandas DataFrame from the image counts and sizes dictionaries and sort it by image count.



In [12]:
import pandas as pd

df_image_analysis = pd.DataFrame.from_dict(image_counts, orient='index', columns=['Image Count'])
df_image_analysis['Size (MB)'] = pd.Series(dir_sizes)
df_image_analysis = df_image_analysis.reset_index().rename(columns={'index': 'Class'})
df_image_analysis = df_image_analysis.sort_values(by='Image Count', ascending=True)

## Save to csv

### Subtask:
Save the DataFrame to a CSV file.


**Reasoning**:
Save the dataframe to a CSV file without including the index.



In [13]:
df_image_analysis.to_csv("image_analysis.csv", index=False)

## Display dataframe

### Subtask:
Display the sorted DataFrame.


**Reasoning**:
Display the sorted DataFrame as requested by the subtask.



In [14]:
display(df_image_analysis)

Unnamed: 0,Class,Image Count,Size (MB)
49,Lubbockia__93061,1,0.002988
111,Monstrilloida__45069,1,0.011627
0,Nannosquillidae__45061,2,0.031023
19,Cephalopoda__12906,3,0.064713
92,Cymbulia peroni__92132,3,0.041640
...,...,...,...
63,Acartiidae__61996,66353,319.555913
119,Calanidae__61993,91513,1552.032633
12,Oithonidae__62005,110510,353.036505
84,Calanoida__45074,149956,671.509879


## Summary:

### Data Analysis Key Findings

*   The analysis successfully identified all subdirectories within the specified path and counted the number of image files in each.
*   The size of each subdirectory was calculated in MB.
*   A pandas DataFrame was created containing the class names, image counts, and sizes, which was then sorted by image count in ascending order.
*   The resulting DataFrame was successfully saved to a CSV file named "image\_analysis.csv".

### Insights or Next Steps

*   The sorted DataFrame allows for easy identification of classes with the fewest and most images, which could be useful for balancing datasets in machine learning tasks.
*   Further analysis could involve calculating the average image size per class or visualizing the distribution of image counts and sizes.
