# Basic Libraries I

## Exercises


Given a zip file with a subfolder with multiple annotations, where the name convention for each one of them is: 

{DATE}_{TIME}_SN{SATELLITE_NUMBER}_QUICKVIEW_VISUAL_{VERSION}_{UNIQUE_REGION}.txt

where:

- DATE expressed as YYYYMMDD (year, month and day), e.g. 20241201, 20230321 ...
- TIME expressed as HHMMSS (hour, minutes and seconds), e.g. 2134307
- SATELLITE_NUMBER an integer that represents the satellite number.
- VERSION provides the version of the pipeline, e.g. "0_1_2", "1_3_1" ...
- UNIQUE_REGION provides a unique location in the form of a string, e.g SATL-2KM-10N_552_4164



### Find out the following things about your data:
some tips:
- str class has a method called split, you can use it to get each field per annotation.
- you can use sort from numpy on strings.



1. How many files the annotations folder has?

In [1]:
import os 
os.getcwd()  # First we use the os module to get the current working directory

'c:\\Users\\ochix\\Desktop\\MSc BA Esade\\Term 1\\Python for Data Science\\Assigments_PDS\\Session 4'

In [2]:
import zipfile

zip_path = "session_4.zip"  # Since we have the .zip folder in the same folder as this notebook, we don't need to specify the path.

# In order to not open and create a new file with the .zip, we just read it with the zipfile module.
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    file_list = [f for f in zip_ref.namelist() if f.startswith('session_4/annotations/') and not f.endswith('/')] # used the endswith('/') to filter out the directory
    num_files = len(file_list)

print(f"The annotations folder has {num_files} files.")


The annotations folder has 206 files.


2. How many of them follow the name convention expressed above?

In [3]:
import re

zip_path = r"session_4.zip" 
folder_inside_zip = "session_4/annotations/"

# We define the pattern that we want to match
pattern = re.compile(r'\d{8}_\d{6}_SN\d+_QUICKVIEW_VISUAL_\d+_\d+_\d+_.+\.txt')

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    file_list = [f for f in zip_ref.namelist() if f.startswith(folder_inside_zip) and not f.endswith('/')]
    
    # Count how many files match the naming convention
    matching_files = [f for f in file_list if pattern.match(f.split('/')[-1])]
    num_matching_files = len(matching_files)

print(f"Number of files that follow the naming convention: {num_matching_files}")


Number of files that follow the naming convention: 194


3. How many of annotations you have per month and year. Which month has more annotation files?

In [4]:
file_list[0] # We use the index to see the name of the file is stored on the list
# This way i realize that only the last part after the / is needed

'session_4/annotations/20240102_185527_SN27_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_740_3850.txt'

In [5]:
zip_path = r"session_4.zip" 
folder_inside_zip = "session_4/annotations/"
month_names = ["January", "February", "March", "April", "May", "June",
               "July", "August", "September", "October", "November", "December"]

annotations_count = {}

# We extract date from the file names and count annotations per month and year
for file_name in file_list:
    file_name_only = file_name.split('/')[-1]
    date_str = file_name_only.split('_')[0]  # Extract the date part that we want
    year = date_str[:4]
    month = date_str[4:6]

    month_name = month_names[int(month)-1]
    year_month = f"{month_name} {year}"

    if year_month in annotations_count:
        annotations_count[year_month] += 1
    else:
        annotations_count[year_month] = 1

# We find the month with the most annotations
max_month = max(annotations_count, key=annotations_count.get)
max_count = annotations_count[max_month]

print("Annotations per month and year:")
for year_month, count in sorted(annotations_count.items()):
    print(f"{year_month}: {count} annotations")

print(f"\nThe month with the most annotation files is {max_month} with {max_count}.")


Annotations per month and year:
April 2024: 37 annotations
February 2024: 45 annotations
January 2024: 27 annotations
June 2024: 52 annotations
March 2024: 17 annotations
May 2024: 28 annotations

The month with the most annotation files is June 2024 with 52.


4. Create a new annotations folder with multiple folders corresponding to a month.

In [6]:
from shutil import copyfileobj
zip_path = r"session_4.zip" 
folder_inside_zip = "session_4/annotations/"
output_folder = 'sorted_annotations'

# We create the output folder if it doesn't exist
os.makedirs(output_folder, exist_ok=True)

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    file_list = [f for f in zip_ref.namelist() if f.startswith(folder_inside_zip) and not f.endswith('/')]

    for file_name in file_list:
        file_name_only = file_name.split('/')[-1]
        date_str = file_name_only[:8]
        year = date_str[:4]
        month = date_str[4:6]

        month_name = month_names[int(month)-1]

        # We create the a new folder per month in the annotations folder
        new_folder_path = os.path.join(output_folder, f"{month_name}_{year}")
        os.makedirs(new_folder_path, exist_ok=True)

        # Now me extract the file from the zip.folder and copy it to the new folder
        extracted_path = os.path.join(new_folder_path, file_name_only)
        with zip_ref.open(file_name) as var1, open(extracted_path, 'wb') as var2:
            copyfileobj(var1, var2)

print(f"Files have been sorted and saved to {output_folder} in each corresponding month folder.")


Files have been sorted and saved to sorted_annotations in each corresponding month folder.


5. Print all the annotations from the most recent to the oldest one.

In [7]:
import pandas as pd
# As file_name was already defined in the previous cell, we use it here.
annotations = []
for file_name in file_list:
    file_name_only = file_name.split('/')[-1]
    date_str = file_name_only[:8]
    time_str = file_name_only[9:15]

    datetime_str = date_str + time_str    # We concatenate the date and time to get the specific datetimes.
    annotations.append((datetime_str, file_name_only))

# Now we sort it in decending order
annotations_sorted = sorted(annotations, reverse=True)

print("This are the files from the most recent to the oldest:")
for datetime_str, file_name in annotations_sorted:
    print(file_name)

This are the files from the most recent to the oldest:
20240623_215120_SN29_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-10N_596_4134.txt
20240623_215102_SN43_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_384_3750.txt
20240623_193704_SN27_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_566_3734.txt
20240619_215556_SN29_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-10N_742_4460.txt
20240619_185757_SN24_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_528_3700.txt
20240619_052401_SN30_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-52N_368_4336.txt
20240618_215539_SN31_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_458_3756.txt
20240618_215539_SN31_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_452_3740.txt
20240618_193146_SN27_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_530_3682.txt
20240617_211350_SN29_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_724_3614.txt
20240617_184443_SN24_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_702_3566.txt
20240617_052859_SN29_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-51N_730_4348.txt
20240616_213053_SN30_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_460_3792.txt
20240616_213047_SN30_QUICKVIEW_VISU

6. How many different satellites there are, how many annotations we have per satellite number, and which one was used in the most recent annotation file.

In [8]:
annotations_count_satelite = {}

pattern = re.compile(r'\d{8}_\d{6}_SN(\d+)_')

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    file_list = [f for f in zip_ref.namelist() if f.startswith(folder_inside_zip)]
    for file_name in file_list:
        file_name_only = file_name.split('/')[-1]
        match = pattern.search(file_name_only)
        if match:
            satellite_number = match.group(1)
            if satellite_number in annotations_count_satelite:
                annotations_count_satelite[satellite_number] += 1
            else:
                annotations_count_satelite[satellite_number] = 1
        if file_name_only==annotations_sorted[0][1]:
            last_satelite='SN'+satellite_number


different_satelites= len(annotations_count_satelite.keys())

print(f'There are {different_satelites} different satelites')
print('\nThe number of annotations per satelite is:')
for satellite_number, count in sorted(annotations_count_satelite.items()):
    print(f'Satelite SN{satellite_number}: {count} anotaciones')
print(f'\nAnd the most recent satelite to be used was: {last_satelite}')

There are 9 different satelites

The number of annotations per satelite is:
Satelite SN24: 26 anotaciones
Satelite SN26: 37 anotaciones
Satelite SN27: 29 anotaciones
Satelite SN28: 16 anotaciones
Satelite SN29: 22 anotaciones
Satelite SN30: 18 anotaciones
Satelite SN31: 19 anotaciones
Satelite SN33: 16 anotaciones
Satelite SN43: 11 anotaciones

And the most recent satelite to be used was: SN29


7. How many unique regions there are?

In [9]:
unique_regions = set()
pattern = re.compile(r'\d{8}_\d{6}_SN\d+_QUICKVIEW_VISUAL_\d+_(\w+)\.txt')

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    file_list = [f for f in zip_ref.namelist() if f.startswith(folder_inside_zip) and not f.endswith('/')]
    for file_name in file_list:
        file_name_only = file_name.split('/')[-1]
        regions = '_'.join(file_name_only.split('_')[-3:]) # Extract the region part of the name
        unique_regions.add(regions)

print(f'There are {len(unique_regions)} unique regions.')

There are 146 unique regions.
