## Basic Libraries I

Let's jump into today's exercice.

### Exercise


Given a zip file with a subfolder with multiple annotations, where the name convention for each one of them is: 

{DATE}_{TIME}_SN{SATELLITE_NUMBER}_QUICKVIEW_VISUAL_{VERSION}_{UNIQUE_REGION}.txt

where:

- DATE expressed as YYYYMMDD (year, month and day), e.g. 20241201, 20230321 ...
- TIME expressed as HHMMSS (hour, minutes and seconds), e.g. 2134307
- SATELLITE_NUMBER an integer that represents the satellite number.
- VERSION provides the version of the pipeline, e.g. "0_1_2", "1_3_1" ...
- UNIQUE_REGION provides a unique location in the form of a string, e.g SATL-2KM-10N_552_4164

Find out the following thing about your data:

1. How many files the annotations folder has.
2. How many of them follow the name convention expressed above.
3. How many of annotations you have per month and year. Which month has more annotation files.
4. Create a new annotations folder with multiple folders corresponding to a month.
5. Print all the annotations from the most recent to the oldest one. 
6. How many different satellites there are, how many annotations we have per satellite number, and which one was used in the most recent annotation file. 
7. How many unique regions there are.

some tips:
- str class has a method called split, you can use it to get each field per annotation.
- you can use sort from numpy on strings.

In [17]:
import os 

# Define ANSI codes for formatting
RED = "\033[91m"
BLUE = "\033[94m"
GREEN = "\033[92m"
BOLD = "\033[1m"
UNDERLINE = "\033[4m"
RESET = "\033[0m"
YELLOW = "\033[93m"
BLUE = "\033[94m"   
MAGENTA = "\033[95m" 
CYAN = "\033[96m"

annotations_folder = r"./session_4/annotations" # Define the path to the folder containing annotation files

files = os.listdir(annotations_folder) # List all files in the specified folder and store them in the 'files' variable

total_files = len(files) # Count the total number of files in the folder

print(f"{BOLD}{UNDERLINE}Total files in folder:{RESET} {GREEN}{total_files}{RESET}") # Print the total number of files


[1m[4mTotal files in folder:[0m [92m206[0m


In [18]:
import re  

pattern = re.compile(r"(\d{8})_(\d{6})_SN(\d+)_QUICKVIEW_VISUAL_([\d_]+)_(.+)\.txt") # This pattern captures:
        # - Eight digits for the date (YYYYMMDD)
        # - Six digits for the time (HHMMSS)
        # - SN followed by one or more digits for the satellite number
        # - The word "QUICKVIEW", followed by "VISUAL", and a version number that may include underscores
        # - Any additional characters before the file extension .txt

valid_files = [f for f in files if pattern.match(f)] # Filter the list of files to find those that match the defined naming pattern

invalid_files = [f for f in files if not pattern.match(f)] # Create a list of files that do not match the naming pattern

valid_files_count = len(valid_files) # Count the number of valid files that matched the pattern

print(f"{BOLD}{UNDERLINE}Files following the naming convention:{RESET} {GREEN}{valid_files_count}{RESET}") # Print the count of valid files that follow the naming convention

# Print the valid files 
print(f"{BOLD}{UNDERLINE}Valid files:{RESET}")
for file in valid_files[:10]:  # Limit the output of valid files to the first 10 entries
    print(file)

print(f"\n{BOLD}{UNDERLINE}Not valid files:{RESET}") # Print a header for the invalid files section

for file in invalid_files: # Print all the files that do not match the naming convention
    print(file)


[1m[4mFiles following the naming convention:[0m [92m194[0m
[1m[4mValid files:[0m
20240101_174301_SN33_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_404_3770.txt
20240101_174301_SN33_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_404_3772.txt
20240101_192856_SN24_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-10N_552_4162.txt
20240101_192856_SN24_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-10N_552_4164.txt
20240101_192856_SN24_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-10N_554_4162.txt
20240101_213601_SN31_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_392_3740.txt
20240101_213601_SN31_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_392_3742.txt
20240101_213601_SN31_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_396_3752.txt
20240102_185527_SN27_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_740_3850.txt
20240102_185605_SN27_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_690_3572.txt

[1m[4mNot valid files:[0m
20240405_183824_409694_MS_NS24_QUICKVIEW_VISUAL_1_3_0_SATL-2KM-11N_736_3716.txt
20240407_190149_742846_MS_NS24_QUICKVIEW_VISUAL_1_3_0_SATL-2KM-11N_258_4028.txt
20240408_21

In [19]:
from collections import Counter  
from datetime import datetime

# Mapping of month numbers to names
month_names = {
    '01': 'January',
    '02': 'February',
    '03': 'March',
    '04': 'April',
    '05': 'May',
    '06': 'June',
    '07': 'July',
    '08': 'August',
    '09': 'September',
    '10': 'October',
    '11': 'November',
    '12': 'December',
}

annotations_per_month = Counter()  # Initialize a Counter to track the number of annotations per month
satellite_count = Counter()  # Initialize a Counter to track the count of annotations per satellite
unique_regions = set()  # Initialize a set to store unique regions found in the filenames

# Loop through each valid file to extract and count information
for file in valid_files:
    match = pattern.match(file)  # Match the filename against the defined pattern
    
    # Check if the filename matched the expected pattern
    if match:
        date_str, time_str, satellite, version, region = match.groups()
        
        date_obj = datetime.strptime(date_str, "%Y%m%d")  # Convert the date string from format YYYYMMDD to a datetime object
        
        month_year_str = date_obj.strftime("%Y-%m")  # Format the datetime object to a string representing year and month (YYYY-MM)
        
        annotations_per_month[month_year_str] += 1  # Increment the count for that specific month in the annotations counter
        
        satellite_count[satellite] += 1  # Increment the count for the respective satellite
        
        unique_regions.add(region)  # Add the region to the set of unique regions

# Print the number of annotations per month in the format "06 (June) - 52 files"
print(f"{BOLD}{UNDERLINE}Annotations per month and year:{RESET}")
for month_year, count in annotations_per_month.items():
    month_num = month_year.split('-')[1]  # Extract the month part (MM) from the string formatted as YYYY-MM
    month_name = month_names[month_num]  # Get the full name of the month using the mapping dictionary
    print(f"{GREEN}{month_num} ({month_name}){RESET} - {BLUE}{count} files{RESET}")

# Find the month with the most annotations, if there are any recorded
if annotations_per_month:
    most_common_month = annotations_per_month.most_common(1)[0]  # Get the most common month and its count
    month_num = most_common_month[0].split('-')[1]  # Extract the month part (MM) from the string formatted as YYYY-MM
    month_name = month_names[month_num]  # Get the full name of the month using the mapping dictionary
    
    print(f"\n{BOLD}{UNDERLINE}Month with the most annotations:{RESET} {GREEN}{month_num} ({month_name}){RESET} - {BLUE}{most_common_month[1]} files{RESET}")
else:
    print("No annotations found.")


[1m[4mAnnotations per month and year:[0m
[92m01 (January)[0m - [94m27 files[0m
[92m02 (February)[0m - [94m45 files[0m
[92m03 (March)[0m - [94m17 files[0m
[92m04 (April)[0m - [94m25 files[0m
[92m05 (May)[0m - [94m28 files[0m
[92m06 (June)[0m - [94m52 files[0m

[1m[4mMonth with the most annotations:[0m [92m06 (June)[0m - [94m52 files[0m


In [20]:
import shutil  

# Iterate over each unique month-year in the annotations_per_month Counter
for month_year in annotations_per_month.keys():
    
    month_folder = os.path.join(annotations_folder, month_year) # Create a folder path for the current month-year
    
    os.makedirs(month_folder, exist_ok = True) # Create the folder if it doesn't already exist (exist_ok=True prevents an error if it does)

    # Move each valid file to the respective folder based on its date
    for file in valid_files:
        
        match = pattern.match(file) # Check if the file name matches the defined pattern
        if match:
            
            date_str = match.group(1) # Extract the date string from the filename
            
            # Move the file to the corresponding folder if the month-year matches
            if month_year_str == month_year: 
                
                shutil.move(os.path.join(annotations_folder, file), os.path.join(month_folder, file)) # Move the file from the original folder to the new month folder

print(f"{BOLD}{UNDERLINE}Files organized into monthly folders.{RESET}") # Print a success message indicating that files have been organized


[1m[4mFiles organized into monthly folders.[0m


In [21]:
sorted_files = sorted(valid_files, key=lambda f: (f[:8], f[9:14]), reverse = True) # Sort the valid files based on their date and time components:
        # The key used for sorting is a tuple consisting of:
        # - The first 8 characters (YYYYMMDD) for the date
        # - The next 6 characters (HHMMSS) for the time
        # The 'reverse = True' argument sorts the files in descending order, so the most recent files appear first.

print(F"{BOLD}{UNDERLINE}Annotations from most recent to oldest:{RESET}")

for file in sorted_files: # Iterate through sorted files and print them with colored components
    
    parts = file.split('_') # Split the filename components
    
    # Extract the date and time components
    date = parts[0]  # YYYYMMDD
    time = parts[1]  # HHMMSS

    # Format the components
    year = date[:4]       # YYYY
    month = date[4:6]     # MM
    day = date[6:8]       # DD
    hour = time[:2]       # HH
    minutes = time[2:4]    # MM
    seconds = time[4:6]    # SS

    # Create a colored output for each component
    colored_filename = (
        f"{RED}{year}{RESET}."
        f"{BLUE}{month}{RESET}."
        f"{GREEN}{day}{RESET}//"
        f"{YELLOW}{hour}{RESET}:"
        f"{MAGENTA}{minutes}{RESET}:"
        f"{CYAN}{seconds}{RESET}_" +
        "_".join(parts[2:])  # Add remaining parts without colors
    )

    print(colored_filename)

[1m[4mAnnotations from most recent to oldest:[0m
[91m2024[0m.[94m06[0m.[92m23[0m//[93m21[0m:[95m51[0m:[96m20[0m_SN29_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-10N_596_4134.txt
[91m2024[0m.[94m06[0m.[92m23[0m//[93m21[0m:[95m51[0m:[96m02[0m_SN43_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_384_3750.txt
[91m2024[0m.[94m06[0m.[92m23[0m//[93m19[0m:[95m37[0m:[96m04[0m_SN27_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_566_3734.txt
[91m2024[0m.[94m06[0m.[92m19[0m//[93m21[0m:[95m55[0m:[96m56[0m_SN29_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-10N_742_4460.txt
[91m2024[0m.[94m06[0m.[92m19[0m//[93m18[0m:[95m57[0m:[96m57[0m_SN24_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_528_3700.txt
[91m2024[0m.[94m06[0m.[92m19[0m//[93m05[0m:[95m24[0m:[96m01[0m_SN30_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-52N_368_4336.txt
[91m2024[0m.[94m06[0m.[92m18[0m//[93m21[0m:[95m55[0m:[96m39[0m_SN31_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_452_3740.txt
[91m2024[0m.[94m06[0m.[92m18[0m//[93m2

In [22]:
satellite_count = Counter()  # Initialize a Counter object to track the number of annotations per satellite

# Loop through all valid files to extract and count the satellite numbers
for file in valid_files:
    match = pattern.match(file)  # Match each filename against the defined pattern
    if match:  # If the file matches the pattern
        satellite = match.group(3)  # Extract the satellite number (third group)
        satellite_count[satellite] += 1  # Increment the count for that satellite in the Counter

num_different_satellites = len(satellite_count)  # Count the number of unique satellites by getting the length of the Counter
print(f"{BOLD}{UNDERLINE}Total number of different satellites:{RESET} {GREEN}{num_different_satellites}{RESET}")

print() # Add space for clarity

print(f"{BOLD}{UNDERLINE}Annotations per satellite:{RESET}")  # Print a heading for annotations per satellite
for satellite, count in satellite_count.items(): # Loop through each satellite and its count in the Counter to display the annotations per satellite
    print(f"{CYAN}SN{satellite}:{RESET}{GREEN} {count} annotations{RESET}")  # Print each satellite number and its annotation count 

print() # Add space for clarity

most_recent_annotation = sorted_files[0]  # Get the most recent file from the sorted list
most_recent_satellite = pattern.match(most_recent_annotation).group(3) # Extract the satellite information from the most recent annotation  
print(f"{BOLD}{UNDERLINE}Most recent annotation satellite:{RESET} {GREEN}SN{most_recent_satellite}{RESET}")

[1m[4mTotal number of different satellites:[0m [92m9[0m

[1m[4mAnnotations per satellite:[0m
[96mSN33:[0m[92m 16 annotations[0m
[96mSN24:[0m[92m 26 annotations[0m
[96mSN31:[0m[92m 19 annotations[0m
[96mSN27:[0m[92m 29 annotations[0m
[96mSN28:[0m[92m 16 annotations[0m
[96mSN29:[0m[92m 22 annotations[0m
[96mSN26:[0m[92m 37 annotations[0m
[96mSN30:[0m[92m 18 annotations[0m
[96mSN43:[0m[92m 11 annotations[0m

[1m[4mMost recent annotation satellite:[0m [92mSN29[0m


In [23]:
# THIS IS RETRIEVING THE NUMBER OF UNIQUES REGIONS FROM A LOOP DEVELOPED IN EXCERCISE 3

num_unique_regions = len(unique_regions)

print(f"{BOLD}{UNDERLINE}Number of unique regions:{RESET} {GREEN}{num_unique_regions}{RESET}")


[1m[4mNumber of unique regions:[0m [92m137[0m
