### Exercise


Given a zip file with a subfolder with multiple annotations, where the name convention for each one of them is: 

{DATE}_{TIME}_SN{SATELLITE_NUMBER}_QUICKVIEW_VISUAL_{VERSION}_{UNIQUE_REGION}.txt

where:

- DATE expressed as YYYYMMDD (year, month and day), e.g. 20241201, 20230321 ...
- TIME expressed as HHMMSS (hour, minutes and seconds), e.g. 2134307
- SATELLITE_NUMBER an integer that represents the satellite number.
- VERSION provides the version of the pipeline, e.g. "0_1_2", "1_3_1" ...
- UNIQUE_REGION provides a unique location in the form of a string, e.g SATL-2KM-10N_552_4164

Find out the following thing about your data:

1. How many files the annotations folder has.
2. How many of them follow the name convention expressed above.
3. How many of annotations you have per month and year. Which month has more annotation files.
4. Create a new annotations folder with multiple folders corresponding to a month.
5. Print all the annotations from the most recent to the oldest one. 
6. How many different satellites there are, how many annotations we have per satellite number, and which one was used in the most recent annotation file. 
7. How many unique regions there are.

some tips:
- str class has a method called split, you can use it to get each field per annotation.
- you can use sort from numpy on strings.

In [1]:
# code

import zipfile
import os
import re
from collections import Counter
import pandas as pd
from datetime import datetime
import numpy as np

# Define relative file path and extraction path
zip_file_path = './session_4.zip'  # Assuming the ZIP file is in the same directory as this script
extract_to_folder = './extracted_annotations'  # Folder to extract files

# Extract the contents of the zip file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(extract_to_folder)

# Corrected regex pattern based on name convention
pattern = re.compile(
    r'(\d{8})_(\d{6,7})_SN(\d+)_QUICKVIEW_VISUAL_([\d_]+)_([\w-]+)\.txt')

# Data to store results
annotations_data = []
annotations_per_month_year = Counter()
satellite_counts = Counter()
unique_regions = set()

# Walk through extracted folder and subfolders
for root, dirs, files in os.walk(extract_to_folder):
    for file in files:
        if file.endswith('.txt'):
            annotations_data.append(file)
            match = pattern.match(file)
            if match:
                # Split the filename to get the required parts
                date, time, satellite_num, version, unique_region = match.groups()
                
                # Parse date to extract month and year
                date_obj = datetime.strptime(date, '%Y%m%d')
                month_year = date_obj.strftime('%Y-%m')
                annotations_per_month_year[month_year] += 1
                
                # Satellite info
                satellite_counts[satellite_num] += 1
                
                # Unique regions
                unique_regions.add(unique_region)

# Sort annotations using numpy
annotations_sorted = np.sort(annotations_data)[::-1]  # Sort in reverse for most recent to oldest

# Create folders by month (if needed)
for month_year in annotations_per_month_year:
    month_folder = os.path.join(extract_to_folder, month_year)
    os.makedirs(month_folder, exist_ok=True)
    # Optionally move files into respective month folders (code can be added here if needed)

# Results
num_files = len(annotations_data)
num_matching_files = sum(1 for file in annotations_data if pattern.match(file))
most_annotations_month = annotations_per_month_year.most_common(1)[0] if annotations_per_month_year else None
most_recent_annotation = annotations_sorted[0] if len(annotations_sorted) > 0 else None
most_recent_satellite = None

# Extract satellite number for most recent file
if most_recent_annotation:
    most_recent_parts = most_recent_annotation.split('_')
    most_recent_satellite = most_recent_parts[2][2:]  # Get satellite number from 'SN{number}'

# Summary
summary = {
    "Total files": num_files,
    "Files matching pattern": num_matching_files,
    "Most annotations in month": most_annotations_month,
    "Most recent annotation": most_recent_annotation,
    "Most recent satellite": most_recent_satellite,
    "Total unique satellites": len(satellite_counts),
    "Total unique regions": len(unique_regions)
}

# Display sorted annotations and the summary
annotations_sorted_df = pd.DataFrame({"Annotations": annotations_sorted})
print(annotations_sorted_df)
print(summary)


                                           Annotations
0    20240623_215120_SN29_QUICKVIEW_VISUAL_1_7_0_SA...
1    20240623_215102_SN43_QUICKVIEW_VISUAL_1_7_0_SA...
2    20240623_193704_SN27_QUICKVIEW_VISUAL_1_7_0_SA...
3    20240619_215556_SN29_QUICKVIEW_VISUAL_1_7_0_SA...
4    20240619_185757_SN24_QUICKVIEW_VISUAL_1_7_0_SA...
..                                                 ...
201  20240101_192856_SN24_QUICKVIEW_VISUAL_1_1_10_S...
202  20240101_192856_SN24_QUICKVIEW_VISUAL_1_1_10_S...
203  20240101_192856_SN24_QUICKVIEW_VISUAL_1_1_10_S...
204  20240101_174301_SN33_QUICKVIEW_VISUAL_1_1_10_S...
205  20240101_174301_SN33_QUICKVIEW_VISUAL_1_1_10_S...

[206 rows x 1 columns]
{'Total files': 206, 'Files matching pattern': 194, 'Most annotations in month': ('2024-06', 52), 'Most recent annotation': '20240623_215120_SN29_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-10N_596_4134.txt', 'Most recent satellite': '29', 'Total unique satellites': 9, 'Total unique regions': 137}
