# Python for Data Science

## Assignment 4

### Basic Libraries I

## README.md

### Overview: Satellite Annotations Analysis
This assignment processes a collection of satellite annotation files, analyzing and organizing the data based on specific criteria such as date, satellite number, and unique region. The file naming convention includes metadata like date, time, satellite number, and region, which we leverage to extract insights.

### Exercises


Given a zip file with a subfolder with multiple annotations, where the name convention for each one of them is: 

{DATE}_{TIME}_SN{SATELLITE_NUMBER}_QUICKVIEW_VISUAL_{VERSION}_{UNIQUE_REGION}.txt

where:

- DATE expressed as YYYYMMDD (year, month and day), e.g. 20241201, 20230321 ...
- TIME expressed as HHMMSS (hour, minutes and seconds), e.g. 2134307
- SATELLITE_NUMBER an integer that represents the satellite number.
- VERSION provides the version of the pipeline, e.g. "0_1_2", "1_3_1" ...
- UNIQUE_REGION provides a unique location in the form of a string, e.g SATL-2KM-10N_552_4164

Find out the following thing about your data:

1. How many files the annotations folder has.
2. How many of them follow the name convention expressed above.
3. How many of annotations you have per month and year. Which month has more annotation files.
4. Create a new annotations folder with multiple folders corresponding to a month.
5. Print all the annotations from the most recent to the oldest one. 
6. How many different satellites there are, how many annotations we have per satellite number, and which one was used in the most recent annotation file. 
7. How many unique regions there are.

some tips:
- str class has a method called split, you can use it to get each field per annotation.
- you can use sort from numpy on strings.

## Exercise 1

Description: Counts all files in the annotations folder.

In [22]:
# Importing the 'os' module to interact with the file system
import os

# Specifying the path where all my annotation files are stored. 
annotations_path = '/Users/noor/Desktop/Python for Data Science/Week 4/session_4/annotations'

# Counting how many files are present in the folder I specified above.
total_files = len(os.listdir(annotations_path))

# Printing out the total number of files.
print("Total files:", total_files)

Total files: 206


## Exercise 2

Description: Uses regular expressions to match files to the expected naming convention.

In [21]:
import re   # This helps the program recognize patterns in file names.

# Defining a pattern describing what a file name should look like to match my criteria.
pattern = r"^\d{8}_\d{6}_SN\d+_QUICKVIEW_VISUAL_\d+_\d+_\d+_[A-Z]+-\d+KM-\d+[NS]_\d+_\d+\.txt$" 

# Checking all the files in the folder I specified earlier.
matching_files = [file for file in os.listdir(annotations_path) if re.match(pattern, file)]

# Printing how many files matched the naming convention.
print("Files matching naming convention:", len(matching_files))

Files matching naming convention: 194


## Exercise 3

Description: Extracts and counts annotations per month and year, identifying the month with the most annotations.

In [23]:
import pandas as pd

# Creating an empty list where I’ll store each file's name along with its year and month.
file_data = []

# Splitting the file name to isolate the first part (the date) and extracting the year and month from it.
for file in matching_files:
    date_part = file.split("_")[0]  # Get the date part of the filename
    year = date_part[:4]           # The first 4 characters represent the year
    month = date_part[4:6]         # The next 2 characters represent the month
    file_data.append((file, year, month))  # Add the filename, year, and month as a tuple to the list

# Organizing the data into a DataFrame.
df = pd.DataFrame(file_data, columns=["filename", "year", "month"])

# Counting how many annotations (files) exist for each combination of year and month.
annotations_per_month = df.groupby(["year", "month"]).size().reset_index(name="count")

# Identifying which month has the highest number of annotations.
max_month = annotations_per_month.loc[annotations_per_month["count"].idxmax()]

# Printing the annotations count for each year and month in a neat format.
print("Annotations per Month and Year:")
print(annotations_per_month.to_string(index=False))

# Displaying the month that had the most annotations.
print(f"\nMonth with the most annotations: {max_month['year']}-{max_month['month']} with {max_month['count']} annotations")

Annotations per Month and Year:
year month  count
2024    01     27
2024    02     45
2024    03     17
2024    04     25
2024    05     28
2024    06     52

Month with the most annotations: 2024-06 with 52 annotations


## Exercise 4

Description: Creates a new folder structure based on months and organizes the files accordingly.

In [24]:
import shutil
import os
import pandas as pd
from prettytable import PrettyTable     # To display the results in a nice table format

# Setting up a folder called 'organized_annotations' where I’ll neatly organize my files by month.
output_path = 'organized_annotations'

# Creating a table to show details of how files are organized.
table = PrettyTable(['Year', 'Month', 'Filename', 'New Path'])

for _, row in annotations_per_month.iterrows():
    # Creating a folder name based on the year and month.
    month_folder = f"{row['year']}_{row['month']}"
    # Generating the full path for this folder inside the 'organized_annotations' folder.
    month_path = os.path.join(output_path, month_folder)
    # Making sure the folder exists. If it doesn’t, this line will create it.
    os.makedirs(month_path, exist_ok=True)
    
    # For each file that belongs to this year and month:
    for file in df[(df['year'] == row['year']) & (df['month'] == row['month'])]['filename']:
        # Getting the original file path in the 'annotations' folder.
        original_path = os.path.join(annotations_path, file)
        # Creating the new path where the file will be copied to.
        new_path = os.path.join(month_path, file)
        # Copying the file from its original location to the new organized folder.
        shutil.copy2(original_path, new_path)
        
        # Adding a row to my table with details about this file's organization.
        table.add_row([row['year'], row['month'], file, new_path])

# Finally, printing the table so I can see how my files have been organized.
print(table)

+------+-------+------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------+
| Year | Month |                                Filename                                |                                               New Path                                               |
+------+-------+------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------+
| 2024 |   01  | 20240102_185527_SN27_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_740_3850.txt | organized_annotations/2024_01/20240102_185527_SN27_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_740_3850.txt |
| 2024 |   01  | 20240101_174301_SN33_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_404_3770.txt | organized_annotations/2024_01/20240101_174301_SN33_QUICKVIEW_VISUAL_1_1_10_SATL-2KM-11N_404_3770.txt |
| 2024 |   01  | 20240101_192856_SN

## Exercise 5

Description: Sorts files from the most recent to the oldest based on date extracted from filenames.

In [25]:
from prettytable import PrettyTable     # To display the results in a nice table format

# Creating a PrettyTable object to hold and display the sorted file information.
table = PrettyTable()

# Setting the column names for my table.
table.field_names = ["Date (YYYYMMDD_HHMMSS)", "Filename"]

# Going through the list of files and their dates.
for date, filename in file_date:
    # For each file, I add a row to the table with the date and filename.
    table.add_row([date, filename])

# Finally, printing the table to show the sorted files from the most recent to the oldest.
print("Sorted Files from Most Recent to Oldest:")
print(table)

Sorted Files from Most Recent to Oldest:
+------------------------+------------------------------------------------------------------------+
| Date (YYYYMMDD_HHMMSS) |                                Filename                                |
+------------------------+------------------------------------------------------------------------+
|     20240623_21512     | 20240623_215120_SN29_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-10N_596_4134.txt  |
|     20240623_21510     | 20240623_215102_SN43_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_384_3750.txt  |
|     20240623_19370     | 20240623_193704_SN27_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_566_3734.txt  |
|     20240619_21555     | 20240619_215556_SN29_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-10N_742_4460.txt  |
|     20240619_18575     | 20240619_185757_SN24_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_528_3700.txt  |
|     20240619_05240     | 20240619_052401_SN30_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-52N_368_4336.txt  |
|     20240618_21553     | 20240618_215539_SN31_QUICKVIEW_V

## Exercise 6

Description: Counts unique satellites and annotations per satellite and identifies the satellite in the most recent annotation.

In [27]:
# Creating an empty dictionary to keep track of how many annotations belong to each satellite.
satellite_dict = {}

# Looping through all the matching files to identify the satellite number in each filename.
for file in matching_files:
    satellite_num = file[16:20]  # Extract the satellite number
    satellite_dict[satellite_num] = satellite_dict.get(satellite_num, 0) + 1    # If the satellite number exists in the dictionary, increase its count; otherwise, add it with a count of 1.

# Printing how many unique satellites there are.
print(f"Unique satellites: {len(satellite_dict)}")

# Printing the number of annotations for each satellite in a readable format.
for sat, count in satellite_dict.items():
    print(f"Satellite {sat}: {count} annotations")

# Finally, checking which satellite was used most recently.
print("Most recent satellite:", file_date[0][1][16:20])     # To extract the satellite number from the most recent file (16th to 20th character of the filename).

Unique satellites: 9
Satellite SN27: 29 annotations
Satellite SN24: 26 annotations
Satellite SN26: 37 annotations
Satellite SN33: 16 annotations
Satellite SN29: 22 annotations
Satellite SN28: 16 annotations
Satellite SN31: 19 annotations
Satellite SN30: 18 annotations
Satellite SN43: 11 annotations
Most recent satellite: SN29


## Exercise 7

Description: Counts unique regions from filenames.

In [28]:
 # Creating a new column in my DataFrame to store this region data.
df['unique_region'] = df['filename'].apply(lambda x: x.split("_")[-1].replace('.txt', ''))

# Knowing how many unique regions are present in the data.
unique_regions = df['unique_region'].nunique()      # The nunique() function counts the number of distinct values in the 'unique_region' column.

# Finally, printing the total number of unique regions.
print("Unique regions:", unique_regions)

Unique regions: 87
