# Methodology for Washing an Annotation Database

This document outlines the process used to clean and validate annotations in a database of captions for remote sensing images. The process ensures the quality and consistency of the annotations by applying specific rules and validations. 

## Databases Involved

Three databases are used in this process:

1. **`annotation.db`**: The original annotation database containing the captions.
2. **`metadata.db`**: The original metadata database associated with the annotations.
3. **`rejected_annotation.db`** (optional): A database to record annotations that fail the validation rules for further analysis.

## Validation Rules

The cleaning process applies the following rules to validate the annotations:

1. **Invalid Characters**:
   - Only a defined set of characters is allowed: letters, numbers, spaces, punctuation (`.,!?;:'"-%/()&#‘’“”`), and specific symbols.
   - Any annotation containing characters outside this set is marked as invalid.

2. **Unbalanced Brackets**:
   - Annotations must have balanced brackets, including `()`, `[]`, and `{}`.
   - If brackets are unbalanced, the annotation is marked as invalid.

3. **Abnormal Length**:
   - Annotations with fewer than 10 words or more than 120 words are marked as invalid.

4. **Special Case for Separators**:
   - The double dash (`—`) is considered a valid separator only if it appears in pairs (even count). Single or uneven occurrences of the dash are marked as invalid.

## Implementation

The implementation involves Python code that validates each annotation based on the defined rules. Below is an explanation of the key components of the code:

### Regular Expression for Invalid Characters

```python
PATTERN = r"[^a-zA-Z0-9\s.,!?;:'\"\-%\/()&#\ʻ\’\“\”]"
```

This regular expression identifies any character not in the allowed set. 

### Function to Check Balanced Brackets

```python
def find_unbalanced_brackets(text):
    stack = {'(': 0, '[': 0, '{': 0}
    for ch in text:
        if ch in stack:
            stack[ch] += 1
        elif ch in ')]}':
            if ch == ')' and stack['('] > 0:
                stack['('] -= 1
            elif ch == ']' and stack['['] > 0:
                stack['['] -= 1
            elif ch == '}' and stack['{'] > 0:
                stack['{'] -= 1
            else:
                return True
    return not all(v == 0 for v in stack.values())
```

This function ensures that brackets in the annotations are properly balanced.

### Function to Validate Annotations

```python
def check_validity(pattern, text):
    is_valid = False
    
    # Identify invalid characters
    matches = re.findall(pattern, text)
    if bool(matches):
        cnt = 0
        for match in matches:
            if match == '—':
                cnt += 1
        if cnt // 2 > 0 and cnt % 2 == 0:
            is_valid = True
    else:
        is_valid = True
    
    # Check for unbalanced brackets
    is_valid = is_valid and not find_unbalanced_brackets(text)
    
    # Check for abnormal length
    num_splits = len(text.split())
    is_valid = is_valid and (10 <= num_splits <= 120)

    return is_valid
```

This function evaluates each annotation against the validation rules.

### Application to Database

The validation process is applied to the `annotation.db` as follows:

```python
anno_table['valid'] = anno_table['ANNOTATION'].apply(lambda x: check_validity(PATTERN, x))
```

Annotations marked as valid (`True`) are retained in the cleaned database, while those marked as invalid (`False`) can be optionally recorded in `rejected_annotation.db`.

In [1]:
from tqdm import tqdm
import os
import pandas as pd
import sqlite3
import re

tqdm.pandas()

PATTERN = r"[^a-zA-Z0-9\s.,!?;:'\"\-%\/()&#\ʻ\’\“\”]"

ANNOTATION_DB_PATH = '../database/annotation.db'
METADATA_DB_PATH = '../database/metadata.db'
REJECTED_ANNOTATION_DB_PATH = '../database/rejected_annotation.db'


def find_unbalanced_brackets(text):
    stack = {'(': 0, '[': 0, '{': 0}
    for ch in text:
        if ch in stack:
            stack[ch] += 1
        elif ch in ')]}':
            if ch == ')' and stack['('] > 0:
                stack['('] -= 1
            elif ch == ']' and stack['['] > 0:
                stack['['] -= 1
            elif ch == '}' and stack['{'] > 0:
                stack['{'] -= 1
            else:
                return True
    
    if all(v == 0 for v in stack.values()):
        return False
    else:
        return True

# pattern = r'_'    
def check_validity(pattern, text):
    
    is_valid = False
    
    # unvalid characters
    matches = re.findall(pattern, text)
    # only 2n '—' is recognized as a valid separator
    if bool(matches):
        cnt = 0
        for match in matches:
            if match == '—':
                cnt += 1
        if cnt // 2 > 0 and cnt % 2 == 0:
            is_valid = True
    else:
        is_valid = True
    
    # unbalanced brackets
    is_valid = is_valid and not find_unbalanced_brackets(text)
    
    # abnormal length
    is_valid = is_valid and (10 <= len(text.split()) <= 120)

    return is_valid

# Connect to annotation and metadata databases
conn_annotation = sqlite3.connect(ANNOTATION_DB_PATH)
conn_metadata = sqlite3.connect(METADATA_DB_PATH)

# Connect to rejected annotation database
conn_rejected_annotation = sqlite3.connect(REJECTED_ANNOTATION_DB_PATH) \
                            if os.path.exists(REJECTED_ANNOTATION_DB_PATH) \
                            else None
                            
# load the annotation table into a pandas dataframe
annotation_df = pd.read_sql_query("SELECT * FROM annotation", conn_annotation)

# check the validity of the annotations
annotation_df['is_valid'] = annotation_df['ANNOTATION'].progress_apply(lambda x: check_validity(PATTERN, x))

100%|██████████| 618437/618437 [00:12<00:00, 48550.55it/s]


In [16]:
# Set display options to show all content
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)


samples = annotation_df[annotation_df['is_valid']==False].sample(10)

for i, sample in samples.iterrows():
    print(f"Sample {i+1}:")
    print(f"Annotation: {sample['ANNOTATION']}")

samples

Sample 51253:
Annotation: Centered in the scene, a large rectangular zone occupies over half the view, known as Sherwood Oaks, and tagged for residential land use. This area features homes, perhaps densities of apartments or houses,えっ seeing suggestions of orderly streets or suburban layout.
Sample 49683:
Annotation: On the left-central side of the scene, an irregularly shaped forested area занимает nearly one-fifth of the view. Its boundary extends slightly beyond the frame, suggesting a continuous stretch of deciduous broadleaved woodland, perhaps featuring a lush canopy of mature trees.
Sample 477057:
Annotation: At the base of the scene, a curiously shaped complex, known as Hilton High School, commands approximately a fifth of the view, its-mdash; perhaps featuring single-family homes—encircling the property.”
Sample 135674:
Annotation: A winding stream, intermittent but present in the scene, meanders from the top-left corner down to the bottom-right corner. Following a northwest-t

Unnamed: 0,ID,PATCH,ANNOTATION,NUM_ELEMS,ANNOTATOR,PROMPT,CREATED_AT,is_valid
51252,66429,4643828,"Centered in the scene, a large rectangular zone occupies over half the view, known as Sherwood Oaks, and tagged for residential land use. This area features homes, perhaps densities of apartments or houses,えっ seeing suggestions of orderly streets or suburban layout.",1,3,1,2024-12-22 02:58:55,False
49682,64859,10074095,"On the left-central side of the scene, an irregularly shaped forested area занимает nearly one-fifth of the view. Its boundary extends slightly beyond the frame, suggesting a continuous stretch of deciduous broadleaved woodland, perhaps featuring a lush canopy of mature trees.",1,3,1,2024-12-22 02:45:12,False
477056,492233,14633359,"At the base of the scene, a curiously shaped complex, known as Hilton High School, commands approximately a fifth of the view, its-mdash; perhaps featuring single-family homes—encircling the property.”",1,3,1,2024-12-24 18:05:47,False
135673,150850,7602889,"A winding stream, intermittent but present in the scene, meanders from the top-left corner down to the bottom-right corner. Following a northwest-to-southeast course, the streamjąc about 313 meters across the image, likely carving through fields and rural landscapes.",1,3,5,2024-12-22 15:43:25,False
55054,70231,12433206,"At the center of the scene, a vast, rectangular expanse occupies over three-quarters of the view, identified as El Paso International Airport. This major aerodrome, at an elevation of 1202 meters, serves El Paso, Texas, with the IATA code ELP and ICAO code KELP. Operated by the City of El Paso, it sprawls across Convair Road, featuring two runways and possibly several terminals or hangars. The airport's name is prominently displayed in English and simplified Chinese: 埃尔帕索国际机场.",1,3,1,2024-12-22 03:34:00,False
481442,496619,521411,"At the left-center of the scene, the irregularly shaped Lake George spans across roughly one-sixth of the view. Renowned as Lac du Saint-Sacrement in French and Kaniá:taro’kte in Mohawk, this expansive water body is the focus, with its boundary suggesting parts extending beyond the frame. Its natural state and likely tranquil setting are indicated, evoking images of a scenic and culturally significant locale.",1,4,1,2024-12-24 18:44:26,False
268810,283987,10703241,"At the right-center of the scene, a rectangular area occupying roughly one-fourth of the view appears distinctly—the South Shore Wastewater Treatment Plant. Managed by the Milwaukee Metropolitan Sewerage District, this man-made facility is designed for wastewater treatment, standing out against the surrounding landscape due to its precise rectangular shape and expansive footprint.",1,4,1,2024-12-23 11:28:37,False
68549,83726,12784050,"At the base of the scene, an irregularly shaped expanse of tracked railway stretches across nearly a fifth of the view, específicosy defined by its meandering border that extends beyond the frame, suggesting a complex network of train tracks, stations, and supporting structures.",1,3,1,2024-12-22 05:34:06,False
131683,146860,1966388,"A two-lane motorway, designated as the Dr. William Wood Highway and part of the National Highway System, runs eastward across the top of the image. Stretching from the top-center edge to the right-top_right edge, this modern highway spans approximately 165 meters and is equipped with a maximum speed limit of 65 mph. The road is designated for heavy goods vehicle traffic, functioning as part of the national network for trucks in the United States.",1,4,5,2024-12-22 15:09:26,False
205388,220565,6960227,"Stretching diagonally across more than half the scene, an expansive, irregular park landscape dominates the right-center view. Known internationally as Central Park, or variations like 中央公园 (in Chinese), this recreational sanctuary is designed by renowned architects Frederick Law Olmstead and Calvert Vaux. Marked as a US National Historic Site since 1966, the park is famed for its smoke-free regulations and managed by the Central Park Conservancy. Open hours extend from 6:00 AM to 1:00 AM, inviting visitors to enjoy its historic grounds and tranquil settings.",1,4,1,2024-12-23 02:09:04,False


In [4]:
print(f"Total unvalid annotations: {len(annotation_df[annotation_df['is_valid']==False])}")

Total unvalid annotations: 15312


# Update Databases

**Cautions**:
- This operation may not be reversible.
- The database may become locked if accessed by too many processes simultaneously.

In [5]:
# load annotation_osm_table
annotation_osm_table = pd.read_sql("SELECT * FROM annotation_osm", conn_annotation)
merged_anno_df = annotation_df[annotation_df['is_valid']==False].merge(annotation_osm_table[['ANNOTATION', 'ID', 'OSM_ID']], left_on='ID', right_on='ANNOTATION')
merged_anno_df.head(1)

In [8]:
c_rejected = conn_rejected_annotation.cursor() if conn_rejected_annotation else None
c_metadata = conn_metadata.cursor()
c_annotation = conn_annotation.cursor()

for k, row in tqdm(merged_anno_df.iterrows(), total=len(merged_anno_df), desc='Washing annotations'):
    patch_id = row['PATCH']
    anno_id = row['ID_x']
    anno_osm_id = row['ID_y']

    if c_rejected:
        row[['ANNOTATION_x','NUM_ELEMS', 'ANNOTATOR', 'PROMPT', 'CREATED_AT']].values.tolist()
        # copy the rejected annotation to the rejected annotation database (table: annotation)
        
        c_rejected.execute(f"INSERT INTO annotation (PATCH, ANNOTATION, NUM_ELEMS, ANNOTATOR, PROMPT, CREATED_AT) VALUES (?,?,?,?,?,?)", (patch_id, annotation, num_elements, annotator, prompt, created_at))
        # get the ID of the inserted annotation
        c_rejected.lastrowid
        anno_id = c_rejected.lastrowid
        
        # copy the rejected annotation_osm to the rejected annotation_osm database (table: annotation_osm)
        osm_id = row['OSM_ID']
        c_rejected.execute(f"INSERT INTO annotation_osm (ANNOTATION, OSM_ID) VALUES ({anno_id}, {osm_id})")
        
    # delete the records from annotation database (table: annotation, annotation_osm) 
    c_annotation.execute(f"DELETE FROM annotation WHERE ID = {anno_id}")
    c_annotation.execute(f"DELETE FROM annotation_osm WHERE ID = {anno_osm_id}")
    # NUM_ANNOTATIONS - 1 in metadata database (table: patch)
    c_metadata.execute(f"UPDATE patch SET NUM_ANNOTATIONS = NUM_ANNOTATIONS - 1 WHERE ID = {patch_id}")
    
    if k % 1000 == 0:
        conn_annotation.commit()
        conn_metadata.commit()
        if c_rejected:
            conn_rejected_annotation.commit()
            
conn_annotation.commit()
conn_metadata.commit()
if c_rejected:
    conn_rejected_annotation.commit()

Washing annotations: 100%|██████████| 584/584 [00:00<00:00, 2485.05it/s]
