# Create Artificial Pretraining Data for Testing Purposes

* The pipeline in this notebook is the following:
    1. Read the largest data sample from california labeled samples (`/geosatlearn_app/ml_models/sits-bert/datafiles/california-labeled/Test.csv`);  
    2. Eliminate the labels from this sample in order to be used as pretraining dataset;
    3. Save processed data results in a new folder (`/geosatlearn_app/ml_models/sits-bert/datafiles/california-excluded-labels/Test.csv`) to be used in pretraining experiments;

## Initial Setup

In [9]:
import os

from tqdm.auto import tqdm

## Read from Labeled Dataset

In [10]:
file_path = "/geosatlearn_app/ml_models/sits-bert/datafiles/california-labeled/Test.csv" 

# Read into memory.
with open(file_path, "r") as ifile:
    data: list[str] = ifile.readlines()
    ts_num: list[str] = len(data)
    print(f">>> Loading data successful ... {ts_num} lines read.")

>>> Loading data successful ... 318588 lines read.


## Drop Labels From Dataset

In [11]:
# All processed lines will be stored here.
processed_data: list[str] = []

# Loop over all the instances in the data.
for line in tqdm(data, desc="Processing data ..."):
    
    # Remove `\n` from the end of the line.
    line_processed: str = line.strip()

    # Eliminate class label from the line.
    line_processed_list: list[str] = line_processed.split(",")[:-1]
    
    # Join the line back together.
    line_processed: str = ",".join(line_processed_list)

    # Fill the processed data list.\
    processed_data.append(line_processed)

Processing data ...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 318588/318588 [00:09<00:00, 32668.30it/s]


## Save Final Results

In [12]:
# Output file path.
output_file_path = "/geosatlearn_app/ml_models/sits-bert/datafiles/california-excluded-labels/Test.csv"

# Create the output directory if it does not exist.
output_dir = os.path.dirname(output_file_path)
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
    print(f">>> Created output directory: {output_dir}")

# Save results to a new csv file.
with open(output_file_path, "w") as ofile:
    for line in tqdm(processed_data, desc="Saving processed data ..."):
        ofile.write(f"{line}\n")

# Show the output file path.
print(f">>> Processed data saved to {output_file_path}.")

# Print the number of processed lines.
print(f">>> Total number of processed lines: {len(processed_data)}")

Saving processed data ...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 318588/318588 [00:01<00:00, 281225.41it/s]


>>> Processed data saved to /geosatlearn_app/ml_models/sits-bert/datafiles/california-excluded-labels/Test.csv.
>>> Total number of processed lines: 318588
