## LZV .tfrecord datu sagatavošana

Šī Jupyter grāmatiņa sagatavo training.tfrecord un validation.tfrecord, izmantojot no video failiem iegūtos orientieru failus, kas saglabāti .npy formātā. Šajā grāmatiņā arī tiek veikta normalizācija.

Lai palaistu šo grāmatiņu, ir nepieciešams:

* GitHub repo sagatavotais:
    * char_map.json

* Jūsu pašu sagatavots:
    * .npy formātā saglabāti orientieru masīvi, kas atrodas `data\training_landmarks`

Kā iegūt MediaPipe Holistic orientierus no video failiem, var redzēt MediaPipe dokumentācijā.

Jums ir jāsagatavo dati struktūrā, kur apakšmapīšu un failu nosaukumi atbilst oriģinālvideo parādītajām zīmēm. Indeksācija faila nosaukumā nav svarīga datu apstrādē, tā ir domāta tikai vienkāršākai failu pārskatei.

Piemērs struktūrai:

* `data\training_landmarks\d-z-i-m-š-a-n-a-s- -d-i-e-n-a\d-z-i-m-š-a-n-a-s- -d-i-e-n-a_0001.npy`

* `data\training_landmarks\k-ā- -t-e-v-i- -s-a-u-c\k-ā- -t-e-v-i- -s-a-u-c_0001.npy`

* `data\training_landmarks\g-u-n-t-a\g-u-n-t-a_0001.npy`

* `data\training_landmarks\g-u-n-t-a\g-u-n-t-a_0002.npy`

Piemērs `g-u-n-t-a_0001.npy` ir dots.

Orientieriem ir 3 dimensijas (x,y,z) un to globālā indeksācija ir šāda:
* 33 pozas (0-32)
* 468 sejas (33-500)
* 21 kreisās rokas (501-521)
* 21 labās rokas (522-542)

Galā tiks iegūts training.tfrecord un validation.tfrecord, kā arī metadatu fails (kam vajadzētu tikt pārrakstītam ar Jūsu sagatavoto datu informāciju).

---

### 1. Konfigurācija

Nepieciešamie faili un parametri. Šeit ir nepieciešams norādīt savu vēlamo trenēšanas/validācijas sadalīšanu, kas šobrīd ir 80 training : 20 validation.

In [18]:
import os
import sys
import json
import numpy as np
import pandas as pd
import tensorflow as tf
from tqdm import tqdm
from sklearn.model_selection import train_test_split

try:
    BASE_DIR = os.path.abspath(os.path.join(os.getcwd(), "../"))

    DATA_DIR = os.path.join(BASE_DIR, "data")
    CHAR_MAP_FILE = os.path.join(DATA_DIR, "processed_landmarks", "char_map.json")
    DATASET_INFO_FILE = os.path.join(DATA_DIR, "processed_landmarks", "dataset_info.json")
    TRAINING_LANDMARKS_DIR = os.path.join(DATA_DIR, "processed_landmarks", "training_landmarks")

    # Orientieru konfigurācija
    LANDMARK_DIMS = 3
    POSE_IDXS = list(range(0, 33))
    FACE_IDXS = list(range(33, 501))
    LHAND_IDXS = list(range(501, 522))
    RHAND_IDXS = list(range(522, 543))
    ALL_LANDMARKS = POSE_IDXS + FACE_IDXS + LHAND_IDXS + RHAND_IDXS
    NUM_NODES = len(ALL_LANDMARKS)
    print("Landmark count:", NUM_NODES)

    VALIDATION_SPLIT_SIZE = 0.20 # Izvēlēties savu trenēšanas/validācijas datu daļu, šobrīd 80 training : 20 validation
    RANDOM_STATE = 42

    # Char map
    with open(CHAR_MAP_FILE, 'r', encoding='utf-8') as f:
        char_map = json.load(f)
        id_to_char = {int(v): k for k, v in char_map.items()}

    inverse_char_map = {}
    try:
        inverse_char_map = {v: k for k, v in char_map.items()}
    except Exception as e:
        print(f"Error: {e}")
    
    print("Char map size:", len(char_map))

except Exception as e:
    print(f"Error: {e}")
    sys.exit(1)

Landmark count: 543
Char map size: 45


### 2. Palīgfunkcijas

Dažādas palīgfunkcijas, kas nepieciešamas priekšapstrādes veikšanai.

In [25]:
# Metadatu izveides funkcija
def create_metadata(DATA_DIR):
    samples = []

    for root, _, files in os.walk(DATA_DIR):
        for fname in files:
            if fname.endswith(".npy"):
                file_id = fname[:-4] # Sagaidāmās struktūras piemērs "5- -f-ū-r-e-s_0001"
                parts = file_id.split('_', 1)
                sequence_str = parts[0]
                signs = sequence_str.split("-")

                valid_sequence = []
                valid = True
                for sign in signs:
                    if sign in char_map:
                        valid_sequence.append(sign)
                    else:
                        print(f"    Warning: Sign '{sign}' from file '{fname}' (sequence: '{sequence_str}') not found in signs.csv. Skipping file.")
                        valid = False
                        break

                if valid:
                    samples.append({
                        "file_id": file_id,
                        "path": os.path.join(root, fname),
                        "phrase": valid_sequence # Saglabā zīmes kā sarakstu
                    })
    if not samples:
        return pd.DataFrame()
    return pd.DataFrame(samples)

# Aprēķina statistiku par zīmēm
def calculate_stats(df):
    pose_landmarks_all, face_landmarks_all, lhand_landmarks_all, rhand_landmarks_all = [], [], [], []

    for _, row in tqdm(df.iterrows(), total=len(df), desc="Loading landmarks for stats"):
        try:
            landmarks = np.load(row['path'])
            if landmarks.ndim != 3 or landmarks.shape[1] != NUM_NODES or landmarks.shape[2] != LANDMARK_DIMS:
                print(f"\nWarning: Skipping file {row['path']} due to unexpected shape: {landmarks.shape}. Expected: (frames, {NUM_NODES}, {LANDMARK_DIMS})")
                continue

            pose = landmarks[:, POSE_IDXS, :]
            face = landmarks[:, FACE_IDXS, :]
            lhand = landmarks[:, LHAND_IDXS, :]
            rhand = landmarks[:, RHAND_IDXS, :]

            def filter_and_clean(data):
                if data.shape[0] == 0: return data
                valid_frames_mask = (~np.isnan(data).all(axis=(1,2))) & ((data != 0).any(axis=(1,2)))
                filtered_data = data[valid_frames_mask]
                cleaned_data = np.nan_to_num(filtered_data)
                return cleaned_data

            pose_clean = filter_and_clean(pose)
            face_clean = filter_and_clean(face)
            lhand_clean = filter_and_clean(lhand)
            rhand_clean = filter_and_clean(rhand)

            if pose_clean.shape[0] > 0: pose_landmarks_all.append(pose_clean)
            if face_clean.shape[0] > 0: face_landmarks_all.append(face_clean)
            if lhand_clean.shape[0] > 0: lhand_landmarks_all.append(lhand_clean)
            if rhand_clean.shape[0] > 0: rhand_landmarks_all.append(rhand_clean)

        except Exception as e:
            print(f"\nError: {e}")

    stats = {}
    EPSILON = 1e-8 # Lai izvairītos no nulles dalīšanas kļūdām
    calculated_something = False

    for part_name, part_data_list, part_indices in [
        ('pose', pose_landmarks_all, POSE_IDXS),
        ('face', face_landmarks_all, FACE_IDXS),
        ('lhand', lhand_landmarks_all, LHAND_IDXS),
        ('rhand', rhand_landmarks_all, RHAND_IDXS)
    ]:
        expected_shape = (len(part_indices), LANDMARK_DIMS)
        if part_data_list:
            all_part_data = np.concatenate(part_data_list, axis=0)
            mean_val = all_part_data.mean(axis=0)
            std_val = all_part_data.std(axis=0)
            
            if mean_val.shape != expected_shape or std_val.shape != expected_shape:
                print(f"   Warning: Shape mismatch for {part_name}. Mean: {mean_val.shape}, Std: {std_val.shape}, Expected: {expected_shape}")
            stats[f'{part_name}_mean'] = mean_val
            stats[f'{part_name}_std'] = std_val + EPSILON
            calculated_something = True
        else:
            print(f"   Warning: No valid data found for {part_name}. Using default stats (mean=0, std=1).")
            stats[f'{part_name}_mean'] = np.zeros(expected_shape)
            stats[f'{part_name}_std'] = np.ones(expected_shape)

    return stats

def normalize_landmarks(landmarks, stats_dict):
    normalized = np.zeros_like(landmarks, dtype=np.float32)
    if not stats_dict:
        print("Warning: Stats dictionary is empty, returning unnormalized data.")
        return landmarks.astype(np.float32)

    try:
        normalized[:, POSE_IDXS, :] = (landmarks[:, POSE_IDXS, :] - stats_dict['pose_mean'][np.newaxis, :, :]) / stats_dict['pose_std'][np.newaxis, :, :]
        normalized[:, FACE_IDXS, :] = (landmarks[:, FACE_IDXS, :] - stats_dict['face_mean'][np.newaxis, :, :]) / stats_dict['face_std'][np.newaxis, :, :]
        normalized[:, LHAND_IDXS, :] = (landmarks[:, LHAND_IDXS, :] - stats_dict['lhand_mean'][np.newaxis, :, :]) / stats_dict['lhand_std'][np.newaxis, :, :]
        normalized[:, RHAND_IDXS, :] = (landmarks[:, RHAND_IDXS, :] - stats_dict['rhand_mean'][np.newaxis, :, :]) / stats_dict['rhand_std'][np.newaxis, :, :]
    except Exception as e:
        print(f"Error: {e}.")
        return landmarks.astype(np.float32)

    normalized = np.nan_to_num(normalized)
    normalized = np.clip(normalized, np.finfo(np.float32).min / 100, np.finfo(np.float32).max / 100) 
    return normalized.astype(np.float32)

# Palīgfunkcijas TFRecord datu ierakstiem
def float_feature(value):
    if not isinstance(value, np.ndarray): value = np.array(value, dtype=np.float32)
    if value.dtype != np.float32: value = value.astype(np.float32)
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value.tobytes()]))

def int_feature(value):
    if not isinstance(value, list): value = [value]
    return tf.train.Feature(int64_list=tf.train.Int64List(value=value))


# Izveido TFRecord failus
def create_tfrecords(df, output_path, stats_dict, normalize=True):
    writer = tf.io.TFRecordWriter(output_path)
    print(f"\nCreating TFRecord file: {output_path}")
    skipped_count = 0
    processed_count = 0
    for _, row in tqdm(df.iterrows(), total=len(df), desc=f"Writing {os.path.basename(output_path)}"):
        try:
            landmarks = np.load(row['path'])
            if landmarks.ndim != 3 or landmarks.shape[1] != NUM_NODES or landmarks.shape[2] != LANDMARK_DIMS:
                print(f"\nWarning: Skipping file {row['path']} due to unexpected shape: {landmarks.shape}. Expected (frames, {NUM_NODES}, {LANDMARK_DIMS})")
                skipped_count += 1
                continue

            landmarks_to_save = landmarks.astype(np.float32)

            if normalize:
                landmarks_to_save = normalize_landmarks(landmarks_to_save, stats_dict)

            if np.isnan(landmarks_to_save).any(): # Pārbauda, vai ir NaN vērtības
                print(f"\nWarning: NaNs found in data for {row['path']} after processing! Skipping record.")
                skipped_count += 1
                continue

            phrase_list = row['phrase']
            phrase_indices = [char_map[token] for token in phrase_list]

            # Struktūra ierakstiem
            feature = {
                'landmarks': float_feature(landmarks_to_save),
                'phrase': int_feature(phrase_indices),
                'length': int_feature(landmarks_to_save.shape[0]),
                'phrase_length': int_feature(len(phrase_indices)),
            }

            example = tf.train.Example(features=tf.train.Features(feature=feature))
            writer.write(example.SerializeToString())
            processed_count += 1

        except Exception as e:
            print(f"\nError: {type(e).__name__} - {e}")
            skipped_count += 1

    writer.close()
    print(f"Finished creating TFRecord: {output_path}")
    print(f"  Processed: {processed_count} records")
    print(f"  Skipped: {skipped_count} records")

### 3. Galvenā apstrāde

Veic datu apstrādi. Saglabā training.tfrecord un validation.tfrecord.

In [None]:
initial_train_df = create_metadata(TRAINING_LANDMARKS_DIR)
print(initial_train_df)
training_stats_for_test = {}

if not initial_train_df.empty:
    print(f"Initial training data: {len(initial_train_df)} samples.")
    
    actual_train_df = pd.DataFrame()
    val_df = pd.DataFrame()

    if len(initial_train_df) * VALIDATION_SPLIT_SIZE < 1 or len(initial_train_df) < 2 : # Ja gadījumā nav vismaz 2 paraugi validācijas dalīšanai
        print("Warning: Dataset too small for validation split or less than 2 samples. Using all data for training.")
        actual_train_df = initial_train_df.copy()
        # Šajā gadījumā nebūs validācijas datu
    else:
        actual_train_df, val_df = train_test_split(
            initial_train_df,
            test_size=VALIDATION_SPLIT_SIZE,
            random_state=RANDOM_STATE,
            stratify=None
        )
        print(f"Split complete: {len(actual_train_df)} training samples, {len(val_df)} validation samples.")

    # Aprēķina statistiku par treniņu datiem
    if not actual_train_df.empty:
        training_stats = calculate_stats(actual_train_df)
        training_stats_for_test = training_stats

        create_tfrecords(actual_train_df, os.path.join(DATA_DIR, "processed_landmarks", "training.tfrecord"), training_stats, normalize=True)
        
        if not val_df.empty:
            create_tfrecords(val_df, os.path.join("validation.tfrecord"), training_stats, normalize=True)
        else:
            print("No validation data to process into TFRecord.")
    else:
        print("No training samples after split (or initial dataset was too small). Skipping stats calculation and TFRecord creation.")

else:
    print("No training data found or processed. Skipping training, validation, and testing TFRecord creation.")

### 4. Metadatu faila izveide

Izveido vai pārraksta metadatu failu `dataset_info.json`.

In [None]:
import pandas as pd
import json

# Saglabā jaunu char_map (utf-8 priekš diakritiskajām zīmēm)
# with open(CHAR_MAP_FILE, "w", encoding='utf-8') as f:
#     json.dump(char_map, f, ensure_ascii=False, indent=4)

# Saglabātā metadatu informācija
dataset_info = {
    "dataset_name": "LZV",
    "data files": {
        "train_tfrecord": "training.tfrecord",
        "validation_tfrecord": "validation.tfrecord",
        "char_map": "char_map.json",
        "training_stats_dir": "training_stats"
    },
    "num_classes": len(char_map),
    "num_classes_with_blank": len(char_map)+1,
    "dataset_stats": {
        "train_samples": len(actual_train_df) if not actual_train_df.empty else 0,
        "val_samples": len(val_df) if not val_df.empty else 0,
        "total_samples": (len(actual_train_df) if not actual_train_df.empty else 0) + (len(val_df) if not val_df.empty else 0)
    },
    "landmark_info": {
        "source_tool": "MediaPipe Holistic",
        "num_landmarks": NUM_NODES,
        "num_coordinates": LANDMARK_DIMS,
        "input_shape": [None, NUM_NODES, LANDMARK_DIMS],
        "landmark_components": {
            "pose": [POSE_IDXS[0], POSE_IDXS[-1]],
            "face": [FACE_IDXS[0], FACE_IDXS[-1]],
            "lhand": [LHAND_IDXS[0], LHAND_IDXS[-1]],
            "rhand": [RHAND_IDXS[0], RHAND_IDXS[-1]]
        }
    }
}

# Saglabā (utf-8 priekš diakritiskajām zīmēm)
with open(DATASET_INFO_FILE, "w", encoding='utf-8') as f:
    json.dump(dataset_info, f, ensure_ascii=False, indent=4)

### 5. Pārbaude

Papildus datu pārbaude, kas izvada dažu datu struktūru. Izvadīto daudzumu var izmainīt.

In [27]:
import tensorflow as tf

TFRECORD_PATH = "../data/processed_landmarks/training.tfrecord" # Sagaidāmāis ceļš uz TFRecord failu
print(f"Verifying: {TFRECORD_PATH}")
try:
    raw_dataset = tf.data.TFRecordDataset(TFRECORD_PATH)

    print("\n--- Verifying first 5 examples ---")
    for i, raw_record in enumerate(raw_dataset.take(30)):
        print(f"\n--- Example {i+1} ---")
        example = tf.train.Example()
        example.ParseFromString(raw_record.numpy())

        landmarks_bytes = example.features.feature['landmarks'].bytes_list.value[0]
        landmarks_flat = tf.io.decode_raw(landmarks_bytes, tf.float32)
        length = example.features.feature['length'].int64_list.value[0]
        phrase = example.features.feature['phrase'].int64_list.value
        phrase_length = example.features.feature['phrase_length'].int64_list.value[0]

        expected_size = length * NUM_NODES * LANDMARK_DIMS

        if tf.size(landmarks_flat).numpy() != expected_size:
            print(f"ERROR: Decoded size mismatch!")
            print(f"  Decoded flat tensor size: {tf.size(landmarks_flat).numpy()}")
            print(f"  Expected size (length * NUM_NODES * LANDMARK_DIMS): {length} * {NUM_NODES} * {LANDMARK_DIMS} = {expected_size}")
            landmarks_shape = "Error - Size Mismatch"
        else:
            landmarks = tf.reshape(landmarks_flat, (length, NUM_NODES, LANDMARK_DIMS))
            landmarks_shape = landmarks.shape
            if np.isnan(landmarks.numpy()).any():
                print("Warning: NaNs detected in decoded landmarks!")
            if np.isinf(landmarks.numpy()).any():
                print("Warning: Infs detected in decoded landmarks!")


        print("Landmarks shape:", landmarks_shape)
        decoded_phrase = [id_to_char.get(idx, f"Unknown({idx})") for idx in phrase]
        print("Phrase:", decoded_phrase)
        print("Landmark Sequence Length (Frames):", length)
        print("Phrase length:", phrase_length)

    print("\n--- Verification Complete ---")

except Exception as e:
    print(f"\nAn error occurred during TFRecord verification: {e}")

Verifying: ../data/processed_landmarks/training.tfrecord

--- Verifying first 5 examples ---

--- Example 1 ---
Landmarks shape: (39, 543, 3)
Phrase: ['o']
Landmark Sequence Length (Frames): 39
Phrase length: 1

--- Example 2 ---
Landmarks shape: (147, 543, 3)
Phrase: ['9', '8', '9', '8']
Landmark Sequence Length (Frames): 147
Phrase length: 4

--- Example 3 ---
Landmarks shape: (211, 543, 3)
Phrase: ['b', 'r', 'ū', 'c', 'e']
Landmark Sequence Length (Frames): 211
Phrase length: 5

--- Example 4 ---
Landmarks shape: (130, 543, 3)
Phrase: ['o', 'z', 'o', 'n', 's']
Landmark Sequence Length (Frames): 130
Phrase length: 5

--- Example 5 ---
Landmarks shape: (247, 543, 3)
Phrase: ['š', 'o', 'k', 'ē', 'j', 'o', 't']
Landmark Sequence Length (Frames): 247
Phrase length: 7

--- Example 6 ---
Landmarks shape: (7, 543, 3)
Phrase: ['o']
Landmark Sequence Length (Frames): 7
Phrase length: 1

--- Example 7 ---
Landmarks shape: (230, 543, 3)
Phrase: ['ū', 'd', 'e', 'n', 's']
Landmark Sequence Length