# Wavelet Transform + CNN Feature Extraction

This notebook implements wavelet-based feature extraction for seismic signals combined with CNN processing. The main components are:

## Key Components

1. **Wavelet Decomposition**
   - Uses db4 wavelet with 4 levels
   - Extracts statistical features from coefficients
   - Handles multiple signal scales

2. **Feature Extraction**
   - Statistical measures (mean, std, skewness, etc.)
   - Signal characteristics (entropy, norms)
   - Wavelet coefficient analysis

3. **Data Processing**
   - Training data augmentation
   - Validation split handling
   - Test set preparation

## Feature Organization

Features are extracted for each wavelet coefficient level including:
- Mean and standard deviation
- Skewness and kurtosis
- Percentile measures
- Signal entropy
- L1 and L2 norms


## Setup and Imports

Import required libraries and configure paths:

- **Signal Processing**: numpy, scipy, pywt
- **Data Handling**: pandas, obspy
- **Visualization**: matplotlib
- **File Management**: os, tqdm

Define paths for:
- Augmented data storage
- Feature extraction outputs
- Model checkpoints

In [None]:
import os
import pandas as pd
import numpy as np
from obspy import read
from tqdm import tqdm
import scipy.stats as stats
import matplotlib.pyplot as plt
import pywt

# Paths
augmented_data_path = '/mnt/c/Users/Usuario/Documents/Studies/GicoProject/SeismicWaves/data/procesed/used_data/training_augmented'
features_path = '/mnt/c/Users/Usuario/Documents/Studies/GicoProject/SeismicWaves/data/procesed/features'

# Create features directory if it doesn't exist
if not os.path.exists(features_path):
    os.makedirs(features_path)

## Feature Extraction Functions

Core functions for processing seismic signals:

1. **extract_wavelet_features**:
   - Performs wavelet decomposition
   - Calculates statistical measures
   - Generates feature vectors

2. **process_seismic_files**:
   - Batch processes MSEED files
   - Extracts features for each signal
   - Maintains arrival time mapping

3. **match_arrival_times**:
   - Aligns signals with P-wave arrivals
   - Converts absolute to relative times
   - Validates data consistency

In [2]:
def extract_wavelet_features(signal, wavelet='db4', level=4):
    """Extract statistical features from wavelet decomposition of a signal.
    Args:
        signal: Input signal array
        wavelet: Wavelet type to use
        level: Decomposition level
    Returns:
        array: Feature vector containing statistical measures"""
    # Perform wavelet decomposition
    coeffs = pywt.wavedec(signal, wavelet, level=level)
    
    # Initialize feature list
    features = []
    
    # Extract features from each coefficient level
    for coef in coeffs:
        # Statistical features
        features.extend([
            np.mean(coef),           # Mean
            np.std(coef),            # Standard deviation
            stats.skew(coef),        # Skewness
            stats.kurtosis(coef),    # Kurtosis
            np.percentile(coef, 75), # 75th percentile
            np.percentile(coef, 25), # 25th percentile
            np.max(coef),            # Maximum
            np.min(coef),            # Minimum
            np.sum(np.abs(coef)),    # L1 norm
            np.sqrt(np.sum(coef**2)),# L2 norm
            stats.entropy(np.abs(coef)), # Signal entropy
            np.median(np.abs(coef))  # Median absolute deviation
        ])
        
    return np.array(features)

def process_seismic_files(data_path, arrival_times_csv):
    """Process all seismic files and extract wavelet features.
    Args:
        data_path: Path to directory containing MSEED files
        arrival_times_csv: Path to CSV with arrival times
    Returns:
        tuple: (features array, arrival times array, file names)"""
    # Read arrival times
    arrivals_df = pd.read_csv(arrival_times_csv)
    
    features_list = []
    arrival_times = []
    file_names = []
    
    print('Extracting wavelet features...')
    for _, row in tqdm(arrivals_df.iterrows(), total=len(arrivals_df)):
        file_path = os.path.join(data_path, row['file'])
        
        try:
            # Read seismic signal
            st = read(file_path)
            signal = st[0].data
            
            # Extract features
            features = extract_wavelet_features(signal)
            features_list.append(features)
            arrival_times.append(row['arrival_time'])
            file_names.append(row['file'])
            
        except Exception as e:
            print(f'Error processing {file_path}: {str(e)}')
            continue
    
    return np.array(features_list), np.array(arrival_times), file_names




In [None]:

# Process all files
augmented_path = os.path.join(augmented_data_path, 'augmented')
arrival_times_csv = os.path.join(augmented_data_path, 'arrival_times.csv')

X, y, files = process_seismic_files(augmented_path, arrival_times_csv)

# Save features and metadata
np.save(os.path.join(features_path, 'wavelet_features.npy'), X)
np.save(os.path.join(features_path, 'arrival_times.npy'), y)
pd.DataFrame({'file': files}).to_csv(
    os.path.join(features_path, 'feature_files.csv'), index=False)

print(f'Extracted features shape: {X.shape}')
print(f'Number of samples: {len(files)}')

# Update the feature extraction info
print('Features extracted per coefficient level:', 12)
print('Total features for 4 levels:', 12 * (4 + 1))  # 4 detail + 1 approximation

Extracting wavelet features...


  0%|          | 0/4989 [00:00<?, ?it/s]

100%|██████████| 4989/4989 [03:34<00:00, 23.21it/s] 


Extracted features shape: (4989, 60)
Number of samples: 4989
Features extracted per coefficient level: 12
Total features for 4 levels: 60


In [None]:
# process validation files
val_path = '/mnt/c/Users/Usuario/Documents/Studies/GicoProject/SeismicWaves/data/procesed/used_data/training_augmented/val'
val_arrival_times_csv = os.path.join(features_path, 'val_arrival_times.csv')

X, y, files = process_seismic_files(val_path, val_arrival_times_csv)
# Save validation features and metadata
np.save(os.path.join(features_path, 'val_wavelet_features.npy'), X)
np.save(os.path.join(features_path, 'val_arrival_times.npy'), y)
pd.DataFrame({'file': files}).to_csv(
    os.path.join(features_path, 'val_feature_files.csv'), index=False)
print(f'Validation features shape: {X.shape}')
print(f'Number of validation samples: {len(files)}')
print('Validation features extracted per coefficient level:', 12)
print('Total validation features for 4 levels:', 12 * (4 + 1))  # 4 detail + 1 approximation

Extracting wavelet features...


  0%|          | 0/317 [00:00<?, ?it/s]

100%|██████████| 317/317 [00:14<00:00, 22.02it/s]

Validation features shape: (317, 60)
Number of validation samples: 317
Validation features extracted per coefficient level: 12
Total validation features for 4 levels: 60





In [8]:
# process test files
test_path = '/mnt/c/Users/Usuario/Documents/Studies/GicoProject/SeismicWaves/data/procesed/used_data/testing'
test_arrival_times_csv = os.path.join(features_path, 'test_arrival_times.csv')

X, y, files = process_seismic_files(test_path, test_arrival_times_csv)
# Save test features and metadata
np.save(os.path.join(features_path, 'test_wavelet_features.npy'), X)
np.save(os.path.join(features_path, 'test_arrival_times.npy'), y)
pd.DataFrame({'file': files}).to_csv(
    os.path.join(features_path, 'test_feature_files.csv'), index=False)
print(f'Test features shape: {X.shape}')
print(f'Number of test samples: {len(files)}')
print('Test features extracted per coefficient level:', 12)
print('Total test features for 4 levels:', 12 * (4 + 1))  # 4 detail + 1 approximation


Extracting wavelet features...


100%|██████████| 496/496 [00:23<00:00, 21.09it/s]

Test features shape: (496, 60)
Number of test samples: 496
Test features extracted per coefficient level: 12
Total test features for 4 levels: 60





In [None]:
# process training original files not augmented
train_path = '/mnt/c/Users/Usuario/Documents/Studies/GicoProject/SeismicWaves/data/procesed/used_data/training_augmented/train'
train_arrival_times_csv = os.path.join(features_path, 'train_arrival_times.csv')
X, y, files = process_seismic_files(train_path, train_arrival_times_csv)

np.save(os.path.join(features_path, 'train_wavelet_features.npy'), X)
np.save(os.path.join(features_path, 'train_arrival_times.npy'), y)
pd.DataFrame({'file': files}).to_csv(
    os.path.join(features_path, 'train_feature_files.csv'), index=False)
print(f'Train features shape: {X.shape}')
print(f'Number of train samples: {len(files)}')
print('Train features extracted per coefficient level:', 12)
print('Total train features for 4 levels:', 12 * (4 + 1))  # 4 detail + 1 approximation

Extracting wavelet features...


100%|██████████| 1663/1663 [01:13<00:00, 22.65it/s] 

Train features shape: (1663, 60)
Number of train samples: 1663
Train features extracted per coefficient level: 12
Total train features for 4 levels: 60





In [None]:
# process training range augmented files
augmented_range_path = '/mnt/c/Users/Usuario/Documents/Studies/GicoProject/SeismicWaves/data/procesed/used_data/training_augmented/augmented_ranges'
train_range_arrival_times_csv = os.path.join(augmented_data_path, 'arrival_times_all.csv')
X, y, files = process_seismic_files(augmented_range_path, train_range_arrival_times_csv)

np.save(os.path.join(features_path, 'train_range_wavelet_features.npy'), X)
np.save(os.path.join(features_path, 'train_range_arrival_times.npy'), y)
pd.DataFrame({'file': files}).to_csv(
    os.path.join(features_path, 'train_range_feature_files.csv'), index=False)
print(f'Train features shape: {X.shape}')
print(f'Number of train samples: {len(files)}')
print('Train features extracted per coefficient level:', 12)
print('Total train features for 4 levels:', 12 * (4 + 1))  # 4 detail + 1 approximation

## Data Processing Pipeline

Process different data splits:

1. **Training Data**
   - Original signals
   - Augmented versions
   - Range-based augmentations

2. **Validation Data**
   - Held-out validation set
   - Feature extraction
   - Arrival time matching

3. **Test Data**
   - Separate test set
   - Independent validation
   - Performance metrics

In [3]:
def match_arrival_times(directory_path, csv_path, output_path):
    """
    Extrae tiempos de llegada del CSV que coincidan con archivos MSEED en el directorio.
    
    Args:
        directory_path: Ruta al directorio con archivos MSEED
        csv_path: Ruta al CSV con los tiempos de llegada
        output_path: Ruta donde guardar el nuevo CSV
    """
    # Leer el CSV original
    df = pd.read_csv(csv_path)
    
    # Obtener lista de archivos MSEED en el directorio
    mseed_files = []
    for file in os.listdir(directory_path):
        if file.endswith('.mseed'):
            # Extraer el número del archivo sin la extensión
            file_number = int(file.replace('.mseed', ''))
            mseed_files.append(file_number)
    
    # Filtrar el DataFrame para mantener solo las filas que coinciden con los archivos
    matched_df = df[df['archivo'].isin(mseed_files)].copy()
    
    # Calcular tiempos relativos
    arrival_times = []
    filenames = []
    
    for _, row in tqdm(matched_df.iterrows(), desc='Procesando archivos'):
        file_id = row['archivo']
        mseed_file = f"{file_id:08d}.mseed"
        file_path = os.path.join(directory_path, mseed_file)
        
        try:
            # Leer señal y calcular tiempo relativo
            st = read(file_path)
            absolute_p_time = row['lec_p']
            relative_p_time = absolute_p_time - st[0].stats.starttime.timestamp
            
            arrival_times.append(relative_p_time)
            filenames.append(mseed_file)
        except Exception as e:
            print(f"Error procesando {mseed_file}: {str(e)}")
            continue
    
    # Crear nuevo DataFrame con los resultados
    result_df = pd.DataFrame({
        'file': filenames,
        'arrival_time': arrival_times
    })
    
    # Guardar resultados
    result_df.to_csv(output_path, index=False)
    print(f"Se guardaron {len(result_df)} coincidencias en {output_path}")
    
    return result_df


In [None]:
# test arrival times 
test_path = '/mnt/c/Users/Usuario/Documents/Studies/GicoProject/SeismicWaves/data/procesed/used_data/testing'
output_test_path = os.path.join(features_path, 'test_arrival_times.csv')
matched_test_times = match_arrival_times(test_path, csv_path, output_test_path)
# Mostrar las primeras filas del resultado
print("\nPrimeras 5 coincidencias de test:")
print(matched_test_times.head())

Procesando archivos: 496it [00:17, 28.26it/s]

Se guardaron 496 coincidencias en /mnt/c/Users/Usuario/Documents/Studies/GicoProject/SeismicWaves/data/procesed/features/test_arrival_times.csv

Primeras 5 coincidencias de test:
             file  arrival_time
0  04031646.mseed          29.6
1  04100031.mseed          30.6
2  04110253.mseed          30.5
3  04150010.mseed          30.4
4  04152328.mseed          30.6





In [5]:
train_path = '/mnt/c/Users/Usuario/Documents/Studies/GicoProject/SeismicWaves/data/procesed/used_data/training_augmented/train'
output_training_path = os.path.join(features_path, 'train_arrival_times.csv')
matched_train_times = match_arrival_times(train_path, csv_path, output_training_path)


Procesando archivos: 1663it [01:21, 20.35it/s]

Se guardaron 1663 coincidencias en /mnt/c/Users/Usuario/Documents/Studies/GicoProject/SeismicWaves/data/procesed/features/train_arrival_times.csv





In [11]:
# Show example of features for one file
example_idx = 0
print(f'Features for file {files[example_idx]}:')
print('Feature vector length:', len(X[example_idx]))
print('\nFirst 10 features:')
print(X[example_idx][:10])
print(f'\nArrival time: {y[example_idx]:.2f}s')

Features for file 01010056.mseed:
Feature vector length: 60

First 10 features:
[-2.26887150e-03  1.36387755e-01  2.11445763e+00  9.02215459e+01
  1.74969649e-02 -1.97850397e-02  1.76067501e+00 -1.31033293e+00
  1.88985862e+01  2.85151844e+00]

Arrival time: 30.60s


## Procesamiento del conjunto de prueba

Procesamos el conjunto de prueba usando las mismas funciones de extracción de características que usamos para el conjunto de entrenamiento.

In [None]:
# Definir rutas
testing_data_path = '/mnt/c/Users/Usuario/Documents/Studies/GicoProject/SeismicWaves/data/procesed/used_data/testing'
test_csv_path = '/mnt/c/Users/Usuario/Documents/Studies/GicoProject/SeismicWaves/data/raw/VT_P_training.csv'

def process_arrival_times(data_path, data_name):
    """Procesa los tiempos de llegada para el conjunto de prueba."""
    # Leer CSV con tiempos de llegada de prueba
    test_df = pd.read_csv(test_csv_path)
    
    # Preparar DataFrame para almacenar tiempos de llegada
    test_times_df = pd.DataFrame(columns=['file', 'arrival_time'])
    
    print('Procesando tiempos de llegada del conjunto de prueba...')
    for _, row in tqdm(test_df.iterrows(), total=len(test_df)):
        file_id = row['archivo']
        mseed_file = f"{file_id:08d}.mseed"
        file_path = os.path.join(data_path, mseed_file)
        
        try:
            # Leer señal y obtener tiempo relativo
            st = read(file_path)
            absolute_p_time = row['lec_p']
            relative_p_time = absolute_p_time - st[0].stats.starttime.timestamp
            
            # Agregar al DataFrame
            test_times_df = pd.concat([test_times_df, pd.DataFrame([{
                'file': mseed_file,
                'arrival_time': relative_p_time
            }])], ignore_index=True)
            
        except Exception as e:
            continue
    
    # Guardar tiempos de llegada
    test_times_df.to_csv(os.path.join(features_path, f'{data_name}.csv'), index=False)
    np.save(os.path.join(features_path, data_name), test_times_df['arrival_time'].values)
    
    print(f'Tiempos de llegada guardados en {data_path}/{data_name}.csv')
    print(f'y {features_path}/test_arrival_times.npy')
    
    return test_times_df

# Procesar tiempos de llegada
test_times_df = process_arrival_times(testing_data_path, 'test_arrival_times')

# Mostrar algunos ejemplos
#print('\nPrimeros 5 tiempos de llegada:')
# print(test_times_df.head())


val_data_path = '/mnt/c/Users/Usuario/Documents/Studies/GicoProject/SeismicWaves/data/procesed/used_data/training_augmented/val'
val_times_df = process_arrival_times(val_data_path, 'val_arrival_times')
val_times_df.head()

Procesando tiempos de llegada del conjunto de prueba...


  0%|          | 0/2500 [00:00<?, ?it/s]

  test_times_df = pd.concat([test_times_df, pd.DataFrame([{
100%|██████████| 2500/2500 [00:21<00:00, 115.43it/s]

Tiempos de llegada guardados en /mnt/c/Users/Usuario/Documents/Studies/GicoProject/SeismicWaves/data/procesed/used_data/training_augmented/val/val_arrival_times.csv
y /mnt/c/Users/Usuario/Documents/Studies/GicoProject/SeismicWaves/data/procesed/features/test_arrival_times.npy





Unnamed: 0,file,arrival_time
0,04010919.mseed,14.84
1,04020130.mseed,11.45
2,04021826.mseed,30.93
3,04031203.mseed,30.62
4,04040354.mseed,30.26


In [21]:
# Guardar los nombres de los archivos en un csv de un directorio 
def save_file_names(data_path, output_csv):
    """Guarda los nombres de los archivos en un CSV."""
    file_names = []
    
    for root, _, files in os.walk(data_path):
        for file in files:
            if file.endswith('.mseed'):
                file_names.append(file)
    
    # Crear DataFrame y guardar
    df = pd.DataFrame(file_names, columns=['file_name'])
    df.to_csv(output_csv, index=False)
    print(f'Nombres de archivos guardados en {output_csv}')

# Guardar nombres de archivos de prueba
testing_data_path = '/mnt/c/Users/Usuario/Documents/Studies/GicoProject/SeismicWaves/data/procesed/used_data/testing'

output_csv = os.path.join(features_path, 'testing_file_names.csv')
save_file_names(testing_data_path, output_csv)



Nombres de archivos guardados en /mnt/c/Users/Usuario/Documents/Studies/GicoProject/SeismicWaves/data/procesed/features/testing_file_names.csv


In [22]:
val_data_path = '/mnt/c/Users/Usuario/Documents/Studies/GicoProject/SeismicWaves/data/procesed/used_data/training_augmented/val'
save_file_names(val_data_path, os.path.join(features_path, 'val_file_names.csv'))

Nombres de archivos guardados en /mnt/c/Users/Usuario/Documents/Studies/GicoProject/SeismicWaves/data/procesed/features/val_file_names.csv


In [23]:
train_data_path = '/mnt/c/Users/Usuario/Documents/Studies/GicoProject/SeismicWaves/data/procesed/used_data/training_augmented/train'
save_file_names(train_data_path, os.path.join(features_path, 'train_file_names.csv'))

Nombres de archivos guardados en /mnt/c/Users/Usuario/Documents/Studies/GicoProject/SeismicWaves/data/procesed/features/train_file_names.csv


In [14]:
val_times_df.info()
test_times_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 317 entries, 0 to 316
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   file          317 non-null    object 
 1   arrival_time  317 non-null    float64
dtypes: float64(1), object(1)
memory usage: 5.1+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 496 entries, 0 to 495
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   file          496 non-null    object 
 1   arrival_time  496 non-null    float64
dtypes: float64(1), object(1)
memory usage: 7.9+ KB
