# Debugging NaN Issues in Spatial-Temporal Model Training

This notebook guides you through the process of identifying and resolving NaN (Not a Number) values in a spatial-temporal dataset that may be causing training failures.

## Import Required Libraries

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import torch
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import warnings
warnings.filterwarnings('ignore')

# Check PyTorch version and availability of CUDA
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")

PyTorch version: 2.6.0+cpu
CUDA available: False


## Load and Inspect Data

First, let's load the dataset and examine its structure to better understand what we're working with.

In [13]:
# Replace with your actual data path
data_path = "d:/cyg/统计建模04192223bycyg/data/spatial_temporal_data.csv"

try:
    # Load the dataset
    df = pd.read_csv(data_path)
    print("Data loaded successfully!")
except FileNotFoundError:
    print("File not found. Please check the path and try again.")
except Exception as e:
    print(f"An error occurred: {e}")

File not found. Please check the path and try again.


In [3]:
# Inspect the data structure
if 'df' in locals():
    print(f"Dataset shape: {df.shape}")
    print("\nFirst 5 rows:")
    display(df.head())
    
    print("\nColumn information:")
    df.info()
    
    print("\nSummary statistics:")
    display(df.describe())

## Check for NaN Values

Now, let's identify if there are any NaN values in our dataset and which columns are affected.

In [4]:
if 'df' in locals():
    # Check for NaN values in each column
    nan_counts = df.isnull().sum()
    print("NaN count in each column:")
    display(pd.DataFrame({
        'Column': nan_counts.index,
        'NaN Count': nan_counts.values,
        'Percentage': (nan_counts / len(df) * 100).round(2)
    }))
    
    # Only show columns with NaN values
    columns_with_nans = nan_counts[nan_counts > 0]
    if len(columns_with_nans) > 0:
        print("\nColumns with NaN values:")
        display(columns_with_nans)
    else:
        print("\nNo NaN values found in the dataset.")
        
    # Visualize missing values
    plt.figure(figsize=(12, 6))
    sns.heatmap(df.isnull(), yticklabels=False, cbar=False, cmap='viridis')
    plt.title('Missing Values in Dataset')
    plt.tight_layout()
    plt.show()

Let's examine patterns in NaN values by checking if they're randomly distributed or have a pattern.

In [5]:
if 'df' in locals() and len(columns_with_nans) > 0:
    # Check if NaN values are present in the same rows
    rows_with_nans = df[df.isnull().any(axis=1)]
    print(f"Number of rows with at least one NaN value: {len(rows_with_nans)}")
    
    # If we have time-based or spatial data, check if NaNs are clustered
    if 'timestamp' in df.columns or 'date' in df.columns:
        time_col = 'timestamp' if 'timestamp' in df.columns else 'date'
        print("\nNaN distribution over time:")
        plt.figure(figsize=(12, 6))
        df.groupby(time_col).size().plot(kind='line', label='Total Records')
        df[df.isnull().any(axis=1)].groupby(time_col).size().plot(kind='line', label='Records with NaNs')
        plt.legend()
        plt.title('Records with NaNs over time')
        plt.show()

## Handle Missing Values

Now that we've identified the NaN values, let's implement strategies to handle them. The approach depends on the data and context:

1. Fill with mean/median/mode (for numerical features)
2. Fill with a constant value
3. Use forward/backward fill for time series data
4. Drop rows or columns with NaNs (if small percentage)
5. Use more advanced imputation techniques

In [6]:
if 'df' in locals():
    # Create a copy of the dataframe to work with
    df_cleaned = df.copy()
    
    # Strategy 1: Fill numerical columns with mean/median
    numerical_cols = df_cleaned.select_dtypes(include=['float64', 'int64']).columns
    
    for col in numerical_cols:
        if df_cleaned[col].isnull().sum() > 0:
            # Check for outliers to decide between mean or median
            q1 = df_cleaned[col].quantile(0.25)
            q3 = df_cleaned[col].quantile(0.75)
            iqr = q3 - q1
            
            # If significant outliers, use median; otherwise use mean
            if (df_cleaned[col] > q3 + 1.5*iqr).sum() > len(df_cleaned) * 0.05:
                print(f"Filling column '{col}' with median due to outliers")
                df_cleaned[col] = df_cleaned[col].fillna(df_cleaned[col].median())
            else:
                print(f"Filling column '{col}' with mean")
                df_cleaned[col] = df_cleaned[col].fillna(df_cleaned[col].mean())
    
    # Strategy 2: Fill categorical columns with mode
    categorical_cols = df_cleaned.select_dtypes(include=['object', 'category']).columns
    for col in categorical_cols:
        if df_cleaned[col].isnull().sum() > 0:
            print(f"Filling column '{col}' with mode")
            df_cleaned[col] = df_cleaned[col].fillna(df_cleaned[col].mode()[0])
    
    # Strategy 3: For time series data, use forward/backward fill
    time_series_cols = [col for col in numerical_cols if any(ts in col.lower() for ts in ['time', 'date', 'day', 'month', 'year'])]
    for col in time_series_cols:
        if df_cleaned[col].isnull().sum() > 0:
            print(f"Filling time series column '{col}' with forward fill")
            df_cleaned[col] = df_cleaned[col].fillna(method='ffill').fillna(method='bfill')
    
    # For remaining NaNs, if any
    remaining_nans = df_cleaned.isnull().sum().sum()
    if remaining_nans > 0:
        print(f"\nRemaining NaNs after initial strategies: {remaining_nans}")
        
        # If less than 1% of the data has NaNs in rows, drop those rows
        nan_row_percentage = (df_cleaned.isnull().any(axis=1).sum() / len(df_cleaned)) * 100
        if nan_row_percentage < 1:
            before_len = len(df_cleaned)
            df_cleaned = df_cleaned.dropna(axis=0)
            print(f"Dropped {before_len - len(df_cleaned)} rows with NaNs (less than 1% of data)")
        else:
            # For any remaining columns with high NaN percentage, consider dropping the column
            nan_cols = df_cleaned.columns[df_cleaned.isnull().mean() > 0.3]
            if len(nan_cols) > 0:
                print(f"Columns with more than 30% NaNs: {list(nan_cols)}")
                print("Consider dropping these columns or using more advanced imputation techniques.")
    
    print(f"\nFinal NaN count after cleaning: {df_cleaned.isnull().sum().sum()}")

## Verify Data After Cleaning

Let's check if our cleaning process was successful by verifying there are no remaining NaN values in the dataset.

In [7]:
if 'df_cleaned' in locals():
    # Check for any remaining NaN values
    remaining_nans = df_cleaned.isnull().sum()
    
    if remaining_nans.sum() > 0:
        print("WARNING: There are still NaN values in the dataset")
        display(remaining_nans[remaining_nans > 0])
        
        # Additional strategy: drop any remaining rows with NaNs
        print("Dropping any rows with remaining NaNs...")
        df_cleaned = df_cleaned.dropna()
        print(f"Rows after final cleaning: {len(df_cleaned)}")
    else:
        print("SUCCESS: No NaN values remain in the dataset")
    
    # Check if we have enough data left
    print(f"\nOriginal data shape: {df.shape}")
    print(f"Cleaned data shape: {df_cleaned.shape}")
    print(f"Data retention: {len(df_cleaned)/len(df)*100:.2f}%")
    
    # Save the cleaned dataset
    clean_data_path = "d:/cyg/统计建模04192223bycyg/data/spatial_temporal_data_cleaned.csv"
    df_cleaned.to_csv(clean_data_path, index=False)
    print(f"Cleaned data saved to: {clean_data_path}")

## Reinitialize and Train the Model

Now that we have cleaned the data, let's reinitialize the spatial-temporal model and train it using the cleaned dataset.

In [8]:
# Define a simple spatial-temporal model using PyTorch
class SpatialTemporalModel(torch.nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(SpatialTemporalModel, self).__init__()
        self.lstm = torch.nn.LSTM(input_dim, hidden_dim, batch_first=True)
        self.linear = torch.nn.Linear(hidden_dim, output_dim)
    
    def forward(self, x):
        lstm_out, _ = self.lstm(x)
        # Take the output from the last time step
        y_pred = self.linear(lstm_out[:, -1, :])
        return y_pred

# Function to prepare data for the model
def prepare_data_for_training(df, sequence_length, target_col, feature_cols):
    """
    Prepare sequences for LSTM training
    
    Args:
        df: DataFrame with the cleaned data
        sequence_length: Length of input sequences
        target_col: Name of the target column
        feature_cols: List of feature column names
    
    Returns:
        X_tensor, y_tensor: PyTorch tensors ready for training
    """
    sequences = []
    targets = []
    
    # Create sequences
    for i in range(len(df) - sequence_length):
        seq = df[feature_cols].iloc[i:i+sequence_length].values
        target = df[target_col].iloc[i+sequence_length]
        sequences.append(seq)
        targets.append(target)
    
    # Convert to tensors
    X_tensor = torch.tensor(np.array(sequences), dtype=torch.float32)
    y_tensor = torch.tensor(np.array(targets), dtype=torch.float32).view(-1, 1)
    
    return X_tensor, y_tensor

# Function to train the model
def train_model(model, X, y, epochs=100, lr=0.01):
    """Train the model"""
    criterion = torch.nn.MSELoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    
    # Use a portion for validation
    split_idx = int(len(X) * 0.8)
    X_train, X_val = X[:split_idx], X[split_idx:]
    y_train, y_val = y[:split_idx], y[split_idx:]
    
    train_losses = []
    val_losses = []
    
    for epoch in range(epochs):
        # Training
        model.train()
        optimizer.zero_grad()
        y_pred = model(X_train)
        loss = criterion(y_pred, y_train)
        loss.backward()
        optimizer.step()
        train_losses.append(loss.item())
        
        # Validation
        model.eval()
        with torch.no_grad():
            y_val_pred = model(X_val)
            val_loss = criterion(y_val_pred, y_val)
            val_losses.append(val_loss.item())
        
        if (epoch+1) % 10 == 0:
            print(f'Epoch {epoch+1}/{epochs}, Train Loss: {loss.item():.6f}, Val Loss: {val_loss.item():.6f}')
    
    return train_losses, val_losses

In [9]:
# Let's train the model with our cleaned data
if 'df_cleaned' in locals() and df_cleaned.isnull().sum().sum() == 0:
    try:
        # For demonstration, let's assume some columns are features and one is target
        # Replace with actual column names from your dataset
        target_col = "target"  # Replace with your actual target column
        
        # Identify numerical columns for features (excluding the target)
        feature_cols = df_cleaned.select_dtypes(include=['float64', 'int64']).columns.tolist()
        if target_col in feature_cols:
            feature_cols.remove(target_col)
        
        print(f"Target column: {target_col}")
        print(f"Feature columns: {feature_cols}")
        
        # Parameters
        sequence_length = 10
        hidden_dim = 64
        input_dim = len(feature_cols)
        output_dim = 1  # Assuming regression task
        
        # Prepare data
        X, y = prepare_data_for_training(df_cleaned, sequence_length, target_col, feature_cols)
        print(f"Input shape: {X.shape}, Target shape: {y.shape}")
        
        # Initialize and train model
        model = SpatialTemporalModel(input_dim, hidden_dim, output_dim)
        print("Model initialized:")
        print(model)
        
        # Train model
        train_losses, val_losses = train_model(model, X, y, epochs=50)
        
        # Plot training progress
        plt.figure(figsize=(10, 5))
        plt.plot(train_losses, label='Training Loss')
        plt.plot(val_losses, label='Validation Loss')
        plt.xlabel('Epoch')
        plt.ylabel('Loss')
        plt.title('Training and Validation Loss')
        plt.legend()
        plt.grid(True)
        plt.show()
        
        # Save the model
        model_path = "d:/cyg/统计建模04192223bycyg/models/spatial_temporal_model.pt"
        torch.save(model.state_dict(), model_path)
        print(f"Model saved to: {model_path}")
        
    except Exception as e:
        print(f"Error during model training: {e}")
else:
    print("Cannot train model because data cleaning was unsuccessful or the target dataset is not available")
    print("Please ensure all NaN values are handled before training")

Cannot train model because data cleaning was unsuccessful or the target dataset is not available
Please ensure all NaN values are handled before training


## Evaluate Model Performance

Let's evaluate how well our model performs after resolving the NaN issues.

In [10]:
def evaluate_model(model, X, y):
    """Evaluate the model performance"""
    model.eval()
    with torch.no_grad():
        y_pred = model(X).numpy()
        y_true = y.numpy()
        
        # Calculate metrics
        mae = mean_absolute_error(y_true, y_pred)
        rmse = np.sqrt(mean_squared_error(y_true, y_pred))
        r2 = r2_score(y_true, y_pred)
        
        print(f"Mean Absolute Error (MAE): {mae:.4f}")
        print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
        print(f"R² Score: {r2:.4f}")
        
        # Plot actual vs predicted
        plt.figure(figsize=(12, 6))
        plt.scatter(y_true, y_pred, alpha=0.5)
        plt.plot([y_true.min(), y_true.max()], [y_true.min(), y_true.max()], 'r--')
        plt.xlabel('Actual')
        plt.ylabel('Predicted')
        plt.title('Actual vs Predicted Values')
        plt.grid(True)
        plt.show()
        
        # Plot prediction errors
        errors = y_pred.flatten() - y_true.flatten()
        plt.figure(figsize=(12, 6))
        plt.hist(errors, bins=50)
        plt.xlabel('Prediction Error')
        plt.ylabel('Count')
        plt.title('Distribution of Prediction Errors')
        plt.grid(True)
        plt.show()
        
        return mae, rmse, r2

In [11]:
# Evaluate our model
if 'model' in locals() and 'X' in locals() and 'y' in locals():
    try:
        # Split data for evaluation (using 20% of data for testing)
        split_idx = int(len(X) * 0.8)
        X_test, y_test = X[split_idx:], y[split_idx:]
        
        print("Evaluating model performance...")
        mae, rmse, r2 = evaluate_model(model, X_test, y_test)
        
        # Compare with previous model metrics if available
        # This would require loading previous metrics from somewhere
        print("\nModel evaluation complete!")
        print(f"MAE: {mae:.4f}, RMSE: {rmse:.4f}, R²: {r2:.4f}")
        
        # Save metrics for future comparison
        metrics = {
            'mae': mae,
            'rmse': rmse,
            'r2': r2,
            'timestamp': pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')
        }
        
        metrics_df = pd.DataFrame([metrics])
        metrics_path = "d:/cyg/统计建模04192223bycyg/metrics/model_metrics.csv"
        
        # Append or create metrics file
        try:
            existing_metrics = pd.read_csv(metrics_path)
            combined_metrics = pd.concat([existing_metrics, metrics_df], ignore_index=True)
            combined_metrics.to_csv(metrics_path, index=False)
        except:
            metrics_df.to_csv(metrics_path, index=False)
        
        print(f"Metrics saved to: {metrics_path}")
        
    except Exception as e:
        print(f"Error during model evaluation: {e}")
else:
    print("Cannot evaluate model because it hasn't been trained yet")

Cannot evaluate model because it hasn't been trained yet


## Conclusion

In this notebook, we have:

1. Identified and visualized NaN values in our spatial-temporal dataset
2. Applied various strategies to handle these missing values
3. Verified the cleanliness of our data after preprocessing
4. Successfully trained a spatial-temporal model on the cleaned data
5. Evaluated the model's performance

By addressing the NaN issues, we've created a more robust training pipeline that should prevent the training failures caused by missing values. The model's performance metrics can now serve as a baseline for future improvements.

### Next Steps

- Further tune hyperparameters to improve model performance
- Experiment with more advanced imputation techniques if needed
- Implement cross-validation for more reliable performance estimates
- Consider ensemble methods for improved predictions