# Neural Network - Advanced Data Preprocessing
## Forest Cover Type Dataset

This notebook implements advanced feature engineering to maximize neural network performance.

### Feature Engineering Techniques:
1. **Domain-specific interactions** (hydrology, elevation)
2. **Coordinate rotations** (linear combinations)
3. **Cyclical encoding** (Aspect angles)
4. **Categorical embeddings** (Wilderness Areas, Soil Types)
5. **Robust scaling** for features with extreme values



## 1. Import Required Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, RobustScaler
import os
import itertools

print("Libraries imported successfully")

Libraries imported successfully


## 2. Configuration

In [2]:
print("=" * 80)
print("NEURAL NETWORK PREPROCESSING (ADVANCED FEATURE ENGINEERING)")
print("=" * 80)

NEURAL NETWORK PREPROCESSING (ADVANCED FEATURE ENGINEERING)


## 3. Locate Dataset

In [3]:
script_dir = os.path.abspath('../..')

possible_paths = [
    os.path.join(script_dir, 'covtype.csv'),
    os.path.join(script_dir, '../covtype.csv'),
    os.path.join(script_dir, '../../covtype.csv'),
    'covtype.csv',
    '../covtype.csv'
]

csv_path = None
for path in possible_paths:
    if os.path.exists(path):
        csv_path = path
        break

if csv_path is None:
    print("Error: covtype.csv not found!")
    raise FileNotFoundError("covtype.csv not found")

print(f"✓ Found dataset at: {csv_path}")

✓ Found dataset at: C:\PYTHON\AIT511 Course Project 2\archive\covtype.csv


## 4. Load Dataset

In [4]:
print("\nLoading dataset...")
df = pd.read_csv(csv_path)
print(f"✓ Dataset loaded: {df.shape}")
print(f"  - Rows: {df.shape[0]:,}")
print(f"  - Columns: {df.shape[1]}")

df.head()


Loading dataset...


✓ Dataset loaded: (581012, 55)
  - Rows: 581,012
  - Columns: 55


Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,...,Soil_Type32,Soil_Type33,Soil_Type34,Soil_Type35,Soil_Type36,Soil_Type37,Soil_Type38,Soil_Type39,Soil_Type40,Cover_Type
0,2596,51,3,258,0,510,221,232,148,6279,...,0,0,0,0,0,0,0,0,0,5
1,2590,56,2,212,-6,390,220,235,151,6225,...,0,0,0,0,0,0,0,0,0,5
2,2804,139,9,268,65,3180,234,238,135,6121,...,0,0,0,0,0,0,0,0,0,2
3,2785,155,18,242,118,3090,238,238,122,6211,...,0,0,0,0,0,0,0,0,0,2
4,2595,45,2,153,-1,391,220,234,150,6172,...,0,0,0,0,0,0,0,0,0,5


## 5. Core Feature Interactions

Create domain-specific features based on hydrological and geological knowledge.

In [5]:
print("\n[1/3] Generating Domain-Specific Interactions...")

df['Hydro_Elevation'] = df['Elevation'] - df['Vertical_Distance_To_Hydrology']
print("  ✓ Created: Hydro_Elevation")

df['Hydro_Euclidean'] = np.sqrt(
    df['Horizontal_Distance_To_Hydrology']**2 + 
    df['Vertical_Distance_To_Hydrology']**2
)
print("  ✓ Created: Hydro_Euclidean")

df['Hydro_Manhattan'] = (
    abs(df['Horizontal_Distance_To_Hydrology']) + 
    abs(df['Vertical_Distance_To_Hydrology'])
)
print("  ✓ Created: Hydro_Manhattan")

df['Dist_Hydro_Fire'] = (
    df['Horizontal_Distance_To_Hydrology'] + 
    df['Horizontal_Distance_To_Fire_Points']
)
df['Dist_Hydro_Road'] = (
    df['Horizontal_Distance_To_Hydrology'] + 
    df['Horizontal_Distance_To_Roadways']
)
df['Dist_Fire_Road'] = (
    df['Horizontal_Distance_To_Fire_Points'] + 
    df['Horizontal_Distance_To_Roadways']
)
print("  ✓ Created: Distance sum features")

df['Abs_Dist_Hydro_Fire'] = abs(
    df['Horizontal_Distance_To_Hydrology'] - 
    df['Horizontal_Distance_To_Fire_Points']
)
df['Abs_Dist_Hydro_Road'] = abs(
    df['Horizontal_Distance_To_Hydrology'] - 
    df['Horizontal_Distance_To_Roadways']
)
print("  ✓ Created: Absolute difference features")

df['Mean_Dist_Amenities'] = (
    df['Horizontal_Distance_To_Hydrology'] + 
    df['Horizontal_Distance_To_Roadways'] + 
    df['Horizontal_Distance_To_Fire_Points']
) / 3
print("  ✓ Created: Mean_Dist_Amenities")

print(f"\nTotal engineered features so far: 10")


[1/3] Generating Domain-Specific Interactions...
  ✓ Created: Hydro_Elevation
  ✓ Created: Hydro_Euclidean
  ✓ Created: Hydro_Manhattan
  ✓ Created: Distance sum features
  ✓ Created: Absolute difference features
  ✓ Created: Mean_Dist_Amenities

Total engineered features so far: 10


## 6. Linear Rotations

Create linear combinations of key features to help neural networks capture diagonal decision boundaries.

In [6]:
print("\n[2/3] Applying Linear Rotations...")

cols_to_rotate = [
    'Horizontal_Distance_To_Hydrology',
    'Horizontal_Distance_To_Roadways',
    'Horizontal_Distance_To_Fire_Points',
    'Elevation'
]

rotation_count = 0
for col1, col2 in itertools.combinations(cols_to_rotate, 2):
    df[f'{col1}_plus_{col2}'] = df[col1] + df[col2]
    df[f'{col1}_minus_{col2}'] = df[col1] - df[col2]
    rotation_count += 2

print(f"  ✓ Created {rotation_count} rotation features from {len(cols_to_rotate)} base features")
print(f"  ✓ Total pairwise combinations: {len(list(itertools.combinations(cols_to_rotate, 2)))}")


[2/3] Applying Linear Rotations...
  ✓ Created 12 rotation features from 4 base features
  ✓ Total pairwise combinations: 6


## 7. Cyclical Aspect Encoding

Aspect is a circular variable (0° = 360°), so we use sine/cosine encoding.

In [7]:
print("\nEncoding cyclical Aspect feature...")

df['Aspect_Sin'] = np.sin(np.radians(df['Aspect']))
df['Aspect_Cos'] = np.cos(np.radians(df['Aspect']))

print("  ✓ Created: Aspect_Sin, Aspect_Cos")
print("  Note: This preserves the circular nature of compass directions")


Encoding cyclical Aspect feature...
  ✓ Created: Aspect_Sin, Aspect_Cos
  Note: This preserves the circular nature of compass directions


## 8. Feature Summary

In [8]:
print("\nFeature Engineering Summary:")
print("=" * 60)
print(f"Original features: 54")
print(f"Current total features: {len(df.columns)}")
print(f"Engineered features: {len(df.columns) - 54}")
print("\nNew feature categories:")
print("  - Hydrological interactions: 10")
print(f"  - Linear rotations: {rotation_count}")
print("  - Cyclical encoding: 2")


Feature Engineering Summary:
Original features: 54
Current total features: 78
Engineered features: 24

New feature categories:
  - Hydrological interactions: 10
  - Linear rotations: 12
  - Cyclical encoding: 2


## 9. Prepare Arrays

Separate continuous features, categorical features, and target variable.

In [9]:
print("\n[3/3] Preparing arrays...")

wild_cols = [c for c in df.columns if 'Wilderness_Area' in c]
soil_cols = [c for c in df.columns if 'Soil_Type' in c]

print(f"  - Wilderness Area columns: {len(wild_cols)}")
print(f"  - Soil Type columns: {len(soil_cols)}")

df['Wilderness_Area_Index'] = np.argmax(df[wild_cols].values, axis=1)
df['Soil_Type_Index'] = np.argmax(df[soil_cols].values, axis=1)

print("  ✓ Created categorical indices for embeddings")

exclude_cols = ['Cover_Type', 'Aspect'] + wild_cols + soil_cols
cont_features = [c for c in df.columns if c not in exclude_cols and 'Index' not in c]

print(f"  ✓ Total continuous features: {len(cont_features)}")


[3/3] Preparing arrays...
  - Wilderness Area columns: 4
  - Soil Type columns: 40


  ✓ Created categorical indices for embeddings
  ✓ Total continuous features: 32


## 10. Extract Arrays

In [10]:
X_cont = df[cont_features].values.astype(np.float32)
X_wild = df['Wilderness_Area_Index'].values
X_soil = df['Soil_Type_Index'].values
y = df['Cover_Type'].values

print(f"\nArray shapes:")
print(f"  - Continuous features: {X_cont.shape}")
print(f"  - Wilderness indices: {X_wild.shape}")
print(f"  - Soil indices: {X_soil.shape}")
print(f"  - Target: {y.shape}")


Array shapes:


  - Continuous features: (581012, 32)
  - Wilderness indices: (581012,)
  - Soil indices: (581012,)
  - Target: (581012,)


## 11. Train-Test Split

In [11]:
print("\nSplitting data (85% train, 15% test)...")

train_idx, test_idx, y_train, y_test = train_test_split(
    np.arange(len(df)), 
    y, 
    test_size=0.15, 
    random_state=42, 
    stratify=y
)

X_cont_train = X_cont[train_idx]
X_cont_test = X_cont[test_idx]
X_wild_train = X_wild[train_idx]
X_wild_test = X_wild[test_idx]
X_soil_train = X_soil[train_idx]
X_soil_test = X_soil[test_idx]

print(f"✓ Data split complete")
print(f"  - Training samples: {len(train_idx):,}")
print(f"  - Test samples: {len(test_idx):,}")


Splitting data (85% train, 15% test)...


✓ Data split complete
  - Training samples: 493,860
  - Test samples: 87,152


## 12. Robust Scaling

Use RobustScaler instead of StandardScaler to handle outliers better, especially in hydrology features.

In [12]:
print("\nApplying Robust Scaling...")
print("  Note: RobustScaler handles outliers better than StandardScaler")

scaler = RobustScaler()
X_cont_train = scaler.fit_transform(X_cont_train)
X_cont_test = scaler.transform(X_cont_test)

print("✓ Scaling complete")

print("\nScaling verification (training set):")
print(f"  - Median: {np.median(X_cont_train):.6f} (should be ~0)")
print(f"  - IQR: ~1 for most features")
print(f"  - Min: {X_cont_train.min():.6f}")
print(f"  - Max: {X_cont_train.max():.6f}")


Applying Robust Scaling...
  Note: RobustScaler handles outliers better than StandardScaler


✓ Scaling complete

Scaling verification (training set):
  - Median: 0.000000 (should be ~0)
  - IQR: ~1 for most features
  - Min: -9.416667
  - Max: 9.209678


## 13. Encode Target Variable

Convert to one-hot encoding for neural network training.

In [13]:
print("\nEncoding target variable...")

y_train_cat = pd.get_dummies(y_train - 1).values
y_test_cat = pd.get_dummies(y_test - 1).values

print(f"✓ Target encoded")
print(f"  - Training shape: {y_train_cat.shape}")
print(f"  - Test shape: {y_test_cat.shape}")
print(f"  - Number of classes: {y_train_cat.shape[1]}")


Encoding target variable...
✓ Target encoded
  - Training shape: (493860, 7)
  - Test shape: (87152, 7)
  - Number of classes: 7


## 14. Save Processed Data

In [14]:
print("\nSaving processed data...")

output_dir = os.path.join(script_dir, 'data_95_v2')
os.makedirs(output_dir, exist_ok=True)

output_file = os.path.join(output_dir, 'advanced_data_v2.npz')
np.savez_compressed(
    output_file,
    X_cont_train=X_cont_train,
    X_cont_test=X_cont_test,
    X_wild_train=X_wild_train,
    X_wild_test=X_wild_test,
    X_soil_train=X_soil_train,
    X_soil_test=X_soil_test,
    y_train_cat=y_train_cat,
    y_test_cat=y_test_cat
)

print(f"✓ Saved to: {output_file}")


Saving processed data...


✓ Saved to: C:\PYTHON\AIT511 Course Project 2\archive\data_95_v2\advanced_data_v2.npz


## 15. Summary

In [15]:
print("\n" + "=" * 80)
print("NEURAL NETWORK PREPROCESSING COMPLETE")
print("=" * 80)
print(f"✓ Original features: 54")
print(f"✓ Continuous features: {X_cont_train.shape[1]}")
print(f"✓ Categorical features: 2 (Wilderness, Soil)")
print(f"✓ Training samples: {len(y_train):,}")
print(f"✓ Test samples: {len(y_test):,}")
print(f"✓ Classes: {y_train_cat.shape[1]}")
print(f"\nFeature Engineering Applied:")
print(f"  - Domain-specific interactions")
print(f"  - Linear rotations (45° combinations)")
print(f"  - Cyclical encoding (Aspect)")
print(f"  - Categorical embeddings (Wilderness, Soil)")
print(f"  - Robust scaling (outlier-resistant)")
print(f"\nData ready for Neural Network training !")


NEURAL NETWORK PREPROCESSING COMPLETE
✓ Original features: 54
✓ Continuous features: 32
✓ Categorical features: 2 (Wilderness, Soil)
✓ Training samples: 493,860
✓ Test samples: 87,152
✓ Classes: 7

Feature Engineering Applied:
  - Domain-specific interactions
  - Linear rotations (45° combinations)
  - Cyclical encoding (Aspect)
  - Categorical embeddings (Wilderness, Soil)
  - Robust scaling (outlier-resistant)

Data ready for Neural Network training !
