# Support Vector Machine (SVM) - Data Preprocessing
## Forest Cover Type Dataset

This notebook prepares data for SVM modeling. SVMs are distance-based algorithms, so proper feature scaling is critical for good performance.

### Preprocessing Steps:
1. Load raw dataset
2. Split into training and test sets (stratified)
3. Apply standard scaling (essential for SVM)
4. Save processed data and scaler

## 1. Import Required Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import os
import joblib

print("Libraries imported successfully")

Libraries imported successfully


## 2. Configuration

In [2]:
print("=" * 80)
print("SVM PREPROCESSING")
print("=" * 80)

SVM PREPROCESSING


## 3. Locate Dataset

Robust path handling to find the covtype.csv file.

In [3]:
script_dir = os.path.abspath('../..')
parent_dir = os.path.dirname(script_dir)

possible_paths = [
    os.path.join(script_dir, 'covtype.csv'),
    os.path.join(parent_dir, 'covtype.csv'),
    'covtype.csv',
    '../covtype.csv',
    '../../covtype.csv'
]

csv_path = None
for path in possible_paths:
    if os.path.exists(path):
        csv_path = path
        break

if csv_path is None:
    print("Error: covtype.csv not found!")
    print("Checked paths:")
    for path in possible_paths:
        print(f"  - {path}")
    raise FileNotFoundError("covtype.csv not found")

print(f"✓ Found dataset at: {csv_path}")

✓ Found dataset at: C:\PYTHON\AIT511 Course Project 2\archive\covtype.csv


## 4. Load Dataset

In [4]:
print("\n[1/5] Loading dataset...")
df = pd.read_csv(csv_path)
print(f"✓ Dataset loaded: {df.shape}")
print(f"  - Rows: {df.shape[0]:,}")
print(f"  - Columns: {df.shape[1]}")

print("\nDataset Info:")
df.info()


[1/5] Loading dataset...


✓ Dataset loaded: (581012, 55)
  - Rows: 581,012
  - Columns: 55

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 581012 entries, 0 to 581011
Data columns (total 55 columns):
 #   Column                              Non-Null Count   Dtype
---  ------                              --------------   -----
 0   Elevation                           581012 non-null  int64
 1   Aspect                              581012 non-null  int64
 2   Slope                               581012 non-null  int64
 3   Horizontal_Distance_To_Hydrology    581012 non-null  int64
 4   Vertical_Distance_To_Hydrology      581012 non-null  int64
 5   Horizontal_Distance_To_Roadways     581012 non-null  int64
 6   Hillshade_9am                       581012 non-null  int64
 7   Hillshade_Noon                      581012 non-null  int64
 8   Hillshade_3pm                       581012 non-null  int64
 9   Horizontal_Distance_To_Fire_Points  581012 non-null  int64
 10  Wilderness_Area1                    

## 5. Prepare Features and Target

### SVM Preprocessing Strategy:
1. **Feature Selection**: Use all features initially (54 dimensions is manageable)
2. **Scaling**: MUST standardize everything (SVM is distance-based)
3. **Dimensionality Reduction**: Optional - LinearSVC handles 54 dims efficiently

In [5]:
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nTarget classes: {np.unique(y)}")
print(f"Number of classes: {len(np.unique(y))}")

unique, counts = np.unique(y, return_counts=True)
print("\nClass distribution:")
for cls, count in zip(unique, counts):
    percentage = (count / len(y)) * 100
    print(f"  Class {cls}: {count:6,} ({percentage:5.2f}%)")

Features shape: (581012, 54)
Target shape: (581012,)

Target classes: [1 2 3 4 5 6 7]
Number of classes: 7

Class distribution:
  Class 1: 211,840 (36.46%)
  Class 2: 283,301 (48.76%)
  Class 3: 35,754 ( 6.15%)
  Class 4:  2,747 ( 0.47%)
  Class 5:  9,493 ( 1.63%)
  Class 6: 17,367 ( 2.99%)
  Class 7: 20,510 ( 3.53%)


## 6. Split Data

Using stratified split to maintain class distribution in both sets.

In [6]:
print("\n[2/5] Splitting data (80% train, 20% test)...")

X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42, 
    stratify=y
)

print(f"✓ Data split complete")
print(f"  - Training samples: {X_train.shape[0]:,}")
print(f"  - Test samples: {X_test.shape[0]:,}")
print(f"  - Features: {X_train.shape[1]}")


[2/5] Splitting data (80% train, 20% test)...


✓ Data split complete
  - Training samples: 464,809
  - Test samples: 116,203
  - Features: 54


## 7. Feature Scaling

**Critical for SVM**: StandardScaler transforms features to have mean=0 and variance=1.

This ensures that:
- All features contribute equally to distance calculations
- The algorithm converges faster
- Performance is significantly improved

In [7]:
print("\n[3/5] Applying Standard Scaling...")
print("(Essential for SVM - distance-based algorithm)")

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("✓ Data scaled (mean=0, variance=1)")

print("\nScaling verification (training set):")
print(f"  - Mean: {X_train_scaled.mean():.6f} (should be ~0)")
print(f"  - Std: {X_train_scaled.std():.6f} (should be ~1)")
print(f"  - Min: {X_train_scaled.min():.6f}")
print(f"  - Max: {X_train_scaled.max():.6f}")


[3/5] Applying Standard Scaling...
(Essential for SVM - distance-based algorithm)


✓ Data scaled (mean=0, variance=1)

Scaling verification (training set):
  - Mean: -0.000000 (should be ~0)
  - Std: 1.000000 (should be ~1)
  - Min: -11.296430
  - Max: 393.618258


## 8. Feature Statistics Comparison

Compare feature statistics before and after scaling.

In [8]:
print("\nFeature statistics comparison (first 5 features):")
print("\nBefore scaling:")
print(f"  Mean: {X_train[:, :5].mean(axis=0)}")
print(f"  Std:  {X_train[:, :5].std(axis=0)}")

print("\nAfter scaling:")
print(f"  Mean: {X_train_scaled[:, :5].mean(axis=0)}")
print(f"  Std:  {X_train_scaled[:, :5].std(axis=0)}")


Feature statistics comparison (first 5 features):

Before scaling:
  Mean: [2959.51065922  155.82856399   14.10456123  269.34836029   46.42075347]


  Std:  [280.02477598 111.97966415   7.48716583 212.38921881  58.23262633]

After scaling:
  Mean: [-3.73055182e-16  5.45403220e-17 -1.37429717e-15  1.11315952e-15
 -7.12673902e-18]
  Std:  [1. 1. 1. 1. 1.]


## 9. Dimensionality Note

With 54 features, we don't need PCA for dimensionality reduction. LinearSVC scales linearly with dimensions, making it efficient for this size.

In [9]:
print(f"\nDimensionality: {X_train_scaled.shape[1]} features")
print("Note: LinearSVC handles 54 dimensions efficiently.")
print("PCA not required for this dataset size.")


Dimensionality: 54 features
Note: LinearSVC handles 54 dimensions efficiently.
PCA not required for this dataset size.


## 10. Save Processed Data

Save both the processed data and the fitted scaler for future use.

In [10]:
print("\n[4/5] Saving processed data...")

output_dir = os.path.join(script_dir, 'svm_implementation', 'data')
os.makedirs(output_dir, exist_ok=True)

data_file = os.path.join(output_dir, 'svm_data.npz')
np.savez_compressed(
    data_file,
    X_train=X_train_scaled,
    X_test=X_test_scaled,
    y_train=y_train,
    y_test=y_test
)

scaler_file = os.path.join(output_dir, 'svm_scaler.joblib')
joblib.dump(scaler, scaler_file)

print(f"✓ Saved processed data to: {data_file}")
print(f"✓ Saved scaler to: {scaler_file}")


[4/5] Saving processed data...


✓ Saved processed data to: C:\PYTHON\AIT511 Course Project 2\archive\svm_implementation\data\svm_data.npz
✓ Saved scaler to: C:\PYTHON\AIT511 Course Project 2\archive\svm_implementation\data\svm_scaler.joblib


## 11. Summary

In [11]:
print("\n" + "=" * 80)
print("SVM PREPROCESSING COMPLETE")
print("=" * 80)
print(f"✓ Total samples: {len(y):,}")
print(f"✓ Training samples: {len(y_train):,}")
print(f"✓ Test samples: {len(y_test):,}")
print(f"✓ Features: {X_train_scaled.shape[1]}")
print(f"✓ Classes: {len(np.unique(y))}")
print(f"✓ Scaling: StandardScaler (mean=0, std=1)")
print(f"\nData ready for SVM training!")


SVM PREPROCESSING COMPLETE
✓ Total samples: 581,012
✓ Training samples: 464,809
✓ Test samples: 116,203
✓ Features: 54
✓ Classes: 7
✓ Scaling: StandardScaler (mean=0, std=1)

Data ready for SVM training!
