# Logistic Regression - Data Preprocessing
## Forest Cover Type Dataset

This notebook prepares data for Logistic Regression modeling by:
- Loading the raw dataset
- Splitting into training and test sets
- Applying standard scaling to all features
- Saving processed data for model training

## 1. Import Required Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import os
import joblib

print("Libraries imported successfully")

Libraries imported successfully


## 2. Configuration

In [2]:
print("=" * 80)
print("LOGISTIC REGRESSION PREPROCESSING")
print("=" * 80)

LOGISTIC REGRESSION PREPROCESSING


## 3. Locate Dataset

The script searches for `covtype.csv` in multiple possible locations to ensure robustness.

In [3]:
script_dir = os.path.abspath('../..')
parent_dir = os.path.dirname(script_dir)

possible_paths = [
    os.path.join(script_dir, 'covtype.csv'),
    os.path.join(parent_dir, 'covtype.csv'),
    'covtype.csv',
    '../covtype.csv',
    '../../covtype.csv'
]

csv_path = None
for path in possible_paths:
    if os.path.exists(path):
        csv_path = path
        break

if csv_path is None:
    print(f"Error: covtype.csv not found!")
    print(f"Checked paths:")
    for path in possible_paths:
        print(f"  - {path}")
    raise FileNotFoundError("covtype.csv not found in any expected location")

print(f"✓ Found dataset at: {csv_path}")

✓ Found dataset at: C:\PYTHON\AIT511 Course Project 2\archive\covtype.csv


## 4. Load Dataset

In [4]:
print("\n[1/5] Loading dataset...")
df = pd.read_csv(csv_path)
print(f"✓ Dataset loaded: {df.shape}")
print(f"  - Rows: {df.shape[0]:,}")
print(f"  - Columns: {df.shape[1]}")

print("\nFirst few rows:")
df.head()


[1/5] Loading dataset...


✓ Dataset loaded: (581012, 55)
  - Rows: 581,012
  - Columns: 55

First few rows:


Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,...,Soil_Type32,Soil_Type33,Soil_Type34,Soil_Type35,Soil_Type36,Soil_Type37,Soil_Type38,Soil_Type39,Soil_Type40,Cover_Type
0,2596,51,3,258,0,510,221,232,148,6279,...,0,0,0,0,0,0,0,0,0,5
1,2590,56,2,212,-6,390,220,235,151,6225,...,0,0,0,0,0,0,0,0,0,5
2,2804,139,9,268,65,3180,234,238,135,6121,...,0,0,0,0,0,0,0,0,0,2
3,2785,155,18,242,118,3090,238,238,122,6211,...,0,0,0,0,0,0,0,0,0,2
4,2595,45,2,153,-1,391,220,234,150,6172,...,0,0,0,0,0,0,0,0,0,5


## 5. Prepare Features and Target

The last column contains the target variable (Cover_Type), and all other columns are features.

In [5]:
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nTarget classes: {np.unique(y)}")
print(f"Number of classes: {len(np.unique(y))}")

Features shape: (581012, 54)
Target shape: (581012,)

Target classes: [1 2 3 4 5 6 7]
Number of classes: 7


## 6. Split Data

We use an 80-20 split with stratification to maintain class distribution in both sets.

In [6]:
print("\n[2/5] Splitting data (80% train, 20% test)...")

X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42, 
    stratify=y
)

print(f"✓ Data split complete")
print(f"  - Training samples: {X_train.shape[0]:,}")
print(f"  - Test samples: {X_test.shape[0]:,}")
print(f"  - Features: {X_train.shape[1]}")

print("\nClass distribution in training set:")
unique, counts = np.unique(y_train, return_counts=True)
for cls, count in zip(unique, counts):
    percentage = (count / len(y_train)) * 100
    print(f"  Class {cls}: {count:6,} ({percentage:5.2f}%)")


[2/5] Splitting data (80% train, 20% test)...


✓ Data split complete
  - Training samples: 464,809
  - Test samples: 116,203
  - Features: 54

Class distribution in training set:
  Class 1: 169,472 (36.46%)
  Class 2: 226,640 (48.76%)
  Class 3: 28,603 ( 6.15%)
  Class 4:  2,198 ( 0.47%)
  Class 5:  7,594 ( 1.63%)
  Class 6: 13,894 ( 2.99%)
  Class 7: 16,408 ( 3.53%)


## 7. Feature Scaling

Logistic Regression benefits from standardized features (mean=0, std=1).
We fit the scaler on training data and apply it to both train and test sets.

In [7]:
print("\n[3/5] Applying Standard Scaling...")

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("✓ Data scaled (mean=0, variance=1)")

print("\nScaling verification (training set):")
print(f"  - Mean: {X_train_scaled.mean():.6f} (should be ~0)")
print(f"  - Std: {X_train_scaled.std():.6f} (should be ~1)")
print(f"  - Min: {X_train_scaled.min():.6f}")
print(f"  - Max: {X_train_scaled.max():.6f}")


[3/5] Applying Standard Scaling...


✓ Data scaled (mean=0, variance=1)

Scaling verification (training set):
  - Mean: -0.000000 (should be ~0)
  - Std: 1.000000 (should be ~1)
  - Min: -11.296430
  - Max: 393.618258


## 8. Save Processed Data

Save the processed data and scaler for use in model training.

In [8]:
print("\n[4/5] Saving processed data...")

output_dir = os.path.join(script_dir, 'data_logistic')
os.makedirs(output_dir, exist_ok=True)

data_file = os.path.join(output_dir, 'logistic_data.npz')
np.savez_compressed(
    data_file,
    X_train=X_train_scaled,
    X_test=X_test_scaled,
    y_train=y_train,
    y_test=y_test
)

scaler_file = os.path.join(output_dir, 'scaler.joblib')
joblib.dump(scaler, scaler_file)

print(f"✓ Saved processed data to: {data_file}")
print(f"✓ Saved scaler to: {scaler_file}")


[4/5] Saving processed data...


✓ Saved processed data to: C:\PYTHON\AIT511 Course Project 2\archive\data_logistic\logistic_data.npz
✓ Saved scaler to: C:\PYTHON\AIT511 Course Project 2\archive\data_logistic\scaler.joblib


## 9. Summary

In [9]:
print("\n" + "=" * 80)
print("PREPROCESSING COMPLETE")
print("=" * 80)
print(f"✓ Total samples: {len(y):,}")
print(f"✓ Training samples: {len(y_train):,}")
print(f"✓ Test samples: {len(y_test):,}")
print(f"✓ Features: {X_train.shape[1]}")
print(f"✓ Classes: {len(np.unique(y))}")
print(f"\nData ready for Logistic Regression training!")


PREPROCESSING COMPLETE
✓ Total samples: 581,012
✓ Training samples: 464,809
✓ Test samples: 116,203
✓ Features: 54
✓ Classes: 7

Data ready for Logistic Regression training!
