# Forest Cover Type — EDA & Preprocessing
**Filename:** `Forest_Cover_Type_MT2025065.ipynb` (created programmatically)

This notebook performs exploratory data analysis (EDA) and preprocessing on the Forest Cover dataset found in the provided training files. The goal is to prepare clean, reproducible preprocessing steps so training models (including neural networks) is straightforward.

**What this notebook contains (high level):**
- Data loading and initial inspection
- Missing values & dtype checks
- Target distribution and class balance
- Numeric summaries, feature correlations (sampled)
- Suggested preprocessing pipelines (imputation, scaling, encoding)
- Example: create training/validation splits and save preprocessed arrays for model training

---


## 1) Dataset loading
We loaded the file `covtype.csv` from the provided data. The raw dataframe shape is **(581012, 55)**.

**Guessed target column:** `Cover_Type`

> If this guess is incorrect, edit the cell where `target_col` is set to the correct column name.


In [None]:
# Inspect dataset (head, dtypes, missing values, basic stats)
import pandas as pd
import numpy as np
from pathlib import Path
csv_path = r"/mnt/data/archive/covtype.csv"
df = pd.read_csv(csv_path)
df.shape
# show head
print(df.head())
# dtypes
print('\nDtypes:')
print(df.dtypes)
# missing
print('\nMissing counts:')
print(df.isnull().sum()[df.isnull().sum()>0])
# basic stats
print('\nDescribe:')
print(df.describe(include='all').T.head(20))


## 2) EDA highlights & plots
We will:
- Inspect target distribution
- Visualize distributions of key numeric variables (histograms)
- Check correlations for numeric features (sampled to keep compute reasonable)

Notes about domain: The original UCI Cover Type dataset uses many binary indicator columns for soil and wilderness areas; those are already encoded as 0/1 in many versions. If your file uses a different encoding, convert categorical indicators accordingly.


In [None]:
# EDA: target distribution, histograms, correlations (sampled)
import matplotlib.pyplot as plt
import seaborn as sns
numeric_cols = df.select_dtypes(include=['number']).columns.tolist()
# guess target as before
target_col = None
for c in df.columns:
    if 'cover' in c.lower() and 'type' in c.lower():
        target_col = c
        break
print('Target column guess:', target_col)
if target_col is not None:
    print('\nClass distribution:')
    print(df[target_col].value_counts().sort_index())

# Histograms for first 8 numeric columns
cols_to_plot = numeric_cols[:8]
for c in cols_to_plot:
    plt.figure(figsize=(6,2.5))
    plt.hist(df[c].dropna(), bins=40)
    plt.title(c)
    plt.tight_layout()
    plt.show()

# Correlation heatmap (sample up to 5000 rows)
sample = df[numeric_cols].sample(n=min(5000, len(df)), random_state=42)
corr = sample.corr()
plt.figure(figsize=(10,8))
plt.imshow(corr, aspect='auto')
plt.colorbar()
plt.title('Correlation matrix (numeric features, sampled)')
plt.xticks(range(len(corr.columns)), corr.columns, rotation=90)
plt.yticks(range(len(corr.index)), corr.index)
plt.tight_layout()
plt.show()


## 3) Preprocessing strategy (templates)
We provide two common preprocessing pipelines — **Pipeline A (Standard scaling)** and **Pipeline B (MinMax scaling)** — and show how to apply them.

Justifications:
- **StandardScaler** is appropriate when features are roughly Gaussian or when using models that assume centered inputs (e.g., linear models, neural networks).
- **MinMaxScaler** is useful when preserving original distribution bounds matters (e.g., for some distance-based models or when features are on different scales but bounded).

We also provide:
- Imputer (median) for numeric features if any missing values exist.
- Pass-through for binary indicator columns (soil/wilderness) which don't need scaling in many cases.
- Option to use `ColumnTransformer` and `Pipeline` so you can plug into scikit-learn models directly.


In [None]:
# Preprocessing pipelines: create ColumnTransformer examples and show transformed shape
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split

# Identify numeric and binary (0/1) columns heuristically
numeric_cols = df.select_dtypes(include=['number']).columns.tolist()
# If target exists, remove it
if target_col in numeric_cols:
    numeric_cols.remove(target_col)
# Heuristic: binary cols are those with unique values subset of {0,1}
binary_cols = [c for c in numeric_cols if set(df[c].dropna().unique()).issubset({0,1})]
other_numeric = [c for c in numeric_cols if c not in binary_cols]

print('Numeric columns:', len(other_numeric), 'Binary columns (likely indicators):', len(binary_cols))

num_pipeline_std = Pipeline([('imputer', SimpleImputer(strategy='median')),('scaler', StandardScaler())])
num_pipeline_mm = Pipeline([('imputer', SimpleImputer(strategy='median')),('scaler', MinMaxScaler())])

from sklearn.preprocessing import FunctionTransformer
# ColumnTransformer that scales numeric features but leaves binary indicator columns as-is
ct_std = ColumnTransformer([
    ('num', num_pipeline_std, other_numeric),
    ('bin', 'passthrough', binary_cols)
])
ct_mm = ColumnTransformer([
    ('num', num_pipeline_mm, other_numeric),
    ('bin', 'passthrough', binary_cols)
])

# Create X,y and train/test split
if target_col is None:
    raise ValueError('Target column not found automatically; please set target_col to the correct column name in the notebook')
X = df.drop(columns=[target_col])
y = df[target_col]
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print('Train shape:', X_train.shape, 'Val shape:', X_val.shape)

# Fit transformers on training data and transform a small sample to show shapes
ct_std.fit(X_train)
X_train_std = ct_std.transform(X_train)
X_val_std = ct_std.transform(X_val)
print('After Standard pipeline transform — train shape:', X_train_std.shape, 'val shape:', X_val_std.shape)

ct_mm.fit(X_train)
X_train_mm = ct_mm.transform(X_train)
print('After MinMax pipeline transform — train shape:', X_train_mm.shape)

# Save preprocessed numpy arrays for later modelling convenience
import numpy as np
np.save('/mnt/data/X_train_std.npy', X_train_std)
np.save('/mnt/data/X_val_std.npy', X_val_std)
np.save('/mnt/data/y_train.npy', y_train.to_numpy())
np.save('/mnt/data/y_val.npy', y_val.to_numpy())
print('\nSaved preprocessed arrays to /mnt/data: X_train_std.npy, X_val_std.npy, y_train.npy, y_val.npy')


## 4) Next steps (Modeling)
With the preprocessed arrays saved in `/mnt/data`, it's trivial to train models in separate notebook cells or notebooks:

- Load `X_train_std.npy`, `y_train.npy` and plug into scikit-learn models (LogisticRegression, DecisionTree, RandomForest, SVM, MLPClassifier) — all covered in class.
- When training neural networks, use `StandardScaler`-based preprocessed data (or re-fit `StandardScaler` inside a pipeline).
- Evaluate using stratified cross-validation and metrics appropriate for multiclass (accuracy, macro-F1, confusion matrix).

A good practice is to compare performance for both Standard and MinMax scaled versions to justify choice of scaler for the final models.
