# Forest Cover Type Prediction
## Notebook 1: Data Preprocessing

This notebook performs complete data preprocessing including:
- Missing value handling
- Infinity and outlier handling
- Encoding categorical features
- Skewness treatment
- Feature selection
- Feature engineering
- Feature scaling
- Class imbalance handling using SMOTE

Final output: Clean dataset ready for EDA and model building.


In [47]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from imblearn.over_sampling import SMOTE


## 1. Load Dataset


In [49]:
df = pd.read_csv("data/cover_type.csv")
df.head()


Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,...,Soil_Type_32,Soil_Type_33,Soil_Type_34,Soil_Type_35,Soil_Type_36,Soil_Type_37,Soil_Type_38,Soil_Type_39,Soil_Type_40,Cover_Type
0,2596,51,3,258,0,510,221,232,148,6279,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Aspen
1,2590,56,2,212,-6,390,220,235,151,6225,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Aspen
2,2804,139,9,268,65,3180,234,238,135,6121,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Lodgepole Pine
3,2785,155,18,242,118,3090,238,238,122,6211,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Lodgepole Pine
4,2595,45,2,153,-1,391,220,234,150,6172,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Aspen


## 2. Basic Inspection


In [50]:
print(df.shape)
print(df.info())
df.isnull().sum()


(145890, 55)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145890 entries, 0 to 145889
Data columns (total 55 columns):
 #   Column                              Non-Null Count   Dtype  
---  ------                              --------------   -----  
 0   Elevation                           145890 non-null  int64  
 1   Aspect                              145890 non-null  int64  
 2   Slope                               145890 non-null  int64  
 3   Horizontal_Distance_To_Hydrology    145890 non-null  int64  
 4   Vertical_Distance_To_Hydrology      145890 non-null  int64  
 5   Horizontal_Distance_To_Roadways     145890 non-null  int64  
 6   Hillshade_9am                       145890 non-null  int64  
 7   Hillshade_Noon                      145890 non-null  int64  
 8   Hillshade_3pm                       145890 non-null  int64  
 9   Horizontal_Distance_To_Fire_Points  145890 non-null  int64  
 10  Wilderness_Area_1                   145890 non-null  float64
 11  Wilderness_Ar

Elevation                             0
Aspect                                0
Slope                                 0
Horizontal_Distance_To_Hydrology      0
Vertical_Distance_To_Hydrology        0
Horizontal_Distance_To_Roadways       0
Hillshade_9am                         0
Hillshade_Noon                        0
Hillshade_3pm                         0
Horizontal_Distance_To_Fire_Points    0
Wilderness_Area_1                     0
Wilderness_Area_2                     0
Wilderness_Area_3                     0
Wilderness_Area_4                     0
Soil_Type_1                           0
Soil_Type_2                           0
Soil_Type_3                           0
Soil_Type_4                           0
Soil_Type_5                           0
Soil_Type_6                           0
Soil_Type_7                           0
Soil_Type_8                           0
Soil_Type_9                           0
Soil_Type_10                          0
Soil_Type_11                          0


## 3. Separate Features and Target
Target column: Cover_Type


In [51]:
X = df.drop("Cover_Type", axis=1)
y = df["Cover_Type"]


## 4. Handle Infinite Values


In [52]:
X.replace([np.inf, -np.inf], np.nan, inplace=True)


## 5. Handle Missing Values (Numerical Only)



In [55]:
imputer = SimpleImputer(strategy="median")
X = pd.DataFrame(imputer.fit_transform(X), columns=X.columns)


## 6. Check Remaining NaN or Infinite Values


In [56]:
print("NaN count:", X.isnull().sum().sum())
print("Inf count:", np.isinf(X).sum().sum())


NaN count: 0
Inf count: 0


## 7. Skewness Detection


In [57]:
skewness = X.skew().sort_values(ascending=False)
skewness


Soil_Type_25                          381.955495
Soil_Type_28                          127.308025
Soil_Type_36                          120.773755
Soil_Type_27                           98.606288
Soil_Type_21                           95.474146
Soil_Type_34                           81.415601
Soil_Type_37                           65.482599
Soil_Type_26                           51.949230
Soil_Type_35                           37.779944
Soil_Type_7                            37.235169
Soil_Type_5                            29.685031
Soil_Type_14                           29.330416
Soil_Type_8                            28.496418
Soil_Type_31                           20.891122
Soil_Type_1                            20.198232
Soil_Type_11                           18.877105
Soil_Type_13                           17.421283
Wilderness_Area_2                      17.011000
Soil_Type_17                           15.342471
Soil_Type_33                           15.291941
Soil_Type_2         

## 8. Skewness Treatment (Log Transform)


In [60]:
skewed_cols = skewness[abs(skewness) > 1].index

for col in skewed_cols:
    if (X[col] >= 0).all():
        X[col] = np.log1p(X[col])



## 9. Feature Engineering



In [61]:
X["Hydrology_Ratio"] = X["Horizontal_Distance_To_Hydrology"] / (
    X["Vertical_Distance_To_Hydrology"].abs() + 1
)



## 10. Handle New Infinite / NaN Values (Post Feature Engineering)


In [62]:
X.replace([np.inf, -np.inf], np.nan, inplace=True)
X = pd.DataFrame(imputer.fit_transform(X), columns=X.columns)

print("NaN:", X.isnull().sum().sum())
print("Inf:", np.isinf(X).sum().sum())


NaN: 0
Inf: 0


## 11. Feature Selection (Low Variance Filter)


In [64]:
from sklearn.feature_selection import VarianceThreshold

selector = VarianceThreshold(threshold=0.01)
X_selected = selector.fit_transform(X)
X = pd.DataFrame(X_selected, columns=X.columns[selector.get_support()])


## 12. Feature Scaling


In [65]:
scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)


## 13. Class Imbalance Handling (SMOTE)


In [67]:
print("Before SMOTE:\n", y.value_counts())

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

print("After SMOTE:\n", pd.Series(y_resampled).value_counts())


Before SMOTE:
 Cover_Type
Lodgepole Pine       103071
Spruce/Fir            31110
Aspen                  3069
Krummholz              2160
Ponderosa Pine         2160
Douglas-fir            2160
Cottonwood/Willow      2160
Name: count, dtype: int64
After SMOTE:
 Cover_Type
Aspen                103071
Lodgepole Pine       103071
Spruce/Fir           103071
Krummholz            103071
Ponderosa Pine       103071
Douglas-fir          103071
Cottonwood/Willow    103071
Name: count, dtype: int64


## 14. Save Final Preprocessed Dataset


In [69]:
final_df = pd.concat([X_resampled, pd.Series(y_resampled, name="Cover_Type")], axis=1)
final_df.to_csv("final_preprocessed_data.csv", index=False)

print("Saved: final_preprocessed_data.csv")


Saved: final_preprocessed_data.csv
