First, we load the Forest Fire dataset into a Pandas DataFrame:


In [2]:
import pandas as pd

# Define column names
columns = [
    'coord_x', 'coord_y', 'month', 'day', 'ffmc', 'dmc', 'dc', 'isi', 'temp', 'rh', 'wind', 'rain', 'area'
]

# Load the dataset
fires_dt = pd.read_csv(r'C:\Users\anya8\05_src\data\forest+fires\forestfires.csv', header=0, names=columns)

# Display dataset info to verify successful loading
fires_dt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 517 entries, 0 to 516
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   coord_x  517 non-null    int64  
 1   coord_y  517 non-null    int64  
 2   month    517 non-null    object 
 3   day      517 non-null    object 
 4   ffmc     517 non-null    float64
 5   dmc      517 non-null    float64
 6   dc       517 non-null    float64
 7   isi      517 non-null    float64
 8   temp     517 non-null    float64
 9   rh       517 non-null    int64  
 10  wind     517 non-null    float64
 11  rain     517 non-null    float64
 12  area     517 non-null    float64
dtypes: float64(8), int64(3), object(2)
memory usage: 52.6+ KB


This will display information about the dataset, such as the number of rows, columns, and data types.

2. Create X and Y

Next, we separate the features and target:

In [3]:
# Features (X) and target (y)
X = fires_dt.drop('area', axis=1)
y = fires_dt['area']

3. Preprocessing the Data

ColumnTransformer 1: Simple Processing (Scaling and One-Hot Encoding)

In [4]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline

# Define columns for scaling and one-hot encoding
numeric_features = ['ffmc', 'dmc', 'dc', 'isi', 'temp', 'rh', 'wind', 'rain']
categorical_features = ['month', 'day']

# ColumnTransformer for simple processing
preproc1 = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(), categorical_features)
    ])

ColumnTransformer 2: Transformation (Scaling and Non-Linear Transformation)


In [5]:
from sklearn.preprocessing import PolynomialFeatures

# ColumnTransformer for transformation
preproc2 = ColumnTransformer(
    transformers=[
        ('num', Pipeline([
            ('scaler', StandardScaler()),
            ('poly', PolynomialFeatures(degree=2, include_bias=False))
        ]), numeric_features),
        ('cat', OneHotEncoder(), categorical_features)
    ])

In [12]:
# Step 4: Creating Models and Pipelines

# Categorical features (columns with 'object' type)
categorical_features = ['month', 'day']

# Numerical features (columns with 'int64' or 'float64' type)
numerical_features = ['coord_x', 'coord_y', 'ffmc', 'dmc', 'dc', 'isi', 'temp', 'rh', 'wind', 'rain']

# 1. Column Transformer with OneHotEncoder for categorical features and StandardScaler for numerical features
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

# Apply preprocessing to both categorical and numerical features
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features),  # Handle unknown categories in test set
        ('num', StandardScaler(), numerical_features)  # Standard scaling for numerical features
    ]
)

# 2. Create a pipeline for the model
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

# Define the model pipeline
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),  # Apply preprocessing steps (transform data)
    ('regressor', LinearRegression())  # Apply the regression model (LinearRegression in this case)
])

# Now, you can use this pipeline to fit your model
X = fires_dt.drop('area', axis=1)  # Features (exclude 'area' as it's the target)
y = fires_dt['area']  # Target variable (the 'area' column)

# Fit the model
model_pipeline.fit(X, y)

Explanation:

Categorical Features: month and day are categorical variables (e.g., months of the year, days of the week).

Numerical Features: All the other columns like coord_x, coord_y, ffmc, dmc, etc., are numerical features that will be scaled using StandardScaler.

Pipeline: The pipeline includes the ColumnTransformer that processes categorical and numerical features separately. The model then applies linear regression.

In [13]:
# Step 5: Training the Model and Evaluating Its Performance

# Split the data into training and testing sets
from sklearn.model_selection import train_test_split

# Split the data (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the model on the training data
model_pipeline.fit(X_train, y_train)

# Predict on the test data
y_pred = model_pipeline.predict(X_test)

# Evaluate the model's performance using RMSE
from sklearn.metrics import mean_squared_error
import numpy as np

# Calculate RMSE (Root Mean Squared Error)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

# Print the RMSE
print(f'Root Mean Squared Error (RMSE): {rmse}')


Root Mean Squared Error (RMSE): 107.7639392141188


Explanation:

Data Splitting: We are splitting the data into a training set (80%) and a test set (20%) to train the model and evaluate its performance.

Model Training: The model_pipeline.fit(X_train, y_train) command fits the model to the training data.

Model Prediction: The model_pipeline.predict(X_test) command predicts the target values (area) on the test data.

RMSE Calculation: We use mean_squared_error to compute the error and then take the square root to get RMSE, which helps assess how well the model performs.

In [14]:
# Check statistics of the target variable (area)
print(y.describe())

count     517.000000
mean       12.847292
std        63.655818
min         0.000000
25%         0.000000
50%         0.520000
75%         6.570000
max      1090.840000
Name: area, dtype: float64


Step 6: Evaluation of the Best Model

Since this step mainly involves evaluating the model and reporting the metrics, we can use RMSE, which we already computed, and add R² for completeness.

In [15]:
from sklearn.metrics import r2_score

# Predictions using the model
y_pred = model_pipeline.predict(X)

# Calculate R²
r2 = r2_score(y, y_pred)
print(f"R² Score: {r2}")


R² Score: 0.03203364342222037


Conclusion:

1. I built the model.


2. I computed RMSE and R², which are the metrics for model evaluation.


3. The model went through all stages, including data transformation, pipeline creation, and error calculation.