# Algerian Forest Fires Dataset
Data Set Information:

The dataset includes 244 instances that regroup a data of two regions of Algeria,namely the Bejaia region located in the northeast of Algeria and the Sidi Bel-abbes region located in the northwest of Algeria.

122 instances for each region.

The period from June 2012 to September 2012. The dataset includes 11 attribues and 1 output attribue (class) The 244 instances have been classified into fire(138 classes) and not fire (106 classes) classes.

Attribute Information:

Date : (DD/MM/YYYY) Day, month ('june' to 'september'), year (2012) Weather data observations

Temp : temperature noon (temperature max) in Celsius degrees: 22 to 42

RH : Relative Humidity in %: 21 to 90

Ws :Wind speed in km/h: 6 to 29

Rain: total day in mm: 0 to 16.8 FWI Components

Fine Fuel Moisture Code (FFMC) index from the FWI system: 28.6 to 92.5

Duff Moisture Code (DMC) index from the FWI system: 1.1 to 65.9

Drought Code (DC) index from the FWI system: 7 to 220.4

Initial Spread Index (ISI) index from the FWI system: 0 to 18.5

Buildup Index (BUI) index from the FWI system: 1.1 to 68

Fire Weather Index (FWI) Index: 0 to 31.1

Classes: two classes, namely Fire and not Fire

In [127]:
## importing all the necessary libraries
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns


In [None]:
## reading the original data
df = pd.read_csv('Algerian_forest_fires_dataset.csv',header=1) # header=1 removes Bejaia Region Dataset
df.head()

In [None]:
## information about the dataset
df.info()

## Data cleaning

In [None]:
## missing values 
df.isnull().sum()

The dataset is converted into two sets based on Region from 122th index, we can make a new column based on the Region

1 : "Bejaia Region Dataset"

2 : "Sidi-Bel Abbes Region Dataset"

Add new column with region

In [6]:
df.loc[:122,"Region"]=0
df.loc[122:,"Region"]=1

In [None]:
df.info()

In [8]:
## data type conversion to int 
df['Region']=df['Region'].astype('int')

In [None]:
df.info()

In [None]:
df.isnull().sum()

In [None]:
##  checking where is null
df[df.isnull().any(axis=1)]

In [None]:
df.head()


In [16]:
##removing the null
df=df.dropna().reset_index(drop=True)

In [None]:
df.head()

In [None]:
## now is there is any null now
df.isnull().sum()

In [None]:
## in dataset in one row there is only column names
df.iloc[[122]]

In [None]:
## remove the 122th index 
df = df.drop(122).reset_index(drop=True)
df.iloc[[122]]

In [25]:
## to remove the space in columns
df.columns=df.columns.str.strip()

In [None]:
df.columns

In [None]:
df.info()

In [None]:
df.head()

In [None]:
## convert the required columns data type from object to int
df[['day', 'month', 'year', 'Temperature', 'RH']]=df[['day', 'month', 'year', 'Temperature', 'RH']].astype('int')
df.info()

In [None]:
## changing the datatype of other cols as float except Classes
others =[col for col in df.columns if df[col].dtype=='O']
print('features having Object datatype are ', others)

for i in others : 
    if i!='Classes':
        df[i]=df[i].astype(float)

In [None]:
df.info()

In [33]:
## lets save this cleaned data into csv 
df.to_csv('Algerian_forest_fires_cleaned_dataset.csv', index=False)

# Exploratory data analysis

In [None]:
## drop day , month , year 
dff = df.drop(['day','month','year'], axis=1)
dff.head()

In [None]:
dff.head()

In [None]:
dff['Classes']=df['Classes']
dff.head()

In [None]:
dff['Classes'].value_counts()

In [None]:
## encoding into binary of feature "Classes"
dff['Classes']= np.where(dff['Classes'].str.contains('not fire'),0,1)
dff['Classes'].value_counts()

In [None]:
## density for all features 
plt.style.use('classic')  ## ggplot , seaborn 
dff.hist(bins=50,figsize=(20,15))
plt.show()

In [None]:
## percentage of classes label 
percent = dff['Classes'].value_counts(normalize=True)*100
percent

In [None]:
## its pie chart 
labels= ['fire','not fire']
plt.figure(figsize=(15,7))
plt.pie(percent,labels=labels,autopct='%1.1F%%')
plt.show()

In [None]:
## correlation 
dff.corr()

In [None]:
## heatmap
plt.figure(figsize=(12, 12))  # Adjust the figure size
sns.heatmap(dff.corr(), annot=True, cmap='coolwarm')  # Add a colormap for better visualization
plt.title("Correlation Heatmap")  # Optional: Add a title
plt.show()

In [None]:
## boxplot 
sns.boxplot(df['FWI'],color='blue')

In [None]:
df.head()

In [None]:
## monthly analsis of fire 
df['Classes']=np.where(df['Classes'].str.contains('not fire'),'not fire','fire')
dftemp=df[df['Region']==1]
plt.plot(figsize=(13,7))
sns.set_style('whitegrid')
sns.countplot(data=dftemp,x='month',hue='Classes')
plt.ylabel('Number of Fires',weight='bold')
plt.xlabel('Months',weight='bold')
plt.title("Fire Analysis of Sidi- Bel Regions",weight='bold')

In [None]:
## Monthly Fire Analysis
dftemp=df.loc[df['Region']==0]
plt.plot(figsize=(13,6))
sns.set_style('whitegrid')
sns.countplot(x='month',hue='Classes',data=dftemp)
plt.ylabel('Number of Fires',weight='bold')
plt.xlabel('Months',weight='bold')
plt.title("Fire Analysis of Brjaia Regions",weight='bold')

Its observed that August and September had the most number of forest fires for both regions. And from the above plot of months, we can understand few things

Most of the fires happened in August and very high Fires happened in only 3 months - June, July and August.

Less Fires was on September

# Feature Selection

In [None]:
dff.head()

In [None]:
## independent and dependent features 
X= dff.loc[: , dff.columns != 'FWI']
y = dff['FWI']
y.head()

In [84]:
## train test split 
from sklearn.model_selection import train_test_split
X_train , X_test , y_train , y_test = train_test_split(X,y,random_state=42,test_size=0.25)


In [None]:
## feature selection based on correlation 
X_train.corr()

In [None]:
## check for multicollinearity 
corr = X_train.corr()
plt.plot(figsize=(15,10))
sns.heatmap(corr , annot=True)

In [88]:
def correlation(dataset , threshold): 
    corr_cols = set()
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if abs(corr_matrix.iloc[i,j]) > threshold:
                colname = corr_matrix.columns[i]
                corr_cols.add(colname)

    return corr_cols


In [None]:
## threshold is set by domain expert 
dropping_features=correlation(X_train,0.85)
dropping_features

In [None]:
## drop the features when correlation is more than 85%
X_train.drop(dropping_features , axis=1 , inplace=True)
X_test.drop(dropping_features , axis=1 , inplace=True)
X_train.shape , X_test.shape


In [96]:
## standardisation of features / Z score 
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled= scaler.fit_transform(X_train)
X_test_scaled=scaler.transform(X_test)


In [None]:
## box plot to understand the effect of standard scaler 
plt.subplots(figsize=(25,10))
plt.subplot(1,2,1)
sns.boxplot(X_train)
plt.title('data before scaling')
plt.subplot(1,2,2)
sns.boxplot(X_train_scaled)
plt.title('data after scaling')
plt.show()

# Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error , r2_score
LinReg = LinearRegression()
LinReg.fit(X_train_scaled,y_train)
y_pred = LinReg.predict(X_test_scaled)
mae = mean_absolute_error(y_test,y_pred)
score = r2_score(y_test,y_pred)
print('mean score error is ',mae)
print('r2 score is ',score)
plt.scatter(y_test,y_pred)

# Lasso Regression 

In [None]:
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_absolute_error , r2_score
LassoReg = Lasso()
LassoReg.fit(X_train_scaled,y_train)
y_pred = LassoReg.predict(X_test_scaled)
mae = mean_absolute_error(y_test,y_pred)
score = r2_score(y_test,y_pred)
print('mean score error is ',mae)
print('r2 score is ',score)
plt.scatter(y_test,y_pred)

# Ridge regression


In [None]:
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_absolute_error , r2_score
ridge = Ridge()
ridge.fit(X_train_scaled,y_train)
y_pred =ridge.predict(X_test_scaled)
mae = mean_absolute_error(y_test,y_pred)
score = r2_score(y_test,y_pred)
print('mean score error is ',mae)
print('r2 score is ',score)
plt.scatter(y_test,y_pred)

# ElasticNet

In [None]:
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_absolute_error , r2_score
elastic = ElasticNet()
elastic.fit(X_train_scaled,y_train)
y_pred = elastic.predict(X_test_scaled)
mae = mean_absolute_error(y_test,y_pred)
score = r2_score(y_test,y_pred)
print('mean score error is ',mae)
print('r2 score is ',score)
plt.scatter(y_test,y_pred)

# Hypertunning 

## LassoCV

In [None]:
from sklearn.linear_model import LassoCV
from sklearn.metrics import mean_absolute_error , r2_score
lassocv = LassoCV(cv=5)
lassocv.fit(X_train_scaled,y_train)
y_pred = lassocv.predict(X_test_scaled)
mae = mean_absolute_error(y_test,y_pred)
score = r2_score(y_test,y_pred)
print('mean score error is ',mae)
print('r2 score is ',score)
print('selected alpha value ', lassocv.alpha_)
msepath = lassocv.mse_path_
alphas = lassocv.alphas_
print(msepath.shape)
print(alphas.shape)
plt.scatter(y_test,y_pred)


# RidgeCV

In [None]:
from sklearn.linear_model import RidgeCV
from sklearn.metrics import mean_absolute_error , r2_score
ridgecv = RidgeCV(cv=5)
ridgecv.fit(X_train_scaled,y_train)
y_pred = ridgecv.predict(X_test_scaled)
mae = mean_absolute_error(y_test,y_pred)
score = r2_score(y_test,y_pred)
print('mean score error is ',mae)
print('r2 score is ',score)
print(ridgecv.alpha_)
plt.scatter(y_test,y_pred)


# ElasticNetCV

In [None]:
from sklearn.linear_model import ElasticNetCV
from sklearn.metrics import mean_absolute_error , r2_score
elastic_cv = ElasticNetCV(cv=5)
elastic_cv.fit(X_train_scaled,y_train)
y_pred = elastic_cv.predict(X_test_scaled)
mae = mean_absolute_error(y_test,y_pred)
score = r2_score(y_test,y_pred)
print('mean score error is ',mae)
print('r2 score is ',score)
print(elastic_cv.alpha_)
plt.scatter(y_test,y_pred)


Pickle is a Python module used for serializing and deserializing objects. Serialization is the process of converting a Python object (e.g., a trained machine learning model) into a byte stream that can be saved to a file or transferred over a network. Deserialization is the reverse process—loading the object back into memory.

In [133]:
## pickle the ml model and pre processing model scaler
import pickle
pickle.dump(scaler,open('scaler.pkl','wb'))
pickle.dump(ridge,open('ridge.pkl','wb'))