## Algerian Forest Fires Dataset 
Data Set Information:

The dataset includes 244 instances that regroup a data of two regions of Algeria,namely the Bejaia region located in the northeast of Algeria and the Sidi Bel-abbes region located in the northwest of Algeria.

122 instances for each region.

The period from June 2012 to September 2012.
The dataset includes 11 attribues and 1 output attribue (class)
The 244 instances have been classified into fire(138 classes) and not fire (106 classes) classes.

Attribute Information:

1. Date : (DD/MM/YYYY) Day, month ('june' to 'september'), year (2012)
Weather data observations
2. Temp : temperature noon (temperature max) in Celsius degrees: 22 to 42
3. RH : Relative Humidity in %: 21 to 90
4. Ws :Wind speed in km/h: 6 to 29
5. Rain: total day in mm: 0 to 16.8
FWI Components
6. Fine Fuel Moisture Code (FFMC) index from the FWI system: 28.6 to 92.5
7. Duff Moisture Code (DMC) index from the FWI system: 1.1 to 65.9
8. Drought Code (DC) index from the FWI system: 7 to 220.4
9. Initial Spread Index (ISI) index from the FWI system: 0 to 18.5
10. Buildup Index (BUI) index from the FWI system: 1.1 to 68
11. Fire Weather Index (FWI) Index: 0 to 31.1
12. Classes: two classes, namely Fire and not Fire

## Project Information:
This dataset is used to predict the occurrence of forest fires based on weather conditions and FWI components. The goal is to build a model that can classify instances into "Fire" or "Not Fire" based on the attributes provided.
## Dataset Access:
The dataset can be accessed from the UCI Machine Learning Repository at the following link: [Algerian Forest Fires Dataset](https://archive.ics.uci.edu/ml/datasets/Algerian+Forest+Fires+Dataset)
## Usage
This dataset can be used for various machine learning tasks, including classification, regression, and clustering. It is suitable for training models to predict forest fire occurrences based on environmental factors.


## Project Phases:
## Phase 01: Data Cleaning and EDA:
1. **Data Cleaning and EDA**: Initial exploration and cleaning of the dataset to prepare it for modeling.
    1. **Data Cleaning**: Handling missing values, correcting data types, and ensuring the dataset is ready for analysis.
    2. **Exploratory Data Analysis (EDA)**: Visualizing the data to understand patterns, distributions, and relationships between features.
    3. **Feature Engineering**: Creating new features or modifying existing ones to improve model performance.

## Phase 02: Model Training:
2. **Model Training**: Building and training machine learning models using the cleaned dataset.
    1. Regression Models: Training models to predict continuous outcomes with regression techniques.(Linear Regression, Ridge Regression, Lasso Regression, ElasticNet Regression)
    2. Cross-Validation: Implementing cross-validation to evaluate model performance and prevent overfitting.
    3. Hyperparameter Tuning: Optimizing model parameters to improve accuracy and generalization

# Phase 01 Start : Data Cleaning and EDA

In [None]:
# Import necessary libraries for data analysis and visualization
import pandas as pd      # For data manipulation and analysis
import numpy as np       # For numerical operations
import matplotlib.pyplot as plt  # For plotting graphs
import seaborn as sns    # For advanced data visualization

# Enable inline plotting for Jupyter Notebook
%matplotlib inline

In [None]:
# Load the Algerian Forest Fires dataset, skipping the first row (header=1) as per dataset format
dataset = pd.read_csv('data/Algerian_forest_fires_dataset.csv',header=1)

In [None]:
dataset.head()

In [None]:
dataset.info()

## Data Cleaning

In [None]:
## missing values
dataset[dataset.isnull().any(axis=1)]

The dataset is converted into two sets based on Region from 122th index, we can make a new column based on the Region

1 : "Bejaia Region Dataset"

2 : "Sidi-Bel Abbes Region Dataset"

Add new column with region

In [None]:
dataset.loc[:122,"Region"]=0
dataset.loc[122:,"Region"]=1
df=dataset

In [None]:
df.info()

In [None]:
df[['Region']]=df[['Region']].astype(int)

In [None]:
df.head()

In [None]:
df.isnull().sum()

In [None]:
## Removing the null values
df=df.dropna().reset_index(drop=True)


In [None]:
df.head()

In [None]:
df.isnull().sum()

In [None]:
df.iloc[[122]]

In [None]:
##remove the 122nd row
df=df.drop(122).reset_index(drop=True)

In [None]:
df.iloc[[122]]

In [None]:
df.columns

In [None]:
## fix spaces in columns names
df.columns=df.columns.str.strip()
df.columns

In [None]:
df.info()

#### Changes the required columns as integer data type

In [None]:
df.columns

In [None]:
df[['month','day','year','Temperature','RH','Ws']]=df[['month','day','year','Temperature','RH','Ws']].astype(int)

In [None]:
df.info()

#### Changing the other columns to float data datatype


In [None]:
objects=[features for features in df.columns if df[features].dtypes=='O']

In [None]:
for i in objects:
    if i!='Classes':
        df[i]=df[i].astype(float)

In [None]:
df.info()

In [None]:
objects

In [None]:
df.describe()

In [None]:
df.head()

In [None]:
## Let ave the cleaned dataset
df.to_csv('data/Algerian_forest_fires_cleaned_dataset.csv',index=False)

##  Exploratory Data Analysis

In [None]:
## drop day,month and year
df_copy=df.drop(['day','month','year'],axis=1)

In [None]:
df_copy.head()

In [None]:
## categories in classes
df_copy['Classes'].value_counts()

In [None]:
## Encoding of the categories in classes
df_copy['Classes']=np.where(df_copy['Classes'].str.contains('not fire'),0,1)

In [None]:
df_copy.head()

In [None]:
df_copy.tail()

In [None]:
df_copy['Classes'].value_counts()

In [None]:
## Plot density plot for all features
plt.style.use('seaborn-v0_8')
df_copy.hist(bins=50, figsize=(20, 15))
plt.suptitle('Density Plot for All Features', fontsize=20)
plt.xlabel('Feature Values', fontsize=15)
plt.ylabel('Density', fontsize=15)
plt.tight_layout()
plt.show()

In [None]:
## Percentage for Pie Chart
percentage=df_copy['Classes'].value_counts(normalize=True)*100

In [None]:
# plotting piechart
classlabels=["Fire","Not Fire"]
plt.figure(figsize=(12,7))
plt.pie(percentage,labels=classlabels,autopct='%1.1f%%')
plt.title("Pie Chart of Classes")
plt.axis('equal')  # Equal aspect ratio ensures that pie chart is circular

plt.show()

In [None]:
## Correlation

In [None]:
df_copy.corr()

In [None]:
sns.heatmap(df_copy.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap', fontsize=20)
plt.tight_layout()
plt.show()

In [None]:
## Box Plots
sns.boxplot(df['FWI'],color='green')
plt.title('Box Plot of FWI', fontsize=20)
plt.xlabel('FWI', fontsize=15)
plt.show()

In [None]:
df.head()

In [None]:
df['Classes']=np.where(df['Classes'].str.contains('not fire'),'not fire','fire')

In [None]:
## Monthly Fire Analysis
dftemp=df.loc[df['Region']==1]
plt.subplots(figsize=(13,6))
sns.set_style('whitegrid')
sns.countplot(x='month',hue='Classes',data=df)
plt.ylabel('Number of Fires',weight='bold')
plt.xlabel('Months',weight='bold')
plt.title("Fire Analysis of Sidi- Bel Regions",weight='bold')

In [None]:
## Monthly Fire Analysis
dftemp=df.loc[df['Region']==0]
plt.subplots(figsize=(13,6))
sns.set_style('whitegrid')
sns.countplot(x='month',hue='Classes',data=df)
plt.ylabel('Number of Fires',weight='bold')
plt.xlabel('Months',weight='bold')
plt.title("Fire Analysis of Brjaia Regions",weight='bold')

Its observed that August and September had the most number of forest fires for both regions. And from the above plot of months, we can understand few things

Most of the fires happened in August and very high Fires happened in only 3 months - June, July and August.

Less Fires was on September