# [Algerian Forest Fires](https://archive.ics.uci.edu/ml/datasets/Algerian+Forest+Fires+Dataset++)

---

- **Dataset Information**
  - The dataset includes 244 instances that regroup a data of two regions of Algeria,namely the Bejaia region located in the northeast of Algeria and the SidiBel-abbes region located in the northwest of Algeria.
  - 122 instances for each region.
  - The period from June 2012 to September 2012.
  - The dataset includes 11 attribues and 1 output attribue (class)
  - The 244 instances have been classified into fire (138 classes) and not fire(106 classes) classes.

---

- **Attribute Information**

  1. Date : (DD/MM/YYYY) Day, month ('june' to 'september'), year (2012)

  Weather data observations

  2. Temp : temperature noon (temperature max) in Celsius degrees: 22 to 42
  3. RH : Relative Humidity in %: 21 to 90
  4. Ws :Wind speed in km/h: 6 to 29
  5. Rain: total day in mm: 0 to 16.8

  FWI Components

  6. Fine Fuel Moisture Code (FFMC) index from the FWI system: 28.6 to 92.5
  7. Duff Moisture Code (DMC) index from the FWI system: 1.1 to 65.9
  8. Drought Code (DC) index from the FWI system: 7 to 220.4
  9. Initial Spread Index (ISI) index from the FWI system: 0 to 18.5
  10. Buildup Index (BUI) index from the FWI system: 1.1 to 68
  11. Fire Weather Index (FWI) Index: 0 to 31.1
  12. Classes: two classes, namely Fire and not Fire


# Data Cleaning


## Importing Libraries


In [1]:
import pandas as pd
import numpy as np


## Importing Data and cleaning


In [2]:
dataset=pd.read_csv("./../data/Algerian_forest_fires_dataset_UPDATE.csv")
dataset.head()

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Bejaia Region Dataset
day,month,year,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes
01,06,2012,29,57,18,0,65.7,3.4,7.6,1.3,3.4,0.5,not fire
02,06,2012,29,61,13,1.3,64.4,4.1,7.6,1,3.9,0.4,not fire
03,06,2012,26,82,22,13.1,47.1,2.5,7.1,0.3,2.7,0.1,not fire
04,06,2012,25,89,13,2.5,28.6,1.3,6.9,0,1.7,0,not fire


OBSERVATIONS
- 1st row is not actual header, it is bifurcation based on the region. Likewise there will be another row with Region we need to clean it.
- We can add region as feature to our dataset.
- 2nd row is the actual header so we will assign it as header.


TODO
- read dataset with correct header and adding region feature

In [3]:
# Can manipulate original dataset without reading again by resetting the index and setting row `0` as new column 
df = pd.read_csv("./../data/Algerian_forest_fires_dataset_UPDATE.csv", header=1)
df['Region']='Bejaia'
df.head()


Unnamed: 0,day,month,year,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes,Region
0,1,6,2012,29,57,18,0.0,65.7,3.4,7.6,1.3,3.4,0.5,not fire,Bejaia
1,2,6,2012,29,61,13,1.3,64.4,4.1,7.6,1.0,3.9,0.4,not fire,Bejaia
2,3,6,2012,26,82,22,13.1,47.1,2.5,7.1,0.3,2.7,0.1,not fire,Bejaia
3,4,6,2012,25,89,13,2.5,28.6,1.3,6.9,0.0,1.7,0.0,not fire,Bejaia
4,5,6,2012,27,77,16,0.0,64.8,3.0,14.2,1.2,3.9,0.5,not fire,Bejaia


## Checking column names

In [4]:
df.columns # need to strip extra spaces

Index(['day', 'month', 'year', 'Temperature', ' RH', ' Ws', 'Rain ', 'FFMC',
       'DMC', 'DC', 'ISI', 'BUI', 'FWI', 'Classes  ', 'Region'],
      dtype='object')

OBSERVATION
- Column name has extra space padding

TODO
- strip the column names

In [5]:
df.columns=df.columns.str.strip()
df.columns

Index(['day', 'month', 'year', 'Temperature', 'RH', 'Ws', 'Rain', 'FFMC',
       'DMC', 'DC', 'ISI', 'BUI', 'FWI', 'Classes', 'Region'],
      dtype='object')

## Dataset Summary

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 246 entries, 0 to 245
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   day          246 non-null    object
 1   month        245 non-null    object
 2   year         245 non-null    object
 3   Temperature  245 non-null    object
 4   RH           245 non-null    object
 5   Ws           245 non-null    object
 6   Rain         245 non-null    object
 7   FFMC         245 non-null    object
 8   DMC          245 non-null    object
 9   DC           245 non-null    object
 10  ISI          245 non-null    object
 11  BUI          245 non-null    object
 12  FWI          245 non-null    object
 13  Classes      244 non-null    object
 14  Region       246 non-null    object
dtypes: object(15)
memory usage: 29.0+ KB


In [7]:
df.describe()

Unnamed: 0,day,month,year,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes,Region
count,246,245,245,245,245,245,245,245.0,245.0,245,245.0,245,245.0,244,246
unique,33,5,2,20,63,19,40,174.0,167.0,199,107.0,175,128.0,9,1
top,1,7,2012,35,64,14,0,88.9,7.9,8,1.1,3,0.4,fire,Bejaia
freq,8,62,244,29,10,43,133,8.0,5.0,5,8.0,5,12.0,131,246


OBSERVATION
- According to the dataset info there should be only **fire** and **not fire** i.e. 2 classes.
- Unique day can be 31 only and not more than that.
- Extra row i.e. only 244 data should be there and not 246.
- dtypes of all feature is object

TODO  
- Correct `Classes` values.
- Correcting `day` feature.
- Converting dtypes of feature to correct dtypes after data is cleaned.


## Checking unique Classes and cleaning

In [8]:
df['Classes'].unique()

array(['not fire   ', 'fire   ', 'fire', 'fire ', 'not fire', 'not fire ',
       nan, 'Classes  ', 'not fire     ', 'not fire    '], dtype=object)

OBSERVATIONS
- There are more than 2 classes and string formatting issues in values.

TODO  
- correcting Classes feature

In [9]:
df['Classes']=df['Classes'].str.strip()
df['Classes'].unique()

array(['not fire', 'fire', nan, 'Classes'], dtype=object)

In [10]:
extra_classes=df['Classes'].unique()[2:]
df[df['Classes'].isin(extra_classes)]

Unnamed: 0,day,month,year,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes,Region
122,Sidi-Bel Abbes Region Dataset,,,,,,,,,,,,,,Bejaia
123,day,month,year,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes,Bejaia
167,14,07,2012,37,37,18,0.2,88.9,12.9,14.6 9,12.5,10.4,fire,,Bejaia


OBSERVATIONS
- From index 122, data is from Sidi-Bel Region.
- index 123 is another header.
- index 167 has left shift column. `DC` column has two value `14.6` and `9` which seems to be cause of the shift.

TODO
- Dropping index 122 and 123.
- adding `Region` as Sidi-Bel Abbes from index 124 onwards.
- right shift of data at index 167.

In [11]:
df=df.drop([122,123])
df['Region'].loc[124:]='Sidi-Bel Abbes'
df[120:125]

Unnamed: 0,day,month,year,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes,Region
120,29,9,2012,26,80,16,1.8,47.4,2.9,7.7,0.3,3.0,0.1,not fire,Bejaia
121,30,9,2012,25,78,14,1.4,45.0,1.9,7.5,0.2,2.4,0.1,not fire,Bejaia
124,1,6,2012,32,71,12,0.7,57.1,2.5,8.2,0.6,2.8,0.2,not fire,Sidi-Bel Abbes
125,2,6,2012,30,73,13,4.0,55.7,2.7,7.8,0.6,2.9,0.2,not fire,Sidi-Bel Abbes
126,3,6,2012,29,80,14,2.0,48.7,2.2,7.6,0.3,2.6,0.1,not fire,Sidi-Bel Abbes


In [12]:
df.loc[[167],'DC':]

Unnamed: 0,DC,ISI,BUI,FWI,Classes,Region
167,14.6 9,12.5,10.4,fire,,Sidi-Bel Abbes


In [13]:
df.loc[[167],'ISI':'Classes']

Unnamed: 0,ISI,BUI,FWI,Classes
167,12.5,10.4,fire,


In [14]:
df.loc[[167],'ISI':'Classes']=df.loc[[167],'DC':'FWI'].values
df.loc[167]['DC':'ISI']=df.loc[[167],'DC'].values[0].split()
df.loc[[167]]

Unnamed: 0,day,month,year,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes,Region
167,14,7,2012,37,37,18,0.2,88.9,12.9,14.6,9,12.5,10.4,fire,Sidi-Bel Abbes


In [15]:
df['Classes']=df["Classes"].str.strip()


## Rechecking Summary

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 244 entries, 0 to 245
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   day          244 non-null    object
 1   month        244 non-null    object
 2   year         244 non-null    object
 3   Temperature  244 non-null    object
 4   RH           244 non-null    object
 5   Ws           244 non-null    object
 6   Rain         244 non-null    object
 7   FFMC         244 non-null    object
 8   DMC          244 non-null    object
 9   DC           244 non-null    object
 10  ISI          244 non-null    object
 11  BUI          244 non-null    object
 12  FWI          244 non-null    object
 13  Classes      244 non-null    object
 14  Region       244 non-null    object
dtypes: object(15)
memory usage: 38.6+ KB


In [17]:
df.describe()

Unnamed: 0,day,month,year,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes,Region
count,244,244,244,244,244,244,244,244.0,244.0,244,244.0,244,244.0,244,244
unique,31,4,1,19,62,18,39,173.0,166.0,198,106.0,173,127.0,2,2
top,1,7,2012,35,64,14,0,88.9,7.9,8,1.1,3,0.4,fire,Bejaia
freq,8,62,244,29,10,43,133,8.0,5.0,5,8.0,5,12.0,138,122


In [18]:
df['Region'].value_counts()

Bejaia            122
Sidi-Bel Abbes    122
Name: Region, dtype: int64

OBSERVATION
- Dataset appears to be cleaned.

TODO
- Converting dtypes

## Converting dtypes

In [19]:
# Every feature except Classes and Region appears to be numeric
num_feature=df.columns[:-2]
df[num_feature]=df[num_feature].apply(pd.to_numeric)

In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 244 entries, 0 to 245
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   day          244 non-null    int64  
 1   month        244 non-null    int64  
 2   year         244 non-null    int64  
 3   Temperature  244 non-null    int64  
 4   RH           244 non-null    int64  
 5   Ws           244 non-null    int64  
 6   Rain         244 non-null    float64
 7   FFMC         244 non-null    float64
 8   DMC          244 non-null    float64
 9   DC           244 non-null    float64
 10  ISI          244 non-null    float64
 11  BUI          244 non-null    float64
 12  FWI          244 non-null    float64
 13  Classes      244 non-null    object 
 14  Region       244 non-null    object 
dtypes: float64(7), int64(6), object(2)
memory usage: 38.6+ KB


In [21]:
df.describe()

Unnamed: 0,day,month,year,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI
count,244.0,244.0,244.0,244.0,244.0,244.0,244.0,244.0,244.0,244.0,244.0,244.0,244.0
mean,15.754098,7.5,2012.0,32.172131,61.938525,15.504098,0.760656,77.887705,14.673361,49.288115,4.759836,16.673361,7.04918
std,8.825059,1.112961,0.0,3.633843,14.8842,2.810178,1.999406,14.337571,12.368039,47.619662,4.154628,14.201648,7.428366
min,1.0,6.0,2012.0,22.0,21.0,6.0,0.0,28.6,0.7,6.9,0.0,1.1,0.0
25%,8.0,7.0,2012.0,30.0,52.0,14.0,0.0,72.075,5.8,13.275,1.4,6.0,0.7
50%,16.0,7.5,2012.0,32.0,63.0,15.0,0.0,83.5,11.3,33.1,3.5,12.45,4.45
75%,23.0,8.0,2012.0,35.0,73.25,17.0,0.5,88.3,20.75,68.15,7.3,22.525,11.375
max,31.0,9.0,2012.0,42.0,90.0,29.0,16.8,96.0,65.9,220.4,19.0,68.0,31.1


CONCLUSION
- It matches the description as per described by Dataset Information.
- We can export this clean dataset


In [23]:
df.to_csv('./../data/Cleaned_Dataset.csv',index=False)