<ul style = "font-size : 25px;">
<li ><a href = "#Loading the data">Loading the data</a></li>
<br/>    
<li><a href = "#Data preprocessing">Data preprocessing</a></li>
<br/>    
<li><a href = "#Feature Extraction">Feature Extraction</a></li>
<br/>    
<li><a href = "#Model Building">Model Building</a></li>       
</ul>   

# Loading the Data
<p id = "Loading the data"></p>

In [1]:
import numpy as np      # to deal with arrays and dataframe
import pandas as pd     # to deal with dataframes
from sklearn.preprocessing import StandardScaler      # to help stadarized the data    
from sklearn.model_selection import train_test_split  # to split the data to train an test
# this packages is responsible for building ANN
from tensorflow.keras.layers import Dense       # build hidden layesr or output layers       
from tensorflow.keras.layers import Input       # build the input layer
from tensorflow.keras.models import Model       # initialize the model
from tensorflow.keras.layers import Dropout     # dropout regulization
from tensorflow.keras.optimizers import Adam    # adam optimizer object to control learning rate
from tensorflow.keras.callbacks import EarlyStopping     # early stopping object 
from tensorflow.keras.models import load_model        # to load the model 

In [2]:
# read the data into datafram
df = pd.read_csv("forestfires.csv")

In [3]:
# preview the first 5 rows of it
df.head()

Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
0,7,5,mar,fri,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,0.0
1,7,4,oct,tue,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,0.0
2,7,4,oct,sat,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,0.0
3,8,6,mar,fri,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,0.0
4,8,6,mar,sun,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,0.0


In [4]:
# see all the columns datatypes and the size of the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 517 entries, 0 to 516
Data columns (total 13 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   X       517 non-null    int64  
 1   Y       517 non-null    int64  
 2   month   517 non-null    object 
 3   day     517 non-null    object 
 4   FFMC    517 non-null    float64
 5   DMC     517 non-null    float64
 6   DC      517 non-null    float64
 7   ISI     517 non-null    float64
 8   temp    517 non-null    float64
 9   RH      517 non-null    int64  
 10  wind    517 non-null    float64
 11  rain    517 non-null    float64
 12  area    517 non-null    float64
dtypes: float64(8), int64(3), object(2)
memory usage: 52.6+ KB


In [5]:
# view the unique values inside the day feature in the datafram
df["day"].unique()

array(['fri', 'tue', 'sat', 'sun', 'mon', 'wed', 'thu'], dtype=object)

In [6]:
# view the unique values inside the month feature in the datafram
df["month"].unique()

array(['mar', 'oct', 'aug', 'sep', 'apr', 'jun', 'jul', 'feb', 'jan',
       'dec', 'may', 'nov'], dtype=object)

# Data preprocessing
<p id = "Data preprocessing"></p>

## <ul><li style = "list-style: circle">Dealing with missing values</li><li style = "list-style: circle">Dealing with duplicates</li><li style = "list-style: circle">Dealing with Outliers</li></ul>
<hr>

<h2><u>Dealing with missing values<u/><h2/>

In [7]:
# see if there are nulls in each column
df.isnull().sum()

X        0
Y        0
month    0
day      0
FFMC     0
DMC      0
DC       0
ISI      0
temp     0
RH       0
wind     0
rain     0
area     0
dtype: int64

**This dataset doesn't contain any null values**

<h2><u>Dealing with duplicates<u/><h2/>

In [8]:
# get the duplicates as boolean array then sum them to get number of duplicates
df.duplicated().sum()

4

**Duplicates in this dataset mean that a fire happened multiple times in the same region with the same weather conditions and with almost the same burned area which might happen.so, I can't remove them**


<h2><u>Dealing with Outliers<u/><h2/>

In [9]:
# see the summary of the datafram to check for the distribution of each column
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
X,517.0,4.669246,2.313778,1.0,3.0,4.0,7.0,9.0
Y,517.0,4.299807,1.2299,2.0,4.0,4.0,5.0,9.0
FFMC,517.0,90.644681,5.520111,18.7,90.2,91.6,92.9,96.2
DMC,517.0,110.87234,64.046482,1.1,68.6,108.3,142.4,291.3
DC,517.0,547.940039,248.066192,7.9,437.7,664.2,713.9,860.6
ISI,517.0,9.021663,4.559477,0.0,6.5,8.4,10.8,56.1
temp,517.0,18.889168,5.806625,2.2,15.5,19.3,22.8,33.3
RH,517.0,44.288201,16.317469,15.0,33.0,42.0,53.0,100.0
wind,517.0,4.017602,1.791653,0.4,2.7,4.0,4.9,9.4
rain,517.0,0.021663,0.295959,0.0,0.0,0.0,0.0,6.4


# Feature Extraction
<p id = "Feature Extraction"></p>

> __steps :__
<br>
1-  we need to need to one-hot-encode the ordinal variables day and month
<br>
2- Scaling the data using min_max 

In [10]:
# use one hot encoding with day and month
df = pd.get_dummies(df)
df.head()

Unnamed: 0,X,Y,FFMC,DMC,DC,ISI,temp,RH,wind,rain,...,month_nov,month_oct,month_sep,day_fri,day_mon,day_sat,day_sun,day_thu,day_tue,day_wed
0,7,5,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,...,0,0,0,1,0,0,0,0,0,0
1,7,4,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,...,0,1,0,0,0,0,0,0,1,0
2,7,4,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,...,0,1,0,0,0,1,0,0,0,0
3,8,6,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,...,0,0,0,1,0,0,0,0,0,0
4,8,6,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,...,0,0,0,0,0,0,1,0,0,0


In [11]:
#
df = df.sample(frac = 1)
# we want to select all the features except the target
features = list(df.columns)     # store all columns names in list
features.remove("area")         # remove target "area"
X = df.loc[:,features].values   # store only the features inside X variable
# apply ln(1+x) as transormation to the target 
Y = df.loc[:,"area"].apply(lambda y:np.log(1+y)).values.reshape(-1,1)
# to make the learing process faster i am going to standrized the data Features 
scaler = StandardScaler() 
X = scaler.fit_transform(X)
# split the data into 20% test and 80% train and shuffle the data 
# The random state keeps the shuffling process almost the same even if we run the code multiple times
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2,random_state=100000)

><u style = "color : red">__Note :__</u>
we can reverse the transformation to retrive the area Y from Y* 
<br>
<p> Y<sup>*</sup> = ln(Y + 1)</p>
<br>
<p> e<sup>Y<sup>*</sup></sup> = e<sup>ln(Y + 1)</sup></p>
<br>
<p>Y = e<sup>Y<sup>*</sup></sup> - 1</p>


# Model Building
<p id = "Model Building"></p>

> __model Implementaion Details__: 
1) we are going to use kears functional API to build model architecture
<br>
2) The size of the data is too small 517 instances only, so it's expected to see some kind of underfitting at first glance
<br>
3) I started with small architecture and then made it bigger gradually 'increasing its complexity' until I had the best architecture that minimizes the loss function and at the same time doesn't cause any kind of overfitting or underfitting
<br>
4) I have used stop early to stop training if the validation loss is no longer decreasing
<br>
5) i have used dropout regulization to reduce overfitting 


In [12]:
# define the architecture of the model
input_feature = Input((29,),name = "features that controls the forestfires") # input layer
hidden_layer_1 = Dense(20, activation = "relu")(input_feature)               # first hidden layer
dropout_layer = Dropout(0.5,seed = 10000)(hidden_layer_1)                    # Dropout the first layer
hidden_layer_2 = Dense(10, activation = "relu")(dropout_layer)                # second hidden layer
output_layer = Dense(1, name = "transformed_area")(hidden_layer_2)           # output layer

In [13]:
stop_early = EarlyStopping(patience = 4)

In [14]:
model = Model(input_feature, output_layer)
# we have used mean absolute error because many valus of Y* are so close to zero
# and mean squared error has high gradient sensitivity near minimum
model.compile(optimizer = Adam(learning_rate = 0.1), loss = "mae")
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 features that controls the   [(None, 29)]             0         
 forestfires (InputLayer)                                        
                                                                 
 dense (Dense)               (None, 20)                600       
                                                                 
 dropout (Dropout)           (None, 20)                0         
                                                                 
 dense_1 (Dense)             (None, 10)                210       
                                                                 
 transformed_area (Dense)    (None, 1)                 11        
                                                                 
Total params: 821
Trainable params: 821
Non-trainable params: 0
_______________________________________________________________

In [15]:
# fit the model to the data
model.fit(X_train, y_train,epochs = 200, batch_size =X_train.shape[0],verbose=True,\
          validation_data = (X_test,y_test), callbacks = [stop_early])

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200


<keras.callbacks.History at 0x167e0a9e290>

In [16]:
# see the losses
print("The training loss is : ",round(model.history.history["loss"][-1],1),\
     "\n The test loss is : ", round(model.history.history["val_loss"][-1],1))


The training loss is :  1.1 
 The test loss is :  1.0


**we can notice that the trainning loss is close to test loss and both are too small and close to zero.However,we need the loss to be something that is almost zero to obtain better accuracy but the datasize is too samll for NN to train on so we might want to use another ML model if we want to get better performance or get more data, i have saved the best model after shuffling the data multiple times**

>in the next few cells i'm just going to show how to predict the burned area from the forest given an instance

In [17]:
model = load_model('forestfires_model.h5')

In [22]:
# take some random value
r = np.random.randint(0,X.shape[0])
x = X[r,:].reshape(1,-1)
# use the model to predict Y*
y_star = model.predict(x, verbose = False)[0][0]
# retransform it back to Y
y = np.exp(y_star) - 1
print(y)
print(Y[r,:][0])

0.027894139289855957
0.0
