# Forest Fires Portugal 6-degree Polynomial Regression

## Problem Statement

Paulo Cortez and Anibal Morais (2007) wanted to use "automatic tools based on local sensors" at local (to the area) meterological stations. Since meteorological conditions were determined to correlate to the likelihood of a fire, they explored a "Data Mining" solution to predict the fires (Cortez & Morais, 2007).

My model differs from theirs because they use Support Vectors, whereas I use a Polynomial Regression model. I effectively imploy an n-th degree Polynomial Regression model to give all of the important factors weight and full consideration. It appears to be very effective.

## Index

1. [Data Preprocessing](http://localhost:8888/notebooks/MachineLearningModels/Portugal%20Forest%20Fires%20Polynomial.ipynb#Data-Preprocessing)
2. [Train Model](http://localhost:8888/notebooks/MachineLearningModels/Portugal%20Forest%20Fires%20Polynomial.ipynb#Train-Model)
3. [Predictions and Final Equation](http://localhost:8888/notebooks/MachineLearningModels/Portugal%20Forest%20Fires%20Polynomial.ipynb#Predictions-and-final-linear-equation)
4. [Conclusion](http://localhost:8888/notebooks/MachineLearningModels/Portugal%20Forest%20Fires%20Polynomial.ipynb#Conclusion)
5. [Citation](http://localhost:8888/notebooks/MachineLearningModels/Portugal%20Forest%20Fires%20Polynomial.ipynb#References)

## Data Preprocessing

### Import libraries

The three typical libaries are used: Numpy, Matplotlib.pyplot, and Pandas

As is tradition Scikit Learn will come in later.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

### Import dataset

This dataset required an extra step that is not normal because of the difference of scales between the features (inputs) and output. Meteorological scales typically provide double digit values, but since the output is the amount of land scorched in hectacres the ouput tends to skew toward 0. As a result, without the log transform on inputs the correlation will be significantly lower than it should be.

In [2]:
dataset = pd.read_csv("/home/joe/Documents/ML-Resources/forestfires.csv")
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

x = x.astype(float)

x = np.log1p(x)

y = y.reshape(len(y), 1)

### Taking care of missing data

I used the 'mean' of a column to fill any possibly missing data.

In [3]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(x[:, 1:3])
x[:, 1:3] = imputer.transform(x[:, 1:3])

## Feature scaling

This is normally not required for regression, and polynomial regression at that. However, standardization provided amazing results. Standardization is the process of making each value fall into a range of [+3, -3]. In other words, every value will be between 3 and negative 3.

In [4]:
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
sc_y = StandardScaler()
x = sc_x.fit_transform(x)
y = sc_y.fit_transform(y)

### Split data

There is enough data here to split the dataset into a training and test set. I stick to the typocal 80% training size.

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size = 0.2, random_state = 1)

## Train Model

### Train Linear Regression

Just a line of best fit model, it goes along with the Polynomial Regression.

In [6]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X_train,Y_train)

### Train Polynomial

This is a 6-degree polynomial because when there's only a few features n-degree polynomials tend to be super accurate. 

In [7]:
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 6)
x_poly = poly_reg.fit_transform(x)
lin_reg_2 = LinearRegression()
lin_reg_2.fit(x_poly, y)

## Predictions and final linear equation

### Predictions

In [8]:
y_pred = lin_reg_2.predict(poly_reg.transform(X_test))
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), Y_test.reshape(len(Y_test),1)),1))

[[-0.19 -0.19]
 [-0.2  -0.2 ]
 [-0.2  -0.2 ]
 [ 0.36  0.36]
 [ 0.39  0.39]
 [-0.2  -0.2 ]
 [-0.19 -0.18]
 [-0.2  -0.2 ]
 [-0.2  -0.2 ]
 [-0.14 -0.14]
 [-0.2  -0.2 ]
 [-0.2  -0.2 ]
 [-0.18 -0.18]
 [-0.2  -0.2 ]
 [-0.2  -0.2 ]
 [-0.2  -0.2 ]
 [-0.2  -0.2 ]
 [-0.09 -0.09]
 [-0.1  -0.1 ]
 [-0.15 -0.15]
 [-0.12 -0.03]
 [-0.2  -0.2 ]
 [-0.17 -0.14]
 [-0.15 -0.15]
 [-0.2  -0.2 ]
 [-0.12 -0.2 ]
 [-0.2  -0.2 ]
 [-0.08 -0.08]
 [-0.2  -0.2 ]
 [ 0.38  0.38]
 [-0.12 -0.05]
 [-0.2  -0.2 ]
 [-0.2  -0.2 ]
 [-0.09 -0.09]
 [ 1.42  1.42]
 [-0.09 -0.09]
 [-0.18 -0.18]
 [-0.17 -0.17]
 [-0.05 -0.07]
 [-0.2  -0.2 ]
 [ 0.18  0.18]
 [-0.2  -0.2 ]
 [ 0.56  0.56]
 [-0.16 -0.16]
 [-0.18 -0.15]
 [-0.2  -0.2 ]
 [ 0.53  0.53]
 [-0.17 -0.17]
 [ 0.24  0.24]
 [-0.11 -0.11]
 [-0.2  -0.2 ]
 [11.53 11.53]
 [-0.19 -0.19]
 [-0.14 -0.2 ]
 [-0.2  -0.2 ]
 [ 0.05  0.05]
 [-0.2  -0.2 ]
 [-0.2  -0.2 ]
 [-0.19 -0.19]
 [-0.2  -0.2 ]
 [-0.2  -0.2 ]
 [-0.11 -0.11]
 [-0.2  -0.2 ]
 [-0.17 -0.17]
 [-0.2  -0.2 ]
 [-0.15 -0.15]
 [-0.2  -0

In [9]:
print(lin_reg.coef_)
print(lin_reg.intercept_)

[[ 0.05  0.01 -0.04  0.02 -0.06  0.02]]
[-0.02]


### Linear Regression equation

Area of impact = 0.05 x DMC + 0.01 x DC - 0.04 x ISI + 0.02 x temp - 0.06 x RH + 0.02 x wind - 0.02

However, the polynomial equation gives far more accurate results.

In [10]:
from sklearn.metrics import r2_score
r2_score(Y_test, y_pred)

0.9997858770638611

## Conclusion

While the original authors wanted to predict the likelihood of a fire, when this dataset is solved using linear regression the only thing I can reliably determine is size of the resulting fire based on meteorological values. However, the size of a fire can be predicted with a very high degree of accuracy. As a result, fire departments would be able to predict the amount of resources required for a fire. That being said, it is very likely with a classification model the likelihood of a fire could also be predicted.

<h2 align=center>References</h2>

[Cortez and Morais, 2007] P. Cortez and A. Morais. A Data Mining Approach to Predict Forest Fires using Meteorological Data. In J. Neves, M. F. Santos and J. Machado Eds., New Trends in Artificial Intelligence, Proceedings of the 13th EPIA 2007 - Portuguese Conference on Artificial Intelligence, December, Guimarães, Portugal, pp. 512-523, 2007. APPIA, ISBN-13 978-989-95618-0-9. Available at: http://www.dsi.uminho.pt/~pcortez/fires.pdf

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. 