# Forest Fires Portugal 5-degree Polynomial Regression

References

[Cortez and Morais, 2007] P. Cortez and A. Morais. A Data Mining Approach to Predict Forest Fires using Meteorological Data. In J. Neves, M. F. Santos and J. Machado Eds., New Trends in Artificial Intelligence, Proceedings of the 13th EPIA 2007 - Portuguese Conference on Artificial Intelligence, December, Guimarães, Portugal, pp. 512-523, 2007. APPIA, ISBN-13 978-989-95618-0-9. Available at: http://www.dsi.uminho.pt/~pcortez/fires.pdf

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. 

## Import libraries

In [166]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Import dataset

The output of this dataset tends to skew toward 0, therefore the input needs to be transformed by its log in order to produce a high correlation

In [189]:
dataset = pd.read_csv("/home/joe/Documents/ML-Resources/forestfires.csv")
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

x = x.astype(float)

x = np.log1p(x)

y = y.reshape(len(y), 1)

## Taking care of missing data

I don't know if this dataset has missing data, but just in case.

In [190]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(x[:, 1:3])
x[:, 1:3] = imputer.transform(x[:, 1:3])

## Feature scaling

Normally not required here, but this is the best r-score I have ever gotten. Likely due to the difference in input and output.

In [191]:
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
sc_y = StandardScaler()
x = sc_x.fit_transform(x)
y = sc_y.fit_transform(y)

## Split data

In [192]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size = 0.2, random_state = 1)

## Train Linear Regression

In [193]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X_train,Y_train)

## Train Polynomial

This is a 5-degree polynomial, one degree for each independent variable. When the amount of independent variables is small (4-5), it seems like n-degree polynomial degrees (where n = independent variables) work the best.

In [204]:
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 5)
x_poly = poly_reg.fit_transform(x)
lin_reg_2 = LinearRegression()
lin_reg_2.fit(x_poly, y)

## Predictions

In [205]:
y_pred = lin_reg_2.predict(poly_reg.transform(X_test))
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), Y_test.reshape(len(Y_test),1)),1))

[[-0.15 -0.19]
 [-0.5  -0.2 ]
 [-0.16 -0.2 ]
 [ 0.78  0.36]
 [ 0.4   0.39]
 [-0.2  -0.2 ]
 [ 0.25 -0.18]
 [-0.24 -0.2 ]
 [-0.2  -0.2 ]
 [-0.16 -0.14]
 [-0.24 -0.2 ]
 [-0.23 -0.2 ]
 [-0.19 -0.18]
 [-0.24 -0.2 ]
 [-0.06 -0.2 ]
 [-0.06 -0.2 ]
 [-0.41 -0.2 ]
 [-0.25 -0.09]
 [-0.06 -0.1 ]
 [-0.14 -0.15]
 [-0.13 -0.03]
 [-0.13 -0.2 ]
 [-0.16 -0.14]
 [ 0.08 -0.15]
 [ 0.15 -0.2 ]
 [-0.13 -0.2 ]
 [-0.15 -0.2 ]
 [-0.21 -0.08]
 [ 1.56 -0.2 ]
 [ 0.38  0.38]
 [-0.13 -0.05]
 [-0.55 -0.2 ]
 [-0.19 -0.2 ]
 [ 0.26 -0.09]
 [ 1.07  1.42]
 [ 0.39 -0.09]
 [-0.6  -0.18]
 [-0.13 -0.17]
 [-0.04 -0.07]
 [-0.96 -0.2 ]
 [ 0.13  0.18]
 [-0.2  -0.2 ]
 [ 0.81  0.56]
 [-0.17 -0.16]
 [-0.28 -0.15]
 [-0.28 -0.2 ]
 [ 0.49  0.53]
 [-0.17 -0.17]
 [ 0.33  0.24]
 [ 1.26 -0.11]
 [-0.09 -0.2 ]
 [11.38 11.53]
 [-0.21 -0.19]
 [-0.14 -0.2 ]
 [-0.52 -0.2 ]
 [-0.27  0.05]
 [-0.64 -0.2 ]
 [-0.47 -0.2 ]
 [-0.13 -0.19]
 [-0.19 -0.2 ]
 [-0.22 -0.2 ]
 [-0.1  -0.11]
 [ 0.36 -0.2 ]
 [-0.17 -0.17]
 [-0.2  -0.2 ]
 [-0.14 -0.15]
 [-0.24 -0

In [208]:
print(lin_reg.coef_)
print(lin_reg.intercept_)

[[ 0.05  0.01 -0.04  0.02 -0.06  0.02]]
[-0.02]


## linear regression equation

Area of impact = 0.05 x DMC + 0.01 x DC - 0.04 x ISI + 0.02 x temp - 0.06 x RH + 0.02 x wind - 0.02

However, the polynomial equation gives far more accurate results.

In [207]:
from sklearn.metrics import r2_score
r2_score(Y_test, y_pred)

0.9221518054892045