#  Predicting the burned area of forest fires, in the northeast region of Portugal

 #### This dataset consists of 517 instances, 12 attributes and the output attribute. Here is the description of each attribute in the dataset:
   
   1. **X** - x-axis spatial coordinate within the Montesinho park map: 1 to 9
   2. **Y** - y-axis spatial coordinate within the Montesinho park map: 2 to 9
   3. **month** - month of the year: "jan" to "dec" 
   4. **day** - day of the week: "mon" to "sun"
   5. **The Fine Fuel Moisture Code (FFMC)** - FFMC index from the FWI system: 18.7 to 96.20
   6. **The Duff Moisture Code (DMC)** - DMC index from the FWI system: 1.1 to 291.3 
   7. **The Drought Code (DC)** - DC index from the FWI system: 7.9 to 860.6 
   8. **The Initial Spread Index (ISI)** - ISI index from the FWI system: 0.0 to 56.10
   9. **temp** - temperature in Celsius degrees: 2.2 to 33.30
   10. **RH** - relative humidity in %: 15.0 to 100
   11. **wind** - wind speed in km/h: 0.40 to 9.40 
   12. **rain** - outside rain in mm/m2 : 0.0 to 6.4 
   13. **area** - the burned area of the forest (in ha): 0.00 to 1090.84 
   (this output variable is very skewed towards 0.0, thus it may make
    sense to model with the logarithm transform).
    
More information about the variables is available [here](https://cwfis.cfs.nrcan.gc.ca/background/summary/fwi)

The package [sklearn](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning) is the industry standard for ML algorithms that can be used out of the box quickly- you should use it. 

Citation: 

* This dataset has been taken from repository Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.


* P. Cortez and A. Morais. A Data Mining Approach to Predict Forest Fires using Meteorological Data. In J. Neves, M. F. Santos and J. Machado Eds., New Trends in Artificial Intelligence, Proceedings of the 13th EPIA 2007 - Portuguese Conference on Artificial Intelligence, December, Guimaraes, Portugal, pp. 512-523, 2007. APPIA, ISBN-13 978-989-95618-0-9.

---

#### This is a regression problem with the target being the `area` column.

In [None]:
# Feel free to import more packages (i.e., numpy, sklearn packages) as required.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

import mlflow
import mlflow.sklearn
import tempfile
import os

%matplotlib inline
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [None]:
path_ = 'https://s3-eu-west-1.amazonaws.com/fellowship-teaching-materials/data-practical/forestfires.csv'

In [None]:
df = pd.read_csv(path_)
print(len(df))
df.head()

## check data
- Datas are clean - no null values found
- The area ranges between 0 and approx 1000, with very few values. We rescale the plots to understand better


In [None]:
print(f'Null values: {df.isnull().values.any()}')

In [None]:
print(f'NaN values: {df.isna().values.any()}' )

In [None]:
Nfeat = len(df.columns)-1

fig = plt.figure('Features summary', (16,14) )
Ax=fig.subplots(4, 3, sharex=False, sharey=True, squeeze=False, subplot_kw=None, gridspec_kw=None)

rr,cc=0,0
for ii,feature in zip(range(Nfeat), df.columns[:-1] ) :
    rr=ii//3
    cc=ii%3
    
    # print('plotting feat %s into %d,%d' %(feature, rr, cc) )
    ax = Ax[rr,cc]
    ax.set_ylabel("area")
    ax.set_title('%s' %df.columns[ii] )
    ax.plot( df[feature].values, df.area.values, ls='', marker='.' )
    ax.set_ylim(0, 400)


## Simple model - Linear regression
- perform any necessary feature engineering (month & days)
- choose features
- split dataset
---

#### Turn `month` and `day` into features

 * Map the `day` feature to distinct categories
 * One hot encode the `month` feature

In [None]:
df.day.unique()

In [None]:
day_map = {'fri':5, 'mon':1, 'tue':2, 'sat':6, 'sun':7,  'wed':3, 'thu':4}
df.day = df.day.map(day_map)

OH_month = pd.get_dummies(df.month)

df = df.drop("month", axis=1)
df = df.merge(OH_month, left_index=True, right_index=True)

In [None]:
df.head()

In [None]:
X,y = df.drop("area", axis=1), df["area"]
X

---
Split the data for model training and testing

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1935)

### Use MLFlow and the Platform Feature 'Experiments' to keep track of our experiment setup, model, and results. 

* Allows you to track multiple experiments and compare the performance of different parameter choices

In [None]:
## Change this please to 'forest-fires-yourName'
experiment_name = #"forest-fires-default"

In [None]:
mlflow.set_experiment(experiment_name) 

In [None]:
# setup model and hyper_parameters
hyper_parameters = {'fit_intercept': True}
model = LinearRegression(fit_intercept=hyper_parameters['fit_intercept'])
model_name = 'LinearRegression'

In [None]:
with mlflow.start_run():
    
    # set useful tags about experiment setup
    mlflow.set_tag("model", model_name)
    mlflow.set_tag("features", 'all')
    
    # track your model parameters
    for name, val in hyper_parameters.items():
        mlflow.log_param(name, val)

    # train model
    model.fit(X_train, y_train)
    # log trained model 
    mlflow.sklearn.log_model(model, model_name)

    # evaluate model
    predictions = model.predict(X_test)
    MAE = mean_absolute_error(y_test, predictions)
    RMSE = np.sqrt(mean_squared_error(y_test, predictions))
    RSQ = r2_score(y_test, predictions)
    # log performance metrics
    mlflow.log_metric('MAE', MAE)
    mlflow.log_metric('RMSE', RMSE)
    mlflow.log_metric('RSQ', RSQ)
    
    # plot and log residuals
    with tempfile.TemporaryDirectory() as temp_dir:
        image_path = os.path.join(temp_dir, "residuals.png")
        # plot model residuals 
        fig, ax = plt.subplots(figsize=(16,8))
        ax = sns.residplot(x=y_test, y=predictions, color="g")
        ax.set_title('model residuals')
        plt.savefig(image_path)
        plt.show()
        plt.close()
        mlflow.log_artifact(image_path)

mlflow.end_run()  
print('Experiment Finished')

---
## Challenge

Can you improve on the above performace? Below are some suggestions for things to look into, but you're free to try anything that comes to mind. At the end of the session we will see who managed to train the best performing model, and what experimental design they used. 
 * Does it make sense to represent the `month` and `day` features in the above way?
 * Should we decrease the number of variables that are being used?
 * Is there another model we can try?
 