# Optimzing Model Predictions

## Goal

Optimize the model prediction though different strategies if applicable:
- Handling outliers
- K-fold cross validation
- Regularization
- Non-linear models

We will start with a base model and iterate to find one that performs better.

The goal is to predict the extent of fire damage to a forest

## Data

We will use the [Forest Fires]([[https](https://archive.ics.uci.edu/ml/datasets/Forest+Fires)://archive.ics.uci.edu/dataset/597/productivity+prediction+of+garment+employees](https://archive.ics.uci.edu/ml/datasets/Forest+Fires)) set from the UCI Machine Learning Repository. 

-  **X** : x-axis spatial coordinate within the Montesinho park map: 1 to 9
-  **Y** : y-axis spatial coordinate within the Montesinho park map: 2 to 9
-  **month** : month of the year: 'jan' to 'dec' 
-  **day** : day of the week: 'mon' to 'sun'
-  **FFMC** : FFMC index from the FWI system: 18.7 to 96.20
-  **DMC** : DMC index from the FWI system: 1.1 to 291.3 
-  **DC** : DC index from the FWI system: 7.9 to 860.6 
-  **ISI** : ISI index from the FWI system: 0.0 to 56.10
-  **temp** : temperature in Celsius degrees: 2.2 to 33.30
-   **RH** : relative humidity in %: 15.0 to 100
-   **wind** : wind speed in km/h: 0.40 to 9.40 
-   **rain** : outside rain in mm/m2 : 0.0 to 6.4 
-   **area** : the burned area of the forest (in ha): 0.00 to 1090.84 
   (this output variable is very skewed towards 0.0, thus it may make
    sense to model with the logarithm transform).

## Libraries

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.linear_model import LinearRegression

from sklearn.metrics import  mean_squared_error

## Model Development

### Load data

In [2]:
ff = pd.read_csv('data/forestfires.csv')

### EDA