## Activity 1.02: Forest Fire Size and Temperature Analysis

In this activity, we will use pandas features to derive some insights from a forest fire dataset. We will get the mean size of forest fires, what the largest recorded fire in our dataset is, and whether the amount of forest fires grows proportionally to the temperature in each month.

#### Loading the dataset

In [5]:
# importing the necessary dependencies
import pandas as pd

In [6]:
# loading the Dataset
dataset = pd.read_csv('../../Datasets/forestfires.csv')

In [7]:
# looking at the first two rows of the dataset
dataset[0:2]

Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
0,7,5,mar,fri,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,0.0
1,7,4,oct,tue,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,0.0


The dataset contains: 
- X - x-axis spatial coordinate within the Montesinho park map: 1 to 9 
- Y - y-axis spatial coordinate within the Montesinho park map: 2 to 9 
- **month - month of the year: 'jan' to 'dec'**
- day - day of the week: 'mon' to 'sun' 
- FFMC - FFMC index from the FWI system: 18.7 to 96.20 
- DMC - DMC index from the FWI system: 1.1 to 291.3 
- DC - DC index from the FWI system: 7.9 to 860.6 
- ISI - ISI index from the FWI system: 0.0 to 56.10 
- **temp - temperature in Celsius degrees: 2.2 to 33.30**
- RH - relative humidity in %: 15.0 to 100 
- wind - wind speed in km/h: 0.40 to 9.40 
- rain - outside rain in mm/m2 : 0.0 to 6.4 
- **area - the burned area of the forest (in ha): 0.00 to 1090.84**


**Note:**   
The fields that we'll be working with are highlighted in the listing.

---

#### Getting insights into the sizes of forest fires.

When looking at the first two rows of our dataset we can already see that it contains entries in which the area is 0.    
For this first task we only care about fires that have an area of more than 0.   

Create a new dataset that only contains the entries with an area value of > 0.

In [8]:
# filter the dataset for rows that have an area > 0
area_dataset= dataset[dataset["area"] > 0]

area_dataset

Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
138,9,9,jul,tue,85.8,48.3,313.4,3.9,18.0,42,2.7,0.0,0.36
139,1,4,sep,tue,91.0,129.5,692.6,7.0,21.7,38,2.2,0.0,0.43
140,2,5,sep,mon,90.9,126.5,686.5,7.0,21.9,39,1.8,0.0,0.47
141,1,2,aug,wed,95.5,99.9,513.3,13.2,23.3,31,4.5,0.0,0.55
142,8,6,aug,fri,90.1,108.0,529.8,12.5,21.2,51,8.9,0.0,0.61
...,...,...,...,...,...,...,...,...,...,...,...,...,...
509,5,4,aug,fri,91.0,166.9,752.6,7.1,21.1,71,7.6,1.4,2.17
510,6,5,aug,fri,91.0,166.9,752.6,7.1,18.2,62,5.4,0.0,0.43
512,4,3,aug,sun,81.6,56.7,665.6,1.9,27.8,32,2.7,0.0,6.44
513,2,4,aug,sun,81.6,56.7,665.6,1.9,21.9,71,5.8,0.0,54.29


After filtering out the zero area entries, we can simply use the `mean` method of pandas to get the mean area size of the forest fires for the filtered down dataset not containing zero area sizes.

Get the mean value for the `area` column of out filtered dataset.

In [9]:
# get the mean value for the area column
area_dataset["area"].mean()

24.600185185185182

In addition to that, looking at the largest and smallest non-zero area can help us understand the range of possible area sizes.
Let's get more insights into that.

- Use the `min` and `max` methods to see the smallest and largest area that has been affected by a forest fire.
- Use the `std` method to get insights into how much variation there is in our dataset.

In [10]:
# get the smallest area value from our dataset
area_dataset["area"].min()

0.09

In [11]:
# get the largest area value from our dataset
area_dataset["area"].max()

1090.84

In [12]:
# get the standard deviation of values in our dataset
area_dataset["area"].std()

86.50163460412126

The largest value is much larger than our mean.   
The standard deviation also is quite large which indicates that the difference between our mean and the "middle value" will be quite high.

Let's look at the last 20 values of our sorted dataset to see if we have more than one very large value.   
Sort the filtered dataset by the `area` column and output the last 20 entries from it.

In [13]:
# sorting the filtered dataset and printing the last 20 elements 
area_dataset.sort_values(by=["area"]).tail(20)

Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
469,6,3,apr,sun,91.0,14.6,25.6,12.3,13.7,33,9.4,0.0,61.13
228,4,6,sep,sun,93.5,149.3,728.6,8.1,28.3,26,3.1,0.0,64.1
473,9,4,jun,sat,90.5,61.1,252.6,9.4,24.5,50,3.1,0.0,70.32
392,1,3,sep,sun,91.0,276.3,825.1,7.1,21.9,43,4.0,0.0,70.76
229,8,6,aug,sat,92.2,81.8,480.8,11.9,16.4,43,4.0,0.0,71.3
457,1,4,aug,wed,91.7,191.4,635.9,7.8,19.9,50,4.0,0.0,82.75
293,7,6,jul,tue,93.1,180.4,430.8,11.0,26.9,28,5.4,0.0,86.45
230,4,4,sep,wed,92.9,133.3,699.6,9.2,26.4,21,4.5,0.0,88.49
231,1,5,sep,sun,93.5,149.3,728.6,8.1,27.8,27,3.1,0.0,95.18
232,6,4,sep,tue,91.0,129.5,692.6,7.0,18.7,43,2.7,0.0,103.39


As we can see here, only 11 out of the 270 rows contain values that are larger than 100.   
After 20 values we are close to the area value of 60. 

Let's imagine our dataset contained only 1 or 2 values that were much higher than the other ones, e.g. an area size value of 10254.91. Simply by observing the dataset, this feels like there might have been an error on adding this to the dataset.   
In a smaller dataset, the mean value would get heavily distored by this one entry. A more stable value to use in such a case is the median value of the dataset.

Get the median value for the ´area´ column.

In [14]:
# calculate the median value for the area column
area_dataset["area"].median()

6.37

**Note:**   
Remember that the median is not the same as then mean of your dataset. While the median is simple the "value in the middle", the mean is much more prone to distortion by outliers.

---

#### Finding the month with the most forest fires

In this second task we want to quickly see which months have the most forest fires and whether or not the temperature has a direct connection to it.

Get a list of month values that are present in our dataset.

In [15]:
# get a list of month values from the dataset
months = dataset["month"].unique()

months

array(['mar', 'oct', 'aug', 'sep', 'apr', 'jun', 'jul', 'feb', 'jan',
       'dec', 'may', 'nov'], dtype=object)

In addition to the unique values we also want use the shape element of our dataset to determine how many rows it has.

Filter the dataset for only rows that contain the month `mar` and print the number of rows using `shape`.

In [16]:
# get the number of forest fires for the month of march
dataset[dataset["month"] == "mar"].shape[0]

54

The last step to fulfil the task is to iterate over all months, filtering our dataset for the rows containing the given month and calculating the mean temperature.

- Iterate over the months from the unique list we created
- Filter our dataset for the rows containing the given month
- Get the number of rows from `shape`
- Get the mean temperature for the given month
- Print a statement with the number of fires, mean temperature and the month

In [17]:
# iterate over the months list
# get number of forest fires for each month
# get mean temperature for each month
# print out number of fires and mean temperature
for month in months:
    month_dataset = dataset[dataset["month"] == month]
    fires_in_month = month_dataset.shape[0]
    avg_tmp_in_month = int(month_dataset["temp"].mean())
    
    print(str(fires_in_month) + " fires in " + month + " with a mean temperature of ~" + str(avg_tmp_in_month) + "°C")

54 fires in mar with a mean temperature of ~13°C
15 fires in oct with a mean temperature of ~17°C
184 fires in aug with a mean temperature of ~21°C
172 fires in sep with a mean temperature of ~19°C
9 fires in apr with a mean temperature of ~12°C
17 fires in jun with a mean temperature of ~20°C
32 fires in jul with a mean temperature of ~22°C
20 fires in feb with a mean temperature of ~9°C
2 fires in jan with a mean temperature of ~5°C
9 fires in dec with a mean temperature of ~4°C
2 fires in may with a mean temperature of ~14°C
1 fires in nov with a mean temperature of ~11°C
