Abstract

During last three decades, Brazil has suffered from big fires that destroy many larg forests every year. In this kernel, i want to take a journey with you analyzing reported forest fires in Brazil between 1998 and 2017. Basically, we want to answer the following questions:
- which months, seasons and years fires are active in?
- is there any correlation between months of big fires? 
- which states that suffered a lot from forest fires?


So, lets start our journey...

Load libraries:

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import seaborn as sns
sns.set_palette('husl')
import missingno as msn
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

Import and read our data:

In [None]:
data = pd.read_csv('../input/forest-fires-in-brazil/amazon.csv', encoding='latin1')
data.head()


Checking for missing values.

In [None]:
msn.matrix(data)

That's fine, there are no missing values. Let us checking for data types in our data.

In [None]:
data.info()

date feature is not important so we are going to delete it and change the other feature names.

In [None]:
data['month'].unique()

In [None]:
df = data.drop('date', axis=1, inplace=True)
df = data.rename({'year':'Year', 'state':'State', 'number':'Fires', 'month':'Month'}, axis=1)
df.head()

Translate months into english:

In [None]:
df['Month'].unique()

In [None]:
english_months = {'Janeiro':'January', 'Fevereiro':'February', 'Março':'March', 'Abril':'April', 'Maio':'May', 'Junho':'June', 'Julho':'Jully', 'Agosto':'August', 'Setembro':'September', 'Outubro':'October', 'Novembro':'November', 'Dezembro':'December'}
df['Month'] = df['Month'].map(english_months)
df['Month'].unique()

In [None]:
fires_per_year = df.groupby(['Year'], as_index=None)['Fires'].agg('sum').round(0)
fires_per_year.head()

In [None]:
fires_per_year['Fires'].describe()

In [None]:
fig, ax = plt.subplots(1,1, figsize=(12, 7), dpi=72)
sns.regplot(data=fires_per_year, x='Year', y='Fires', ax=ax)
plt.xticks(np.arange(1998, 2018, 1))
plt.show()

Based on the plot above, we can conclude that fires between 2009 and 2015 are almost all in the range of prediction.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(16, 6))
sns.distplot(fires_per_year['Fires'], ax=ax, color='purple')

In [None]:

fig, ax = plt.subplots(figsize=(12, 8))
sns.scatterplot(data=fires_per_year, x='Year', y='Fires', color='red', ax=ax)
sns.lineplot(data=fires_per_year, x='Year', y='Fires', color='red', ax=ax)
plt.xticks(np.arange(1998, 2018, 1))




Obviously, we can notice that number of reported fires was almost tipled between 1998 and 2016, with a tremendous increasing in 1999, 2002, and 2009. Therefore we will pay close attention towards months of those interesting years.

For now, lets explore months vers number of reported fires.

In [None]:
months = list(df['Month'].unique())
months_fires = []
for i in list(np.arange(0, 12)):
    month_fire = df.query('Month in ["'+months[i]+'"]').groupby(['Year'], as_index=None)['Fires'].sum().round(0)
    month_fire = month_fire.rename({'Fires':months[i]}, axis=1)
    if i == 0:
        months_fires.append(month_fire)
    else:
        months_fires.append(pd.merge(months_fires[i-1], month_fire, on='Year'))
fires_per_month = months_fires[-1]
fires_per_month.head()

First, lets see what happened in the first half of every year in our data.

In [None]:

fires_per_month.plot.barh(x='Year', y=months[:6], figsize=(18, 28))

Do you remember 1999, 2002, and 2009 periods? those where fires got so increased.
 At first, we noticed that fires were reported as 0 during the 5 first months of 1998(Maybe reporting was started on June)
 Second, it is clear that the increasing in fires in 1999 was not happened in the first 6 months, they were increased by almost double from 2002.
 Third, fires on January had been increased so much from 2002 on, that played a key role in the years rate of fires.
 Lets explore the other months :)

In [None]:
fires_per_month.plot.barh(x='Year', y=months[6:], figsize=(18, 28))

From the plot above, we can conclude that fires in Jully, August and November are the most months that had an increasing in fires in 2002 and 2009.

The following shows whether there are any correclation between different months.

In [None]:
fig, ax = plt.subplots(1, 1)
corr = fires_per_month[months].corr()
sns.heatmap(round(corr,2), annot=True, ax=ax, cmap="coolwarm",fmt='.2f',linewidths=.05)
fig.subplots_adjust(top=0.93)
fig.suptitle('Months correlation heatmap', fontsize=14)

The most import result that we can conclude from the above correlation matrix is:
There is a relation between fires that occurred in April and those that occurred in November and December.

Lets look towards states and their relationships with fires.

In [None]:
states = list(df['State'].unique())
states_fires = []
for i in list(np.arange(0, len(states))):
    state_fire = df.query('State in ["'+states[i]+'"]').groupby(['Year'], as_index=None)['Fires'].sum().round(0)
    state_fire = state_fire.rename({'Fires':states[i]}, axis=1)
    if i == 0:
        states_fires.append(state_fire)
    else:
        states_fires.append(pd.merge(states_fires[i-1], state_fire, on='Year'))
fires_per_state = states_fires[-1]
fires_per_state.head()

In [None]:
fires_per_state[states].sum().sort_values(ascending=False)

We will be concerned with states that had more that 30000 reported fires.

In [None]:
hot_states = fires_per_state[list(fires_per_state.sum().nlargest(12).index)]
hot_states = pd.merge(fires_per_year, hot_states)
hot_states

In [None]:
interesting_states = list(hot_states.columns)
fig, ax = plt.subplots(1, 1, figsize=(18, 28))
hot_states.plot.barh(x='Year', y=interesting_states[2:], ax=ax)

It is obvious that Mato Grosso state plays a key role in the icreasing rate of fires. It took the first place between the other state from 2001 on, with a huge increase from 2003. This result leads us to pay close attention towards this state that need more analysis and of course more data about which i hope it will be available within kaggle.

We finally achieved the end of this analysis of forest fires in Brazil. If you have any suggestions or questions please comment below, and if you like this kernel please upvote.

Thank you for your reading.
