# Analysis into the Brazil Forest Fires

This analysis is based on the Kaggle data set https://www.kaggle.com/man72331/forest-fires-data-analysis 

The goal of this analysis is to gain insights in the patern of fires in the Brazil Forest

1. How has the number of fires evolved over the last years
2. Which state has seen the highest /lowest in/decrease in forest fires
3. Which month has the most chance of an fire per region
4. Does the chance of fire per moth per region show any dissimalarities
5. Which year had the most / least fires
6. Which state on average has the most / least fires
7. Which year / state has had the largest deviation in forest fires


By: <b>Bram van Schaik</b> 
<br>Date: <b> October 2019</b>

## Part 1: Import libraries

In [1]:
import pandas as pd
import numpy as np

from scipy import stats


#### Visuals ####
import plotly.express as px

import plotly as py
import plotly.graph_objs as go

import seaborn as sns


alpha = 0.05

## Part 2: Import the dataset

In [191]:
#### Read in the data set ####
df = pd.read_csv("C:\\Users\\Big Boss\\Documents\\GitHub\\DEMO_PROJECTS\\Brazil Forest Fires\\amazon.csv", 
                 sep=",", 
                 engine='python',
                 thousands=r','
                )

## Part 3: Exploratory Analysis
This part is about getting to know our dataset

In [192]:
#### Looking at the first and last few rows to gain an understanding of the content ####
print(df.head(3))
print("\n")
print(df.tail(3))

   year state    month  number        date
0  1998  Acre  Janeiro     0.0  1998-01-01
1  1999  Acre  Janeiro     0.0  1999-01-01
2  2000  Acre  Janeiro     0.0  2000-01-01


      year      state     month  number        date
6451  2014  Tocantins  Dezembro   223.0  2014-01-01
6452  2015  Tocantins  Dezembro   373.0  2015-01-01
6453  2016  Tocantins  Dezembro   119.0  2016-01-01


### Conclusions:
- The month column has the spanish month names instead of common english names


In [193]:
df['month'] = df['month'].replace(['Janeiro', 'Fevereiro', 'Março', 'Abril', 'Maio', 'Junho', 'Julho', 'Agosto', 'Setembro', 'Outubro', 'Novembro', 'Dezembro'], 
                    ['01 January', '02 February', '03 March', '04 April', '05 May', '06 June', '07 July', '08 August', '09 September', '10 October', '11 November', '12 December'])

df.head()

Unnamed: 0,year,state,month,number,date
0,1998,Acre,01 January,0.0,1998-01-01
1,1999,Acre,01 January,0.0,1999-01-01
2,2000,Acre,01 January,0.0,2000-01-01
3,2001,Acre,01 January,0.0,2001-01-01
4,2002,Acre,01 January,0.0,2002-01-01


In [198]:
### Now we are going to be looking at the datatypes,shape and missing/unique for each column ####
def resumetable(df):
    print(f"Dataset Shape: {df.shape}")
    summary = pd.DataFrame(df.dtypes,columns=['dtypes'])
    summary = summary.reset_index()
    summary['Name'] = summary['index']
    summary = summary[['Name','dtypes']]
    summary['Missing'] = df.isnull().sum().values    
    summary['Uniques'] = df.nunique().values
    summary['Min'] = df.min().values
    summary['Max'] = df.max().values
    return summary

resumetable(df)

Dataset Shape: (6454, 5)


Unnamed: 0,Name,dtypes,Missing,Uniques,Min,Max
0,year,int64,0,20,1998,2017
1,state,object,0,23,Acre,Tocantins
2,month,object,0,12,01 January,12 December
3,number,int64,0,724,0,998
4,date,object,0,20,1998-01-01,2017-01-01


### Conclusions:
- There are 20 unique years in the data set starting from 1998 till 2017
- There are 23 unique states in the data set
- State, Month and date are objects
- There are no missing values

### Question 1: How has the amount of fires developt over the last years

In [197]:
df['number'] = df['number'].apply(int)

In [199]:
#### Count the fires by year and visualize ####

YEAR = df.groupby(df['year'])['number'].sum().reset_index()

YEAR['MEAN'] = YEAR['number'].mean()

In [200]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=YEAR['year'], y= YEAR['number'],
                    mode='lines',
                    name='Number of Fires'))
fig.add_trace(go.Scatter(x=YEAR['year'], y= YEAR['MEAN'],
                    mode='lines+markers',
                    name='Mean'))

fig.show()

In [201]:
print("There are on average: {} amount of forest fires in Brazil each year".format(int(YEAR.number.mean())))

YEAR['MEAN'] = YEAR.number.mean()

There are on average: 34927 amount of forest fires in Brazil each year


In [202]:
print("The standard deviation is : {} on the amount of forest fires in Brazil for year".format(int(YEAR.number.std())))

YEAR['SD'] = YEAR.number.std()

The standard deviation is : 5893 on the amount of forest fires in Brazil for year


In [203]:
from sklearn.linear_model import LinearRegression
from sklearn import linear_model
import statsmodels.api as sm
from scipy import stats

model = LinearRegression()
x = YEAR[['year']]
y = YEAR[['number']]

model.fit(x, y)

r_sq = model.score(x, y)
print('coefficient of determination:', r_sq)
print('\n')
print('intercept:', model.intercept_)
print('\n')
print('slope:', model.coef_)
print('\n')


X2 = sm.add_constant(x)
est = sm.OLS(y, X2)
est2 = est.fit()
print(est2.summary())

coefficient of determination: 0.43000195061809926


intercept: [-1276451.4481203]


slope: [[653.23984962]]


                            OLS Regression Results                            
Dep. Variable:                 number   R-squared:                       0.430
Model:                            OLS   Adj. R-squared:                  0.398
Method:                 Least Squares   F-statistic:                     13.58
Date:                Thu, 24 Oct 2019   Prob (F-statistic):            0.00169
Time:                        21:52:10   Log-Likelihood:                -195.88
No. Observations:                  20   AIC:                             395.8
Df Residuals:                      18   BIC:                             397.7
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
---------------------


Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.



Conclusion:

- The regression model only shows an 0.4 R-squared value based on the total of each year

In [204]:
from scipy.stats import shapiro

stat, p = shapiro(YEAR['number'])
print('Statistics=%.3f, p=%.3f' % (stat, p))


if p > alpha:
	print('DISTRIBUTION = Sample looks Gaussian (fail to reject H0)')
else:
	print('DISTRIBUTION = Sample does not look Gaussian (reject H0)')
    
stats.f_oneway(YEAR['number'], YEAR['year'])

if p > alpha:
	print('F_OneWay TEST = There is no significance in the relation between Years and Number of Fires')
else:
	print('F_OneWay TEST = There is significance in the relation between Years and Number of Fires')
    
stats.kruskal(YEAR['number'], YEAR['year'])
if p > alpha:
	print('KRUSKAL TEST = There is no significance in the relation between Years and Number of Fires')
else:
	print('KRUSKAL TEST = There is significance in the relation between Years and Number of Fires')

fig = px.histogram(YEAR, x="number",
                   marginal="box", 
                   hover_data=YEAR.columns)
fig.show()


Statistics=0.935, p=0.190
DISTRIBUTION = Sample looks Gaussian (fail to reject H0)
F_OneWay TEST = There is no significance in the relation between Years and Number of Fires
KRUSKAL TEST = There is no significance in the relation between Years and Number of Fires


In [205]:
YEAR_MONTH = df.groupby(['year', 'month']).sum()
YEAR_MONTH = YEAR_MONTH.reset_index()

YEAR_MONTH_5 = YEAR_MONTH[YEAR_MONTH['year'] > 2000]

YEAR_MONTH_5 = YEAR_MONTH_5.sort_values(by= ['month', 'year'])

fig = go.Figure()
fig = px.line(YEAR_MONTH_5, x="month", y="number", title='Brazil Forest Fires by Year and Month', color= 'year')
fig.show()
    

<b>Question 1:</b> 
<br>How has the amount of fires developt over the last years
<br><b>Answer 1:</b> 
<br>The amount of fires has been increasing in the last few years after 4 points abaove average

### Question 2: Which state has seen the highest /lowest in/decrease in forest fires

In [206]:
#### Count the fires by year and visualize ####
YEAR_STATE = df.groupby(['year', 'state'])['number'].sum().reset_index()

YEAR_STATE = YEAR_STATE.sort_values(by= ['state', 'year'])

YEAR_STATE.head()



#fig = px.line(YEAR_STATE, x="year", y="number", title='Brazil Forest Fires by Year and state', color= 'state')
#fig.update_layout(title='Amount of forest fires in Brazil by Year')
                   
#fig.show()

Unnamed: 0,year,state,number
0,1998,Acre,730
23,1999,Acre,333
46,2000,Acre,434
69,2001,Acre,828
92,2002,Acre,1543


In [207]:
YEAR_STATE['DIFF'] = YEAR_STATE['number'].diff()

YEAR_STATE = YEAR_STATE[YEAR_STATE['year'] > 1998]

In [242]:
DOWN5 = YEAR_STATE.sort_values(by= ['DIFF']).head().reset_index(drop = True)
DOWN5['DOWN'] = DOWN5['year'].map(str) + '-' + DOWN5['state'].map(str)

In [244]:
fig = px.bar(DOWN5, x='DOWN', y='DIFF')
fig.show()

In [239]:
UP5 = YEAR_STATE.sort_values(by= ['DIFF'], ascending = False).head().reset_index(drop = True)
UP5['UP'] = UP5['year'].map(str) + '-' + UP5['state'].map(str)

In [241]:
fig = px.bar(UP5, x='UP', y='DIFF')
fig.show()

In [223]:
Mato = YEAR_STATE[YEAR_STATE['state'] == 'Mato Grosso']

fig = go.Figure()
fig.add_trace(go.Scatter(x=Mato['year'], y= Mato['DIFF'],
                    mode='lines',
                    name='Difference of previous year'))
fig.add_trace(go.Scatter(x=Mato['year'], y= (Mato['DIFF'] - Mato['DIFF']),
                    mode='lines+markers',
                    name='Zero Line'))

fig.show()

<b>Question 2:</b> 
<br>Which state has seen the highest /lowest in/decrease in forest fires
<br><b>Answer 2:</b> 
<br>The highest decrease was in the state Mato Grosso for the year 2016 compared to 2015
<br>The highest increase was also for the state Mato Grosso for the year 2009 compared to 2008