# Analysis into the Brazil Forest Fires

This analysis is based on the Kaggle data set https://www.kaggle.com/man72331/forest-fires-data-analysis 

The goal of this analysis is to gain insights in the patern of fires in the Brazil Forest

1. How has the number of fires evolved over the last years
2. Which state has seen the highest /lowest in/decrease in forest fires
3. Which month has the most chance of an fire per region
4. Does the chance of fire per moth per region show any dissimalarities
5. Which year had the most / least fires
6. Which state on average has the most / least fires
7. Which year / state has had the largest deviation in forest fires


By: <b>Bram van Schaik</b> 
<br>Date: <b> October 2019</b>

## Part 1: Import libraries

In [12]:
import pandas as pd
import numpy as np

from scipy import stats


#### Visuals ####
import plotly.express as px

import plotly as py
import plotly.graph_objs as go

import seaborn as sns


alpha = 0.05

## Part 2: Import the dataset

In [72]:
#### Read in the data set ####
df = pd.read_csv("C:\\Users\\Big Boss\\Documents\\GitHub\\DEMO_PROJECTS\\Brazil Forest Fires\\amazon.csv", 
                 sep=",", 
                 engine='python',
                 thousands=r'.'
                )

## Part 3: Exploratory Analysis
This part is about getting to know our dataset

In [73]:
#### Looking at the first and last few rows to gain an understanding of the content ####
print(df.head(3))
print("\n")
print(df.tail(3))

   year state    month  number        date
0  1998  Acre  Janeiro       0  1998-01-01
1  1999  Acre  Janeiro       0  1999-01-01
2  2000  Acre  Janeiro       0  2000-01-01


      year      state     month  number        date
6451  2014  Tocantins  Dezembro     223  2014-01-01
6452  2015  Tocantins  Dezembro     373  2015-01-01
6453  2016  Tocantins  Dezembro     119  2016-01-01


### Conclusions:
- The month column has the spanish month names instead of common english names


In [75]:
df['month'] = df['month'].replace(['Janeiro', 'Fevereiro', 'Março', 'Abril', 'Maio', 'Junho', 'Julho', 'Agosto', 'Setembro', 'Outubro', 'Novembro', 'Dezembro'], 
                    ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'])

df.head()

Unnamed: 0,year,state,month,number,date
0,1998,Acre,January,0,1998-01-01
1,1999,Acre,January,0,1999-01-01
2,2000,Acre,January,0,2000-01-01
3,2001,Acre,January,0,2001-01-01
4,2002,Acre,January,0,2002-01-01


In [76]:
### Now we are going to be looking at the datatypes,shape and missing/unique for each column ####
def resumetable(df):
    print(f"Dataset Shape: {df.shape}")
    summary = pd.DataFrame(df.dtypes,columns=['dtypes'])
    summary = summary.reset_index()
    summary['Name'] = summary['index']
    summary = summary[['Name','dtypes']]
    summary['Missing'] = df.isnull().sum().values    
    summary['Uniques'] = df.nunique().values
    summary['Min'] = df.min().values
    summary['Max'] = df.max().values
    return summary

resumetable(df)

Dataset Shape: (6454, 5)


Unnamed: 0,Name,dtypes,Missing,Uniques,Min,Max
0,year,int64,0,20,1998,2017
1,state,object,0,23,Acre,Tocantins
2,month,object,0,12,April,September
3,number,int64,0,1416,0,25963
4,date,object,0,20,1998-01-01,2017-01-01


### Conclusions:
- There are 20 unique years in the data set starting from 1998 till 2017
- There are 23 unique states in the data set
- State, Month and date are objects
- There are no missing values

### Question 1: How has the amount of fires developt over the last years

In [78]:
#### Count the fires by year and visualize ####

years=list(df.year.unique())

fires_per_year=[]

for year in years:
    fire =amazon_df.loc[df['year'] == fire].number.sum().round(0)
    fires_per_year.append(fire)
    
   
fire_year_dic={'Year':years,'Total_Fires':sub_fires_per_year}

time_plot_1_df=pd.DataFrame(fire_year_dic)
#checking the dataframe
time_plot_1_df.head(5)

AttributeError: Cannot access callable attribute 'reset_index' of 'DataFrameGroupBy' objects, try using the 'apply' method

In [51]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=YEAR['year'], y= YEAR['number'],
                    mode='lines',
                    name='Number of Fires'))
fig.add_trace(go.Scatter(x=YEAR['year'], y= YEAR['MEAN'],
                    mode='lines+markers',
                    name='Mean'))

fig.show()

In [7]:
print("There are on average: {} amount of forest fires in Brazil each year".format(int(YEAR.number.mean())))

YEAR['MEAN'] = YEAR.number.mean()

There are on average: 168674 amount of forest fires in Brazil each year


In [43]:
print("The standard deviation is : {} on the amount of forest fires in Brazil for year".format(int(YEAR.number.std())))

YEAR['SD'] = YEAR.number.std()

The standard deviation is : 50486 on the amount of forest fires in Brazil for year


In [48]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
x = YEAR[['year']]
y = YEAR[['number']]

model.fit(x, y)

r_sq = model.score(x, y)
print('coefficient of determination:', r_sq)
print('\n')
print('intercept:', model.intercept_)
print('\n')
print('slope:', model.coef_)
print('\n')

if p > alpha:
	print('Linear Regression = There is no significance in the relation between Years and Number of Fires')
else:
	print('Linear Regression = There is significance in the relation between Years and Number of Fires summed per year')

coefficient of determination: 0.04704140726683937


intercept: [-3546994.06541353]


slope: [[1850.89323308]]


Linear Regression = There is no significance in the relation between Years and Number of Fires


Conclusion:

- The regression model only shows an 0.04 R-squared value based on the total of each year

In [40]:
from scipy.stats import shapiro

stat, p = shapiro(YEAR['number'])
print('Statistics=%.3f, p=%.3f' % (stat, p))

alpha = 0.05
if p > alpha:
	print('DISTRIBUTION = Sample looks Gaussian (fail to reject H0)')
else:
	print('DISTRIBUTION = Sample does not look Gaussian (reject H0)')
    
stats.f_oneway(YEAR['number'], YEAR['year'])

if p > alpha:
	print('F_OneWay TEST = There is no significance in the relation between Years and Number of Fires')
else:
	print('F_OneWay TEST = There is significance in the relation between Years and Number of Fires')
    
stats.kruskal(YEAR['number'], YEAR['year'])
if p > alpha:
	print('KRUSKAL TEST = There is no significance in the relation between Years and Number of Fires')
else:
	print('KRUSKAL TEST = There is significance in the relation between Years and Number of Fires')

fig = px.histogram(YEAR, x="number",
                   marginal="box", 
                   hover_data=YEAR.columns)
fig.show()


Statistics=0.910, p=0.064
DISTRIBUTION = Sample looks Gaussian (fail to reject H0)
F_OneWay TEST = There is no significance in the relation between Years and Number of Fires
KRUSKAL TEST = There is no significance in the relation between Years and Number of Fires


In [67]:
YEAR_MONTH = df.groupby(['year', 'month']).sum()
YEAR_MONTH = YEAR_MONTH.reset_index()

YEAR_MONTH_5 = YEAR_MONTH[YEAR_MONTH['year'] > 2012]

fig = go.Figure()
fig = px.line(YEAR_MONTH_5, x="month", y="number", title='Brazil Forest Fires by Year and state', color= 'year')
fig.show()
    

In [64]:
YEAR_5.head()

Unnamed: 0,year,number,MEAN,SD
15,2013,105572,168674.1,50486.496866
16,2014,170259,168674.1,50486.496866
17,2015,209296,168674.1,50486.496866
18,2016,171132,168674.1,50486.496866
19,2017,246289,168674.1,50486.496866


Unnamed: 0,year,month,number
0,1998,Abril,0
1,1998,Agosto,35549
2,1998,Dezembro,4448
3,1998,Fevereiro,0
4,1998,Janeiro,0


<b>Question 1:</b> 
<br>How has the amount of fires developt over the last years
<br><b>Answer 1:</b> 
<br>The amount of fires has been increasing in the last few years after some years of decrease

### Question 2: Which state has seen the highest /lowest in/decrease in forest fires

In [53]:
#### Count the fires by year and visualize ####
YEAR_STATE = df.groupby(['year', 'state']).sum()
YEAR_STATE = YEAR_STATE.reset_index()



fig = px.line(YEAR_STATE, x="year", y="number", title='Brazil Forest Fires by Year and state', color= 'state')
fig.update_layout(title='Amount of forest fires in Brazil by Year')
                   
fig.show()

In [62]:
stats.f_oneway(YEAR_STATE['number'], YEAR_STATE['year'])

if p > alpha:
	print('greater')
else:
	print('lower')

lower


In [61]:
stats.kruskal(YEAR_STATE['number'], YEAR_STATE['year'])
if p > alpha:
	print('greater then 0.05')
else:
	print('lower ten 0.05')

lower ten 0.05
