# Forest Fires A/B Test Analysis
*Bonus Assignment — DS: Statistics and Probability*

*College: Cornerstone International Community College of Canada - CICCC*

*Student: Amir Lima Oliveira*

*Date: May 20th, 2025*


## Objective

*To conduct a complete A/B (two-sample) hypothesis test using a real-world environmental dataset. We aim to determine whether the presence of rain significantly affects the burned forest area.*


## 1. Dataset Information

- **Dataset title**: Forest Fires  
- **Source**: UCI Machine Learning Repository  
- **Link**: [https://archive.ics.uci.edu/dataset/162/forest+fires](https://archive.ics.uci.edu/dataset/162/forest+fires)  
- **Description**: *The dataset contains meteorological data and fire area measurements from Montesinho Natural Park in Portugal.*

In [59]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from meteostat import Point, Daily
from datetime import datetime
from scipy import stats
from scipy.stats import ttest_ind

#### 1.1 - API for FFMC
 - %pip install meteostat
 - install ipykernel -- necessary to install py meteostat
 - Try to do it later (Get real world data with meteostat API and fetch with the csv file variables)

#### 1.2 Importing the dataset

In [35]:
df = pd.read_csv('forestfires.csv')
df.head()

Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
0,7,5,mar,fri,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,0.0
1,7,4,oct,tue,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,0.0
2,7,4,oct,sat,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,0.0
3,8,6,mar,fri,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,0.0
4,8,6,mar,sun,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,0.0


## 4. Exploratory Data Analysis (EDA)

*Compute basic descriptive statistics (mean, median, std), visualize distributions, and assess group differences in burned area.*

In [None]:
df.info()
mode = df.mode()
print("\n\nThe mode is:\n\n",mode)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 517 entries, 0 to 516
Data columns (total 13 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   X       517 non-null    int64  
 1   Y       517 non-null    int64  
 2   month   517 non-null    object 
 3   day     517 non-null    object 
 4   FFMC    517 non-null    float64
 5   DMC     517 non-null    float64
 6   DC      517 non-null    float64
 7   ISI     517 non-null    float64
 8   temp    517 non-null    float64
 9   RH      517 non-null    int64  
 10  wind    517 non-null    float64
 11  rain    517 non-null    float64
 12  area    517 non-null    float64
dtypes: float64(8), int64(3), object(2)
memory usage: 52.6+ KB


The mode is:

      X    Y month  day  FFMC   DMC     DC  ISI  temp    RH  wind  rain  area
0  4.0  4.0   aug  sun  91.6  99.0  745.3  9.6  17.4  27.0   2.2   0.0   0.0
1  NaN  NaN   NaN  NaN  92.1   NaN    NaN  NaN  19.6   NaN   3.1   NaN   NaN


In [42]:
df.describe()

Unnamed: 0,X,Y,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
count,517.0,517.0,517.0,517.0,517.0,517.0,517.0,517.0,517.0,517.0,517.0
mean,4.669246,4.299807,90.644681,110.87234,547.940039,9.021663,18.889168,44.288201,4.017602,0.021663,12.847292
std,2.313778,1.2299,5.520111,64.046482,248.066192,4.559477,5.806625,16.317469,1.791653,0.295959,63.655818
min,1.0,2.0,18.7,1.1,7.9,0.0,2.2,15.0,0.4,0.0,0.0
25%,3.0,4.0,90.2,68.6,437.7,6.5,15.5,33.0,2.7,0.0,0.0
50%,4.0,4.0,91.6,108.3,664.2,8.4,19.3,42.0,4.0,0.0,0.52
75%,7.0,5.0,92.9,142.4,713.9,10.8,22.8,53.0,4.9,0.0,6.57
max,9.0,9.0,96.2,291.3,860.6,56.1,33.3,100.0,9.4,6.4,1090.84


In [56]:
median = df[['FFMC','DMC','DC','ISI','temp','RH','wind','rain','area']].median()
print(median)



FFMC     91.60
DMC     108.30
DC      664.20
ISI       8.40
temp     19.30
RH       42.00
wind      4.00
rain      0.00
area      0.52
dtype: float64


## 5. *Data Cleaning & Preprocessing*

#### Checking Null values

In [37]:
df.isnull().sum()

X        0
Y        0
month    0
day      0
FFMC     0
DMC      0
DC       0
ISI      0
temp     0
RH       0
wind     0
rain     0
area     0
dtype: int64

#### Checking unique values on month and day columns

In [None]:
df['month'] = df['month'].str.lower().str.strip()
df['day'] = df['day'].str.lower().str.strip()
print(df['month'].unique())
print(df['day'].unique())

#### Checking duplicates

In [None]:
print(f"Number of duplicate rows: {df.duplicated().sum()}")
print("df.duplicated():\n",df[df.duplicated()])
print("df.duplicated(subset=['X', 'Y']):\n",df[df.duplicated(subset=['X', 'Y'])])


## 6. Hypothesis Formulation

- H₀ (Null Hypothesis): There is no significant difference in the burned area between days with high and low FFMC.

- H₁ (Alternative Hypothesis): There is a significant difference in the burned area between days with high and low FFMC.
- **Significance Level (α)**: *0.05*

## 6. Hypothesis Test Execution

Group A: Days with low FFMC values.

Group B: Days with high FFMC values.

In [57]:
ffmc_median = df['FFMC'].median()

# Column to separate FFMC in groups: as 'Low' or 'High'
df['FFMC_Group'] = ''
for i in df.index:
    if df.loc[i, 'FFMC'] < ffmc_median:
        df.loc[i, 'FFMC_Group'] = 'Low'
    else:
        df.loc[i, 'FFMC_Group'] = 'High'

# Display how many samples in each group
print(df['FFMC_Group'].value_counts())

FFMC_Group
High    283
Low     234
Name: count, dtype: int64


In [None]:
# Defining the samples for the A/B test
burned_area_high = df[df['FFMC_Group'] == 'High']['area']
burned_area_low = df[df['FFMC_Group'] == 'Low']['area']

#### Statistics test choice based on the problem (two sample t-test)
 - It is because even though I have more than 30 unit poins per sample in each group, I only have access to one specific region, so I don't know the entire population.

##### Arguments:
 - Continuous data (burned area size)
 - Two independent groups (High FFMC vs Low FFMC) - Created above
 - Don’t know the entire population standard deviation

In [60]:

# 1. Separate the groups
burned_area_high = df[df['FFMC_Group'] == 'High']['area']
burned_area_low = df[df['FFMC_Group'] == 'Low']['area']

# Check for Equal variances:
# Checking equal variance with levene's test is a good practice becouse 
stats.levene(burned_area_high, burned_area_low)

# 3. Perform the independent two-sample t-test
t_stat, p_value = stats.ttest_ind(burned_area_high, burned_area_low, equal_var = False)  
# Welch’s t-test (assumes unequal variances)

# 4. Show results
print("T-statistic:", t_stat)
print("P-value:", p_value)

# Interpretation
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There IS a significant difference in burned area between FFMC groups.")
else:
    print("Fail to reject the null hypothesis: There is NO significant difference in burned area between FFMC groups.")


T-statistic: 1.1340003431780752
P-value: 0.2575357105414661
Fail to reject the null hypothesis: There is NO significant difference in burned area between FFMC groups.


#### Confidence interval

In [63]:
n_high = len(burned_area_high) # High FFMC group sample size
n_low = len(burned_area_low) # Low FFMC group sample size
mean_high = burned_area_high.mean() # High FFMC group mean
mean_low = burned_area_low.mean() # Low FFMC group mean
std_high = burned_area_high.std() # High FFMC group standard deviation
std_low = burned_area_low.std() # Low FFMC group standard deviation

se_diff = np.sqrt((std_high**2/n_high) + (std_low**2/n_low)) # Standard error of the difference

df_effective = df_effective = ((std_high**2 / n_high) + (std_low**2 / n_low))**2 / \
               (((std_high**2 / n_high)**2) / (n_high - 1) + ((std_low**2 / n_low)**2) / (n_low - 1))

t_crit = stats.t.ppf(0.975, df_effective) # Critical t-value for 95% confidence level

margin_of_error = t_crit * se_diff # Margin of error

ci_low = (mean_high - mean_low) - margin_of_error # Lower bound of the confidence interval
ci_high = (mean_high - mean_low) + margin_of_error # Upper bound of the confidence interval
print(f"95% Confidence Interval: (Group_Low:{ci_low}, Group_High:{ci_high})")

95% Confidence Interval: (Group_Low:-4.350797663018458, Group_High:16.20437593005963)


## 7. Results & Conclusion

- *Test statistic = 1.134*  
- *P-value = 0.2575*  
- *95% confidence interval:* Group_Low:-4.3507  //  Group_High:16.2043 

- *Decision: Fail to reject the null hypothesis*  
- *Practical significance: Even the FFMC variable being a important and validaded value to consider in a forest fire situation, in this test wasn't statistically relevant to determine if the fire will spead in a wider area.*

## 8. Briefly discussion

*Briefly discuss dataset limitations (e.g., small sample with rain, skewed data), test assumptions, and propose improvements or future analyses.*

This A/B test project in for Forest fires on the northeast region of Portugal (Montesinho Park) evaluated if high FFMC values are associated with significantly higher burned areas compared to low FFMC days. The result did not provide strong evidence to reject the null hypothesis, as the confidence interval included zero.

* A larger dataset could change this result (if it brought all Portugal forest fires during the years 2000 and 2003).
* The burned areas are not symmetrical, with many small fires and a few very large ones. It cause an the burned area data column be skewed.