# Projet DA WildFires - Modélisation d'une prédiction de feu au niveau d'un comté
__________________________________________________________________

> Par P. NGUYEN

## 0- Objectif de ce notebook

Dans ce notebook, nous allons décliner la modélisation déjà réalisée pour la prédiction d'un feu au niveau d'un état pour prédire cette fois-ci un feu au niveau d'un comté.

## 1- Chargement des librairies

In [1]:
import pandas as pd
import os
import numpy as np
from datetime import datetime
from sklearn.ensemble import HistGradientBoostingClassifier
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn import ensemble
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve, auc, f1_score
import matplotlib.pyplot as plt
from imblearn.ensemble import BalancedRandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
import joblib

## 2 - Construction des jeux de données
### A - Données "Global Summary of the Day" fournies par la NOAA

Les données mises à disposition par la NOAA sont relevées par des milliers de stations météo disséminées sur le territoire américain. Problème : chaque entrée n'est identifiable que par un ID de station, l'état auquel la station est rattachée et ses coordonnées de latitude/longitude, et avec ces informations, il n'est pas possible d'en déduire directement le comté.

Afin de pouvoir mener cette étude, il a donc fallu :
1. Extraire les données de la NOAA souhaitées
2. Construire un objet geopandas décrivant l'ensemble des comtés des Etats-Unis avec les identifiants propres à un comté (lié au code FIPS)
3. Faire un merge entre les données NOAA et l'objet geopandas en se basant sur les coordonnées de latitude/longitude

Comme c'est une processus assez long étant le nombre important d'observations disponibles, nous nous sommes contentés de faire l'exercice pour les comtés de la Californie sur la période 1992-2015. Ci-dessous le jeu de données obtenu :

In [2]:
df_gsod = pd.read_csv("Phuc/df_gsod.csv", index_col=0)
df_gsod.sample(10)

Unnamed: 0,DATE,ELEVATION,TEMP,TEMP_ATTRIBUTES,DEWP,DEWP_ATTRIBUTES,SLP,SLP_ATTRIBUTES,STP,STP_ATTRIBUTES,...,GUST,MAX,MAX_ATTRIBUTES,MIN,MIN_ATTRIBUTES,PRCP,PRCP_ATTRIBUTES,SNDP,FRSHTT,COUNTYFP
88,2010-03-30,18.0,55.5,23,48.4,23,1011.1,23,999.9,0,...,28.0,61.2,,49.5,,0.03,B,999.9,10000,95
50,2002-03-19,56.0,58.2,9,42.8,9,1022.2,9,999.9,0,...,999.9,62.6,*,48.2,*,0.0,I,999.9,0,37
26,1993-01-27,151.5,43.0,24,35.1,24,1018.1,24,999.7,24,...,999.9,66.9,,28.9,,0.0,D,999.9,0,89
8,2011-01-09,23.5,48.5,24,42.3,24,1016.3,23,13.0,24,...,999.9,59.0,,37.0,,0.0,G,999.9,0,111
31,1997-02-01,1.8,54.9,21,53.0,21,9999.9,0,999.9,0,...,999.9,60.8,*,53.6,*,0.0,D,999.9,100000,1
138,2001-05-19,1644.4,65.5,24,35.6,24,1013.5,23,999.9,0,...,19.0,81.0,*,50.0,*,0.01,G,999.9,0,23
139,2001-05-20,65.0,60.5,24,54.9,24,1014.6,22,999.9,0,...,999.9,71.6,*,53.6,*,0.0,I,999.9,0,79
251,2008-09-10,1309.4,68.3,24,22.1,24,1007.0,24,863.4,24,...,21.0,89.1,,46.9,,0.0,G,999.9,0,13
324,2008-11-20,664.5,60.7,24,22.5,24,1017.6,24,941.2,24,...,999.9,73.9,,50.0,,0.0,G,999.9,0,3
44,2010-02-14,879.3,52.7,6,33.8,6,9999.9,0,919.4,6,...,999.9,66.2,*,39.2,*,0.0,I,999.9,0,71


Nous procédons au nettoyage des données en :
1. Supprimant les variables inutiles
2. En retranscrivant les valeurs manquantes codées en numérique en valeur de type Nan

In [3]:
column2drop = ['TEMP_ATTRIBUTES', 'DEWP_ATTRIBUTES', 'SLP_ATTRIBUTES', 'STP_ATTRIBUTES', 'VISIB_ATTRIBUTES', 'WDSP_ATTRIBUTES']
df_gsod.drop(column2drop, axis = 1, inplace=True) #droping non relevant data for the merge

In [4]:
# setting the missing data to nan
df_gsod.loc[df_gsod['TEMP'] == 9999.9, 'TEMP'] = np.nan
df_gsod.loc[df_gsod['DEWP'] == 9999.9, 'DEWP'] = np.nan
df_gsod.loc[df_gsod['SLP'] == 9999.9, 'SLP'] = np.nan
df_gsod.loc[df_gsod['STP'] == 9999.9, 'STP'] = np.nan
df_gsod.loc[df_gsod['VISIB'] == 999.9, 'VISIB'] = np.nan
df_gsod.loc[df_gsod['WDSP'] == 999.9, 'WDSP'] = np.nan
df_gsod.loc[df_gsod['MXSPD'] == 999.9, 'MXSPD'] = np.nan
df_gsod.loc[df_gsod['GUST'] == 999.9, 'GUST'] = np.nan
df_gsod.loc[df_gsod['MAX'] == 9999.9, 'MAX'] = np.nan

In [5]:
df = df_gsod
df

Unnamed: 0,DATE,ELEVATION,TEMP,DEWP,SLP,STP,VISIB,WDSP,MXSPD,GUST,MAX,MAX_ATTRIBUTES,MIN,MIN_ATTRIBUTES,PRCP,PRCP_ATTRIBUTES,SNDP,FRSHTT,COUNTYFP
0,1992-01-02,317.0,49.1,35.4,1017.7,999.9,14.9,2.0,5.1,,62.1,,35.1,,0.00,A,999.9,0,53
1,1992-01-03,317.0,50.0,43.8,1012.4,999.9,12.8,1.5,4.1,,60.1,,39.9,,0.04,G,999.9,10000,53
2,1992-01-06,317.0,47.0,43.3,1014.0,999.9,6.2,1.6,4.1,,54.0,*,39.0,*,0.00,G,999.9,110000,53
3,1992-01-07,317.0,48.1,42.3,1012.4,999.9,9.6,3.3,6.0,,54.0,,37.0,,0.12,G,999.9,110000,53
4,1992-01-08,317.0,46.6,40.7,1021.9,999.9,14.9,2.2,5.1,,57.0,*,30.9,,0.24,G,999.9,0,53
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
347,2015-12-27,10.0,45.5,,1025.4,999.9,,0.0,,,51.6,*,40.8,*,0.00,I,999.9,0,75
348,2015-12-28,10.0,44.1,,1018.2,999.9,,0.0,,,47.7,*,41.4,*,0.00,I,999.9,0,75
349,2015-12-29,10.0,47.3,,1016.9,999.9,,0.0,,,51.1,*,45.5,*,0.00,I,999.9,0,75
350,2015-12-30,10.0,46.4,,1021.9,999.9,,0.0,,,51.6,*,43.9,*,0.00,I,999.9,0,75


### B - Données Wildfires

In [6]:
df_fire = pd.read_csv('Phuc/Fires_df.csv', index_col=0) #importing fire data
df_fire

Unnamed: 0,FIRE_NAME,FIRE_YEAR,DISCOVERY_DATE,DISCOVERY_DOY,DISCOVERY_TIME,STAT_CAUSE_CODE,STAT_CAUSE_DESCR,CONT_DATE,CONT_DOY,CONT_TIME,FIRE_SIZE,FIRE_SIZE_CLASS,LATITUDE,LONGITUDE,STATE,COUNTYFP,COUNTY_NAME
0,FOUNTAIN,2005,2005-02-02,33,1300.0,9.0,Miscellaneous,2005-02-02,33.0,1730.0,0.10,A,40.036944,-121.005833,CA,63,Plumas
1,MAGNOLIA,2005,2005-06-02,153,1600.0,5.0,Debris Burning,2005-06-02,153.0,2000.0,2.30,B,39.817222,-120.451111,CA,63,Plumas
2,SPENCER,2005,2005-04-17,107,1200.0,9.0,Miscellaneous,2005-05-09,129.0,1800.0,0.10,A,39.689722,-120.717222,CA,63,Plumas
3,MILLER,2005,2005-08-22,234,1458.0,9.0,Miscellaneous,2005-08-22,234.0,2015.0,0.50,B,39.872222,-121.147500,CA,63,Plumas
4,ENGLAND,2005,2005-08-29,241,1812.0,9.0,Miscellaneous,2005-08-29,241.0,1840.0,0.10,A,39.962222,-120.903889,CA,63,Plumas
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1848285,5,2015,2015-02-14,45,1023.0,7.0,Arson,2015-02-14,45.0,1118.0,7.00,B,37.710000,-100.330000,KS,69,Gray
1848286,31,2015,2015-11-12,316,,9.0,Miscellaneous,2015-11-12,316.0,1040.0,0.01,A,37.740000,-100.600000,KS,69,Gray
1848287,13,2015,2015-03-16,75,1227.0,9.0,Miscellaneous,2015-03-16,75.0,1424.0,100.00,D,37.440000,-100.990000,KS,81,Haskell
1848288,14,2015,2015-03-16,75,1457.0,9.0,Miscellaneous,2015-03-16,75.0,1709.0,30.00,C,37.440000,-100.990000,KS,81,Haskell


Comme indiqué précédememnt, nous nous contenterons de faire la modélisation pour les comtés de la Californie

In [7]:
df_fire = df_fire[df_fire['STATE'] == 'CA'] # Filtering on CA datas
df_fire

Unnamed: 0,FIRE_NAME,FIRE_YEAR,DISCOVERY_DATE,DISCOVERY_DOY,DISCOVERY_TIME,STAT_CAUSE_CODE,STAT_CAUSE_DESCR,CONT_DATE,CONT_DOY,CONT_TIME,FIRE_SIZE,FIRE_SIZE_CLASS,LATITUDE,LONGITUDE,STATE,COUNTYFP,COUNTY_NAME
0,FOUNTAIN,2005,2005-02-02,33,1300.0,9.0,Miscellaneous,2005-02-02,33.0,1730.0,0.10,A,40.036944,-121.005833,CA,63,Plumas
1,MAGNOLIA,2005,2005-06-02,153,1600.0,5.0,Debris Burning,2005-06-02,153.0,2000.0,2.30,B,39.817222,-120.451111,CA,63,Plumas
2,SPENCER,2005,2005-04-17,107,1200.0,9.0,Miscellaneous,2005-05-09,129.0,1800.0,0.10,A,39.689722,-120.717222,CA,63,Plumas
3,MILLER,2005,2005-08-22,234,1458.0,9.0,Miscellaneous,2005-08-22,234.0,2015.0,0.50,B,39.872222,-121.147500,CA,63,Plumas
4,ENGLAND,2005,2005-08-29,241,1812.0,9.0,Miscellaneous,2005-08-29,241.0,1840.0,0.10,A,39.962222,-120.903889,CA,63,Plumas
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1303030,WAVERLY,2013,2013-08-30,242,1854.0,13.0,Missing/Undefined,2013-08-30,242.0,1921.0,0.25,A,37.942974,-120.943633,CA,77,San Joaquin
1303031,W BERKELEY RD / MOUNTAIN,2009,2009-07-20,201,1130.0,13.0,Missing/Undefined,,,,1.00,B,37.728938,-121.531399,CA,77,San Joaquin
1303032,WAVERLY 2,2010,2010-10-15,288,1212.0,13.0,Missing/Undefined,2010-10-15,288.0,1216.0,0.01,A,38.017350,-120.955867,CA,77,San Joaquin
1303033,HWY 12 /N CORD RD,2010,2010-09-17,260,1510.0,13.0,Missing/Undefined,,,,0.10,A,38.203120,-121.018805,CA,77,San Joaquin


In [36]:
df_fire.columns

Index(['FIRE_NAME', 'FIRE_YEAR', 'DISCOVERY_DATE', 'DISCOVERY_DOY',
       'DISCOVERY_TIME', 'STAT_CAUSE_CODE', 'STAT_CAUSE_DESCR', 'CONT_DATE',
       'CONT_DOY', 'CONT_TIME', 'FIRE_SIZE', 'FIRE_SIZE_CLASS', 'LATITUDE',
       'LONGITUDE', 'STATE', 'COUNTYFP', 'COUNTY_NAME'],
      dtype='object')

In [8]:
df_fire.drop(['FIRE_NAME', 'FIRE_YEAR', 'DISCOVERY_DOY',
       'DISCOVERY_TIME', 'STAT_CAUSE_CODE', 'STAT_CAUSE_DESCR', 'CONT_DATE',
       'CONT_DOY', 'CONT_TIME', 'FIRE_SIZE', 'FIRE_SIZE_CLASS', 'LATITUDE',
       'LONGITUDE', 'STATE', 'COUNTY_NAME'], axis = 1, inplace = True)
df_fire

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_fire.drop(['FIRE_NAME', 'FIRE_YEAR', 'DISCOVERY_DOY',


Unnamed: 0,DISCOVERY_DATE,COUNTYFP
0,2005-02-02,63
1,2005-06-02,63
2,2005-04-17,63
3,2005-08-22,63
4,2005-08-29,63
...,...,...
1303030,2013-08-30,77
1303031,2009-07-20,77
1303032,2010-10-15,77
1303033,2010-09-17,77


In [9]:
df

Unnamed: 0,DATE,ELEVATION,TEMP,DEWP,SLP,STP,VISIB,WDSP,MXSPD,GUST,MAX,MAX_ATTRIBUTES,MIN,MIN_ATTRIBUTES,PRCP,PRCP_ATTRIBUTES,SNDP,FRSHTT,COUNTYFP
0,1992-01-02,317.0,49.1,35.4,1017.7,999.9,14.9,2.0,5.1,,62.1,,35.1,,0.00,A,999.9,0,53
1,1992-01-03,317.0,50.0,43.8,1012.4,999.9,12.8,1.5,4.1,,60.1,,39.9,,0.04,G,999.9,10000,53
2,1992-01-06,317.0,47.0,43.3,1014.0,999.9,6.2,1.6,4.1,,54.0,*,39.0,*,0.00,G,999.9,110000,53
3,1992-01-07,317.0,48.1,42.3,1012.4,999.9,9.6,3.3,6.0,,54.0,,37.0,,0.12,G,999.9,110000,53
4,1992-01-08,317.0,46.6,40.7,1021.9,999.9,14.9,2.2,5.1,,57.0,*,30.9,,0.24,G,999.9,0,53
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
347,2015-12-27,10.0,45.5,,1025.4,999.9,,0.0,,,51.6,*,40.8,*,0.00,I,999.9,0,75
348,2015-12-28,10.0,44.1,,1018.2,999.9,,0.0,,,47.7,*,41.4,*,0.00,I,999.9,0,75
349,2015-12-29,10.0,47.3,,1016.9,999.9,,0.0,,,51.1,*,45.5,*,0.00,I,999.9,0,75
350,2015-12-30,10.0,46.4,,1021.9,999.9,,0.0,,,51.6,*,43.9,*,0.00,I,999.9,0,75


Nous construisons ensuite 2 jeux de données :
1. **"df_state"** qui est le résultat d'un groupby sur la date et sur l'identifiant d'un comté en appliquant une moyenne aux données météo concernées, ne pas oublier qu'on peut avoir plusieurs relevés pour un comté en fonction du nombre de stations présentes d'où cette méthode
2. **"df_fire_grouped"** qui est le pendant de **"df_state"** pour les données de feu

In [11]:
df_state = df.groupby(by = ['DATE', 'COUNTYFP']).mean() #groupingby the meteo data
df_fire['fire'] = 1
df_fire.rename(columns={'DISCOVERY_DATE' : 'DATE'}, inplace=True) #groupingby the fire
df_fire_grouped = df_fire.groupby(by = ['DATE', 'COUNTYFP']).mean()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_fire['fire'] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_fire.rename(columns={'DISCOVERY_DATE' : 'DATE'}, inplace=True) #groupingby the fire


In [12]:
df_state

Unnamed: 0_level_0,Unnamed: 1_level_0,ELEVATION,TEMP,DEWP,SLP,STP,VISIB,WDSP,MXSPD,GUST,MAX,MIN,PRCP,SNDP,FRSHTT
DATE,COUNTYFP,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1992-01-01,1,306.000000,44.15,38.925000,1025.650000,727.525,8.625,3.725,7.775,,54.925000,37.075,0.00,999.9,75000.000000
1992-01-01,3,664.500000,42.80,33.700000,1026.900000,948.300,38.900,3.600,7.000,,53.100000,33.100,0.00,999.9,0.000000
1992-01-01,7,1533.100000,22.30,15.700000,1029.600000,852.000,24.400,4.000,7.000,,43.000000,8.100,0.00,999.9,0.000000
1992-01-01,11,1993.000000,25.60,16.300000,,999.900,,4.900,9.900,,37.000000,19.200,0.00,999.9,0.000000
1992-01-01,13,657.450000,39.45,34.650000,1028.500000,938.150,13.650,5.000,10.950,,45.950000,32.000,0.00,999.9,50000.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2015-12-31,107,112.300000,38.65,33.150000,1024.900000,10.600,7.200,2.250,6.500,,53.500000,28.000,0.00,999.9,50000.000000
2015-12-31,109,1332.000000,31.60,29.800000,,972.900,9.400,1.500,5.100,,41.650000,24.550,0.00,999.9,0.000000
2015-12-31,111,12.833333,53.20,26.866667,1020.733333,18.700,10.000,10.200,22.700,32.7,65.566667,37.700,0.00,999.9,3333.333333
2015-12-31,115,28.500000,38.45,31.050000,1024.400000,20.950,9.250,4.200,8.450,,52.800000,26.450,0.00,999.9,50000.000000


In [13]:
df_fire

Unnamed: 0,DATE,COUNTYFP,fire
0,2005-02-02,63,1
1,2005-06-02,63,1
2,2005-04-17,63,1
3,2005-08-22,63,1
4,2005-08-29,63,1
...,...,...,...
1303030,2013-08-30,77,1
1303031,2009-07-20,77,1
1303032,2010-10-15,77,1
1303033,2010-09-17,77,1


On termine la préparation des données en faisant un merge des données de météo avec les données de feu, ce qui nous permet à la fin d'avoir un relevé des données météo de chaque comté et de savoir si à cette date un feu a été recensé (0 ou 1) :

In [14]:
df_combined = df_state.merge(right = df_fire_grouped, right_index=True, left_index=True, how='left') #last merge on their index

In [15]:
df_combined['fire'] = df_combined['fire'].fillna(0) #geting ride of nan (no fire = 0)

In [16]:
df_combined['fire'] = df_combined['fire'].astype(int) #setting it to int
df_combined.to_csv('combined-data-county.csv') #export for the team

# 2 - Fire prediction


Démarrons la modélisation :

In [17]:
df = pd.read_csv('combined-data-county.csv')
target = df['fire']
data = df.drop(['DATE', 'COUNTYFP', 'fire'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=66)

In [18]:
target

0         0
1         0
2         0
3         0
4         0
         ..
370143    0
370144    0
370145    0
370146    0
370147    0
Name: fire, Length: 370148, dtype: int64

In [19]:
data

Unnamed: 0,ELEVATION,TEMP,DEWP,SLP,STP,VISIB,WDSP,MXSPD,GUST,MAX,MIN,PRCP,SNDP,FRSHTT
0,306.000000,44.15,38.925000,1025.650000,727.525,8.625,3.725,7.775,,54.925000,37.075,0.00,999.9,75000.000000
1,664.500000,42.80,33.700000,1026.900000,948.300,38.900,3.600,7.000,,53.100000,33.100,0.00,999.9,0.000000
2,1533.100000,22.30,15.700000,1029.600000,852.000,24.400,4.000,7.000,,43.000000,8.100,0.00,999.9,0.000000
3,1993.000000,25.60,16.300000,,999.900,,4.900,9.900,,37.000000,19.200,0.00,999.9,0.000000
4,657.450000,39.45,34.650000,1028.500000,938.150,13.650,5.000,10.950,,45.950000,32.000,0.00,999.9,50000.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
370143,112.300000,38.65,33.150000,1024.900000,10.600,7.200,2.250,6.500,,53.500000,28.000,0.00,999.9,50000.000000
370144,1332.000000,31.60,29.800000,,972.900,9.400,1.500,5.100,,41.650000,24.550,0.00,999.9,0.000000
370145,12.833333,53.20,26.866667,1020.733333,18.700,10.000,10.200,22.700,32.7,65.566667,37.700,0.00,999.9,3333.333333
370146,28.500000,38.45,31.050000,1024.400000,20.950,9.250,4.200,8.450,,52.800000,26.450,0.00,999.9,50000.000000


Reprenons le GridSearchCV pour déterminer les meilleurs hyperparamètres pour les données considérées :

In [20]:
HGBC = HistGradientBoostingClassifier()
params = {'max_leaf_nodes': [20, 31, 40, 50],
          'max_depth': [20, 40, 60, 80, 100, None],
          'min_samples_leaf': [10, 20, 40]}

gridcv = GridSearchCV(HGBC, param_grid=params, scoring='accuracy', cv=3, n_jobs=8)

gridcv.fit(X_train, y_train)
result = pd.DataFrame(gridcv.cv_results_)[['params', 'mean_test_score', 'std_test_score']]
result.to_csv('HGBC_county_result.csv')
result

Unnamed: 0,params,mean_test_score,std_test_score
0,"{'max_depth': 20, 'max_leaf_nodes': 20, 'min_s...",0.826549,0.000772
1,"{'max_depth': 20, 'max_leaf_nodes': 20, 'min_s...",0.827012,0.000098
2,"{'max_depth': 20, 'max_leaf_nodes': 20, 'min_s...",0.827039,0.000684
3,"{'max_depth': 20, 'max_leaf_nodes': 31, 'min_s...",0.828085,0.000928
4,"{'max_depth': 20, 'max_leaf_nodes': 31, 'min_s...",0.827812,0.000748
...,...,...,...
67,"{'max_depth': None, 'max_leaf_nodes': 40, 'min...",0.828494,0.000872
68,"{'max_depth': None, 'max_leaf_nodes': 40, 'min...",0.828622,0.000698
69,"{'max_depth': None, 'max_leaf_nodes': 50, 'min...",0.829257,0.000816
70,"{'max_depth': None, 'max_leaf_nodes': 50, 'min...",0.829406,0.001049


Nous allons donc utiliser les hyperparamètres suivants :

In [21]:
gridcv.best_params_

{'max_depth': 20, 'max_leaf_nodes': 50, 'min_samples_leaf': 40}

Entraînons le modèle et procédons à la prédiction :

In [22]:
HGBC = HistGradientBoostingClassifier(max_depth=20, max_leaf_nodes=50, min_samples_leaf=40)
HGBC.fit(X_train, y_train)

In [24]:
y_pred = HGBC.predict(X_test)
pd.crosstab(y_test, y_pred, rownames=['Classe réelle'], colnames=['Classe prédite'])

Classe prédite,0,1
Classe réelle,Unnamed: 1_level_1,Unnamed: 2_level_1
0,53842,3648
1,8873,7667


In [25]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.86      0.94      0.90     57490
           1       0.68      0.46      0.55     16540

    accuracy                           0.83     74030
   macro avg       0.77      0.70      0.72     74030
weighted avg       0.82      0.83      0.82     74030



In [27]:
f1_score(y_test, y_pred)

0.5504936277149525

**Conclusion :** On remarque une nette dégradation du score f1 obtenu lors de la prédiction au niveau d'un état. Pour l'état de Californie, ce score était de 0.97 et avec ce modèle décliné au niveau des comtés de la Californie, le score tombe à 0.55. Ce n'est pas vraiment surprenant, plus le périmètre géographique sur lequel la prédiction doit porter se réduit, plus le modèle se doit d'être précis. Or nous avons mécaniquement moins de données par comté que par état, ce qui explique en partie cette baisse du score de prédiction. D'autre part, notre modélisation se base uniquement sur des données météo qui peuvent constituer des facteurs favorables à la prolifération de feux mais qui ne sont pas suffisantes pour prédire l'occurence d'un feu et plus on réduit le périmètre géographique sujet à prédiction, plus cette limitation impacte le modèle.