<span style="color:#a61c00;font-size:3em">Wildfires in USA - PREPROCESSING</span> 

<span style="color:#a61c00;font-size:2em"><strong>============================================================================</strong></span>
<span style="color:#a61c00;font-size:2em"><strong>Import des librairies</strong></span>
<span style="color:#a61c00;font-size:2em"><strong>============================================================================</strong></span>

In [171]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

import datetime
import calendar

from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, RobustScaler, StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, RepeatedStratifiedKFold
from sklearn.feature_selection import SelectKBest, SelectFromModel, f_classif, mutual_info_classif,\
    f_regression, mutual_info_regression, RFE, RFECV

from sklearn.svm import LinearSVC, SVC
from sklearn.tree import DecisionTreeClassifier

from imblearn.under_sampling import RandomUnderSampler, OneSidedSelection
from imblearn.pipeline import Pipeline
from imblearn.combine import SMOTEENN

pd.options.display.max_columns = None
pd.options.display.max_rows = None
pd.options.display.max_colwidth = None
pd.options.display.max_info_columns = 100
np.set_printoptions(threshold=10000)

%matplotlib inline
%config Completer.use_jedi = False

<span style="color:#a61c00;font-size:2em"><strong>============================================================================</strong></span>
<span style="color:#a61c00;font-size:2em"><strong>Import du dataset Kaggle</strong></span>
<span style="color:#a61c00;font-size:2em"><strong>============================================================================</strong></span>

In [172]:
fires_orig = pd.read_csv('FPA_FOD_20170508.Fires_IMPORT.csv', sep=';')

  fires_orig = pd.read_csv('FPA_FOD_20170508.Fires_IMPORT.csv', sep=';')


In [173]:
# autre méthode avec parse_dates
# fires_orig = pd.read_csv('FPA_FOD_20170508.Fires_IMPORT.csv', sep=';', parse_dates=[['FIRE_YEAR','DISCOVERY_DOY']], date_format='%Y %j', keep_date_col=True)

In [174]:
# Création d'une copie du dataset
fires = fires_orig.copy()

# Quelques statistiques

## Type de variables et nombre de valeurs non nulles

In [175]:
fires.info(verbose=True, memory_usage=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1880465 entries, 0 to 1880464
Data columns (total 38 columns):
 #   Column                      Non-Null Count    Dtype  
---  ------                      --------------    -----  
 0   OBJECTID                    1880465 non-null  int64  
 1   FOD_ID                      1880465 non-null  int64  
 2   FPA_ID                      1880465 non-null  object 
 3   SOURCE_SYSTEM_TYPE          1880465 non-null  object 
 4   SOURCE_SYSTEM               1880465 non-null  object 
 5   NWCG_REPORTING_AGENCY       1880465 non-null  object 
 6   NWCG_REPORTING_UNIT_ID      1880465 non-null  object 
 7   NWCG_REPORTING_UNIT_NAME    1880465 non-null  object 
 8   SOURCE_REPORTING_UNIT       1880465 non-null  object 
 9   SOURCE_REPORTING_UNIT_NAME  1880465 non-null  object 
 10  LOCAL_FIRE_REPORT_ID        421179 non-null   object 
 11  LOCAL_INCIDENT_ID           1059644 non-null  object 
 12  FIRE_CODE                   324724 non-null   object 
 1

## Statistiques des colonnes numériques

In [176]:
fires.describe()

Unnamed: 0,OBJECTID,FOD_ID,FIRE_YEAR,DISCOVERY_DATE,DISCOVERY_DOY,DISCOVERY_TIME,STAT_CAUSE_CODE,CONT_DATE,CONT_DOY,CONT_TIME,FIRE_SIZE,LATITUDE,LONGITUDE,OWNER_CODE,FIPS_CODE
count,1880465.0,1880465.0,1880465.0,1880465.0,1880465.0,997827.0,1880465.0,988934.0,988934.0,907912.0,1880465.0,1880465.0,1880465.0,1880465.0,1202317.0
mean,940233.0,54840200.0,2003.71,2453064.0,164.7191,1453.014326,5.979037,2453238.0,172.656766,1534.83208,74.52016,36.78121,-95.70494,10.59658,95.7835
std,542843.6,101196300.0,6.663099,2434.573,90.03891,405.960963,3.48386,2687.548,84.320348,432.737694,2497.598,6.139031,16.71694,4.404662,98.61505
min,1.0,1.0,1992.0,2448622.0,1.0,0.0,1.0,2448622.0,1.0,0.0,1e-05,17.93972,-178.8026,0.0,1.0
25%,470117.0,505500.0,1998.0,2451084.0,89.0,1240.0,3.0,2450701.0,102.0,1310.0,0.1,32.8186,-110.3635,8.0,29.0
50%,940233.0,1067761.0,2004.0,2453178.0,164.0,1457.0,5.0,2453466.0,181.0,1600.0,1.0,35.4525,-92.04304,14.0,67.0
75%,1410349.0,19106390.0,2009.0,2455036.0,230.0,1708.0,9.0,2455754.0,232.0,1810.0,3.3,40.8272,-82.2976,14.0,121.0
max,1880465.0,300348400.0,2015.0,2457388.0,366.0,2359.0,13.0,2457392.0,366.0,2359.0,606945.0,70.3306,-65.25694,15.0,810.0


In [177]:
# Mise en évidence de l'incohérence de la date de fin de certains feux, dans le dataset sans traitment : 
fires[['FIRE_YEAR', 'DISCOVERY_DATE', 'DISCOVERY_DOY', 'DISCOVERY_TIME', 'CONT_DATE', 'CONT_DOY', 'CONT_TIME']].iloc[362576]

FIRE_YEAR            1999.0
DISCOVERY_DATE    2451345.5
DISCOVERY_DOY         167.0
DISCOVERY_TIME       1300.0
CONT_DATE         2454998.5
CONT_DOY              167.0
CONT_TIME            1430.0
Name: 362576, dtype: float64

On constate que : 
- les jours de l'année sont identiques : 167
- les horaires plutôt cohérents avec un début à 13h et une fin à 14h30
- un compteur de jour totalement aberrant pour la date de fin, avec 3653 jours, soit 10 années pile, dont 3 années bissextiles (2000, 2004, 2008).

On en conclut à un problème de traitement de données.

# Suppression des colonnes

## Colonnes majoritairement vides

Les colonnes suivantes ont un taux de valeurs manquantes élevé ( > 40 %) et ne sont pas nécessairement pertinentes pour répondre à la problématique.

In [178]:
cols_empty = [
    'ICS_209_INCIDENT_NUMBER', 
    'ICS_209_NAME', 
    'MTBS_ID', 
    'MTBS_FIRE_NAME', 
    'COMPLEX_NAME',
    'LOCAL_FIRE_REPORT_ID', 
    'LOCAL_INCIDENT_ID', 
    'FIRE_CODE', 
    'LOCAL_INCIDENT_ID', 
    'FIRE_NAME'
]

In [179]:
fires = fires.drop(cols_empty, axis=1)

## Colonnes non pertinentes

Les colonnes suivantes n'ont pas d'intérêt quant à la problématique ou présentent trop de valeurs impropres à l'utilisation. 

In [180]:
cols_to_drop = [
    'OBJECTID',
    'SOURCE_SYSTEM_TYPE',
    'SOURCE_SYSTEM',
    'NWCG_REPORTING_AGENCY',
    'NWCG_REPORTING_UNIT_ID',
    'NWCG_REPORTING_UNIT_NAME',
    'SOURCE_REPORTING_UNIT',
    'SOURCE_REPORTING_UNIT_NAME',
    'COUNTY',
    'FIPS_CODE',
    'FIPS_NAME'
]

In [181]:
fires = fires.drop(cols_to_drop, axis=1)

In [182]:
# # fires.info(verbose=True, memory_usage=True, show_counts=True)

# Colonnes d'ID : doublons, nettoyage

## Lignes entières

In [183]:
fires.duplicated().sum()

0

Il n'y a pas de lignes entières en doublon dans le jeu de données. 

## Identifiant fonctionnel FPA_ID

In [184]:
fires.loc[fires['FPA_ID'].duplicated(keep=False)].sort_values(by='FPA_ID')

Unnamed: 0,FOD_ID,FPA_ID,FIRE_YEAR,DISCOVERY_DATE,DISCOVERY_DOY,DISCOVERY_TIME,STAT_CAUSE_CODE,STAT_CAUSE_DESCR,CONT_DATE,CONT_DOY,CONT_TIME,FIRE_SIZE,FIRE_SIZE_CLASS,LATITUDE,LONGITUDE,OWNER_CODE,OWNER_DESCR,STATE
21986,22093,FS-1452833,2007,2454299.5,199,1030.0,1.0,Lightning,2454307.5,207.0,1500.0,4.25,B,35.312778,-107.593056,5.0,USFS,NM
1565829,201432072,FS-1452833,2012,2456111.5,185,1500.0,1.0,Lightning,2456112.5,186.0,1400.0,0.1,A,35.337222,-107.779444,5.0,USFS,NM
1065673,1300088,ICS209_2009_KS-DDQ-128,2009,2454881.5,50,1400.0,13.0,Missing/Undefined,2454881.5,50.0,1930.0,2490.0,F,39.234444,-96.830278,6.0,OTHER FEDERAL,KS
1634979,201750002,ICS209_2009_KS-DDQ-128,2012,2455990.5,64,1300.0,13.0,Missing/Undefined,2456020.5,94.0,1500.0,2200.0,F,39.22,-96.94,6.0,OTHER FEDERAL,KS
1825692,300245030,SFO-2015CACDFLNU003791,2015,2457154.5,132,1031.0,5.0,Debris Burning,2457154.5,132.0,1050.0,0.52,B,38.715883,-122.994933,15.0,UNDEFINED FEDERAL,CA
1870332,300306586,SFO-2015CACDFLNU003791,2015,2457204.5,182,1751.0,13.0,Missing/Undefined,,,,0.01,A,38.342004,-121.958596,14.0,MISSING/NOT SPECIFIED,CA


Il y a des doublons d'ID fonctionnels.  
On remarque qu'on pourrait utiliser le FPA_ID pour déduire l'année du feu, puisque l'ID comporte parfois l'année. Toutefois, comme on ne sait pas où se trouve l'erreur, on décide de supprimer ces lignes, vu leur petit nombre.

In [185]:
# Avant suppression des doublons FPA_ID
print('Avant suppression :')
print(fires['FPA_ID'].info(verbose=True, memory_usage=True, show_counts=True), '\n\n============================\n')

# Suppression
fires = fires.drop_duplicates('FPA_ID')

# Après suppression des doublons FPA_ID
print('Après suppression :')
print(fires['FPA_ID'].info(verbose=True, memory_usage=True, show_counts=True))

Avant suppression :
<class 'pandas.core.series.Series'>
RangeIndex: 1880465 entries, 0 to 1880464
Series name: FPA_ID
Non-Null Count    Dtype 
--------------    ----- 
1880465 non-null  object
dtypes: object(1)
memory usage: 14.3+ MB
None 


Après suppression :
<class 'pandas.core.series.Series'>
Index: 1880462 entries, 0 to 1880464
Series name: FPA_ID
Non-Null Count    Dtype 
--------------    ----- 
1880462 non-null  object
dtypes: object(1)
memory usage: 28.7+ MB
None


Les doublons ont bien été supprimés.

## Identifiant générique FOD_ID

In [186]:
fires.loc[fires['FOD_ID'].duplicated(keep=False)].sort_values(by='FOD_ID')

Unnamed: 0,FOD_ID,FPA_ID,FIRE_YEAR,DISCOVERY_DATE,DISCOVERY_DOY,DISCOVERY_TIME,STAT_CAUSE_CODE,STAT_CAUSE_DESCR,CONT_DATE,CONT_DOY,CONT_TIME,FIRE_SIZE,FIRE_SIZE_CLASS,LATITUDE,LONGITUDE,OWNER_CODE,OWNER_DESCR,STATE


Il n'y a pas de doublons d'ID techniques.

## Suppression des espaces en tête et fin d'ID

In [187]:
set(fires['FPA_ID'])

{'SFO-SC0402213116923',
 'SFO-OK01410606-30367_03291424',
 'STATE_MS_93763',
 'W-394491',
 'SFO-TX01430696-10738698',
 'ODF-63278',
 'W-456482',
 'SFO-FL062006-06-0934',
 'SFO-2015NY2401NY2401-2015-0872345',
 'NM98-40950734X',
 'TFS-TX2009-75377',
 '2011TDA10319',
 'SWRA_OK_12256',
 'HIWMO-MA1509',
 'SWRA_SC_55686',
 'ALS-HSV-20030325-002',
 'SFO-2013SCSCS14FF0186',
 'SFO-GA00060503-37-207-0008-10',
 '2011MTNWS000415',
 'W-331655',
 'IA-IITF-26637',
 'SFO-2013MSMFCMS04520131028009',
 'W-585444',
 'SFO-GA-WIL-27-5/16/1995-1312',
 'ODF-75841',
 'SWRA_AL_46148',
 'FS-350983',
 'SFO-MS-2008-MS3952813145',
 'SFO-MN0349-8073',
 'TFS_NC_175691',
 'NCST-086-20100013',
 'FS-327714',
 'ODF-76159',
 'SFO-2013MNDNR2013-234-042',
 '2011SCSCS11FF0770',
 'FS-275365',
 'W-363392',
 'W-572664',
 'W-339345',
 'SFO-GA00770404-42-163-0001-07',
 'SFO-2015FLFLS2015120316',
 'W-125237',
 'HIWMO-MA859',
 'SFO-NC0457-NCST-018-20090034',
 'SWRA_GA_52634',
 'FS-349570',
 'TFS-TXFD2010-266638',
 'SFO-NY-NY4201-20

On constate qu'il y a des espaces en fin d'ID.

In [188]:
# Suppression des espaces en tête et fin de chaîne
fires['FPA_ID'] = fires['FPA_ID'].str.strip()

In [189]:
set(fires['FPA_ID'])

{'SFO-SC0402213116923',
 'SFO-OK01410606-30367_03291424',
 'STATE_MS_93763',
 'W-394491',
 'SFO-TX01430696-10738698',
 'ODF-63278',
 'CDF_1997_56_2229_200',
 'W-456482',
 'SFO-FL062006-06-0934',
 'SFO-2015NY2401NY2401-2015-0872345',
 'NM98-40950734X',
 'TFS-TX2009-75377',
 '2011TDA10319',
 'SWRA_OK_12256',
 'HIWMO-MA1509',
 'SWRA_SC_55686',
 'ALS-HSV-20030325-002',
 'SFO-2013SCSCS14FF0186',
 'SFO-GA00060503-37-207-0008-10',
 '2011MTNWS000415',
 'W-331655',
 'CDF_1993_54_2235_516',
 'IA-IITF-26637',
 'SFO-2013MSMFCMS04520131028009',
 'W-585444',
 'SFO-GA-WIL-27-5/16/1995-1312',
 'ODF-75841',
 'SWRA_AL_46148',
 'FS-350983',
 'SFO-MS-2008-MS3952813145',
 'SFO-MN0349-8073',
 'TFS_NC_175691',
 'NCST-086-20100013',
 'FS-327714',
 'ODF-76159',
 'SFO-2013MNDNR2013-234-042',
 '2011SCSCS11FF0770',
 'FS-275365',
 'W-363392',
 'W-572664',
 'W-339345',
 'SFO-GA00770404-42-163-0001-07',
 'SFO-2015FLFLS2015120316',
 'W-125237',
 'HIWMO-MA859',
 'SFO-NC0457-NCST-018-20090034',
 'SWRA_GA_52634',
 'FS-3

In [190]:
fires_fpa_set = set(fires['FPA_ID'])
print(f"Nombres d'ID fonctionnels uniques dans le dataset 'fires' : {len(fires_fpa_set)}")

Nombres d'ID fonctionnels uniques dans le dataset 'fires' : 1880462


# Changement de type

Comme les valeurs sont en nombre restreint dans certaines colonnes, on modifie le type de certaines colonnes d'object à category afin de gagner de l'espace mémoire.  
De même, on transforme le type de certaines colonnes numériques en un type plus léger.

In [191]:
# fires.columns

In [192]:
# Colonnes catégorielles
# si besoin, jeter un coup d'oeil à la documentation pd.Categorical()
fires[['STAT_CAUSE_DESCR', 'FIRE_SIZE_CLASS', 'OWNER_DESCR', 'STATE']] = \
    fires[['STAT_CAUSE_DESCR', 'FIRE_SIZE_CLASS', 'OWNER_DESCR', 'STATE']].astype('category')

In [193]:
# Colonnes catégorielles 
fires[['STAT_CAUSE_CODE', 'OWNER_CODE']] = fires[['STAT_CAUSE_CODE', 'OWNER_CODE']].astype('uint8')

# Colonnes numériques
fires[['FIRE_YEAR', 'DISCOVERY_DOY']] = fires[['FIRE_YEAR', 'DISCOVERY_DOY']].astype('uint16')

In [194]:
# fires.info()

# Renommage de colonnes  
Par souci de praticité et de temps, on raccourcit certains noms de colonne.

In [195]:
fires.rename(
    {
        'FIRE_YEAR':'DISC_YEAR',
        'DISCOVERY_DATE':'DISC_DATE',
        'DISCOVERY_DOY':'DISC_DOY',
        'DISCOVERY_TIME':'DISC_TIME',
        'STAT_CAUSE_CODE':'CAUSE_CODE',
        'STAT_CAUSE_DESCR':'CAUSE_DESCR',
        'FIRE_SIZE':'SIZE',
        'FIRE_SIZE_CLASS':'CLASS',
        'LATITUDE':'LAT',
        'LONGITUDE':'LON'
    }, 
    axis=1, inplace=True)

In [196]:
fires.columns

Index(['FOD_ID', 'FPA_ID', 'DISC_YEAR', 'DISC_DATE', 'DISC_DOY', 'DISC_TIME',
       'CAUSE_CODE', 'CAUSE_DESCR', 'CONT_DATE', 'CONT_DOY', 'CONT_TIME',
       'SIZE', 'CLASS', 'LAT', 'LON', 'OWNER_CODE', 'OWNER_DESCR', 'STATE'],
      dtype='object')

# Recalage et renommage des colonnes "XX_DATE"

Les deux colonnes DISC_DATE et CONT_DATE sont en fait des sortes de compteurs de jour, dont la plage correspond à la période temporelle étudiée en jours.

In [197]:
# calcul de la durée temporelle entre les dates de début de feu
fires[['DISC_DATE']].max() - fires[['DISC_DATE']].min()

DISC_DATE    8765.0
dtype: float64

In [198]:
# calcul de la durée temporelle entre les dates de fin de feu
fires[['CONT_DATE']].max() - fires[['CONT_DATE']].min()

CONT_DATE    8769.0
dtype: float64

On renomme les deux colonnes DISC_DATE et CONT_DATE pour mettre en évidence l'aspect "compteur de jours".

In [199]:
fires = fires.rename({'DISC_DATE':'DISC_DAYS', 'CONT_DATE':'CONT_DAYS'}, axis=1)

In [200]:
# fires.columns

In [201]:
fires[['DISC_DAYS', 'CONT_DAYS']].agg(['min', 'max'])

Unnamed: 0,DISC_DAYS,CONT_DAYS
min,2448622.5,2448622.5
max,2457387.5,2457391.5


On recale ces compteurs à 0 : la référence est alors le premier jour du dataset, à savoir le 01/01/1992.

In [202]:
min_days = fires['DISC_DAYS'].min()

# Recalage de la colonne de compteur de la date de début de feu
fires['DISC_DAYS'] = fires['DISC_DAYS'] - min_days

# Recalage de la colonne de compteur de la date de fin de feu
# Attention : utiliser le même minimum, même s'ils sont identiques
fires['CONT_DAYS'] = fires['CONT_DAYS'] - min_days

In [203]:
min_days

2448622.5

In [204]:
# Vérification du recalage
fires[['DISC_DAYS', 'CONT_DAYS']].agg(['min', 'max'])

Unnamed: 0,DISC_DAYS,CONT_DAYS
min,0.0,0.0
max,8765.0,8769.0


# Enrichissement du dataset

## Colonne "DUR_DAYS" de durée de feu

On crée une colonne de durée de feu en jours. Malheureusement, il manque près de la moitié des valeurs dans la colonne "CONT_DATE", compteur de jours qui marque la fin du feu. 
Cela nous permet notamment de créer la colonne "YEAR" pour la date de maîtrise du feu.

<span style="color:red">ATTENTION : il s'agit de la partie entière de la durée en jours. Cela signifie qu'un feu de 2 h aura une durée en jours de 0 ou encore qu'un feu de 26 h aura une durée de 1 jour.</span> 

In [205]:
# Durée de feu en jour
fires['DUR_DAYS'] = fires['CONT_DAYS'] - fires['DISC_DAYS']
# fires[['CONT_DAYS','DISC_DAYS','DUR_DAYS']].head()

In [206]:
# fires.loc[fires['CONT_DAYS'].isna(),['CONT_DAYS','DISC_DAYS','DUR_DAYS']].head()

## Nouvelles colonnes "DISC_DATE" et "CONT_DATE"

On crée une colonne de date de début du feu et une colonne de date de fin de feu.

In [207]:
# Création de la date de début du feu
fires['DISC_DATE'] = \
    pd.to_datetime(fires['DISC_YEAR'].astype('str') + fires['DISC_DOY'].astype('str')
                   , format='%Y%j'
                   , errors='coerce')

In [208]:
fires.head()

Unnamed: 0,FOD_ID,FPA_ID,DISC_YEAR,DISC_DAYS,DISC_DOY,DISC_TIME,CAUSE_CODE,CAUSE_DESCR,CONT_DAYS,CONT_DOY,CONT_TIME,SIZE,CLASS,LAT,LON,OWNER_CODE,OWNER_DESCR,STATE,DUR_DAYS,DISC_DATE
0,1,FS-1418826,2005,4781.0,33,1300.0,9,Miscellaneous,4781.0,33.0,1730.0,0.1,A,40.036944,-121.005833,5,USFS,CA,0.0,2005-02-02
1,2,FS-1418827,2004,4515.0,133,845.0,1,Lightning,4515.0,133.0,1530.0,0.25,A,38.933056,-120.404444,5,USFS,CA,0.0,2004-05-12
2,3,FS-1418835,2004,4534.0,152,1921.0,5,Debris Burning,4534.0,152.0,2024.0,0.1,A,38.984167,-120.735556,13,STATE OR PRIVATE,CA,0.0,2004-05-31
3,4,FS-1418845,2004,4562.0,180,1600.0,1,Lightning,4567.0,185.0,1400.0,0.1,A,38.559167,-119.913333,5,USFS,CA,5.0,2004-06-28
4,5,FS-1418847,2004,4562.0,180,1600.0,1,Lightning,4567.0,185.0,1200.0,0.1,A,38.559167,-119.933056,5,USFS,CA,5.0,2004-06-28


In [209]:
fires.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1880462 entries, 0 to 1880464
Data columns (total 20 columns):
 #   Column       Dtype         
---  ------       -----         
 0   FOD_ID       int64         
 1   FPA_ID       object        
 2   DISC_YEAR    uint16        
 3   DISC_DAYS    float64       
 4   DISC_DOY     uint16        
 5   DISC_TIME    float64       
 6   CAUSE_CODE   uint8         
 7   CAUSE_DESCR  category      
 8   CONT_DAYS    float64       
 9   CONT_DOY     float64       
 10  CONT_TIME    float64       
 11  SIZE         float64       
 12  CLASS        category      
 13  LAT          float64       
 14  LON          float64       
 15  OWNER_CODE   uint8         
 16  OWNER_DESCR  category      
 17  STATE        category      
 18  DUR_DAYS     float64       
 19  DISC_DATE    datetime64[ns]
dtypes: category(4), datetime64[ns](1), float64(9), int64(1), object(1), uint16(2), uint8(2)
memory usage: 204.4+ MB


In [210]:
365*20

7300

In [211]:
fires['DUR_DAYS'].max()

4018.0

In [212]:
# Création de la date de fin du feu
fires['CONT_DATE'] = \
    fires.loc[fires['DUR_DAYS'].notna()]['DISC_DATE'] + pd.to_timedelta(fires['DUR_DAYS'], unit='D')

In [213]:
# fires.head()

In [214]:
# fires.loc[fires['CONT_DATE'].isna()].head()

## Colonne "CONT_YEAR" pour l'année de maîtrise du feu

Par souci d'homogénéité, on crée une colonne "CONT_YEAR" afin d'avoir l'année de fin du feu, pour les lignes disposant de l'information de la date de feu. 

In [215]:
# Création de la colonne de l'année de fin de feu
fires['CONT_YEAR'] = fires['CONT_DATE'].dt.year
# fires[['CONT_YEAR']].head()

In [216]:
# fires[fires['CONT_YEAR'].isna()].head()

### Colonnes "HOUR" et "MINUTE" pour les horaires de départ et de fin de feu

On crée une colonne pour l'heure et une pour les minutes des horaires de départ et de fin de feu pour une utilisation potentielle plus tard dans l'imputing ou les analyses.  
On supprime les deux colonnes de départ.

In [217]:
# Récupération de l'heure de début de feu
fires['DISC_HOUR'] = fires['DISC_TIME'] // 100
# Récupération des minutes de début de feu
fires['DISC_MIN'] = fires['DISC_TIME'] % 100

# Récupération de l'heure de fin de feu
fires['CONT_HOUR'] = fires['CONT_TIME'] // 100
# Récupération des minutes de fin de feu
fires['CONT_MIN'] = fires['CONT_TIME'] % 100

In [218]:
fires[['DISC_TIME', 'DISC_HOUR', 'DISC_MIN', 'CONT_TIME', 'CONT_HOUR', 'CONT_MIN']].head()

Unnamed: 0,DISC_TIME,DISC_HOUR,DISC_MIN,CONT_TIME,CONT_HOUR,CONT_MIN
0,1300.0,13.0,0.0,1730.0,17.0,30.0
1,845.0,8.0,45.0,1530.0,15.0,30.0
2,1921.0,19.0,21.0,2024.0,20.0,24.0
3,1600.0,16.0,0.0,1400.0,14.0,0.0
4,1600.0,16.0,0.0,1200.0,12.0,0.0


In [219]:
fires[['DISC_TIME', 'DISC_HOUR', 'DISC_MIN', 'CONT_TIME', 'CONT_HOUR', 'CONT_MIN']].info(verbose=True, memory_usage=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
Index: 1880462 entries, 0 to 1880464
Data columns (total 6 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   DISC_TIME  997824 non-null  float64
 1   DISC_HOUR  997824 non-null  float64
 2   DISC_MIN   997824 non-null  float64
 3   CONT_TIME  907910 non-null  float64
 4   CONT_HOUR  907910 non-null  float64
 5   CONT_MIN   907910 non-null  float64
dtypes: float64(6)
memory usage: 164.9 MB


Bien qu'il y ait un peu plus d'horaires de début de feu que de fin de feu, on remarque qu'ils manquent tout de même près de la moitié des horaires. Ceci est plutôt logique car il ne doit pas être toujours aisé de donner avec précision l'heure de départ ou de fin d'un feu. 

In [220]:
# fires.loc[fires['CONT_TIME'].isna(),['CONT_TIME','CONT_HOUR','CONT_MIN']].head()

In [221]:
# Après séparation des heures et minutes, suppression des colonnes initiales
fires.drop(['DISC_TIME','CONT_TIME'], axis=1, inplace=True)

## Nouvelles colonnes "DISC_DATETIME" et "CONT_DATETIME"

On crée deux colonnes datetime pour les dates et horaires de départ de feu et de fin de feu. Cela permettra d'affiner, pour les lignes complètes, la durée du feu.

In [222]:
# Création d'une colonne de datetime de début de feu
fires['DISC_DATETIME'] = \
    fires.loc[fires['DISC_HOUR'].notna()]['DISC_DATE'] + \
    pd.to_timedelta(fires['DISC_HOUR'], unit='h') + pd.to_timedelta(fires['DISC_MIN'], unit='m')

In [223]:
# fires[['DISC_DATETIME', 'DISC_DATE', 'DISC_HOUR', 'DISC_MIN']].head()

In [224]:
# fires.loc[fires['DISC_DATETIME'].isna()].head()

In [225]:
# Création d'une colonne de datetime de début de feu
fires['CONT_DATETIME'] = \
    fires.loc[fires['CONT_HOUR'].notna()]['CONT_DATE'] + \
    pd.to_timedelta(fires['CONT_HOUR'], unit='h') + pd.to_timedelta(fires['CONT_MIN'], unit='m')

In [226]:
# fires[['CONT_DATETIME', 'CONT_DATE', 'CONT_HOUR', 'CONT_MIN']].head()

In [227]:
# fires.loc[fires['CONT_DATETIME'].isna()].head()

## Nouvelle colonne "DUR_MIN"

On crée une colonne de durée de feu en minutes, ce qui enrichira le dataset d'une nouvelle variable.  
Cette variable servira aussi pour l'impute sur les durées manquantes dans une partie complémentaire à l'étude initiale.

In [228]:
fires['DUR_MIN'] = (fires['CONT_DATETIME'] - fires['DISC_DATETIME']) / pd.Timedelta(minutes=1)

In [229]:
fires[['DUR_MIN', 'CONT_DATETIME', 'DISC_DATETIME']].head()

Unnamed: 0,DUR_MIN,CONT_DATETIME,DISC_DATETIME
0,270.0,2005-02-02 17:30:00,2005-02-02 13:00:00
1,405.0,2004-05-12 15:30:00,2004-05-12 08:45:00
2,63.0,2004-05-31 20:24:00,2004-05-31 19:21:00
3,7080.0,2004-07-03 14:00:00,2004-06-28 16:00:00
4,6960.0,2004-07-03 12:00:00,2004-06-28 16:00:00


In [230]:
fires[['DUR_MIN']].info(verbose=True, memory_usage=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
Index: 1880462 entries, 0 to 1880464
Data columns (total 1 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   DUR_MIN  892005 non-null  float64
dtypes: float64(1)
memory usage: 93.2 MB


Il aurait fallu à cet endroit tracer un boxplot de la durée en fonction de la classe de feu pour se rendre compte qu'il y avait un problème pour certains feux qui dureraient, prétendument, 10 ans pour certains.  
Malheureusement, erreur de novice : Thibault est parti du principe que le nettoyage était bon au-dessus, d'autant plus qu'il était précisé sur Kaggle que le dataset était plutôt clean. Il a réalisé trop tard que même un simple calcul comme une différence pouvait mettre au jour une nouvelle incohérence des données.   

Petit aperçu du problème ci-dessous :  certains feux de classe "petite", A ou B, durent 10 ans. On suppose une erreur dans le fameux compteur de jour, d'autant plus que l'horaire de fin semble lui cohérent avec l'horaire de début (exemple : première ligne, avec un classe A qui aurait duré 1h30, si l'année de fin était bien identique à l'année de début).

In [231]:
fires.loc[fires['DUR_MIN'] > 1E6, [
    'FPA_ID', 'DUR_MIN', 'SIZE', 'CLASS', 
    'DISC_YEAR', 'DISC_DATETIME', 'CONT_YEAR', 'CONT_DATETIME', 
    'CAUSE_DESCR', 'STATE']].sort_values('DUR_MIN', ascending=False)

Unnamed: 0,FPA_ID,DUR_MIN,SIZE,CLASS,DISC_YEAR,DISC_DATETIME,CONT_YEAR,CONT_DATETIME,CAUSE_DESCR,STATE
362576,FWS-1999CAGRRY269,5260410.0,0.5,B,1999,1999-06-16 13:00:00,2009.0,2009-06-16 14:30:00,Debris Burning,CA
362492,FWS-1999CAGRRX345,5260380.0,0.5,B,1999,1999-07-19 11:00:00,2009.0,2009-07-19 12:00:00,Debris Burning,CA
362655,FWS-1999CAPLRY974,5260335.0,0.5,B,1999,1999-07-11 18:15:00,2009.0,2009-07-11 18:30:00,Miscellaneous,CA
362642,FWS-1999CAPLRW160,5260335.0,0.1,A,1999,1999-08-11 14:00:00,2009.0,2009-08-11 14:15:00,Miscellaneous,CA
1317621,SFO-WV-2001-20554,4733280.0,4.0,B,2001,2001-05-14 22:00:00,2010.0,2010-05-14 22:00:00,Equipment Use,WV
1351259,SFO-NY-NY0822-2000-030003,4733280.0,0.1,A,2000,2000-03-21 11:22:00,2009.0,2009-03-21 11:22:00,Miscellaneous,NY
356156,W-513441,2708760.0,120.0,D,2000,2000-08-07 16:00:00,2005.0,2005-10-01 18:00:00,Lightning,CA
365708,FWS-2002CAGRRY287,2629470.0,0.1,A,2002,2002-09-12 06:45:00,2007.0,2007-09-12 07:15:00,Smoking,CA
325491,W-507314,2108054.0,0.1,A,2005,2005-08-25 20:45:00,2009.0,2009-08-28 18:59:00,Lightning,CO
368029,FWS-2004CATNREP7B,2103885.0,0.4,B,2004,2004-12-04 14:30:00,2008.0,2008-12-04 15:15:00,Debris Burning,CA


In [232]:
fires.iloc[362576]

FOD_ID                          373245
FPA_ID               FWS-1999CAGRRY269
DISC_YEAR                         1999
DISC_DAYS                       2723.0
DISC_DOY                           167
CAUSE_CODE                           5
CAUSE_DESCR             Debris Burning
CONT_DAYS                       6376.0
CONT_DOY                         167.0
SIZE                               0.5
CLASS                                B
LAT                          37.078344
LON                         -120.93495
OWNER_CODE                          14
OWNER_DESCR      MISSING/NOT SPECIFIED
STATE                               CA
DUR_DAYS                        3653.0
DISC_DATE          1999-06-16 00:00:00
CONT_DATE          2009-06-16 00:00:00
CONT_YEAR                       2009.0
DISC_HOUR                         13.0
DISC_MIN                           0.0
CONT_HOUR                         14.0
CONT_MIN                          30.0
DISC_DATETIME      1999-06-16 13:00:00
CONT_DATETIME      2009-0

In [233]:
# fires_duration = fires[['CLASS', 'DUR_MIN']]

# fig = px.box(fires_duration, x='CLASS', y = 'DUR_MIN', color='CLASS',
#              category_orders={"CLASS": ["A", "B", "C", "D", "E", "F", "G"]},
#              width=1000, height=600
#             )

# fig.update_layout(
#     title={
#         'text': "Distribution de la durée (min) par classe de feu",
#         'y':0.95,
#         'x':0.5,
#         'xanchor': 'center',
#         'yanchor': 'top'}, 
#     xaxis={'title':"Classe de feu",'categoryorder':'category ascending'},    
#     yaxis_title="Durée (min)"
# )

On constate des durées de feu complètement aberrants de plusieurs années. Finalement, les outliers pour les classes F et G ont l'air moins aberrants que ceux des classes A, B, etc...

In [234]:
# fires_duration = fires.loc[fires['DUR_MIN'] < 2E5, ['CLASS', 'DUR_MIN']]

# fig = px.box(fires_duration, x='CLASS', y = 'DUR_MIN', color='CLASS',
#              category_orders={"CLASS": ["A", "B", "C", "D", "E", "F", "G"]},
#              width=1000, height=600
#             )

# fig.update_layout(
#     title={
#         'text': "Distribution de la durée (min) par classe de feu (outliers supérieurs à 2E5 min supprimés)",
#         'y':0.95,
#         'x':0.5,
#         'xanchor': 'center',
#         'yanchor': 'top'}, 
#     xaxis={'title':"Classe de feu",'categoryorder':'category ascending'},    
#     yaxis_title="Durée (min)"
# )

En éliminant les plus grands outliers, on commence à voir mieux apparaître les boxplots pour les classes les plus grandes. On constate toujours la présence d'outliers pour les classes de petits feux, qui sont pourtant plus élevés que ceux de plus grande classe : ceci confirme le problème.

In [235]:
fires.groupby('CLASS')['DUR_MIN'].agg(['min', 'median', 'mean', 'max'])

Unnamed: 0_level_0,min,median,mean,max
CLASS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,0.0,69.0,1471.368696,5260335.0
B,0.0,74.0,1194.246266,5260410.0
C,0.0,180.0,1948.238535,1578240.0
D,0.0,791.0,6114.313158,2708760.0
E,0.0,1785.0,10765.76264,1058535.0
F,0.0,4945.0,20580.790793,298260.0
G,0.0,19147.0,46598.507775,535095.0


In [236]:
fires.loc[fires['DUR_MIN'] < 2E5].groupby('CLASS')['DUR_MIN'].agg(['min', 'median', 'mean', 'max'])

Unnamed: 0_level_0,min,median,mean,max
CLASS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,0.0,68.0,1159.479301,199870.0
B,0.0,74.0,908.07551,199843.0
C,0.0,180.0,1847.929426,197296.0
D,0.0,780.0,5358.384826,198627.0
E,0.0,1783.0,10216.503223,197160.0
F,0.0,4860.0,19834.639763,198141.0
G,0.0,18777.0,44430.310013,198749.0


## Nouvelle colonne de cause humaine : "CAUSE_DESCR_HUMAN"
Il s'agit de créer un booléen qui indique si le feu est d'origine humaine. On exclut donc la foudre ou le cas "indéfini".

In [237]:
fires['CAUSE_DESCR_HUMAN'] = fires['CAUSE_DESCR'].apply(lambda x: 1 if x not in ['Lightning', 'Missing/Undefined'] else 0)

In [238]:
fires.columns

Index(['FOD_ID', 'FPA_ID', 'DISC_YEAR', 'DISC_DAYS', 'DISC_DOY', 'CAUSE_CODE',
       'CAUSE_DESCR', 'CONT_DAYS', 'CONT_DOY', 'SIZE', 'CLASS', 'LAT', 'LON',
       'OWNER_CODE', 'OWNER_DESCR', 'STATE', 'DUR_DAYS', 'DISC_DATE',
       'CONT_DATE', 'CONT_YEAR', 'DISC_HOUR', 'DISC_MIN', 'CONT_HOUR',
       'CONT_MIN', 'DISC_DATETIME', 'CONT_DATETIME', 'DUR_MIN',
       'CAUSE_DESCR_HUMAN'],
      dtype='object')

On constate bien que la colonne a été créée en fin de dataframe.

=====================================================================================================

<span style="color:#a61c00;font-size:2em"><strong>============================================================================</strong></span>
<span style="color:#a61c00;font-size:2em"><strong>Import du dataset "végétation" et "météo"</strong></span>
<span style="color:#a61c00;font-size:2em"><strong>============================================================================</strong></span>

In [239]:
fires_veg_orig = pd.read_csv('all_fires.csv', sep=',')

  fires_veg_orig = pd.read_csv('all_fires.csv', sep=',')


In [240]:
# Création d'une copie du dataset
fires_veg = fires_veg_orig.copy()
fires_veg.head()

Unnamed: 0,clean_id,Wind,FPA_ID,LATITUDE,LONGITUDE,ICS_209_INCIDENT_NUMBER,ICS_209_NAME,MTBS_ID,MTBS_FIRE_NAME,FIRE_YEAR,DISCOVERY_DATE,DISCOVERY_DOY,STAT_CAUSE_DESCR,FIRE_SIZE,STATE,IGNITION,DISCOVERY_DAY,DISCOVERY_MONTH,DISCOVERY_YEAR,is_id_duplicated,fm,NBCD_countrywide_biomass_mosaic,us_130bps,GROUPVEG,NA_L3CODE,NA_L3NAME,NA_L1CODE,NA_L1NAME,EcoArea_km2,FIRE_SIZE_m2,FIRE_SIZE_ha
0,181642,4.166682,FS-1418918,35.000278,-83.351111,,,,,2005,2005-01-27,27,Arson,50.3,NC,Human,27,1,2005,False,16.743015,864.900146,1822,Conifer,8.4.4,Blue Ridge,8,EASTERN TEMPERATE FORESTS,40883.224113,203557.058,20.355706
1,181717,4.651072,FS-1419081,44.012778,-103.3825,,,,,2005,2005-01-02,2,Campfire,0.1,SD,Human,2,1,2005,False,14.041939,241.470001,1269,Grassland,6.2.10,Middle Rockies,6,NORTHWESTERN FORESTED MOUNTAINS,13954.952392,404.686,0.040469
2,181933,4.268664,FS-1419493,33.786111,-96.15,,,,,2005,2005-01-24,24,Arson,3.0,TX,Human,24,1,2005,False,16.593532,137.160004,1713,Grassland,8.3.7,South Central Plains,8,EASTERN TEMPERATE FORESTS,151719.535367,12140.58,1.214058
3,181934,3.978921,FS-1419494,31.3125,-94.270833,,,,,2005,2005-01-25,25,Debris Burning,55.0,TX,Human,25,1,2005,False,17.156658,788.759949,1438,Hardwood-Conifer,8.3.7,South Central Plains,8,EASTERN TEMPERATE FORESTS,151719.535367,222577.3,22.25773
4,182236,4.329414,FS-1420148,30.953889,-93.071667,,,,,2005,2005-01-23,23,Arson,4.0,LA,Human,23,1,2005,False,16.912801,612.179993,1444,Riparian,8.3.7,South Central Plains,8,EASTERN TEMPERATE FORESTS,151719.535367,16187.44,1.618744


# Quelques statistiques

## Type de variables et nombre de valeurs non nulles

In [241]:
fires_veg.info(verbose=True, memory_usage=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1832837 entries, 0 to 1832836
Data columns (total 31 columns):
 #   Column                           Non-Null Count    Dtype  
---  ------                           --------------    -----  
 0   clean_id                         1832837 non-null  object 
 1   Wind                             1828970 non-null  float64
 2   FPA_ID                           1832837 non-null  object 
 3   LATITUDE                         1832837 non-null  float64
 4   LONGITUDE                        1832837 non-null  float64
 5   ICS_209_INCIDENT_NUMBER          24777 non-null    object 
 6   ICS_209_NAME                     24776 non-null    object 
 7   MTBS_ID                          10070 non-null    object 
 8   MTBS_FIRE_NAME                   10070 non-null    object 
 9   FIRE_YEAR                        1832837 non-null  int64  
 10  DISCOVERY_DATE                   1832837 non-null  object 
 11  DISCOVERY_DOY                    1832837 non-null 

## Statistiques des colonnes numériques

In [242]:
fires_veg.describe()

Unnamed: 0,Wind,LATITUDE,LONGITUDE,FIRE_YEAR,DISCOVERY_DOY,FIRE_SIZE,DISCOVERY_DAY,DISCOVERY_MONTH,DISCOVERY_YEAR,fm,NBCD_countrywide_biomass_mosaic,us_130bps,NA_L1CODE,EcoArea_km2,FIRE_SIZE_m2,FIRE_SIZE_ha
count,1828970.0,1832837.0,1832837.0,1832837.0,1832837.0,1832837.0,1832837.0,1832837.0,1832837.0,1828970.0,1832837.0,1832837.0,1832837.0,1832837.0,1832837.0,1832837.0
mean,3.859689,36.90394,-95.35999,2003.664,165.4501,58.51081,15.51915,5.953888,2003.664,14.05234,280.8798,1416.13,8.227097,133909.7,236785.1,23.67851
std,0.7362714,5.317256,15.31792,6.695004,90.03963,1913.249,8.798111,2.955097,6.695004,3.302363,318.7876,502.2841,1.625398,103705.6,7742649.0,774.2649
min,1.001655,24.58167,-124.7186,1992.0,1.0,0.01,1.0,1.0,1992.0,2.597198,0.0,-9999.0,5.0,0.0641084,40.4686,0.00404686
25%,3.317903,32.95309,-109.7835,1998.0,90.0,0.1,8.0,3.0,1998.0,12.4472,12.42,1135.0,8.0,53086.11,404.686,0.0404686
50%,3.849594,35.5169,-92.0669,2004.0,165.0,1.0,15.0,6.0,2004.0,14.87856,193.68,1455.0,8.0,116792.7,4046.86,0.404686
75%,4.353605,40.81225,-82.4241,2009.0,231.0,3.4,23.0,8.0,2009.0,16.46496,445.23,1811.0,9.0,166115.5,13759.32,1.375932
max,9.227564,49.34336,-66.98756,2015.0,366.0,558198.3,31.0,12.0,2015.0,26.46518,3837.331,2160.0,15.0,357667.9,2258950000.0,225895.0


## Résumé des principales caractéristiques de chaque colonne

In [243]:
# summary(fires_veg)

# Gestion des doublons d'ID

## Lignes entières

In [244]:
fires_veg.duplicated().sum()

0

Il n'y a pas de lignes entières en doublon dans le jeu de données. 

## Identifiant fonctionnel FPA_ID

On utilise tout d'abord la colonne "is_id_duplicated" qui indique les doublons d'ID FPA.

In [245]:
fires_veg_duplicated = fires_veg.loc[fires_veg['is_id_duplicated'] == True, ['FPA_ID']].sort_values(by='FPA_ID')['FPA_ID'].values
fires_veg_duplicated

array(['FS-1452833', 'ICS209_2009_KS-DDQ-128', 'SFO-2015CACDFLNU003791'],
      dtype=object)

Il y a 3 doublons repérés dans le dataset. On les supprime par précaution.

In [246]:
print('Avant suppression :')
print(fires_veg['FPA_ID'].info(verbose=True, memory_usage=True, show_counts=True), '\n\n============================\n')

fires_veg = fires_veg.loc[fires_veg['is_id_duplicated'] == False]

print('Avant suppression :')
print(fires_veg['FPA_ID'].info(verbose=True, memory_usage=True, show_counts=True))

Avant suppression :
<class 'pandas.core.series.Series'>
RangeIndex: 1832837 entries, 0 to 1832836
Series name: FPA_ID
Non-Null Count    Dtype 
--------------    ----- 
1832837 non-null  object
dtypes: object(1)
memory usage: 14.0+ MB
None 


Avant suppression :
<class 'pandas.core.series.Series'>
Index: 1832834 entries, 0 to 1832836
Series name: FPA_ID
Non-Null Count    Dtype 
--------------    ----- 
1832834 non-null  object
dtypes: object(1)
memory usage: 28.0+ MB
None


In [247]:
# Présence de doublons d'identifiants fonctionnels
fires_veg_duplicated_id = fires_veg.loc[fires_veg['FPA_ID'].duplicated(keep=False)]
fires_veg_duplicated_id.shape

(3302, 31)

In [248]:
fires_veg_duplicated_id.sort_values(by='FPA_ID').head(10)

Unnamed: 0,clean_id,Wind,FPA_ID,LATITUDE,LONGITUDE,ICS_209_INCIDENT_NUMBER,ICS_209_NAME,MTBS_ID,MTBS_FIRE_NAME,FIRE_YEAR,DISCOVERY_DATE,DISCOVERY_DOY,STAT_CAUSE_DESCR,FIRE_SIZE,STATE,IGNITION,DISCOVERY_DAY,DISCOVERY_MONTH,DISCOVERY_YEAR,is_id_duplicated,fm,NBCD_countrywide_biomass_mosaic,us_130bps,GROUPVEG,NA_L3CODE,NA_L3NAME,NA_L1CODE,NA_L1NAME,EcoArea_km2,FIRE_SIZE_m2,FIRE_SIZE_ha
1242426,1004,2.480832,2009CAIRS12297050,39.66688,-121.7329,,,,,2009,2009-07-06,187,Miscellaneous,0.1,CA,Human,6,7,2009,False,9.217258,0.0,921,Hardwood,11.1.2,Central California Valley,11,MEDITERRANEAN CALIFORNIA,46559.6,404.686,0.040469
1242425,1004,2.480832,2009CAIRS12297050,39.66688,-121.7329,,,,,2009,2009-07-06,187,Miscellaneous,0.1,CA,Human,6,7,2009,False,9.217258,0.0,921,Hardwood,11.1.1,"California Coastal Sage, Chaparral, and Oak Woodlands",11,MEDITERRANEAN CALIFORNIA,76656.05,404.686,0.040469
1775349,2606,4.13545,2009CAIRS12900729,38.3805,-122.9206,,,,,2009,2009-11-08,312,Debris Burning,1.0,CA,Human,8,11,2009,False,20.27072,0.0,548,Conifer,7.1.8,Coast Range,7,MARINE WEST COAST FOREST,52191.55,4046.86,0.404686
1775350,2606,4.13545,2009CAIRS12900729,38.3805,-122.9206,,,,,2009,2009-11-08,312,Debris Burning,1.0,CA,Human,8,11,2009,False,20.27072,0.0,548,Conifer,11.1.1,"California Coastal Sage, Chaparral, and Oak Woodlands",11,MEDITERRANEAN CALIFORNIA,76656.05,4046.86,0.404686
695423,3029,4.340678,2010CAIRS14140327,41.38405,-122.414083,,,,,2010,2010-04-26,116,Debris Burning,0.1,CA,Human,26,4,2010,False,15.528349,397.439972,838,Riparian,6.2.8,Eastern Cascades Slopes and Foothills,6,NORTHWESTERN FORESTED MOUNTAINS,53258.453397,404.686,0.040469
695422,3029,4.340678,2010CAIRS14140327,41.38405,-122.414083,,,,,2010,2010-04-26,116,Debris Burning,0.1,CA,Human,26,4,2010,False,15.528349,397.439972,838,Riparian,6.2.7,Cascades,6,NORTHWESTERN FORESTED MOUNTAINS,13882.893312,404.686,0.040469
1011065,3431,3.428518,2010CAIRS14408433,41.38912,-122.4149,,,,,2010,2010-06-21,172,Debris Burning,0.1,CA,Human,21,6,2010,False,10.54943,543.5101,797,Conifer,6.2.7,Cascades,6,NORTHWESTERN FORESTED MOUNTAINS,13882.89,404.686,0.040469
1011066,3431,3.428518,2010CAIRS14408433,41.38912,-122.4149,,,,,2010,2010-06-21,172,Debris Burning,0.1,CA,Human,21,6,2010,False,10.54943,543.5101,797,Conifer,6.2.8,Eastern Cascades Slopes and Foothills,6,NORTHWESTERN FORESTED MOUNTAINS,53258.45,404.686,0.040469
1011544,3460,3.439111,2010CAIRS14415300,41.3331,-122.3618,,,,,2010,2010-06-25,176,Debris Burning,0.01,CA,Human,25,6,2010,False,10.84388,815.2199,798,Conifer,6.2.11,Klamath Mountains,6,NORTHWESTERN FORESTED MOUNTAINS,48311.65,40.4686,0.004047
1011543,3460,3.439111,2010CAIRS14415300,41.3331,-122.3618,,,,,2010,2010-06-25,176,Debris Burning,0.01,CA,Human,25,6,2010,False,10.84388,815.2199,798,Conifer,6.2.7,Cascades,6,NORTHWESTERN FORESTED MOUNTAINS,13882.89,40.4686,0.004047


Il y a encore des doublons au niveau de l'identifiant FPA_ID. Sur quelques exemples, on constate que les lignes sont identiques à part un élément : l'écorégion de niveau 3 (nom, code et surface).  
Par mesure de précaution et souci de rapidité, on décide de supprimer l'intégralité de ces lignes, vu leur petit nombre. 

In [249]:
print('Avant suppression :')
print(fires_veg['FPA_ID'].info(verbose=True, memory_usage=True, show_counts=True), '\n\n============================\n')

fires_veg_duplicated_id_list = fires_veg_duplicated_id['FPA_ID'].values
fires_veg = fires_veg.loc[~fires_veg['FPA_ID'].isin(fires_veg_duplicated_id_list)]

print('Avant suppression :')
print(fires_veg['FPA_ID'].info(verbose=True, memory_usage=True, show_counts=True))

Avant suppression :
<class 'pandas.core.series.Series'>
Index: 1832834 entries, 0 to 1832836
Series name: FPA_ID
Non-Null Count    Dtype 
--------------    ----- 
1832834 non-null  object
dtypes: object(1)
memory usage: 28.0+ MB
None 


Avant suppression :
<class 'pandas.core.series.Series'>
Index: 1829532 entries, 0 to 1832836
Series name: FPA_ID
Non-Null Count    Dtype 
--------------    ----- 
1829532 non-null  object
dtypes: object(1)
memory usage: 27.9+ MB
None


## Identifiant générique clean_id

In [250]:
fires_veg.loc[fires_veg['clean_id'].duplicated(keep=False)].sort_values(by='clean_id')

Unnamed: 0,clean_id,Wind,FPA_ID,LATITUDE,LONGITUDE,ICS_209_INCIDENT_NUMBER,ICS_209_NAME,MTBS_ID,MTBS_FIRE_NAME,FIRE_YEAR,DISCOVERY_DATE,DISCOVERY_DOY,STAT_CAUSE_DESCR,FIRE_SIZE,STATE,IGNITION,DISCOVERY_DAY,DISCOVERY_MONTH,DISCOVERY_YEAR,is_id_duplicated,fm,NBCD_countrywide_biomass_mosaic,us_130bps,GROUPVEG,NA_L3CODE,NA_L3NAME,NA_L1CODE,NA_L1NAME,EcoArea_km2,FIRE_SIZE_m2,FIRE_SIZE_ha


Il n'y a pas de doublons d'ID techniques.

# Suppression des colonnes

In [251]:
# Pour une prochaine interrogation, plus loin dans le notebook
# Etats présents dans le dataset "végétation et météo"
fires_veg_set = set(fires_veg['STATE'].unique())

In [252]:
# fires_veg.columns

## Colonnes majoritairement vides  
Les colonnes suivantes ont un taux de valeurs manquantes élevé ( > 40 %) et ne sont pas nécessairement pertinentes pour répondre à la problématique.

In [253]:
cols_empty_veg = [
    'ICS_209_INCIDENT_NUMBER', 
    'ICS_209_NAME', 
    'MTBS_ID', 
    'MTBS_FIRE_NAME'
]

In [254]:
# Suppression des colonnes majoritairement vides
fires_veg = fires_veg.drop(cols_empty_veg, axis=1)

In [255]:
# fires_veg.info(verbose=True, memory_usage=True, show_counts=True)

## Colonnes non pertinentes  
Les colonnes suivantes n'ont pas d'intérêt quant à la problématique : identifiants, colonnes en doublon...

In [256]:
cols_to_drop_veg = [
    'clean_id', 
    #'FPA_ID', # conservé pour la jointure des deux datasets
    'is_id_duplicated', # tous les doublons sont déjà supprimés
    #'Wind', 'fm', # conservés car données explicatives
    'STATE', 'LATITUDE', 'LONGITUDE', # colonnes en doublon
    #'DISCOVERY_YEAR', 'DISCOVERY_DOY', 'DISCOVERY_DATE', # conservé pour la vérification de la jointure des deux datasets
    'FIRE_YEAR', 'DISCOVERY_DAY', 'DISCOVERY_MONTH',  # colonnes en doublon
    'STAT_CAUSE_DESCR', 'IGNITION', # colonnes en doublon
    'FIRE_SIZE', 'FIRE_SIZE_m2', 'FIRE_SIZE_ha', # colonnes en doublon
    'us_130bps', # code
    #'NBCD_countrywide_biomass_mosaic', 'GROUPVEG', # conservés car données explicatives
    'NA_L3CODE', 'NA_L1CODE' # code
    #'NA_L3NAME', 'NA_L1NAME', 'EcoArea_km2' # conservés car données explicatives
]

In [257]:
# Suppression de colonnes
fires_veg = fires_veg.drop(cols_to_drop_veg, axis=1)

In [258]:
fires_veg.info(verbose=True, memory_usage=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
Index: 1829532 entries, 0 to 1832836
Data columns (total 11 columns):
 #   Column                           Non-Null Count    Dtype  
---  ------                           --------------    -----  
 0   Wind                             1825671 non-null  float64
 1   FPA_ID                           1829532 non-null  object 
 2   DISCOVERY_DATE                   1829532 non-null  object 
 3   DISCOVERY_DOY                    1829532 non-null  int64  
 4   DISCOVERY_YEAR                   1829532 non-null  int64  
 5   fm                               1825671 non-null  float64
 6   NBCD_countrywide_biomass_mosaic  1829532 non-null  float64
 7   GROUPVEG                         1829532 non-null  object 
 8   NA_L3NAME                        1829532 non-null  object 
 9   NA_L1NAME                        1829532 non-null  object 
 10  EcoArea_km2                      1829532 non-null  float64
dtypes: float64(4), int64(2), object(5)
memory usage: 167.5+

# Suppression des espaces en tête et fin d'ID

De la même manière que le dataset initial, on constate qu'il y a des espaces en fin d'ID.

In [259]:
set(fires_veg['FPA_ID'])

{'SFO-SC0402213116923',
 'SFO-OK01410606-30367_03291424',
 'STATE_MS_93763',
 'SFO-TX01430696-10738698',
 'ODF-63278',
 'W-394491',
 'SFO-FL062006-06-0934',
 'W-456482',
 'SFO-2015NY2401NY2401-2015-0872345',
 'NM98-40950734X',
 'TFS-TX2009-75377',
 '2011TDA10319',
 'SWRA_OK_12256',
 'SWRA_SC_55686',
 'ALS-HSV-20030325-002',
 'SFO-2013SCSCS14FF0186',
 'SFO-GA00060503-37-207-0008-10',
 '2011MTNWS000415',
 'W-331655',
 'SFO-2013MSMFCMS04520131028009',
 'W-585444',
 'SFO-GA-WIL-27-5/16/1995-1312',
 'SWRA_AL_46148',
 'ODF-75841',
 'FS-350983',
 'SFO-MS-2008-MS3952813145',
 'NCST-086-20100013',
 'TFS_NC_175691',
 'SFO-MN0349-8073',
 'FS-327714',
 'ODF-76159',
 'SFO-2013MNDNR2013-234-042',
 '2011SCSCS11FF0770',
 'FS-275365',
 'W-363392',
 'W-572664',
 'SFO-GA00770404-42-163-0001-07',
 'W-339345',
 'SFO-2015FLFLS2015120316',
 'W-125237',
 'SFO-NC0457-NCST-018-20090034',
 'SWRA_GA_52634',
 'FS-349570',
 'TFS-TXFD2010-266638',
 'SFO-NY-NY4201-2004-034000',
 'SFO-NY-NY5277-2005-0001360',
 'TFS-TX

In [260]:
fires_veg['FPA_ID'] = fires_veg['FPA_ID'].str.strip()

In [261]:
set(fires_veg['FPA_ID'])

{'SFO-SC0402213116923',
 'SFO-OK01410606-30367_03291424',
 'STATE_MS_93763',
 'SFO-TX01430696-10738698',
 'ODF-63278',
 'CDF_1997_56_2229_200',
 'W-394491',
 'SFO-FL062006-06-0934',
 'W-456482',
 'SFO-2015NY2401NY2401-2015-0872345',
 'NM98-40950734X',
 'TFS-TX2009-75377',
 '2011TDA10319',
 'SWRA_OK_12256',
 'SWRA_SC_55686',
 'ALS-HSV-20030325-002',
 'SFO-2013SCSCS14FF0186',
 'SFO-GA00060503-37-207-0008-10',
 '2011MTNWS000415',
 'W-331655',
 'CDF_1993_54_2235_516',
 'SFO-2013MSMFCMS04520131028009',
 'W-585444',
 'SFO-GA-WIL-27-5/16/1995-1312',
 'SWRA_AL_46148',
 'ODF-75841',
 'FS-350983',
 'SFO-MS-2008-MS3952813145',
 'NCST-086-20100013',
 'TFS_NC_175691',
 'SFO-MN0349-8073',
 'FS-327714',
 'ODF-76159',
 'SFO-2013MNDNR2013-234-042',
 '2011SCSCS11FF0770',
 'FS-275365',
 'W-363392',
 'W-572664',
 'SFO-GA00770404-42-163-0001-07',
 'W-339345',
 'SFO-2015FLFLS2015120316',
 'W-125237',
 'SFO-NC0457-NCST-018-20090034',
 'SWRA_GA_52634',
 'FS-349570',
 'TFS-TXFD2010-266638',
 'SFO-NY-NY4201-200

# Colonne "ECO_AREA_KM2" : correction
L'aire de la région de niveau 3 semble avoir une corrélation non négligeable avec la classe de feu, d'après le KBest (mutual_info_classif). On décide de la retravailler afin de corriger des erreurs.

In [262]:
# Colonnes présentes
fires_veg.columns

Index(['Wind', 'FPA_ID', 'DISCOVERY_DATE', 'DISCOVERY_DOY', 'DISCOVERY_YEAR',
       'fm', 'NBCD_countrywide_biomass_mosaic', 'GROUPVEG', 'NA_L3NAME',
       'NA_L1NAME', 'EcoArea_km2'],
      dtype='object')

In [263]:
print(f"Il y a {fires_veg['NA_L3NAME'].nunique()} écorégions de niveau 3.")

Il y a 85 écorégions de niveau 3.


In [264]:
# #commenté car lourd en taille de notebook
# for eco in df['ECO_REG_LVL3'].unique():
#     fig = go.Figure(data=[
#         go.Histogram(x=df.loc[(fires_merge['ECO_REG_LVL3']==eco) & (fires_merge['CLASS']=='A'), 
#                                       'ECO_AREA_1000KM2'], name='A'),
#         go.Histogram(x=df.loc[(fires_merge['ECO_REG_LVL3']==eco) & (fires_merge['CLASS']=='B'), 
#                                       'ECO_AREA_1000KM2'], name='B'),
#         go.Histogram(x=df.loc[(fires_merge['ECO_REG_LVL3']==eco) & (fires_merge['CLASS']=='C'), 
#                                       'ECO_AREA_1000KM2'], name='C'),
#         go.Histogram(x=df.loc[(fires_merge['ECO_REG_LVL3']==eco) & (fires_merge['CLASS']=='D'), 
#                                       'ECO_AREA_1000KM2'], name='D'),
#         go.Histogram(x=df.loc[(fires_merge['ECO_REG_LVL3']==eco) & (fires_merge['CLASS']=='E'), 
#                                       'ECO_AREA_1000KM2'], name='E'),
#         go.Histogram(x=df.loc[(fires_merge['ECO_REG_LVL3']==eco) & (fires_merge['CLASS']=='F'), 
#                                       'ECO_AREA_1000KM2'], name='F'),
#         go.Histogram(x=df.loc[(fires_merge['ECO_REG_LVL3']==eco) & (fires_merge['CLASS']=='G'), 
#                                       'ECO_AREA_1000KM2'], name='G')
#     ])

#     # The two histograms are drawn on top of another
#     fig.update_layout(barmode='stack', 
#                       title = f"Distribution de l'aire de l'écorégion {eco}",
#                       xaxis_title_text='Aire (1000 km²)',
#                       yaxis_title_text='Nombre'
#                       )
#     fig.show()

On constate qu'il y a des erreurs dans les surfaces des écorégions, avec des valeurs ici et là ridiculement faibles... Problèmes de conversion ? De saisie des données ?  
On décide d'appliquer le maximum de la surface d'une écorégion à tous ces enregistrements associés, sachant que cette valeur est normalement constante (ou tout du moins évolue très peu).

In [265]:
# Récupération du maximum de l'aire de chacune des écorégions de niveau 3 - tentative n°1
eco_areas_dict = fires_veg.groupby(['NA_L3NAME'])['EcoArea_km2'].max().round(0).to_dict()
eco_areas_dict

{'Acadian Plains and Hills': 44417.0,
 'Arizona/New Mexico Mountains': 82548.0,
 'Arizona/New Mexico Plateau': 146788.0,
 'Arkansas Valley': 28423.0,
 'Aspen Parkland/Northern Glaciated Plains': 134938.0,
 'Atlantic Coastal Pine Barrens': 9944.0,
 'Blue Mountains': 70907.0,
 'Blue Ridge': 40883.0,
 'Boston Mountains': 14165.0,
 'California Coastal Sage, Chaparral, and Oak Woodlands': 76656.0,
 'Canadian Rockies': 18832.0,
 'Cascades': 44848.0,
 'Central Appalachians': 61337.0,
 'Central Basin and Range': 308870.0,
 'Central California Valley': 46560.0,
 'Central Corn Belt Plains': 76575.0,
 'Central Great Plains': 275121.0,
 'Central Irregular Plains': 59824.0,
 'Chihuahuan Desert': 161400.0,
 'Chihuahuan Deserts': 2569.0,
 'Coast Range': 52192.0,
 'Colorado Plateaus': 136644.0,
 'Columbia Mountains/Northern Rockies': 82062.0,
 'Columbia Plateau': 81223.0,
 'Cross Timbers': 88165.0,
 'Driftless Area': 47376.0,
 'East Central Texas Plains': 55733.0,
 'Eastern Cascades Slopes and Foothil

En observant les clés du dictionnaire, on se rend compte qu'il y a une erreur de saisie pour le Chihuahuan (Desert, Deserts). On corrige.

In [266]:
# Correction du doublon de noms de l'écorégion Chihuahuan
fires_veg.loc[fires_veg['NA_L3NAME']=='Chihuahuan Deserts'] =\
    fires_veg.loc[fires_veg['NA_L3NAME']=='Chihuahuan Deserts']\
        .replace('Chihuahuan Deserts', 'Chihuahuan Desert') 

In [267]:
# Récupération du maximum de l'aire de chacune des écorégions de niveau 3 - tentative n°2
eco_areas_dict = fires_veg.groupby(['NA_L3NAME'])['EcoArea_km2'].max().round(0).to_dict()
eco_areas_dict

{'Acadian Plains and Hills': 44417.0,
 'Arizona/New Mexico Mountains': 82548.0,
 'Arizona/New Mexico Plateau': 146788.0,
 'Arkansas Valley': 28423.0,
 'Aspen Parkland/Northern Glaciated Plains': 134938.0,
 'Atlantic Coastal Pine Barrens': 9944.0,
 'Blue Mountains': 70907.0,
 'Blue Ridge': 40883.0,
 'Boston Mountains': 14165.0,
 'California Coastal Sage, Chaparral, and Oak Woodlands': 76656.0,
 'Canadian Rockies': 18832.0,
 'Cascades': 44848.0,
 'Central Appalachians': 61337.0,
 'Central Basin and Range': 308870.0,
 'Central California Valley': 46560.0,
 'Central Corn Belt Plains': 76575.0,
 'Central Great Plains': 275121.0,
 'Central Irregular Plains': 59824.0,
 'Chihuahuan Desert': 161400.0,
 'Coast Range': 52192.0,
 'Colorado Plateaus': 136644.0,
 'Columbia Mountains/Northern Rockies': 82062.0,
 'Columbia Plateau': 81223.0,
 'Cross Timbers': 88165.0,
 'Driftless Area': 47376.0,
 'East Central Texas Plains': 55733.0,
 'Eastern Cascades Slopes and Foothills': 53258.0,
 'Eastern Corn Be

In [268]:
fires_veg.columns

Index(['Wind', 'FPA_ID', 'DISCOVERY_DATE', 'DISCOVERY_DOY', 'DISCOVERY_YEAR',
       'fm', 'NBCD_countrywide_biomass_mosaic', 'GROUPVEG', 'NA_L3NAME',
       'NA_L1NAME', 'EcoArea_km2'],
      dtype='object')

In [269]:
# Avant correction
fires_veg[['NA_L3NAME','EcoArea_km2']].head(20)

Unnamed: 0,NA_L3NAME,EcoArea_km2
0,Blue Ridge,40883.224113
1,Middle Rockies,13954.952392
2,South Central Plains,151719.535367
3,South Central Plains,151719.535367
4,South Central Plains,151719.535367
5,South Central Plains,151719.535367
6,Southern Coastal Plain,139344.003107
7,Southern Coastal Plain,139344.003107
8,Southern Coastal Plain,139344.003107
9,Southern Coastal Plain,139344.003107


In [270]:
# Application de la correction
fires_veg['EcoArea_km2'] = fires_veg['NA_L3NAME'].map(eco_areas_dict)
fires_veg[['NA_L3NAME','EcoArea_km2']].head(20)

Unnamed: 0,NA_L3NAME,EcoArea_km2
0,Blue Ridge,40883.0
1,Middle Rockies,134151.0
2,South Central Plains,151720.0
3,South Central Plains,151720.0
4,South Central Plains,151720.0
5,South Central Plains,151720.0
6,Southern Coastal Plain,139344.0
7,Southern Coastal Plain,139344.0
8,Southern Coastal Plain,139344.0
9,Southern Coastal Plain,139344.0


La correction est faite.

# Colonne "NBCD_FIA_BIOMASS_MOSAIC"
L'indice de biomasse indique "la vie" sur une parcelle normalisée de terre. On analyse rapidement la distribution de cette colonne.

In [271]:
# # commenté car gourmand à l'affichage
# fig = px.box(fires_veg,
#              x='NBCD_countrywide_biomass_mosaic',
#              y='NA_L1NAME',
#              title = f"Distribution de l'indice de biomasse")
# fig.show()

In [272]:
ratio_null_values_NBCD = fires_veg.loc[fires_veg['NBCD_countrywide_biomass_mosaic'] == 0, 'NBCD_countrywide_biomass_mosaic'].count() / fires_veg.shape[0]
print(f"Ratio toutes classes confondues de valeurs nulles : {np.round(ratio_null_values_NBCD, 2) * 100} %")

Ratio toutes classes confondues de valeurs nulles : 22.0 %


Un quart des valeurs de l'indice de biomasse sont nulles. Deux solutions :   
1) on décide de changer toutes les valeurs nulles par une valeur de leur plus proche voisin non nulle ou bien par la médiane
2) on n'utilise pas cette colonne.

Par mesure de sécurité et souci de temps, on décide de ne pas utiliser cette colonne.

# Changement de type

On modifie le type de certaines colonnes d'object à category afin de gagner de l'espace mémoire.

In [273]:
# fires_merge.columns

In [274]:
fires_veg[['GROUPVEG', 'NA_L3NAME', 'NA_L1NAME']] = \
        fires_veg[['GROUPVEG', 'NA_L3NAME', 'NA_L1NAME']].astype('category')

In [275]:
# fires_merge.info(verbose=True, memory_usage=True, show_counts=True)

# Renommage de colonnes

In [276]:
# Colonnes présentes
fires_veg.columns

Index(['Wind', 'FPA_ID', 'DISCOVERY_DATE', 'DISCOVERY_DOY', 'DISCOVERY_YEAR',
       'fm', 'NBCD_countrywide_biomass_mosaic', 'GROUPVEG', 'NA_L3NAME',
       'NA_L1NAME', 'EcoArea_km2'],
      dtype='object')

In [277]:
# Renommage des colonnes
fires_veg= fires_veg.rename(
    {
        'fm':'FUEL_MOISTURE',
        'Wind':'WIND',
        'NBCD_countrywide_biomass_mosaic':'NBCD_FIA_BIOMASS_MOSAIC',
        'GROUPVEG':'VEGETATION',
        'NA_L3NAME':'ECO_REG_LVL3',
        'NA_L1NAME':'ECO_REG_LVL1',
        'EcoArea_km2':'ECO_AREA_KM2'
    }, 
    axis=1)

In [278]:
# Colonnes renommées
fires_veg.columns

Index(['WIND', 'FPA_ID', 'DISCOVERY_DATE', 'DISCOVERY_DOY', 'DISCOVERY_YEAR',
       'FUEL_MOISTURE', 'NBCD_FIA_BIOMASS_MOSAIC', 'VEGETATION',
       'ECO_REG_LVL3', 'ECO_REG_LVL1', 'ECO_AREA_KM2'],
      dtype='object')

# Comparaison rapide des FPA_ID dans les deux datasets avant jointure
Comme la variable FPA_ID va servir de clé de jointure, on analyse les différences de cette variable dans les deux datasets.

In [279]:
fires_fpa_set = set(fires['FPA_ID'])
fires_veg_fpa_set = set(fires_veg['FPA_ID'])

print(f"Nombres d'ID fonctionnels uniques dans le dataset fires : {len(fires_fpa_set)}")
print(f"Nombres d'ID fonctionnels au total dans le dataset fires : {fires.shape[0]}")
print(f"Nombres d'ID fonctionnels uniques dans le dataset fires_veg : {len(fires_veg_fpa_set)}")
print(f"Nombres d'ID fonctionnels au total dans le dataset fires_veg : {fires_veg.shape[0]}")
print(f"On constate qu'il y a {len(fires_fpa_set - fires_veg_fpa_set)} 'FPA_ID' uniques dans le dataset Kaggle qui ne sont pas présentes dans le dataset complémentaire.")
print(f"On constate qu'il y a {len(fires_veg_fpa_set - fires_fpa_set)} 'FPA_ID' uniques dans le dataset complémentaire qui ne sont pas présentes dans le dataset Kaggle.")

Nombres d'ID fonctionnels uniques dans le dataset fires : 1880462
Nombres d'ID fonctionnels au total dans le dataset fires : 1880462
Nombres d'ID fonctionnels uniques dans le dataset fires_veg : 1829532
Nombres d'ID fonctionnels au total dans le dataset fires_veg : 1829532
On constate qu'il y a 50930 'FPA_ID' uniques dans le dataset Kaggle qui ne sont pas présentes dans le dataset complémentaire.
On constate qu'il y a 0 'FPA_ID' uniques dans le dataset complémentaire qui ne sont pas présentes dans le dataset Kaggle.


# Des Etats manquants...
On pousse l'analyse un peu plus loin en comparant les Etats présents dans les deux datasets.

In [280]:
# Etats présents dans le dataset initial
fires_states_set = set(fires['STATE'].unique())

# Etats absents dans le dataset "végétation et météo"
print("Les Etats absents dans le dataset 'végétation et météo' :",fires_states_set - fires_veg_set)

Les Etats absents dans le dataset 'végétation et météo' : {'AK', 'PR', 'HI'}


In [281]:
# Distribution des classes de feu pour l'entièreté du dataset : 
print("Distribution des classes de feu :\n", fires['CLASS'].value_counts(), '\n')

# Distribution des classes de feu pour l'Alaska : 
print("Distribution des classes de feu pour l'Alaska :\n", fires.loc[fires['STATE'] == 'AK', 'CLASS'].value_counts())

Distribution des classes de feu :
 CLASS
B    939376
A    666917
C    220077
D     28427
E     14107
F      7785
G      3773
Name: count, dtype: int64 

Distribution des classes de feu pour l'Alaska :
 CLASS
A    6622
B    3386
C    1045
G     650
F     413
E     378
D     349
Name: count, dtype: int64


Cela pose question : prend-on le risque à cause de la jointure de ne pas considérer les records de l'Alaska, sachant que c'est un des Etats les plus touchés par les feux de grande classe (10 % de la classe G et 5 % de la classe E), ou se passe-t-on des variables "vent", "humidité de la végétation", "écorégion" en ne faisant pas la jointure ?  

On décide tenter la jointure des deux datasets. 

<span style="color:#a61c00;font-size:2em"><strong>============================================================================</strong></span>
<span style="color:#a61c00;font-size:2em"><strong>Jointure des deux datasets</strong></span>
<span style="color:#a61c00;font-size:2em"><strong>============================================================================</strong></span>

In [282]:
print(f"Dimensions du dataset 'fire' : {fires.shape}")
print(f"Dimensions du dataset 'fire_veg' : {fires_veg.shape}")

Dimensions du dataset 'fire' : (1880462, 28)
Dimensions du dataset 'fire_veg' : (1829532, 11)


# Jointure

In [283]:
# Fusionner les deux datasets
fires_merge = fires.merge(fires_veg, how='inner', 
                          left_on=['FPA_ID', 'DISC_YEAR', 'DISC_DOY'], 
                          right_on=['FPA_ID', 'DISCOVERY_YEAR', 'DISCOVERY_DOY'], 
                          suffixes=('_kag', '_veg'), validate='1:1')

In [284]:
fires_merge.head()

Unnamed: 0,FOD_ID,FPA_ID,DISC_YEAR,DISC_DAYS,DISC_DOY,CAUSE_CODE,CAUSE_DESCR,CONT_DAYS,CONT_DOY,SIZE,CLASS,LAT,LON,OWNER_CODE,OWNER_DESCR,STATE,DUR_DAYS,DISC_DATE,CONT_DATE,CONT_YEAR,DISC_HOUR,DISC_MIN,CONT_HOUR,CONT_MIN,DISC_DATETIME,CONT_DATETIME,DUR_MIN,CAUSE_DESCR_HUMAN,WIND,DISCOVERY_DATE,DISCOVERY_DOY,DISCOVERY_YEAR,FUEL_MOISTURE,NBCD_FIA_BIOMASS_MOSAIC,VEGETATION,ECO_REG_LVL3,ECO_REG_LVL1,ECO_AREA_KM2
0,1,FS-1418826,2005,4781.0,33,9,Miscellaneous,4781.0,33.0,0.1,A,40.036944,-121.005833,5,USFS,CA,0.0,2005-02-02,2005-02-02,2005.0,13.0,0.0,17.0,30.0,2005-02-02 13:00:00,2005-02-02 17:30:00,270.0,1,5.009992,2005-02-02,33,2005,18.19767,775.890015,Hardwood,Sierra Nevada,NORTHWESTERN FORESTED MOUNTAINS,53086.0
1,2,FS-1418827,2004,4515.0,133,1,Lightning,4515.0,133.0,0.25,A,38.933056,-120.404444,5,USFS,CA,0.0,2004-05-12,2004-05-12,2004.0,8.0,45.0,15.0,30.0,2004-05-12 08:45:00,2004-05-12 15:30:00,405.0,0,3.072036,2004-05-12,133,2004,11.998703,1147.319946,Conifer,Sierra Nevada,NORTHWESTERN FORESTED MOUNTAINS,53086.0
2,3,FS-1418835,2004,4534.0,152,5,Debris Burning,4534.0,152.0,0.1,A,38.984167,-120.735556,13,STATE OR PRIVATE,CA,0.0,2004-05-31,2004-05-31,2004.0,19.0,21.0,20.0,24.0,2004-05-31 19:21:00,2004-05-31 20:24:00,63.0,1,2.770343,2004-05-31,152,2004,11.299702,576.090027,Conifer,Sierra Nevada,NORTHWESTERN FORESTED MOUNTAINS,53086.0
3,4,FS-1418845,2004,4562.0,180,1,Lightning,4567.0,185.0,0.1,A,38.559167,-119.913333,5,USFS,CA,5.0,2004-06-28,2004-07-03,2004.0,16.0,0.0,14.0,0.0,2004-06-28 16:00:00,2004-07-03 14:00:00,7080.0,0,3.520761,2004-06-28,180,2004,9.437581,996.65979,Conifer,Sierra Nevada,NORTHWESTERN FORESTED MOUNTAINS,53086.0
4,5,FS-1418847,2004,4562.0,180,1,Lightning,4567.0,185.0,0.1,A,38.559167,-119.933056,5,USFS,CA,5.0,2004-06-28,2004-07-03,2004.0,16.0,0.0,12.0,0.0,2004-06-28 16:00:00,2004-07-03 12:00:00,6960.0,0,3.520761,2004-06-28,180,2004,9.437581,468.719971,Conifer,Sierra Nevada,NORTHWESTERN FORESTED MOUNTAINS,53086.0


In [285]:
fires_merge.info(verbose=True, memory_usage=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1829532 entries, 0 to 1829531
Data columns (total 38 columns):
 #   Column                   Non-Null Count    Dtype         
---  ------                   --------------    -----         
 0   FOD_ID                   1829532 non-null  int64         
 1   FPA_ID                   1829532 non-null  object        
 2   DISC_YEAR                1829532 non-null  uint16        
 3   DISC_DAYS                1829532 non-null  float64       
 4   DISC_DOY                 1829532 non-null  uint16        
 5   CAUSE_CODE               1829532 non-null  uint8         
 6   CAUSE_DESCR              1829532 non-null  category      
 7   CONT_DAYS                976060 non-null   float64       
 8   CONT_DOY                 976060 non-null   float64       
 9   SIZE                     1829532 non-null  float64       
 10  CLASS                    1829532 non-null  category      
 11  LAT                      1829532 non-null  float64       
 12  

# Vérification de la jointure   
On contrôle la fusion des deux datasets.  
On vérifie déjà les dimensions du dataset : on s'attend à avoir le même nombre de lignes que le dataset fires_veg, soit 1829532.

In [286]:
# Vérification des dimensions
print(f"Shape du dataset fusionné : {fires_merge.shape}")

Shape du dataset fusionné : (1829532, 38)


On vérifie par une deuxième voie avec les dates de début de feu.

In [287]:
# Conversion de la date en datetime
fires_merge['DISCOVERY_DATE'] = pd.to_datetime(fires_merge['DISCOVERY_DATE'], format='%Y-%m-%d')
fires_merge['DISCOVERY_DATE'].info(verbose=True, memory_usage=True, show_counts=True)

<class 'pandas.core.series.Series'>
RangeIndex: 1829532 entries, 0 to 1829531
Series name: DISCOVERY_DATE
Non-Null Count    Dtype         
--------------    -----         
1829532 non-null  datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 14.0 MB


On constate qu'il y a 4 incohérences.

In [288]:
# Nombre de mismatches entre la date de début de feu venant du premier dataset et celle venant du deuxième dataset
(fires_merge['DISCOVERY_DATE'] != fires_merge['DISC_DATE']).sum()

4

In [289]:
# Constatation des erreurs
fires_merge_correc = fires_merge.loc[(fires_merge['DISCOVERY_DATE'] != fires_merge['DISC_DATE'])]\
                            [['DISC_DATE', 'DISC_YEAR', 'DISC_DOY', 'DISCOVERY_DATE', 'DISCOVERY_DOY', 'DISCOVERY_YEAR']]
fires_merge_correc.head()

Unnamed: 0,DISC_DATE,DISC_YEAR,DISC_DOY,DISCOVERY_DATE,DISCOVERY_DOY,DISCOVERY_YEAR
1133706,1993-07-19,1993,200,1994-07-19,200,1993
1133709,1993-07-24,1993,205,1994-07-24,205,1993
1133718,1993-03-26,1993,85,1994-03-26,85,1993
1134661,2010-10-10,2010,283,2009-10-10,283,2010


In [290]:
# Création des valeurs corrigées
fires_merge_correc['DISCOVERY_DATE_corr'] = fires_merge_correc.apply(\
        lambda row: str(row['DISCOVERY_YEAR']) + ' ' + str(row['DISCOVERY_DOY']), axis=1)

fires_merge_correc['DISCOVERY_DATE_corr'] = pd.to_datetime(fires_merge_correc['DISCOVERY_DATE_corr'], format='%Y %j')
fires_merge_correc.head()

Unnamed: 0,DISC_DATE,DISC_YEAR,DISC_DOY,DISCOVERY_DATE,DISCOVERY_DOY,DISCOVERY_YEAR,DISCOVERY_DATE_corr
1133706,1993-07-19,1993,200,1994-07-19,200,1993,1993-07-19
1133709,1993-07-24,1993,205,1994-07-24,205,1993,1993-07-24
1133718,1993-03-26,1993,85,1994-03-26,85,1993,1993-03-26
1134661,2010-10-10,2010,283,2009-10-10,283,2010,2010-10-10


In [291]:
# Application des valeurs corrigées
fires_merge.loc[(fires_merge['DISCOVERY_DATE'] != fires_merge['DISC_DATE']),'DISCOVERY_DATE'] = \
                                                                            fires_merge_correc.loc[:,'DISCOVERY_DATE_corr']

In [292]:
# Constatation de la correction
fires_merge.loc[(fires_merge['DISCOVERY_DATE'] != fires_merge['DISC_DATE'])]['DISCOVERY_DATE']

Series([], Name: DISCOVERY_DATE, dtype: datetime64[ns])

On constate qu'il n'y a plus d'incohérences.

In [293]:
fires_merge.iloc[[1133706, 1133709, 1133718, 1134661]][['DISC_DATE','DISCOVERY_DATE']]

Unnamed: 0,DISC_DATE,DISCOVERY_DATE
1133706,1993-07-19,1993-07-19
1133709,1993-07-24,1993-07-24
1133718,1993-03-26,1993-03-26
1134661,2010-10-10,2010-10-10


La correction est faite. 

In [294]:
# Réordonnancement "fonctionnel" des colonnes
fires_merge = fires_merge[[
    'CLASS', 'SIZE',
    'DISC_YEAR', 'DISC_DOY', 'DUR_MIN', 
    'CAUSE_DESCR', 'CAUSE_DESCR_HUMAN',
    'OWNER_DESCR',
    'LAT', 'LON', 
    'STATE',
    'FUEL_MOISTURE', 'WIND', 'NBCD_FIA_BIOMASS_MOSAIC', 'ECO_AREA_KM2',
    'VEGETATION', 'ECO_REG_LVL3', 'ECO_REG_LVL1'
]]

# Colonne indice de biomasse : deuxième analyse  
Maintenant que l'a fusionné les deux datasets et donc accès à la classe, on peut réétudier rapidement la colonne d'indice de biomasse.

In [295]:
ratio_null_values_NBCD = fires_merge.loc[fires_merge['NBCD_FIA_BIOMASS_MOSAIC'] == 0, 'CLASS'].value_counts() / fires_merge.shape[0] * 100
# print(f"Ratio toutes classes confondues de valeurs nulles : {np.round(ratio_null_values_NBCD, 2) * 100} %")
print(f"Pourcentage d'indice de biomasse égal à '0' : {ratio_null_values_NBCD.round(2)}")

Pourcentage d'indice de biomasse égal à '0' : CLASS
B    10.00
A     7.76
C     2.68
D     0.58
E     0.34
F     0.21
G     0.09
Name: count, dtype: float64


La majeure partie des valeurs "0" de l'indice de biomasse sont sur les petites classes. On peut supputer que vu le nombre de petits feux ainsi que leur faible durée, il n'est pas toujours possible de mesurer la valeur de l'indice de biomasse pour ce type de feux.
On a conservé cette colonne dans le premier run des modèles mais on décidé de l'éliminer ensuite du dataset sans outliers.

# Sélection des colonnes

In [296]:
# fires_merge.info(verbose=True, memory_usage=True, show_counts=True)

On décide de ne pas sélectionner certaines variables corrélées avec la variable cible (la taille du feu) ou avec d'autres variables :
- la surface de feu "SIZE"
- le compteur de jours "DISC_DAYS"
- la durée en jours "DUR_DAYS"
- les variables associées au containment du feu "CONT_XXX"  
- les colonnes ID : "FPA_ID" et "FOD_ID"
- les colonnes en doublon avec le deuxième dataset

In [297]:
# fires_merge.columns

In [298]:
# Sélection des colonnes
fires_merge = fires_merge[[
    #'FOD_ID', 'FPA_ID', 
    'DISC_YEAR', 'DISC_DOY', 
    #'DISC_MONTH', 'DISC_DATE', 'DISC_DAYS', 'DISC_HOUR', 'DISC_MIN', 
    #'DISC_DATETIME',
    #'CONT_YEAR', 'CONT_DOY', 'CONT_DATE', 'CONT_DAYS', 'CONT_HOUR', 'CONT_MIN', 
    #'CONT_DATETIME',    
    #'DUR_DAYS', 
    'DUR_MIN',
    #'CAUSE_CODE', 
    'CAUSE_DESCR', 'CAUSE_DESCR_HUMAN',
    #'SIZE', 
    'CLASS', 
    'LAT', 'LON', 'STATE', 
    #'OWNER_CODE', 
    'OWNER_DESCR', 
    'FUEL_MOISTURE', 'WIND', 
    #'NBCD_FIA_BIOMASS_MOSAIC', # écarté du fait du nombre non négligeable de valeurs '0' 
    'ECO_AREA_KM2',
    'VEGETATION', 'ECO_REG_LVL3', 'ECO_REG_LVL1'
    #'DISCOVERY_DATE', 'DISCOVERY_DOY', 'DISCOVERY_YEAR'
]]

In [299]:
# fires_merge.info(verbose=True, memory_usage=True, show_counts=True)

<span style="color:#a61c00;font-size:2em"><strong>============================================================================</strong></span>
<span style="color:#a61c00;font-size:2em"><strong>Réduction du dataset</strong></span>
<span style="color:#a61c00;font-size:2em"><strong>============================================================================</strong></span>

# Gestion des valeurs manquantes
Dnas un premier temps, on décide de ne garder que les records avec les données de durée, vent et humidité du combustible non nulles.

In [300]:
fires_merge.isnull().sum()

DISC_YEAR                 0
DISC_DOY                  0
DUR_MIN              946531
CAUSE_DESCR               0
CAUSE_DESCR_HUMAN         0
CLASS                     0
LAT                       0
LON                       0
STATE                     0
OWNER_DESCR               0
FUEL_MOISTURE          3861
WIND                   3861
ECO_AREA_KM2              0
VEGETATION                0
ECO_REG_LVL3              0
ECO_REG_LVL1              0
dtype: int64

In [301]:
# Récupération des index des lignes avec au moins une valeur nulle
index_with_nan = fires_merge.index[fires_merge.isnull().any(axis=1)]
index_with_nan.shape

(948710,)

In [302]:
# Réduction du dataset aux lignes complètes
fires_merge_reduced = fires_merge.drop(index_with_nan).reset_index(drop=True)
fires_merge_reduced.shape

(880822, 16)

In [303]:
# Vérification de la présence de valeurs manquantes
fires_merge_reduced.isnull().sum()

DISC_YEAR            0
DISC_DOY             0
DUR_MIN              0
CAUSE_DESCR          0
CAUSE_DESCR_HUMAN    0
CLASS                0
LAT                  0
LON                  0
STATE                0
OWNER_DESCR          0
FUEL_MOISTURE        0
WIND                 0
ECO_AREA_KM2         0
VEGETATION           0
ECO_REG_LVL3         0
ECO_REG_LVL1         0
dtype: int64

# Gestion des outliers de durée
Vu les incohérences sur la durée, notamment des classes de petits feux, on décide de supprimer dans chaque classe les outliers relatifs à la durée.

Note : cette partie est commentée car nous nous sommes rendus compte du problème des outliers de durée après avoir fait tourner une première fois les modèles. Nous avons donc conservé cette erreur en l'état mais nous avons ensuite généré un nouveau dataset sans ces outliers afin de refaire tourner certains modèles et constaté une amélioration certaine dans les résultats.

In [304]:
# Médiane de la colonne durée pour chaque classe
fires_no_oaut_dur_median = fires_merge_reduced.groupby('CLASS', observed=False)['DUR_MIN'].median()

# Premier quartile de la colonne durée pour chaque classe
fires_no_out_dur_q1 = fires_merge_reduced.groupby('CLASS', observed=False)['DUR_MIN'].quantile(0.25)

# Troisième quartile de la colonne durée pour chaque classe
fires_no_out_dur_q3 = fires_merge_reduced.groupby('CLASS', observed=False)['DUR_MIN'].quantile(0.75)

# Ecart interquartile de la colonne durée pour chaque classe
fires_no_out_dur_iqr = fires_no_out_dur_q3 - fires_no_out_dur_q1

# Limite haute pour chaque classe
fires_no_out_dur_up = fires_no_out_dur_q3 + fires_no_out_dur_iqr * 1.5

# Limite basse pour chaque classe
fires_no_out_dur_bottom = fires_no_out_dur_q1 - fires_no_out_dur_iqr * 1.5

In [305]:
# Nouveau dataframe sans outliers dans chaque classe
fires_no_out_dur = pd.DataFrame()

for class_ in fires_merge_reduced['CLASS'].unique():
    data = fires_merge_reduced.loc[\
        (fires_merge_reduced['CLASS'] == class_) &\
        (fires_merge_reduced['DUR_MIN'] <= fires_no_out_dur_up[class_]) &\
        (fires_merge_reduced['DUR_MIN'] >= fires_no_out_dur_bottom[class_])]
    fires_no_out_dur = pd.concat([fires_no_out_dur, data])

print("Dataset initial", fires_merge_reduced.shape)
print("Dataset filtré", fires_no_out_dur.shape)

Dataset initial (880822, 16)
Dataset filtré (731659, 16)


In [306]:
print("Dataset avec outliers de 'durée' :")
fires_merge_reduced.groupby('CLASS', observed=False)['DUR_MIN'].agg(['min', 'median', 'mean', 'max']).style.format(precision=0)

Dataset avec outliers de 'durée' :


Unnamed: 0_level_0,min,median,mean,max
CLASS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,0,69,1460,5260335
B,0,73,1168,5260410
C,0,179,1815,1578240
D,0,696,5639,2708760
E,0,1710,9769,1058535
F,0,4535,18341,298260
G,0,12995,35831,535095


In [307]:
print("Dataset sans outliers de 'durée' :")
fires_no_out_dur.groupby('CLASS', observed=False)['DUR_MIN'].agg(['min', 'median', 'mean', 'max']).style.format(precision=0)

Dataset sans outliers de 'durée' :


Unnamed: 0_level_0,min,median,mean,max
CLASS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,0,45,92,645
B,0,60,79,395
C,0,135,179,1028
D,0,460,1206,6833
E,0,1440,2551,13680
F,0,3240,5885,32449
G,0,10564,23342,114170


On constate la diminution attendue du maximum et aussi et surtout de la moyenne.  
On remarque tout de même que la moyenne et le maximum de la classe A sont supérieurs à ceux de la classe B, ce qui est assez étrange. On penche pour l'hypothèse suivante : bien qu'on ait enlevé une grande partie des outliers, ceux-ci étaient suffisamment nombreux pour autoriser un écart interquartile suffisamment grand pour conserver encore des outliers "aberrants".  

ATTENTION : du fait de la boucle de construction, les classes sont regroupées dans ce dataset. Le shuffle est donc impératif dans le train_test_split !

In [308]:
print("Ordre de rangement des records de chaque classe :", fires_merge_reduced['CLASS'].unique())

Ordre de rangement des records de chaque classe : ['A', 'B', 'G', 'C', 'D', 'F', 'E']
Categories (7, object): ['A', 'B', 'C', 'D', 'E', 'F', 'G']


In [309]:
# # Création de csv pour Power BI
# fires_no_out_dur.to_csv('data_Power_BI_reduced_no_outlier.csv', sep=';', encoding='utf-8', index_label='index')

In [310]:
# Réassignation du dataset pour la continuité du notebook ci-après
fires_merge_reduced = fires_no_out_dur

<span style="color:#a61c00;font-size:2em"><strong>============================================================================</strong></span>
<span style="color:#a61c00;font-size:2em"><strong>Preprocessing</strong></span>
<span style="color:#a61c00;font-size:2em"><strong>============================================================================</strong></span>

# Encodage
On encode :
- les variables cycliques comme le mois de l'année
- les variables catégorielles comme la cause ou l'Etat  
  
L'encodage n'étant pas une transformation statistique, cela ne pose pas de problème de le faire avant la séparation du dataset en jeu d'entraînement et de test.

In [311]:
# fires.head()

In [312]:
# fires.info(verbose=True, memory_usage=True, show_counts=True)

In [313]:
# copie du dataset utilisé pour l'exploration avec réordonnancement
fires_model = fires_merge_reduced.copy()
fires_model.head()

Unnamed: 0,DISC_YEAR,DISC_DOY,DUR_MIN,CAUSE_DESCR,CAUSE_DESCR_HUMAN,CLASS,LAT,LON,STATE,OWNER_DESCR,FUEL_MOISTURE,WIND,ECO_AREA_KM2,VEGETATION,ECO_REG_LVL3,ECO_REG_LVL1
0,2005,33,270.0,Miscellaneous,1,A,40.036944,-121.005833,CA,USFS,18.19767,5.009992,53086.0,Hardwood,Sierra Nevada,NORTHWESTERN FORESTED MOUNTAINS
1,2004,133,405.0,Lightning,0,A,38.933056,-120.404444,CA,USFS,11.998703,3.072036,53086.0,Conifer,Sierra Nevada,NORTHWESTERN FORESTED MOUNTAINS
2,2004,152,63.0,Debris Burning,1,A,38.984167,-120.735556,CA,STATE OR PRIVATE,11.299702,2.770343,53086.0,Conifer,Sierra Nevada,NORTHWESTERN FORESTED MOUNTAINS
12,2004,247,30.0,Miscellaneous,1,A,38.786667,-120.193333,CA,USFS,8.4216,2.639286,53086.0,Conifer,Sierra Nevada,NORTHWESTERN FORESTED MOUNTAINS
14,2004,277,510.0,Lightning,0,A,38.675833,-120.279722,CA,USFS,11.59191,2.565227,53086.0,Conifer,Sierra Nevada,NORTHWESTERN FORESTED MOUNTAINS


In [314]:
fires_model.describe()

Unnamed: 0,DISC_YEAR,DISC_DOY,DUR_MIN,CAUSE_DESCR_HUMAN,LAT,LON,FUEL_MOISTURE,WIND,ECO_AREA_KM2
count,731659.0,731659.0,731659.0,731659.0,731659.0,731659.0,731659.0,731659.0,731659.0
mean,2004.201803,166.700854,246.458998,0.79428,37.840666,-96.731306,13.581445,3.85989,140317.553055
std,7.301798,86.837498,2202.359836,0.404227,5.319465,15.487097,3.58936,0.754268,105187.981548
min,1992.0,1.0,0.0,0.0,25.1215,-124.71048,2.597198,1.001655,9944.0
25%,1998.0,95.0,21.0,1.0,33.503309,-111.497274,11.074395,3.301242,53258.0
50%,2005.0,168.0,60.0,1.0,36.651667,-94.238454,14.543201,3.839501,118406.0
75%,2011.0,229.0,138.0,1.0,42.399721,-83.249896,16.309674,4.368423,166116.0
max,2015.0,366.0,114170.0,1.0,48.9955,-67.0661,26.11673,9.227564,357668.0


In [315]:
fires_model.info(verbose=True, memory_usage=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
Index: 731659 entries, 0 to 880729
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype   
---  ------             --------------   -----   
 0   DISC_YEAR          731659 non-null  uint16  
 1   DISC_DOY           731659 non-null  uint16  
 2   DUR_MIN            731659 non-null  float64 
 3   CAUSE_DESCR        731659 non-null  category
 4   CAUSE_DESCR_HUMAN  731659 non-null  int64   
 5   CLASS              731659 non-null  category
 6   LAT                731659 non-null  float64 
 7   LON                731659 non-null  float64 
 8   STATE              731659 non-null  category
 9   OWNER_DESCR        731659 non-null  category
 10  FUEL_MOISTURE      731659 non-null  float64 
 11  WIND               731659 non-null  float64 
 12  ECO_AREA_KM2       731659 non-null  float64 
 13  VEGETATION         731659 non-null  category
 14  ECO_REG_LVL3       731659 non-null  category
 15  ECO_REG_LVL1       731659 non-null  cat

## Encodage des valeurs cycliques : mois, jour de l'année, latitude, longitude

Certaines variables sont périodiques : le 365ème jour de l'année est juste avant le 1er de l'année d'après. Pour respecter ce cycle, on encode ce type de variable en se basant sur un cercle trigonométrique, ce qui conduit à la création de deux colonnes (COS et SIN).

### Jour de l'année

In [316]:
fires_model.loc[:, 'DISC_DOY_SIN'] = fires_model.loc[:, ['DISC_DOY', 'DISC_YEAR']].apply(\
        lambda row : np.sin(2 * np.pi * row['DISC_DOY'] / 366) if calendar.isleap(row['DISC_YEAR']) \
                    else np.sin(2 * np.pi * row['DISC_DOY'] / 365), axis=1)
fires_model.loc[:, 'DISC_DOY_COS'] = fires_model.loc[:, ['DISC_DOY', 'DISC_YEAR']].apply(\
        lambda row : np.cos(2 * np.pi * row['DISC_DOY'] / 366) if calendar.isleap(row['DISC_YEAR']) \
                    else np.cos(2 * np.pi * row['DISC_DOY'] / 365), axis=1)

### Longitude et latitude

In [317]:
fires_model.loc[:, 'LAT_SIN'] = fires_model.loc[:, 'LAT'].apply(lambda h : np.sin(2 * np.pi * h / 360))
fires_model.loc[:, 'LAT_COS'] = fires_model.loc[:, 'LAT'].apply(lambda h : np.cos(2 * np.pi * h / 360))
fires_model.loc[:, 'LON_SIN'] = fires_model.loc[:, 'LON'].apply(lambda h : np.sin(2 * np.pi * h / 360))
fires_model.loc[:, 'LON_COS'] = fires_model.loc[:, 'LON'].apply(lambda h : np.cos(2 * np.pi * h / 360))

### Suppression des colonnes encodées

In [318]:
fires_model.drop(['DISC_DOY', 'LAT', 'LON'], axis=1, inplace=True)

In [319]:
# fires_model.info(verbose=True, memory_usage=True, show_counts=True)

In [320]:
# Création d'une copie intermédiaire
fires_model_orig = fires_model.copy()

In [321]:
# Réordonnancement des colonnes
# fires_model = fires_model_orig
fires_model = fires_model[[
    'CLASS',
    'DISC_YEAR', 'DISC_DOY_COS', 'DISC_DOY_SIN', 'DUR_MIN', 
    'CAUSE_DESCR', 'CAUSE_DESCR_HUMAN',
    'OWNER_DESCR',
    'LAT_COS', 'LAT_SIN', 'LON_COS','LON_SIN',
    'STATE',
    'FUEL_MOISTURE', 'WIND', 
    'ECO_AREA_KM2',
    'VEGETATION', 
    'ECO_REG_LVL1', 
    'ECO_REG_LVL3'
]]

## Encodage des variables catégorielles

In [322]:
fires_model.columns

Index(['CLASS', 'DISC_YEAR', 'DISC_DOY_COS', 'DISC_DOY_SIN', 'DUR_MIN',
       'CAUSE_DESCR', 'CAUSE_DESCR_HUMAN', 'OWNER_DESCR', 'LAT_COS', 'LAT_SIN',
       'LON_COS', 'LON_SIN', 'STATE', 'FUEL_MOISTURE', 'WIND', 'ECO_AREA_KM2',
       'VEGETATION', 'ECO_REG_LVL1', 'ECO_REG_LVL3'],
      dtype='object')

### Variables indépendantes

In [323]:
fires_model[['CAUSE_DESCR','OWNER_DESCR','STATE','VEGETATION','ECO_REG_LVL3','ECO_REG_LVL1']].nunique()

CAUSE_DESCR     13
OWNER_DESCR     16
STATE           49
VEGETATION      12
ECO_REG_LVL3    84
ECO_REG_LVL1    10
dtype: int64

In [324]:
# Colonnes à encoder
cat = [
    'CAUSE_DESCR',
    'OWNER_DESCR', 
    'STATE', 
    'VEGETATION', 
    'ECO_REG_LVL3',
    'ECO_REG_LVL1'
]

# Instanciation du One Hot Encoder 
# Note : notre mentor nous a demandé de ne pas activer le drop 'First'
ohe = OneHotEncoder(sparse_output=True, handle_unknown='ignore')

# Encodage
ohe_arr = ohe.fit_transform(fires_model[cat]).toarray()

# Repérer les classes les moins représentatives avec le describe plus bas
ohe_df = pd.DataFrame(ohe_arr, columns=ohe.get_feature_names_out(cat))

In [325]:
# Attention au merge : les index doivent correspondre, donc nécessité de reset_index() lors de la suppression des lignes sans durée, plus haut
fires_model_enc = pd.merge(fires_model, ohe_df, left_index=True, right_index=True).drop(cat, axis=1)
fires_model_enc.head(10)

Unnamed: 0,CLASS,DISC_YEAR,DISC_DOY_COS,DISC_DOY_SIN,DUR_MIN,CAUSE_DESCR_HUMAN,LAT_COS,LAT_SIN,LON_COS,LON_SIN,FUEL_MOISTURE,WIND,ECO_AREA_KM2,CAUSE_DESCR_Arson,CAUSE_DESCR_Campfire,CAUSE_DESCR_Children,CAUSE_DESCR_Debris Burning,CAUSE_DESCR_Equipment Use,CAUSE_DESCR_Fireworks,CAUSE_DESCR_Lightning,CAUSE_DESCR_Miscellaneous,CAUSE_DESCR_Missing/Undefined,CAUSE_DESCR_Powerline,CAUSE_DESCR_Railroad,CAUSE_DESCR_Smoking,CAUSE_DESCR_Structure,OWNER_DESCR_BIA,OWNER_DESCR_BLM,OWNER_DESCR_BOR,OWNER_DESCR_COUNTY,OWNER_DESCR_FOREIGN,OWNER_DESCR_FWS,OWNER_DESCR_MISSING/NOT SPECIFIED,OWNER_DESCR_MUNICIPAL/LOCAL,OWNER_DESCR_NPS,OWNER_DESCR_OTHER FEDERAL,OWNER_DESCR_PRIVATE,OWNER_DESCR_STATE,OWNER_DESCR_STATE OR PRIVATE,OWNER_DESCR_TRIBAL,OWNER_DESCR_UNDEFINED FEDERAL,OWNER_DESCR_USFS,STATE_AL,STATE_AR,STATE_AZ,STATE_CA,STATE_CO,STATE_CT,STATE_DC,STATE_DE,STATE_FL,STATE_GA,STATE_IA,STATE_ID,STATE_IL,STATE_IN,STATE_KS,STATE_KY,STATE_LA,STATE_MA,STATE_MD,STATE_ME,STATE_MI,STATE_MN,STATE_MO,STATE_MS,STATE_MT,STATE_NC,STATE_ND,STATE_NE,STATE_NH,STATE_NJ,STATE_NM,STATE_NV,STATE_NY,STATE_OH,STATE_OK,STATE_OR,STATE_PA,STATE_RI,STATE_SC,STATE_SD,STATE_TN,STATE_TX,STATE_UT,STATE_VA,STATE_VT,STATE_WA,STATE_WI,STATE_WV,STATE_WY,VEGETATION_Barren-Rock/Sand/Clay,VEGETATION_Conifer,VEGETATION_Grassland,VEGETATION_Hardwood,VEGETATION_Hardwood-Conifer,VEGETATION_No Data,VEGETATION_Open Water,VEGETATION_PerennialIce/Snow,VEGETATION_Riparian,VEGETATION_Savanna,VEGETATION_Shrubland,VEGETATION_Sparse,ECO_REG_LVL3_Acadian Plains and Hills,ECO_REG_LVL3_Arizona/New Mexico Mountains,ECO_REG_LVL3_Arizona/New Mexico Plateau,ECO_REG_LVL3_Arkansas Valley,ECO_REG_LVL3_Aspen Parkland/Northern Glaciated Plains,ECO_REG_LVL3_Atlantic Coastal Pine Barrens,ECO_REG_LVL3_Blue Mountains,ECO_REG_LVL3_Blue Ridge,ECO_REG_LVL3_Boston Mountains,"ECO_REG_LVL3_California Coastal Sage, Chaparral, and Oak Woodlands",ECO_REG_LVL3_Canadian Rockies,ECO_REG_LVL3_Cascades,ECO_REG_LVL3_Central Appalachians,ECO_REG_LVL3_Central Basin and Range,ECO_REG_LVL3_Central California Valley,ECO_REG_LVL3_Central Corn Belt Plains,ECO_REG_LVL3_Central Great Plains,ECO_REG_LVL3_Central Irregular Plains,ECO_REG_LVL3_Chihuahuan Desert,ECO_REG_LVL3_Coast Range,ECO_REG_LVL3_Colorado Plateaus,ECO_REG_LVL3_Columbia Mountains/Northern Rockies,ECO_REG_LVL3_Columbia Plateau,ECO_REG_LVL3_Cross Timbers,ECO_REG_LVL3_Driftless Area,ECO_REG_LVL3_East Central Texas Plains,ECO_REG_LVL3_Eastern Cascades Slopes and Foothills,ECO_REG_LVL3_Eastern Corn Belt Plains,ECO_REG_LVL3_Eastern Great Lakes Lowlands,ECO_REG_LVL3_Edwards Plateau,ECO_REG_LVL3_Erie Drift Plain,ECO_REG_LVL3_Flint Hills,ECO_REG_LVL3_High Plains,ECO_REG_LVL3_Huron/Erie Lake Plains,ECO_REG_LVL3_Idaho Batholith,ECO_REG_LVL3_Interior Plateau,ECO_REG_LVL3_Interior River Valleys and Hills,ECO_REG_LVL3_Klamath Mountains,ECO_REG_LVL3_Lake Manitoba and Lake Agassiz Plain,ECO_REG_LVL3_Madrean Archipelago,ECO_REG_LVL3_Middle Atlantic Coastal Plain,ECO_REG_LVL3_Middle Rockies,ECO_REG_LVL3_Mississippi Alluvial Plain,ECO_REG_LVL3_Mississippi Valley Loess Plains,ECO_REG_LVL3_Mojave Basin and Range,ECO_REG_LVL3_Nebraska Sand Hills,ECO_REG_LVL3_North Cascades,ECO_REG_LVL3_North Central Appalachians,ECO_REG_LVL3_North Central Hardwood Forests,ECO_REG_LVL3_Northeastern Coastal Zone,ECO_REG_LVL3_Northern Allegheny Plateau,ECO_REG_LVL3_Northern Appalachian and Atlantic Maritime Highlands,ECO_REG_LVL3_Northern Basin and Range,ECO_REG_LVL3_Northern Lakes and Forests,ECO_REG_LVL3_Northern Minnesota Wetlands,ECO_REG_LVL3_Northern Piedmont,ECO_REG_LVL3_Northwestern Glaciated Plains,ECO_REG_LVL3_Northwestern Great Plains,ECO_REG_LVL3_Ouachita Mountains,ECO_REG_LVL3_Ozark Highlands,ECO_REG_LVL3_Piedmont,ECO_REG_LVL3_Ridge and Valley,ECO_REG_LVL3_Sierra Nevada,ECO_REG_LVL3_Snake River Plain,ECO_REG_LVL3_Sonoran Desert,ECO_REG_LVL3_South Central Plains,ECO_REG_LVL3_Southeastern Plains,ECO_REG_LVL3_Southeastern Wisconsin Till Plains,ECO_REG_LVL3_Southern Coastal Plain,ECO_REG_LVL3_Southern Florida Coastal Plain,ECO_REG_LVL3_Southern Michigan/Northern Indiana Drift Plains,ECO_REG_LVL3_Southern Rockies,ECO_REG_LVL3_Southern Texas Plains/Interior Plains and Hills with Xerophytic Shrub and Oak Forest,ECO_REG_LVL3_Southern and Baja California Pine-Oak Mountains,ECO_REG_LVL3_Southwestern Appalachians,ECO_REG_LVL3_Southwestern Tablelands,ECO_REG_LVL3_Strait of Georgia/Puget Lowland,ECO_REG_LVL3_Texas Blackland Prairies,ECO_REG_LVL3_Wasatch and Uinta Mountains,ECO_REG_LVL3_Western Allegheny Plateau,ECO_REG_LVL3_Western Corn Belt Plains,ECO_REG_LVL3_Western Gulf Coastal Plain,ECO_REG_LVL3_Willamette Valley,ECO_REG_LVL3_Wyoming Basin,ECO_REG_LVL1_EASTERN TEMPERATE FORESTS,ECO_REG_LVL1_GREAT PLAINS,ECO_REG_LVL1_MARINE WEST COAST FOREST,ECO_REG_LVL1_MEDITERRANEAN CALIFORNIA,ECO_REG_LVL1_NORTH AMERICAN DESERTS,ECO_REG_LVL1_NORTHERN FORESTS,ECO_REG_LVL1_NORTHWESTERN FORESTED MOUNTAINS,ECO_REG_LVL1_SOUTHERN SEMI-ARID HIGHLANDS,ECO_REG_LVL1_TEMPERATE SIERRAS,ECO_REG_LVL1_TROPICAL WET FORESTS
0,A,2005,0.842942,0.538005,270.0,1,0.76563,0.643281,-0.515125,-0.857115,18.19767,5.009992,53086.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,A,2004,-0.65368,0.756771,405.0,0,0.777881,0.628412,-0.506101,-0.862474,11.998703,3.072036,53086.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,A,2004,-0.861702,0.507415,63.0,1,0.77732,0.629106,-0.511076,-0.859535,11.299702,2.770343,53086.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
12,A,2004,-0.454755,-0.890617,30.0,1,0.779484,0.626422,-0.502919,-0.864333,8.4216,2.639286,53086.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
14,A,2004,0.042905,-0.999079,510.0,0,0.780694,0.624913,-0.504222,-0.863574,11.59191,2.565227,53086.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
15,A,2004,0.042905,-0.999079,270.0,0,0.78191,0.623391,-0.508173,-0.861255,11.79862,2.425683,53086.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
19,A,2004,-0.894487,0.447094,210.0,0,0.834455,0.551076,-0.270946,-0.962595,8.193272,4.656804,82548.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
20,A,2004,-0.978856,0.204552,250.0,0,0.835738,0.549128,-0.269401,-0.963028,7.975851,4.28688,82548.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
27,A,2004,-0.824855,-0.565345,510.0,0,0.835155,0.550015,-0.26736,-0.963597,14.55405,3.188987,82548.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
28,A,2004,0.229688,0.973264,275.0,1,0.834989,0.550266,-0.269569,-0.962981,10.097295,5.658385,82548.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [326]:
# fires_model_enc.info(verbose=True, memory_usage=True, show_counts=True)

In [327]:
# Statistiques
fires_model_enc.describe()

Unnamed: 0,DISC_YEAR,DISC_DOY_COS,DISC_DOY_SIN,DUR_MIN,CAUSE_DESCR_HUMAN,LAT_COS,LAT_SIN,LON_COS,LON_SIN,FUEL_MOISTURE,WIND,ECO_AREA_KM2,CAUSE_DESCR_Arson,CAUSE_DESCR_Campfire,CAUSE_DESCR_Children,CAUSE_DESCR_Debris Burning,CAUSE_DESCR_Equipment Use,CAUSE_DESCR_Fireworks,CAUSE_DESCR_Lightning,CAUSE_DESCR_Miscellaneous,CAUSE_DESCR_Missing/Undefined,CAUSE_DESCR_Powerline,CAUSE_DESCR_Railroad,CAUSE_DESCR_Smoking,CAUSE_DESCR_Structure,OWNER_DESCR_BIA,OWNER_DESCR_BLM,OWNER_DESCR_BOR,OWNER_DESCR_COUNTY,OWNER_DESCR_FOREIGN,OWNER_DESCR_FWS,OWNER_DESCR_MISSING/NOT SPECIFIED,OWNER_DESCR_MUNICIPAL/LOCAL,OWNER_DESCR_NPS,OWNER_DESCR_OTHER FEDERAL,OWNER_DESCR_PRIVATE,OWNER_DESCR_STATE,OWNER_DESCR_STATE OR PRIVATE,OWNER_DESCR_TRIBAL,OWNER_DESCR_UNDEFINED FEDERAL,OWNER_DESCR_USFS,STATE_AL,STATE_AR,STATE_AZ,STATE_CA,STATE_CO,STATE_CT,STATE_DC,STATE_DE,STATE_FL,STATE_GA,STATE_IA,STATE_ID,STATE_IL,STATE_IN,STATE_KS,STATE_KY,STATE_LA,STATE_MA,STATE_MD,STATE_ME,STATE_MI,STATE_MN,STATE_MO,STATE_MS,STATE_MT,STATE_NC,STATE_ND,STATE_NE,STATE_NH,STATE_NJ,STATE_NM,STATE_NV,STATE_NY,STATE_OH,STATE_OK,STATE_OR,STATE_PA,STATE_RI,STATE_SC,STATE_SD,STATE_TN,STATE_TX,STATE_UT,STATE_VA,STATE_VT,STATE_WA,STATE_WI,STATE_WV,STATE_WY,VEGETATION_Barren-Rock/Sand/Clay,VEGETATION_Conifer,VEGETATION_Grassland,VEGETATION_Hardwood,VEGETATION_Hardwood-Conifer,VEGETATION_No Data,VEGETATION_Open Water,VEGETATION_PerennialIce/Snow,VEGETATION_Riparian,VEGETATION_Savanna,VEGETATION_Shrubland,VEGETATION_Sparse,ECO_REG_LVL3_Acadian Plains and Hills,ECO_REG_LVL3_Arizona/New Mexico Mountains,ECO_REG_LVL3_Arizona/New Mexico Plateau,ECO_REG_LVL3_Arkansas Valley,ECO_REG_LVL3_Aspen Parkland/Northern Glaciated Plains,ECO_REG_LVL3_Atlantic Coastal Pine Barrens,ECO_REG_LVL3_Blue Mountains,ECO_REG_LVL3_Blue Ridge,ECO_REG_LVL3_Boston Mountains,"ECO_REG_LVL3_California Coastal Sage, Chaparral, and Oak Woodlands",ECO_REG_LVL3_Canadian Rockies,ECO_REG_LVL3_Cascades,ECO_REG_LVL3_Central Appalachians,ECO_REG_LVL3_Central Basin and Range,ECO_REG_LVL3_Central California Valley,ECO_REG_LVL3_Central Corn Belt Plains,ECO_REG_LVL3_Central Great Plains,ECO_REG_LVL3_Central Irregular Plains,ECO_REG_LVL3_Chihuahuan Desert,ECO_REG_LVL3_Coast Range,ECO_REG_LVL3_Colorado Plateaus,ECO_REG_LVL3_Columbia Mountains/Northern Rockies,ECO_REG_LVL3_Columbia Plateau,ECO_REG_LVL3_Cross Timbers,ECO_REG_LVL3_Driftless Area,ECO_REG_LVL3_East Central Texas Plains,ECO_REG_LVL3_Eastern Cascades Slopes and Foothills,ECO_REG_LVL3_Eastern Corn Belt Plains,ECO_REG_LVL3_Eastern Great Lakes Lowlands,ECO_REG_LVL3_Edwards Plateau,ECO_REG_LVL3_Erie Drift Plain,ECO_REG_LVL3_Flint Hills,ECO_REG_LVL3_High Plains,ECO_REG_LVL3_Huron/Erie Lake Plains,ECO_REG_LVL3_Idaho Batholith,ECO_REG_LVL3_Interior Plateau,ECO_REG_LVL3_Interior River Valleys and Hills,ECO_REG_LVL3_Klamath Mountains,ECO_REG_LVL3_Lake Manitoba and Lake Agassiz Plain,ECO_REG_LVL3_Madrean Archipelago,ECO_REG_LVL3_Middle Atlantic Coastal Plain,ECO_REG_LVL3_Middle Rockies,ECO_REG_LVL3_Mississippi Alluvial Plain,ECO_REG_LVL3_Mississippi Valley Loess Plains,ECO_REG_LVL3_Mojave Basin and Range,ECO_REG_LVL3_Nebraska Sand Hills,ECO_REG_LVL3_North Cascades,ECO_REG_LVL3_North Central Appalachians,ECO_REG_LVL3_North Central Hardwood Forests,ECO_REG_LVL3_Northeastern Coastal Zone,ECO_REG_LVL3_Northern Allegheny Plateau,ECO_REG_LVL3_Northern Appalachian and Atlantic Maritime Highlands,ECO_REG_LVL3_Northern Basin and Range,ECO_REG_LVL3_Northern Lakes and Forests,ECO_REG_LVL3_Northern Minnesota Wetlands,ECO_REG_LVL3_Northern Piedmont,ECO_REG_LVL3_Northwestern Glaciated Plains,ECO_REG_LVL3_Northwestern Great Plains,ECO_REG_LVL3_Ouachita Mountains,ECO_REG_LVL3_Ozark Highlands,ECO_REG_LVL3_Piedmont,ECO_REG_LVL3_Ridge and Valley,ECO_REG_LVL3_Sierra Nevada,ECO_REG_LVL3_Snake River Plain,ECO_REG_LVL3_Sonoran Desert,ECO_REG_LVL3_South Central Plains,ECO_REG_LVL3_Southeastern Plains,ECO_REG_LVL3_Southeastern Wisconsin Till Plains,ECO_REG_LVL3_Southern Coastal Plain,ECO_REG_LVL3_Southern Florida Coastal Plain,ECO_REG_LVL3_Southern Michigan/Northern Indiana Drift Plains,ECO_REG_LVL3_Southern Rockies,ECO_REG_LVL3_Southern Texas Plains/Interior Plains and Hills with Xerophytic Shrub and Oak Forest,ECO_REG_LVL3_Southern and Baja California Pine-Oak Mountains,ECO_REG_LVL3_Southwestern Appalachians,ECO_REG_LVL3_Southwestern Tablelands,ECO_REG_LVL3_Strait of Georgia/Puget Lowland,ECO_REG_LVL3_Texas Blackland Prairies,ECO_REG_LVL3_Wasatch and Uinta Mountains,ECO_REG_LVL3_Western Allegheny Plateau,ECO_REG_LVL3_Western Corn Belt Plains,ECO_REG_LVL3_Western Gulf Coastal Plain,ECO_REG_LVL3_Willamette Valley,ECO_REG_LVL3_Wyoming Basin,ECO_REG_LVL1_EASTERN TEMPERATE FORESTS,ECO_REG_LVL1_GREAT PLAINS,ECO_REG_LVL1_MARINE WEST COAST FOREST,ECO_REG_LVL1_MEDITERRANEAN CALIFORNIA,ECO_REG_LVL1_NORTH AMERICAN DESERTS,ECO_REG_LVL1_NORTHERN FORESTS,ECO_REG_LVL1_NORTHWESTERN FORESTED MOUNTAINS,ECO_REG_LVL1_SOUTHERN SEMI-ARID HIGHLANDS,ECO_REG_LVL1_TEMPERATE SIERRAS,ECO_REG_LVL1_TROPICAL WET FORESTS
count,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0,606664.0
mean,2002.19569,-0.24069,0.104833,254.693593,0.787518,0.78594,0.611129,-0.114179,-0.957658,13.509992,3.854147,141516.61349,0.179315,0.052998,0.034095,0.23451,0.061637,0.012938,0.147706,0.181377,0.04438,0.010086,0.009999,0.027795,0.003163,0.125358,0.050746,0.000257,0.00074,1.3e-05,0.008893,0.307951,0.002771,0.010823,0.00584,0.290114,0.023766,0.033867,0.010446,0.001637,0.126777,0.030903,0.011977,0.058355,0.092537,0.021747,0.0003,7.9e-05,4.9e-05,0.025546,0.136285,0.000486,0.020448,0.002192,0.001586,0.007724,0.015488,0.00373,0.000391,0.001213,0.005146,0.007208,0.024505,0.008481,0.061367,0.030229,0.035731,0.019515,0.007314,7.6e-05,0.00057,0.022291,0.016363,0.092898,0.001729,0.039819,0.031408,0.006602,0.000466,0.015991,0.024607,0.013543,0.017123,0.016947,0.001045,4.3e-05,0.013505,0.019574,0.024043,0.010823,0.00436,0.257996,0.091553,0.252158,0.103121,5e-06,0.012422,1.2e-05,0.149656,0.002553,0.122912,0.003252,0.003775,0.031057,0.01032,0.005988,0.014784,0.010306,0.011499,0.014321,0.005102,0.027102,0.001162,0.010759,0.020965,0.01918,0.008171,0.001055,0.008004,0.004647,0.003588,0.003832,0.015979,0.01347,0.00583,0.005232,0.002159,0.001002,0.011389,4.8e-05,0.026694,0.00049,0.001363,0.001355,0.009354,0.000213,0.005871,0.008895,0.002034,0.014852,0.000953,0.004719,0.012755,0.010695,0.001025,0.0132,0.008243,0.001022,0.002359,0.002527,0.011456,0.024633,0.013327,0.010368,0.007065,0.028676,0.006058,0.006499,0.009036,0.032952,0.013821,0.018185,0.05286,0.025469,0.018193,0.006783,0.025696,0.019551,0.149753,0.001343,0.045674,0.003214,0.000849,0.013367,0.000468,0.009633,0.009188,0.003654,0.000402,0.000336,0.004472,0.01397,0.00402,0.002801,0.000839,0.00604,0.537479,0.099109,0.005074,0.044906,0.108726,0.047629,0.118087,0.004719,0.031057,0.003214
std,6.36457,0.643129,0.71935,2213.956468,0.409064,0.058921,0.07313,0.259945,0.047773,3.645129,0.750199,106142.76677,0.383616,0.22403,0.181472,0.423693,0.240495,0.113007,0.354809,0.385331,0.205939,0.099923,0.099494,0.164384,0.056153,0.331124,0.21948,0.016034,0.027195,0.003631,0.093882,0.461647,0.052566,0.10347,0.076197,0.453815,0.15232,0.180887,0.101669,0.040425,0.332723,0.173056,0.108782,0.234414,0.289783,0.145856,0.017318,0.008895,0.007032,0.157777,0.343091,0.022046,0.141527,0.046771,0.03979,0.087547,0.123483,0.060962,0.019761,0.03481,0.071552,0.084595,0.154609,0.0917,0.240002,0.171218,0.18562,0.138326,0.085207,0.008707,0.023875,0.147628,0.126868,0.29029,0.041547,0.195535,0.174417,0.080982,0.021593,0.125439,0.154923,0.115583,0.12973,0.129072,0.032311,0.006546,0.115424,0.138532,0.153183,0.10347,0.065886,0.437532,0.288394,0.434252,0.304118,0.002224,0.11076,0.003397,0.356734,0.050466,0.328336,0.056935,0.061323,0.173471,0.101064,0.077153,0.120688,0.100992,0.106615,0.11881,0.071244,0.162382,0.03407,0.103165,0.143269,0.137158,0.090023,0.032463,0.089109,0.068008,0.059796,0.061788,0.125395,0.115278,0.076133,0.072142,0.046419,0.031642,0.106108,0.006914,0.161186,0.022121,0.036896,0.036785,0.096265,0.014581,0.0764,0.093891,0.045055,0.120959,0.030852,0.068535,0.112216,0.10286,0.032004,0.114131,0.090418,0.031952,0.04851,0.050205,0.106418,0.155004,0.114671,0.101295,0.083755,0.166896,0.077595,0.080357,0.094629,0.178512,0.11675,0.133619,0.223753,0.157544,0.133649,0.082079,0.158228,0.138452,0.35683,0.036628,0.208778,0.056604,0.029124,0.114839,0.021631,0.097674,0.095413,0.060341,0.020051,0.018334,0.066723,0.117366,0.063279,0.052846,0.028954,0.07748,0.498594,0.298809,0.071049,0.207098,0.311295,0.212981,0.322711,0.068535,0.173471,0.056604
min,1992.0,-1.0,-0.999991,0.0,0.0,0.656118,0.424539,-0.56943,-1.0,2.597198,1.001655,9944.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1996.0,-0.852864,-0.607058,20.0,1.0,0.735078,0.550481,-0.365696,-0.995237,10.82384,3.296341,53258.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2002.0,-0.365723,0.136906,61.0,1.0,0.804549,0.593887,-0.078062,-0.98122,14.534752,3.829672,118406.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,2008.0,0.312281,0.857315,145.0,1.0,0.834848,0.677983,0.114104,-0.930524,16.308207,4.354783,166116.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,2012.0,1.0,0.999991,114170.0,1.0,0.90541,0.754658,0.389669,-0.82204,26.11673,8.631218,357668.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [328]:
# # Les catégories vues par le One Hot Encoder
# print("Les catégories vues par le One Hot Encoder :\n", ohe.categories_)

# # Les catégories retenues par le One Hot Encoder
# print("\nLes catégories retenues par le One Hot Encoder :\n", ohe.get_feature_names_out(cat))

# # Les index des catégories droppées par le One Hot Encoder
# print("\nLes index des catégories droppées par le One Hot Encoder :\n",  ohe.drop_idx_)

In [329]:
fires_model_enc['BIG_CLASSES'] = fires_model_enc['CLASS'].apply(lambda x: 0 if x in ['A','B'] else 1 if x in ['C','D','E'] else 2)

### Variable cible  
Certaines méthodes de feature selection ou de sampling nécessitent une variable cible numérique. On décide donc d'encoder de manière ordinale la classe de feu, puisqu'elle comporte un ordre naturel, du plus petit (A) au plus grand (G).

In [330]:
# Instanciation du Ordinal Encoder
ore = OrdinalEncoder()

# Encodage
fires_model_enc["CLASS"] = ore.fit_transform(fires_model_enc[["CLASS"]])
fires_model_enc["CLASS"].head(20)

0     0.0
1     0.0
2     0.0
12    0.0
14    0.0
15    0.0
19    0.0
20    0.0
27    0.0
28    0.0
29    0.0
30    0.0
33    0.0
34    0.0
35    0.0
38    0.0
43    0.0
44    0.0
48    0.0
49    0.0
Name: CLASS, dtype: float64

In [331]:
# Pour retrouver les étiquettes initiales
# ore.inverse_transform(fires_model_enc[["CLASS"]].head(20))

### Déséquilibre important pour la variable cible

In [332]:
np.round(fires_model_enc["CLASS"].value_counts(normalize=True),4)*100

CLASS
1.0    44.01
0.0    42.09
2.0    10.05
3.0     1.80
4.0     1.05
5.0     0.66
6.0     0.34
Name: proportion, dtype: float64

Le dataset est très déséquilibré pour la variable cible : on a un ratio de 250 entre la classe majoritaire et la classe minoritaire.  
Plusieurs possibilités pour la modélisation :
- utiliser le paramètre de pénalisation "class_weight" dans les algorithmes
- oversampler les classes minoritaires et/ou undersampler les classes majoritaires. On effectuera cette opération après le scaling et la sélection des features, de manière à éviter des temps de traitement trop longs.

# Séparation des variables

In [333]:
# Création de la variable cible
y = fires_model_enc['BIG_CLASSES']

# Création du dataset des variables explicatives
X = fires_model_enc.drop(['CLASS', 'BIG_CLASSES'], axis=1)

# Séparation du dataset en jeux d'entraînement et de test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=42)

# Vérification des dimensions des datasets
print("X Train Set:", X_train.shape)
print("y Train Set:", y_train.shape)
print("X Test Set:", X_test.shape)
print("y Test Set:", y_test.shape)

X Train Set: (485331, 196)
y Train Set: (485331,)
X Test Set: (121333, 196)
y Test Set: (121333,)


# Scaling  
La colonne "DUR_MIN" comporte de très grands outliers. On choisit donc d'utiliser un RobustScaler qui fait intervenir la médiane et l'écart interquartile, ce qui empêche les outliers d'avoir une influence sur la transformation des valeurs "moyennes", comme cela pourrait être le cas avec un StandardScaler ou un MinMaxScaler.

In [334]:
# Instanciation du Robust Scaler
rs = RobustScaler()

# Ajustement au dataset d'entrainement
X_train_scaled = rs.fit_transform(X_train)

# Application au dataset de test
X_test_scaled = rs.transform(X_test)

In [335]:
# Création de dataframe, par souci de lecture et de vérification dans l'équipe
X_train_df = pd.DataFrame(X_train_scaled, columns=X.columns)
X_test_df = pd.DataFrame(X_test_scaled, columns=X.columns)

In [336]:
X_train_df.describe()

Unnamed: 0,DISC_YEAR,DISC_DOY_COS,DISC_DOY_SIN,DUR_MIN,CAUSE_DESCR_HUMAN,LAT_COS,LAT_SIN,LON_COS,LON_SIN,FUEL_MOISTURE,WIND,ECO_AREA_KM2,CAUSE_DESCR_Arson,CAUSE_DESCR_Campfire,CAUSE_DESCR_Children,CAUSE_DESCR_Debris Burning,CAUSE_DESCR_Equipment Use,CAUSE_DESCR_Fireworks,CAUSE_DESCR_Lightning,CAUSE_DESCR_Miscellaneous,CAUSE_DESCR_Missing/Undefined,CAUSE_DESCR_Powerline,CAUSE_DESCR_Railroad,CAUSE_DESCR_Smoking,CAUSE_DESCR_Structure,OWNER_DESCR_BIA,OWNER_DESCR_BLM,OWNER_DESCR_BOR,OWNER_DESCR_COUNTY,OWNER_DESCR_FOREIGN,OWNER_DESCR_FWS,OWNER_DESCR_MISSING/NOT SPECIFIED,OWNER_DESCR_MUNICIPAL/LOCAL,OWNER_DESCR_NPS,OWNER_DESCR_OTHER FEDERAL,OWNER_DESCR_PRIVATE,OWNER_DESCR_STATE,OWNER_DESCR_STATE OR PRIVATE,OWNER_DESCR_TRIBAL,OWNER_DESCR_UNDEFINED FEDERAL,OWNER_DESCR_USFS,STATE_AL,STATE_AR,STATE_AZ,STATE_CA,STATE_CO,STATE_CT,STATE_DC,STATE_DE,STATE_FL,STATE_GA,STATE_IA,STATE_ID,STATE_IL,STATE_IN,STATE_KS,STATE_KY,STATE_LA,STATE_MA,STATE_MD,STATE_ME,STATE_MI,STATE_MN,STATE_MO,STATE_MS,STATE_MT,STATE_NC,STATE_ND,STATE_NE,STATE_NH,STATE_NJ,STATE_NM,STATE_NV,STATE_NY,STATE_OH,STATE_OK,STATE_OR,STATE_PA,STATE_RI,STATE_SC,STATE_SD,STATE_TN,STATE_TX,STATE_UT,STATE_VA,STATE_VT,STATE_WA,STATE_WI,STATE_WV,STATE_WY,VEGETATION_Barren-Rock/Sand/Clay,VEGETATION_Conifer,VEGETATION_Grassland,VEGETATION_Hardwood,VEGETATION_Hardwood-Conifer,VEGETATION_No Data,VEGETATION_Open Water,VEGETATION_PerennialIce/Snow,VEGETATION_Riparian,VEGETATION_Savanna,VEGETATION_Shrubland,VEGETATION_Sparse,ECO_REG_LVL3_Acadian Plains and Hills,ECO_REG_LVL3_Arizona/New Mexico Mountains,ECO_REG_LVL3_Arizona/New Mexico Plateau,ECO_REG_LVL3_Arkansas Valley,ECO_REG_LVL3_Aspen Parkland/Northern Glaciated Plains,ECO_REG_LVL3_Atlantic Coastal Pine Barrens,ECO_REG_LVL3_Blue Mountains,ECO_REG_LVL3_Blue Ridge,ECO_REG_LVL3_Boston Mountains,"ECO_REG_LVL3_California Coastal Sage, Chaparral, and Oak Woodlands",ECO_REG_LVL3_Canadian Rockies,ECO_REG_LVL3_Cascades,ECO_REG_LVL3_Central Appalachians,ECO_REG_LVL3_Central Basin and Range,ECO_REG_LVL3_Central California Valley,ECO_REG_LVL3_Central Corn Belt Plains,ECO_REG_LVL3_Central Great Plains,ECO_REG_LVL3_Central Irregular Plains,ECO_REG_LVL3_Chihuahuan Desert,ECO_REG_LVL3_Coast Range,ECO_REG_LVL3_Colorado Plateaus,ECO_REG_LVL3_Columbia Mountains/Northern Rockies,ECO_REG_LVL3_Columbia Plateau,ECO_REG_LVL3_Cross Timbers,ECO_REG_LVL3_Driftless Area,ECO_REG_LVL3_East Central Texas Plains,ECO_REG_LVL3_Eastern Cascades Slopes and Foothills,ECO_REG_LVL3_Eastern Corn Belt Plains,ECO_REG_LVL3_Eastern Great Lakes Lowlands,ECO_REG_LVL3_Edwards Plateau,ECO_REG_LVL3_Erie Drift Plain,ECO_REG_LVL3_Flint Hills,ECO_REG_LVL3_High Plains,ECO_REG_LVL3_Huron/Erie Lake Plains,ECO_REG_LVL3_Idaho Batholith,ECO_REG_LVL3_Interior Plateau,ECO_REG_LVL3_Interior River Valleys and Hills,ECO_REG_LVL3_Klamath Mountains,ECO_REG_LVL3_Lake Manitoba and Lake Agassiz Plain,ECO_REG_LVL3_Madrean Archipelago,ECO_REG_LVL3_Middle Atlantic Coastal Plain,ECO_REG_LVL3_Middle Rockies,ECO_REG_LVL3_Mississippi Alluvial Plain,ECO_REG_LVL3_Mississippi Valley Loess Plains,ECO_REG_LVL3_Mojave Basin and Range,ECO_REG_LVL3_Nebraska Sand Hills,ECO_REG_LVL3_North Cascades,ECO_REG_LVL3_North Central Appalachians,ECO_REG_LVL3_North Central Hardwood Forests,ECO_REG_LVL3_Northeastern Coastal Zone,ECO_REG_LVL3_Northern Allegheny Plateau,ECO_REG_LVL3_Northern Appalachian and Atlantic Maritime Highlands,ECO_REG_LVL3_Northern Basin and Range,ECO_REG_LVL3_Northern Lakes and Forests,ECO_REG_LVL3_Northern Minnesota Wetlands,ECO_REG_LVL3_Northern Piedmont,ECO_REG_LVL3_Northwestern Glaciated Plains,ECO_REG_LVL3_Northwestern Great Plains,ECO_REG_LVL3_Ouachita Mountains,ECO_REG_LVL3_Ozark Highlands,ECO_REG_LVL3_Piedmont,ECO_REG_LVL3_Ridge and Valley,ECO_REG_LVL3_Sierra Nevada,ECO_REG_LVL3_Snake River Plain,ECO_REG_LVL3_Sonoran Desert,ECO_REG_LVL3_South Central Plains,ECO_REG_LVL3_Southeastern Plains,ECO_REG_LVL3_Southeastern Wisconsin Till Plains,ECO_REG_LVL3_Southern Coastal Plain,ECO_REG_LVL3_Southern Florida Coastal Plain,ECO_REG_LVL3_Southern Michigan/Northern Indiana Drift Plains,ECO_REG_LVL3_Southern Rockies,ECO_REG_LVL3_Southern Texas Plains/Interior Plains and Hills with Xerophytic Shrub and Oak Forest,ECO_REG_LVL3_Southern and Baja California Pine-Oak Mountains,ECO_REG_LVL3_Southwestern Appalachians,ECO_REG_LVL3_Southwestern Tablelands,ECO_REG_LVL3_Strait of Georgia/Puget Lowland,ECO_REG_LVL3_Texas Blackland Prairies,ECO_REG_LVL3_Wasatch and Uinta Mountains,ECO_REG_LVL3_Western Allegheny Plateau,ECO_REG_LVL3_Western Corn Belt Plains,ECO_REG_LVL3_Western Gulf Coastal Plain,ECO_REG_LVL3_Willamette Valley,ECO_REG_LVL3_Wyoming Basin,ECO_REG_LVL1_EASTERN TEMPERATE FORESTS,ECO_REG_LVL1_GREAT PLAINS,ECO_REG_LVL1_MARINE WEST COAST FOREST,ECO_REG_LVL1_MEDITERRANEAN CALIFORNIA,ECO_REG_LVL1_NORTH AMERICAN DESERTS,ECO_REG_LVL1_NORTHERN FORESTS,ECO_REG_LVL1_NORTHWESTERN FORESTED MOUNTAINS,ECO_REG_LVL1_SOUTHERN SEMI-ARID HIGHLANDS,ECO_REG_LVL1_TEMPERATE SIERRAS,ECO_REG_LVL1_TROPICAL WET FORESTS
count,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0,485331.0
mean,0.016426,0.106615,-0.022282,1.565728,-0.212552,-0.185786,0.134445,-0.074772,0.363523,-0.186398,0.02259,0.204447,0.179057,0.053085,0.03408,0.234215,0.062071,0.012933,0.147454,0.181672,0.044417,0.010123,0.00994,0.027748,0.003204,0.124935,0.050784,0.000249,0.000746,1.2e-05,0.008916,0.30826,0.002763,0.010836,0.005848,0.290179,0.023786,0.033981,0.010416,0.001669,0.126621,0.031026,0.011911,0.058257,0.092601,0.021756,0.000309,8e-05,4.9e-05,0.025613,0.136367,0.000478,0.020353,0.002141,0.00158,0.007747,0.015455,0.003775,0.000408,0.001212,0.005242,0.007218,0.024499,0.008497,0.061222,0.030179,0.035701,0.019531,0.007304,8.4e-05,0.000581,0.022121,0.016311,0.092733,0.001723,0.040092,0.031374,0.006666,0.000464,0.016016,0.02448,0.013469,0.017215,0.016982,0.001059,4.3e-05,0.013523,0.01978,0.023969,0.010801,0.004302,0.257841,0.09122,0.252432,0.103389,4e-06,0.012445,1.2e-05,0.149753,0.002501,0.122825,0.003274,0.003857,0.03103,0.010255,0.006082,0.014761,0.010348,0.011425,0.014176,0.005131,0.027202,0.001131,0.010797,0.020971,0.019164,0.008246,0.001032,0.008017,0.004743,0.003624,0.003857,0.015806,0.013498,0.005858,0.005238,0.002194,0.000999,0.011378,4.3e-05,0.026549,0.000484,0.001315,0.001364,0.009299,0.00021,0.005782,0.008932,0.001999,0.014775,0.000956,0.004675,0.012795,0.010764,0.001045,0.013111,0.008178,0.001038,0.002357,0.002532,0.011576,0.024723,0.013311,0.010319,0.00694,0.028723,0.005988,0.006536,0.009109,0.032708,0.013799,0.018243,0.052908,0.025572,0.01815,0.006822,0.025634,0.019642,0.149566,0.001374,0.04589,0.003223,0.000872,0.013385,0.00047,0.009624,0.009247,0.003653,0.000404,0.000354,0.004585,0.013879,0.004053,0.00279,0.000859,0.006045,-0.462072,0.099038,0.00512,0.045072,0.108326,0.047561,0.118027,0.004675,0.03103,0.003223
std,0.530241,0.551819,0.492214,17.994554,0.409114,0.590551,0.573489,0.541809,0.738071,0.664525,0.708327,0.940024,0.383401,0.224204,0.181434,0.423508,0.241285,0.112988,0.354558,0.385574,0.20602,0.100103,0.099201,0.16425,0.056513,0.330646,0.219556,0.015788,0.027301,0.003516,0.094001,0.461775,0.052492,0.10353,0.076245,0.453845,0.152381,0.18118,0.101524,0.040819,0.332548,0.173389,0.108488,0.234229,0.289872,0.145887,0.017578,0.008964,0.007032,0.157979,0.343178,0.021859,0.141205,0.046219,0.039722,0.087677,0.123356,0.061323,0.020194,0.034786,0.07221,0.08465,0.154592,0.091788,0.239738,0.171081,0.185545,0.138382,0.085153,0.009191,0.024098,0.147077,0.126667,0.290058,0.041468,0.196176,0.174328,0.08137,0.021526,0.125536,0.154535,0.115273,0.130072,0.129205,0.032526,0.006578,0.115498,0.139245,0.152953,0.103365,0.06545,0.437446,0.287922,0.434408,0.304467,0.00203,0.110861,0.003516,0.35683,0.049951,0.328237,0.057126,0.061986,0.1734,0.100746,0.077753,0.120595,0.101195,0.106276,0.118216,0.071444,0.162672,0.033614,0.103345,0.143288,0.137102,0.090432,0.032113,0.089179,0.068707,0.060093,0.061986,0.124723,0.115394,0.076312,0.072182,0.046793,0.031596,0.106058,0.006578,0.160761,0.021999,0.036233,0.036907,0.095981,0.014496,0.075817,0.094087,0.044661,0.120653,0.030905,0.068215,0.112391,0.103189,0.032304,0.113749,0.090061,0.032209,0.048493,0.050258,0.106966,0.155281,0.114601,0.101056,0.083015,0.167026,0.077148,0.080579,0.095007,0.17787,0.116655,0.13383,0.22385,0.157855,0.133496,0.082314,0.158041,0.138768,0.356646,0.037046,0.209247,0.056676,0.02951,0.114915,0.021669,0.097631,0.095717,0.060331,0.020092,0.018822,0.067554,0.11699,0.063533,0.052745,0.0293,0.077517,0.49856,0.298713,0.071372,0.207463,0.310792,0.212836,0.32264,0.068215,0.1734,0.056676
min,-0.833333,-0.544376,-0.777961,-0.488,-1.0,-1.486383,-1.329574,-1.023449,-0.290987,-2.1552,-2.670643,-0.961048,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,-0.5,-0.418095,-0.509083,-0.328,0.0,-0.695159,-0.341653,-0.598856,-0.217293,-0.67628,-0.504383,-0.577256,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.5,0.581905,0.490917,0.672,0.0,0.304841,0.658347,0.401144,0.782707,0.32372,0.495617,0.422744,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,0.833333,1.172148,0.590596,912.872,0.0,1.012184,1.259582,0.975675,2.459176,2.111222,4.533945,2.120027,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [337]:
# X_train_df.info(verbose=True, memory_usage=True, show_counts=True)

In [338]:
# X_test_df.info(verbose=True, memory_usage=True, show_counts=True)

In [339]:
# y_train.info(verbose=True, memory_usage=True, show_counts=True)

In [340]:
# y_test.info(verbose=True, memory_usage=True, show_counts=True)

# Sélection spécifique : catégorie globale humaine VS toutes les causes humaines

On souhaite étudier si la conservation des causes humaines séparées ou bien leur regroupement en une seule classe a un impact. On prépare donc deux datasets.

In [341]:
# X_train_df.columns

## Version : toutes les causes
Après analyse, il semblerait que la conservation des colonnes telles quelles n'apporte pas plus d'informations. On décide donc de laisser la partie suviante commentée et donc d'utiliser le dataset des causes humaines regroupées. Cela permet un gain en termes de nombre de features et aussi de taille en mémoire.

In [342]:
# # avec toutes les causes humaines
# X_train_df_all_causes = X_train_df.drop(['CAUSE_DESCR_HUMAN'], axis=1)
# X_test_df_all_causes = X_test_df.drop(['CAUSE_DESCR_HUMAN'], axis=1)

In [343]:
# # Utilisation de variables "génériques"
# X_train_df = X_train_df_all_causes
# X_test_df = X_test_df_all_causes

In [344]:
# # Vérification des dimensions des datasets
# print(f"Variable cible d'entraînement : {X_train_df.shape}")
# print(f"Variable cible d'entraînement : {X_test_df.shape}")
# print(f"Variable cible d'entraînement : {y_train.shape}")
# print(f"Variable cible d'entraînement : {y_test.shape}")

In [345]:
# # Export vers csv
# X_train_df.to_csv('X_train_df_all_causes.csv', sep=';', encoding='utf-8', index_label='index')
# X_test_df.to_csv('X_test_df_all_causes.csv', sep=';', encoding='utf-8', index_label='index')

## Version : réduction à 3 catégories

In [346]:
# avec 3 catégories : humaine, foudre, indéfini
X_train_df_3_causes = X_train_df.drop([
    'CAUSE_DESCR_Arson', 'CAUSE_DESCR_Campfire', 'CAUSE_DESCR_Children', 'CAUSE_DESCR_Debris Burning', 
    'CAUSE_DESCR_Equipment Use', 'CAUSE_DESCR_Fireworks', 'CAUSE_DESCR_Miscellaneous', 'CAUSE_DESCR_Powerline', 
    'CAUSE_DESCR_Railroad', 'CAUSE_DESCR_Smoking', 'CAUSE_DESCR_Structure'
], axis=1)
X_test_df_3_causes = X_test_df.drop([
    'CAUSE_DESCR_Arson', 'CAUSE_DESCR_Campfire', 'CAUSE_DESCR_Children', 'CAUSE_DESCR_Debris Burning', 
    'CAUSE_DESCR_Equipment Use', 'CAUSE_DESCR_Fireworks', 'CAUSE_DESCR_Miscellaneous', 'CAUSE_DESCR_Powerline', 
    'CAUSE_DESCR_Railroad', 'CAUSE_DESCR_Smoking', 'CAUSE_DESCR_Structure'
], axis=1)

In [347]:
X_train_df = X_train_df_3_causes
X_test_df = X_test_df_3_causes

In [348]:
# Vérification des dimensions des datasets
print(f"Variable cible d'entraînement : {X_train_df.shape}")
print(f"Variable cible d'entraînement : {X_test_df.shape}")
print(f"Variable cible d'entraînement : {y_train.shape}")
print(f"Variable cible d'entraînement : {y_test.shape}")

Variable cible d'entraînement : (485331, 185)
Variable cible d'entraînement : (121333, 185)
Variable cible d'entraînement : (485331,)
Variable cible d'entraînement : (121333,)


In [349]:
X_train_df.columns.values

array(['DISC_YEAR', 'DISC_DOY_COS', 'DISC_DOY_SIN', 'DUR_MIN',
       'CAUSE_DESCR_HUMAN', 'LAT_COS', 'LAT_SIN', 'LON_COS', 'LON_SIN',
       'FUEL_MOISTURE', 'WIND', 'ECO_AREA_KM2', 'CAUSE_DESCR_Lightning',
       'CAUSE_DESCR_Missing/Undefined', 'OWNER_DESCR_BIA',
       'OWNER_DESCR_BLM', 'OWNER_DESCR_BOR', 'OWNER_DESCR_COUNTY',
       'OWNER_DESCR_FOREIGN', 'OWNER_DESCR_FWS',
       'OWNER_DESCR_MISSING/NOT SPECIFIED', 'OWNER_DESCR_MUNICIPAL/LOCAL',
       'OWNER_DESCR_NPS', 'OWNER_DESCR_OTHER FEDERAL',
       'OWNER_DESCR_PRIVATE', 'OWNER_DESCR_STATE',
       'OWNER_DESCR_STATE OR PRIVATE', 'OWNER_DESCR_TRIBAL',
       'OWNER_DESCR_UNDEFINED FEDERAL', 'OWNER_DESCR_USFS', 'STATE_AL',
       'STATE_AR', 'STATE_AZ', 'STATE_CA', 'STATE_CO', 'STATE_CT',
       'STATE_DC', 'STATE_DE', 'STATE_FL', 'STATE_GA', 'STATE_IA',
       'STATE_ID', 'STATE_IL', 'STATE_IN', 'STATE_KS', 'STATE_KY',
       'STATE_LA', 'STATE_MA', 'STATE_MD', 'STATE_ME', 'STATE_MI',
       'STATE_MN', 'STATE_MO', 'STA

In [350]:
# Export vers csv
X_train_df.to_csv('corr_X_train_df_3_classes.csv', sep=';', encoding='utf-8', index_label='index')
X_test_df.to_csv('corr_X_test_df_3_classes.csv', sep=';', encoding='utf-8', index_label='index')

## Variable cible

In [351]:
# Export vers csv
y_test.to_csv('corr_y_test_3_classes.csv', sep=';', encoding='utf-8', index_label='index')
y_train.to_csv('corr_y_train_3_classes.csv', sep=';', encoding='utf-8', index_label='index')