# Data Pre Processing

After viewing our EDA we will be needing to create a few modifications to our data set in order to prepare it for our modeling.

In [1612]:
import pandas as pd
import numpy as np
from scipy import stats
from scipy.stats import zscore
import seaborn as sns

datafile = pd.read_csv("../data/processed/fires.csv")

# Reviewing Columns

First we will review our columns in order to remember how each one of them can be transformed

In [1613]:
datafile.head()

Unnamed: 0,X,Y,month,FFMC,DMC,DC,ISI,temp,RH,wind,...,month_dec,month_feb,month_jan,month_jul,month_jun,month_mar,month_may,month_nov,month_oct,month_sep
0,7,5,mar,86.2,26.2,94.3,5.1,8.2,51,6.7,...,False,False,False,False,False,True,False,False,False,False
1,7,4,oct,90.6,35.4,669.1,6.7,18.0,33,0.9,...,False,False,False,False,False,False,False,False,True,False
2,7,4,oct,90.6,43.7,686.9,6.7,14.6,33,1.3,...,False,False,False,False,False,False,False,False,True,False
3,8,6,mar,91.7,33.3,77.5,9.0,8.3,97,4.0,...,False,False,False,False,False,True,False,False,False,False
4,8,6,mar,89.3,51.3,102.2,9.6,11.4,99,1.8,...,False,False,False,False,False,True,False,False,False,False


# Filtering Out Extreme Outliers

As we observed from the Data Exploration, there are many outliers that can make our Data Set unfit for our linear regression model. In this step we will be verifying each column for their outliers and filtering them out of the data frame through the usage of the z-score

In [1614]:
z_scores = stats.zscore(datafile['area'])

outliers = (z_scores > 0.5) | (z_scores < -2.9)

datafile = datafile[~outliers]

In [1615]:
z_scores = stats.zscore(datafile['RH'])

outliers_rh = (z_scores > 2.5) | (z_scores < -3)

print("Outliers:", datafile['RH'][outliers_rh])

datafile = datafile[~outliers_rh]

Outliers: 3       97
4       99
7       86
98      87
211     96
286     90
299     90
304     94
335     86
379    100
451     88
Name: RH, dtype: int64


In [1616]:
z_scores = stats.zscore(datafile['ISI'])

outliers_rh = (z_scores > 1.5) | (z_scores < -1.8)

print("Outliers:", datafile['ISI'][outliers_rh])

datafile = datafile[~outliers_rh]

Outliers: 11     22.6
12      0.8
22     56.1
24     20.3
30     15.9
42     17.0
45     15.9
71     15.9
82     17.0
97      0.7
102    17.0
124    15.9
130     0.8
133    17.9
135    20.3
148    17.0
149    17.9
153    15.9
155    17.0
167    16.5
192    17.0
194    17.0
199     0.8
206    20.3
209    17.9
212    15.9
266    22.7
312     0.4
382    18.0
421    18.0
443    16.8
450    18.0
455    16.7
475    18.0
485    21.3
486    17.7
487    17.7
489    17.7
490    17.7
495    16.8
496    16.8
503    20.0
Name: ISI, dtype: float64


In [1617]:
z_scores = stats.zscore(datafile['temp'])

outliers_rh = (z_scores > 2.3) | (z_scores < -2.2)

print("Outliers:", datafile['temp'][outliers_rh])

datafile = datafile[~outliers_rh]

Outliers: 61      5.5
104     5.3
165     5.3
176     5.8
196     5.8
273     4.8
274     5.1
275     5.1
276     4.6
277     4.6
278     4.6
279     4.6
280     2.2
281     5.1
282     4.2
394     5.3
463     4.6
464     5.1
465     4.6
484    33.1
491    32.4
492    32.4
497    32.3
498    33.3
Name: temp, dtype: float64


In [1618]:
z_scores = stats.zscore(datafile['wind'])

outliers_w = (z_scores > 2.6) | (z_scores < -3)

print("Outliers:", datafile['wind'][outliers_w])

datafile = datafile[~outliers_w]

Outliers: 142    8.9
162    8.5
168    9.4
411    9.4
506    8.5
Name: wind, dtype: float64


In [1619]:
z_scores = stats.zscore(datafile['DMC'])

outliers_w = (z_scores > 1.8) | (z_scores < -3)

print("Outliers:", datafile['DMC'][outliers_w])

datafile = datafile[~outliers_w]

Outliers: 369    276.3
370    276.3
374    290.0
383    248.4
384    273.8
398    231.1
406    291.3
408    290.0
413    231.1
414    235.1
422    263.1
424    231.1
425    248.4
426    248.4
430    287.2
433    235.1
434    269.8
437    253.6
438    231.1
440    290.0
444    290.0
448    284.9
452    238.2
453    266.2
454    248.4
456    248.4
458    231.1
459    273.8
460    231.1
461    231.1
462    276.3
Name: DMC, dtype: float64


In [1620]:
z_scores = stats.zscore(datafile['DC'])

outliers_dc = (z_scores > 2.5) | (z_scores < -1.9)

print("Outliers:", datafile['DC'][outliers_dc])

datafile = datafile[~outliers_dc]

Outliers: 39     67.6
48     64.7
58     34.0
59     43.0
75     26.6
76     43.0
96     30.2
105    57.3
110    57.3
114    67.6
115    67.6
131    64.7
134    67.6
182    48.3
202    32.1
239     7.9
240    43.5
283    18.7
284    15.8
378    30.6
387    30.6
390    58.3
393    25.6
407    55.0
410    52.8
417    28.3
442    41.6
447    28.3
466    36.9
467    41.1
468    43.5
470    25.6
Name: DC, dtype: float64


In [1621]:
z_scores = stats.zscore(datafile['rain'])

outliers_r = (z_scores > 2.1) | (z_scores < -1.9)

print("Outliers:", datafile['rain'][outliers_r])

datafile = datafile[~outliers_r]

Outliers: 243    1.0
499    6.4
500    0.8
501    0.8
509    1.4
Name: rain, dtype: float64


In [1622]:
z_scores = stats.zscore(datafile['FFMC'])

outliers_fmc = (z_scores > 1.1) | (z_scores < -1.45)

print("Outliers:", datafile['FFMC'][outliers_fmc])

datafile = datafile[~outliers_fmc]

Outliers: 0      86.2
17     84.9
19     86.3
40     79.5
47     94.2
49     87.6
77     87.6
123    84.4
126    87.6
138    85.8
141    95.5
144    95.5
145    95.2
147    84.4
161    95.2
169    95.2
171    85.6
181    84.9
191    95.2
213    87.6
222    87.6
241    83.0
242    94.2
255    87.5
256    94.2
257    94.2
264    94.3
324    88.1
373    94.8
388    94.8
389    94.8
402    94.8
404    87.9
405    94.6
428    94.8
432    94.8
445    94.0
482    94.9
483    94.9
493    95.9
494    96.0
502    96.1
511    81.6
512    81.6
514    81.6
515    94.4
516    79.5
Name: FFMC, dtype: float64


# Filtering Out Columns

Due to the column "month" being a string we will need to drop it for our linear regression. The Linear Regression will need to convert the values into floats and strings cannot be converted to floats. In the data exploration through the heatmap we were also able to see multicollinearity which may affect how the data frame would fit into the linear regression model so we will be dropping columns that also demonstrate that multicollinearity with other variables.

In [1623]:
df_processed = datafile.drop('month', axis=1).drop("X", axis=1).drop("Y", axis= 1).drop("month_numerical", axis=1)


In [1624]:
# save data for later modeling
df_processed.to_csv("../data/processed/processed_fires.csv", index= False)