# Data Pre Processing

After viewing our EDA we will be needing to create a few modifications to our data set in order to prepare it for our modeling.

In [1022]:
import pandas as pd
import numpy as np
from scipy import stats
from scipy.stats import zscore

from sklearn.preprocessing import MinMaxScaler


datafile = pd.read_csv("../data/processed/fires.csv")

# Filtering Out Extreme Outliers

As we observed from the Data Exploration, there are many outliers that can make our Data Set unfit for our linear regression model. In this step we will be verifying each column for their outliers and filtering them out of the data frame through the usage of the z-score

In [1023]:
z_scores = stats.zscore(datafile['area'])

outliers = (z_scores > 3) | (z_scores < -3)

print("Outliers:", datafile['area'][outliers])

datafile = datafile[~outliers]

Outliers: 237     212.88
238    1090.84
415     746.28
479     278.53
Name: area, dtype: float64


In [1024]:
z_scores = stats.zscore(datafile['RH'])

outliers_rh = (z_scores > 3) | (z_scores < -3)

print("Outliers:", datafile['RH'][outliers_rh])

datafile = datafile[~outliers_rh]

Outliers: 3       97
4       99
211     96
304     94
379    100
Name: RH, dtype: int64


In [1025]:
z_scores = stats.zscore(datafile['ISI'])

outliers_rh = (z_scores > 3) | (z_scores < -3)

print("Outliers:", datafile['ISI'][outliers_rh])

datafile = datafile[~outliers_rh]

Outliers: 22     56.1
266    22.7
Name: ISI, dtype: float64


In [1026]:
z_scores = stats.zscore(datafile['temp'])

outliers_rh = (z_scores > 2.1) | (z_scores < -2.1)

print("Outliers:", datafile['temp'][outliers_rh])

datafile = datafile[~outliers_rh]

Outliers: 61      5.5
75      6.7
104     5.3
165     5.3
176     5.8
196     5.8
273     4.8
274     5.1
275     5.1
276     4.6
277     4.6
278     4.6
279     4.6
280     2.2
281     5.1
282     4.2
394     5.3
463     4.6
464     5.1
465     4.6
484    33.1
491    32.4
492    32.4
496    32.6
497    32.3
498    33.3
Name: temp, dtype: float64


# Filtering Out Columns

Due to the column "month" being a string we will need to drop it for our linear regression. The Linear Regression will need to convert the values into floats and strings cannot be converted to floats. In the data exploration through the heatmap we were also able to see multicollinearity which may affect how the data frame would fit into the linear regression model so we will be dropping columns that also demonstrate that multicollinearity with other variables.

In [1027]:
df_processed = datafile.drop('month', axis=1).drop("X", axis=1).drop("Y", axis= 1).drop("FFMC", axis=1).drop("month_numerical", axis=1)


In [1028]:
scaler = MinMaxScaler()

datafile[['DMC', 'DC', 'ISI', 'temp', 'RH', 'wind', 'rain', 'area']] = scaler.fit_transform(datafile[['DMC', 'DC', 'ISI', 'temp', 'RH', 'wind', 'rain', 'area']])

df_processed = datafile.drop('month', axis=1).drop("X", axis=1).drop("Y", axis= 1).drop("FFMC", axis=1).drop("month_numerical", axis=1)



In [1029]:
# save data for later modeling
df_processed.to_csv("../data/processed/processed_fires.csv", index= False)