# Wild Fires caused by the weather
## Part 4: Machine Learning

In this part, we will use machine learning to verify our research.<br>
The type of machine learning we will use is <b>Supervised Learning</b>.<br>
The label type is <b>Classification Problems</b>.<br>
The Algorithm we will use to predict our findings is <b>Logistic Regression</b>.<br>

#### Preceding Step - import modules (packages)
This step is necessary in order to use external packages. 

In [1]:
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn import linear_model
from sklearn import metrics

#### Global variables and constants
Here we define our global variables we will use in this notebook

In [2]:
FIRE_HISTORY_CSV = "Wildfire_history_final.csv"
NUMERIC_COLS = ['InitialLatitude','InitialLongitude','FireDuration','Temperature', 'MaxTemperature', 'MinTemperature','WindSpeed','WindDirection','Humidity','CausedByWeather']
TO_PREDICT_COL = 'CausedByWeather'

### Data Preperation
In this part we will load our final CSV to a data frame. <br>
From the data frame, we will create a new data frame with only relevant columns. <br>
We will prepare TRAINING_FEATURES, TARGET_FEATURE, X and y for the data splitting part. 

In [3]:
df = pd.read_csv(FIRE_HISTORY_CSV)
df.head()

Unnamed: 0,UniqueFireIdentifier,FireDiscoveryDateTime,FireOutDateTime,POOCounty,InitialLatitude,InitialLongitude,FireCause,FireDuration,CausedByWeather,Temperature,MaxTemperature,MinTemperature,WindSpeed,WindDirection,Humidity
0,2020-MTLG42-000224,2020-08-06T18:58:00,2020-08-12T14:00:00,Carter,45.78496,-104.4958,2,6,0,25.9,34.0,16.5,21.9,206.63,46.67
1,2017-MTNWS-000878,2017-10-17T20:20:24,2017-11-09T21:59:59,Flathead,48.07167,-114.8303,2,23,0,6.1,12.9,-3.3,13.9,119.55,64.2
2,2020-MSMNF-000308,2020-11-23T19:17:00,2020-11-30T14:29:59,Perry,31.06819,-89.06972,2,7,0,10.3,22.7,0.6,11.2,104.88,78.07
3,2019-AZA5S-001664,2019-09-05T19:17:00,2019-09-09T17:00:00,Yavapai,34.40333,-112.4394,1,4,1,24.6,33.1,16.8,11.2,197.58,45.06
4,2020-IDNCF-000071,2020-04-20T21:33:59,2020-04-21T03:00:00,Idaho,45.41833,-116.1661,2,1,0,8.6,15.7,0.7,8.9,195.88,53.81


Now we will create a new data frame with only numeric and relevant columns:

In [4]:
df_copy=df[NUMERIC_COLS].copy()
df_copy.head()

Unnamed: 0,InitialLatitude,InitialLongitude,FireDuration,Temperature,MaxTemperature,MinTemperature,WindSpeed,WindDirection,Humidity,CausedByWeather
0,45.78496,-104.4958,6,25.9,34.0,16.5,21.9,206.63,46.67,0
1,48.07167,-114.8303,23,6.1,12.9,-3.3,13.9,119.55,64.2,0
2,31.06819,-89.06972,7,10.3,22.7,0.6,11.2,104.88,78.07,0
3,34.40333,-112.4394,4,24.6,33.1,16.8,11.2,197.58,45.06,1
4,45.41833,-116.1661,1,8.6,15.7,0.7,8.9,195.88,53.81,0


In [5]:
TRAINING_FEATURES = df_copy.columns[df_copy.columns != TO_PREDICT_COL]
TARGET_FEATURE    = TO_PREDICT_COL

X = df_copy[TRAINING_FEATURES]
y = df_copy[TARGET_FEATURE]

In [6]:
X.head()

Unnamed: 0,InitialLatitude,InitialLongitude,FireDuration,Temperature,MaxTemperature,MinTemperature,WindSpeed,WindDirection,Humidity
0,45.78496,-104.4958,6,25.9,34.0,16.5,21.9,206.63,46.67
1,48.07167,-114.8303,23,6.1,12.9,-3.3,13.9,119.55,64.2
2,31.06819,-89.06972,7,10.3,22.7,0.6,11.2,104.88,78.07
3,34.40333,-112.4394,4,24.6,33.1,16.8,11.2,197.58,45.06
4,45.41833,-116.1661,1,8.6,15.7,0.7,8.9,195.88,53.81


In [7]:
y.head()

0    0
1    0
2    0
3    1
4    0
Name: CausedByWeather, dtype: int64

### Splitting the Data
In this part we will split our data to Train and Test parts. <br>
We will take 20% of the data for Test, and 80% of the data for Train.

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Initial amount of samples: #{}".format(X.shape[0]))
print("Number of training samples: #{}".format(X_train.shape[0]))
print("Number of test samples: #{}".format(X_test.shape[0]))

print("\nTarget distribution in original dataset:\n{}".format(y.value_counts()))
print("\nTarget distribution in the training set:\n{}\n".format(y_train.value_counts()))
print("Target distribution in the test set:\n{}".format(y_test.value_counts()))

Initial amount of samples: #42759
Number of training samples: #38483
Number of test samples: #4276

Target distribution in original dataset:
0    28073
1    14686
Name: CausedByWeather, dtype: int64

Target distribution in the training set:
0    25233
1    13250
Name: CausedByWeather, dtype: int64

Target distribution in the test set:
0    2840
1    1436
Name: CausedByWeather, dtype: int64


### Scaling (MinMax and Standard Scalers)
Now that we're done with our train and test sets, it is time to do some pre-processing on our data.

#### Standard Scale

In [9]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train_scaled

In [10]:
print("Mean: ", X_train_scaled.mean(axis=0))
print("Standard Deviation: ", X_train_scaled.std(axis=0))
X_train_scaled

Mean:  [ 1.39660246e-15 -1.07385509e-15  5.05908348e-17 -3.32348550e-18
  7.38552333e-18 -1.82422426e-16 -4.65287970e-17  1.99409130e-17
 -3.51366272e-16]
Standard Deviation:  [1. 1. 1. 1. 1. 1. 1. 1. 1.]


array([[-0.65412676,  1.27057203,  0.79295891, ...,  0.44565202,
        -0.278077  ,  1.37580426],
       [-1.2226919 ,  0.03321462, -0.14683357, ...,  1.95270513,
         1.37712237,  0.21305596],
       [-0.63490448,  0.22058588, -0.52275056, ...,  0.38616308,
        -0.29730694,  0.39430629],
       ...,
       [ 1.24998695, -0.36655074,  1.1688759 , ..., -1.04157146,
         1.14493864, -0.59158563],
       [ 3.7763151 , -2.66182938, -0.89866755, ..., -1.75543872,
        -2.27674448,  1.59220616],
       [-1.18843908, -0.01485013, -0.89866755, ..., -0.58548959,
         0.77689893,  0.75570845]])

X_test_scaled

In [11]:
print("Mean: ", X_test_scaled.mean(axis=0))
print("Standard Deviation: ", X_test_scaled.std(axis=0))

Mean:  [ 0.01150959 -0.02728041 -0.01597472  0.01586534  0.01968206  0.00436603
  0.01133361  0.02840789 -0.02417686]
Standard Deviation:  [0.97164131 0.91619072 0.99468412 0.99319884 0.99658128 0.99251806
 1.00634473 1.00946418 0.9905386 ]


#### MinMax scale

In [12]:
min_max_scaler = MinMaxScaler(feature_range=(0, 1))
X_train_scaled_in_range = min_max_scaler.fit_transform(X_train)
X_test_scaled_in_range = min_max_scaler.transform(X_test)

X_train_scaled_in_range

In [13]:
print("Min Value: ", X_train_scaled_in_range.min(axis=0))
print("Max Value: ", X_train_scaled_in_range.max(axis=0))

Min Value:  [0. 0. 0. 0. 0. 0. 0. 0. 0.]
Max Value:  [1. 1. 1. 1. 1. 1. 1. 1. 1.]


X_test_scaled_in_range

In [14]:
print("Min Value: ", X_test_scaled_in_range.min(axis=0))
print("Max Value: ", X_test_scaled_in_range.max(axis=0))

Min Value:  [0.0025663  0.0268728  0.         0.00258398 0.         0.
 0.09407666 0.00525126 0.01317578]
Max Value:  [0.98406411 0.93585686 1.         1.00258398 0.99054374 0.99726776
 1.         0.97823052 0.96538743]


### Apply Machine Learning Algorithm - Train
In this part, we will train our algorithm

In [15]:
clf_model=LogisticRegression(max_iter=3000)
clf_model.fit(X_train,y_train)

LogisticRegression(max_iter=3000)

### Apply Machine Learning Algorithm - Predict
Now we will use the classifier model ( clfmodelclfmodel ) and apply it on new data ($X_test) in order to predict its labels

In [16]:
y_pred=clf_model.predict(X_test)

Results:

In [17]:
resDF=pd.DataFrame({"Actual":y_test,"Predicted":y_pred})

resDF["correct"]=abs((resDF["Actual"]^resDF["Predicted"])-1)
resDF[resDF["correct"]==1]
resDF

Unnamed: 0,Actual,Predicted,correct
27891,0,0,1
21167,0,0,1
7616,1,1,1
231,0,0,1
4159,1,0,0
...,...,...,...
1426,0,0,1
928,0,0,1
2754,1,0,0
13399,0,0,1


In [18]:
accuracy = metrics.accuracy_score(y_test, y_pred)
accuracy_percentage = 100 * accuracy
print(accuracy_percentage)

67.6800748362956
