# Wild Fires caused by the weather
## Part 4: Machine Learning

In this part, we will use machine learning to verify our research.<br>
The type of machine learning we will use is <b>Supervised Learning</b>.<br>
The label type is <b>Classification Problems</b>.<br>
The Algorithm we will use to predict our findings is <b>KNeighborsClassifier</b>.<br>

#### Preceding Step - import modules (packages)
This step is necessary in order to use external packages. 

In [1]:
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn import linear_model
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

#### Global variables and constants
Here we define our global variables we will use in this notebook

In [2]:
FIRE_HISTORY_CSV = "Wildfire_history_final.csv"
NUMERIC_COLS = ['InitialLatitude','InitialLongitude','FireDuration','Temperature', 'MaxTemperature', 'MinTemperature','WindSpeed','WindDirection','Humidity','CausedByWeather']
TO_PREDICT_COL = 'CausedByWeather'

### Data Preperation
In this part we will load our final CSV to a data frame. <br>
From the data frame, we will create a new data frame with only relevant columns. <br>
We will prepare TRAINING_FEATURES, TARGET_FEATURE, X and y for the data splitting part. 

In [3]:
df = pd.read_csv(FIRE_HISTORY_CSV)
df.head()

Unnamed: 0,UniqueFireIdentifier,FireDiscoveryDateTime,FireOutDateTime,POOCounty,InitialLatitude,InitialLongitude,FireCause,FireDuration,CausedByWeather,Temperature,MaxTemperature,MinTemperature,WindSpeed,WindDirection,Humidity
0,2020-MTLG42-000224,2020-08-06T18:58:00,2020-08-12T14:00:00,Carter,45.78496,-104.4958,2,6,0,25.9,34.0,16.5,21.9,206.63,46.67
1,2017-MTNWS-000878,2017-10-17T20:20:24,2017-11-09T21:59:59,Flathead,48.07167,-114.8303,2,23,0,6.1,12.9,-3.3,13.9,119.55,64.2
2,2020-MSMNF-000308,2020-11-23T19:17:00,2020-11-30T14:29:59,Perry,31.06819,-89.06972,2,7,0,10.3,22.7,0.6,11.2,104.88,78.07
3,2019-AZA5S-001664,2019-09-05T19:17:00,2019-09-09T17:00:00,Yavapai,34.40333,-112.4394,1,4,1,24.6,33.1,16.8,11.2,197.58,45.06
4,2020-IDNCF-000071,2020-04-20T21:33:59,2020-04-21T03:00:00,Idaho,45.41833,-116.1661,2,1,0,8.6,15.7,0.7,8.9,195.88,53.81


Now we will create a new data frame with only numeric and relevant columns:

In [4]:
df_copy=df[NUMERIC_COLS].copy()
df_copy.head()

Unnamed: 0,InitialLatitude,InitialLongitude,FireDuration,Temperature,MaxTemperature,MinTemperature,WindSpeed,WindDirection,Humidity,CausedByWeather
0,45.78496,-104.4958,6,25.9,34.0,16.5,21.9,206.63,46.67,0
1,48.07167,-114.8303,23,6.1,12.9,-3.3,13.9,119.55,64.2,0
2,31.06819,-89.06972,7,10.3,22.7,0.6,11.2,104.88,78.07,0
3,34.40333,-112.4394,4,24.6,33.1,16.8,11.2,197.58,45.06,1
4,45.41833,-116.1661,1,8.6,15.7,0.7,8.9,195.88,53.81,0


In [5]:
TRAINING_FEATURES = df_copy.columns[df_copy.columns != TO_PREDICT_COL]
TARGET_FEATURE    = TO_PREDICT_COL

X = df_copy[TRAINING_FEATURES]
y = df_copy[TARGET_FEATURE]

In [6]:
X.head()

Unnamed: 0,InitialLatitude,InitialLongitude,FireDuration,Temperature,MaxTemperature,MinTemperature,WindSpeed,WindDirection,Humidity
0,45.78496,-104.4958,6,25.9,34.0,16.5,21.9,206.63,46.67
1,48.07167,-114.8303,23,6.1,12.9,-3.3,13.9,119.55,64.2
2,31.06819,-89.06972,7,10.3,22.7,0.6,11.2,104.88,78.07
3,34.40333,-112.4394,4,24.6,33.1,16.8,11.2,197.58,45.06
4,45.41833,-116.1661,1,8.6,15.7,0.7,8.9,195.88,53.81


In [7]:
y.head()

0    0
1    0
2    0
3    1
4    0
Name: CausedByWeather, dtype: int64

### Splitting the Data
In this part we will split our data to Train and Test parts. <br>
We will take 20% of the data for Test, and 80% of the data for Train.

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

print("Initial amount of samples: #{}".format(X.shape[0]))
print("Number of training samples: #{}".format(X_train.shape[0]))
print("Number of test samples: #{}".format(X_test.shape[0]))

print("\nTarget distribution in original dataset:\n{}".format(y.value_counts()))
print("\nTarget distribution in the training set:\n{}\n".format(y_train.value_counts()))
print("Target distribution in the test set:\n{}".format(y_test.value_counts()))

Initial amount of samples: #42759
Number of training samples: #34207
Number of test samples: #8552

Target distribution in original dataset:
0    28073
1    14686
Name: CausedByWeather, dtype: int64

Target distribution in the training set:
0    22470
1    11737
Name: CausedByWeather, dtype: int64

Target distribution in the test set:
0    5603
1    2949
Name: CausedByWeather, dtype: int64


### Scaling (MinMax and Standard Scalers)
Now that we're done with our train and test sets, it is time to do some pre-processing on our data.

#### Standard Scale

In [9]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train_scaled

In [10]:
print("Mean: ", X_train_scaled.mean(axis=0))
print("Standard Deviation: ", X_train_scaled.std(axis=0))
X_train_scaled

Mean:  [-8.46764540e-16 -1.72406370e-16 -1.06975037e-17 -3.30895600e-16
  4.11698103e-16 -1.95878563e-16  8.60993259e-17 -7.06658399e-16
 -1.22865504e-16]
Standard Deviation:  [1. 1. 1. 1. 1. 1. 1. 1. 1.]


array([[ 0.50579337, -0.3906906 , -0.89536798, ..., -0.62636099,
        -0.37510522,  0.3837452 ],
       [ 0.2600988 , -0.82346572,  0.42614889, ...,  0.34605114,
         1.41398213, -0.33779933],
       [ 0.42248198,  1.82007075, -0.89536798, ...,  0.64372832,
         1.88926116,  0.76783735],
       ...,
       [ 1.03345124, -0.33212965, -0.51779173, ..., -0.5866707 ,
        -0.00226773, -1.01544762],
       [-0.14788378,  0.3735255 , -0.51779173, ...,  0.08806425,
        -0.04452265,  0.01117582],
       [-1.12310081, -0.28023449,  0.61493701, ..., -1.47970224,
        -0.28118568, -1.1559156 ]])

X_test_scaled

In [11]:
print("Mean: ", X_test_scaled.mean(axis=0))
print("Standard Deviation: ", X_test_scaled.std(axis=0))

Mean:  [ 3.68666806e-05 -4.77232707e-03  2.83074972e-02  4.98362875e-03
  1.99951018e-03  9.38913701e-03  2.05392009e-03 -8.93930570e-03
 -2.58525684e-03]
Standard Deviation:  [0.99941715 1.0141077  1.01900024 1.00590762 1.01339049 0.99587841
 1.00709897 0.9904357  0.99046764]


#### MinMax scale

In [12]:
min_max_scaler = MinMaxScaler(feature_range=(0, 1))
X_train_scaled_in_range = min_max_scaler.fit_transform(X_train)
X_test_scaled_in_range = min_max_scaler.transform(X_test)

X_train_scaled_in_range

In [13]:
print("Min Value: ", X_train_scaled_in_range.min(axis=0))
print("Max Value: ", X_train_scaled_in_range.max(axis=0))

Min Value:  [0. 0. 0. 0. 0. 0. 0. 0. 0.]
Max Value:  [1. 1. 1. 1. 1. 1. 1. 1. 1.]


X_test_scaled_in_range

In [14]:
print("Min Value: ", X_test_scaled_in_range.min(axis=0))
print("Max Value: ", X_test_scaled_in_range.max(axis=0))

Min Value:  [0.0050433  0.02298946 0.         0.00257732 0.         0.
 0.05923345 0.00179922 0.01317578]
Max Value:  [0.99736627 0.9434093  1.         0.98195876 0.99763593 1.
 1.         1.00249892 0.99247098]


### Finding the best K
In this part, we will use cross-validation in order to find the best K for our algorithm. <br>
The maximum K we will use will be less than the squre root value of the X_train which is 184(square root of 34207 is 184.951..)

In [17]:
parameters = {'n_neighbors':range(1,150,2) }
knn = KNeighborsClassifier()
clf = GridSearchCV(knn, parameters,scoring=make_scorer(metrics.accuracy_score, greater_is_better=True))
clf.fit(X_train, y_train)

print("best parameter set is:",clf.best_params_," and its score was",clf.best_score_)

best parameter set is: {'n_neighbors': 15}  and its score was 0.7651357743331096


### Apply the Algorithm
In this part we will predict by using the K value we found above

In [21]:
y_pred=clf.predict(X_test)
print(metrics.confusion_matrix(y_true = y_test, y_pred = y_pred))
print('Accuracy = ', metrics.accuracy_score(y_true = y_test, y_pred = y_pred))

[[4676  927]
 [1046 1903]]
Accuracy =  0.7692937324602432
