### Classify the Size_Categorie using SVM
##### month	month of the year: 'jan' to 'dec'
##### day	day of the week: 'mon' to 'sun'
##### FFMC	FFMC index from the FWI system: 18.7 to 96.20
##### DMC	DMC index from the FWI system: 1.1 to 291.3
##### DC	DC index from the FWI system: 7.9 to 860.6
##### ISI	ISI index from the FWI system: 0.0 to 56.10
##### temp	temperature in Celsius degrees: 2.2 to 33.30
##### RH	relative humidity in %: 15.0 to 100
##### wind	wind speed in km/h: 0.40 to 9.40
##### rain	outside rain in mm/m2 : 0.0 to 6.4
##### Size_Categorie 	the burned area of the forest ( Small , Large)

In [1]:
# Importing the libraries
import pandas as pd
from sklearn.svm import SVC
import numpy as np 
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

In [2]:
# Reading the dataset
data = pd.read_csv('forestfires.csv')
data.sample()
data.columns

Index(['month', 'day', 'FFMC', 'DMC', 'DC', 'ISI', 'temp', 'RH', 'wind',
       'rain', 'area', 'dayfri', 'daymon', 'daysat', 'daysun', 'daythu',
       'daytue', 'daywed', 'monthapr', 'monthaug', 'monthdec', 'monthfeb',
       'monthjan', 'monthjul', 'monthjun', 'monthmar', 'monthmay', 'monthnov',
       'monthoct', 'monthsep', 'size_category'],
      dtype='object')

In [3]:
data.head()

Unnamed: 0,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,...,monthfeb,monthjan,monthjul,monthjun,monthmar,monthmay,monthnov,monthoct,monthsep,size_category
0,mar,fri,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,...,0,0,0,0,1,0,0,0,0,small
1,oct,tue,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,...,0,0,0,0,0,0,0,1,0,small
2,oct,sat,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,...,0,0,0,0,0,0,0,1,0,small
3,mar,fri,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,...,0,0,0,0,1,0,0,0,0,small
4,mar,sun,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,...,0,0,0,0,1,0,0,0,0,small


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 517 entries, 0 to 516
Data columns (total 31 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   month          517 non-null    object 
 1   day            517 non-null    object 
 2   FFMC           517 non-null    float64
 3   DMC            517 non-null    float64
 4   DC             517 non-null    float64
 5   ISI            517 non-null    float64
 6   temp           517 non-null    float64
 7   RH             517 non-null    int64  
 8   wind           517 non-null    float64
 9   rain           517 non-null    float64
 10  area           517 non-null    float64
 11  dayfri         517 non-null    int64  
 12  daymon         517 non-null    int64  
 13  daysat         517 non-null    int64  
 14  daysun         517 non-null    int64  
 15  daythu         517 non-null    int64  
 16  daytue         517 non-null    int64  
 17  daywed         517 non-null    int64  
 18  monthapr  

In [5]:
# Looking for non-zero day friday to ensure that hot encoding is applied to the dataset already
data.iloc[np.flatnonzero(data['dayfri'])][['day','dayfri']]

Unnamed: 0,day,dayfri
0,fri,1
3,fri,1
12,fri,1
15,fri,1
26,fri,1
...,...,...
506,fri,1
507,fri,1
508,fri,1
509,fri,1


In [6]:
# Looking for non-zero month apr to ensure that hot encoding is applied to the dataset already
data.iloc[np.flatnonzero(data['monthapr'])][['month','monthapr']]

Unnamed: 0,month,monthapr
19,apr,1
176,apr,1
196,apr,1
239,apr,1
240,apr,1
241,apr,1
442,apr,1
469,apr,1
470,apr,1


In [7]:
# Dropping categorical columns 'Month' and 'Day'
data.drop(['month','day'], axis=1, inplace=True)
data.head()

Unnamed: 0,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area,dayfri,...,monthfeb,monthjan,monthjul,monthjun,monthmar,monthmay,monthnov,monthoct,monthsep,size_category
0,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,0.0,1,...,0,0,0,0,1,0,0,0,0,small
1,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,0.0,0,...,0,0,0,0,0,0,0,1,0,small
2,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,0.0,0,...,0,0,0,0,0,0,0,1,0,small
3,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,0.0,1,...,0,0,0,0,1,0,0,0,0,small
4,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,0.0,0,...,0,0,0,0,1,0,0,0,0,small


### Separating the data sets into X any y variables

In [8]:
# Independent Variables
X = data.iloc[:,:-1]
X.head()

Unnamed: 0,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area,dayfri,...,monthdec,monthfeb,monthjan,monthjul,monthjun,monthmar,monthmay,monthnov,monthoct,monthsep
0,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,0.0,1,...,0,0,0,0,0,1,0,0,0,0
1,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,0.0,0,...,0,0,0,0,0,0,0,0,1,0
2,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,0.0,0,...,0,0,0,0,0,0,0,0,1,0
3,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,0.0,1,...,0,0,0,0,0,1,0,0,0,0
4,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,0.0,0,...,0,0,0,0,0,1,0,0,0,0


In [9]:
# Dependent Variables
y = data.iloc[:, -1]
y.head()

0    small
1    small
2    small
3    small
4    small
Name: size_category, dtype: object

### Hypertuning of parameters through RandomizedSearchCV

In [10]:
# Tuning of hyperparameters using RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV 
params = {'kernel':['linear','rbf'],'gamma':[0.5,0.1,0.01],'C':[10,0.1,0.01,0.001]}
model = SVC()
# n_iter = 10 (default) and cv=5 (default). Hence 50 fits are running.
randomCV = RandomizedSearchCV(model, params, verbose = 300, random_state=45)
randomCV.fit(X,y)
print(randomCV.best_params_)
print(randomCV.best_score_)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 1/5; 1/10] START C=0.01, gamma=0.01, kernel=linear..........................
[CV 1/5; 1/10] END C=0.01, gamma=0.01, kernel=linear;, score=0.990 total time=   0.0s
[CV 2/5; 1/10] START C=0.01, gamma=0.01, kernel=linear..........................
[CV 2/5; 1/10] END C=0.01, gamma=0.01, kernel=linear;, score=1.000 total time=   0.0s
[CV 3/5; 1/10] START C=0.01, gamma=0.01, kernel=linear..........................
[CV 3/5; 1/10] END C=0.01, gamma=0.01, kernel=linear;, score=1.000 total time=   0.0s
[CV 4/5; 1/10] START C=0.01, gamma=0.01, kernel=linear..........................
[CV 4/5; 1/10] END C=0.01, gamma=0.01, kernel=linear;, score=0.942 total time=   0.0s
[CV 5/5; 1/10] START C=0.01, gamma=0.01, kernel=linear..........................
[CV 5/5; 1/10] END C=0.01, gamma=0.01, kernel=linear;, score=0.981 total time=   0.0s
[CV 1/5; 2/10] START C=0.001, gamma=0.1, kernel=linear..........................
[CV 1/5; 2/10] END C=0.

{'kernel': 'linear', 'gamma': 0.1, 'C': 10}
0.9922330097087378


In [25]:
# Building the model with the best parameters found through the GridSearch Hypertuning
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state=42, stratify=y )
final_model = SVC(kernel='linear', gamma=0.1, C=10)
final_model.fit(X_train,y_train)
y_predict = final_model.predict(X_test)
accuracy = accuracy_score(y_test, y_predict)
print('Accuracy of SVM model: ', accuracy)
print('Confusion_Matrix:')
confusion_matrix(y_test, y_predict)

Accuracy of SVM model:  0.9807692307692307
Confusion_Matrix:


array([[26,  2],
       [ 0, 76]], dtype=int64)

In [29]:
# Trying to predict a single row by tweaking a couple of values and then match the result with the actual label.
unseen_data = X_test.iloc[10]

In [33]:
unseen_class = y_test.iloc[10]
unseen_class

'small'

In [30]:
unseen_data

FFMC         93.0
DMC          75.3
DC          466.6
ISI           7.7
temp         18.8
RH           35.0
wind          4.9
rain          0.0
area          0.0
dayfri        0.0
daymon        0.0
daysat        0.0
daysun        0.0
daythu        1.0
daytue        0.0
daywed        0.0
monthapr      0.0
monthaug      1.0
monthdec      0.0
monthfeb      0.0
monthjan      0.0
monthjul      0.0
monthjun      0.0
monthmar      0.0
monthmay      0.0
monthnov      0.0
monthoct      0.0
monthsep      0.0
Name: 62, dtype: float64

In [31]:
unseen_data['temp'] = 20.0
unseen_data['FFMC'] = 95.0
unseen_data['ISI'] = 4.0
unseen_data['DC'] = 500.0
unseen_data

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  unseen_data['temp'] = 20.0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  unseen_data['FFMC'] = 95.0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  unseen_data['ISI'] = 4.0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  unseen_data['DC'] = 500.0


FFMC         95.0
DMC          75.3
DC          500.0
ISI           4.0
temp         20.0
RH           35.0
wind          4.9
rain          0.0
area          0.0
dayfri        0.0
daymon        0.0
daysat        0.0
daysun        0.0
daythu        1.0
daytue        0.0
daywed        0.0
monthapr      0.0
monthaug      1.0
monthdec      0.0
monthfeb      0.0
monthjan      0.0
monthjul      0.0
monthjun      0.0
monthmar      0.0
monthmay      0.0
monthnov      0.0
monthoct      0.0
monthsep      0.0
Name: 62, dtype: float64

In [39]:
unseen_data_class = final_model.predict(unseen_data.array.reshape(1,-1))
unseen_data_class



array(['small'], dtype=object)