EE338 Group 24
#### This is the classification code for the Seismic Bumps dataset from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/seismic-bumps). Here we will compare three methods of classification and retain the one which works best for this particular scenario. The dataset was first converted to csv file for easy input via pandas.

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

We will explore KNN, Random Forest and Logistic Regression for our project.

#### Attribute Information:
Attribute information:
1. **seismic**: result of shift seismic hazard assessment in the mine working obtained by the seismic
method (a - lack of hazard, b - low hazard, c - high hazard, d - danger state);
2. **seismoacoustic**: result of shift seismic hazard assessment in the mine working obtained by the
seismoacoustic method;
3. **shift**: information about type of a shift (W - coal-getting, N -preparation shift);
4. **genergy**: seismic energy recorded within previous shift by the most active geophone (GMax) out of
geophones monitoring the longwall;
5. **gpuls**: a number of pulses recorded within previous shift by GMax;
6. **gdenergy**: a deviation of energy recorded within previous shift by GMax from average energy recorded
during eight previous shifts;
7. **gdpuls**: a deviation of a number of pulses recorded within previous shift by GMax from average number
of pulses recorded during eight previous shifts;
8. **ghazard**: result of shift seismic hazard assessment in the mine working obtained by the
seismoacoustic method based on registration coming form GMax only;
9. **nbumps**: the number of seismic bumps recorded within previous shift;
10. **nbumps2**: the number of seismic bumps (in energy range [10^2,10^3)) registered within previous shift;
11. **nbumps3**: the number of seismic bumps (in energy range [10^3,10^4)) registered within previous shift;
12. **nbumps4**: the number of seismic bumps (in energy range [10^4,10^5)) registered within previous shift;
13. **nbumps5**: the number of seismic bumps (in energy range [10^5,10^6)) registered within the last shift;
14. **nbumps6**: the number of seismic bumps (in energy range [10^6,10^7)) registered within previous shift;
15. **nbumps7**: the number of seismic bumps (in energy range [10^7,10^8)) registered within previous shift;
16. **nbumps89**: the number of seismic bumps (in energy range [10^8,10^10)) registered within previous shift;
17. **energy**: total energy of seismic bumps registered within previous shift;
18. **maxenergy**: the maximum energy of the seismic bumps registered within previous shift;
19. **class**: the decision attribute - '1' means that high energy seismic bump occurred in the next shift
('hazardous state'), '0' means that no high energy seismic bumps occurred in the next shift
('non-hazardous state').

In [2]:
model_accuracies = {'LogReg':0, 'RF':0, 'KNN':0}
df = pd.read_csv('https://drive.google.com/uc?export=download&id=1EWQI6RC1a_QjNgH3_MBB88cxCIhpQ641', header = None)
display(df)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,a,a,N,15180,48,-72,-72,a,0,0,0,0,0,0,0,0,0,0,0
1,a,a,N,14720,33,-70,-79,a,1,0,1,0,0,0,0,0,2000,2000,0
2,a,a,N,8050,30,-81,-78,a,0,0,0,0,0,0,0,0,0,0,0
3,a,a,N,28820,171,-23,40,a,1,0,1,0,0,0,0,0,3000,3000,0
4,a,a,N,12640,57,-63,-52,a,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2579,b,a,W,81410,785,432,151,b,0,0,0,0,0,0,0,0,0,0,0
2580,b,a,W,42110,555,213,118,a,0,0,0,0,0,0,0,0,0,0,0
2581,b,a,W,26960,540,101,112,a,0,0,0,0,0,0,0,0,0,0,0
2582,a,a,W,16130,322,2,2,a,0,0,0,0,0,0,0,0,0,0,0


In [3]:
X = df.iloc[:, 0:18].values
y = df.iloc[:, 18].values

le_y = LabelEncoder()
y = le_y.fit_transform(y)
le_X = LabelEncoder()
X[:, 0] = le_X.fit_transform(X[:, 0])
X[:, 1] = le_X.fit_transform(X[:, 1])
X[:, 2] = le_X.fit_transform(X[:, 2])
X[:, 7] = le_X.fit_transform(X[:, 7])

In [4]:
sc_X = StandardScaler()
X = sc_X.fit_transform(X)
print(X)

[[-0.73230209 -0.77142023 -1.34374329 ...  0.         -0.24332671
  -0.22108685]
 [-0.73230209 -0.77142023 -1.34374329 ...  0.         -0.14551225
  -0.11774749]
 [-0.73230209 -0.77142023 -1.34374329 ...  0.         -0.24332671
  -0.22108685]
 ...
 [ 1.36555667 -0.77142023  0.74418976 ...  0.         -0.24332671
  -0.22108685]
 [-0.73230209 -0.77142023  0.74418976 ...  0.         -0.24332671
  -0.22108685]
 [-0.73230209 -0.77142023  0.74418976 ...  0.         -0.24332671
  -0.22108685]]


In [5]:
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)
display(X_train, X_train.shape)
display(X_test, X_test.shape)

array([[-0.73230209, -0.77142023, -1.34374329, ...,  0.        ,
        -0.24332671, -0.22108685],
       [-0.73230209, -0.77142023, -1.34374329, ...,  0.        ,
        -0.24332671, -0.22108685],
       [ 1.36555667, -0.77142023,  0.74418976, ...,  0.        ,
         1.08205916,  0.8123068 ],
       ...,
       [-0.73230209,  1.12339905, -1.34374329, ...,  0.        ,
        -0.24332671, -0.22108685],
       [-0.73230209,  1.12339905, -1.34374329, ...,  0.        ,
         0.05011666,  0.08893124],
       [-0.73230209, -0.77142023, -1.34374329, ...,  0.        ,
        -0.24332671, -0.22108685]])

(2067, 18)

array([[ 1.36555667,  1.12339905,  0.74418976, ...,  0.        ,
        -0.24332671, -0.22108685],
       [-0.73230209, -0.77142023, -1.34374329, ...,  0.        ,
        -0.22376382, -0.20041898],
       [-0.73230209, -0.77142023,  0.74418976, ...,  0.        ,
        -0.24332671, -0.22108685],
       ...,
       [-0.73230209, -0.77142023,  0.74418976, ...,  0.        ,
        -0.24332671, -0.22108685],
       [-0.73230209,  1.12339905,  0.74418976, ...,  0.        ,
        -0.16507514, -0.16941717],
       [ 1.36555667,  1.12339905,  0.74418976, ...,  0.        ,
        -0.24332671, -0.22108685]])

(517, 18)

### Random Forest Classifier

In [6]:
rf = RandomForestClassifier(n_estimators = 10, criterion = 'entropy')
rf.fit(X_train, Y_train)
Y_pred = rf.predict(X_test)
print(confusion_matrix(Y_test, Y_pred))
model_accuracies['RF'] = accuracy_score(Y_test, Y_pred)
print(model_accuracies['RF'])

[[482   6]
 [ 28   1]]
0.9342359767891683


### Logistic Regression

In [7]:
lr = LogisticRegression()
lr.fit(X_train, Y_train)
Y_pred = lr.predict(X_test)
print(confusion_matrix(Y_test, Y_pred))
model_accuracies['LR'] = accuracy_score(Y_test, Y_pred)
print(model_accuracies['LR'])

[[485   3]
 [ 28   1]]
0.9400386847195358


### K Nearest Neighbours

In [8]:
knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
print(confusion_matrix(Y_test, Y_pred))
model_accuracies['KNN'] = accuracy_score(Y_test, Y_pred)
print(model_accuracies['KNN'])


[[477  11]
 [ 29   0]]
0.9226305609284333


**Logistic Regression**, even though the most simplistic model out of the three, gives the maximum accuracy among these. One may choose any other classifier if his or her aim is different (for e.g. one may choose to have the least number of false negatives or false positives.)  
Although all three models have high accuracy score >90%.