<a href="https://colab.research.google.com/github/JohnMichaelCourville/Forest-Cover-Type-Classification/blob/main/CoverType_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import  InputLayer
from tensorflow.keras.layers import  Dense
import pandas as pd
import numpy as np
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import statistics

Read data and get value counts and percentages of each of the seven classifications

In [3]:
data = pd.read_csv('cover_data.csv')

print("Value counts: \n")
print(data['class'].value_counts());
print("\n")
print('Percentages\n')
percentages = data['class'].value_counts()
perc = percentages.to_frame()

# print(percentages)
perc['class'] = perc['class']/len(data)
print(perc)


Value counts: 

2    283301
1    211840
3     35754
7     20510
6     17367
5      9493
4      2747
Name: class, dtype: int64


Percentages

      class
2  0.487599
1  0.364605
3  0.061537
7  0.035300
6  0.029891
5  0.016339
4  0.004728


Since class 1 and 2 make up over 80% of the sample, we will run some models using the original sample and some using a sample with a large portion of 1 and 2 removed. 

Below we remove the samples with 1 and 2 classifcations from the data and then remove a large portion of the sample from each set then create the new "data_evened" data frame. 

Classification 1 and 2 only account for ~16% of the new evened data frame.  We will run the models on this data frame at the end to compare with the results of running models on the entire data frame. 

In [4]:
data_1 = data.loc[data['class']==1]
data_2 = data.loc[data['class']==2]

data_1 = data_1.iloc[:-180000,:]
data_2 = data_2.iloc[:-220000,:]

data_evened = data.loc[(data['class']!= 1)&(data['class']!= 2)]
data_evened['class'].value_counts()

data_evened = pd.concat([data_evened, data_1, data_2])

print("Value counts: \n")
print(data_evened['class'].value_counts());
print("\n")
print('Percentages\n')
percentages = data_evened['class'].value_counts()
perc = percentages.to_frame()

# print(percentages)
perc['class'] = perc['class']/len(data)
print(perc)

Value counts: 

2    63301
3    35754
1    31840
7    20510
6    17367
5     9493
4     2747
Name: class, dtype: int64


Percentages

      class
2  0.108950
3  0.061537
1  0.054801
7  0.035300
6  0.029891
5  0.016339
4  0.004728


Creating feature and label set using the entire data set and standardizing the features set using z-score standardization

In [22]:

#remove classification column from feature set
X = data.loc[:,:'Soil_Type40']
print(X)
#create labels set of classification column
y = data.loc[:,'class']

#convert labels to one-hot endoded labels in order to use categoricalcrossentropy

#Z score standardization
ct = StandardScaler()





        Elevation  Aspect  Slope  Horizontal_Distance_To_Hydrology  \
0            2596      51      3                               258   
1            2590      56      2                               212   
2            2804     139      9                               268   
3            2785     155     18                               242   
4            2595      45      2                               153   
...           ...     ...    ...                               ...   
581007       2396     153     20                                85   
581008       2391     152     19                                67   
581009       2386     159     17                                60   
581010       2384     170     15                                60   
581011       2383     165     13                                60   

        Vertical_Distance_To_Hydrology  Horizontal_Distance_To_Roadways  \
0                                    0                              510   
1        

Performing Random Forest Classification produced impressive results.  Because of this, we will run it again using 5 fold stratification. 

In [6]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, stratify = y)

# one-hot endoding the labels
y_train = tf.keras.utils.to_categorical(y_train, dtype = 'int64')
y_test = tf.keras.utils.to_categorical(y_test, dtype = 'int64')
x_train.head()

#Z score standardization
ct = StandardScaler()

x_train = ct.fit_transform(x_train)
x_test = ct.transform(x_test)
#print(x_train)

model_forest = RandomForestClassifier()
model_forest.fit(x_train, y_train)
model_forest.score(x_test, y_test)

y_predict = model_forest.predict(x_test)

print(classification_report(y_test, y_predict, zero_division = 1))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00         0
           1       0.97      0.93      0.95     63552
           2       0.95      0.97      0.96     84991
           3       0.95      0.94      0.95     10726
           4       0.93      0.79      0.86       824
           5       0.95      0.70      0.81      2848
           6       0.95      0.86      0.90      5210
           7       0.98      0.94      0.96      6153

   micro avg       0.96      0.95      0.95    174304
   macro avg       0.96      0.89      0.92    174304
weighted avg       0.96      0.95      0.95    174304
 samples avg       0.96      0.95      0.95    174304



Running the random forrest with the 5 fold stratification produced results that were very poor in comparison. At the moment, I am unsure as to why the performance was so much worst. 

In [31]:
model_forest = RandomForestClassifier()

skf = StratifiedKFold(n_splits=5, shuffle = True)
lst_accu_stratified = []
X_array = X.to_numpy()

# X_array.type

for train_index, test_index in skf.split(X_array, y):
    x_train_fold, x_test_fold = X_array[train_index], X_array[test_index]
    y_train_fold, y_test_fold = y[train_index], y[test_index]
    model_forest.fit(ct.fit_transform(x_train_fold), y_train_fold)
    lst_accu_stratified.append(model_forest.score(ct.transform(x_test_fold), y_test_fold))
    y_estimate = model.predict(ct.transform(x_test_fold))
    print('y estimate: ' + str(y_estimate.shape))
    y_estimate = np.argmax(y_estimate, axis = 1)
    print('y estimate: ' + str(y_estimate.shape))
    y_true = np.argmax(y_test_fold, axis = 0)
    print("Report: " + str(train_index))

    print(classification_report(y_test_fold, y_estimate))
    
    
print('List of possible accuracy:', lst_accu_stratified)
print('\nMaximum Accuracy That can be obtained from this model is:',
      max(lst_accu_stratified)*100, '%')
print('\nMinimum Accuracy:',
      min(lst_accu_stratified)*100, '%')
print('\nOverall Accuracy:',
      statistics.mean(lst_accu_stratified)*100, '%')
print('\nStandard Deviation is:', statistics.stdev(lst_accu_stratified))

y estimate: (116203, 8)
y estimate: (116203,)
Report: [     2      3      4 ... 581008 581010 581011]
              precision    recall  f1-score   support

           1       0.96      0.98      0.97     42368
           2       0.97      0.97      0.97     56660
           3       0.97      0.96      0.96      7151
           4       0.92      0.84      0.88       550
           5       0.91      0.87      0.89      1899
           6       0.94      0.94      0.94      3473
           7       0.99      0.91      0.95      4102

    accuracy                           0.97    116203
   macro avg       0.95      0.92      0.94    116203
weighted avg       0.97      0.97      0.97    116203

y estimate: (116203, 8)
y estimate: (116203,)
Report: [     0      1      2 ... 581008 581009 581010]
              precision    recall  f1-score   support

           1       0.96      0.98      0.97     42368
           2       0.98      0.97      0.97     56660
           3       0.97      0.96   

Now we will run a neural network model to see if it performs better or worse than the random forest. I chose a neural net here because of the large number of features. 

In [14]:
model = Sequential()
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, stratify = y)

y_train = tf.keras.utils.to_categorical(y_train, dtype = 'int64')
y_test = tf.keras.utils.to_categorical(y_test, dtype = 'int64')

x_train_scaled = ct.fit_transform(x_train)
x_test_scaled = ct.fit_transform(x_test)

model.add(InputLayer(x_train_scaled.shape[1],))

model.add(Dense(512, activation = 'relu'))
model.add(Dense(128, activation = 'relu'))
model.add(Dense(64, activation = 'relu'))
model.add(Dense(16, activation = 'relu'))
model.add(Dense(8, activation = 'relu'))
model.add(Dense(8, activation = 'softmax'))

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 512)               28160     
                                                                 
 dense_1 (Dense)             (None, 128)               65664     
                                                                 
 dense_2 (Dense)             (None, 64)                8256      
                                                                 
 dense_3 (Dense)             (None, 16)                1040      
                                                                 
 dense_4 (Dense)             (None, 8)                 136       
                                                                 
 dense_5 (Dense)             (None, 8)                 72        
                                                                 
Total params: 103,328
Trainable params: 103,328
Non-trai

In [16]:
EPOCHS = 5
BATCH_SIZE = 128


model.compile(loss = 'categorical_crossentropy', optimizer = 'adam', metrics = ["categorical_accuracy"])

model.fit(x_train_scaled, y_train, epochs = EPOCHS, batch_size = BATCH_SIZE)
y_estimate = model.predict(x_test_scaled)
y_estimate = np.argmax(y_estimate, axis = 1)

y_true = np.argmax(y_test, axis = 1)

print(classification_report(y_true, y_estimate))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
              precision    recall  f1-score   support

           1       0.92      0.89      0.91     63552
           2       0.91      0.94      0.93     84991
           3       0.90      0.91      0.90     10726
           4       0.87      0.74      0.80       824
           5       0.81      0.65      0.72      2848
           6       0.82      0.80      0.81      5210
           7       0.91      0.92      0.92      6153

    accuracy                           0.91    174304
   macro avg       0.88      0.84      0.85    174304
weighted avg       0.91      0.91      0.91    174304



In [26]:
EPOCHS = 5
BATCH_SIZE = 128

model.compile(loss = 'sparse_categorical_crossentropy', optimizer = 'adam', metrics = ["sparse_categorical_accuracy"])

skf = StratifiedKFold(n_splits=5, shuffle = True)
lst_accu_stratified = []
  
for train_index, test_index in skf.split(X_array, y):
    x_train_fold, x_test_fold = X_array[train_index],X_array[test_index]
    print('x train shape:' + str(x_train_fold.shape))
    y_train_fold, y_test_fold = y[train_index], y[test_index]
    print('y train shape:' + str(y_train_fold.shape))
    model.fit(ct.fit_transform(x_train_fold), y_train_fold, epochs = EPOCHS, batch_size = BATCH_SIZE)
    y_estimate = model.predict(ct.transform(x_test_fold))
    print('y estimate: ' + str(y_estimate.shape))
    y_estimate = np.argmax(y_estimate, axis = 1)
    print('y estimate: ' + str(y_estimate.shape))
    y_true = np.argmax(y_test_fold, axis = 0)
    print("Report: " + str(train_index))

    print(classification_report(y_test_fold, y_estimate))



x train shape:(464809, 54)
y train shape:(464809,)
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
y estimate: (116203, 8)
y estimate: (116203,)
Report: [     0      1      2 ... 581009 581010 581011]
              precision    recall  f1-score   support

           1       0.96      0.96      0.96     42368
           2       0.97      0.97      0.97     56660
           3       0.96      0.96      0.96      7151
           4       0.87      0.87      0.87       550
           5       0.89      0.85      0.87      1899
           6       0.94      0.92      0.93      3473
           7       0.95      0.96      0.95      4102

    accuracy                           0.96    116203
   macro avg       0.93      0.93      0.93    116203
weighted avg       0.96      0.96      0.96    116203

x train shape:(464809, 54)
y train shape:(464809,)
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
y estimate: (116203, 8)
y estimate: (116203,)
Report: [     0      2      3 ... 581009 581010 58101