6. Forest Cover Type Dataset: Forest.csv

Forest.csv (75.8MB): 581012 samples

the last column is label includes 7 classes
(1:211840, 2:283301, 3:2747, 4:35754, 5:9493, 6:17367, 7:20510)


Context

This dataset contains tree observations from four areas of the Roosevelt National Forest in Colorado. All observations are cartographic variables (no remote sensing) from 30 meter x 30 meter sections of forest. There are over half a million measurements total!
Content

This dataset includes information on tree type, shadow coverage, distance to nearby landmarks (roads etcetera), soil type, and local topography.
Acknowledgement

This dataset is part of the UCI Machine Learning Repository, and the original source can be found here. The original database owners are Jock A. Blackard, Dr. Denis J. Dean, and Dr. Charles W. Anderson of the Remote Sensing and GIS Program at Colorado State University.
Inspiration

    Can you build a model that predicts what types of trees grow in an area based on the surrounding characteristics? A past Kaggle competition project on this topic can be found here.
    What kinds of trees are most common in the Roosevelt National Forest?
    Which tree types can grow in more diverse environments? Are there certain tree types that are sensitive to an environmental factor, such as elevation or soil type?

https://www.kaggle.com/uciml/forest-cover-type-dataset


In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

data = pd.read_csv('./covtype.csv')
data

Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,...,Soil_Type32,Soil_Type33,Soil_Type34,Soil_Type35,Soil_Type36,Soil_Type37,Soil_Type38,Soil_Type39,Soil_Type40,Cover_Type
0,2596,51,3,258,0,510,221,232,148,6279,...,0,0,0,0,0,0,0,0,0,5
1,2590,56,2,212,-6,390,220,235,151,6225,...,0,0,0,0,0,0,0,0,0,5
2,2804,139,9,268,65,3180,234,238,135,6121,...,0,0,0,0,0,0,0,0,0,2
3,2785,155,18,242,118,3090,238,238,122,6211,...,0,0,0,0,0,0,0,0,0,2
4,2595,45,2,153,-1,391,220,234,150,6172,...,0,0,0,0,0,0,0,0,0,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
581007,2396,153,20,85,17,108,240,237,118,837,...,0,0,0,0,0,0,0,0,0,3
581008,2391,152,19,67,12,95,240,237,119,845,...,0,0,0,0,0,0,0,0,0,3
581009,2386,159,17,60,7,90,236,241,130,854,...,0,0,0,0,0,0,0,0,0,3
581010,2384,170,15,60,5,90,230,245,143,864,...,0,0,0,0,0,0,0,0,0,3


In [3]:
X = data.drop('Cover_Type',axis=1)
y = data['Cover_Type']

In [4]:
# Use PCA to reduce to 32 dimensions
from sklearn.decomposition import PCA

n_components = 32
X_pca = PCA(n_components = n_components).fit_transform(X)

print('X_pca.shape',X_pca.shape)

X_pca.shape (581012, 32)


In [5]:
X_pca

array([[ 6.74821965e+02,  4.63459937e+03, -2.44289792e+02, ...,
        -9.30392972e-02,  3.45066120e-04, -4.17138904e-02],
       [ 5.43787831e+02,  4.65172408e+03, -2.63848362e+02, ...,
        -9.37542314e-02,  2.34147036e-03, -4.61723348e-02],
       [ 2.87025270e+03,  3.09256260e+03, -2.16452213e+02, ...,
         1.67305344e-03, -2.19143255e-03, -1.61547842e-02],
       ...,
       [-2.54590708e+03,  2.44302744e+02, -4.55505134e+02, ...,
         4.92428614e-01,  1.31450700e-01,  2.31487071e-02],
       [-2.54076502e+03,  2.52666167e+02, -4.57305806e+02, ...,
         4.90048185e-01,  1.26019462e-01,  2.27858336e-02],
       [-2.55455967e+03,  2.74192750e+02, -4.56952123e+02, ...,
         4.90453697e-01,  1.22057360e-01,  2.09465544e-02]])

In [6]:
y = y-1
y.value_counts()

1    283301
0    211840
2     35754
6     20510
5     17367
4      9493
3      2747
Name: Cover_Type, dtype: int64

In [7]:
# Divide training and test sets
from sklearn.model_selection import train_test_split

x_pca_train,x_pca_test,y_train,y_test = train_test_split(X_pca,y,test_size=0.2,random_state=11,stratify = y)

In [8]:
# One-Hot
import keras

y_train = keras.utils.to_categorical(y_train, 7)
y_test = keras.utils.to_categorical(y_test, 7)

Using TensorFlow backend.


In [9]:
#  Build a full neural network
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.optimizers import RMSprop

model = Sequential()
model.add(Dense(512, activation='relu', input_shape=(32,)))
model.add(Dropout(0.2))
model.add(Dense(512, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(7, activation='softmax'))

model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 512)               16896     
_________________________________________________________________
dropout_1 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 512)               262656    
_________________________________________________________________
dropout_2 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 7)                 3591      
Total params: 283,143
Trainable params: 283,143
Non-trainable params: 0
_________________________________________________________________


In [10]:
model.compile(loss='categorical_crossentropy',
              optimizer=RMSprop(),
              metrics=['accuracy'])

history = model.fit(x_pca_train, y_train,
                    batch_size=32,
                    epochs=10,
                    verbose=1,
                    validation_data=(x_pca_test, y_test))

Train on 464809 samples, validate on 116203 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [11]:
score = model.evaluate(x_pca_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Test loss: 0.8287166669166791
Test accuracy: 0.7011006474494934
