<a href="https://colab.research.google.com/github/GrzegorzMeller/AlgorithmsForMassiveData/blob/master/FOREST_COVER_PREDICTION.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercises 30/04

In the 30/04 lab lecture we will focus on data scaling.

Data scaling is a common preprocessing that is performed on datasets where data is represented with different scales.

Several scaling methods are available, where the most common are:

- Normalization. It is basically a rescaling of the data so that the values are within the range $[0,1]$.
- Standardization. It consists of rescaling the distribution of the observed values to zero mean and unit standard deviation. In the literature, this process is sometimes referred to as *whitening*.

In the lab lecture we will see how to perform data scaling and why data scaling is important for neural networks.

Meanwhile, address the following exercises.

# Neural network models

Build three feed-forward neural network models with one or more layers as follows:
- first network: use the original raw data,
- second network: use normalized data,
- third network: use standardized data.

Do not use convolutional layers. Exploit the methods we saw in the last lab lecture to properly train the networks (for instance, techniques to avoid overfitting). use TensorBoard to assess the performance.

Finally, address the following questions:
- Which network reaches the best performance?
- Do you notice the difference in performance when scaling the data?
- Which scaling method is the best? Can you guess why?

# Forest Cover Type Prediction dataset

Download the Forest Cover Type Prediction dataset.

The dataset contains tree observations from four areas of the Roosevelt National Forest in Colorado. All observations are cartographic variables (no remote sensing) from 30 meter x 30 meter sections of forest. The task is to predict the forest cover type (the predominant kind of tree cover) from strictly cartographic variables (as opposed to remotely sensed data). More info related to the dataset are available [here](https://www.kaggle.com/uciml/forest-cover-type-dataset).

Download the dataset in the Google Colab environment using ``curl`` as follows:


In [43]:
!curl http://bodini.di.unimi.it/teaching/ADM_files/covtype.csv --output covtype.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 71.6M  100 71.6M    0     0  9337k      0  0:00:07  0:00:07 --:--:-- 17.6M


Then, read the .csv file containing the dataset with Pandas as follows:

In [0]:
import pandas as pd

df = pd.read_csv('covtype.csv')

The last column contains the labels, while the other columns contain the data.

In [45]:
x = df[df.columns[:-1]]
y = df.Cover_Type
#print(df)
print(x)
print(y.drop_duplicates())
print(x.shape[1])

        Elevation  Aspect  Slope  ...  Soil_Type38  Soil_Type39  Soil_Type40
0            2596      51      3  ...            0            0            0
1            2590      56      2  ...            0            0            0
2            2804     139      9  ...            0            0            0
3            2785     155     18  ...            0            0            0
4            2595      45      2  ...            0            0            0
...           ...     ...    ...  ...          ...          ...          ...
581007       2396     153     20  ...            0            0            0
581008       2391     152     19  ...            0            0            0
581009       2386     159     17  ...            0            0            0
581010       2384     170     15  ...            0            0            0
581011       2383     165     13  ...            0            0            0

[581012 rows x 54 columns]
0       5
2       2
40      1
1654    7
1818    

Divide the dataset in train and test sets. To compare with my results, set 0.7 as the training set ratio and ``random_state = 90``. 

In [56]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y , train_size = 0.7, random_state = 90)
print(x_train.shape,y_train.shape, x_test.shape, y_test.shape)

(406708, 54) (406708,) (174304, 54) (174304,)


In [57]:
import tensorflow as tf
from tensorflow import keras

#one-hot encoding for outputs
y_train = tf.reshape(tf.one_hot(y_train, 7),[406708, 7])
y_test = tf.reshape(tf.one_hot(y_test, 7),[174304, 7])

print(y_test, y_train)
print(y_train.shape, y_test.shape)

tf.Tensor(
[[0. 1. 0. ... 0. 0. 0.]
 [0. 0. 1. ... 0. 0. 0.]
 [0. 0. 1. ... 0. 0. 0.]
 ...
 [0. 1. 0. ... 0. 0. 0.]
 [0. 0. 1. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]], shape=(174304, 7), dtype=float32) tf.Tensor(
[[0. 0. 1. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 1. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 0. 1. ... 0. 0. 0.]], shape=(406708, 7), dtype=float32)
(406708, 7) (174304, 7)


In [54]:
#neural network implementation
model = keras.Sequential([
    keras.layers.Dense(200, activation="relu", input_shape=(x_train.shape[1],)), 

    keras.layers.Dense(60, activation="relu"), # 2nd hidden layer
    keras.layers.Dropout(0.4),

    keras.layers.Dense(7, activation="softmax")]) #  output layer with 7 categories
  

model.compile(optimizer=tf.keras.optimizers.SGD(lr=0.001),
              loss='categorical_crossentropy',
              metrics=['accuracy'])

model.summary()

Model: "sequential_11"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_33 (Dense)             (None, 200)               11000     
_________________________________________________________________
dense_34 (Dense)             (None, 60)                12060     
_________________________________________________________________
dropout_5 (Dropout)          (None, 60)                0         
_________________________________________________________________
dense_35 (Dense)             (None, 7)                 427       
Total params: 23,487
Trainable params: 23,487
Non-trainable params: 0
_________________________________________________________________


In [59]:
history_1 = model.fit(x_train,
                    y_train,
                    batch_size=100,
                    epochs=30,
                    validation_data=(x_test,y_test),
                    callbacks=[tf.keras.callbacks.EarlyStopping(monitor='val_loss')],
                    )

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


In [60]:
#TASK 2: data normalization
x_train_norm = (x_train - x_train.min())/(x_train.max() - x_train.min())
x_test_norm = (x_test - x_test.min())/(x_test.max() - x_test.min())
print(x_train_norm)
print(x_test_norm)

        Elevation    Aspect     Slope  ...  Soil_Type38  Soil_Type39  Soil_Type40
152044   0.581791  0.361111  0.166667  ...          0.0          0.0          0.0
363373   0.827914  0.286111  0.242424  ...          0.0          0.0          1.0
372733   0.399200  0.516667  0.257576  ...          0.0          0.0          0.0
572846   0.387694  0.383333  0.181818  ...          0.0          0.0          0.0
114145   0.543272  0.700000  0.242424  ...          0.0          0.0          0.0
...           ...       ...       ...  ...          ...          ...          ...
286827   0.481741  0.966667  0.318182  ...          0.0          0.0          0.0
564298   0.351176  0.002778  0.181818  ...          0.0          0.0          0.0
402834   0.546773  0.763889  0.196970  ...          0.0          0.0          0.0
185125   0.589295  0.069444  0.106061  ...          0.0          0.0          0.0
158375   0.531766  0.050000  0.166667  ...          0.0          0.0          0.0

[406708 rows x 

In [61]:
history_2 = model.fit(x_train_norm,
                    y_train,
                    batch_size=100,
                    epochs=30,
                    validation_data=(x_test_norm,y_test),
                    callbacks=[tf.keras.callbacks.EarlyStopping(monitor='val_loss')],
                    )

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30


In [49]:
#TASK 3: data rescaling
from sklearn import preprocessing

#Select numerical columns which needs to be normalized
train_norm = x_train[x_train.columns[0:10]]
test_norm = x_test[x_test.columns[0:10]]

# Normalize Training Data 
std_scale = preprocessing.StandardScaler().fit(train_norm)
x_train_norm = std_scale.transform(train_norm)

#Converting numpy array to dataframe
training_norm_col = pd.DataFrame(x_train_norm, index=train_norm.index, columns=train_norm.columns) 
x_train.update(training_norm_col)
print (x_train.head())

# Normalize Testing Data by using mean and SD of training set
x_test_norm = std_scale.transform(test_norm)
testing_norm_col = pd.DataFrame(x_test_norm, index=test_norm.index, columns=test_norm.columns) 
x_test.update(testing_norm_col)
print (x_test.head())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[col] = expressions.where(mask, this, that)


        Elevation    Aspect     Slope  ...  Soil_Type38  Soil_Type39  Soil_Type40
152044   0.222366 -0.228639 -0.412503  ...            0            0            0
363373   1.980490 -0.469989  0.255453  ...            0            0            1
372733  -1.081933  0.271939  0.389044  ...            0            0            0
572846  -1.164122 -0.157128 -0.278912  ...            0            0            0
114145  -0.052787  0.861906  0.255453  ...            0            0            0

[5 rows x 54 columns]
        Elevation    Aspect     Slope  ...  Soil_Type38  Soil_Type39  Soil_Type40
204886   0.783394 -1.310245 -0.946867  ...            0            0            0
116027  -0.903262 -1.006323 -0.679685  ...            0            0            0
328145  -0.270766 -1.095711 -0.278912  ...            0            0            0
579670  -1.139108 -0.961628 -0.412503  ...            0            0            0
41341    0.265247  0.736762 -1.347641  ...            0            0       

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[col] = expressions.where(mask, this, that)


In [55]:
history1 = model.fit(x_train,
                    y_train,
                    batch_size=100,
                    epochs=30,
                    validation_data=(x_test,y_test),
                    callbacks=[tf.keras.callbacks.EarlyStopping(monitor='val_loss')],
                    )

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
