# My application of a simple neural net on playground december 2021
### Please let me know of any improvements, I'm here to learn

### Ideas for improvement
* Feature engineering, Cover_Type = 5 is only 1 sample, remove? DONE
* Encode using sklearn labelencoder (need to use encoder.inverse_transform for test preds later) DONE
* Scale data using sklearn robustscaler DONE
* Plot model using tf.keras.utils plot_model
* Use some tool to do feature importance
* Can run on TPU, DONE

Used https://www.kaggle.com/gulshanmishra/tps-dec-21-tensorflow-nn-feature-engineering as inspiration, please go give that notebook a thumbs up


In [1]:
import pandas as pd
import numpy as np
import datatable as dt

from sklearn.model_selection import train_test_split, StratifiedKFold 
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder, RobustScaler

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, BatchNormalization
from tensorflow.keras.callbacks import ReduceLROnPlateau, EarlyStopping
from tensorflow.keras.utils import plot_model
import tensorflow as tf

plot = False # Plot model or plot summary

## Function to reduce memory of dataframes

In [2]:
def reduce_memory_usage(df, verbose=True):
    numerics = ["int8", "int16", "int32", "int64", "float16", "float32", "float64"]
    start_mem = df.memory_usage().sum() / 1024 ** 2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == "int":
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if (
                    c_min > np.finfo(np.float16).min
                    and c_max < np.finfo(np.float16).max
                ):
                    df[col] = df[col].astype(np.float16)
                elif (
                    c_min > np.finfo(np.float32).min
                    and c_max < np.finfo(np.float32).max
                ):
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024 ** 2
    if verbose:
        print(
            "Mem. usage decreased to {:.2f} Mb ({:.1f}% reduction)".format(
                end_mem, 100 * (start_mem - end_mem) / start_mem
            )
        )
    return df

# Importing training and testing data
Reading using datatable and converting to pandas is often faster than reading directly using pandas

In [3]:
train_df = dt.fread("../input/tabular-playground-series-dec-2021/train.csv")
test_df = dt.fread("../input/tabular-playground-series-dec-2021/test.csv")
test_df = reduce_memory_usage(test_df.to_pandas())
train_df = reduce_memory_usage(train_df.to_pandas())

INPUT_SHAPE = test_df.shape[1:] # Used to decide first layer of nn
NUM_CLASSES = train_df["Cover_Type"].nunique() # For output layer of nn

# Remove sample with cover_type = 5
idx_to_drop5 = train_df[train_df["Cover_Type"] == 5].index
print(f"Nr of cover_type = 5: {len(idx_to_drop5)}")
train_df.drop(idx_to_drop5,
              axis=0,
              inplace=True)

# Very few is 4 aswell
"""idx_to_drop4 = train_df[train_df["Cover_Type"] == 4].index
print(f"Nr of cover_type = 4: {len(idx_to_drop4)}")
train_df.drop(idx_to_drop4,
              axis=0,
              inplace=True)"""

encoder = LabelEncoder()
train_df["Cover_Type"] = encoder.fit_transform(train_df["Cover_Type"])

bool_features = [i for i in train_df.columns if "area" in i.lower() or "soil" in i.lower()]
test_df[bool_features] = test_df[bool_features].astype(np.int8)
train_df[bool_features] = train_df[bool_features].astype(np.int8)


Mem. usage decreased to 63.90 Mb (23.9% reduction)
Mem. usage decreased to 259.40 Mb (26.1% reduction)
Nr of cover_type = 5: 1
Nr of cover_type = 4: 377


### Scale unscaled data
Great article on interesting ways to select pandas columns:
https://towardsdatascience.com/interesting-ways-to-select-pandas-dataframe-columns-b29b82bbfb33

In [4]:
cols_to_scale = train_df.loc[:,[(train_df[col] > 7).any() for col in train_df.columns]].columns
print(f"Scaled Columns: {cols_to_scale}\n\n  \
Number of scaled Columns: {len(cols_to_scale)}")

scaler = RobustScaler()
train_df[cols_to_scale] = scaler.fit_transform(train_df[cols_to_scale])
test_df[cols_to_scale] = scaler.fit_transform(test_df[cols_to_scale])

y = train_df.pop("Cover_Type").values
X = train_df.values

Scaled Columns: Index(['Id', 'Elevation', 'Aspect', 'Slope',
       'Horizontal_Distance_To_Hydrology', 'Vertical_Distance_To_Hydrology',
       'Horizontal_Distance_To_Roadways', 'Hillshade_9am', 'Hillshade_Noon',
       'Hillshade_3pm', 'Horizontal_Distance_To_Fire_Points'],
      dtype='object')

  Number of scaled Columns: 11


## Functions to use when training later
Reduce learningrate when accuracy is plateauing and stop early if accuracy is not improving

In [5]:
reduce_lr = ReduceLROnPlateau(
    monitor="val_loss",
    factor=0.5,
    patience=5,
    verbose=False
)
early_stop = EarlyStopping(
    monitor="val_accuracy",
    patience=10,
    restore_best_weights=True,
    verbose=True
)
callbacks = [reduce_lr, early_stop]

## Define the model and compile it

In [6]:
def build_model():
    # To run on TPU
    build_with_TPU = False
    try:
        tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
        tf.config.experimental_connect_to_cluster(tpu)
        tf.tpu.experimental.initialize_tpu_system(tpu)
        strategy = tf.distribute.experimental.TPUStrategy(tpu)
        BATCH_SIZE = strategy.num_replicas_in_sync * 64
        print(f"Running on TPU: {tpu.master()}")
        print(f"Batch Size on TPU: {BATCH_SIZE}")
        build_with_TPU = True
    except ValueError:
        print("Not running on TPU")
        # strategy = tf.distribute.get_strategy()
        # BATCH_SIZE = 512
        # print(f"Running on {strategy.num_replicas_in_sync} replicas")
        # print(f"Batch Size: {BATCH_SIZE}")
        
    if build_with_TPU:
        with strategy.scope():
            model = Sequential([
                Dense(units=512, kernel_initializer='random_normal', activation='gelu',
                      input_shape=INPUT_SHAPE),
                BatchNormalization(),
                Dense(units=256, kernel_initializer='random_normal', activation='gelu'),
                BatchNormalization(),
                Dense(units=128, kernel_initializer='random_normal', activation='gelu'),
                BatchNormalization(),
                Dense(units=64, kernel_initializer='random_normal', activation='gelu'),
                BatchNormalization(),
                Dense(units=32, kernel_initializer='random_normal', activation='gelu'),
                BatchNormalization(),
                Dense(units=5, activation="softmax")
            ])
            model.compile(
                optimizer='adam',
                loss = 'sparse_categorical_crossentropy',
                metrics=["accuracy"]
            )
    else:
        model = Sequential([
            Dense(units=512, kernel_initializer='random_normal', activation='gelu',
                  input_shape=INPUT_SHAPE),
            BatchNormalization(),
            Dense(units=256, kernel_initializer='random_normal', activation='gelu'),
            BatchNormalization(),
            Dense(units=128, kernel_initializer='random_normal', activation='gelu'),
            BatchNormalization(),
            Dense(units=64, kernel_initializer='random_normal', activation='gelu'),
            BatchNormalization(),
            Dense(units=32, kernel_initializer='random_normal', activation='gelu'),
            BatchNormalization(),
            Dense(units=5, activation="softmax")
        ])
        model.compile(
            optimizer='adam',
            loss = 'sparse_categorical_crossentropy',
            metrics=["accuracy"]
        )
            
    return model

if plot:
    plot_model(
        build_model(),
        show_shapes=True,
        show_layer_names=True
    )
else:
    build_model().summary()

Not running on TPU


2021-12-13 11:44:48.932815: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-12-13 11:44:49.043767: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-12-13 11:44:49.044844: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero


Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 512)               28672     
_________________________________________________________________
batch_normalization (BatchNo (None, 512)               2048      
_________________________________________________________________
dense_1 (Dense)              (None, 256)               131328    
_________________________________________________________________
batch_normalization_1 (Batch (None, 256)               1024      
_________________________________________________________________
dense_2 (Dense)              (None, 128)               32896     
_________________________________________________________________
batch_normalization_2 (Batch (None, 128)               512       
_________________________________________________________________
dense_3 (Dense)              (None, 64)                8

2021-12-13 11:44:49.047149: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-12-13 11:44:49.049075: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-12-13 11:44:49.050063: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-12-13 11:44:49.051083: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA 

In [7]:
print("Num GPUs available: ", len(tf.config.list_physical_devices('GPU')))

FOLDS = 10
EPOCHS = 100
BATCH_SIZE = 1024
STEPS_PER_EPOCH = 2*BATCH_SIZE # Not used, chosen if wanted faster epochs
test_preds = np.zeros((1,1))
scores = []

cv = StratifiedKFold(n_splits=FOLDS, shuffle=True, random_state=0)

for fold, (train_idx, test_idx) in enumerate(cv.split(X,y), start=1):
    X_train, X_val = X[train_idx], X[test_idx]
    y_train, y_val = y[train_idx], y[test_idx]

    model = build_model()
    model.fit(
        X_train,
        y_train,
        validation_data=(X_val, y_val),
        # steps_per_epoch=4*BATCH_SIZE,
        batch_size=BATCH_SIZE,
        epochs=EPOCHS,
        callbacks=callbacks,
        verbose=False
    )

    y_pred = np.argmax(model.predict(X_val), axis=1)

    score = accuracy_score(y_val, y_pred)
    print(f"Fold {fold} Validation Accuracy: {score}")
    scores.append(score)

    test_preds = test_preds + model.predict(test_df)
    
    # del model, y_pred, score

print(f"\n\nMean accuracy over all folds: {np.mean(scores)}")

Num GPUs available:  1
Not running on TPU


2021-12-13 11:44:53.618138: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 791924980 exceeds 10% of free system memory.
2021-12-13 11:44:54.483354: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 791924980 exceeds 10% of free system memory.
2021-12-13 11:44:55.100053: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)


Restoring model weights from the end of the best epoch.
Epoch 00033: early stopping
Fold 1 Validation Accuracy: 0.9612689173748572
Not running on TPU


2021-12-13 11:56:30.919307: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 791924980 exceeds 10% of free system memory.
2021-12-13 11:56:31.750051: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 791924980 exceeds 10% of free system memory.


Restoring model weights from the end of the best epoch.
Epoch 00027: early stopping
Fold 2 Validation Accuracy: 0.961961481437033
Not running on TPU


2021-12-13 12:06:17.241386: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 791925200 exceeds 10% of free system memory.


Restoring model weights from the end of the best epoch.
Epoch 00025: early stopping
Fold 3 Validation Accuracy: 0.9614638390647111
Not running on TPU
Restoring model weights from the end of the best epoch.
Epoch 00029: early stopping
Fold 4 Validation Accuracy: 0.9614288357393952
Not running on TPU
Restoring model weights from the end of the best epoch.
Epoch 00030: early stopping
Fold 5 Validation Accuracy: 0.9622814167345898
Not running on TPU
Restoring model weights from the end of the best epoch.
Epoch 00030: early stopping
Fold 6 Validation Accuracy: 0.9622314119841385
Not running on TPU
Restoring model weights from the end of the best epoch.
Epoch 00024: early stopping
Fold 7 Validation Accuracy: 0.9616938609167871
Not running on TPU
Restoring model weights from the end of the best epoch.
Epoch 00025: early stopping
Fold 8 Validation Accuracy: 0.9619663868067466
Not running on TPU
Restoring model weights from the end of the best epoch.
Epoch 00030: early stopping
Fold 9 Validatio

In [8]:
sample = pd.read_csv("../input/tabular-playground-series-dec-2021/sample_submission.csv")
preds = np.argmax(test_preds, axis=1)
preds = encoder.inverse_transform(preds)

sample.Cover_Type = preds
sample.to_csv("Submission.csv", index=False)