# Tabular Playground Series 12-2021

<br />


<br />

*In this notebook we work on the **Tabular Playground Series 12-2021** dataset available on kaggle which deals with the prediction of the forest cover type based on different features and build a NN model using the **tensorflow** module*


In [12]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [13]:
train = pd.read_csv(r'C:\Users\adity\Github\datasets\TPS_12_2021\train.csv')
test = pd.read_csv(r'C:\Users\adity\Github\datasets\TPS_12_2021\test.csv')

In [14]:
train.shape

(4000000, 56)

*From above we can see that the dataset is huge, we convert the datatype of columns using the function that I borrowed from Kaggle so that the computation becomes easier*

In [15]:
def reduce_memory_usage(df, verbose=True):
    numerics = ["int8", "int16", "int32", "int64", "float16", "float32", "float64"]
    start_mem = df.memory_usage().sum() / 1024 ** 2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == "int":
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if (
                    c_min > np.finfo(np.float16).min
                    and c_max < np.finfo(np.float16).max
                ):
                    df[col] = df[col].astype(np.float16)
                elif (
                    c_min > np.finfo(np.float32).min
                    and c_max < np.finfo(np.float32).max
                ):
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024 ** 2
    
    if verbose:
        print(
            "Mem. usage decreased to {:.2f} Mb ({:.1f}% reduction)".format(
                end_mem, 100 * (start_mem - end_mem) / start_mem
            )
        )
    return df

In [16]:
train_df = reduce_memory_usage(train, verbose=True)
test_df = reduce_memory_usage(test, verbose=True)
print('Memory reduced')

Mem. usage decreased to 259.40 Mb (84.8% reduction)
Mem. usage decreased to 63.90 Mb (84.8% reduction)
Memory reduced


*We can see that the mem.usage is decreased by more than 80%*

*We do some feature engineering by scaling our datasets so that we can pass them on to the neural network*

In [17]:
def feature_engineering(df):
    df['Id'].drop(columns=['Id'],inplace=True)
    
    if 'Cover_Type' in df.columns:
        df.drop(columns=['Cover_Type'],inplace=True)
    scaler = StandardScaler()
    df_scaled = scaler.fit_transform(df)
    
    return df_scaled

In [18]:
y = train_df['Cover_Type']
test_ids = test_df['Id']
X = feature_engineering(train_df)
test = feature_engineering(test_df)

*We split our dataset of our data using **train_test_split()** to get the training set and validation dataset*

In [19]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [20]:
X_train.shape

(3200000, 55)

*We build the tensorflow model based on the **keras.Sequential()** function*

In [28]:
from tensorflow import keras
from tensorflow.keras import layers#, callbacks

model = keras.Sequential([layers.Dense(500,activation='relu',input_shape=[X_train.shape[1]]),
                                   layers.Dropout(0.2),
                                   layers.Dense(300,activation='relu'),
                                   layers.Dropout(0.2),
                                   layers.Dense(8,activation='softmax')])

*We can see that the model has 3 layers and also with the dropout() function*

In [27]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 500)               28000     
_________________________________________________________________
dropout (Dropout)            (None, 500)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 300)               150300    
_________________________________________________________________
dropout_1 (Dropout)          (None, 300)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 8)                 2408      
Total params: 180,708
Trainable params: 180,708
Non-trainable params: 0
_________________________________________________________________


*we compile the model using the optimizer **adam** and loss fn. is **sparse_categorical_crossentropy** and **accuracy** metrics* 

In [22]:
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

In [23]:
model.fit(X_train,y_train,epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x1ad29385d90>

In [24]:
model.evaluate(X_test,y_test)



[0.10546988993883133, 0.9550300240516663]

In [25]:
y_predict = model.predict(test)
pred_y = np.argmax(y_predict, axis=1)

In [26]:
a = np.column_stack([test_ids.astype(np.int32),pred_y])
df_csv = pd.DataFrame(a,columns=['Id','Cover_Type'])
df_csv["Id"] = df_csv["Id"].astype(int)
df_csv.to_csv('tps_12_2021.csv',index=False)

*After uploading the dataset to Kaggle we get the score of .93819*