# 1-1 Example: Modeling Procedure for Structured Data

Welcome to Day 1 of the "Eat That TensorFlow2.0 in 30 Days" series on Kaggle!

Today, we'll dive into the Modeling Procedure of TensorFlow, laying the foundation for our TensorFlow journey. In this chapter, we'll start with a practical example of modeling structured data. Get ready to learn the essentials of TensorFlow modeling in just a few lines of code! 🚀🤩

### 1. Data Preparation


The purpose of the Titanic dataset is to predict whether the given passengers could be survived after Titinic hit the iceburg titanic, according to their personal information.

We usually use DataFrame from the pandas library to pre-process the structured data.

In [1]:
!pip install tensorflow -q

In [2]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import tensorflow as tf 
import plotly.express as px
import plotly.graph_objs as go
from tensorflow.keras import models,layers
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore',category=UserWarning)

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


In [3]:
dftrain_raw = pd.read_csv('./Data/titanic_data//train.csv')
dftest_raw = pd.read_csv('./Data/titanic_data/test.csv')
dftrain_raw.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Introduction of each field：

* Survived: 0 for death and 1 for survived [y labels]
* Pclass: Class of the tickets, with three possible values (1,2,3) [converting to one-hot encoding]
* Name: Name of each passenger [discarded]
* Sex: Gender of each passenger [converting to bool type]
* Age: Age of each passenger (partly missing) [numerical feature, should add "Whether age is missing" as auxiliary feature]
* SibSp: Number of siblings and spouse of each passenger (interger) [numerical feature]
* Parch: Number of parents/children of each passenger (interger) [numerical feature]
* Ticket: Ticket number (string) [discarded]
* Fare: Ticket price of each passenger (float, between 0 to 500) [numerical feature]
* Cabin: Cabin where each passenger is located (partly missing) [should add "Whether cabin is missing" as auxiliary feature]
* Embarked: Which port was each passenger embarked, possible values are S、C、Q (partly missing) [converting to one-hot encoding, four dimensions, S,C,Q,nan]

Use data visualization for initial EDA (Exploratory Data Analysis).

Survival label distribution:

In [4]:
%matplotlib inline
# Count the values of 'Survived' column
counts = dftrain_raw['Survived'].value_counts()

# Create the bar chart using Plotly Express
fig = px.bar(x=counts.index, y=counts.values,
             labels={'x': 'Survived', 'y': 'Counts'},
             title='Survived Counts',
             width=800, height=500)

# Show the plot
fig.show()

Age distribution:

In [5]:
# Create the histogram using Plotly Express
fig = px.histogram(dftrain_raw, x='Age', nbins=20, color_discrete_sequence=['purple'],
                   labels={'Age': 'Age', 'count': 'Frequency'},
                   title='Age Distribution',
                   width=800, height=500)

# Show the plot
fig.show()

Correlation between age and survival label:

In [6]:
# Create the traces for both 'Survived==0' and 'Survived==1'
trace_survived_0 = go.Histogram(x=dftrain_raw.query('Survived == 0')['Age'], opacity=0.7,
                                name='Survived==0', marker=dict(color='red'))
trace_survived_1 = go.Histogram(x=dftrain_raw.query('Survived == 1')['Age'], opacity=0.7,
                                name='Survived==1', marker=dict(color='green'))

# Create the figure with the overlaid density plots
fig = go.Figure(data=[trace_survived_0, trace_survived_1])

# Update the layout to add axis labels and title
fig.update_layout(title='Age Density by Survival',
                  xaxis_title='Age', yaxis_title='Density')

# Show the plot
fig.show()

Below are code for formal data pre-processing:

In [7]:
def preprocessing(dfdata):
    dfresult= pd.DataFrame()

    #Pclass
    dfPclass = pd.get_dummies(dfdata['Pclass'])
    dfPclass.columns = ['Pclass_' +str(x) for x in dfPclass.columns ]
    dfresult = pd.concat([dfresult,dfPclass],axis = 1)

    #Sex
    dfSex = pd.get_dummies(dfdata['Sex'])
    dfresult = pd.concat([dfresult,dfSex],axis = 1)

    #Age
    dfresult['Age'] = dfdata['Age'].fillna(0)
    dfresult['Age_null'] = pd.isna(dfdata['Age']).astype('int32')

    #SibSp,Parch,Fare
    dfresult['SibSp'] = dfdata['SibSp']
    dfresult['Parch'] = dfdata['Parch']
    dfresult['Fare'] = dfdata['Fare']

    #Carbin
    dfresult['Cabin_null'] =  pd.isna(dfdata['Cabin']).astype('int32')

    #Embarked
    dfEmbarked = pd.get_dummies(dfdata['Embarked'],dummy_na=True)
    dfEmbarked.columns = ['Embarked_' + str(x) for x in dfEmbarked.columns]
    dfresult = pd.concat([dfresult,dfEmbarked],axis = 1)

    return dfresult


In [8]:
# Preprocess the data
x_train = preprocessing(dftrain_raw)
y_train = dftrain_raw['Survived'].values

# Split the data into training and testing sets
train_size = 0.8  # You can adjust the train_size as needed
x_train, x_test, y_train, y_test = train_test_split(x_train, y_train, train_size=train_size, random_state=42)

print("x_train.shape =", x_train.shape)

x_train.shape = (712, 15)


### 2. Model Definition

Usually there are three ways of modeling using APIs of Keras: sequential modeling using Sequential() function, arbitrary modeling using API functions, and customized modeling by inheriting base class Model.

Here we take the simplest way: sequential modeling using function Sequential().

In [9]:
tf.keras.backend.clear_session()

model = models.Sequential()
model.add(layers.Dense(20,activation = 'relu',input_shape=(15,)))
model.add(layers.Dense(10,activation = 'relu' ))
model.add(layers.Dense(1,activation = 'sigmoid' ))

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 20)                320       
                                                                 
 dense_1 (Dense)             (None, 10)                210       
                                                                 
 dense_2 (Dense)             (None, 1)                 11        
                                                                 
Total params: 541
Trainable params: 541
Non-trainable params: 0
_________________________________________________________________


### 3. Model Training

There are three usual ways for model training: use internal function fit, use internal function train_on_batch, and customized training loop. Here we introduce the simplist way: using internal function fit.

In [10]:
# Use binary cross entropy loss function for binary classification
model.compile(optimizer='adam',
            loss='binary_crossentropy',
            metrics=['AUC'])

history = model.fit(x_train,y_train,
                    batch_size= 64,
                    epochs= 30,
                    validation_split=0.2 #Split part of the training data for validation
            )

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


### 4. Model Evaluation

First, we evaluate the model performance on the training and validation datasets.

In [11]:
def plot_metric(history, metric):
    train_metrics = history.history[metric]
    val_metrics = history.history['val_'+metric]
    epochs = range(1, len(train_metrics) + 1)

    # Convert 'epochs' to a list
    epochs_list = list(epochs)

    # Create traces for train and validation metrics
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=epochs_list, y=train_metrics, mode='lines+markers',
                             line=dict(dash='dash'), name='train_'+metric))
    fig.add_trace(go.Scatter(x=epochs_list, y=val_metrics, mode='lines+markers',
                             line=dict(dash='solid'), name='val_'+metric))

    # Update the layout and add the legend
    fig.update_layout(title='Training and Validation ' + metric,
                      xaxis_title='Epochs', yaxis_title=metric,
                      legend_title_text='Metrics', legend=dict(font=dict(size=12)))

    # Show the plot
    fig.show()



In [12]:
plot_metric(history,"loss")

In [13]:
plot_metric(history,"auc")

Let's take a look at the performance on the testing dataset.

In [14]:
model.evaluate(x = x_test,y = y_test)



[0.5149889588356018, 0.8519304990768433]

### 5. Model Application

In [15]:
#Predict the possiblities
model.predict(x_test[0:10])
#model(tf.constant(x_test[0:10].values,dtype = tf.float32)) #Identical way



array([[0.27141356],
       [0.30842742],
       [0.2038404 ],
       [0.6310907 ],
       [0.41614658],
       [0.63005185],
       [0.32723108],
       [0.25263572],
       [0.4102702 ],
       [0.6455285 ]], dtype=float32)

In [16]:
# Predict the classes
predictions = model.predict(x_test[0:10])
# Round off the predictions
np.round(predictions)



array([[0.],
       [0.],
       [0.],
       [1.],
       [0.],
       [1.],
       [0.],
       [0.],
       [0.],
       [1.]], dtype=float32)

### 6. Model Saving


The trained model could be saved through either the way of Keras or the way of original TensorFlow. The former only allows using Python to retrieve the model, while the latter allows cross-platform deployment.

The latter way is recommended to save the model.

**(1) Model Saving with Keras**

In [17]:
# Saving model structure and parameters

model.save('./data/keras_model.h5')  

del model  #Deleting current model

# Identical to the previous one
model = models.load_model('./data/keras_model.h5')
model.evaluate(x_test,y_test)



[0.5149889588356018, 0.8519304990768433]

In [18]:
# Saving the model structure
json_str = model.to_json()

# Retrieving the model structure
model_json = models.model_from_json(json_str)

In [19]:
# Saving the weights of the model
model.save_weights('./data/keras_model_weight.h5')

# Retrieving the model structure
model_json = models.model_from_json(json_str)
model_json.compile(
        optimizer='adam',
        loss='binary_crossentropy',
        metrics=['AUC']
    )

# Load the weights
model_json.load_weights('./data/keras_model_weight.h5')
model_json.evaluate(x_test,y_test)



[0.5149889588356018, 0.8519304990768433]

**(2) Model Saving with Original Way of TensorFlow**

In [20]:
# Saving the weights, this way only save the tensors of the weights
model.save_weights('./data/tf_model_weights.ckpt',save_format = "tf")

In [21]:
# Saving model structure and parameters to a file, so the model allows cross-platform deployment

model.save('./data/tf_model_savedmodel', save_format="tf")
print('export saved model.')

model_loaded = tf.keras.models.load_model('./data/tf_model_savedmodel')
model_loaded.evaluate(x_test,y_test)

export saved model.


[0.5149889588356018, 0.8519304990768433]

Thank you for joining the "Eat That TensorFlow2.0 in 30 Days" series - happy learning and keep exploring the world of TensorFlow! 🙏🚀