# Logistic Regression with Tensorflow

I know it is quite overkill to use Tensorflow for this task, but I just learned using Tensorflow and I want to apply what I've learned in this task. Basically, I'm going to build Logistic Regression using Tensorflow. So, let's begin!

First, I start importing the libraries and loading the data.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf

  return f(*args, **kwds)


In [2]:
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

## Preprocessing the Data
Let's just take a quick view of the data.

In [3]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [5]:
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [6]:
test.describe()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
count,418.0,418.0,332.0,418.0,418.0,417.0
mean,1100.5,2.26555,30.27259,0.447368,0.392344,35.627188
std,120.810458,0.841838,14.181209,0.89676,0.981429,55.907576
min,892.0,1.0,0.17,0.0,0.0,0.0
25%,996.25,1.0,21.0,0.0,0.0,7.8958
50%,1100.5,3.0,27.0,0.0,0.0,14.4542
75%,1204.75,3.0,39.0,1.0,0.0,31.5
max,1309.0,3.0,76.0,8.0,9.0,512.3292


The goal of this project is to predict whether a passenger survives. Therefore, I don't think that *Name*, *Ticket*, *Fare*, and *Embarkment* are related to survival. Just delete those columns from the table. Moreover, there are also several *NaN* in the table. Replace those *NaN*s with 0.

In [7]:
del train['Name']
del train['Ticket']
del train['Fare']
del train['Embarked']

In [8]:
train = train.fillna(value=0.0)

1. First, let's preprocess the *Sex*. Just replace it with 0 (Female) or 1 (Male).
2. Then, let's handle the *Age*. Since the age is categorical data, I group the age 8 groups: *NaN*, 0-10, 10-20, ..., 70-80. From the desribe above, it's shown that the maximum age is 80.
3. *Cabin* is quite interesting. It is stored in string. I think the format is written as *Cabin Section + Cabin Number*. I'm only interested in obtaining the *Cabin Section*.

In [9]:
for i in range(train.shape[0]):
    if train.at[i, 'Sex'] == 'male':
        train.at[i, 'Sex'] = 1
    else:
        train.at[i, 'Sex'] = 0

In [10]:
train['Age_group'] = 0
for i in range(train.shape[0]):
    for j in range(70, 0, -10):
        if train.at[i, 'Age'] > j:
            train.at[i, 'Age_group'] = int(j/10)
            break
del train['Age'] # it's unnecessary anymore

In [11]:
print(list(set(train['Cabin'].values))[:10]) # sample of 'Cabin' values
train['Cabin_section'] = '0'
for i in range(train.shape[0]):
    if train.at[i, 'Cabin'] != 0:
        train.at[i, 'Cabin_section'] = train.at[i, 'Cabin'][0]
CABIN_SECTION = list(set(train['Cabin_section'].values)) # will be reused for test data
print(CABIN_SECTION) # 'Cabin_Section' values
for i in range(train.shape[0]):
    train.at[i, 'Cabin_section'] = CABIN_SECTION.index(train.at[i, 'Cabin_section'])
del train['Cabin'] # it's unnecessary anymore

[0.0, 'C99', 'E68', 'C54', 'E49', 'D10 D12', 'D35', 'B3', 'C87', 'C93']
['B', 'C', '0', 'A', 'D', 'E', 'T', 'F', 'G']


I've done with the preprocessing. Here is the result.

In [12]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,SibSp,Parch,Age_group,Cabin_section
0,1,0,3,1,1,0,2,2
1,2,1,1,0,1,0,3,1
2,3,1,3,0,0,0,2,2
3,4,1,1,0,1,0,3,1
4,5,0,3,1,0,0,3,2


What's next is preparing the numpy array for the input of Tensorflow. I need to convert the categorical data (*Pclass*, *Age_group*, and *Cabin_section*) into *one hot* array using np.eye. Then, divide the data into training and dev set.

In [13]:
pclass = np.eye(train['Pclass'].values.max()+1)[train['Pclass'].values]
age_group = np.eye(train['Age_group'].values.max()+1)[train['Age_group'].values]
cabin_section = np.eye(train['Cabin_section'].values.max()+1) \
                    [train['Cabin_section'].values.astype(int)] # prevent IndexError

In [14]:
X = train[['Sex', 'SibSp', 'Parch']].values
X = np.concatenate([X, age_group], axis=1)
X = np.concatenate([X, pclass], axis=1)
X = np.concatenate([X, cabin_section], axis=1)
X = X.astype(float)

y = train['Survived'].values
y = y.astype(float).reshape(-1, 1)

In [15]:
X_train, X_dev, y_train, y_dev = train_test_split(X, y, test_size=0.1, random_state=0)

In [16]:
print(X_train.shape, y_train.shape)

(801, 24) (801, 1)


Repeat the preprocessing for the test data as well.

In [17]:
del test['Name']
del test['Ticket']
del test['Fare']
del test['Embarked']

test = test.fillna(value=0.0)

test['Age_group'] = 0
test['Cabin_section'] = '0'
for i in range(test.shape[0]):
    if test.at[i, 'Sex'] == 'male':
        test.at[i, 'Sex'] = 1
    else:
        test.at[i, 'Sex'] = 0

    for j in range(70, 0, -10):
        if test.at[i, 'Age'] > j:
            test.at[i, 'Age_group'] = int(j/10)
            break

    if test.at[i, 'Cabin'] != 0:
        test.at[i, 'Cabin_section'] = test.at[i, 'Cabin'][0]
    test.at[i, 'Cabin_section'] = CABIN_SECTION.index(test.at[i, 'Cabin_section'])

del test['Cabin'] # it's unnecessary anymore
del test['Age'] # it's unnecessary anymore

In [18]:
test.head()

Unnamed: 0,PassengerId,Pclass,Sex,SibSp,Parch,Age_group,Cabin_section
0,892,3,1,0,0,3,2
1,893,3,0,1,0,4,2
2,894,2,1,0,0,6,2
3,895,3,1,0,0,2,2
4,896,3,0,1,1,2,2


In [19]:
pclass_test = np.eye(test['Pclass'].values.max()+1)[test['Pclass'].values]
age_group_test = np.eye(test['Age_group'].values.max()+1)[test['Age_group'].values]
cabin_section_test = np.eye(test['Cabin_section'].values.max()+1) \
                    [test['Cabin_section'].values.astype(int)] # prevent IndexError

X_test = test[['Sex', 'SibSp', 'Parch']].values
X_test = np.concatenate([X_test, age_group_test], axis=1)
X_test = np.concatenate([X_test, pclass_test], axis=1)
X_test = np.concatenate([X_test, cabin_section_test], axis=1)
X_test = X_test.astype(float)

id_test = test['PassengerId'].values
id_test = id_test.reshape(-1, 1)

In [20]:
print(X_test.shape, id_test.shape)

(418, 24) (418, 1)


## Building the Neural Network
Let's start by defining the hyperparameters

In [21]:
seed = 7 # for reproducible purpose
input_size = X_train.shape[1] # number of features
learning_rate = 0.001 # most common value for Adam
epochs = 8500 # I've tested previously that this is the best epochs to avoid overfitting

The Logistic Regression looks like this: W1\*X + b1 = pred, where \* is the matrix multiplication and sigmoid is used as activation function at the output layer. *Cross Entropy* and *Adam Optimizer* are used as the loss function and optimizer.

In [22]:
graph = tf.Graph()
with graph.as_default():
    tf.set_random_seed(seed)
    np.random.seed(seed)

    X_input = tf.placeholder(dtype=tf.float32, shape=[None, input_size], name='X_input')
    y_input = tf.placeholder(dtype=tf.float32, shape=[None, 1], name='y_input')
    
    W1 = tf.Variable(tf.random_normal(shape=[input_size, 1], seed=seed), name='W1')
    b1 = tf.Variable(tf.random_normal(shape=[1], seed=seed), name='b1')
    sigm = tf.nn.sigmoid(tf.add(tf.matmul(X_input, W1), b1), name='pred')
    
    loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=y_input,
                                                                  logits=sigm, name='loss'))
    train_steps = tf.train.AdamOptimizer(learning_rate).minimize(loss)

    pred = tf.cast(tf.greater_equal(sigm, 0.5), tf.float32, name='pred') # 1 if >= 0.5
    acc = tf.reduce_mean(tf.cast(tf.equal(pred, y_input), tf.float32), name='acc')
    
    init_var = tf.global_variables_initializer()

In [23]:
train_feed_dict = {X_input: X_train, y_input: y_train}
dev_feed_dict = {X_input: X_dev, y_input: y_dev}
test_feed_dict = {X_input: X_test} # no y_input since the goal is to predict it

## Training the Network
Let's start the training. I initialize the session and variables first and start the training. During training, the loss and accuracy are printed.

In [24]:
sess = tf.Session(graph=graph)
sess.run(init_var)

In [25]:
cur_loss = sess.run(loss, feed_dict=train_feed_dict)
train_acc = sess.run(acc, feed_dict=train_feed_dict)
test_acc = sess.run(acc, feed_dict=dev_feed_dict)
print('step 0: loss {0:.5f}, train_acc {1:.2f}%, test_acc {2:.2f}%'.format(
                       cur_loss, 100*train_acc, 100*test_acc))
for step in range(1, epochs+1):
    sess.run(train_steps, feed_dict=train_feed_dict)
    cur_loss = sess.run(loss, feed_dict=train_feed_dict)
    train_acc = sess.run(acc, feed_dict=train_feed_dict)
    test_acc = sess.run(acc, feed_dict=dev_feed_dict)
    if step%100 != 0: # print result every 100 steps
        continue
    print('step {3}: loss {0:.5f}, train_acc {1:.2f}%, test_acc {2:.2f}%'.format(
                       cur_loss, 100*train_acc, 100*test_acc, step))

step 0: loss 0.72769, train_acc 65.17%, test_acc 67.78%
step 100: loss 0.71315, train_acc 65.04%, test_acc 60.00%
step 200: loss 0.70363, train_acc 64.17%, test_acc 58.89%
step 300: loss 0.69718, train_acc 63.80%, test_acc 61.11%
step 400: loss 0.69236, train_acc 64.42%, test_acc 62.22%
step 500: loss 0.68835, train_acc 65.04%, test_acc 63.33%
step 600: loss 0.68477, train_acc 66.04%, test_acc 67.78%
step 700: loss 0.68153, train_acc 67.42%, test_acc 68.89%
step 800: loss 0.67860, train_acc 68.79%, test_acc 70.00%
step 900: loss 0.67600, train_acc 69.79%, test_acc 70.00%
step 1000: loss 0.67370, train_acc 70.41%, test_acc 71.11%
step 1100: loss 0.67168, train_acc 70.79%, test_acc 71.11%
step 1200: loss 0.66988, train_acc 71.91%, test_acc 72.22%
step 1300: loss 0.66826, train_acc 72.16%, test_acc 74.44%
step 1400: loss 0.66677, train_acc 72.41%, test_acc 75.56%
step 1500: loss 0.66539, train_acc 73.41%, test_acc 75.56%
step 1600: loss 0.66411, train_acc 74.28%, test_acc 75.56%
step 1700

## Evaluating the Network
Actually the network performance is not very good (only around 80%). Finally, I need to prepare the prediction.

In [26]:
y_pred = sess.run(pred, feed_dict=test_feed_dict).astype(int)
prediction = pd.DataFrame(np.concatenate([id_test, y_pred], axis=1),
                          columns=['PassengerId', 'Survived'])

In [27]:
prediction.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,0


## Takeaways
1. I think I'm not doing enough Exploratory Data Analysis, which I think very crucial in beginning the project.
2.  80% accuracy in train and dev set is not very good actually. I think other models such as Random Forest will produce better accuracy.
3. Even if Logistic Regression should be used, using Tensorflow is not very efficient. There are many build-in libraries for Logistic Regression (e.g. Scikit-Learn).

Any feedbacks are very welcomed!