# Kaggle Titanic
## Logistic Regression with Python


For this lecture we will be working with the [Titanic Data Set from Kaggle](https://www.kaggle.com/c/titanic). This is a very famous dataset.


# Step - 1 : Frame The Problem

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.



# Step - 2 : Obtain the Data

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Pandas provides two important data types with in built functions to be able to provide extensive capability to handle the data.The datatypes include Series and DataFrames.

Pandas provides ways to read or get the data from various sources like read_csv,read_excel,read_html etc.The data is read and stored in the form of DataFrames.

In [2]:
# FOR GOOGLE COLAB, UNCOMMENT LINES BELOW

# !wget -q https://www.dropbox.com/s/8grgwn4b6y25frw/titanic.csv
# !ls -l
# data = pd.read_csv('titanic.csv')

In [3]:
# FOR JUPYTER NOTEBOOK, UNCOMMENT LINES BELOW

data = pd.read_csv('Titanic Data-Train.csv')

In [4]:
data.set_index('PassengerId', inplace=True)

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Name        891 non-null object
Sex         891 non-null object
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Ticket      891 non-null object
Fare        891 non-null float64
Cabin       204 non-null object
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB


In [6]:
data.describe()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


# Step - 3 : Analyse the Data

#### What do you observe from the above charts?

# Step - 4 : Feature Engineering

## Feature Engineering

We want to fill the missing values of the age in the dataset with the average age value for each of the classes. This is called data imputation.

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Name        891 non-null object
Sex         891 non-null object
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Ticket      891 non-null object
Fare        891 non-null float64
Cabin       204 non-null object
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB


In [8]:
# BL NOTE - SKIP THIS AND USE THE PANDAS CODE I CREATED BELOW
# def impute_age(cols):
#     Age = cols[0]
#     Pclass = cols[1]
    
#     if pd.isnull(Age):
#         # Class-1
#         if Pclass == 1:
#             return 37
#         # Class-2 
#         elif Pclass == 2:
#             return 29
#         # Class-3
#         else:
#             return 24

#     else:
#         return Age

Applying the function.

In [9]:
# data['Age'] = data[['Age','Pclass']].apply(impute_age,axis=1)

In [10]:
# BL NOTE - USE MY CODE TO GET THE MEAN VALUE FOR EACH CLASS INSTEAD OF HARD CODED NUMBERS
mean_age_Pclass1 = int(data.loc[data['Pclass']==1]['Age'].mean())
mean_age_Pclass2 = int(data.loc[data['Pclass']==2]['Age'].mean())
mean_age_Pclass3 = int(data.loc[data['Pclass']==3]['Age'].mean())

data.loc[(data['Age'].isnull() ) & (data['Pclass']==1), 'Age'] = mean_age_Pclass1
data.loc[(data['Age'].isnull() ) & (data['Pclass']==2), 'Age'] = mean_age_Pclass2
data.loc[(data['Age'].isnull() ) & (data['Pclass']==3), 'Age'] = mean_age_Pclass3

Now let's visualize the missing values.

The Age column is imputed sucessfully.

Let's drop the Cabin column and the row in the Embarked that is NaN.

In [11]:
data.drop('Cabin', axis = 1,inplace=True)

In [12]:
data.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S


In [13]:
data.isnull().sum()

Survived    0
Pclass      0
Name        0
Sex         0
Age         0
SibSp       0
Parch       0
Ticket      0
Fare        0
Embarked    2
dtype: int64

In [14]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 10 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Name        891 non-null object
Sex         891 non-null object
Age         891 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Ticket      891 non-null object
Fare        891 non-null float64
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(4)
memory usage: 76.6+ KB


In [15]:
data.dropna(inplace = True)

In [16]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 1 to 891
Data columns (total 10 columns):
Survived    889 non-null int64
Pclass      889 non-null int64
Name        889 non-null object
Sex         889 non-null object
Age         889 non-null float64
SibSp       889 non-null int64
Parch       889 non-null int64
Ticket      889 non-null object
Fare        889 non-null float64
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(4)
memory usage: 76.4+ KB


## Converting Categorical Features 

We'll need to convert categorical features to dummy variables using pandas! Otherwise our machine learning algorithm won't be able to directly take in those features as inputs.

In [17]:
sex_dummies = pd.get_dummies(data['Sex'],drop_first=1)
embark_dummies = pd.get_dummies(data['Embarked'],drop_first=1)
sex_dummies.head()

Unnamed: 0_level_0,male
PassengerId,Unnamed: 1_level_1
1,1
2,0
3,0
4,0
5,1


In [18]:
embark_dummies.head()

Unnamed: 0_level_0,Q,S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0,1
2,0,0
3,0,1
4,0,1
5,0,1


In [19]:
data.drop(['Sex','Embarked','Name','Ticket'],axis=1,inplace=True)
data.head()

Unnamed: 0_level_0,Survived,Pclass,Age,SibSp,Parch,Fare
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0,3,22.0,1,0,7.25
2,1,1,38.0,1,0,71.2833
3,1,3,26.0,0,0,7.925
4,1,1,35.0,1,0,53.1
5,0,3,35.0,0,0,8.05


In [20]:
data = pd.concat([data,sex_dummies,embark_dummies],axis=1)

In [21]:
data.head()

Unnamed: 0_level_0,Survived,Pclass,Age,SibSp,Parch,Fare,male,Q,S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,0,3,22.0,1,0,7.25,1,0,1
2,1,1,38.0,1,0,71.2833,0,0,0
3,1,3,26.0,0,0,7.925,0,0,1
4,1,1,35.0,1,0,53.1,0,0,1
5,0,3,35.0,0,0,8.05,1,0,1


In [22]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 1 to 891
Data columns (total 9 columns):
Survived    889 non-null int64
Pclass      889 non-null int64
Age         889 non-null float64
SibSp       889 non-null int64
Parch       889 non-null int64
Fare        889 non-null float64
male        889 non-null uint8
Q           889 non-null uint8
S           889 non-null uint8
dtypes: float64(2), int64(4), uint8(3)
memory usage: 51.2 KB


In [23]:
data.describe()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,male,Q,S
count,889.0,889.0,889.0,889.0,889.0,889.0,889.0,889.0,889.0
mean,0.382452,2.311586,29.20604,0.524184,0.382452,32.096681,0.649044,0.086614,0.724409
std,0.48626,0.8347,13.177747,1.103705,0.806761,49.697504,0.477538,0.281427,0.447063
min,0.0,1.0,0.42,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,2.0,22.0,0.0,0.0,7.8958,0.0,0.0,0.0
50%,0.0,3.0,26.0,0.0,0.0,14.4542,1.0,0.0,1.0
75%,1.0,3.0,36.5,1.0,0.0,31.0,1.0,0.0,1.0
max,1.0,3.0,80.0,8.0,6.0,512.3292,1.0,1.0,1.0


# Step - 5 : Model Selection

In [24]:
Target = 'Survived'
X = data.drop(Target,axis=1)
y = data[Target]

In [25]:
from sklearn.metrics import confusion_matrix, precision_score, recall_score,accuracy_score, f1_score

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)


In [26]:
X_train.describe()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,male,Q,S
count,711.0,711.0,711.0,711.0,711.0,711.0,711.0,711.0
mean,2.320675,29.040211,0.527426,0.379747,32.102818,0.649789,0.085795,0.720113
std,0.829802,13.350864,1.093653,0.820055,51.872981,0.477372,0.280258,0.44926
min,1.0,0.42,0.0,0.0,0.0,0.0,0.0,0.0
25%,2.0,22.0,0.0,0.0,7.9104,0.0,0.0,0.0
50%,3.0,25.0,0.0,0.0,14.4542,1.0,0.0,1.0
75%,3.0,36.0,1.0,0.0,30.5,1.0,0.0,1.0
max,3.0,80.0,8.0,6.0,512.3292,1.0,1.0,1.0


**Standard Scaling**: Standardize features by removing the mean and scaling to unit variance  
Similar to normalization but changes the mean to be 0 and std dev = 1 (create bell curve for all data)  
[Standard Scaling SK Learn](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

In [27]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X_train)
X_train = sc.transform(X_train)

In [28]:
X_train = pd.DataFrame(X_train)
X_train.describe()

Unnamed: 0,0,1,2,3,4,5,6,7
count,711.0,711.0,711.0,711.0,711.0,711.0,711.0,711.0
mean,-6.870578000000001e-17,8.978597000000001e-17,4.591576e-16,1.826949e-16,5.664324e-17,1.052448e-16,2.00574e-16,-2.492146e-16
std,1.000704,1.000704,1.000704,1.000704,1.000704,1.000704,1.000704,1.000704
min,-1.592674,-2.145206,-0.4826003,-0.4634007,-0.6193093,-1.362139,-0.3063432,-1.604015
25%,-0.3867196,-0.5276937,-0.4826003,-0.4634007,-0.4667063,-1.362139,-0.3063432,-1.604015
50%,0.819235,-0.302831,-0.4826003,-0.4634007,-0.3404671,0.7341397,-0.3063432,0.6234355
75%,0.819235,0.5216658,0.4324099,-0.4634007,-0.03092065,0.7341397,-0.3063432,0.6234355
max,0.819235,3.819653,6.837481,6.85833,9.264254,0.7341397,3.264313,0.6234355


Notice that mean = 0 and std = 1 (when rounded)

In [29]:
X_test = sc.transform(X_test) # uses the scaling factors used in X_train above. So only transform must be done. No Fitting

In [30]:
type(X_train)

pandas.core.frame.DataFrame

# Now lets make ANN

#### For Jupyter Notebook, make sure Keras and TensorFlow is installed first
- [Keras Installation Instructions](https://keras.io/#installation)  
- [Tensorflow Installation Instructions](https://www.tensorflow.org/install)

In [31]:
import numpy as np
np.__version__

'1.16.3'

In [32]:
# Importing the Keras libraries and packages
import keras
from keras.models import Sequential
from keras.layers import Dense

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [33]:
# Initializing the ANN
classifier = Sequential()

In [34]:
X_train.shape

(711, 8)

In [35]:
# Adding the input layer and the first hidden layer
classifier.add(Dense(units = 9, kernel_initializer = 'uniform', 
                     activation = 'relu', input_dim = X_train.shape[1]))

Instructions for updating:
Colocations handled automatically by placer.


**ACTIVATION FUNCTIONS**

- relu = filters out negative values (Usually a very effective activation function)
- sigmoid = between 0 and 1.  Provides the probability of survival for titanic passengers
- accuracy = ??

In [36]:
# Adding the second hidden layer
classifier.add(Dense(units = 10, kernel_initializer = 'uniform', activation = 'relu'))

In [37]:
# Adding the third hidden layer
classifier.add(Dense(units = 8, kernel_initializer = 'uniform', activation = 'relu'))

In [38]:
# Adding the output layer
classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))

In [39]:
# Compiling the ANN
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

In [40]:
classifier.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 9)                 81        
_________________________________________________________________
dense_2 (Dense)              (None, 10)                100       
_________________________________________________________________
dense_3 (Dense)              (None, 8)                 88        
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 9         
Total params: 278
Trainable params: 278
Non-trainable params: 0
_________________________________________________________________


In [41]:
# Fitting the ANN to the Training set
classifier.fit(X_train, y_train, verbose=1, batch_size = 500, epochs = 200)

Instructions for updating:
Use tf.cast instead.
Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200

<keras.callbacks.History at 0x1a2ae4ca90>

In [42]:
classifier.weights

[<tf.Variable 'dense_1/kernel:0' shape=(8, 9) dtype=float32_ref>,
 <tf.Variable 'dense_1/bias:0' shape=(9,) dtype=float32_ref>,
 <tf.Variable 'dense_2/kernel:0' shape=(9, 10) dtype=float32_ref>,
 <tf.Variable 'dense_2/bias:0' shape=(10,) dtype=float32_ref>,
 <tf.Variable 'dense_3/kernel:0' shape=(10, 8) dtype=float32_ref>,
 <tf.Variable 'dense_3/bias:0' shape=(8,) dtype=float32_ref>,
 <tf.Variable 'dense_4/kernel:0' shape=(8, 1) dtype=float32_ref>,
 <tf.Variable 'dense_4/bias:0' shape=(1,) dtype=float32_ref>]

In [43]:
#Part 3 - Making predictions and evaluating the model

In [44]:
# Predicting the Test set results
y_pred = classifier.predict(X_test)
y_pred = (y_pred > 0.3)

In [45]:
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm

array([[79, 26],
       [21, 52]])

In [46]:
y_pred[:,0]

array([ True, False,  True,  True, False,  True, False, False,  True,
        True,  True,  True,  True, False,  True, False,  True,  True,
       False,  True,  True,  True,  True, False, False, False,  True,
        True,  True, False,  True, False, False,  True, False, False,
       False, False, False, False,  True, False,  True, False,  True,
       False, False, False, False, False,  True,  True, False,  True,
       False,  True,  True,  True, False, False,  True, False, False,
       False, False, False,  True,  True,  True,  True,  True, False,
       False,  True,  True,  True, False, False,  True, False, False,
        True,  True,  True,  True,  True, False, False, False, False,
       False,  True,  True, False, False,  True,  True,  True, False,
        True, False,  True, False,  True, False,  True, False, False,
       False,  True,  True,  True, False,  True, False,  True, False,
       False, False, False, False, False, False,  True, False, False,
       False, False,

# Step - 7 : Predict on New Cases
## Prediction on Test Data From Kaggle

create an account on www.kaggle.com

TODO - LOAD ACTUAL TEST DATA FROM KAGGLE, CLEAN DATA, RUN NEURAL NET MODEL ON DATA, SUBMIT RESULTS

In [47]:
# FOR GOOGLE COLAB - UNCOMMENT LINES BELOW

# !wget -q https://www.dropbox.com/s/t9i5j6ki0qsf989/Titanic%20Data-Test.csv?dl=0
# !ls -l
# test_data = pd.read_csv('Titanic Data-Test.csv')

In [48]:
# FOR JUPYTER NOTEBOOK - UNCOMMENT LINES BELOW

test_data = pd.read_csv('Titanic Data-Test.csv')

In [49]:
test_data.set_index('PassengerId', inplace=True)
test_data.head()

Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [50]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 418 entries, 892 to 1309
Data columns (total 10 columns):
Pclass      418 non-null int64
Name        418 non-null object
Sex         418 non-null object
Age         332 non-null float64
SibSp       418 non-null int64
Parch       418 non-null int64
Ticket      418 non-null object
Fare        417 non-null float64
Cabin       91 non-null object
Embarked    418 non-null object
dtypes: float64(2), int64(3), object(5)
memory usage: 35.9+ KB


In [51]:
test_data.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)
test_data.head()

Unnamed: 0_level_0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
892,3,male,34.5,0,0,7.8292,Q
893,3,female,47.0,1,0,7.0,S
894,2,male,62.0,0,0,9.6875,Q
895,3,male,27.0,0,0,8.6625,S
896,3,female,22.0,1,1,12.2875,S


In [52]:
test_data['Embarked'].value_counts()

S    270
C    102
Q     46
Name: Embarked, dtype: int64

In [53]:
embarked_dummies = pd.get_dummies(test_data['Embarked'], drop_first=True)
embarked_dummies.head()

Unnamed: 0_level_0,Q,S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
892,1,0
893,0,1
894,1,0
895,0,1
896,0,1


In [54]:
test_data['Sex'].value_counts()

male      266
female    152
Name: Sex, dtype: int64

In [55]:
male = pd.get_dummies(test_data['Sex'], drop_first=True)
male.head()

Unnamed: 0_level_0,male
PassengerId,Unnamed: 1_level_1
892,1
893,0
894,1
895,1
896,0


In [56]:
test_data = pd.concat([test_data, embarked_dummies, male], axis=1)
test_data.head()

Unnamed: 0_level_0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Q,S,male
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
892,3,male,34.5,0,0,7.8292,Q,1,0,1
893,3,female,47.0,1,0,7.0,S,0,1,0
894,2,male,62.0,0,0,9.6875,Q,1,0,1
895,3,male,27.0,0,0,8.6625,S,0,1,1
896,3,female,22.0,1,1,12.2875,S,0,1,0


In [57]:
test_data.drop(['Sex', 'Embarked'], axis=1, inplace=True)
test_data.head()

Unnamed: 0_level_0,Pclass,Age,SibSp,Parch,Fare,Q,S,male
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
892,3,34.5,0,0,7.8292,1,0,1
893,3,47.0,1,0,7.0,0,1,0
894,2,62.0,0,0,9.6875,1,0,1
895,3,27.0,0,0,8.6625,0,1,1
896,3,22.0,1,1,12.2875,0,1,0


In [58]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 418 entries, 892 to 1309
Data columns (total 8 columns):
Pclass    418 non-null int64
Age       332 non-null float64
SibSp     418 non-null int64
Parch     418 non-null int64
Fare      417 non-null float64
Q         418 non-null uint8
S         418 non-null uint8
male      418 non-null uint8
dtypes: float64(2), int64(3), uint8(3)
memory usage: 20.8 KB


In [59]:
test_data.isnull().sum()

Pclass     0
Age       86
SibSp      0
Parch      0
Fare       1
Q          0
S          0
male       0
dtype: int64

In [60]:
t_mean_age_Pclass1 = int(test_data.loc[test_data['Pclass']==1]['Age'].mean())
t_mean_age_Pclass2 = int(test_data.loc[test_data['Pclass']==2]['Age'].mean())
t_mean_age_Pclass3 = int(test_data.loc[test_data['Pclass']==3]['Age'].mean())

test_data.loc[(test_data['Age'].isnull() ) & (test_data['Pclass']==1), 'Age'] = t_mean_age_Pclass1
test_data.loc[(test_data['Age'].isnull() ) & (test_data['Pclass']==2), 'Age'] = t_mean_age_Pclass2
test_data.loc[(test_data['Age'].isnull() ) & (test_data['Pclass']==3), 'Age'] = t_mean_age_Pclass3

In [61]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 418 entries, 892 to 1309
Data columns (total 8 columns):
Pclass    418 non-null int64
Age       418 non-null float64
SibSp     418 non-null int64
Parch     418 non-null int64
Fare      417 non-null float64
Q         418 non-null uint8
S         418 non-null uint8
male      418 non-null uint8
dtypes: float64(2), int64(3), uint8(3)
memory usage: 20.8 KB


In [62]:
test_data.dropna(inplace=True)
test_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 417 entries, 892 to 1309
Data columns (total 8 columns):
Pclass    417 non-null int64
Age       417 non-null float64
SibSp     417 non-null int64
Parch     417 non-null int64
Fare      417 non-null float64
Q         417 non-null uint8
S         417 non-null uint8
male      417 non-null uint8
dtypes: float64(2), int64(3), uint8(3)
memory usage: 20.8 KB


In [64]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(test_data)
X_test = sc.transform(test_data)
X_test = pd.DataFrame(X_test)
X_test.describe()

Unnamed: 0,0,1,2,3,4,5,6,7
count,417.0,417.0,417.0,417.0,417.0,417.0,417.0,417.0
mean,-1.064962e-17,-6.709261000000001e-17,-4.2598490000000005e-17,-6.789134000000001e-17,-5.2582510000000006e-17,6.6560130000000005e-18,1.895633e-16,-6.549517000000001e-17
std,1.001201,1.001201,1.001201,1.001201,1.001201,1.001201,1.001201,1.001201
min,-1.502602,-2.265063,-0.5002182,-0.4008043,-0.638017,-0.352121,-1.348172,-1.320387
25%,-1.502602,-0.4898548,-0.5002182,-0.4008043,-0.4966178,-0.352121,-1.348172,-1.320387
50%,0.8753298,-0.3343394,-0.5002182,-0.4008043,-0.379169,-0.352121,0.7417452,0.7573539
75%,0.8753298,0.5209952,0.6152416,-0.4008043,-0.07391031,-0.352121,0.7417452,0.7573539
max,0.8753298,3.631303,8.42346,8.77126,8.536851,2.839933,0.7417452,0.7573539


In [78]:
# Predicting the Test set results
test_y_pred = classifier.predict(X_test)
test_y_pred = (test_y_pred > 0.3)
test_y_pred = pd.DataFrame(test_y_pred)
test_y_pred.tail()

Unnamed: 0,0
412,False
413,True
414,False
415,False
416,False


In [81]:
results = pd.DataFrame()
results['PassengerId'] = test_data.index
results['Survived'] = test_y_pred
results.tail()

Unnamed: 0,PassengerId,Survived
412,1305,False
413,1306,True
414,1307,False
415,1308,False
416,1309,False


In [85]:
results['Survived'] = results['Survived'].apply(lambda x: x*1)
results.tail()

Unnamed: 0,PassengerId,Survived
412,1305,0
413,1306,1
414,1307,0
415,1308,0
416,1309,0


In [None]:
# FOR GOOGLE COLAB - UNCOMMENT LINES BELOW

# filename = "Titanic result ANN 2019-05-04.csv"
# results.to_csv(filename, index=False)
# from google.colab import files
# files.download(filename) 

In [86]:
# FOR JUPYTER NOTEBOOK - UNCOMMENT LINES BELOW AND UPDATE FILENAME

filename = "Titanic Results/Titanic result ANN 2019-05-04.csv"
results.to_csv(filename, index=False)