# Titanic Survival Classification - Data Cleansing (Part 1)

This is my first ever ipython notebook, which includes some simple playing about with python to get to grips with this environment.



All of the below is based on inputting and cleaning up data for the Titanic: Machine Learning from Disaster Kaggle competition.  

https://www.kaggle.com/c/titanic

This is part 1 of my series of creating a solution to this problem as my first attempt at creating a machine learning model now I've finished training.  All previous models I built were courtesy of Andrew Ng's fantastic coursera courses - 

https://www.coursera.org/learn/machine-learning <br>
https://www.coursera.org/specializations/deep-learning

Below notes are my initial notes as someone with a SAS/SQL background.

Please note that everything here is simply just testing to get to grips with transferring to python, as such I will be using suboptimal coding methodology in order to explore different ways of doing things.

In [1]:
#First importing some relevant packages

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Now to import and view the data and see some useful functions below.

Also lets look at all of the different variables on our input and make some notes as to how we're going to pre-process them.

### Objectives
The overall objective is to get a matrix of values to feed into our eventual ML algorithm.  

Hence we want to get rid of all of our text values and replace them with numbers and if appropriate, convert those to One-Hot (OH) encoded matrices to feed into our final algorithm.

Then we will want to normalize our numerical fields to speed up convergence later.

The below cell contains a compilation of useful functions for personal reference.

In [2]:
#read in data from raw csv file
full_set = pd.read_csv('D:/Datasets/Titanic/train.csv')

#use below to get metadata
#full_set.info()

#below is the top X rows for pandas
#full_set.head(10)

#below gives us the group by syntax, note that by default it imports all columns so differs slightly from SQL.
#fixed_set.groupby('Embarked').count()

#Creating a new column as a substring of original
#fixed_set['Deck'] = fixed_set.Cabin.str[0]

#Getting null/not null values as a pseudo where clause
#this is equiv of where cabin is not null in SQL
#full_set[full_set['Cabin'].notnull()]


#some data cleansing functions - they can be chained as below
#full_set.Cabin.str.strip().str.lower().str.replace(' ', '')


#Below lines produce a dataset of all the observations where cabin is present
#cabin_set = full_set[full_set['Cabin'].notnull()]
#This is getting where the second value is a string to verify
#cabin_set[cabin_set['Cabin'].str[1].str.isalpha() == True]

#Dropping columns
#full_set = full_set.drop(['Name', 'Ticket'], axis=1)

#Some code from another implementation to convert to One-Hot arrays - the get_dummies function can handle this for us
#pd.get_dummies(model_data,columns=['Sex','Embarked'],drop_first=True)

#Case when condition example from redundant code  I later removed
#full_set.loc(full_set.Deckstr == 'A', 'Deck') = 1

#Other more efficient case whenning
#dataset['Sex'] = dataset['Sex'].map( {'female': 0, 'male': 1} ).astype(int)

#Joining example
#pd.merge(full_set, cabin_set, on='PassengerId', how='left')


full_set.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


# Features
So the first thing to do is to look at our data and evaluate how to treat each feature in turn.  The below data was obtained by using a series of group by statements on the input dataframe.

### PassengerID - 
We will not be feeding this array into the ML algorithm, instead we will retain the order along the m-axis of our data and use this as a join key when we come to re-merge the data back together for producing a final submission file.

### Survived - 

This will be lifted and used as our y vector with which to compute our cost function.

### Pclass - 
1 - 216<br>
2 - 184<br>
3 - 491<br>

As this is a numerical band with 3 categories, can convert simply to a OH vector.

### Name - 
While name may yield some information as to the social class of the person as this is a small project we shall ignore this for now.  

Maybe this could be an avenue for improvement later by using some name embeddings to derive social class and thus likelihood to survive, however this information is likely to be derived from gender/age/fare by our algorithm.

Other similar algorithms have used the length of the name as a feature so maybe this could be an avenue to explore.

### Gender -
female - 314<br>
male - 577<br>

We can apply a OH classification here.

### Age - 
There is no need to band/categorize here, instead we will implement normalization later.

### SibSP -
0 -	608 <br>
1 - 209 <br>
2 -	28<br>
3 -	16 <br>
4 -	18 	<br>
5 -	5 <br>
8 -	7 	<br>

Simple number of siblings/spouses field.  As this has few values that are close together we can just import this column with no manipulation needed

### Parch -
0 -	678 <br>	
1 -	118 	<br>
2 -	80 <br>
3 -	5 	<br>
4 -	4 <br>
5 -	5 <br>
6 -	1 <br>

We will treat this exactly the same as siblings/spouses as no manipulation needed.

### Ticket - 
While this may have some codes indicating the type of ticket bought, for simplicity we're going to assume the majority of the ticket is just a random integer used as an identifier. 

As such this information can be dropped as the information is likely to be more accurately captured by the fare.

### Fare - 
see age<br>

### Cabin - 
This data seemed to be a deck/class followed by a number on first inspection, however it appears some people have multiple rooms (each room is cancatenated).

We will have to do some data cleansing to export the classes to convert to OH vectors.


### Embarked - 
C - 168<br>
Q - 77<br>
S - 644<br>

Another simple OH vector transform can take care of this for us.

# Cleansing Cabin

As we identified above cabin will require some more data cleansing so that is going to the the focus of the below cell.  The below shows the initial intuitions.

The process will be to split this out into two features - 

## Room

This will be an integer, we will have a look at the range of values to decide whether or not to apply a linear transform to the class.

For simplicity's sake we will only take the first room in the case of passengers with multiple rooms (later on we could take number of rooms as an additional feature perhaps).

## Deck

Here we will figure out the range of values the characters can take and the assign an integer value to each to feed in as a ML feature.


In [3]:
#Below lines produce a dataset of all the observations where cabin is present
cabin_set = full_set[full_set['Cabin'].notnull()]
#This is getting where the second value is a string to verify

#Dropping columns we don't care about
cabin_set = cabin_set[['PassengerId','Cabin']]

#Concatenating our original field to remove all the spaces
cabin_set['canc'] = cabin_set['Cabin'].str.replace(' ', '')



#cabin_set.head(100)

#DEALING WITH DECK FIRST

#Below is a validation step for 2nd character
#cabin_set[cabin_set['canc'].str[1].str.isalpha() == True]

#It appears that 4 records have the second char as character, the rest have room numbers next to them

#It is unclear what is going on with these 4 however I suspect its simply a missing room number and they in fact have 2 rooms.

#So for these 4 we will take the first char as the deck and next integer value as the room
#This may not be accurate but it should not throw off our algorithm too badly as its only 4 values

#We unfortunately cannot give bespoke treatment to them as we have to have a general treatment
#This is because we have to have a general linear transform for our submission set.

#This leaves Deck as simply taking the first character which is by definition always a character
cabin_set['Deckstr'] = cabin_set['canc'].str[0]

#left join our deck classification back onto our original data frame
deck_set = pd.merge(full_set, cabin_set, on='PassengerId', how='left')

#The above may have been dealt with easier in place however I decided to split it out to better 
#visualize the data when working

#Plus it gives a frame of reference for how to merge data in pandas
deck_set.head(10)


#DEALING WITH ROOM
#Having manually looked at room, this will likely be 50+ unique categories since we cannot treat it as numeric
#This will add a lot more features to our model but may not actually give much in the way of classification information
#Thus I have decided to omit rooms for now



Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin_x,Embarked,Cabin_y,canc,Deckstr
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,,,
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,C85,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,,,
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,C123,C123,C
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,,,
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,,,
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,E46,E46,E
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,,,
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,,,
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C,,,


# Age and Fare normalization

Now we have the cabin field ready for converting to OH vectors, now to take care of our numerical fields - Age and Fare.

For these we are simply going to normalize them using the following formula - 

new $ x_i = \dfrac{x_i - \mu}{max(x) - min(x)}$

Null values will be ignored and should not matter to our algorithm.

We can also choose to divide by the variance should dividing by the range be a bad choice.

##  Dealing with NaN values

As NaN values can invalidate any summation over matrices we will also need to account for NaN values.  

Fortunately as we have normalized about our mean we can simply use 0 as a replacement for NaN.

In [4]:
#First get our means
age_mean = deck_set['Age'].mean()
fare_mean = deck_set['Fare'].mean()

#Next get our denominator
age_range = deck_set['Age'].max() - deck_set['Age'].min()
fare_range = deck_set['Fare'].max() - deck_set['Fare'].min()

#Finally create new fields in our dataframe which are normalized versions
deck_set['Norm_age'] = (deck_set['Age'] - age_mean) / age_range
deck_set['Norm_fare'] = (deck_set['Fare'] - fare_mean) / fare_range

#Replace NaN with 0
deck_set['Norm_age'].fillna(0, inplace=True)
deck_set['Norm_fare'].fillna(0, inplace=True)

deck_set.head(10)

#Dividing by the range seems okay for now, we can revisit this later if needed.

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin_x,Embarked,Cabin_y,canc,Deckstr,Norm_age,Norm_fare
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,,,,-0.096747,-0.048707
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,C85,C85,C,0.104309,0.076277
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,,,,-0.046483,-0.04739
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,C123,C123,C,0.066611,0.040786
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,,,,0.066611,-0.047146
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,,,,0.0,-0.046349
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,E46,E46,E,0.305364,0.03837
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,,,,-0.348066,-0.021723
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,,,,-0.033917,-0.041128
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C,,,,-0.197275,-0.004164


# Getting our One-Hot Vectors

Now to deal with all of our categories by converting them all into one-hot vectors.

This should be a relatively straightforward process iterating over all of our categorical features.

After that we can finally put it all together in a final dataframe for exporting into numpy arrays to feed into our models.

In [5]:
#Generate one hot vector dataframes for each of the categorical fields
p_set = pd.get_dummies(deck_set.Pclass, prefix='pclass', dummy_na = True)
gen_set = pd.get_dummies(deck_set.Sex, prefix='Gen', dummy_na = True)
emb_set = pd.get_dummies(deck_set.Embarked, prefix='Emb', dummy_na = True)                        
dek_set = pd.get_dummies(deck_set.Deckstr, prefix='Deck', dummy_na = True)

#Define our final output dataframe
out_set = pd.concat([deck_set[['PassengerId', 'Survived', 'Norm_age', 'Norm_fare', 'SibSp', 'Parch']], 
                     p_set, 
                     gen_set, 
                     emb_set, 
                     dek_set], axis=1)

out_set.head(10)

Unnamed: 0,PassengerId,Survived,Norm_age,Norm_fare,SibSp,Parch,pclass_1.0,pclass_2.0,pclass_3.0,pclass_nan,...,Emb_nan,Deck_A,Deck_B,Deck_C,Deck_D,Deck_E,Deck_F,Deck_G,Deck_T,Deck_nan
0,1,0,-0.096747,-0.048707,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1
1,2,1,0.104309,0.076277,1,0,1,0,0,0,...,0,0,0,1,0,0,0,0,0,0
2,3,1,-0.046483,-0.04739,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1
3,4,1,0.066611,0.040786,1,0,1,0,0,0,...,0,0,0,1,0,0,0,0,0,0
4,5,0,0.066611,-0.047146,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1
5,6,0,0.0,-0.046349,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1
6,7,0,0.305364,0.03837,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
7,8,0,-0.348066,-0.021723,3,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1
8,9,1,-0.033917,-0.041128,0,2,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1
9,10,1,-0.197275,-0.004164,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1


# Splitting our data
Now we have our data all in one place and looking good!

The final step is to split it out into the relevant matrices and convert to numpy arrays.

Our final outputs should be - <br>
*X* an $(m * n_x)$ matrix where $n_x$ denotes the number of features. <br>
*Y* an $(m * 1)$ vector of our binary classification ground truth vector.

However this would only give us our full training set!

As we have only been provided labels for our training set and not the submission set we should also split out a cross validation set to evaluate our outputs.

So given our out_set shape is $(891, 26)$  we can take 100 observations out for our cross validation set and have the remaining 791 for training.

We should really take this as a random sample based on our data distribution, however given the small dataset size its probably okay to just take a cutoff.


In [6]:
#Segmenting data
X_Train_df = out_set.head(791)
X_CV_df = out_set.tail(100)

#Getting our Y vectors
Y_Train = X_Train_df['Survived'].values
Y_CV = X_CV_df['Survived'].values

#Dropping columns we don't want to feed into our ML algorithm
X_Train_df = X_Train_df.drop(['PassengerId', 'Survived'], axis=1)
X_CV_df = X_CV_df.drop(['PassengerId', 'Survived'], axis=1)

#Getting our X vectors
X_Train = X_Train_df.values
X_CV = X_CV_df.values

# Putting it all together

Now we have our training data cleansed and ready to use!

However there's two final things remaining, we will want to put all of this into functions we can call easily for our later notebooks and we have to do all of this manipulation to our submission data also.

And not forgetting we have to create our submission file so we will want three functions - 
* Cleanse_Training_Data(dataframe)
* Cleanse_Submission_Data(dataframe)
* Create_Submission_Output(dataframe, Y_hat)


In [7]:
#Creating our Training Set
def Cleanse_Training_Data(train_df):
    #Getting our Deck
    train_df['canc'] = train_df['Cabin'].str.replace(' ', '')
    train_df['Deckstr'] = train_df['canc'].str[0]
    
    #Normalizing age/fare
    age_mean = train_df['Age'].mean()
    fare_mean = train_df['Fare'].mean()
    age_range = train_df['Age'].max() - train_df['Age'].min()
    fare_range = train_df['Fare'].max() - train_df['Fare'].min()

    train_df['Norm_age'] = (train_df['Age'] - age_mean) / age_range
    train_df['Norm_fare'] = (train_df['Fare'] - fare_mean) / fare_range
    
    #Replace NaN with 0
    train_df['Norm_age'].fillna(0, inplace=True)
    train_df['Norm_fare'].fillna(0, inplace=True)
    
    #Generate one hot vector dataframes for each of the categorical fields
    p_set = pd.get_dummies(train_df.Pclass, prefix='pclass', dummy_na = True)
    gen_set = pd.get_dummies(train_df.Sex, prefix='Gen', dummy_na = True)
    emb_set = pd.get_dummies(train_df.Embarked, prefix='Emb', dummy_na = True)                        
    dek_set = pd.get_dummies(train_df.Deckstr, prefix='Deck', dummy_na = True)

    #Define our final output dataframe
    out_set = pd.concat([train_df[['Survived', 'Norm_age', 'Norm_fare', 'SibSp', 'Parch']], 
                     p_set, 
                     gen_set, 
                     emb_set, 
                     dek_set], axis=1)
    
    
    #Segmenting data
    X_Train_df = out_set.head(791)
    X_CV_df = out_set.tail(100)

    #Getting our Y vectors
    Y_Train = X_Train_df['Survived'].values
    Y_CV = X_CV_df['Survived'].values

    #Dropping columns we don't want to feed into our ML algorithm
    X_Train_df = X_Train_df.drop(['Survived'], axis=1)
    X_CV_df = X_CV_df.drop(['Survived'], axis=1)

    #Getting our X vectors
    X_Train = X_Train_df.values
    X_CV = X_CV_df.values
    
    return X_Train, X_CV, Y_Train, Y_CV


In [8]:
#Testing our function
X_Train_t, X_CV_t, Y_Train_t, Y_CV_t = Cleanse_Training_Data(full_set)
    
print(X_CV_t.shape)

(100, 24)


# Submission File

Now to create our submission file functions.  First lets load our submission dataset and have a look.

In [9]:
sub_set = pd.read_csv('D:/Datasets/Titanic/test.csv')

sub_set.head(10)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
5,897,3,"Svensson, Mr. Johan Cervin",male,14.0,0,0,7538,9.225,,S
6,898,3,"Connolly, Miss. Kate",female,30.0,0,0,330972,7.6292,,Q
7,899,2,"Caldwell, Mr. Albert Francis",male,26.0,1,1,248738,29.0,,S
8,900,3,"Abrahim, Mrs. Joseph (Sophie Halaut Easu)",female,18.0,0,0,2657,7.2292,,C
9,901,3,"Davies, Mr. John Samuel",male,21.0,2,0,A/4 48871,24.15,,S


So it looks like a very similar format but without Survived, for obvious reasons.  Thus in order to load this we can use a replication of the above function but omit any references to survived.

However we will require one new output - a PassengerID vector to cancatenate on to our eventual $\hat{y}$ so we can easily merge and create our submission file.


In [10]:
#Creating our Training Set
def Cleanse_Submission_Data(sub_df):
    #Getting our Deck
    sub_df['canc'] = sub_df['Cabin'].str.replace(' ', '')
    sub_df['Deckstr'] = sub_df['canc'].str[0]
    
    #Normalizing age/fare
    age_mean = sub_df['Age'].mean()
    fare_mean = sub_df['Fare'].mean()
    age_range = sub_df['Age'].max() - sub_df['Age'].min()
    fare_range = sub_df['Fare'].max() - sub_df['Fare'].min()

    sub_df['Norm_age'] = (sub_df['Age'] - age_mean) / age_range
    sub_df['Norm_fare'] = (sub_df['Fare'] - fare_mean) / fare_range
    
    #Replace NaN with 0
    sub_df['Norm_age'].fillna(0, inplace=True)
    sub_df['Norm_fare'].fillna(0, inplace=True)
    
    #Generate one hot vector dataframes for each of the categorical fields
    p_set = pd.get_dummies(sub_df.Pclass, prefix='pclass', dummy_na = True)
    gen_set = pd.get_dummies(sub_df.Sex, prefix='Gen', dummy_na = True)
    emb_set = pd.get_dummies(sub_df.Embarked, prefix='Emb', dummy_na = True)                        
    dek_set = pd.get_dummies(sub_df.Deckstr, prefix='Deck', dummy_na = True)

    #Define our final output dataframe
    out_set = pd.concat([sub_df[['PassengerId', 'Norm_age', 'Norm_fare', 'SibSp', 'Parch']], 
                     p_set, 
                     gen_set, 
                     emb_set, 
                     dek_set], axis=1)
    
    #Inserting to account for T deck
    out_set.insert(23, 'Deck_T', 0) 
    
    sub_ID = out_set['PassengerId'].values
    
    #Dropping columns we don't want to feed into our ML algorithm
    out_set = out_set.drop(['PassengerId'], axis=1)

    #Getting our X vectors
    Submis_set = out_set.values
    
    return Submis_set, sub_ID


In [11]:
#Testing submission to make sure it matches
sub_out, id_set = Cleanse_Submission_Data(sub_set)

print(sub_out.shape)
print(id_set.shape)

(418, 24)
(418,)


## Create Submission File

The final function is to generate the file output in the correct format to submit to Kaggle.

In [12]:
#First lets have a look at the output format
fin_set = pd.read_csv('D:/Datasets/Titanic/gender_submission.csv')

fin_set.head(10)


Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1
5,897,0
6,898,1
7,899,0
8,900,1
9,901,0


It appears there's only 2 columns, our prediction $\hat{y}$ and the passenger_ID field, so we can simply cancatenate our 2 vectors together and we have our submission file!

In [13]:
#Lets create our function to put our vectors into a dataframe we can export to Kaggle.
def Create_output_frame(id_set, y_hat):
    #After much frustration I decided on this method because it actually worked.
    #There is likely a better method that I will figure out at a later date.
    out_df = pd.DataFrame(id_set, columns=['PassengerId'])
    out_df['Survived'] = y_hat
    
    return out_df

In [14]:
#First we will need to define a y_hat to test.  Lets just take a vector of all zeros for simplicity
pred_test = np.zeros(418, dtype='int64')

test_out = Create_output_frame(id_set, pred_test)

test_out.head(10)


Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,0
5,897,0
6,898,0
7,899,0
8,900,0
9,901,0


Now to build some models as we have our data imported nicely, see part 2 for the first small model.