In [117]:
# import libraries
import pandas as pd

In [118]:
# import data
datamaster = pd.read_csv("data/train.csv")
data = datamaster.copy()
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


---


# **NA Values**

Below we will see what parts of the data set are NA and then we will decide how to deal with them.

In [119]:
# view % of na values for each category
data.isna().sum()/891*100

PassengerId     0.000000
Survived        0.000000
Pclass          0.000000
Name            0.000000
Sex             0.000000
Age            19.865320
SibSp           0.000000
Parch           0.000000
Ticket          0.000000
Fare            0.000000
Cabin          77.104377
Embarked        0.224467
dtype: float64

The data shows that there are three factors with missing values.
1. Age
2. Cabin
3. Embarked

Let's see how we can deal with this missing data.

**Age** has 19.38% missing values; there is too many missing values to try to impute the data. For now, I will add a variable that determines whether age value is missing or not and simply replace NA with 0. We can try other techniques later if necessary.

In [120]:
# Add a column to the data set that indicates whether or not the age value is missing, 0 
data['AgeNA'] = data.Age.isna().astype(int)
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,AgeNA
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0


In [121]:
# replace all na in age column with 0
data['Age'].fillna(0,inplace=True)
data.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
AgeNA            0
dtype: int64

Now we have 0 NA values for Age.  
Let's check to make sure AgeNA column retained it's values:

In [122]:
# just wanna check AgeNA column to make sure we still have the NAs tallied.
data['AgeNA'].sum()

177

And it's still showing 177 NA values, which is what we started with. Great! Let's move on to Cabin.

**Cabin** has 77.1% missing values. Looking further into this we find that there was only one partial list found known as "The Cave List" that was found on the body of Herbert Cave. The list only contains some of the passenger accomodations on the titanic, and it is known to have errors and missing passengers, as it was printed before some passengers booked and before some had cancelled. Considering the questionable accuracy of the document, and how little of the passenger data is contained, I will be dropping it entirely and save it for an excercise at a later date.

Information was found from the following sources and can be explored for more information:

[A Thorough Analysis of the Cave List by Daniel Klistorner](https://www.encyclopedia-titanica.org/the-cave-list.html)  
[The Cave List](https://www.encyclopedia-titanica.org/cave-list.html)

A possible approach would be to create a separate model for when cabin data is present, which might increase the accuracy of the predictions in those special cases. But for now we will proceed without the cabin data.



In [123]:
# drop the cabin column from the dataset
data = data.drop('Cabin',axis=1)
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,AgeNA
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S,0


**Embarked** only has 0.22% of values missing which we can either impute or simply drop the rows at it is not a large part of the dataset. Let's look at the missing rows of data.

In [124]:
data[data['Embarked'].isna()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,AgeNA
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,,0
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,,0


There are only two rows missing data from which the port they embarked from. They have the same ticket number and paid the same fare, maybe their is a way to match this to other passengers. But for now, since it is only two rows I will just drop them from the dataset and maybe revisit this later.

In [125]:
data.dropna(subset='Embarked', inplace=True)
data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,AgeNA
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,S,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,S,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,S,0
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,S,0
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,S,0
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,0.0,1,2,W./C. 6607,23.4500,S,1
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C,0


Now let's look to see if there is any Na values left in the data.

In [126]:
data.isna().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
AgeNA          0
dtype: int64

Great! No NAs left in the dataset. Let's move on to see if we can change some other factors.

---

# **Factor Engineering**

We've removed all of the NA values from the dataset, let's see what else we can do before we start training models. 

In [127]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,AgeNA
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S,0


Let's go through the factors we haven't touched and ideas we could implement to make them more modeling friendly. 
1. **PassengerID, Survived, Pclass:** these are already good for modeling. We will leave them as is. 
  
2. **Name:** There might be something interesting from Name that we could extract, but I feel that things such as titles will only repeat what Pclass and Sex tell us, it might be best to drop this for now. 
  
3. **SibSP,Parch:** stand for the number of siblings and spouses; parents and children that are associated with each guest. This could be used to determine their family size, but we'll leave it as is for now.  

4. **Ticket:** There are specific codes for each guests and the ticket number may have something to do with their survivability but I will also drop this for now.



In [128]:
# drop name and ticket
data.drop(['Name','Ticket'],axis=1,inplace=True)
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,AgeNA
0,1,0,3,male,22.0,1,0,7.25,S,0
1,2,1,1,female,38.0,1,0,71.2833,C,0
2,3,1,3,female,26.0,0,0,7.925,S,0
3,4,1,1,female,35.0,1,0,53.1,S,0
4,5,0,3,male,35.0,0,0,8.05,S,0


Now the data is prepared for some modeling, let's create a function to help us convert the test dataset given to us. 


In [129]:
def prepare_data(dataset):
    data = dataset.copy()
    # add AgeNA column and then replace NA with 0
    data['AgeNA'] = data.Age.isna().astype(int)
    data['Age'].fillna(0,inplace=True)

    # drop Cabin, Name, and Ticket column
    data = data.drop('Cabin',axis=1)
    data.drop(['Name','Ticket'],axis=1,inplace=True)

    return data

Let's import and run the test.csv dataset given to us to see if the function works.

In [131]:
testdatamaster = pd.read_csv("data/test.csv")
testdata = testdatamaster.copy()

testdata = prepare_data(testdata)
testdata.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,AgeNA
0,892,3,male,34.5,0,0,7.8292,Q,0
1,893,3,female,47.0,1,0,7.0,S,0
2,894,2,male,62.0,0,0,9.6875,Q,0
3,895,3,male,27.0,0,0,8.6625,S,0
4,896,3,female,22.0,1,1,12.2875,S,0


Looks good! Let's move on.




---

# Model Selection


I'm just going to run a simple model to see what the prediction is like.

In [None]:
# import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

First split the data into train and test sets

In [None]:
train_data, test_data = train_test_split(data,
                                         test)