# Titanic data analysis

<img src=https://www.smbc-comics.com/comics/1516896042-20180125.png style="height:450px">
Source: SMBC

In [None]:
import pandas as pd  
import matplotlib.pyplot as plt  
import seaborn as sns 
import numpy as np  

custom_palette = sns.color_palette('viridis', 3)
sns.set_palette(custom_palette)

## RMS Titanic

The RMS Titanic was a British passenger liner that was considered "unsinkable". In the early morning hours of 15 April 1912, after colliding with an iceberg during its maiden voyage from Southampton to New York City, it sank. 

There were an estimated 2,224 passengers and crew aboard the ship, of which only 722 survived, making it one of the deadliest commercial peacetime maritime disasters in modern history. 

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people, such as women, small children, and upper class people, were more likely to survive than others.

Here, we have a sample of the passengers data and we are interested in exploring it, making inferences based on it, and using it to predict who was likely to survive the disaster. 


In [None]:
titanic_df = pd.read_csv('titanic.csv')  

display(titanic_df.head())
display(titanic_df.describe())
display(titanic_df.describe(include=['O']))
titanic_df.info()

- Our sample includes 891 records and has 11 features
- There are 7 numeric variables and 4 categorical variables. 
- Some numeric variables, like *ticket_class* and *survived* are only numeric in memory, but actually represent categorical variables.
- We can see that *age*, *deck*, and *embarkation_port* have missing values.

### Do the data come from a simple random sample?

In our sample, 38.4% survived. The estimated survival rate in the population was 722/2224 = 32.5%. It is possible this sample oversamples survivors. Before we start with the analysis, let us check if this is true, or is this just a result of sampling variation, and we cannot reject the hypothesis that the data come from a simple random sample.

$H_{0}$: The proportion of survivors in our sample is the same as the proportion of survivors in a simple random sample from the Titanic passenger population.  <br>
$H_{1}$: No, it is too high: The proportion of survivors in our sample is higher than the proportion of survivors in a simple random sample from the Titanic passenger population.

- Our test statistic will be the proportion of survivors in a simple random sample from the Titanic passenger population. 
- If 0.384 (the observed value of this statistic in our sample) will turn out to be high relative to the values of this test statistic simulated under the null hypothesis, we will tend to reject the null.

**Our approach:** We will draw many random samples from the population and compute the proportion of survivors in each sample. Each sample will have the same size as our original sample (891) so we maintain the variation properties of the original sample. We will then compare the distribution of the proportion of survivors to the observed value and make our conclusions.

In [None]:
# To make random draws from the population, we first need to have that population. 
# We do not have these data, but since we know how many passengers it included and how many survived, 
# it is easy to create it
titanic_population = pd.DataFrame({'serial number': np.arange(1, 2225),
                                  'survived': np.concatenate((np.ones(722), np.zeros(2224-722)))})
display(titanic_population)
titanic_population['survived'].mean()

In [None]:
# Sample size as in the original sample
sample_size = 891

# simualte one value. 
# Note here, we need to sample WITHOUT REPLACEMENT, since the same passenger cannot be in the same sample more than once
def simulate_titanic_sample():
    random_sample = titanic_population.sample(n=sample_size, replace=False)
    p_survival = random_sample['survived'].mean()
    return p_survival

# run multiple simulations
num_sims = 5000
many_p_survival = np.empty(num_sims)
for i in range(num_sims):
    many_p_survival[i] = simulate_titanic_sample()

In [None]:
# Compute p-value
observed_p_survival =  titanic_df['survived'].mean() #get value of test statistic from data
num_samples_with_stat_we_got_or_more = np.count_nonzero(many_p_survival >= observed_p_survival)
print('The p-value is', num_samples_with_stat_we_got_or_more/num_sims)

# visualizing the results
ax = sns.displot(many_p_survival)
ax.set(xlabel='propotion survived', ylabel='number of simulations')
plt.scatter(observed_p_survival, 0, marker='.', s=250, color='red', clip_on=False) # shows red dot where the value of the test statistic is
plt.show()

The empirical p-value we got is zero, so in any reasonable significance level, we can reject the null hypothesis. We therefore conclude that our sample does not come from a simple random sample of the Titanic population. 

It seems that indeed our data oversamples survivors. This is not too surprising since the records for those on-board the Titanic are not very accurate. Naturally, we know those who survived much better than those who didn't, and when assembling the data sample, it was probably not very easy to take a random sample from the whole population.

In the following, although our data have this problem, our analysis will assume a simple random sample, and we should qualify all our analysis and conclusions by noting our sample overrepresents survivors.

## Exploratory Data Analysis

### Visualizing distributions

Let's visualize some variables in our data. To decide on the best visualizations, we need to know the type of variables we visualize:

In [None]:
titanic_df.info()

In [None]:
# Distributions of cateogrical variables

# sex
ax = sns.catplot(data=titanic_df, x='sex', kind='count')
ax.set(title='Distribution of gender')
plt.show()

# embarkation_port
ax = sns.catplot(data=titanic_df, x='embarkation_port', kind='count', height=5, aspect=1.3)
ax.set(xlabel="port of embarkation", title='Distribution of port of embarkation')
plt.show()

# survived
ax = sns.catplot(data=titanic_df, x='survived', kind='count')
ax.set(title='Distribution of survival')
plt.show()

# ticket_class
# Although ticket_class is integer, it is clearly an ordinal categorical variable
# Note the use of "pallette" parameter for better use of color in ordinal variables
ax = sns.catplot(data=titanic_df, x='ticket_class', kind='count', palette='mako')
ax.set(xlabel="ticket class", title='Distribution of ticket class')
plt.show()

# deck
# ax = sns.catplot(data=titanic_df, x='deck', kind='count', palette='Blues')
# ax.set(xlabel="deck", title='Distribution of deck')
# plt.show()

# # It is nicer if we get the categories to be sorted from A to F
decks = titanic_df.dropna().deck.unique()
decks.sort()
ax = sns.catplot(data=titanic_df, x='deck', kind='count', palette='mako', order=decks)
ax.set(xlabel="deck", title='Distribution of deck')
plt.show()

In [None]:
# Distributions of continuous variables

# num_siblings_spouse
ax = sns.displot(titanic_df, x='num_siblings_spouse', bins=np.arange(0,10,1))
ax.set(xlabel='number of siblings and/or spouses on board', ylabel='frequency', title='Distribution of num_siblings_spouse')
plt.show()

# num_parents_children
ax = sns.displot(titanic_df, x='num_parents_children', bins=np.arange(0,8,1))
ax.set(xlabel='number of parents and/or children on board', ylabel='frequency', title='Distribution of num_parents_children')
plt.show()

# fare_paid
ax = sns.displot(titanic_df, x='fare_paid', bins=np.arange(0,530,10))
ax.set(xlabel='fare paid', ylabel='frequency', title='Distribution of fare_paid')
plt.show()

# age
# note age has missing values, so this distribution shows only non-missing observations
ax = sns.displot(titanic_df, x='age', bins=np.arange(0,81,1))
ax.set(xlabel='Age', ylabel='frequency', title="Distribution of age, without missing values");


### Visualizing relationships + Feature engineering

After we got a sense of how our variables are distributed, we can think of interesting relationships to examine. Of course, there are many many different relations to plot. Here, we will show a few that make sense. 

It probably makes sense to think of survival as the focal variable of interest, so let's visualize relationships of _survived_ with other variables that we think may impact the survival rate


In [None]:
ax = sns.catplot(kind='bar', x='ticket_class', y='survived', hue='sex', data=titanic_df)
ax.set(xlabel='class', ylabel='survival rate', title='Survival rate by class and sex');

- Females are much more likely to have survived than males in all classes. 
- Among females, first and second class are more likely to have survived.
- Among males only first class are more likely to have survived.

#### Port of embarkation

In [None]:
ax = sns.catplot(kind='bar', x='embarkation_port', y='survived', data=titanic_df)
ax.set(xlabel='port of embarkation', ylabel='survival rate', title='Survival rate by port');

There seems to be an association between survival rate and port of embarkation with Cherburg passengers most likely to survive and Southampton passengers least likely.

#### Number of family members on board

In [None]:
ax = sns.catplot(kind='bar', x='num_siblings_spouse', y='survived', hue='sex', data=titanic_df)
ax.set(xlabel='number of siblings and/or spouses on-board', ylabel='survival rate', title='Survival rate by #siblings or spouse and sex')
plt.show()

ax = sns.catplot(kind='bar', x='num_parents_children', y='survived', hue='sex', data=titanic_df)
ax.set(xlabel='number of parents and/or children on-board', ylabel='survival rate', title='Survival rate by #parents or children and sex');

(We should note the very wide error bars in several groups that make drawing conclusions difficult)

- Men without families seem less likely to survive. That is, it is possible that males who are alone on board are least likely to survive. 
- It is possible that married men (1 spouse/sibling) are more likely to have survived. 
- It is possible females with larger families are somewhat less likely to have survived

Let's make use of our first observation here and create a new variable *alone_on_board*:



In [None]:
titanic_df['alone_on_board'] = False
titanic_df.loc[(titanic_df['num_siblings_spouse']==0) &
               (titanic_df['num_parents_children']==0), 'alone_on_board'] = True
titanic_df.head()

In [None]:
ax = sns.catplot(kind='bar', x='alone_on_board', y='survived', data=titanic_df)
ax.set(xlabel='alone on board?', ylabel='survival rate', title='Survival rate for being alone on board');

#### Deck vs. survival rate

In [None]:
ax = sns.catplot(data=titanic_df, x='deck', y='survived', kind='bar', palette='mako', order=decks)
ax.set(xlabel="deck", ylabel='survival rate', title='Survival rate by deck');

Deck seems related to survival rate, with some decks having higher survival rates than others. 

More interestingly, we note that the survival rates in **all** decks are higher than 40%. That is, in observations _in which we have deck data_ survival rate is higher! 
Let's check this observation by creating a new variable

In [None]:
titanic_df['deck_info'] = True
titanic_df.loc[titanic_df['deck'].isnull(), 'deck_info'] = False
titanic_df.head()

In [None]:
ax = sns.catplot(kind='bar', x='deck_info', y='survived', data=titanic_df)
ax.set(xlabel='Is there deck information?', ylabel='survival rate', title='Survival rate by information on deck');

Indeed, there is a very large clear difference in survival rates (if we want, we can check this formally of course). One possible reason is censoring: those who have survived are much more likely to have provided information on their decks, whereas information of non-survivors depended on written records that may not have been complete.


#### Fare paid vs. survival

In [None]:
ax = sns.catplot(kind='box', x='survived', y='fare_paid', showmeans=True, data=titanic_df)
ax.set(xlabel='Survived?', title='Boxplots of fare paid as a function of survival')
plt.show()

ax = sns.catplot(kind='point', x='survived', y='fare_paid', hue='ticket_class', data=titanic_df)
ax.set(xlabel='Survived?', title='Fare paid as a function of survival, by class');

Perhaps unsurprisingly, we see that those who paid a higher fare were more likely to survive. But, interestingly, this only seems to be true for the First class. 

#### Age vs. survival rate

In [None]:
ax = sns.catplot(kind='box', x='survived', y='age', showmeans=True, data=titanic_df)
ax.set(xlabel='Survived?', title='boxplots of age as a function of survival');

According to these boxplots, it seems there is little relationship between age and survival. It is possible that those who survived tend to be a bit younger on average, though the medians are almost identical. To look at this in more depth, it is better to compare the full distributions, rather than the summary statistics that the boxplot provides:

In [None]:
# age vs. survived
ax = sns.displot(titanic_df, x='age', hue='survived', bins = np.arange(0,81,2), stat='density', common_norm=False)
ax.set(ylabel='probability', title='Distributions of age by survival');

We can see that little children clearly represent a larger share of those who survived than those who did not. It is possible that young adults (in their 20's) and elderly people (over ~60), on the other hand, represent a larger share of those who did not survive. Therefore, we can see that there's an association between age and survival rate which is somewhat obscured in the boxplots above.

To verify this, let's bin age to groups and check for differences in survival rates:

In [None]:
def bin_age(age):
    if age < 17:
        return '[0,17)'
    elif age < 30:
        return '[18,30)'
    elif age < 60:
        return '[30,60)'
    elif np.isnan(age):
        return float('nan')
    else:
        return '[60+]'
    
titanic_df['binned_age'] = titanic_df['age'].apply(bin_age)
display(titanic_df.head(10))
titanic_df.info()

In [None]:
tmp = titanic_df.sort_values('binned_age')
ax=sns.catplot(kind='bar', x='binned_age', y='survived', data=tmp, palette='mako')
ax.set(xlabel='age group', ylabel='survival rate', title='Survival rate by age group');

Indeed, we see that children are much more likely to survive than others. Elderly people, and to a lesser extent, young adults, are somewhat less likely to survive (though the wide error bar for the elderly indicates the estimate is unstable). 

### Missing values

We need to handle the missing values in *deck*, *age*, and *embarkation_port*.

#### Embarkation port

Dealing with the missing values in _embarkation_port_ seems easy. There are only 2 such observations. A reasonable way to handle this is to impute using the most frequent category:

In [None]:
grpby_port_count = titanic_df.groupby('embarkation_port').count()
print(grpby_port_count['passenger_id'])
ax=sns.catplot(data=titanic_df, x='embarkation_port', kind='count')
ax.set(xlabel='port of embarkation', ylabel='number of observations');

In [None]:
titanic_df.loc[titanic_df['embarkation_port'].isnull(), 'embarkation_port'] = 'Southampton'
titanic_df.info()

#### Deck

There are many missing values in _deck_. However, we already saw the mere availability of information on deck is informative. Let's just remove the information on _deck_ and keep just _deck_info_

In [None]:
titanic_df.drop(['deck'], axis=1, inplace=True)
titanic_df.info()

#### Age

Handling the missing data in _age_ is more challenging, as there are many missing observations, but age itself may be an important predictor for survival. A few options:
- Removing the observations with missing data. This is the simplest way, but it reduces the number of observations by nearly 20%.
- Replace each missing observation with the mean or median of _age_. This is simple, but would greatly overweigh the mean or median age relative to the rest of the age distribution, and we already saw that age is related to survival.
- Replace each missing observation with a random value taken from the mean age +- one sd of age. This is a bit more complex and will introduce some noise, but is a much less restrictive assumption. Still, we know children are more likely to survive and this imputation is unlikely to help identifying children.
- Predict age, or age_binned, or "child", using other variables in the dataset and then use the predictions of the model to impute data! Be careful though: If you want to use the imputed values to predict survival later, using _survived_ to predict the imputations is wrong. It would imply that the imputed data would "have access" to whether or not someone in the **test data** survived (because the imputed value will tend to align more closely with whether or not the person survived). Therefore, it is best practice to predict it using only training data (for prediction of _survived_ ) and without using _survived_ as a predictor. 

For now, let's keep in mind that age needs to be taken care of before we can use it in any analysis, and continue working with the data we do have.

## Hypothesis testing

### Is there a difference in survival rate between passengers boarding in Cherburg and those boarding in Queenstown?

<img src=https://i0.wp.com/upload.wikimedia.org/wikipedia/commons/thumb/5/51/Titanic_voyage_map.png/450px-Titanic_voyage_map.png style="height:200px">

From one of the plots above it seems that the survival rate of people boarding in Cherburg was higher than that of people boarding in Queenstown. Is it possible that the rescue teams preferred rescuing French over Irish people?

$H_{0}$: In the population, there is no difference between survival rates of passengers boarding in Cherburg and those boarding in Queenstown.  <br>
$H_{1}$: In the population, the survival rate of passengers boarding in Cherburg is different (higher) than that of those boarding in Queenstown.

We will use a bootstrap method to compute a 95% confidence interval for the difference between the survival rate of those boarding in Cherburg and the survival rate of those boarding in Queenstown.


In [None]:
# We should work only on observations from Cherburg or Queenstown
port_c_or_q = titanic_df.loc[titanic_df['embarkation_port'] != "Southampton",:]
port_c_or_q

In [None]:
# function that returns the difference in averages of "column_name" grouping by "grouping_var"
def diff_of_avgs(df, column_name, grouping_var):
    grpby_var = df.groupby(grouping_var)
    avgs = grpby_var[column_name].mean()
    return avgs[1] - avgs[0]

def bootstrap_mean_difference(original_sample, column_name, grouping_var, num_replications):
    '''This function returns an array of bootstrapped differences between two sample averages:
      original_sample: df containing the original sample
      column_name: name of column containing the variable to average
      grouping_var: name of variable according to which to group
      num_replications: number of bootstrap samples'''
    original_sample_size = original_sample.shape[0] # we need to replicate with the same sample size
    original_sample_cols_of_interest = original_sample[[column_name, grouping_var]]
    bstrap_mean_diffs = np.empty(num_replications)
    for i in range(num_replications):
        bootstrap_sample = original_sample_cols_of_interest.sample(original_sample_size, replace=True) # note WITH REPLACEMENT!
        resampled_mean_diff = diff_of_avgs(bootstrap_sample, column_name, grouping_var)
        bstrap_mean_diffs[i] = resampled_mean_diff
    
    return bstrap_mean_diffs

# run the bootstrap procedure
bstrap_diffs = bootstrap_mean_difference(port_c_or_q, 'survived', 'embarkation_port', 5000)

In [None]:
# Get the endpoints of the 95% confidence interval
left_end = np.percentile(bstrap_diffs, 2.5, method='higher')
right_end = np.percentile(bstrap_diffs, 97.5, method='higher')
print('The 95% boostsrap confidence interval for difference between population means', [left_end,right_end])

# visualize results
ax = sns.displot(bstrap_diffs)
plt.hlines(y=0, xmin=left_end, xmax=right_end, colors='orange', linestyles='solid', lw=7, clip_on=False);  
ax.set(xlabel='difference between survival rates', ylabel='frequency',
       title='Distribution of bootstrap estimates for the difference between\nsurvival rate boarding in Queenstown and survival rate boarding in Cherburg');

Our 95% confidence interval does not include 0 and we can therefore reject the null hypothesis and conclude there's a difference between the survival rates of people boarding in Cherburg and in Queenstown.

Does this mean there was a bias against Irish people (or in favor of French people)?

Let's plot the survival rates as a function of ticket class:

In [None]:
ax = sns.catplot(kind='bar', x='ticket_class', y='survived', hue='embarkation_port', data=port_c_or_q)
ax.set(xlabel='class', ylabel='survival rate',title='Survival rates across embarkation ports and ticket classes')

ax = sns.catplot(kind='count', x='ticket_class', hue='embarkation_port', data=port_c_or_q)
ax.set(xlabel='class', title='Number of passengers in each embarkation port and ticket class');

- When comparing survival rates within each class, we see no stable differences between the survival rates of people boarding the two ports. 
- It seems Simpsons' Paradox is at play here. People boarding in Cherburg were far more likely to have been on First Class or Second Class than people from Queenstown, and people from these classes were far more likely to have survived.

## Classification

The most immediate classification task that comes to mind is to predict whether or not someone _survived_. 

We start with a couple of scatter plots to visualize the relationship between the numeric variables we have and survival:

In [None]:
custom_palette = sns.color_palette('viridis', 2)
sns.set_palette(custom_palette)

sns.relplot(x='age', y='fare_paid', hue='survived',data=titanic_df, s=25)
# To jitter the dots for easier visualization, we use lmplot with fit_reg=False:
sns.lmplot(x='num_parents_children', y='num_siblings_spouse', hue='survived', \
           data=titanic_df, x_jitter=0.35, y_jitter=0.35, fit_reg=False, scatter_kws={"s": 10});

It seems all four numeric variables could be somewhat useful for our classification. 

Let's now look at the correlations between the variables.
The correlation matrix will not display corrlations with non-numeric variables, so we should first transform the categorical variables to numeric (excluding *deck* since we'll remove it later, and *name* for obvious reasons).

Note that when we have a variable which is categorical but stored as numeric, like *ticket_class*, we should be explicit about transforming it.

In [None]:
# pd.get_dummies is one way
titanic_df_dummies = pd.get_dummies(titanic_df, 
                                    columns=['alone_on_board', 'deck_info', 'sex', 'embarkation_port', 'binned_age', 'ticket_class'])
titanic_df_dummies.head()

In [None]:
# compute correlation between each pair of variables in data frame
correlations = titanic_df_dummies.corr(numeric_only=True)

#plot heat map
plt.figure(figsize=(15,15))
g=sns.heatmap(correlations,annot=True,cmap="seismic")

We see the correlations with survived are not high, except for *sex*, which could be an improtant predictor. Other correlations are under 0.3, but based on our analysis so far we would probably like to predict using:

- ticket_class
- sex
- age or age_binned
- embarkation_port
- alone_on_board
- deck_info
- fare_paid
- fare_paid X ticket_class
- sex X ticket_class

and perhaps others as well. 

Note we will not use *num_siblings_spouse* and *num_parents_children* here because we already elicited the important information out of them when we created *alone_on_board*, while the correlations of these variables and *alone_on_board* are relatively high.

For simplicity, we will not use here the intraction variables listed above but only *age*, *fare_paid*, *sex*, *alone_on_board*, *deck_info*, *ticket_class*, and *embarkation_port*. 

Importantly, in transforming the categorical variables to numeric we are making an assumption that the difference between, for example, males and females, should be weighted the same (in terms of distance) as the difference between the maximal and minimal fares paid (if we scale to range) or to a difference of 1 standard deviation of this variable (if we scale to z-scores) etc.

In practice, this is not a decision we should take lightly and considerable thinking should be given here. Many times, the decision is data driven (i.e., we try and see what works best - but never on the test set!)

In [None]:
# First, let's keep only the variables we're interested in
knn_df = titanic_df[['sex', 'fare_paid', 'age', 'alone_on_board', 'deck_info', 
                     'embarkation_port', 'ticket_class', 'survived']]
display(knn_df)

In [None]:
# Second, we need to take care of the missing observations in age - let's remove them
# Again, this is not a decision we should make lightly
knn_df = knn_df[knn_df['age'].notnull()]
knn_df.info()

In [None]:
# Third, we need to transform our categorical variables to dummies. 
# Remember to do this even if the variable is stored as numeric
knn_df = pd.get_dummies(knn_df, columns=['alone_on_board', 'deck_info', 'sex', 'embarkation_port', 'ticket_class'])
knn_df.head()


In [None]:
# Fourth, we need to split our data to train and test sets
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

# It is usually sensible to randomly shuffle the dataframe first
knn_df = knn_df.sample(frac=1)

# Split to X and Y
X = knn_df.loc[:, knn_df.columns != 'survived'] # features
Y = knn_df.loc[:, 'survived'].values # labels

# Split to train and test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20)
X_train

In [None]:
# Fifth, we need to scale the variables to be on the same scale
# We choose here to standardize using z-scores, but this is a decision that needs some thought
from sklearn.preprocessing import StandardScaler
df_columns = X_train.columns
scaler = StandardScaler()
scaled_X_train = scaler.fit_transform(X_train)
# note the difference between scaling the train and the test!
scaled_X_test = scaler.transform(X_test)

scaled_df = pd.DataFrame(scaled_X_train, columns=df_columns)
scaled_df.describe()

In [None]:
# Finally, we can run 10-fold CV to find optimal k - using only the training data
from sklearn.model_selection import cross_val_score

mean_cv_scores = []
k_list = range(1, 25)
for nn in k_list:
    knn_cv = KNeighborsClassifier(n_neighbors=nn)
    cv_scores = cross_val_score(knn_cv, scaled_X_train, Y_train, cv=10)
    mean_cv_scores.append(cv_scores.mean())
    
# output results
best_k = mean_cv_scores.index(max(mean_cv_scores))+1 # gets index of best performing k and adds 1
print('Highest accuracy is obtained for k =', best_k, 'and equals', max(mean_cv_scores))
plt.plot(k_list, mean_cv_scores, '-o')
plt.xlabel('Number of neighbors (k)')
plt.ylabel('Accuracy');

In [None]:
# Last, we need to retrain our chosen kNN on the whole training data and test its accuracy on the test data
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

knn_classifier = KNeighborsClassifier(n_neighbors=best_k)  
knn_classifier.fit(scaled_X_train, Y_train)
print('accuracy of the classifier is', knn_classifier.score(scaled_X_test, Y_test))

# Compute a confusion matrix
predictions = knn_classifier.predict(X=scaled_X_test) # get the classifier's predictions 
print('confusion matrix: \n', confusion_matrix(y_true=Y_test, y_pred=predictions, labels=[0, 1])) # rows are true values, columns are predicted values

print('precision: ', precision_score(y_true=Y_test, y_pred=predictions, labels=[0, 1]))
print('recall: ', recall_score(y_true=Y_test, y_pred=predictions, labels=[0, 1]))

Our model obtains decent accuracy performance, but it slightly underperforms the CV score (which is often the case) 

We see the total accuracy is around 80%, however, the model has lower recall than precision scores. That is, it is not as good in recalling positive cases as positives as it is good at being precise when predicting a case is positive. 