# Prepare the data

In [None]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# Common imports
import numpy as np
import os
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

import numpy as np
import tensorflow as tf
DATA_URL = 'https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz'

path = tf.keras.utils.get_file('mnist.npz', DATA_URL)
with np.load(path) as data:
  X_train = data['x_train']
  y_train = data['y_train']
  X_test = data['x_test']
  y_test = data['y_test']
num_pixels = X_train.shape[1] * X_train.shape[2]
X_train = X_train.reshape(X_train.shape[0], num_pixels).astype('float32')
X_test = X_test.reshape(X_test.shape[0], num_pixels).astype('float32')

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz


## The data is prepared

1 Estimate MNIST SGDClassifier to distinguish digits 3 and 5 from other digits. Report average precision rate when the recall = 75% (you can round up the recall value up to the 3rd digit). Use cv = 3. Use random_state = 42.

In [1]:
# Get fit statistics


## Problem 2 
Plot Precision-recall curve for the fit statistics from the problem 1 showing precision between 60% and 90%.

## Problem 3 
What is the threshold that maximizes precision + recall from the previous problem? Report the threshold and maximum precision and recall.

**Load Titanic Data**

In [1]:
import functools
import numpy as np
import tensorflow as tf
import pandas as pd
TRAIN_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
TEST_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"
train_file_path = tf.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
test_file_path = tf.keras.utils.get_file("eval.csv", TEST_DATA_URL)
train_data = pd.read_csv(train_file_path)
test_data = pd.read_csv(test_file_path)
train_data.head()

print(len(train_data))
print(len(test_data))
# create missing data to fill
data = train_data.append(test_data)
data.loc[[1,3,4,46,56,111,245], 'fare'] = np.nan
data.loc[[6,34,12,51,46,211,145],'n_siblings_spouses'] = np.nan
data.loc[[7,24,32,54,26,231,125], 'class'] = np.nan
data.loc[[9,39,19,59,49,219,195], 'deck'] = np.nan



Downloading data from https://storage.googleapis.com/tf-datasets/titanic/train.csv
Downloading data from https://storage.googleapis.com/tf-datasets/titanic/eval.csv


Unnamed: 0,survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone
0,0,male,22.0,1,0,7.25,Third,unknown,Southampton,n
1,1,female,38.0,1,0,71.2833,First,C,Cherbourg,n
2,1,female,26.0,0,0,7.925,Third,unknown,Southampton,y
3,1,female,35.0,1,0,53.1,First,C,Southampton,n
4,0,male,28.0,0,0,8.4583,Third,unknown,Queenstown,y


The attributes have the following meaning:

Survived: that's the target, 0 means the passenger did not survive, while 1 means he/she survived.
Pclass: passenger class.
Name, Sex, Age: self-explanatory
SibSp: how many siblings & spouses of the passenger aboard the Titanic.
Parch: how many children & parents of the passenger aboard the Titanic.
Ticket: ticket id
Fare: price paid (in pounds)
Cabin: passenger's cabin number
Embarked: where the passenger embarked the Titanic

## Problem 4
Build a pipeline to impute missing data for ["age", "n_siblings_spouses", "parch", "fare"]. Also impute missing categorical variables ["class", 'deck', embark_town'] and convert them into indicator variables. Don't use ['sex', 'alone']. Merge them together, so that final data set have 7 features. Convert descrete variables to binary indicators using OneHot. Create vectors for training and testing data. Use "Survived" as a predictor and the rest of the variables as features. Transform merged training and test data in one pipeline and then divide the transformed data into training and testing.

In [2]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer as Imputer

# A class to select numerical or categorical columns 
# since Scikit-Learn doesn't handle DataFrames yet   

In [3]:
from sklearn.preprocessing import OrdinalEncoder 


In [4]:
from sklearn.pipeline import FeatureUnion


## Problem 5
 estimate data predicting survival using SDG classifier (loss='log') and logit to train model on the training data and test on testing data. Report ROC AUC score for each model. Set random_state = 42. method="predict_proba", to make results comparable. 

## Problem 6
Repeat the same exerise as before, but instead of predicting probability of survival predict class, i.e., don't use  method="predict_proba". Which method is better? How do you explain the difference between the results?

Answer: The relative ranking is the same: Logistic performs better. However, both methods perform poorly. When predicting probabilities weak probability of survival would have a score of 0.6, so if this prediction would be erroneous the error is just (0.6 -0) = 0.6. However, if we are predicting classes, then the probability of 0.6 would be classified as survived, then the error would be higher (1 - 0) = 1. This difference does not affect predictions with very high or very low probabilities, as in these cases the difference between probabilities and classes would be very small.


## Problem 7
Use Titanic data and model estimated in the problem 6 using logit. Show confusion matrix for each model. Describe the most typical error each model has. 