Open a Jupyter Notebook to implement this exercise and import all the required
elements to load and split the dataset. These will be used to train a model and
evaluate its recall:

In [1]:
from sklearn.datasets import load_breast_cancer
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn import tree
from sklearn.metrics import recall_score

For this exercise, the breast cancer dataset will be used. Use the following
code to load the dataset and create the Pandas DataFrames containing the
features and target matrices:

In [2]:
breast_cancer = load_breast_cancer()

X = pd.DataFrame(breast_cancer.data)
Y = pd.DataFrame(breast_cancer.target)

Split the dataset into training, validation, and testing sets:

In [3]:
X_new, X_test, Y_new, Y_test = train_test_split(X, Y, test_size=0.1, random_state=101)
test_size = X_test.shape[0] / X_new.shape[0]
X_train, X_dev, Y_train, Y_dev = train_test_split(X_new, Y_new, test_size=test_size, random_state=101)

print(X_train.shape, Y_train.shape, X_dev.shape, Y_dev.shape, X_test.shape, Y_test.shape)

(455, 30) (455, 1) (57, 30) (57, 1) (57, 30) (57, 1)


Create a train/dev set that combines data from both the training and
validation sets:

In [4]:
np.random.seed(101)

indices_train = np.random.randint(0,len(X_train),25)
indices_dev = np.random.randint(0,len(X_dev),25)

X_train_dev = pd.concat([X_train.iloc[indices_train,:],X_dev.iloc[indices_dev,:]])
Y_train_dev = pd.concat([Y_train.iloc[indices_train,:],Y_dev.iloc[indices_dev,:]])

print(X_train_dev.shape, Y_train_dev.shape)

(50, 30) (50, 1)


First, a random seed is set to ensure the reproducibility of the results. Next,
the NumPy random.randint() function is used to select random indices
from the X_train set. To do that, 28 random integers are generated in a
range between 0 and the total length of X_train. The same process is used to
generate the random indices of the dev set. Finally, a new variable is created to
store the selected values of X_train and X_dev, as well as a variable to store
the corresponding values from Y_train and Y_dev.
The variables that have been created contain 25 instances/labels from the train
set and 25 instances/labels from the dev set.
 
Train a decision tree on the train set, as follows:

In [5]:
model = tree.DecisionTreeClassifier(random_state=101)
model = model.fit(X_train, Y_train)

Use the predict method to generate the predictions for all of your sets (train,
train/dev, dev, and test). Next, considering that the objective of the study is to
maximize the model's ability to predict all malignant cases, calculate the recall
scores for all predictions. Store all of the scores in a variable named scores:

In [6]:
sets = ["Training", "Train/dev", "Validation", "Testing"]
X_sets = [X_train, X_train_dev, X_dev, X_test]
Y_sets = [Y_train, Y_train_dev, Y_dev, Y_test]

scores = {}
for i in range(0,len(X_sets)):
    pred = model.predict(X_sets[i])
    score = recall_score(Y_sets[i],pred)
    scores[sets[i]] = score

print(scores)

{'Training': 1.0, 'Train/dev': 0.9705882352941176, 'Validation': 0.9333333333333333, 'Testing': 0.9714285714285714}
