This notebook explores a project I previously completed diagnosing breast cancer. The original notebook can be found [here](https://github.com/rhkhoo/Breast_Cancer_Diagnosis/blob/master/LogisticRegression_Breast_Cancer.ipynb). Instead of using pandas, this project seeks to accomplish the same outcome with dask.

In [1]:
#!pip install dask_ml --quiet

In [2]:
import dask.dataframe as dd
from dask_ml.model_selection import train_test_split 
from dask_ml.linear_model import LogisticRegression
from dask_ml.preprocessing import StandardScaler

from sklearn.metrics import confusion_matrix

In [3]:
cancer = dd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/cancer_processed.csv')

In [4]:
cancer.head()

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,diagnosis
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,M
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,M
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,M
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,M
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,M


In [5]:
diag_map = {'B':0, 'M': 1}

cancer['diagnosis'] = cancer['diagnosis'].map(diag_map)

In [6]:
X = cancer.drop('diagnosis', 1)
y = cancer['diagnosis']

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 20)

In [8]:
scaler = StandardScaler()
scaler.fit_transform(X_train)
scaler.transform(X_test)

Unnamed: 0_level_0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64
,...,...,...,...,...,...,...,...,...,...


dask-ml's LogisticRegression only works with dask arrays, so X_train and X_test need to be converted.

In [9]:
X_train_array = X_train.to_dask_array(lengths = True)

In [10]:
X_test_array = X_test.to_dask_array(lengths = True)

In [11]:
%%time
lr_model = LogisticRegression()

lr_model.fit(X_train_array, y_train)

Wall time: 1.12 s


LogisticRegression()

In [12]:
y_pred = lr_model.predict(X_test_array).compute()

In [13]:
y_pred = y_pred.astype(int)

In [14]:
confusion_matrix(y_test, y_pred)

array([[57,  5],
       [ 3, 38]], dtype=int64)

The project using pandas had the confusion matrix

[70,   2] <br>
[1,   41]

Using Dask resulted in lower recall (a higher number of false negatives) and due to the fact that this dataset is rather small, there wasn't a big difference in training time for the models.