This project works with data from the Wisconsin Breast Cancer dataset. Below, I train a logistic regression model to predict the diagnosis. Three features are used in the model. Before training the model, I apply scaling to these features using StandardScaler. I then train the model and compute the accuracy on the test set, while also computing and presenting the confusion matrix. Results are shown at the end.

In [2]:
! pip install pyspark

Collecting pyspark
  Downloading pyspark-3.2.1.tar.gz (281.4 MB)
[K     |████████████████████████████████| 281.4 MB 32 kB/s 
[?25hCollecting py4j==0.10.9.3
  Downloading py4j-0.10.9.3-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 53.0 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.2.1-py2.py3-none-any.whl size=281853642 sha256=45d1ba6daae209af4e2af3bd0c23d8469f50b382e7591856d9665507a5919f81
  Stored in directory: /root/.cache/pip/wheels/9f/f5/07/7cd8017084dce4e93e84e92efd1e1d5334db05f2e83bcef74f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.3 pyspark-3.2.1


In [3]:
# load modules
from pyspark.sql import SparkSession
from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel
from pyspark.mllib.regression import LabeledPoint
from pyspark.ml.feature import VectorAssembler 
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.evaluation import MulticlassMetrics

import os

In [6]:
# param init
infile = 'wisc_breast_cancer_w_fields.csv'

spark = SparkSession \
    .builder \
    .appName("Wisc BRCA") \
    .getOrCreate()

In [7]:
# read data into dataframe
df = spark.read.csv(infile, inferSchema=True, header = True)

In [8]:
df.count()

569

In [9]:
colnames = []
for i in df.schema.names:
    print(i,sep='',end=', ')
    colnames.append(i)


id, diagnosis, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f11, f12, f13, f14, f15, f16, f17, f18, f19, f20, f21, f22, f23, f24, f25, f26, f27, f28, f29, f30, 

In [10]:
# check types:
[x for x in df.dtypes if x[1] != 'double']

[('id', 'int'), ('diagnosis', 'string')]

In [11]:
df.select([i for i in df.schema.names if i not in {'id', 'diagnosis'}])

DataFrame[f1: double, f2: double, f3: double, f4: double, f5: double, f6: double, f7: double, f8: double, f9: double, f10: double, f11: double, f12: double, f13: double, f14: double, f15: double, f16: double, f17: double, f18: double, f19: double, f20: double, f21: double, f22: double, f23: double, f24: double, f25: double, f26: double, f27: double, f28: double, f29: double, f30: double]

# MODEL TRAINING

In [12]:
assembler = VectorAssembler(inputCols=["f1","f2","f3","f4","f5","f6","f7","f8","f9",
                                        "f10","f11","f12","f13","f14","f15","f16","f17",
                                        "f18","f19","f20","f21","f22", "f23", "f24", "f25", 
                                       "f26", "f27", "f28", "f29", "f30"], 
                            outputCol="features") 

In [13]:
# scaling
from pyspark.ml.feature import StandardScaler

transformed = assembler.transform(df)
scaler = StandardScaler(inputCol='features',
                       outputCol='scaledFeatures')
sm = scaler.fit(transformed) # model
sd = sm.transform(transformed) # data

In [14]:
# convert to RDD
dataRdd = transformed.select("diagnosis", "features").rdd.map(tuple)

In [15]:
# look at some data
dataRdd.take(2)

[('M',
  DenseVector([17.99, 10.38, 122.8, 1001.0, 0.1184, 0.2776, 0.3001, 0.1471, 0.2419, 0.0787, 1.095, 0.9053, 8.589, 153.4, 0.0064, 0.049, 0.0537, 0.0159, 0.03, 0.0062, 25.38, 17.33, 184.6, 2019.0, 0.1622, 0.6656, 0.7119, 0.2654, 0.4601, 0.1189])),
 ('M',
  DenseVector([20.57, 17.77, 132.9, 1326.0, 0.0847, 0.0786, 0.0869, 0.0702, 0.1812, 0.0567, 0.5435, 0.7339, 3.398, 74.08, 0.0052, 0.0131, 0.0186, 0.0134, 0.0139, 0.0035, 24.99, 23.41, 158.8, 1956.0, 0.1238, 0.1866, 0.2416, 0.186, 0.275, 0.089]))]

In [16]:
# map label to binary values, then convert to LabeledPoint
lp = dataRdd.map(lambda row:(1 if row[0]=='M' else 0, Vectors.dense(row[1])))    \
                    .map(lambda row: LabeledPoint(row[0], row[1]))

In [17]:
# look at some data
lp.take(2)

[LabeledPoint(1.0, [17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189]),
 LabeledPoint(1.0, [20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902])]

In [19]:
# Split data approximately into training (60%) and test (40%)
training, test = lp.randomSplit([0.6, 0.4], seed=311)

In [20]:
# count records in datasets
(training.count(), test.count(), lp.count())

(352, 217, 569)

In [21]:
(training.count()/lp.count(), test.count()/lp.count(), lp.count()/lp.count())

(0.6186291739894552, 0.38137082601054484, 1.0)

In [22]:
# Build the model
model = LogisticRegressionWithLBFGS.train(training)

In [23]:
# Evaluating the model on test data
labelsAndPreds_te = test.map(lambda p: (p.label, float(model.predict(p.features))))
accuracy_te = 1.0 * labelsAndPreds_te.filter(lambda pl: pl[0] == pl[1]).count() / test.count()
print('model accuracy (test): {}'.format(accuracy_te))

model accuracy (test): 0.9354838709677419


In [24]:
from pyspark.mllib.evaluation import MulticlassMetrics # for confusion matrix

print('(1) accuracy: {:.5f}'.format(accuracy_te))
confmat = (MulticlassMetrics(labelsAndPreds_te)).confusionMatrix().toArray()
print('(2) Confusion Matrix:')
print(confmat)

(1) accuracy: 0.93548




(2) Confusion Matrix:
[[123.   7.]
 [  7.  80.]]


In [25]:
# Repeat model traning...
assembler = VectorAssembler(inputCols=["f1","f2","f3","f4","f5","f6","f7","f8","f9",
                                        "f10","f11","f12","f13","f14","f15","f16","f17",
                                        "f18","f19","f20","f21","f22", "f23", "f24", "f25", 
                                       "f26", "f27", "f28", "f29", "f30"], 
                            outputCol="features") 

# scaling
transformed = assembler.transform(df)
scaler = StandardScaler(inputCol='features',
                       outputCol='scaledFeatures')
sm = scaler.fit(transformed) # model
sd = sm.transform(transformed) # data
dataRdd = transformed.select("diagnosis", "features").rdd.map(tuple)
lp = dataRdd.map(lambda row:(1 if row[0]=='M' else 0, Vectors.dense(row[1])))    \
                    .map(lambda row: LabeledPoint(row[0], row[1]))
training, test = lp.randomSplit([0.6, 0.4], seed=314)
model = LogisticRegressionWithLBFGS.train(training, intercept=True)
training.count(), test.count(), lp.count()

labelsAndPreds_te = test.map(lambda p: (p.label, float(model.predict(p.features))))
accuracy_te = 1.0 * labelsAndPreds_te.filter(lambda pl: pl[0] == pl[1]).count() / test.count()

In [26]:
print('(1) accuracy: {:.5f}'.format(accuracy_te))
confmat = (MulticlassMetrics(labelsAndPreds_te)).confusionMatrix().toArray()
print('(2) Confusion Matrix:')
print(confmat)

(1) accuracy: 0.95775




(2) Confusion Matrix:
[[134.   4.]
 [  5.  70.]]


In [27]:
# Repeat model traning...
assembler = VectorAssembler(inputCols=["f1","f2","f3","f4","f5","f6","f7","f8","f9",
                                        "f10","f11","f12","f13","f14","f15","f16","f17",
                                        "f18","f19","f20","f21","f22", "f23", "f24", "f25", 
                                       "f26", "f27", "f28", "f29", "f30"], 
                            outputCol="features") 

# scaling
from pyspark.ml.feature import StandardScaler
transformed = assembler.transform(df)
scaler = StandardScaler(inputCol='features',
                       outputCol='scaledFeatures')
sm = scaler.fit(transformed) # model
sd = sm.transform(transformed) # data
dataRdd = transformed.select("diagnosis", "features").rdd.map(tuple)
lp = dataRdd.map(lambda row:(1 if row[0]=='M' else 0, Vectors.dense(row[1])))    \
                    .map(lambda row: LabeledPoint(row[0], row[1]))
####
training, test = lp.randomSplit([0.7, 0.3], seed=314) #### RANDOM SPLIT 0.7, 0.3
####
model = LogisticRegressionWithLBFGS.train(training, intercept=False)
labelsAndPreds_te = test.map(lambda p: (p.label, float(model.predict(p.features))))
accuracy_te = 1.0 * labelsAndPreds_te.filter(lambda pl: pl[0] == pl[1]).count() / test.count()


In [28]:
print('(1) accuracy: {:.5f}'.format(accuracy_te))
confmat = (MulticlassMetrics(labelsAndPreds_te)).confusionMatrix().toArray()
print('(2) Confusion Matrix:')
print(confmat)

(1) accuracy: 0.94340




(2) Confusion Matrix:
[[98.  6.]
 [ 3. 52.]]


In [29]:
# Repeat model traning...
assembler = VectorAssembler(inputCols=["f1","f2","f3","f4","f5","f6","f7","f8","f9",
                                        "f10","f11","f12","f13","f14","f15","f16","f17",
                                        "f18","f19","f20","f21","f22", "f23", "f24", "f25", 
                                       "f26", "f27", "f28", "f29", "f30"], 
                            outputCol="features") 

# scaling
transformed = assembler.transform(df)
scaler = StandardScaler(inputCol='features',
                       outputCol='scaledFeatures')
sm = scaler.fit(transformed) # model
sd = sm.transform(transformed) # data
dataRdd = transformed.select("diagnosis", "features").rdd.map(tuple)
lp = dataRdd.map(lambda row:(1 if row[0]=='M' else 0, Vectors.dense(row[1])))    \
                    .map(lambda row: LabeledPoint(row[0], row[1]))


training, test = lp.randomSplit([0.7, 0.3], seed=314)
model = LogisticRegressionWithLBFGS.train(training, intercept=True)
training.count(), test.count(), lp.count()

labelsAndPreds_te = test.map(lambda p: (p.label, float(model.predict(p.features))))
accuracy_te = 1.0 * labelsAndPreds_te.filter(lambda pl: pl[0] == pl[1]).count() / test.count()

print('(1) accuracy: {:.5f}'.format(accuracy_te))
confmat = (MulticlassMetrics(labelsAndPreds_te)).confusionMatrix().toArray()
print('(2) Confusion Matrix:')
print(confmat)

(1) accuracy: 0.93711




(2) Confusion Matrix:
[[98.  7.]
 [ 3. 51.]]


The difference between part one and two was the addition of an intercept in the second model. Doing so ended up raising the accuracy of that model by a small amount (.96244 accuracy before vs .96714 after).

The same difference was done between the third and fourth model, along with an adjustment of the training and test data inputs (60-40 to 70-30). This time, adding an intercept lowered the accuracy of the model (.94340 accuracy before vs .93711 accuracy after). However, even with this difference, both of my first two models had higher accuracies than either of the latter two, suggesting that using an intercept and training data split at 60-40 leads to the most accurate model for this data.


In [None]:
!jupyter nbconvert --to pdf `pwd`/*.ipynb

[NbConvertApp] Converting notebook /sfs/qumulo/qhome/dbw2tn/ds5110/assignments/M4_8_classification/classification_wisc_breast_cancer.ipynb to pdf
[NbConvertApp] Writing 61214 bytes to notebook.tex
[NbConvertApp] Building PDF
[NbConvertApp] Running xelatex 3 times: ['xelatex', 'notebook.tex', '-quiet']
[NbConvertApp] Running bibtex 1 time: ['bibtex', 'notebook']
[NbConvertApp] PDF successfully created
[NbConvertApp] Writing 70211 bytes to /sfs/qumulo/qhome/dbw2tn/ds5110/assignments/M4_8_classification/classification_wisc_breast_cancer.pdf
