# For this exercise, you will analyze the breast cancer dataset and build a SVM model to predict whether a tumor is malignant or benign based on a number features.

## Instruction: You may experiment with the code using try-and-error and step-by-step methods. For submission, please prepare a clean copy of the notebook file with the final code and final result for each task, and with additional blank cells removed. Irrelevant code will be counted as wrong answers.

### Data file: breast-cancer-wisconsin.csv (no headers)
### Data description: breast_cancer_data_description.txt

## Step 1: create a spark session 

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName('svm').getOrCreate()

## Step 2: import the data and display schema

### As the data has no headers, you need to first define a schema then import the data into a dataframe with the predefined schema.

In [3]:
from pyspark.sql.types import StructType, StructField, DoubleType, IntegerType

In [4]:
banknote_schema = StructType([
    StructField('variance',DoubleType(),True),
    StructField('skewness',DoubleType(),True),
    StructField('kurtosis',DoubleType(),True),
    StructField('entropy',DoubleType(),True),
    StructField('label',IntegerType(),True),
 ])

In [5]:
data = spark.read.csv('breast-cancer-wisconsin.csv',header=False,schema=banknote_schema)

In [6]:
data.printSchema()

root
 |-- variance: double (nullable = true)
 |-- skewness: double (nullable = true)
 |-- kurtosis: double (nullable = true)
 |-- entropy: double (nullable = true)
 |-- label: integer (nullable = true)



In [7]:
data.show()

+---------+--------+--------+-------+-----+
| variance|skewness|kurtosis|entropy|label|
+---------+--------+--------+-------+-----+
|1000025.0|     5.0|     1.0|    1.0|    1|
|1002945.0|     5.0|     4.0|    4.0|    5|
|1015425.0|     3.0|     1.0|    1.0|    1|
|1016277.0|     6.0|     8.0|    8.0|    1|
|1017023.0|     4.0|     1.0|    1.0|    3|
|1017122.0|     8.0|    10.0|   10.0|    8|
|1018099.0|     1.0|     1.0|    1.0|    1|
|1018561.0|     2.0|     1.0|    2.0|    1|
|1033078.0|     2.0|     1.0|    1.0|    1|
|1033078.0|     4.0|     2.0|    1.0|    1|
|1035283.0|     1.0|     1.0|    1.0|    1|
|1036172.0|     2.0|     1.0|    1.0|    1|
|1041801.0|     5.0|     3.0|    3.0|    3|
|1043999.0|     1.0|     1.0|    1.0|    1|
|1044572.0|     8.0|     7.0|    5.0|   10|
|1047630.0|     7.0|     4.0|    6.0|    4|
|1048672.0|     4.0|     1.0|    1.0|    1|
|1049815.0|     4.0|     1.0|    1.0|    1|
|1050670.0|    10.0|     7.0|    7.0|    6|
|1050718.0|     6.0|     1.0|   

## Step 3: Transform features and label

### Instructions: 
### 1) All the features need to be transformed into a vector. 
### 2) As the data contains invalid entries, you need to specify how to handle invalid data during data transformation. (Hint: use the setHandlerInvalid='skip' method of VectorAssembler.)
### 3) The class column (2 vs 4) needs to be converted to a label column (0 vs 1) corresponding to benigh vs malignant.

### 4) The final dataset should contain 683 rows and two columns (features and label). Show top 20 rows of your final dataset.

In [8]:
from pyspark.ml.feature import VectorAssembler

In [9]:
data.columns

['variance', 'skewness', 'kurtosis', 'entropy', 'label']

In [10]:
assembler = VectorAssembler(inputCols=['variance', 'skewness', 'kurtosis', 'entropy'],
                            outputCol='features')

In [11]:
output = assembler.transform(data)

In [None]:
assembler = VectorAssembler setHandlerInvalid='skip' method of VectorAssembler.)¶


In [3]:
val categoryIndexerModel = new StringIndexer()
  .setInputCol("category")
  .setOutputCol("indexedCategory")
  .setHandleInvalid("skip") 

SyntaxError: invalid syntax (<ipython-input-3-780a94603237>, line 1)

In [12]:
output.printSchema()

root
 |-- variance: double (nullable = true)
 |-- skewness: double (nullable = true)
 |-- kurtosis: double (nullable = true)
 |-- entropy: double (nullable = true)
 |-- label: integer (nullable = true)
 |-- features: vector (nullable = true)



In [13]:
final_data = output.select('features','label')

In [14]:
final_data.show(20)

+--------------------+-----+
|            features|label|
+--------------------+-----+
|[1000025.0,5.0,1....|    1|
|[1002945.0,5.0,4....|    5|
|[1015425.0,3.0,1....|    1|
|[1016277.0,6.0,8....|    1|
|[1017023.0,4.0,1....|    3|
|[1017122.0,8.0,10...|    8|
|[1018099.0,1.0,1....|    1|
|[1018561.0,2.0,1....|    1|
|[1033078.0,2.0,1....|    1|
|[1033078.0,4.0,2....|    1|
|[1035283.0,1.0,1....|    1|
|[1036172.0,2.0,1....|    1|
|[1041801.0,5.0,3....|    3|
|[1043999.0,1.0,1....|    1|
|[1044572.0,8.0,7....|   10|
|[1047630.0,7.0,4....|    4|
|[1048672.0,4.0,1....|    1|
|[1049815.0,4.0,1....|    1|
|[1050670.0,10.0,7...|    6|
|[1050718.0,6.0,1....|    1|
+--------------------+-----+
only showing top 20 rows



## Step 4: build a linear support vector classifier
### Instructions:
### 1) split the data into training and test sets
### 2) create and train the model
### 3) test the model on the test set
### 4) show the label and prediction columns of your predictions

In [1]:
train_data,test_data = final_data.randomSplit([0.7,0.3])

NameError: name 'final_data' is not defined

In [16]:
from pyspark.ml.classification import LinearSVC

In [17]:
lsvc = LinearSVC(maxIter=10,regParam=0.1)

In [18]:
lsvc_model = lsvc.fit(train_data)

IllegalArgumentException: requirement failed: LinearSVC only supports binary classification. 11 classes detected in LinearSVC_90ca1b4b0e23__labelCol

In [None]:
lsvc_preds = lsvc_model.transform(test_data)

## Step 5: Model evaluation
### Display the following measures for the classifier:
### - Area under the ROC
### - Accuracy
### - Precision (optional)

In [12]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [13]:
# default metric: area under ROC
my_binary_eval = BinaryClassificationEvaluator(labelCol='label')

In [14]:
print('Area under ROC')
print(my_binary_eval.evaluate(lsvc_preds))

Area under ROC


NameError: name 'lsvc_preds' is not defined

In [15]:
lsvc_preds.show()

NameError: name 'lsvc_preds' is not defined

In [16]:
# test accuracy, precision, and recall
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [17]:
acc_eval = MulticlassClassificationEvaluator(labelCol='label',
                                            metricName='accuracy')

In [18]:
lsvc_acc = acc_eval.evaluate(lsvc_preds)

NameError: name 'lsvc_preds' is not defined

In [19]:
lsvc_acc

NameError: name 'lsvc_acc' is not defined

## Step 6: Conclusions

### Briefly describe the evaluation results and how well your model predicts heart disease based on the given features.