### University of Virginia
### DS 7200: Distributed Computing
### Lab: Supervised Learning
### Last Updated: August 20, 2023

---

### Justin Lee

### jgh2xh

---

#### Instructions

This project has two parts:
- Part I: Classification - build and apply a logistic regression model on the Wisconsin Breast Cancer dataset.
- Part II: Regression - build and apply a linear regression model on the California Housing dataset.

**Total Possible Points: 10**

---

#### Part I: Classification (5 POINTS)

Here are the specifications and grading breakdown:

- the target variable is `diagnosis`
- use `f1`, `f2` as predictors (1 PT)
- split data into 60% training set, 40% test set 
- standardize the predictors (1 PT)
- use seed=314 whenever a seed is needed
- fit a Logistic Regression model with an intercept (1 PT)
- compute and show the area under the ROC curve for the test set (2 PTS)

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

DATA_FILEPATH = 'wisc_breast_cancer_w_fields.csv'

spark = SparkSession \
    .builder \
    .appName("Wisc BRCA") \
    .getOrCreate()

sc = spark.sparkContext

seed = 314

/opt/conda/lib/python3.7/site-packages/pyspark/bin/load-spark-env.sh: line 68: ps: command not found
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/09/28 23:15:54 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
wbc = spark.read.csv(DATA_FILEPATH, inferSchema=True, header=True)

# train-test 60-40 split
wbc_splits = wbc.randomSplit([0.6, 0.4], seed)
wbc_train = wbc_splits[0]
wbc_test = wbc_splits[1]

print('wbc shape:', (wbc.count(), len(wbc.columns)))
print('wbc_train shape:', (wbc_train.count(), len(wbc_train.columns)))
print('wbc_test shape:', (wbc_test.count(), len(wbc_test.columns)))

wbc shape: (569, 32)
23/09/28 23:16:05 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
wbc_train shape: (354, 32)
wbc_test shape: (215, 32)


In [3]:
wbc.printSchema()

root
 |-- id: integer (nullable = true)
 |-- diagnosis: string (nullable = true)
 |-- f1: double (nullable = true)
 |-- f2: double (nullable = true)
 |-- f3: double (nullable = true)
 |-- f4: double (nullable = true)
 |-- f5: double (nullable = true)
 |-- f6: double (nullable = true)
 |-- f7: double (nullable = true)
 |-- f8: double (nullable = true)
 |-- f9: double (nullable = true)
 |-- f10: double (nullable = true)
 |-- f11: double (nullable = true)
 |-- f12: double (nullable = true)
 |-- f13: double (nullable = true)
 |-- f14: double (nullable = true)
 |-- f15: double (nullable = true)
 |-- f16: double (nullable = true)
 |-- f17: double (nullable = true)
 |-- f18: double (nullable = true)
 |-- f19: double (nullable = true)
 |-- f20: double (nullable = true)
 |-- f21: double (nullable = true)
 |-- f22: double (nullable = true)
 |-- f23: double (nullable = true)
 |-- f24: double (nullable = true)
 |-- f25: double (nullable = true)
 |-- f26: double (nullable = true)
 |-- f27: double (

#### Enter code and solution

In [4]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StandardScaler
from pyspark.ml.feature import StringIndexer
from pyspark.ml.classification import LogisticRegression

# Hyperparams
maxIter = 100
regParam = 0.0
elasticNetParam = 0.0

# convert diagnosis to 0 or 1
log_reg_stage1 = StringIndexer(inputCol='diagnosis', outputCol='bin_diagnosis')
# gather features into feature vector
log_reg_stage2 = VectorAssembler(inputCols=['f1', 'f2'], outputCol="features")
# scale features
log_reg_stage3 = StandardScaler(inputCol='features', outputCol='scaledFeatures')
# logistic regression
log_reg_stage4 = LogisticRegression(labelCol='bin_diagnosis',
                                    featuresCol='scaledFeatures',
                                    maxIter=maxIter, 
                                    regParam=regParam, 
                                    elasticNetParam=elasticNetParam)

log_reg_pipeline = Pipeline(stages=[log_reg_stage1, log_reg_stage2, log_reg_stage3, log_reg_stage4])

In [5]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# fit model on training set
lr_model = log_reg_pipeline.fit(wbc_train)

# make predictions on test set
lr_preds = lr_model.transform(wbc_test)

print('Coefficients: ' + str(lr_model.stages[-1].coefficients))
print('Intercept: ' + str(lr_model.stages[-1].intercept))

lr_preds.select('probability', 'prediction').show(5, False)

log_reg_evaluator = BinaryClassificationEvaluator(rawPredictionCol='prediction',
                                                  labelCol='bin_diagnosis',
                                                  metricName='areaUnderROC')

auroc = log_reg_evaluator.evaluate(lr_preds)

print(f'Area under ROC: {auroc}')

Coefficients: [3.570044628034343,1.0818379381825294]
Intercept: -19.503257088634616
+------------------------------------------+----------+
|probability                               |prediction|
+------------------------------------------+----------+
|[0.9240174999811224,0.07598250001887763]  |0.0       |
|[0.0014364608997718892,0.9985635391002281]|1.0       |
|[0.8388312056356977,0.16116879436430231]  |0.0       |
|[0.5748392470338999,0.4251607529661001]   |0.0       |
|[0.790322784707739,0.20967721529226102]   |0.0       |
+------------------------------------------+----------+
only showing top 5 rows

Area under ROC: 0.8934253402338509


#### Part II: Regression (5 POINTS)

In this project, you will work with the California Home Price dataset to train a regression model and predict median home prices. Here are the specifications and grading breakdown:

- Scale the response variable median_house_value, dividing by 100000 (1 PT)

- Split data into train set (80%), test set (20%) using seed=314 (1 PT)

- Add new predictor: `rooms_per_household`

- In the training set, select all of these features and standardize them: (1 PT)

feats = ["total_bedrooms", 
         "population", 
         "households", 
         "median_income", 
         "rooms_per_household"]

- Fit a linear regression model on the training set with these parameters:

  - maxIter=10
  - regParam=0.3
  - elasticNetParam=0.8  


- Compute the MSE on the test set (2 PTS)

In [6]:
import os
import pandas as pd

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

In [7]:
DATA_FILEPATH2 = 'cal_housing_data_preproc_w_header.txt'

In [8]:
chp = spark.read.csv(DATA_FILEPATH2, inferSchema=True, header=True)

# scale response variable 'median_house_value'
chp = chp.withColumn('median_house_value_scaled', chp.median_house_value / 100000)

# add new predictor 'rooms_per_household'
chp = chp.withColumn('rooms_per_household', chp.total_rooms / chp.households)

# train-test 60-40 split
chp_splits = chp.randomSplit([0.6, 0.4], seed)
chp_train = chp_splits[0]
chp_test = chp_splits[1]

print('chp shape:', (chp.count(), len(chp.columns)))
print('chp_train shape:', (chp_train.count(), len(chp_train.columns)))
print('chp_test shape:', (chp_test.count(), len(chp_test.columns)))

chp shape: (20640, 11)
chp_train shape: (12305, 11)
chp_test shape: (8335, 11)


In [9]:
chp.printSchema()

root
 |-- median_house_value: double (nullable = true)
 |-- median_income: double (nullable = true)
 |-- housing_median_age: double (nullable = true)
 |-- total_rooms: double (nullable = true)
 |-- total_bedrooms: double (nullable = true)
 |-- population: double (nullable = true)
 |-- households: double (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- median_house_value_scaled: double (nullable = true)
 |-- rooms_per_household: double (nullable = true)



#### Enter code and solution

In [10]:
from pyspark.ml.regression import LinearRegression

# Hyperparams
maxIter = 10
regParam = 0.3
elasticNetParam = 0.8

# gather features into feature vector
feats = ['total_bedrooms', 'population', 'households', 'median_income', 'rooms_per_household']
lin_reg_stage1 = VectorAssembler(inputCols=feats, outputCol='features')
# scale features
lin_reg_stage2 = StandardScaler(inputCol='features', outputCol='scaledFeatures')
# linear regression
lin_reg_stage3 = LinearRegression(featuresCol='scaledFeatures',
                                  labelCol='median_house_value_scaled',
                                  maxIter=maxIter,
                                  regParam=regParam,
                                  elasticNetParam=elasticNetParam)

lin_reg_pipeline = Pipeline(stages=[lin_reg_stage1, lin_reg_stage2, lin_reg_stage3])

In [11]:
from pyspark.ml.evaluation import RegressionEvaluator

# fit model on training set
lin_model = lin_reg_pipeline.fit(chp_train)

# make predictions on test set
lin_preds = lin_model.transform(chp_test)

print('Coefficients: ' + str(lin_model.stages[-1].coefficients))
print('Intercept: ' + str(lin_model.stages[-1].intercept))

lin_preds.select('prediction', 'scaledFeatures').show(5, False)

lin_reg_evaluator = RegressionEvaluator(predictionCol='prediction',
                                        labelCol='median_house_value_scaled',
                                        metricName='mse')

mse = lin_reg_evaluator.evaluate(lin_preds)

print(f'Mean Squared Error: {mse}')

Coefficients: [0.0,0.0,0.0,0.521610036433405,0.0]
Intercept: 1.0012589806950158
+------------------+----------------------------------------------------------------------------------------------------+
|prediction        |scaledFeatures                                                                                      |
+------------------+----------------------------------------------------------------------------------------------------+
|2.1522346892941897|[0.6345808898555432,0.5585850806708528,0.5863844642006424,2.2065827499584953,1.635274572558393]     |
|1.2185697382305414|[0.18775988875875624,0.14854093705737645,0.13812611823392912,0.41661536848758474,0.9250507609282013]|
|1.746159743357526 |[4.142601089955849,6.079504819084838,3.7502544177098867,1.4280798117994287,3.176214083207554]       |
|1.2365211384362043|[0.07843134593720197,0.05692586809384487,0.07036613570407708,0.4510307342815572,0.7467007180631932] |
|1.6038935242964385|[1.2905521467848688,1.265711098399082,1.256165