### University of Virginia
### DS 7200: Distributed Computing
### Lab: Supervised Learning
### Last Updated: August 20, 2023

## jq2uw 
## Jiaxing Qiu

---

#### Instructions

This project has two parts:
- Part I: Classification - build and apply a logistic regression model on the Wisconsin Breast Cancer dataset.
- Part II: Regression - build and apply a linear regression model on the California Housing dataset.

**Total Possible Points: 10**

---

#### Part I: Classification (5 POINTS)

Here are the specifications and grading breakdown:

- the target variable is `diagnosis`
- use `f1`, `f2` as predictors (1 PT)
- split data into 60% training set, 40% test set 
- standardize the predictors (1 PT)
- use seed=314 whenever a seed is needed
- fit a Logistic Regression model with an intercept (1 PT)
- compute and show the area under the ROC curve for the test set (2 PTS)

#### Enter code and solution

In [1]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StandardScaler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import StringIndexer, OneHotEncoder
from pyspark.sql.types import StringType, StructType, StructField


In [2]:

DATA_FILEPATH = 'wisc_breast_cancer_w_fields.csv'

spark = SparkSession \
    .builder \
    .appName("Wisc BRCA") \
    .getOrCreate()

In [5]:
# load data
data = spark.read.csv(DATA_FILEPATH, header=True, inferSchema=True)
#data.show(2)

# ---- global engineering X ----
# assemble data 
assembler = VectorAssembler(inputCols=["f" + str(i) for i in range(1, 3)],
                            outputCol="features")
data = assembler.transform(data)
#data.show(2, truncate=False)

# ---- train/test split ----
from pyspark.sql.functions import rand
from pyspark.sql import functions as F
train_ratio = 0.6

# use row index as id per observation
data = data.withColumn("row_id", F.monotonically_increasing_id())
#data.select("id","row_id").show(20)

# get unique id per patient
grouped_data = data.groupBy("id").agg(F.max("row_id").alias("gid")).sort("gid")
#grouped_data.show(10)

# Use randomSplit to split the data
splits = grouped_data.randomSplit([0.6, 0.4], seed=413)

# The 'splits' list now contains two DataFrames with the specified ratios
train_groups = splits[0]
test_groups = splits[1]
#train_groups.show(10)

# Join the train and test groups with the original data to get the final train and test sets
train_data = data.join(train_groups, on="id")
test_data = data.join(test_groups, on="id")

# Show the resulting train and test DataFrames
print("Train DataFrame:")
train_data.select("id","gid").show(2)
print("Test DataFrame:")
test_data.select("id","gid").show(2)


# ---- engineering x & y based on training data ----
# --- y ---
indexer = StringIndexer(inputCol="diagnosis", outputCol="diagnosis_id")
indexerModel = indexer.fit(train_data)
train_data = indexerModel.transform(train_data)
test_data = indexerModel.transform(test_data)

# --- x ---
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures")
scalerModel = scaler.fit(train_data)
train_data = scalerModel.transform(train_data)
test_data = scalerModel.transform(test_data)


# encoder = OneHotEncoder(inputCol="diagnosis_id", outputCol="diagnosis_onehot")
# encoderModel = encoder.fit(data)
# data = encoderModel.transform(data)
# data.select("scaledFeatures","diagnosis_onehot").show(2, truncate=False)



Train DataFrame:
+----+---+
|  id|gid|
+----+---+
|8670|131|
|8913|287|
+----+---+
only showing top 2 rows

Test DataFrame:
+--------+---+
|      id|gid|
+--------+---+
|84348301|  3|
|84799002| 15|
+--------+---+
only showing top 2 rows



In [6]:
# instantiate the model
lr = LogisticRegression(labelCol='diagnosis_id',
                        featuresCol='scaledFeatures',
                        maxIter=1000, 
                        regParam=0.1, 
                        elasticNetParam=0.8)

# Fit the model
lrModel = lr.fit(train_data)

# Print the coefficients and intercept for logistic regression
print("Coefficients: " + str(lrModel.coefficients))
print("Intercept: " + str(lrModel.intercept))

Coefficients: [1.3128433255771863,0.15474126849459116]
Intercept: -6.415834114503757


In [7]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# compute predictions. this will append column "prediction" to dataframe
lrPred = lrModel.transform(test_data)
lrPred.select('probability','prediction').show(5,truncate=False)

# set up evaluator
evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction",
                                          labelCol="diagnosis_id",
                                          metricName="areaUnderROC")

# pass to evaluator the DF with predictions, labels
auc = evaluator.evaluate(lrPred)

print("Area under ROC Curve:", auc)

+----------------------------------------+----------+
|probability                             |prediction|
+----------------------------------------+----------+
|[0.8148024469230046,0.18519755307699537]|0.0       |
|[0.5185936559323403,0.4814063440676597] |0.0       |
|[0.4370845371764407,0.5629154628235593] |1.0       |
|[0.5681036046827329,0.4318963953172671] |0.0       |
|[0.5701903017591848,0.42980969824081516]|0.0       |
+----------------------------------------+----------+
only showing top 5 rows

Area under ROC Curve: 0.8345588235294118


#### Part II: Regression (5 POINTS)

In this project, you will work with the California Home Price dataset to train a regression model and predict median home prices. Here are the specifications and grading breakdown:

- Scale the response variable median_house_value, dividing by 100000 (1 PT)

- Split data into train set (80%), test set (20%) using seed=314 (1 PT)

- Add new predictor: `rooms_per_household`

- In the training set, select all of these features and standardize them: (1 PT)

feats = ["total_bedrooms", 
         "population", 
         "households", 
         "median_income", 
         "rooms_per_household"]

- Fit a linear regression model on the training set with these parameters:

  - maxIter=10
  - regParam=0.3
  - elasticNetParam=0.8  


- Compute the MSE on the test set (2 PTS)

In [8]:
import os
import pandas as pd

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

In [9]:
DATA_FILEPATH2 = 'cal_housing_data_preproc_w_header.txt'

#### Enter code and solution

In [19]:

from pyspark.sql.functions import col


# load data
data = spark.read.csv(DATA_FILEPATH2, header=True, inferSchema=True)
# data.show(2)


# Scale the response variable median_house_value, dividing by 100000 
data = data.withColumn("median_house_value",  col("median_house_value")/100000)

# Add new predictor: rooms_per_household
data = data.withColumn("rooms_per_household",  col("total_rooms")/col("households"))

assembler = VectorAssembler(inputCols=["total_bedrooms", "population", "households", "median_income", "rooms_per_household"],
                            outputCol="features")
data = assembler.transform(data)


# Split data into train set (80%), test set (20%) using seed=314
splits = data.randomSplit([0.8, 0.2], seed=413)
train_data = splits[0]
test_data = splits[1]

# In the training set, select all of these features and standardize them:
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures")
scalerModel = scaler.fit(train_data)
train_data = scalerModel.transform(train_data)
test_data = scalerModel.transform(test_data)
train_data.show(10)


+------------------+------------------+------------------+-----------+--------------+----------+----------+--------+---------+-------------------+--------------------+--------------------+
|median_house_value|     median_income|housing_median_age|total_rooms|total_bedrooms|population|households|latitude|longitude|rooms_per_household|            features|      scaledFeatures|
+------------------+------------------+------------------+-----------+--------------+----------+----------+--------+---------+-------------------+--------------------+--------------------+
|           0.14999|             0.536|              36.0|       98.0|          28.0|      18.0|       8.0|   40.31|  -123.17|              12.25|[28.0,18.0,8.0,0....|[0.06602916111827...|
|           0.14999|            1.6607|              16.0|      255.0|          73.0|      85.0|      38.0|   39.71|  -122.74| 6.7105263157894735|[73.0,85.0,38.0,1...|[0.17214745577263...|
|           0.14999|               2.1|              19

In [20]:
from pyspark.ml.regression import LinearRegression

lr = LinearRegression(featuresCol='features',         # feature vector name
                      labelCol='median_house_value',  # target variable name
                      maxIter=10,
                      regParam=0.3, 
                      elasticNetParam=0.8)

# Fit the model
lrModel = lr.fit(train_data)

# Print the weights and intercept for linear regression
print("Weights: " + str(lrModel.coefficients))
print("Intercept: " + str(lrModel.intercept))

Weights: [0.0,0.0,0.0,0.27621981890278596,0.0]
Intercept: 0.9994485205489783


In [21]:
from pyspark.ml.evaluation import RegressionEvaluator

# compute predictions. this will append column "prediction" to dataframe
lrPred = lrModel.transform(test_data)
lrPred.show(1)

ev = RegressionEvaluator(predictionCol="prediction", labelCol="median_house_value")

print('-'*20)
print("METRICS")
print("Mean Squared Error:", ev.evaluate(lrPred, {ev.metricName: "mse"}))
print("R Squared:", ev.evaluate(lrPred, {ev.metricName:'r2'}))

+------------------+-------------+------------------+-----------+--------------+----------+----------+--------+---------+-------------------+--------------------+--------------------+------------------+
|median_house_value|median_income|housing_median_age|total_rooms|total_bedrooms|population|households|latitude|longitude|rooms_per_household|            features|      scaledFeatures|        prediction|
+------------------+-------------+------------------+-----------+--------------+----------+----------+--------+---------+-------------------+--------------------+--------------------+------------------+
|           0.14999|       4.1932|              52.0|      803.0|         267.0|     628.0|     225.0|   34.24|  -117.86|  3.568888888888889|[267.0,628.0,225....|[0.62963521494924...|2.1576934651721404|
+------------------+-------------+------------------+-----------+--------------+----------+----------+--------+---------+-------------------+--------------------+--------------------+-----