### University of Virginia
### DS 7200: Distributed Computing
### Lab: Supervised Learning
### Last Updated: August 20, 2023

---

#### Instructions

This project has two parts:
- Part I: Classification - build and apply a logistic regression model on the Wisconsin Breast Cancer dataset.
- Part II: Regression - build and apply a linear regression model on the California Housing dataset.

**Total Possible Points: 10**

---

#### Part I: Classification (5 POINTS)

Here are the specifications and grading breakdown:

- the target variable is `diagnosis`
- use `f1`, `f2` as predictors (1 PT)
- split data into 60% training set, 40% test set 
- standardize the predictors (1 PT)
- use seed=314 whenever a seed is needed
- fit a Logistic Regression model with an intercept (1 PT)
- compute and show the area under the ROC curve for the test set (2 PTS)

In [10]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StandardScaler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import StringIndexer, OneHotEncoder
from pyspark.sql.types import StringType, StructType, StructField


In [1]:

DATA_FILEPATH = 'wisc_breast_cancer_w_fields.csv'

spark = SparkSession \
    .builder \
    .appName("Wisc BRCA") \
    .getOrCreate()

In [25]:
# load data
data = spark.read.csv(DATA_FILEPATH, header=True, inferSchema=True)
#data.show(2)

# ---- engineering X ----
# assemble data 
assembler = VectorAssembler(inputCols=["f" + str(i) for i in range(1, 3)],
                            outputCol="features")
data = assembler.transform(data)
#data.show(2, truncate=False)

# scale data
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures")
scalerModel = scaler.fit(data)
data = scalerModel.transform(data)
#data.show(2, truncate=False)

# ---- engineering y ----
indexer = StringIndexer(inputCol="diagnosis", outputCol="diagnosis_id")
indexerModel = indexer.fit(data)
data = indexerModel.transform(data)
data.select("scaledFeatures","diagnosis_id").show(2, truncate=False)

# encoder = OneHotEncoder(inputCol="diagnosis_id", outputCol="diagnosis_onehot")
# encoderModel = encoder.fit(data)
# data = encoderModel.transform(data)
# data.select("scaledFeatures","diagnosis_onehot").show(2, truncate=False)



+--------------------------------------+------------+
|scaledFeatures                        |diagnosis_id|
+--------------------------------------+------------+
|[5.104923594187837,2.4133721641714776]|1.0         |
|[5.837036038490485,4.131562943865814] |1.0         |
+--------------------------------------+------------+
only showing top 2 rows



In [26]:
data.select("diagnosis_id").distinct().show()

+------------+
|diagnosis_id|
+------------+
|         0.0|
|         1.0|
+------------+



In [35]:
df_with_random.select("id", "random_number").show(10)#.distinct().count()* 0.6

+-----+-------------+
|   id|random_number|
+-----+-------------+
| 8670|            1|
| 8913|            1|
| 8915|            1|
| 9047|            1|
|85715|            1|
|86208|            1|
|86211|            1|
|86355|            1|
|86408|            1|
|86409|            1|
+-----+-------------+
only showing top 10 rows



In [29]:

from pyspark.sql.window import Window
from pyspark.sql import functions as F

# Use window functions to assign a random number to each row within each group
window_spec = Window.partitionBy("id").orderBy(F.rand(seed = 314))
df_with_random = data.withColumn("random_number", F.row_number().over(window_spec))

# Determine the split threshold based on group size and ratio
split_threshold = int(df_with_random.select("id", "random_number").distinct().count() * 0.6)

# Split the DataFrame into train and test sets
train_data = df_with_random.filter(F.col("random_number") <= split_threshold).drop("random_number")
test_data = df_with_random.filter(F.col("random_number") > split_threshold).drop("random_number")

# Show the resulting train and test DataFrames
print("Train DataFrame:")
train_data.show(2)
print("Test DataFrame:")
test_data.show(2)

Train DataFrame:
+----+---------+-----+-----+-----+-----+-------+-------+------+-------+------+-------+------+------+-----+-----+--------+-------+-------+--------+-------+--------+-----+-----+-----+------+-------+------+------+-------+------+-------+-------------+--------------------+------------+
|  id|diagnosis|   f1|   f2|   f3|   f4|     f5|     f6|    f7|     f8|    f9|    f10|   f11|   f12|  f13|  f14|     f15|    f16|    f17|     f18|    f19|     f20|  f21|  f22|  f23|   f24|    f25|   f26|   f27|    f28|   f29|    f30|     features|      scaledFeatures|diagnosis_id|
+----+---------+-----+-----+-----+-----+-------+-------+------+-------+------+-------+------+------+-----+-----+--------+-------+-------+--------+-------+--------+-----+-----+-----+------+-------+------+------+-------+------+-------+-------------+--------------------+------------+
|8670|        M|15.46|19.48|101.7|748.9| 0.1092| 0.1223|0.1466|0.08087|0.1931|0.05796|0.4743|0.7859|3.094|48.31| 0.00624|0.01484|0.02813|

In [30]:
test_data.show()

+---+---------+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+--------+--------------+------------+
| id|diagnosis| f1| f2| f3| f4| f5| f6| f7| f8| f9|f10|f11|f12|f13|f14|f15|f16|f17|f18|f19|f20|f21|f22|f23|f24|f25|f26|f27|f28|f29|f30|features|scaledFeatures|diagnosis_id|
+---+---------+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+--------+--------------+------------+
+---+---------+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+--------+--------------+------------+



In [24]:
from pyspark.sql.window import Window
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql import functions as F

# Create a Spark session
spark = SparkSession.builder.appName("GroupedTrainTestSplit").getOrCreate()

# Sample DataFrame with groups (e.g., user_id and features)
data = [(1, "A", 0.1),
        (1, "B", 0.2),
        (1, "C", 0.3),
        (2, "A", 0.4),
        (2, "B", 0.5),
        (3, "A", 0.6),
        (3, "B", 0.7),
        (3, "C", 0.8)]
columns = ["user_id", "feature", "value"]
df = spark.createDataFrame(data, columns)

# Define the train and test group sizes (e.g., 80% train, 20% test)
train_ratio = 0.8
test_ratio = 1 - train_ratio

# Use window functions to assign a random number to each row within each group
window_spec = Window.partitionBy("user_id").orderBy(F.rand())
df_with_random = df.withColumn("random_number", F.row_number().over(window_spec))

# Determine the split threshold based on group size and ratio
split_threshold = int(df_with_random.select("user_id", "random_number").distinct().count() * train_ratio)

# Split the DataFrame into train and test sets
train_df = df_with_random.filter(F.col("random_number") <= split_threshold).drop("random_number")
test_df = df_with_random.filter(F.col("random_number") > split_threshold).drop("random_number")

# Show the resulting train and test DataFrames
print("Train DataFrame:")
train_df.show()
print("Test DataFrame:")
test_df.show()


Train DataFrame:
+-------+-------+-----+
|user_id|feature|value|
+-------+-------+-----+
|      1|      C|  0.3|
|      1|      A|  0.1|
|      1|      B|  0.2|
|      2|      B|  0.5|
|      2|      A|  0.4|
|      3|      B|  0.7|
|      3|      A|  0.6|
|      3|      C|  0.8|
+-------+-------+-----+

Test DataFrame:
+-------+-------+-----+
|user_id|feature|value|
+-------+-------+-----+
+-------+-------+-----+



In [8]:
# instantiate the model
lr = LogisticRegression(labelCol='diagnosis',
                        featuresCol='scaledFeatures',
                        maxIter=100, 
                        regParam=0.3, 
                        elasticNetParam=0.8)

# Fit the model
lrModel = lr.fit(scaledData)

# Print the coefficients and intercept for logistic regression
print("Coefficients: " + str(lrModel.coefficients))
print("Intercept: " + str(lrModel.intercept))

IllegalArgumentException: requirement failed: Column diagnosis must be of type numeric but was actually of type string.

#### Enter code and solution

#### Part II: Regression (5 POINTS)

In this project, you will work with the California Home Price dataset to train a regression model and predict median home prices. Here are the specifications and grading breakdown:

- Scale the response variable median_house_value, dividing by 100000 (1 PT)

- Split data into train set (80%), test set (20%) using seed=314 (1 PT)

- Add new predictor: `rooms_per_household`

- In the training set, select all of these features and standardize them: (1 PT)

feats = ["total_bedrooms", 
         "population", 
         "households", 
         "median_income", 
         "rooms_per_household"]

- Fit a linear regression model on the training set with these parameters:

  - maxIter=10
  - regParam=0.3
  - elasticNetParam=0.8  


- Compute the MSE on the test set (2 PTS)

In [1]:
import os
import pandas as pd

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

/opt/conda/lib/python3.7/site-packages/pyspark/bin/load-spark-env.sh: line 68: ps: command not found
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/03/03 02:27:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/03/03 02:27:29 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [2]:
DATA_FILEPATH2 = 'cal_housing_data_preproc_w_header.txt'

#### Enter code and solution