### University of Virginia 
### DS 5110: Big Data Systems
### Assignment: Tools for Supervised Learning
### Last Updated: April 7, 2022
---  

**INSTRUCTIONS**  
In this assignment, you will code functions to support supervised learning tasks.  The outline is provided below.  The value *None* is used as a placeholder. For random sampling use seed=314 throughout. Each task should be treated independently.

**TOTAL POINTS: 10**

**NOTE TO GRADER**  
Results might vary slightly due to factors like number of cores, whether RDD or DataFrame API is used

In [1]:
import os

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .master("local") \
    .appName("data preprocessing") \
    .config("spark.executor.memory", '8g') \
    .config('spark.executor.cores', '4') \
    .config('spark.cores.max', '4') \
    .config("spark.driver.memory",'8g') \
    .getOrCreate()

sc = spark.sparkContext

### PARAMETERS

In [None]:
# update to match your path
directory_path = '/sfs/qumulo/qhome/apt4c/ds5110/assignments/M7_8_supervised_tools/'
full_path_to_file = os.path.join(directory_path, 'breast_cancer_wisconsin.csv')
path_to_data = os.path.join(full_path_to_file)

In [4]:
# class = 2 for benign (negative class, 4 for malignant (positive class)
target = 'class'
positive_label = 4
negative_label = 2

SEED = 314

### READ IN DATA

In [5]:
brca = spark.read.csv(path_to_data, header=True, inferSchema=True)

In [6]:
brca.printSchema()

root
 |-- id: integer (nullable = true)
 |-- clump_thickness: integer (nullable = true)
 |-- uniformity_cell_size: integer (nullable = true)
 |-- uniformity_cell_shape: integer (nullable = true)
 |-- marginal_adhesion: integer (nullable = true)
 |-- single_epithelial_cell_size: integer (nullable = true)
 |-- bare_nuclei: string (nullable = true)
 |-- bland_chromatin: integer (nullable = true)
 |-- normal_nucleoli: integer (nullable = true)
 |-- mitoses: integer (nullable = true)
 |-- class: integer (nullable = true)



In [7]:
brca.count()

699

In [8]:
# compute distribution of target variable
brca.groupBy(target).count().show()

+-----+-----+
|class|count|
+-----+-----+
|    4|  241|
|    2|  458|
+-----+-----+



### Task 1:  Cross Validate a Logistic Regression Model
i) (**4 PTS**) This task has the following requirements:
- import necessary modules
- use these features as predictors: `clump_thickness`,`uniformity_cell_size`,`uniformity_cell_shape`,`marginal_adhesion`  
- `class` is response variable. apply recoding as needed. hint: save as new variable.
- use 3 folds in the cross validator object
- use BinaryClassificationEvaluator
- logistic regression model with `maxIter`=10  
- tuning grid with `regParam` values of 0.1 and 0.01
- finally, print the average metrics based on each `regParam` value. the attribute `avgMetrics` in the cv model will hold these. 

### Task 2:  Balancing a DataFrame with Downsampling  
i) (**2 PTS**) Write a function to implement downsampling.  Enter code into the cell containing the `downsample` function.  

INPUTS  
* df               - Spark dataframe  
* target           - string, target variable  
* positive_label   - integer, value of positive label  
* negative_label   - integer, value of negative label  

OUTPUT  
balanced spark dataframe  

Downsampling = sample from larger class to match smaller class  

**Example:**  

INITIAL STATE  
Smaller class has 100 records  
Larger class size has 400 records

ACTION  
Sample 100 records from larger class, without replacement  
Retain all records from smaller class

END STATE    
This produces a balanced dataset containing 100 records from each class

ii) **(1 PT)** Print the target distribution from this balanced dataset, to show the label counts nearly match.

#### IMPORTANT NOTE:
Sampling won't produce the exact fraction you request. In order to sample efficiently, Spark uses Bernouilli Sampling. 
Each row is assigned a probability of being included. If you request a 10% sample, each row individually has a 10% chance of being included but this does not guarantee an exact 10% sample   
(it should be close, however).

### Task 3:  Univariate AUC Measurement  

In this exercise, you will measure (in a particular sense) the individual predictive power of the following variables:  
* clump_thickness
* uniformity_cell_size
* uniformity_cell_shape
* marginal_adhesion
* single_epithelial_cell_size

**(3 PTS)** Define a function called `compute_univariate_aucs`  
The function needs to do the following:

* Split the dataset into training and testing sets (*60% / 40%*, respectively)  
* For each variable v:  
    * train a logistic regression classifier with intercept, including variable *v* as predictor
    * classify each record in the test set  
    * measure the area under the ROC curve  
* Return a dataframe containing each variable, its model weight (coefficient), and its Univariate AUC, sorted by Univariate AUC in descending order.

INPUTS  
* df, Spark dataframe 
* target variable as string
* training_fraction  
* max_iterations  
* seed  

OUTPUTS  
dataframe containing three columns: variable name, weight, AUROC  

#### IMPORTANT NOTES:   
1) If you use the RDD API, LabeledPoint requires that positive label = 1, negative label = 0  
2) Do NOT use the downsampling function

Call the `compute_univariate_aucs` and print the results from the dataframe.  Remember not to downsample.