## PSTAT 235 Extra Assignment 1: Tools for Supervised Learning

### University of California, Santa Barbara  
### PSTAT 135/235 - Big Data Analytics
### Prof Tashman
### Last Updated: Jan 31, 2019

---  

### OBJECTIVE: In this assignment, you will code functions to support supervised learning tasks

### MODULES

In [1]:
import os

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .master("local") \
    .appName("data preprocessing") \
    .config("spark.executor.memory", '8g') \
    .config('spark.executor.cores', '4') \
    .config('spark.cores.max', '4') \
    .config("spark.driver.memory",'8g') \
    .getOrCreate()

sc = spark.sparkContext

### PARAMETERS

In [6]:
# read in the breast cancer wisconsin data


In [4]:
# class = 2 for benign (negative class, 4 for malignant (positive class)
target = 'class'
positive_label = 4
negative_label = 2

SEED = 314

### READ IN DATA

print the schema

### Task 1:  Balancing a DataFrame with Downsampling  
i) Write a function to implement downsampling.  

INPUTS  
* df               - Spark dataframe  
* target           - string, target variable  
* positive_label   - integer, value of positive label  
* negative_label   - integer, value of negative label  

OUTPUT  
balanced spark dataframe  

Downsampling = sample from larger class to match smaller class  

ii) Print the target distribution from this balanced dataset, to show the label counts nearly match.

#### IMPORTANT NOTE:
Sampling won't produce the exact fraction you request. In order to sample efficiently, Spark uses Bernouilli Sampling. 
Each row is assigned a probability of being included. If you request a 10% sample, each row individually has a 10% chance of being included but this does not guarantee an exact 10% sample   
(it should be close, however).

In [4]:
# code the function here


In [5]:
# call the function here


In [3]:
# compute the target variable distribution here


### Task 2:  Univariate AUC Measurement  

In this exercise, you will measure (in a particular sense) the individual predictive power of the following variables:  
* clump_thickness
* uniformity_cell_size
* uniformity_cell_shape
* marginal_adhesion
* single_epithelial_cell_size

Write a function that does the following in this order:  
* Split the dataset into training and testing sets (*60% / 40%*, respectively)  
* For each variable v:  
    * train a logistic regression classifier with intercept, including variable *v* as predictor
    * classify each record in the test set  
    * measure the area under the ROC curve  
* Return a pandas dataframe containing each variable, its model weight (coefficient), and its Univariate AUC, sorted by Univariate AUC in descending order

INPUTS  
* df, Spark dataframe 
* target variable as string
* training_fraction  
* max_iterations  
* seed  

OUTPUTS  
dataframe containing two columns: variable name, AUROC  

#### IMPORTANT NOTE:   
LabeledPoint requires that positive label = 1, negative label = 0

In [7]:
# load modules
import pandas as pd
import pyspark.sql.functions as F
import pyspark.mllib.regression as reg
from pyspark.mllib.classification import LogisticRegressionWithLBFGS
from pyspark.mllib.evaluation import BinaryClassificationMetrics

In [8]:
# parameters
training_fraction = 0.6
ITERS = 10

In [9]:
# narrow the list of features for modeling

vars_to_keep = [
 'clump_thickness',
 'uniformity_cell_size',
 'uniformity_cell_shape',
 'marginal_adhesion',
 'single_epithelial_cell_size'
]

In [7]:
# map target labels to 0/1
brca_f = brca_f.withColumn(target,F.when(brca_f[target] == positive_label, 1).otherwise(0))

In [8]:
# code the function here


In [9]:
# call the function here, printing results
