### Lab: Tuning and Topic Modeling

### University of Virginia
### DS 7200: Distributed Computing
### Last Updated: August 20, 2023


#### Jiaxing Qiu
#### JQ2UW
---  

**INSTRUCTIONS**  
In this assignment, you will do three things:
1) Tune a logistic regression model  
2) Label-balance a dataset  
3) Run the Topic Modeling notebook, making small tweaks and capturing results  

**TOTAL POINTS: 10**

In [36]:
import os

In [37]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .master("local") \
    .appName("data preprocessing") \
    .config("spark.executor.memory", '8g') \
    .config('spark.executor.cores', '4') \
    .config('spark.cores.max', '4') \
    .config("spark.driver.memory",'8g') \
    .getOrCreate()

sc = spark.sparkContext

### PARAMETERS

In [8]:
# update to match your path
directory_path = './'
full_path_to_file = os.path.join(directory_path, 'breast_cancer_wisconsin.csv')
path_to_data = os.path.join(full_path_to_file)

In [9]:
# class = 2 for benign (negative class, 4 for malignant (positive class)
target = 'class'
positive_label = 4
negative_label = 2

SEED = 314

### READ IN DATA

In [17]:
brca = spark.read.csv(path_to_data, header=True, inferSchema=True)

In [18]:
brca.printSchema()

root
 |-- id: integer (nullable = true)
 |-- clump_thickness: integer (nullable = true)
 |-- uniformity_cell_size: integer (nullable = true)
 |-- uniformity_cell_shape: integer (nullable = true)
 |-- marginal_adhesion: integer (nullable = true)
 |-- single_epithelial_cell_size: integer (nullable = true)
 |-- bare_nuclei: string (nullable = true)
 |-- bland_chromatin: integer (nullable = true)
 |-- normal_nucleoli: integer (nullable = true)
 |-- mitoses: integer (nullable = true)
 |-- class: integer (nullable = true)



In [19]:
brca.count()

699

In [20]:
# compute distribution of target variable
brca.groupBy(target).count().show()

+-----+-----+
|class|count|
+-----+-----+
|    4|  241|
|    2|  458|
+-----+-----+



### Task 1:  Cross Validate a Logistic Regression Model
i) (**4 PTS**) This task has the following requirements:
- import necessary modules
- use these features as predictors: `clump_thickness`,`uniformity_cell_size`,`uniformity_cell_shape`,`marginal_adhesion`  
- `class` is response variable. apply recoding as needed. hint: save as new variable.
- use 3 folds in the cross validator object
- use BinaryClassificationEvaluator
- logistic regression model with `maxIter`=10  
- tuning grid with `regParam` values of 0.1 and 0.01
- finally, print the average metrics based on each `regParam` value. the attribute `avgMetrics` in the cv model will hold these. 

In [27]:
# Import necessary modules
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.functions import when

# Read in data again 
brca = spark.read.csv(path_to_data, header=True, inferSchema=True)

# Define the features and response variable
selected_features = ['clump_thickness', 'uniformity_cell_size', 'uniformity_cell_shape', 'marginal_adhesion']

# Recode the target variable into binary labels
target = 'class'
positive_label = 4
negative_label = 2
brca = brca.withColumn('label', when(brca[target] == positive_label, 1).otherwise(0))

# Create a feature vector using the selected features
feature_assembler = VectorAssembler(inputCols=selected_features, outputCol='features')
brca = feature_assembler.transform(brca)

# Create a Logistic Regression model
lr = LogisticRegression(maxIter=10)

# Create a parameter grid for hyperparameter tuning
param_grid = (ParamGridBuilder()
              .addGrid(lr.regParam, [0.1, 0.01])
              .build())

# Create a BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator(labelCol='label')

# Create a CrossValidator object with 3 folds
crossval = CrossValidator(estimator=lr,
                          estimatorParamMaps=param_grid,
                          evaluator=evaluator,
                          numFolds=3)

# Fit the model and perform cross-validation
cv_model = crossval.fit(brca)

# Print the average metrics based on each regParam value
for idx, metric in enumerate(cv_model.avgMetrics):
    print(f'For regParam = {param_grid[idx][lr.regParam]}, AUC-ROC = {metric:.4f}')

For regParam = 0.1, AUC-ROC = 0.9911
For regParam = 0.01, AUC-ROC = 0.9905


### Task 2:  Balancing a DataFrame with Downsampling  
i) (**2 PTS**) Write a function to implement downsampling.  Enter code into the cell containing the `downsample` function.  

INPUTS  
* df               - Spark dataframe  
* target           - string, target variable  
* positive_label   - integer, value of positive label  
* negative_label   - integer, value of negative label  

OUTPUT  
balanced spark dataframe  

Downsampling = sample from larger class to match smaller class  

**Example:**  

INITIAL STATE  
Smaller class has 100 records  
Larger class size has 400 records

ACTION  
Sample 100 records from larger class, without replacement  
Retain all records from smaller class

END STATE    
This produces a balanced dataset containing 100 records from each class

In [33]:
from pyspark.sql import functions as F

def downsample(df, target, positive_label, negative_label):
    # Calculate the number of records in each class
    num_positive = df.filter(df[target] == positive_label).count()
    num_negative = df.filter(df[target] == negative_label).count()

    # Determine the smaller and larger classes
    if num_positive < num_negative:
        smaller_class_label = positive_label
        larger_class_label = negative_label
        smaller_class_count = num_positive
    else:
        smaller_class_label = negative_label
        larger_class_label = positive_label
        smaller_class_count = num_negative

    # Calculate the fraction for downsampling the larger class
    fraction = min(num_positive, num_negative) / max(num_positive, num_negative)

    # Sample records from the larger class without replacement
    downsampled_df = df.filter(df[target] == smaller_class_label).union(
        df.filter(df[target] == larger_class_label).sample(False, fraction)
    )

    return downsampled_df



ii) **(1 PT)** Print the target distribution from this balanced dataset, to show the label counts nearly match.

#### IMPORTANT NOTE:
Sampling won't produce the exact fraction you request. In order to sample efficiently, Spark uses Bernouilli Sampling. 
Each row is assigned a probability of being included. If you request a 10% sample, each row individually has a 10% chance of being included but this does not guarantee an exact 10% sample   
(it should be close, however).

In [35]:
# Read in data again 
brca = spark.read.csv(path_to_data, header=True, inferSchema=True)

balanced_brca = downsample(brca, 'class', 4, 2)

# compute distribution of target variable
target = 'class'
balanced_brca.groupBy(target).count().show()

+-----+-----+
|class|count|
+-----+-----+
|    4|  241|
|    2|  264|
+-----+-----+



### Task 3:  Topic Modeling

In this exercise, you will run the `topic_modeling.ipynb` notebook and answer the questions below.

i) **(1 PT)** For the first headline in the dataset, the code processes it and extracts tokens. Provide a list of the tokens.

+-------------+--------------------------------------------------+  
|publish_date | headline_text                                     |  
+-------------+--------------------------------------------------+  
|20030219     | aba decides against community broadcasting licence|  

In [38]:
# answer is below

[aba, decid, commun, broadcast, licenc]

ii) **(1 PT)** The code created a count vectorizer and extracted features. 

`cv = CountVectorizer(inputCol="tokens", outputCol="features", vocabSize=500, minDF=3.0)`

The first document had six tokens and the feature vector looked like this:

(500, [118, 498], [1.0, 1.0])   

Explain why there are only two non-zero elements in this feature vector.

In [39]:
# anwser is below

- Index 118 corresponds to one feature, and it has a count of 1.0.
- Index 498 corresponds to another feature, and it also has a count of 1.0.

iii) **(1 PT)** Change the number of topics to 2, rerun LDA, and visualize the topics by showing the topic words.

In [None]:
# answer is below

topic: 0
*************************
u
iraq
polic
council
sai
claim
warn
water
plan
world
*************************
topic: 1
*************************
war
man
u
new
win
iraqi
anti
call
govt
iraq
*************************
