Why cluster on predictions_df and not candidate_cases?

Just wanted to check that the [clustering logic ](https://github.com/Azure/Strata2018/blob/f2b0e3a977e1f239208dfa45739153d56625d825/text_classification/1_wiki_detox_active_learning_workshop.Rmd#L89-L104)here is correct:
```
    predictions_df <- rxPredict(model, candidate_cases, extraVarsToWrite=c("rev_id", "flagged"))
    predictions_df$entropy <- entropy(predictions_df$Probability)

    predictions_df$cluster_id <- predictions_df %>%
      dist(method="euclidean") %>%
      hclust(method="ward.D2") %>%
      cutree(k=N)

    selected <- predictions_df %>%
      group_by(cluster_id) %>%
      arrange(-entropy) %>%
      slice(which.max(entropy)) %>%
      as.data.frame
```
I was under the impression that the clusters should be based only the comment's features ie. v2...v51 in `candidates_df`. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why cluster on predictions_df and not candidate_cases? #7

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Why cluster on predictions_df and not candidate_cases? #7

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions