In [1]:
%AddDeps org.vegas-viz vegas_2.11 0.3.9 --transitive
%AddDeps org.vegas-viz vegas-spark_2.11 0.3.9 --transitive

Marking org.vegas-viz:vegas_2.11:0.3.9 for download
Obtained 42 files
Marking org.vegas-viz:vegas-spark_2.11:0.3.9 for download
Obtained 44 files


In [3]:
import vegas._
import vegas.DSL.OptArg
import java.sql.Date
import vegas.spec.Spec.MarkEnums.Rule
import vegas.sparkExt._
import java.io._
import org.apache.spark.sql.functions._

**You will need to point to the location of ExampleScoreCard within the ScoreCard directory which should've been generated by running batch scoring:**

In [5]:
val pathToExampleScoreCard = "Z:/ExampleScoreCard"
val scorecard = spark.read.parquet(pathToExampleScoreCard)

scorecard = [subjectId: string, scorecardScore: double ... 1 more field]


pathToExampleScoreCard: String = Z:/ExampleScoreCard


[subjectId: string, scorecardScore: double ... 1 more field]

## Score Distribution Histogram
**Histograms can be used to see the distribution of scores across your scorecard, and in particular are useful when choosing a threshold for alerts. This also allows us to see if too many scores are concentrated in a specific window.**

In [12]:
Vegas("Histogram")
    .withDataFrame(scorecard)
    .mark(Bar)
    .encodeX("scorecardScore", 
             dataType = Quant, // For more simple binning, set this to Ordinal, enableBin to false, and remove bin argument.
             enableBin = true,
             bin = Bin(step=0.2), //Tune this to increase/decrease the sizes of the bins in the histogram
             title="Overall Score")
    .encodeY("scorecardScore", aggregate = AggOps.Count, title="# of Customers")
    .show

## Stacked Bar Chart
**Stacked bar charts can be useful for determining which scores are contributing the most to the final score total. 
The data must first be reorganized so the contribution of each score can be more easily accessible**

In [13]:
val subjectIdAndWeightedScore = scorecard.select(col("subjectId"), 
                                                col("scorecardScore"),
                                                explode($"weightedScoreOutputMap") )

val scoresAndContributions = subjectIdAndWeightedScore.select(col("subjectId"),
                                                            col("scorecardScore"),
                                                            col("key"),
                                                            col("value.contribution"),
                                                             col("value.severity"))
val listOfScores = scoresAndContributions.select($"key",$"contribution").orderBy(desc("contribution")).distinct.rdd.collect().map(_.get(0).toString).toList

listOfScores = List(CancelledCustomer, CustomerAddressInLowValuePostcode, NewCustomer, CustomerFromHighRiskCountry, CR100_CustomerRollupHighNumberOfRoundAmount, HighRiskCustomer, HighRiskCustomer, HighRiskCustomer, HighRiskCustomer, CustomerFromHighRiskCountry, CR100_CustomerRollupHighNumberOfRoundAmount, NewCustomer, HighRiskCustomer, CancelledCustomer, CustomerAddressInLowValuePostcode)


subjectIdAndWeightedScore: org.apache.spark.sql.DataFrame = [subjectId: string, scorecardScore: double ... 2 more fields]
scoresAndContributions: org.apache.spark.sql.DataFrame = [subjectId: string, scorecardScore: double ... 3 more fields]


List(CancelledCustomer, CustomerAddressInLowValuePostcode, NewCustomer, CustomerFromHighRiskCountry, CR100_CustomerRollupHighNumberOfRoundAmount, HighRiskCustomer, HighRiskCustomer, HighRiskCustomer, HighRiskCustomer, CustomerFromHighRiskCountry, CR100_CustomerRollupHighNumberOfRoundAmount, NewCustomer, HighRiskCustomer, CancelledCustomer, CustomerAddressInLowValuePostcode)

**We can use a stacked bar chart to take a closer look at the highest scoring subjects. The below chart shows the scores which the top 10 scoring customers have triggered.**

In [14]:
val topIds = scoresAndContributions.select(col("subjectId"),col("scorecardScore")).distinct
.orderBy(desc("scoreCardScore"))
.drop("scorecardScore")
.limit(10) //Top X scoring subjects
val topSubjects = topIds.join(scoresAndContributions, Seq("subjectId"))
                    .orderBy(desc("scoreCardScore"),$"subjectId",desc("contribution"))

topSubjects = [subjectId: string, scorecardScore: double ... 3 more fields]


topIds: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [subjectId: string]


[subjectId: string, scorecardScore: double ... 3 more fields]

In [15]:
Vegas("StackedBar")
    .withDataFrame(topSubjects)
    .mark(Bar)
    .encodeY("contribution", Quant, AggOps.Sum, title = "Scorecard Score")
    .encodeX("subjectId", 
             Nominal, 
             hideAxis = false, //hideAxis hides subjectIds
             title = "Subject Ids",
             sortField = Sort("contribution", AggOps.Sum, Some(SortOrder.Desc))
             )
    .encodeColor("key",
                 Nominal, 
                 legend = Legend(orient = "left", title ="Score Name"),
                 sortField = Sort("contribution", AggOps.Values, Some(SortOrder.Asc))
                )
    .show



**We can see from the above example that the score _CustomerAddressInLowValuePostcode_ is rarely triggered.**

## Inspecting Specific Scores
**We can also look at specific scores in isolation, which can be useful when choosing weights**

In [16]:
val distinctScores = scoresAndContributions.select("key").distinct
distinctScores.show(false)

[Stage 31:>                                                         (0 + 8) / 8]+-------------------------------------------+
|key                                        |
+-------------------------------------------+
|CR100_CustomerRollupHighNumberOfRoundAmount|
|HighRiskCustomer                           |
|NewCustomer                                |
|CustomerAddressInLowValuePostcode          |
|CancelledCustomer                          |
|CustomerFromHighRiskCountry                |
+-------------------------------------------+



distinctScores = [key: string]


[key: string]

### Continuous View: HighRiskCustomer 
The below graph shows the distribution of the high risk customer score, showing how significant a contribution this has in the scorecard.

In [18]:
Vegas("Histogram")
    .withDataFrame(scoresAndContributions.filter($"key" === "HighRiskCustomer"))
    .mark(Bar)
    .encodeX("contribution", 
             Ordinal,  
             title="HighRiskCustomer Contribution", 
             axis = Axis(ticks = OptArg(0.5))
            )
    .encodeY("contribution", aggregate = AggOps.Count)
    .show

### Bucket View: HighRiskCustomer

In [19]:
Vegas("Histogram")
    .withDataFrame(scoresAndContributions.filter($"key" === "HighRiskCustomer"))
    .mark(Bar)
    .encodeX("contribution", 
             Quant, 
             enableBin = true,
             bin = Bin(step = 0.1), // can adjust size of buckets
             title="HighRiskCustomer Contribution")
    .encodeY("contribution", aggregate = AggOps.Count)
    .configCell(width=500, height=400)
    .show

**For scores which have a binary contribution, it may be useful to see what percentage of individuals the score was triggered for.**

In [20]:
val totalSubjects = scoresAndContributions.select("subjectId").distinct.count
def getPercentage(scoreName: String): Double = {
    scoresAndContributions.filter($"key" === scoreName).select("subjectId").distinct.count.toFloat / totalSubjects
}



totalSubjects: Long = 1439
getPercentage: (scoreName: String)Double


In [21]:
val scoresAndPercentages = listOfScores.map( score => (score, getPercentage(score)))



scoresAndPercentages = List((CancelledCustomer,0.5211952924728394), (CustomerAddressInLowValuePostcode,0.01737317629158497), (NewCustomer,0.2953439950942993), (CustomerFromHighRiskCountry,0.08547602593898773), (CR100_CustomerRollupHighNumberOfRoundAmount,0.07713690400123596), (HighRiskCustomer,0.33425989747047424), (HighRiskCustomer,0.33425989747047424), (HighRiskCustomer,0.33425989747047424), (HighRiskCustomer,0.33425989747047424), (CustomerFromHighRiskCountry,0.08547602593898773), (CR100_CustomerRollupHighNumberOfRoundAmount,0.07713690400123596), (NewCustomer,0.2953439950942993), (HighRiskCustomer,0.33425989747047424), (CancelledCustomer,0.5211952924728394), (CustomerAddressInLowValuePostcode,0.01737317629158497))


List((CancelledCustomer,0.5211952924728394), (CustomerAddressInLowValuePostcode,0.01737317629158497), (NewCustomer,0.2953439950942993), (CustomerFromHighRiskCountry,0.08547602593898773), (CR100_CustomerRollupHighNumberOfRoundAmount,0.07713690400123596), (HighRiskCustomer,0.33425989747047424), (HighRiskCustomer,0.33425989747047424), (HighRiskCustomer,0.33425989747047424), (HighRiskCustomer,0.33425989747047424), (CustomerFromHighRiskCountry,0.08547602593898773), (CR100_CustomerRollupHighNumberOfRoundAmount,0.07713690400123596), (NewCustomer,0.2953439950942993), (HighRiskCustomer,0.33425989747047424), (CancelledCustomer,0.5211952924728394), (CustomerAddressInLowValuePostcode,0.01737317629158497))

**In the below graph, we can see the scores which trigger most frequently for customers in the scorecard. We could further filter this for alerted customers (this would highlight scores which may unexpectedly trigger too often, or too little)**

In [23]:
Vegas("StackedBar")
    .withXY(scoresAndPercentages)
    .mark(Bar)
    .encodeY("y", 
             Quant, 
             //title="Proportion of Customers who Trigger Score", 
             axis = Axis(title = "Proportion of Customers who Trigger Score"),
             scale = Scale(domainValues = List(0,1.0))) 
    .encodeX("x", Nominal, title="Score Name")
    .encodeColor("x",Nominal, legend = Legend(orient = "left", title ="Score Name"))
    .configCell(height = 400)
    .show