# Explanations: Identifying Outliers & Biased Features

By identifying features that heavily influence the data, we can identify which features we should look at to explain unexpected query answers. We use the Awards dataset referenced in our paper to explain the following query, which retrieves the 10 Universities that have received the most award money in the area of Computer Science for 2017:
```sql
SELECT B.instName, sum(A.amount) AS totalAward
FROM Award AS A 
INNER JOIN Institution AS B ON A.aid = B.aid
WHERE A.dir = 'CISE' and A.year = 2017
GROUP BY B.instName
ORDER BY totalAward DESC
LIMIT 10
```
Dataset Link: https://www.nsf.gov/awardsearch/download.jsp

### Import & Connect to MLDB
`Note: To execute the code make sure you have verified your account using the email sent by MLDB when you create an account.`

First we need to import MLDB and establish a connection:

In [1]:
import pymldb
mldb = pymldb.Connection()

## Importing the data
The datasets are available from our Github repository. We parsed the XML data provided by NFS to generate the tables.

In [2]:
print mldb.put('/v1/procedures/_', {
    'type': 'import.text',
    'params': {
        'dataFileUrl':
            'https://raw.githubusercontent.com/Mdevlin4/CMSC724/master/Award.csv',
        'outputDataset': 'Award',
        'delimiter': ','
        }
    })
print mldb.put('/v1/procedures/_', {
    'type': 'import.text',
    'params': {
        'dataFileUrl':
            'https://raw.githubusercontent.com/Mdevlin4/CMSC724/master/Institution.csv',
        'outputDataset': 'Institution',
        'delimiter': ','
        }
    })
print mldb.put('/v1/procedures/_', {
    'type': 'import.text',
    'params': {
        'dataFileUrl':
            'https://raw.githubusercontent.com/Mdevlin4/CMSC724/master/Investigator.csv',
        'outputDataset': 'Investigator',
        'delimiter': ','
        }
    })

<Response [201]>
<Response [201]>
<Response [201]>


### Executing the Example Query 

In [12]:
mldb.query("""
SELECT B.instName, sum(A.amount) AS totalAward
FROM Award AS A 
INNER JOIN Institution AS B ON A.aid = B.aid
WHERE A.dir = 'CISE' and A.year = 2017
GROUP BY B.instName
ORDER BY totalAward DESC
LIMIT 10
""")

Unnamed: 0_level_0,B.instName,totalAward
_rowName,Unnamed: 1_level_1,Unnamed: 2_level_1
"""[""""Clemson University""""]""",Clemson University,4425039
"""[""""University of North Carolina at Chapel Hill""""]""",University of North Carolina at Chapel Hill,3619587
"""[""""SUNY at Buffalo""""]""",SUNY at Buffalo,3162942
"""[""""University of Wisconsin-Madison""""]""",University of Wisconsin-Madison,3102391
"""[""""University of Colorado at Boulder""""]""",University of Colorado at Boulder,3024814
"""[""""Cornell University""""]""",Cornell University,2861738
"""[""""University of Illinois at Urbana-Champaign""""]""",University of Illinois at Urbana-Champaign,2838857
"""[""""Arizona State University""""]""",Arizona State University,2744283
"""[""""Carnegie-Mellon University""""]""",Carnegie-Mellon University,2661297
"""[""""US Ignite, Inc.""""]""","US Ignite, Inc.",2655164


### Labeling Elements based on Attributes
In our example query, we are retrieving the 10 Institutions with the most Total Award Money. So to generate training data for our machine learning model, we will label the data based on its "amount" attribute. The following query assigns a label of [1] to Awards with an amount greater than the average Award amount, and [0] otherwise. This will allow us to train our model based on the Awards that have the most impact on the total amount.

In [17]:
mldb.query("""
SELECT Award.* AS *, Award.amount > amtavg AS aboveAvg 
FROM Award 
INNER JOIN (
    SELECT avg(amount) AS amtavg
    FROM Award
) LIMIT 10
""")

Unnamed: 0_level_0,aboveAvg,aid,amount,dir,div,enddate,startdate,title,year
_rowName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
[2]-[[]],0,1600011,122453,MPS,MS,0,0,"Non-Archimedean Techniques in Analysis, Dynami...",2016
[3]-[[]],1,1600012,1996139,GEO,OS,0,0,Coastal SEES: Enhancing sustainability in coas...,2016
[4]-[[]],0,1600014,10500,MPS,MS,0,0,Conference: Evolution Equations on Singular Sp...,2016
[5]-[[]],0,1600016,83117,ENG,CBETS,0,0,Rapid proposal: Fires and floods: Acquisition ...,2015
[6]-[[]],0,1600017,50000,ENG,IIP,0,0,I-Corps: A Tissue-engineered Nipple-Areolar Co...,2015
[7]-[[]],0,1600018,185436,GEO,AGS,0,0,Collaborative Research: P2C2--Ultra-High-Resol...,2016
[8]-[[]],0,1600023,180000,MPS,MS,0,0,Linear Partial Differential Equations on Singu...,2016
[9]-[[]],0,1600024,130476,MPS,MS,0,0,The Regularity of Cauchy-Riemann Mappings and ...,2016
[10]-[[]],0,1600028,73000,MPS,MS,0,0,Long Term Regularity of Solutions of Fluid Models,2016
[11]-[[]],0,1600032,158004,MPS,MS,0,0,New Methods in Tensor Triangular Geometry,2016


### Training a Model using Award Amount
We divide our dataset into two sets: one set for training our model and one set for testing our model. We randomly select 75% of the dataset to use for training our model, keeping the other 25% for testing.

In [14]:
print mldb.put('/v1/procedures/_', {
    'type': 'classifier.train',
    'params': {
        'trainingData': """
            SELECT {* EXCLUDING (amount, aboveAvg)} AS features,
                   aboveAvg AS label FROM (
                SELECT Award.* AS *, Award.amount > amtavg AS aboveAvg FROM Award 
                INNER JOIN (
                    SELECT avg(amount) AS amtavg
                    FROM Award
                )
            ) WHERE rowHash() % 4 != 0
            """,
        'modelFileUrl': 'file://award_model.cls',
        'algorithm': 'bbdt',
        'functionName': 'score',
        'mode': 'boolean'
        }
    })

<Response [201]>


The above code creates a classifier named "score" which we use on examples from our training set to determine which attributes are most influential in the example query. The higher the score, the more likely the feature is relevant. To evaluate our classifier, we can run it on our test set (note the rowHash() % 4 != 0 vs rowHash() % 4 == 0), as shown below:

In [15]:
mldb.query("""
SELECT score({features: {* EXCLUDING (amount, aboveAvg)}}) AS *
FROM (
    SELECT Award.* AS *, Award.amount > amtavg AS aboveAvg 
    FROM Award INNER JOIN (
        SELECT avg(amount) AS amtavg
        FROM Award
    )
)
WHERE rowHash() % 4 = 0
LIMIT 10
""")

Unnamed: 0_level_0,score
_rowName,Unnamed: 1_level_1
[2]-[[]],-2.041656
[5]-[[]],-0.768587
[7]-[[]],-0.429806
[8]-[[]],-2.041656
[11]-[[]],-2.041656
[16]-[[]],-0.07685
[18]-[[]],-0.435151
[29]-[[]],-0.277778
[31]-[[]],-2.041656
[34]-[[]],-0.169184


We can evaluate how well our classifier performs by evaluating it on our evaluation or testing dataset:

In [16]:
mldb.put('/v1/procedures/_', {
    'type': 'classifier.test',
    'params': {
        'testingData': """
            SELECT score: score({features: {* EXCLUDING (amount,aboveAvg)}})[score], label: aboveAvg
            FROM (
                SELECT Award.* AS *, Award.amount > amtavg AS aboveAvg FROM Award 
                INNER JOIN (
                    SELECT avg(amount) AS amtavg
                    FROM Award
                )
            ) 
            WHERE rowHash() % 4 = 0
            """,
        'outputDataset': 'award_test',
        'mode': 'boolean'
        }
    })

From the statistics above, we can see our accuracy (AUC) is about 80.8%, which is not too bad! We can use various techniques to improve this score, such as removing the biased term and training a new classifier. This allows us to find multi-variable correlations, especially when one biased term is significantly more influential than the other attributes. We can view "explanations" of our classifier to get a deeper understanding of what it is doing:

In [8]:
print mldb.put('/v1/functions/explain', {
    'type': 'classifier.explain',
    'params': {
        'modelFileUrl': 'file://award_model.cls'
        }
    })

<Response [201]>


### Example of "Explaining" Every Single Example (how much each feature influences the final score)

In [10]:
mldb.query("""
SELECT explain({features: {* EXCLUDING (amount, aboveAvg)}, label: aboveAvg}) AS *
FROM (
    SELECT Award.* AS *, Award.amount > amtavg AS aboveAvg FROM Award 
    INNER JOIN (
        SELECT avg(amount) AS amtavg
        FROM Award
    )
)
WHERE rowHash() % 4 = 0
LIMIT 10
""")

Unnamed: 0_level_0,bias,explanation.aid,explanation.dir,explanation.div,explanation.title,explanation.year
_rowName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
[2]-[[]],-0.030529,0.518014,0.190778,1.331558,0.04872,-0.016885
[5]-[[]],-0.030529,0.256754,0.30362,0.065486,0.044316,0.128941
[7]-[[]],-0.030529,0.244593,-0.068111,0.236052,0.044316,0.003485
[8]-[[]],-0.030529,0.518014,0.190778,1.331558,0.04872,-0.016885
[11]-[[]],-0.030529,0.518014,0.190778,1.331558,0.04872,-0.016885
[16]-[[]],-0.030529,0.336294,-0.072483,-0.003304,-0.156613,0.003485
[18]-[[]],-0.030529,0.256754,0.196541,-0.074462,0.044316,0.042531
[29]-[[]],0.030529,-0.336294,0.072483,0.003304,-0.044316,-0.003485
[31]-[[]],-0.030529,0.518014,0.190778,1.331558,0.04872,-0.016885
[34]-[[]],-0.030529,-0.044929,-0.388256,0.195248,0.045288,0.392361


In [17]:
###Example of Aggregating "Explanations" by Attribute

In [19]:
mldb.query("""
SELECT *
FROM transpose((
    SELECT avg({explain({features: {* EXCLUDING (amount,aboveAvg)}, label: aboveAvg})[explanation] as *}) AS *
    NAMED 'explanation'
    FROM (
        SELECT Award.* AS *, Award.amount > amtavg AS aboveAvg FROM Award 
        INNER JOIN (
            SELECT avg(amount) AS amtavg
            FROM Award
        )
    )
    WHERE rowHash() % 4 = 0
))
ORDER BY abs(explanation) DESC
""")

Unnamed: 0_level_0,explanation
_rowName,Unnamed: 1_level_1
div,0.268783
aid,0.145859
dir,0.113124
title,0.039471
year,0.015672


By aggregating the explanation scores by attribute, we notice two stand out: `div` and `dir`. These reprsent the Division and Directorate respectively. Since Divisions fall under Directorates, its not surprising to see both values up there since they are related to each other. 

Note that the model is mistaking `aid` as a notable attribute when it is really a randomly unique key. Since `aid` is the primary key for `Award`, we can infer that `aid` is not a notable attribute for identifying correlated or biased terms (meaning we can initally remove it from the set of attributes, shown below). 

## Retraining Without the Biased Feature: `div`
We can look at the effects of removing `div` by adding it to the excluded columns so that it is not used by the model. We will also remove `aid` based on the intuition described above. This allows us to identify other potential outliers and gain a better understanding of our data.

In [13]:
print mldb.put('/v1/procedures/_', {
    'type': 'classifier.train',
    'params': {
        'trainingData': """
        
            SELECT {* EXCLUDING (amount, aboveAvg, div, aid)} AS features,
                   aboveAvg AS label
            FROM (
                SELECT Award.* AS *, Award.amount > amtavg AS aboveAvg FROM Award 
                INNER JOIN (
                    SELECT avg(amount) AS amtavg
                    FROM Award
                )
            )
            WHERE rowHash() % 4 != 0
            """,
        'modelFileUrl': 'file://award_model.cls',
        'algorithm': 'bbdt',
        'functionName': 'score',
        'mode': 'boolean'
        }
    })

<Response [201]>


In [14]:
mldb.put('/v1/procedures/_', {
    'type': 'classifier.test',
    'params': {
        'testingData': """
            SELECT score: score({features: {* EXCLUDING (amount, aboveAvg, div, aid)}})[score], label: aboveAvg
            FROM (
                SELECT Award.* AS *, Award.amount > amtavg AS aboveAvg FROM Award 
                INNER JOIN Investigator ON Investigator.aid = Award.aid
                INNER JOIN (
                    SELECT avg(amount) AS amtavg
                    FROM Award
                )
            )
            WHERE rowHash() % 4 = 0
            """,
        'outputDataset': 'award_test',
        'mode': 'boolean'
        }
    })

We can see that our accuracy (AUC) has taken a pretty serious hit, dropping from 80% to 65%. This is because we eliminated two attributes from our datasets that are now no longer used in classification (`div` and `aid`).

This allows us to better understand the next most influential attributes as well as correlated variables that were difficult to observe due to biased features. The effect of removing these terms is shown below:

In [15]:
print mldb.put('/v1/functions/explain', {
    'type': 'classifier.explain',
    'params': {
        'modelFileUrl': 'file://award_model.cls'
        }
    })

<Response [201]>


In [16]:
mldb.query("""
SELECT *
FROM transpose((
    SELECT avg({explain({features: {* EXCLUDING (amount, aboveAvg, div, aid)}, label: aboveAvg})[explanation] as *}) AS *
    NAMED 'explanation'
    FROM (
        SELECT Award.* AS *, Award.amount > amtavg AS aboveAvg FROM Award 
        INNER JOIN (
            SELECT avg(amount) AS amtavg
            FROM Award
        )
    )
    WHERE rowHash() % 4 = 0
))
ORDER BY abs(explanation) DESC
""")

Unnamed: 0_level_0,explanation
_rowName,Unnamed: 1_level_1
dir,0.141265
title,0.066794
year,0.006125


By removing the biased attributes, we can now see that `dir` is somewhat correlated with the total amount, which makes sense because directorates and divisions are related by a hierarchal model (so if `div` influences the result heavily, it makes sense that `dir` would as well).



## Finding Explanations Between Multiple Tables
Using the technique described above, we can identify the attributes relevant for an explanation by examining biased terms in the query. In the examples above, we only looked at attributes in Award to find biased terms, but this could miss inter-table relationships. To examine attributes from multiple tables efficiently, we can join tables on foriegn keys which will help limit the number of resulting records.

In [33]:
mldb.query("""
SELECT Invest.name AS name, Invest.email AS email, 
sum(A.amount) AS awardSum, count(Inst.instName) as schoolCount
FROM Award AS A
INNER JOIN Investigator AS Invest ON Invest.aid = A.aid
INNER JOIN Institution AS Inst ON Inst.aid = A.aid
GROUP BY Invest.name, Invest.email 
ORDER BY awardSum DESC LIMIT 10
""")

Unnamed: 0_level_0,awardSum,email,name,schoolCount
_rowName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"""[""""Richard Farnsworth"""",""""farnsworthr@battelle.org""""]""",165908521,farnsworthr@battelle.org,Richard Farnsworth,3
"""[""""Ethan Schreier"""",""""ejs@aui.edu""""]""",61348257,ejs@aui.edu,Ethan Schreier,5
"""[""""Patricia Gumport"""",""""gumport@stanford.edu""""]""",12352332,gumport@stanford.edu,Patricia Gumport,1
"""[""""Francis Halzen"""",""""halzen@icecube.wisc.edu""""]""",12250000,halzen@icecube.wisc.edu,Francis Halzen,1
"""[""""Andrew Bowen"""",""""abowen@whoi.edu""""]""",12168732,abowen@whoi.edu,Andrew Bowen,1
"""[""""Marvin Hackert"""",""""m.hackert@mail.utexas.edu""""]""",8233333,m.hackert@mail.utexas.edu,Marvin Hackert,1
"""[""""Stephen Simoncini"""",""""stephen.g.simoncini@census.gov""""]""",7299517,stephen.g.simoncini@census.gov,Stephen Simoncini,1
"""[""""Paula McClain"""",""""pmcclain@duke.edu""""]""",6317500,pmcclain@duke.edu,Paula McClain,1
"""[""""Mark DeCoster"""",""""decoster@latech.edu""""]""",6000000,decoster@latech.edu,Mark DeCoster,1
"""[""""Jared Medina"""",""""jmedina@psych.udel.edu""""]""",6000000,jmedina@psych.udel.edu,Jared Medina,1


## Conclusion
Using ML techniques, we were able to correctly identify correlated attributes, which are useful in explaining unexpected query answers. These techniques can be applied to incorporate automatic attribute selection into explainable database by exploiting primary & foriegn key relationships, knowledge about aggregate operators, and minimal human domain knowledge.