# Explanations: Identifying Outliers & Biased Features

By identifying features that heavily influence the data, we can identify which features we should look at to explain unexpected query answers. 

We use the Awards dataset referenced in our paper to explain the following query, which retrieves the 10 Universities that have received the most award money in the area of Computer Science for 2017:
```sql
SELECT B.instName, sum(A.amount) AS totalAward
FROM Award AS A 
INNER JOIN Institution AS B ON A.aid = B.aid
WHERE A.dir = 'CISE' and A.year = 2017
GROUP BY B.instName
ORDER BY totalAward DESC
LIMIT 10
```
Dataset Link: https://www.nsf.gov/awardsearch/download.jsp

### Import & Connect to MLDB

In [2]:
import pymldb
mldb = pymldb.Connection()

## Importing the data
The datasets are available from our Github and Google Drive. We parsed the XML data provided by NFS to generate the tables.

In [3]:
print mldb.put('/v1/procedures/_', {
    'type': 'import.text',
    'params': {
        'dataFileUrl':
            'https://raw.githubusercontent.com/Mdevlin4/CMSC724/master/Award.csv',
        'outputDataset': 'Award',
        'delimiter': ','
        }
    })
print mldb.put('/v1/procedures/_', {
    'type': 'import.text',
    'params': {
        'dataFileUrl':
            'https://raw.githubusercontent.com/Mdevlin4/CMSC724/master/Institution.csv',
        'outputDataset': 'Institution',
        'delimiter': ','
        }
    })

<Response [201]>
<Response [201]>


### Executing the Example Query 

In [4]:
mldb.query("""
SELECT B.instName, sum(A.amount) AS totalAward
FROM Award AS A 
INNER JOIN Institution AS B ON A.aid = B.aid
WHERE A.dir = 'CISE' and A.year = 2017
GROUP BY B.instName
ORDER BY totalAward DESC
LIMIT 10
""")

Unnamed: 0_level_0,B.instName,totalAward
_rowName,Unnamed: 1_level_1,Unnamed: 2_level_1
"""[""""Clemson University""""]""",Clemson University,4425039
"""[""""University of North Carolina at Chapel Hill""""]""",University of North Carolina at Chapel Hill,3619587
"""[""""SUNY at Buffalo""""]""",SUNY at Buffalo,3162942
"""[""""University of Wisconsin-Madison""""]""",University of Wisconsin-Madison,3102391
"""[""""University of Colorado at Boulder""""]""",University of Colorado at Boulder,3024814
"""[""""Cornell University""""]""",Cornell University,2861738
"""[""""University of Illinois at Urbana-Champaign""""]""",University of Illinois at Urbana-Champaign,2838857
"""[""""Arizona State University""""]""",Arizona State University,2744283
"""[""""Carnegie-Mellon University""""]""",Carnegie-Mellon University,2661297
"""[""""US Ignite, Inc.""""]""","US Ignite, Inc.",2655164


### Labeling Elements based on Attributes
In our example query, we are retrieving the 10 Institutions with the most Total Award Money. So to generate training data for our machine learning model, we will label the data based on its "amount" attribute. The following query assigns a label of [1] to Awards with an amount greater than the average Award amount, and [0] otherwise. This will allow us to train our model based on the Awards that have the most impact on the total amount.

In [5]:
mldb.query("""
SELECT Award.* AS *, Award.amount > amtavg AS aboveAvg 
FROM Award INNER JOIN (
    SELECT avg(amount) AS amtavg
    FROM Award
) LIMIT 10
""")

Unnamed: 0_level_0,aboveAvg,aid,amount,dir,div,enddate,startdate,title,year
_rowName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
[2]-[[]],0,1600011,122453,MPS,MS,0,0,"Non-Archimedean Techniques in Analysis, Dynami...",2016
[3]-[[]],1,1600012,1996139,GEO,OS,0,0,Coastal SEES: Enhancing sustainability in coas...,2016
[4]-[[]],0,1600014,10500,MPS,MS,0,0,Conference: Evolution Equations on Singular Sp...,2016
[5]-[[]],0,1600016,83117,ENG,CBETS,0,0,Rapid proposal: Fires and floods: Acquisition ...,2015
[6]-[[]],0,1600017,50000,ENG,IIP,0,0,I-Corps: A Tissue-engineered Nipple-Areolar Co...,2015
[7]-[[]],0,1600018,185436,GEO,AGS,0,0,Collaborative Research: P2C2--Ultra-High-Resol...,2016
[8]-[[]],0,1600023,180000,MPS,MS,0,0,Linear Partial Differential Equations on Singu...,2016
[9]-[[]],0,1600024,130476,MPS,MS,0,0,The Regularity of Cauchy-Riemann Mappings and ...,2016
[10]-[[]],0,1600028,73000,MPS,MS,0,0,Long Term Regularity of Solutions of Fluid Models,2016
[11]-[[]],0,1600032,158004,MPS,MS,0,0,New Methods in Tensor Triangular Geometry,2016


### Training a Model using Award Amount
We divide our dataset into two sets: one set for training our model and one set for testing our model. We randomly select 75% of the dataset to use for training our model, keeping the other 25% for testing.

In [6]:
print mldb.put('/v1/procedures/_', {
    'type': 'classifier.train',
    'params': {
        'trainingData': """
            SELECT {* EXCLUDING (amount, aboveAvg)} AS features,
                   aboveAvg AS label FROM (
                SELECT Award.* AS *, Award.amount > amtavg AS aboveAvg FROM Award 
                INNER JOIN (
                    SELECT avg(amount) AS amtavg
                    FROM Award
                )
            ) WHERE rowHash() % 4 != 0
            """,
        'modelFileUrl': 'file://award_model.cls',
        'algorithm': 'bbdt',
        'functionName': 'score',
        'mode': 'boolean'
        }
    })

<Response [201]>


TODO: This creates a [`classifier`][1] function named "score" that we can use on examples from our test set. The higher the score, the more likely the feature is relevant. We can try it on examples from our test set.

[1]: ../../../../doc/#builtin/functions/ClassifierApply.md.html

In [7]:
mldb.query("""
SELECT score({features: {* EXCLUDING (amount, aboveAvg)}}) AS *
FROM (
    SELECT Award.* AS *, Award.amount > amtavg AS aboveAvg 
    FROM Award INNER JOIN (
        SELECT avg(amount) AS amtavg
        FROM Award
    )
)
WHERE rowHash() % 4 = 0
LIMIT 10
""")

Unnamed: 0_level_0,score
_rowName,Unnamed: 1_level_1
[2]-[[]],-2.041656
[5]-[[]],-0.768587
[7]-[[]],-0.429806
[8]-[[]],-2.041656
[11]-[[]],-2.041656
[16]-[[]],-0.07685
[18]-[[]],-0.435151
[29]-[[]],-0.277778
[31]-[[]],-2.041656
[34]-[[]],-0.169184


Now let's see how well our model does on the 25% of the data we didn't train on and get a feel of how good it should perform in real life.

In [8]:
mldb.put('/v1/procedures/_', {
    'type': 'classifier.test',
    'params': {
        'testingData': """
            SELECT score: score({features: {* EXCLUDING (amount,aboveAvg)}})[score], label: aboveAvg
            FROM (
                SELECT Award.* AS *, Award.amount > amtavg AS aboveAvg FROM Award 
                INNER JOIN (
                    SELECT avg(amount) AS amtavg
                    FROM Award
                )
            ) 
            WHERE rowHash() % 4 = 0
            """,
        'outputDataset': 'award_test',
        'mode': 'boolean'
        }
    })

TODO: As we can see by inspecting the different statistics returned by the classifier.test procedure, that model seems to be doing pretty good! The AUC is 0.95: let's ship this thing in production right now! ... Or let's be cautious!

To understand what's going on, let's use the [`classifier.explain` function][1]. This will give us an idea of how much each feature helps (or hurts) in making the predictions.

[1]: ../../../../doc/#builtin/functions/ClassifierExplain.md.html

In [9]:
print mldb.put('/v1/functions/explain', {
    'type': 'classifier.explain',
    'params': {
        'modelFileUrl': 'file://award_model.cls'
        }
    })

<Response [201]>


TODO: You can "explain" every single example, and know how much each feature influences the final score, like this:

In [10]:
mldb.query("""
SELECT explain({features: {* EXCLUDING (amount, aboveAvg)}, label: aboveAvg}) AS *
FROM (
    SELECT Award.* AS *, Award.amount > amtavg AS aboveAvg FROM Award 
    INNER JOIN (
        SELECT avg(amount) AS amtavg
        FROM Award
    )
)
WHERE rowHash() % 4 = 0
LIMIT 10
""")

Unnamed: 0_level_0,bias,explanation.aid,explanation.dir,explanation.div,explanation.title,explanation.year
_rowName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
[2]-[[]],-0.030529,0.518014,0.190778,1.331558,0.04872,-0.016885
[5]-[[]],-0.030529,0.256754,0.30362,0.065486,0.044316,0.128941
[7]-[[]],-0.030529,0.244593,-0.068111,0.236052,0.044316,0.003485
[8]-[[]],-0.030529,0.518014,0.190778,1.331558,0.04872,-0.016885
[11]-[[]],-0.030529,0.518014,0.190778,1.331558,0.04872,-0.016885
[16]-[[]],-0.030529,0.336294,-0.072483,-0.003304,-0.156613,0.003485
[18]-[[]],-0.030529,0.256754,0.196541,-0.074462,0.044316,0.042531
[29]-[[]],0.030529,-0.336294,0.072483,0.003304,-0.044316,-0.003485
[31]-[[]],-0.030529,0.518014,0.190778,1.331558,0.04872,-0.016885
[34]-[[]],-0.030529,-0.044929,-0.388256,0.195248,0.045288,0.392361


Or you can do the average on all the examples. Here we then transpose the result and sort it by the absolute value.

In [11]:
mldb.query("""
SELECT *
FROM transpose((
    SELECT avg({explain({features: {* EXCLUDING (amount,aboveAvg)}, label: aboveAvg})[explanation] as *}) AS *
    NAMED 'explanation'
    FROM (
        SELECT Award.* AS *, Award.amount > amtavg AS aboveAvg FROM Award 
        INNER JOIN (
            SELECT avg(amount) AS amtavg
            FROM Award
        )
    )
    WHERE rowHash() % 4 = 0
))
ORDER BY abs(explanation) DESC
""")

Unnamed: 0_level_0,explanation
_rowName,Unnamed: 1_level_1
div,0.268783
aid,0.145859
dir,0.113124
title,0.039471
year,0.015672


Now what is striking here is that there are two features that really stand out: `div` and `dur`. These reprsent the Division and Directorate respectively. Since Divisions fall under Directorates, its not surprising to see both values up there since they are related to each other.

## Retraining Without the Biased Feature: `div`
We can look at the effects of removing `div` by adding it to the excluded columns so that it is not used by the model.
This allows us to identify other potential outliers and get a better understanding of our data.

In [12]:
print mldb.put('/v1/procedures/_', {
    'type': 'classifier.train',
    'params': {
        'trainingData': """
        
            SELECT {* EXCLUDING (amount, aboveAvg, dir)} AS features,
                   aboveAvg AS label
            FROM (
                SELECT Award.* AS *, Award.amount > amtavg AS aboveAvg FROM Award 
                INNER JOIN (
                    SELECT avg(amount) AS amtavg
                    FROM Award
                )
            )
            WHERE rowHash() % 4 != 0
            """,
        'modelFileUrl': 'file://award_model.cls',
        'algorithm': 'bbdt',
        'functionName': 'score',
        'mode': 'boolean'
        }
    })

<Response [201]>


In [13]:
mldb.put('/v1/procedures/_', {
    'type': 'classifier.test',
    'params': {
        'testingData': """
            SELECT score: score({features: {* EXCLUDING (aboveAvg, dir)}})[score], label: aboveAvg
            FROM (
                SELECT Award.* AS *, Award.amount > amtavg AS aboveAvg FROM Award 
                INNER JOIN (
                    SELECT avg(amount) AS amtavg
                    FROM Award
                )
            )
            WHERE rowHash() % 4 = 0
            """,
        'outputDataset': 'award_test',
        'mode': 'boolean'
        }
    })

TODO: AUC of .79

If we run the explanation again, the highest ranking features seem more legitimate.

In [14]:
print mldb.put('/v1/functions/explain', {
    'type': 'classifier.explain',
    'params': {
        'modelFileUrl': 'file://award_model.cls'
        }
    })

<Response [201]>


In [15]:
mldb.query("""
SELECT *
FROM transpose((
    SELECT avg({explain({features: {* EXCLUDING (aboveAvg, dir)}, label: aboveAvg})[explanation] as *}) AS *
    NAMED 'explanation'
    FROM (
        SELECT Award.* AS *, Award.amount > amtavg AS aboveAvg FROM Award 
        INNER JOIN (
            SELECT avg(amount) AS amtavg
            FROM Award
        )
    )
    WHERE rowHash() % 4 = 0
))
ORDER BY abs(explanation) DESC
""")

Unnamed: 0_level_0,explanation
_rowName,Unnamed: 1_level_1
div,0.278097
aid,0.154999
title,0.029704
year,0.009421


## Conclusion
TODO
