Course Human-Centered Data Science ([HCDS](https://www.mi.fu-berlin.de/en/inf/groups/hcc/teaching/winter_term_2020_21/course_human_centered_data_science.html)) - Winter Term 2020/21 - [HCC](https://www.mi.fu-berlin.de/en/inf/groups/hcc/index.html) | [Freie Universität Berlin](https://www.fu-berlin.de/)

***

# A4 - Transparency
Please use the follwing structure as a starting point. Extend and change the notebook according to your needs. This structure should help you to guide you through your analysis. This notebook is the foundation for condensing your results and writing your reflection in the end. So please read what we expect from you regarding the reflection first to structure your analysis accordingly.

## [1] General understanding
> What is the model about and who is using it?

* What is your model about?
* Why is this model useful?
* Who is using this model?
* What are stakeholder or users of ORES?
* Why is this model useful to wikipedia?
* What applications/projects/... within wikipedia are using this model?

## [2] API
> What does the ORES API (v3) tell you about a specific model? What functions does the API offer?

Use the API to investigate your model: https://ores.wikimedia.org/v3/#/. What do the follwing API calls do and what do they tell you about your model?

* `https://ores.wikimedia.org/v3/scores/`
* `https://ores.wikimedia.org/v3/scores/?model_info`
* `https://ores.wikimedia.org/v3/scores/enwiki`
* `https://ores.wikimedia.org/v3/scores/enwiki?models=YOURMODELNAME&model_info`
* `https://ores.wikimedia.org/v3/scores/enwiki?models=YOURMODELNAME&revids=SOMEIDHERE`
* `https://ores.wikimedia.org/v3/scores/enwiki/REVID/YOURMODELNAME?model_info`
* `https://ores.wikimedia.org/v3/scores/enwiki/REVID/YOURMODELNAME?features=true`

### Feature Injection
Please check out the _feature injection_ feature of ORES: https://www.mediawiki.org/wiki/ORES/Feature_injection

**Example:**

     # Here you can get the perdiction for a revision, if the user would habe been anonymous:
     https://ores.wikimedia.org/v3/scores/enwiki/991397091/damaging?features&feature.revision.user.is_anon=true

## [3] ML algorithm and training/test data
> Which machine learning model is underlying and what data is used to build the model?

* Check out `model_info` in detail.
* What does it tell you about the model performance?
* You can visualise and explain your results regarding model performance.
* What data was used to train and test the model?
* What machine learning algorithm is your model using? Please explain briefly.

In [2]:
import requests
import json
import pandas as pd

# Customize these with your own information
headers = {
    'User-Agent': 'https://github.com/marisanest',
    'From': 'marisa.f.nest@fu-berlin.de'
}

def get_model_info(headers):
    
    # Define the endpoint
    endpoint = 'https://ores.wikimedia.org/v3/scores/enwiki?models={model}&model_info'

    params = {
        'model'   : 'damaging'
    }

    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()

    return response

First we check all info that we can get from the API endpoint about the model.

In [42]:
model_info = get_model_info(headers)

### What does it tell you about the model performance? 

To answer this question we analized the different metrics given by the API about the model performance:

In [69]:
# extract statistics
statistics = model_info['enwiki']['models']['damaging']['statistics'].copy()

# split statistics into different subdomains
count_statistics = statistics.pop('counts')
confusion_matrix = count_statistics.pop('predictions')
rate_statistics = statistics.pop('rates')

First we look into overall count statistics. It seams that the sample test dataset consists of a total of 19332 samples. These samples consist of 18585 false and 747 true labeled data points.

In [83]:
# raw count statistics from the API
count_statistics

{'labels': {'false': 18585, 'true': 747}, 'n': 19332}

Then we look into the rate statistics. These statistics infrom us about the distribution of true and false labeled data points within the whole population and within the sample set.

In [85]:
# rate statistics from the API as data frame
pd.DataFrame(rate_statistics)

Unnamed: 0,population,sample
False,0.966,0.961
True,0.034,0.039


The performance statistcs also include all needed values for a confusion matrix (see below) but it is not clear how to interpret these values correctly (e.g. which is the true positiv value, etc.). In the following you see the confusion matrix values from the API and a interpretet version as data frame.

In [80]:
# raw confusion matrix values from the API
confusion_matrix

{'false': {'false': 17875, 'true': 710}, 'true': {'false': 318, 'true': 429}}

In [82]:
# interpreting API response and transform it to needed values 
true_positiv = confusion_matrix['true']['true']
false_positiv = confusion_matrix['false']['true']
true_negative = confusion_matrix['false']['false']
false_negative = confusion_matrix['true']['false']

# genreate data frame to visualize data
pd.DataFrame(
    data=[
        {'Positiv': true_positiv, 'Negative': false_positiv}, 
        {'Positiv': false_negative, 'Negative': true_negative}
    ], 
    index=['Positiv', 'Negative']
)

Unnamed: 0,Positiv,Negative
Positiv,429,710
Negative,318,17875


As a last step we look into the actual performance metrics.

In [86]:
data = []

for value in statistics.values():
    data.append(
        {
            'labels_false': value['labels']['false'], 
            'labels_true': value['labels']['true'], 
            'micro': value['macro'], 
            'macro':  value['micro']
        }
    )

In [87]:
print(f'For this model the following {len(statistics.keys())} different performance metrics with their respectie values are provided:')
pd.DataFrame(data=data, index=statistics.keys())

For this model the following 12 different performance metrics with their respectie values are provided:


Unnamed: 0,labels_false,labels_true,micro,macro
!f1,0.433,0.973,0.703,0.451
!precision,0.347,0.985,0.666,0.369
!recall,0.574,0.962,0.768,0.588
accuracy,0.949,0.949,0.949,0.949
f1,0.973,0.433,0.703,0.955
filter_rate,0.057,0.943,0.5,0.087
fpr,0.426,0.038,0.232,0.412
match_rate,0.943,0.057,0.5,0.913
pr_auc,0.997,0.448,0.722,0.978
precision,0.985,0.347,0.666,0.963


As you can see there are a lots of different measurements. The "micro" and "macro" columns describe two different algorithms how to calculate the average value, but we have no clue what the "label_false" and "lable_true" columns show us. Also some metrics are more clear than others. For example accuracy, precision, recall, f1, fpr (false positive rate) are well known metrics, but filter_rate, match_rate, pr_auc, roc_auc and all metrics with a "!" in front need some more inspection. "!" probably stands for "negative", e.g. negative precision which is an alias for negative predictive value (nvp).

Depending on the metrics inspected, the model works in some cases quite good and in other cases rather poor.

### What data was used to train and test the model?

To answer this question we looked into the [ORES documentation](https://www.mediawiki.org/wiki/ORES) and found the follwong:

> Advanced support
>
> Rather than assuming, we can ask editors to train ORES which edits are in-fact damaging and which edits look like they were saved in goodfaith. This requires additional work on the part of volunteers in the community, but it affords a more accurate and nuanced prediction with regards to the quality of an edit. Many tools will only function when advanced support is available for a target wiki.
> 
> damaging – predicts whether or not an edit causes damage
> 
> goodfaith – predicts whether an edit was saved in good-faith

This informs us that the used training data for the damaging model was manually labeld by editors / volunteers of the specific wiki community. 


We also checked the MetaWiki dokumentation about ORES and the [damaging model](https://meta.wikimedia.org/wiki/Objective_Revision_Evaluation_Service/damaging). There again is described that the model is trained on human  judgement with a reference to the [Wiki label](https://meta.wikimedia.org/wiki/Wiki_labels) project [Edit quality](https://en.wikipedia.org/wiki/Wikipedia:Labels/Edit_quality) (see below). 


> This model was trained on human judgement for whether or not an edit is damaging.


Therefore we checked the [Wiki labels documentation](https://meta.wikimedia.org/wiki/Wiki_labels). Wiki labels is a tool/service/gadget which is used by ORES to manage projects in which editors are invited to label data.

Afterwards we looked deeper into the already mentioned Wiki label project within the english Wikipedia: [Edit quality](https://en.wikipedia.org/wiki/Wikipedia:Labels/Edit_quality). There is written:

> We'll be using WP:Labels to review 6334 randomly sampled edits as "damaging" and/or "good-faith" in order to train classifiers for mw:ORES. 

### What machine learning algorithm is your model using?

To get more information about the algorithm used for training and all dependent metrices, we look into different aspects of the API response

In [14]:
response['enwiki']['models']['damaging']['type']

'GradientBoosting'

In [16]:
response['enwiki']['models']['damaging']['params']['criterion']

'friedman_mse'

In [19]:
response['enwiki']['models']['damaging']['params']['loss']

'deviance'

In [18]:
response['enwiki']['models']['damaging']['score_schema']

{'properties': {'prediction': {'description': 'The most likely label predicted by the estimator',
   'type': 'boolean'},
  'probability': {'description': 'A mapping of probabilities onto each of the potential output labels',
   'properties': {'false': {'type': 'number'}, 'true': {'type': 'number'}},
   'type': 'object'}},
 'title': 'Scikit learn-based classifier score with probability',
 'type': 'object'}

What we can see ist, that as trainng algorithm **gradient boosting** ist used.

Gradient boosting is definied as:

> Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function. [1]

That means that different underlying base algorithms can be used (such as decision trees). Thus for our case it is not clear which algorithm is used for the our model.

Additioanl we can see that as loss function **deviance** and as criterion **friedman mean squared error** are implemented.

The loss function defines a function which caluculates a values that shows how good or bad the current model works. The task of the used trainng algorithm is then to maximize or minimize the outcome of the loss function.

As far as we understand the term criterion, it is just a synonym for loss function. But it doesn't make sense to have to different loss functions, so we are not sure how to deal with this info. 

Another info the API gives us is about the used scoring schema. It describes that a scikit learn-based classifier is used to map the outcome of the gradient boosting algorithm to probabilities for each potential output label and that the finally predicted label is then the most likely label. This is just an interpretation of us, we are not sure if this is the correct understanding of the provided info.

1 [Gradient_boosting](https://en.wikipedia.org/wiki/Gradient_boosting)

## [4] Features
> Which features are used and which have the greatest influence on the prediction?

* What features is your model using?
* What do they mean?
* Which is the most important features?
* `https://ores.wikimedia.org/v3/scores/enwiki/991379667/articlequality?features=true`
* Are all models (in all languages of wikipedia), are they using the same features?

### What features is your model using?

To get an answere to this question, we retrive the features info for one sample revision via the API and features=true.

In [122]:
def get_feature_info(headers, rev_id, project='enwiki'):
    
    # Define the endpoint
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/{rev_id}/{model}?features=true'
    
    params = {
        'project' : project,
        'model'   : 'damaging',
        'rev_id'  : rev_id
    }

    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()

    return response[project]['scores'][str(rev_id)]['damaging']['features']

In [113]:
feature_info = get_feature_info(headers, 991379667)

In [114]:
len(feature_info.keys())

78

In [115]:
feature_info

{'feature.english.badwords.revision.diff.match_delta_decrease': 0,
 'feature.english.badwords.revision.diff.match_delta_increase': 0,
 'feature.english.badwords.revision.diff.match_delta_sum': 0,
 'feature.english.badwords.revision.diff.match_prop_delta_decrease': 0.0,
 'feature.english.badwords.revision.diff.match_prop_delta_increase': 0.0,
 'feature.english.badwords.revision.diff.match_prop_delta_sum': 0.0,
 'feature.english.dictionary.revision.diff.dict_word_delta_decrease': -1,
 'feature.english.dictionary.revision.diff.dict_word_delta_increase': 0,
 'feature.english.dictionary.revision.diff.dict_word_delta_sum': -1,
 'feature.english.dictionary.revision.diff.dict_word_prop_delta_decrease': -0.00045392646391284613,
 'feature.english.dictionary.revision.diff.dict_word_prop_delta_increase': 0.0,
 'feature.english.dictionary.revision.diff.dict_word_prop_delta_sum': -0.00045392646391284613,
 'feature.english.dictionary.revision.diff.non_dict_word_delta_decrease': 0,
 'feature.english.d

As you can see, this model is using the above listed 78 different features (we also checked other revisions and they all seem to have 78 features). 

### What do they mean?

For some features it is easy to guess (using the above listed feture-key-name) what the feature is about. For exapmle 'feature.revision.user.is_anon' will probably a boolean flag which describes if the user who did this revision was an anonymous user or not. Other features are not very self-explanatory like 'feature.english.informals.revision.diff.match_prop_delta_sum'. It also easy to see that the feature-key-names some structure, so that you can get some idea about the domain the feature deals with. For example all features which start with 'feature.revision.user' belong to user specific measurements. Here a little description table:

| domain                                   | description                                                                                                                                                                                                                                                                                                                               |
|:------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| feature.english.badwords.revision.diff   | These features deal with new added or removed bad words or BWDS (see [BWDS](https://www.mediawiki.org/wiki/ORES/BWDS_review)). Therefore the difference between the old revision and the new one are compared. It is not clear what exactly the different features measure (e.g. match_prop_delta_sum).                                      |
| feature.english.dictionary.revision.diff | These features deal with new added or removed words which are known from a specific dictionary (see [Word lists](https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service/Word_lists)). Therefore the difference between the old revision and the new one are compared. It is not clear what exactly the different features measure (e.g. dict_word_prop_delta_sum). |
| feature.english.informals.revision.diff  | These features deal with new added or removed infromal words which are known from a specific dictionary (see [Word lists](https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service/Word_lists)). Therefore the difference between the old revision and the new one are compared. It is not clear what exactly the different features measure (e.g. dict_word_prop_delta_decrease).                                                                                                                                                                                                                                                                                              |
| feature.len                              | These feature measure different length, e.g. datasource.wikitext.revision.words probably measures the amount of words of the revision's wikitext.                                                                                                                                                                                         |
| feature.revision.comment                 | These features deal with the revision comment.                                                                                                                                                                                                                                                                                            |
| feature.revision.diff                    | These features deal with the difference between the current and previous revision.                                                                                                                                                                                                                                                        |
| feature.revision.page                    | These features deal with the page that is edited by the revision.                                                                                                                                                                                                                                                                         |
| feature.revision.user                    | These features deal with the user that has made the revision.                                                                                                                                                                                                                                                                             |
| feature.temporal.revision.user           | These features also deal with the user that has made the revision.                                                                                                                                                                                                                                                                        |
| feature.wikitext.revision                | These features deal with the wikitext of the revision. This probably means the actual text of the edited page and not the meta data.                                                                                                                                                                                                      |

All in all we can summarise that these features ar not very well explained. We can guess the meaning of some of them but we can not be sure if this is the correct interpretation. We could also check the revision and than try to see if out interpretation and the values seen above for a specific feature correspond to the actual revison. But this is really unhandy. We also searched for more infromation but we didn't find anything.

### Which is the most important features?

This question can not be answerd properly so far. We tried to change each feature value seperately and then measured the resulting predictet probability distribution. We then compared the differences between these probability distributions and the baseline probability (the probability without changing any feature value) with the jensen shannon dinstance. But the problem is that the model always uses all features. So when we set a value to 0 or change a value from 0 to 100 or set a False value to True, these changes can not be comapred. To be able to really compare the outputs we need to be able to exclude features from the prediction completely.

In [184]:
from scipy.spatial import distance
import operator

In [199]:
def get_probability(headers, rev_id, feature=None, model='damaging', project='enwiki'):
    
    # Define the endpoint
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/{rev_id}/{model}?features&{feature}=None'
    
    params = {
        'project' : project,
        'model'   : model,
        'feature' : feature, #'' if feature is None else f'&{feature}',
        'rev_id'  : rev_id
    }

    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()

    return response# [project]['scores'][str(rev_id)][model]['score']['probability']

Just some test and playing around

In [None]:
# rev_id = 1

In [172]:
# base_probability = get_probability(headers, rev_id)

In [173]:
# feature_probabilities = {}
#
# for key in feature_info.keys():
#    feature_probabilities[key] = get_probability(headers, rev_id, feature=key)

In [181]:
# feature_probability_distances = {}

# for key, value in feature_probabilities.items():
#    feature_probability_distances[key] = distance.jensenshannon(list(value.values()), list(base_probability.values()))

In [202]:
# max(feature_probability_distances.items(), key=operator.itemgetter(1))

### Are all models (in all languages of wikipedia), are they using the same features?

To answer this question we check out other Wikipedia language version that support this model.

In [125]:
feature_info = get_feature_info(headers, 1, project='frwiki')
feature_info

{'feature.english.badwords.revision.diff.match_delta_decrease': 0,
 'feature.english.badwords.revision.diff.match_delta_increase': 0,
 'feature.english.badwords.revision.diff.match_delta_sum': 0,
 'feature.english.badwords.revision.diff.match_prop_delta_decrease': 0.0,
 'feature.english.badwords.revision.diff.match_prop_delta_increase': 0.0,
 'feature.english.badwords.revision.diff.match_prop_delta_sum': 0.0,
 'feature.english.informals.revision.diff.match_delta_decrease': 0,
 'feature.english.informals.revision.diff.match_delta_increase': 0,
 'feature.english.informals.revision.diff.match_delta_sum': 0,
 'feature.english.informals.revision.diff.match_prop_delta_decrease': 0.0,
 'feature.english.informals.revision.diff.match_prop_delta_increase': 0.0,
 'feature.english.informals.revision.diff.match_prop_delta_sum': 0.0,
 'feature.french.badwords.revision.diff.match_delta_decrease': 0,
 'feature.french.badwords.revision.diff.match_delta_increase': 0,
 'feature.french.badwords.revision.d

In [126]:
feature_info = get_feature_info(headers, 1, project='dewiki')
feature_info

{'feature.english.badwords.revision.diff.match_delta_decrease': 0,
 'feature.english.badwords.revision.diff.match_delta_increase': 0,
 'feature.english.badwords.revision.diff.match_delta_sum': 0,
 'feature.english.badwords.revision.diff.match_prop_delta_decrease': 0.0,
 'feature.english.badwords.revision.diff.match_prop_delta_increase': 0.0,
 'feature.english.badwords.revision.diff.match_prop_delta_sum': 0.0,
 'feature.english.informals.revision.diff.match_delta_decrease': 0,
 'feature.english.informals.revision.diff.match_delta_increase': 0,
 'feature.english.informals.revision.diff.match_delta_sum': 0,
 'feature.english.informals.revision.diff.match_prop_delta_decrease': 0.0,
 'feature.english.informals.revision.diff.match_prop_delta_increase': 0.0,
 'feature.english.informals.revision.diff.match_prop_delta_sum': 0.0,
 'feature.german.badwords.revision.diff.match_delta_decrease': 0,
 'feature.german.badwords.revision.diff.match_delta_increase': 0,
 'feature.german.badwords.revision.d

As you can see, the french and german version of the model has almost the same features, with the different, that is uses some additional language dependent features (e.g. all features that start with 'feature.french.badwords.revision.diff', 'feature.french.badwords.revision.diff' or 'feature.french.badwords.revision.diff')

## Sample code

In [20]:
import requests
import json

# Customize these with your own information
headers = {
    'User-Agent': 'https://github.com/YOUR-USER-NAME',
    'From': 'YOUR-EMAIL@fu-berlin.de'
}

def get_ores_data(rev_id, headers):
    
    # Define the endpoint: This is an example!
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'

    params = {'project' : 'enwiki',
              'model'   : 'YOUMODELNAME',
              'revids'  : rev_id
              }

    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    data = json.dumps(response)

    return data

***

#### Credits

We release the notebooks under the [Creative Commons Attribution license (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/).