# Data import
## Question 0 - Get common wikidata occupations

> Write a sparql query that retrieves the top 100 occupations on wikidata (wikidata property P106).

You may use the interface https://query.wikidata.org/ to try different queries. Here are some example sparql queries: https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples

In [6]:
query = """
SELECT ?o [TO COMPLETE HERE...]
"""

The following assertion should pass if your answer is correct.

In [7]:
import requests

occupations = ['Q82955', 'Q937857', 'Q36180', 'Q33999', 'Q1650915', 'Q1028181', 'Q1930187', 'Q177220', 'Q1622272', 'Q49757', 'Q36834', 'Q40348', 'Q47064', 'Q639669', 'Q10800557', 'Q201788', 'Q2526255', 'Q43845', 'Q28389', 'Q42973', 'Q10871364', 'Q39631', 'Q193391', 'Q482980', 'Q483501', 'Q11513337', 'Q3665646', 'Q12299841', 'Q19204627', 'Q16533', 'Q81096', 'Q11774891', 'Q188094', 'Q1281618', 'Q333634', 'Q189290', 'Q250867', 'Q33231', 'Q2259451', 'Q42603', 'Q628099', 'Q37226', 'Q2309784', 'Q901', 'Q2066131', 'Q6625963', 'Q10798782', 'Q2374149', 'Q170790', 'Q4610556', 'Q185351', 'Q486748', 'Q3055126', 'Q753110', 'Q4964182', 'Q169470', 'Q158852', 'Q1234713', 'Q14089670', 'Q10873124', 'Q3282637', 'Q593644', 'Q947873', 'Q13414980', 'Q131524', 'Q11338576', 'Q15117302', 'Q488205', 'Q14467526', 'Q183945', 'Q10843402', 'Q13382576', 'Q13141064', 'Q214917', 'Q855091', 'Q644687', 'Q19595175', 'Q121594', 'Q2865819', 'Q16010345', 'Q1231865', 'Q2405480', 'Q350979', 'Q3400985', 'Q13365117', 'Q10833314', 'Q3621491', 'Q15981151', 'Q212980', 'Q16145150', 'Q1792450', 'Q15296811', 'Q15627169', 'Q2306091', 'Q4263842', 'Q806798', 'Q5716684', 'Q2516866', 'Q3387717', 'Q131512']

def evalSparql(query):
    return requests.post('https://query.wikidata.org/sparql', data=query, headers={
        'content-type': 'application/sparql-query',
        'accept': 'application/json',
        'user-agent': 'User:Tpt'
    }).json()['results']['bindings']

myOccupations = [val['o']['value'].replace('http://www.wikidata.org/entity/', '') 
                 for val in evalSparql(query)]
assert(frozenset(occupations) == frozenset(myOccupations))

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

## Occupations labels

We load the labels of the occupations from Wikidata

In [19]:
occupations_label = {}

query = """
SELECT DISTINCT ?o ?oLabel 
WHERE { 
    VALUES ?o { %s } 
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}"""% ' '.join('wd:' + o for o in occupations)

for result in evalSparql(query):
    occupations_label[result['o']['value'].replace('http://www.wikidata.org/entity/', '')] = result['oLabel']['value']

print(occupations_label)

{'Q42973': 'architect', 'Q188094': 'economist', 'Q201788': 'historian', 'Q14467526': 'linguist', 'Q1281618': 'sculptor', 'Q177220': 'singer', 'Q1650915': 'researcher', 'Q628099': 'association football manager', 'Q753110': 'songwriter', 'Q40348': 'lawyer', 'Q12299841': 'cricketer', 'Q1231865': 'pedagogue', 'Q488205': 'singer-songwriter', 'Q158852': 'conductor', 'Q121594': 'professor', 'Q1792450': 'art historian', 'Q183945': 'record producer', 'Q482980': 'author', 'Q1930187': 'journalist', 'Q19595175': 'amateur wrestler', 'Q16533': 'judge', 'Q855091': 'guitarist', 'Q483501': 'artist', 'Q47064': 'military personnel', 'Q131524': 'entrepreneur', 'Q214917': 'playwright', 'Q37226': 'teacher', 'Q28389': 'screenwriter', 'Q639669': 'musician', 'Q1028181': 'painter', 'Q644687': 'illustrator', 'Q39631': 'physician', 'Q11513337': 'athletics competitor', 'Q1234713': 'theologian', 'Q13141064': 'badminton player', 'Q13382576': 'rower', 'Q43845': 'businessperson', 'Q2374149': 'botanist', 'Q3665646': 'b

We load *all* the labels of the occupations from Wikipedia

In [20]:
occupations_labels = {k: [v] for k, v in occupations_label.items()}

query = """
SELECT ?o ?altLabel 
WHERE {
  VALUES ?o { %s }
  ?o skos:altLabel ?altLabel . FILTER (lang(?altLabel) = "en")
}""" % ' '.join('wd:' + o for o in occupations) 

for result in evalSparql(query):
    occupations_labels[result['o']['value'].replace('http://www.wikidata.org/entity/', '')].append(result['altLabel']['value'])

print(occupations_labels)

{'Q188094': ['economist'], 'Q201788': ['historian', 'historiographer', 'historians'], 'Q14467526': ['linguist', 'linguistic scholar'], 'Q1281618': ['sculptor'], 'Q177220': ['singer', 'vocalist'], 'Q1650915': ['researcher'], 'Q628099': ['association football manager', 'football manager', 'soccer coach', 'association football coach', 'football coach', 'soccer manager'], 'Q753110': ['songwriter', 'song writer'], 'Q40348': ['lawyer', 'attorney', 'Jurisprudente'], 'Q12299841': ['cricketer', 'cricket player'], 'Q1231865': ['pedagogue', 'educationalist'], 'Q488205': ['singer-songwriter', 'singer songwriter', 'singer/songwriter', 'singersongwriter'], 'Q158852': ['conductor', 'Conducting'], 'Q121594': ['professor', 'Prof.'], 'Q1792450': ['art historian'], 'Q183945': ['record producer', 'music producer'], 'Q482980': ['author'], 'Q1930187': ['journalist', 'journo'], 'Q19595175': ['amateur wrestler', 'wrestler'], 'Q16533': ['judge', 'magistrate', 'justice', 'judges', 'justices'], 'Q855091': ['guit

## Wikipedia articles

Here we load the training and the testing sets. To save memory space we use a generator that will read the file each time we iterate over the training or the testing examples.

In [26]:
import gzip
import json

def loadJson(filename):
    with gzip.open(filename, 'rt') as fp:
        for line in fp:
            yield json.loads(line)

class MakeIter(object):
    def __init__(self, generator_func, **kwargs):
        self.generator_func = generator_func
        self.kwargs = kwargs
    def __iter__(self):
        return self.generator_func(**self.kwargs)

training_set = MakeIter(loadJson, filename='wiki-train.json.gz')
testing_set = MakeIter(loadJson, filename='wiki-test.json.gz')

# Extract occupations from summaries

## Task 1 - Dictionnary extraction

> Using ```occupations_labels``` dictionnary, identify all occupations for each articles. Complete the function below to evaluate the accuracy of such approach. It will serve as a baseline.

In [27]:
def predict_dictionnary(example, occupations_labels):
    ## example['summary'] contains the summary of the article
    ## Code here
    return None
    
def evaluate_dictionnary(training_set, occupations_labels):
    nexample = 0
    accuracy = 0.
    prediction = None
    for example in training_set:
        prediction = predict_dictionnary(example, occupations_labels)
        p = frozenset(prediction)
        g = frozenset(example['occupations'])
        accuracy += 1.*len(p & g) / len(p | g)
        nexample += 1
    return accuracy / nexample

evaluate_dictionnary(training_set, occupations_labels)

ZeroDivisionError: float division by zero

## Task 2 - Simple neural network

We load the articles "summary" and we take the average of the word vectors.
This is done with spacy loaded with the fast text vectors.
To do the installation/loading [takes 8-10 minutes, dl 1.2Go]
```
pip3 install spacy
wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/cc.en.300.vec.gz
python3 -m spacy init-model en /tmp/en_vectors_wiki_lg --vectors-loc cc.en.300.vec.gz
rm cc.en.300.vec.gz
```

In [9]:
import spacy
nlp = spacy.load('/tmp/en_vectors_wiki_lg')

def vectorize(dataset, nlp):
    result = {}
    for example in dataset:
        doc = nlp(example['summary'], disable=['parser', 'tagger'])
        result[example['title']] = {}
        result[example['title']]['vector'] = doc.vector
        if 'occupations' in example:
            result[example['title']]['occupations'] = example['occupations']

vectorized_training = vectorize(training_set, nlp)
vectorized_testing = vectorize(testing_set, nlp)
nlp = None

OSError: [E050] Can't find model '/tmp/en_vectors_wiki_lg'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

In [28]:
print(vectorized_training['George_Washington']['vector'])

NameError: name 'vectorized_training' is not defined

In [None]:
# We encode the data

import numpy as np

inputs = np.array([vectorized_training[article]['vector'] for article in vectorized_training])
outputs = np.array([[(1 if occupation in vectorized_training[article]['occupations'] else 0)
                    for occupation in occupations ] for article in vectorized_training])

In [10]:
print(len(outputs[0]))

NameError: name 'outputs' is not defined

> Using keras, define a sequential neural network with two layers. Use categorical_crossentropy as a loss function and softmax as the activation function of the output layer

You can look into the documentation here: https://keras.io/getting-started/sequential-model-guide/

In [11]:
from tensorflow import keras
## Compile the model here

In [12]:
## Then train the model on ```inputs``` and ```outputs```

> Complete the function predict: output the list of occupations where the corresponding neuron on the output layer of our model has a value > 0.1

In [29]:
def predict(model, article_name, vectorized_dataset):
    prediction = None
    ## Code here
    return prediction

print(predict(model, 'Elvis_Presley', vectorized_training))
# should be {'Q177220'}

NameError: name 'model' is not defined

In [None]:
def evaluate_nn(vectorized_training, model):
    nexample = 0
    accuracy = 0.
    prediction = None
    for article_name in vectorized_trainingset:
        prediction = predict(model, article_name, vectorized_training)
        p = frozenset(prediction)
        g = frozenset(vectorized_training[article_name]['occupations'])
        accuracy += 1.*len(p & g) / len(p | g)
        nexample += 1
    return accuracy / nexample

## Task 3 - Your approach

> Propose your own approach (extend previous examples or use original approaches) to improve the accuracy for this task. Apply it to the testing set and put the result as a json file with your submission.

***IMPORTANT*** Output format of requested file 'results.json.gz': each line must be a json string representing a dictionnary:
> ```{ 'title': THE_ARTICLE_NAME, 'prediction': [THE_LIST_OF_OCCUPATIONS]}```

In [14]:
# For example if testset_solutions is a dictionnary: article_name (key) -> prediction_list (value) use this function:
def export(testset_solutions):
    with gzip.open('results.json.gz', 'wt') as output:
        for article in testset_solutions:
            output.write(json.dumps({'title':article, 'prediction':testset_solutions[article]}) + "\n")