<a href="https://www.kaggle.com/code/angevalli/extracts-type-facts-from-wikipedia?scriptVersionId=133920084" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a><a target="_blank" href="https://drive.google.com/drive/folders/1Zx90bl8OwIEBK2jCFzyNV4FxvuJ9efSl?usp=sharing"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a><a target="_blank" href="https://drive.google.com/drive/folders/1KgANSekj2YwkT2fpy-89AnmnCDF870UC?usp=sharing"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Extracts type facts from a Wikipedia file


=== Purpose ===

The goal of this lab is to extract the class to which an entity belongs from Wikipedia.
For example, given the Wikipedia article about Leicester:

    Leicester is a small city in England

the goal is to extract:

    Leicester TAB city


=== Task ===

Complete the extract_type() function so that it extracts the type of the article entity from the content.
For example, for a content of "Leicester is a beautiful English city in the UK", it should return "city".
Exclude terms that are too abstract ("member of...", "way of..."), and try to extract exactly the noun(s).
You can also skip articles (e.g. return None) if you are not sure or if the text does not contain any type.
In order to ensure a fair evaluation, do not use any non-standard Python libraries except NLTK.

Input:
April
April is the fourth month of the year with 30 days.

Output:
April TAB month

In [1]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/wikidata/wikipedia-first.txt
/kaggle/input/type-facts-in-wikipedia/gold-standard-sample.tsv


In [2]:
from utility_script_entity_disambiguation_and_typing import Page, Parsy, cal_acc

In [3]:
# a simplified wiki page document
wiki_file = "/kaggle/input/wikidata/wikipedia-first.txt"
# some gold samples for validation
gold_file = "/kaggle/input/type-facts-in-wikipedia/gold-standard-sample.tsv"
# predicted results generated by your
# you are supposed to submit this file
result_file = "/kaggle/working/results.tsv"

In [4]:
# We import
import re
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /usr/share/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [5]:
def extract_type(wiki_page):
    '''
    :param wiki_page is an object contains a title and the first sentence from its wiki page.
    :return:
    '''
    title = wiki_page.title
    content = wiki_page.content

    #We use POS-tagging
    splitted_content = content.split()
    POS_tagging = dict(nltk.pos_tag(splitted_content))
    
    ##### We do two types of matching : considering verbal sentences or non verbal sentences. We compile the implemented groups using re. Considering also non verbal sentences allows us to go from an accuracy of 0.65555.. to an accuracy of 0.68888..
    # First : Verbal sentences
    compiled = re.compile('( has| have| is| are| was| were| means| refers to)'
                          '( it| its| a| an| the)?'
                          '( best| most| very| some| any| less)?'
                          '( new| former| huge| important| ancient| famous| certain| common| large| largest| tall| tallest| big| biggest| short| shortest| small| smallest)?'
                          '( name given to)?'
                          '( type of| kind of| version of| list of| class of| piece of| part of( a| the)?| sort of| set of)?'
                          '( first| second| third| 1st| 2nd| 3rd| ([a-z]|[1-9])*th)?( [A-Z][a-z]*)?( [a-z]*-[a-z]*)? ([a-z]+)')
    
    # The 12 groups are : verbs, determinants, adverbs, adjectives, 'name given to' -> appear often in the given corpus, refering words, numbers, proper nouns (start with a capital letter), composed words and end of sentence.
    
    # We then define the number of groups and do the matching
    number_of_groups = 12
    matching = compiled.search(content)

    # In case it is a non verbal sentence, there is no matching so we do the same as before but without the first group
    if matching is None :
      compiled = re.compile('( it| its| a| an| the)'
                            '( best| most| very| some| any| less)?'
                            '( new| former| huge| important| ancient| famous| certain| common| large| largest| tall| tallest| big| biggest| short| shortest| small| smallest)?'
                            '( name given to)?'
                            '( type of| kind of| version of| list of| class of| piece of| part of( a| the)?| sort of| set of)?'
                            '( first| second| third| 1st| 2nd| 3rd| ([a-z]|[1-9])*th)?( [A-Z][a-z]*)?( [a-z]*-[a-z]*)? ([a-z]+)')
      number_of_groups = 11
      matching = compiled.search(content)
    
    # In case a matching have been found, we compute POS Tagging. It is based on nouns ('NN' or 'NNS' for plural nouns), as they are the most frequent terms in the corpus, so we search for them through the groups.
    if matching is not None :
      declare_groups = matching.group(number_of_groups)
      if declare_groups in splitted_content :
        index = splitted_content.index(declare_groups)
        for ind in range(index+1,len(POS_tagging)) :
          if POS_tagging[declare_groups] in ['NN','NNS'] :
            break
          declare_groups = splitted_content[ind]
        ##############
        index_1 = index + 1
        if index_1 < len(splitted_content) :
          if (POS_tagging[declare_groups], POS_tagging[splitted_content[index_1]]) == ('NN',)*2 :
            declare_groups = splitted_content[index_1]
      return declare_groups

In [6]:
def run():
    '''
    First, extract types from each sentence in the wiki file
    Next, use gold samples to evaluate your model
    :return:
    '''
    with open(result_file, 'w', encoding="utf-8") as output:
        for page in Parsy(wiki_file):
            typ = extract_type(page)
            if typ:
                output.write(page.title + "\t" + typ + "\n")

    # Evaluate on some gold samples for checking your model
    cal_acc(gold_file, result_file)


run()

the extracted type for Army is not correct!
the extracted type for Seam ripper is not correct!
the extracted type for Ziggy Stardust is not correct!
the extracted type for Hillary Rodham Clinton is not correct!
the extracted type for Edip Yuksel is not correct!
the extracted type for AC/DC is not correct!
the extracted type for Disney Channel is not correct!
the extracted type for North and South is not correct!
the extracted type for Loreto Region is not correct!
the extracted type for Tropical cyclone is not correct!
the extracted type for Pinocchio is not correct!
the extracted type for Mollusc is not correct!
the extracted type for Isthmus is not correct!
the extracted type for John the Apostle is not correct!
the extracted type for Medusa (animal) is not correct!
the extracted type for Crown-of-thorns starfish is not correct!
the extracted type for Vaquita is not correct!
the extracted type for Nu metal is not correct!
the extracted type for Moat is not correct!
the extracted type

# Using spacy

In [7]:
# import spacy
import spacy
from spacy import displacy
# An example about how to use spacy
################# EXECUTE THIS LOCALLY
def dependency_parser():
    # load model
    nlp = spacy.load('en_core_web_sm')

    # execute dependency parse
    doc = nlp("A city is a place where many people live together.")

    # print out
    for token in doc:
        print(
            '{0}({1}) <-- {2} -- {3}({4})'.format(token.text, token.tag_, token.dep_, token.head.text, token.head.tag_))

    # display the tree
    # http://localhost:5000/
    displacy.serve(doc, style='dep')
###############

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


In [8]:
# Code goes here

en_model = spacy.load('en_core_web_sm') # Loading the model
forbidden_words = ['class', 'classes',
                   'list', 'lists',
                   'word', 'words',
                   'set', 'sets',
                   'name', 'names',
                   'version', 'versions',
                   'term','terms',
                   'kind', 'kinds',
                   'type', 'types',
                   'part', 'parts'] # List of forbidden words

def extract_type(wiki_page):
    '''

    :param wiki_page is an object contains a title and the first sentence from its wiki page.
    :return:
    '''
    title = wiki_page.title
    content = wiki_page.content
    model_content = en_model(content) # Loading the content in the model
    list_of_tokens = [] # Initialize the list of tokens we keep

    for token in model_content :
      list_of_tokens.append(token.text)
      if token.text not in forbidden_words :
        if (
            (token.head.tag_ in ['VBP', 'VBZ', 'VBD'] and ((token.tag_ in ['NNS','NNP','NN'] and token.dep_ != 'nsubj') or
            (token.tag_ == 'VB' and token.dep_ == 'xcomp'))) or
            (token.head.tag_ == 'IN' and token.tag_ in ['NN', 'NNP', 'NNS'] and token.dep_ == 'pobj')
        ):
          return token.text

In [9]:
def run():
    '''
    First, extract types from each sentence in the wiki file
    Next, use gold samples to evaluate your model
    :return:
    '''
    with open(result_file, 'w', encoding="utf-8") as output:
        for page in Parsy(wiki_file):
            typ = extract_type(page)
            if typ:
                output.write(page.title + "\t" + typ + "\n")

    # Evaluate on some gold samples for checking your model
    cal_acc(gold_file, result_file)


run()

the extracted type for President of Russia is not correct!
the extracted type for Edip Yuksel is not correct!
the extracted type for São Tomé and Príncipe is not correct!
the extracted type for Tropical cyclone is not correct!
the extracted type for Isthmus is not correct!
the extracted type for Medusa (animal) is not correct!
the extracted type for Crown-of-thorns starfish is not correct!
the extracted type for Great Fire of London is not correct!
the extracted type for Nu metal is not correct!
the extracted type for Moat is not correct!
the extracted type for Alex Trebek is not correct!
the extracted type for Cerulean is not correct!
the extracted type for Communication Studies is not correct!
the extracted type for 2029 is not correct!
the extracted type for Petros Duryan is not correct!
the extracted type for Hadith is not correct!
the extracted type for Magnetosphere is not correct!
the extracted type for Phoenicia is not correct!
the accuracy of your results is 0.8
