<a href="https://www.kaggle.com/code/angevalli/extracts-type-facts-from-wikipedia?scriptVersionId=133915990" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

Extracts type facts from a Wikipedia file


=== Purpose ===

The goal of this lab is to extract the class to which an entity belongs from Wikipedia.
For example, given the Wikipedia article about Leicester:

    Leicester is a small city in England

the goal is to extract:

    Leicester TAB city


=== Task ===

Complete the extract_type() function so that it extracts the type of the article entity from the content.
For example, for a content of "Leicester is a beautiful English city in the UK", it should return "city".
Exclude terms that are too abstract ("member of...", "way of..."), and try to extract exactly the noun(s).
You can also skip articles (e.g. return None) if you are not sure or if the text does not contain any type.
In order to ensure a fair evaluation, do not use any non-standard Python libraries except NLTK.

Input:
April
April is the fourth month of the year with 30 days.

Output:
April TAB month

In [1]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/wikidata/wikipedia-first.txt
/kaggle/input/type-facts-in-wikipedia/gold-standard-sample.tsv


In [2]:
from utility_script_entity_disambiguation_and_typing import Page, Parsy, cal_acc

In [3]:
# a simplified wiki page document
wiki_file = "/kaggle/input/wikidata/wikipedia-first.txt"
# some gold samples for validation
gold_file = "/kaggle/input/type-facts-in-wikipedia/gold-standard-sample.tsv"
# predicted results generated by your
# you are supposed to submit this file
result_file = "/kaggle/working/results.tsv"

In [4]:
# We import
import re
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /usr/share/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [5]:
def extract_type(wiki_page):
    '''
    :param wiki_page is an object contains a title and the first sentence from its wiki page.
    :return:
    '''
    title = wiki_page.title
    content = wiki_page.content

    #We use POS-tagging
    splitted_content = content.split()
    POS_tagging = dict(nltk.pos_tag(splitted_content))
    
    ##### We do two types of matching : considering verbal sentences or non verbal sentences. We compile the implemented groups using re. Considering also non verbal sentences allows us to go from an accuracy of 0.65555.. to an accuracy of 0.68888..
    # First : Verbal sentences
    compiled = re.compile('( has| have| is| are| was| were| means| refers to)'
                          '( it| its| a| an| the)?'
                          '( best| most| very| some| any| less)?'
                          '( new| former| huge| important| ancient| famous| certain| common| large| largest| tall| tallest| big| biggest| short| shortest| small| smallest)?'
                          '( name given to)?'
                          '( type of| kind of| version of| list of| class of| piece of| part of( a| the)?| sort of| set of)?'
                          '( first| second| third| 1st| 2nd| 3rd| ([a-z]|[1-9])*th)?( [A-Z][a-z]*)?( [a-z]*-[a-z]*)? ([a-z]+)')
    
    # The 12 groups are : verbs, determinants, adverbs, adjectives, 'name given to' -> appear often in the given corpus, refering words, numbers, proper nouns (start with a capital letter), composed words and end of sentence.
    
    # We then define the number of groups and do the matching
    number_of_groups = 12
    matching = compiled.search(content)

    # In case it is a non verbal sentence, there is no matching so we do the same as before but without the first group
    if matching is None :
      compiled = re.compile('( it| its| a| an| the)'
                            '( best| most| very| some| any| less)?'
                            '( new| former| huge| important| ancient| famous| certain| common| large| largest| tall| tallest| big| biggest| short| shortest| small| smallest)?'
                            '( name given to)?'
                            '( type of| kind of| version of| list of| class of| piece of| part of( a| the)?| sort of| set of)?'
                            '( first| second| third| 1st| 2nd| 3rd| ([a-z]|[1-9])*th)?( [A-Z][a-z]*)?( [a-z]*-[a-z]*)? ([a-z]+)')
      number_of_groups = 11
      matching = compiled.search(content)
    
    # In case a matching have been found, we compute POS Tagging. It is based on nouns ('NN' or 'NNS' for plural nouns), as they are the most frequent terms in the corpus, so we search for them through the groups.
    if matching is not None :
      declare_groups = matching.group(number_of_groups)
      if declare_groups in splitted_content :
        index = splitted_content.index(declare_groups)
        for ind in range(index+1,len(POS_tagging)) :
          if POS_tagging[declare_groups] in ['NN','NNS'] :
            break
          declare_groups = splitted_content[ind]
        ##############
        index_1 = index + 1
        if index_1 < len(splitted_content) :
          if (POS_tagging[declare_groups], POS_tagging[splitted_content[index_1]]) == ('NN',)*2 :
            declare_groups = splitted_content[index_1]
      return declare_groups

In [6]:
def run():
    '''
    First, extract types from each sentence in the wiki file
    Next, use gold samples to evaluate your model
    :return:
    '''
    with open(result_file, 'w', encoding="utf-8") as output:
        for page in Parsy(wiki_file):
            typ = extract_type(page)
            if typ:
                output.write(page.title + "\t" + typ + "\n")

    # Evaluate on some gold samples for checking your model
    cal_acc(gold_file, result_file)


run()

the extracted type for Army is not correct!
the extracted type for Seam ripper is not correct!
the extracted type for Ziggy Stardust is not correct!
the extracted type for Hillary Rodham Clinton is not correct!
the extracted type for Edip Yuksel is not correct!
the extracted type for AC/DC is not correct!
the extracted type for Disney Channel is not correct!
the extracted type for North and South is not correct!
the extracted type for Loreto Region is not correct!
the extracted type for Tropical cyclone is not correct!
the extracted type for Pinocchio is not correct!
the extracted type for Mollusc is not correct!
the extracted type for Isthmus is not correct!
the extracted type for John the Apostle is not correct!
the extracted type for Medusa (animal) is not correct!
the extracted type for Crown-of-thorns starfish is not correct!
the extracted type for Vaquita is not correct!
the extracted type for Nu metal is not correct!
the extracted type for Moat is not correct!
the extracted type