### Task 3 - Text Mining
- Find any suitable textual data for processing which will contain at least 500 sentences.
you can manually collect texts from BBC/CNN/New York Times, or
use the crawler from the first homework/tutorial and extend it to crawl particular website and collect content for this task, or
use any other suitable texts (e.g. OpenData speech datasets)
- Perform following NLP tasks:
- POS tagging
- NER with entity classification (using nltk.ne_chunk)
- NER with custom patterns

e.g. every match of: adjective (optional) and proper noun (singular/plural) is matched as the entity
see slides 31 or 38 from lecture 4 for some NLTK examples using RegexpParser or custom NER
- Implement your custom entity classification for each detected entity (using both nltk.ne_chunk and custom patterns)
- Try to find a page in the Wikipedia
- Extract the first sentence from the summary
Detect category from the sentence as a noun phrase
Example:
for „Wikipedia“ entity the first sentence is „Wikipedia (/ˌwɪkᵻˈpiːdiə/ or /ˌwɪkiˈpiːdiə/ WIK-i-PEE-dee-ə) is a free online encyclopedia that aims to allow anyone to edit articles.“
you can detect pattern „…​ is/VBZ a/DT free/JJ online/NN encyclopedia/NN …​“
the output can be „Wikipedia“: „free online encyklopedia“
For unknown entities assign default category e.g. „Thing“

In [1]:
### Imports
import nltk
import glob
import json

### Task execution
I chose the book Silmarillion for this task as the text is long enough and contains many non-english terms and names.

#### Contents of **src** folder:
  - jupyter notebook
  - silmarillion.txt  (input text)

#### Contents of **res** folder: 
  - entities_binary.json (NER with nltk.ne_chunks())
  - entities_all.json (NER with nltk.ne_chunks())
  - entities_custom.json (NER with custom pattern)
  - named_entitites_wordbook.txt (Wikipedia categorization of NE with nltk entities (binary))
  - custom.txt (Wikipedia categorization of NE with custom entities)
  
    

## Loading text file (Silmarillion) into string

In [36]:
"""
Loads text file into string
"""
def read_book(name):
    book = ""
    with open(name, 'r') as file:
        book = file.read()
    file.close()
    return book

## Loading whole Silmarillion book by Tolkien (apart from foreword, appendix, pronounciation notes and name index)
text = read_book('./../src/silmarillion.txt')

text = text.replace('lluvatar', 'Iluvatar') # fixing possible transcription error
text = text.replace('Copyright', "")

print("Number of characters: ", len(text))

sentences = nltk.sent_tokenize(text)
print("Number of sentences: ", len(sentences))

print("\n",text[:250], "...")

Number of characters:  711475
Number of sentences:  4207

 AINU LIN DALE 


The Music of the Ainur 


There was Eru, the One, who in Arda is called Iluvatar; 
and he made first the Ainur, the Holy Ones, that were the 
offspring of his thought, and they were with him before 
aught else was made. And he spoke  ...


### Part-of-Speech Tagging

In [3]:
from nltk import pos_tag, word_tokenize, ne_chunk

## Creating list of tokens from string
tokenz =  word_tokenize(text)

## Tagging tokens with their corresponding part of speech categories
tagged = pos_tag(tokenz)

print("First couple of tagged tokens: ")
print(tagged[:20])

First couple of tagged tokens: 
[('AINU', 'NNP'), ('LIN', 'NNP'), ('DALE', 'NNP'), ('The', 'DT'), ('Music', 'NNP'), ('of', 'IN'), ('the', 'DT'), ('Ainur', 'NNP'), ('There', 'EX'), ('was', 'VBD'), ('Eru', 'NNP'), (',', ','), ('the', 'DT'), ('One', 'CD'), (',', ','), ('who', 'WP'), ('in', 'IN'), ('Arda', 'NNP'), ('is', 'VBZ'), ('called', 'VBN')]


### NER chunks with nltk.ne_chunks()
Classification (recognition) of Named Entities from POS-tagged tokens
 - binary = True : only recognizes if token is NE or not
 - binary = False : tries to recognize type of NE (i.e. person, location) ... this setting seems to be splitting some entities with more than 1 word and does not work wery well, given that those mostly made-up words and names in any Tolkien book must be harder to classify it's unsurprising.

In [4]:
"""
Function from 2nd tutorial -> from list of tagged words extracts those recognized as named entities
"""
def extractEntities(chunks):
    data = {}
    for entity in chunks:
        if isinstance(entity, nltk.tree.Tree):
            text = " ".join([word for word, tag in entity.leaves()])
            ent = entity.label()
            data[text] = ent
        else:
            continue
    return data

In [5]:
# NE with diff. types 
ne_chunked_f = ne_chunk(tagged, binary=False)
ne_entities_f = extractEntities(ne_chunked_f)

# NE without further classification
ne_chunked = ne_chunk(tagged, binary=True)
ne_entities = extractEntities(ne_chunked)

In [6]:
print("\nFurther classified NE: ")
print(list(ne_entities_f)[:50])

print("Only NE: ")
print(list(ne_entities)[:50])

print("\nFurther classified NE length: "+ str(len(list(ne_entities_f))))
print("\nOnly NE length: "+ str(len(list(ne_entities))))


Further classified NE: 
['Aredhel Turgon', 'Nienna', 'Colon', 'Rivil', 'Finrod Finarfin', 'Curse', 'Hallow', 'Emyn Beraid', 'Darkness', 'Crissaegrim', 'Orthanc', 'ButAtanamir', 'Younger', 'Nay', 'Borlad', 'Erech', 'Elder Children', 'Serinde', 'Mount Rerir', 'High Hope', 'Dimbar', 'Anarion', 'White Telperion', 'Eledhwen', 'Gwindor', 'Dark Years', 'Finrod', 'Tuneless Halls', 'Thangorodrim', 'Earendur', 'Nan Elmoth', 'Valiant', 'Long', 'Haidar', 'Wolf', 'Uruloki', 'Blessed Realm', 'Grey Havens', 'Nen Girith', 'Immortal', 'Tol Galen', 'Teiglin Turin', 'Denethor', 'Winged Crown', 'Fear', 'Seven Stars', 'Kingdom', 'Eldalie', 'Kindler', 'Fingon Fingolfin']
Only NE: 
['Amon Amarth', 'Nienna', 'Twilight Meres', 'Colon', 'Rivil', 'Finrod Finarfin', 'Hallow', 'Emyn Beraid', 'Darkness', 'Crissaegrim', 'Orthanc', 'Younger', 'Annael', 'Borlad', 'Erech', 'Elder Children', 'Serinde', 'Mount Rerir', 'Silmarien', 'Dimbar', 'Anarion', 'White Telperion', 'Eledhwen', 'Gwindor', 'Dark Years', 'Finrod', 'Tu

In [None]:
f = open("./../res/entities_binary.json", "w")
f.write(json.dumps(ne_entities))
f.close()

f = open("./../res/entities_all.json", "w")
f.write(json.dumps(ne_entities_f))
f.close()

#print(ne_entities_f)
## shows choosen entity types -> they are mostly incorrect anyway, this I will keep using binary chunked NE

### NER with custom patterns

I tried to choose the named entities in a way that would be close to the style of writing, since the text consists of an amount of nouns hyphenated per preposition e.g. Eye of Sauron, Grey Havens of Lindon, Towers of Mist, ...
Yet, many of the found entities are either nouns followed by a misinterpreted verb e.g. 'Eldar desire', 'Manwe wept' or 'Middle-earth lay' or nouns mixed with misinterpreted prepositions such as therefore or thereafter.

Overall I found the custom-found entities better in terms of complete definition (e.g. Seeing Stone of Emyn Beraid, Feanor son of Finwe, Finduilas daughter of Orodreth, Lord of Morgul) but of course too complex wikipedia styled page heading.

In [8]:
from nltk import RegexpParser
"""
Function takes tagged tokens and grammar as input and prints the number of such found entities and first 15 of them
"""
def printCustomPattern(tagged_tokens, pattern_grammar):
    cp = nltk.RegexpParser(pattern_grammar) ## train model for input grammar
    res = cp.parse(tagged_tokens) ## apply parser to pos-tagged tokens 
    ents = extractEntities(res)  ## extract found patterns from results
    ents_len = len(ents)
    print("Found "+str(ents_len)+" strings with this pattern.")
    print("Some found strings with given pattern: ")
    if(ents_len > 15):
        print(list(ents)[:15])
    else:
        print(list(ents))  


In [9]:
#print("All tagged words: ")
#print(len(tagged))
#print(list(tagged)[:30])

In [None]:
"""
Chooses named entities as nouns with first upper case letter, adds .. if connected by preposition 'of' i.e. Gates of Argonath.

"""
found = []
custom = []

for word in tagged:
    if (word[1].startswith("N") or word[1].startswith("FW") or (found and word[1].startswith("IN") and word[0] in ['of'])):
        found.append(word)
        #print(word, word[1].startswith("NN"),word[1].startswith("FW"), word[1].startswith("IN"), word[1].startswith("TO"))

    else:
        if (found) and found[-1][1].startswith("IN"): #remove possible prepositions if in the beginning
            #print("\tin ", found[-1][1])
            found.pop()
        if (found and " ".join(e[0] for e in found)[0].isupper()): #if starts with upper case
            #print("\t", " ".join(e[0] for e in found))
            custom.append(" ".join(e[0] for e in found))
        found = []
        
print(set(custom))


f = open("./../res/entities_custom.json", "w")
f.write(json.dumps(list(set(custom))))
f.close()

### Other experiments with own patterns:
Couple experiments with picking different grammar patterns from text.

In [11]:
grammar1 = "SP:{<DT><NNP>}" ## chooses DT (determiner) followed by NNP (proper noun)
printCustomPattern(tagged, grammar1)


grammar2 = "SP:{<DT><NN>}" ## chooses DT (determiner) followed by NN (some noun)
printCustomPattern(tagged, grammar2)

Found 606 strings with this pattern.
Some found strings with given pattern: 
['the Stewards', 'the Doomsman', 'the Halfling', 'all Brandir', 'this World', 'the Balrogs', 'the High', 'the Steadfast', 'the Flame', 'the Gwaith-i-Mirdain', 'the Crissaegrim', 'the Uruloki', 'the Gorgoroth', 'the Calacirya', 'the Ever-young']
Found 1799 strings with this pattern.
Some found strings with given pattern: 
['the building', 'this place', 'the duel', 'this doom', 'that counsel', 'the warden', 'that reason', 'a seed', 'a hall', 'each awoke', 'No love', 'the anguish', 'any treasury', 'a loss', 'a friendship']


In [12]:
grammar3 = "SP:{<DT><JJ><NN|NNP>}" ## chooses sequences of DT (determiner) followed by JJ (adjective) and NN (some or proper noun)
printCustomPattern(tagged, grammar3)

Found 785 strings with this pattern.
Some found strings with given pattern: 
['a great captain', 'the first hour', 'a great rock', 'the inmost circle', 'the first Sun', 'a dreadful fall', 'the right line', 'the left hand', 'the unsullied Light', 'the mighty ravine', 'any other thing', 'a sudden Nahar', 'the uttermost West', 'the first victory', 'a new star']


In [13]:
grammar4 = "Sentence: {<DT|PP\$>*<JJ>*<NN|NNP>+<VBD><JJ><NN|NNP>}" ## some simple setences 
printCustomPattern(tagged, grammar4)

Found 28 strings with this pattern.
Some found strings with given pattern: 
['house was well-nigh destroyed', 'power took cruel revenge', 'this deed Fingon won great renown', 'Morgoth gave small heed', 'hast found thy brother', 'Morgoth sent great strength', 'Beleriand did great evil', 'Caranthir paid little heed', 'Manwe said unto Melkor', 'the Valar gathered great store', 'tongue had great power', 'Earendil was long time', 'Manwe put forth Morgoth', 'the western world were rent asunder', 'Isil was first wrought']


In [14]:
grammar5 = "Sentence: {<NN|NNP><VBD><JJ>*<NN|NNP>*<DT|IN>+<JJ>+<NN|NNP><DT|IN>+<JJ>*<NN|NP>+}" ## some more simplish setences 
printCustomPattern(tagged, grammar5)

Found 10 strings with this pattern.
Some found strings with given pattern: 
['Noldor had as little thought of faith', 'axe smoked in the black blood of the troll-guard', 'thereafter surpassed that desperate crossing in hardihood', 'dominion round about with an unseen wail of shadow', 'Pelori was an empty land in twilight', 'Fiercest burned the new flame of desire', 'Amandil set sail in a small ship at night', 'blade rang a cold voice in answer', 'Manwe made a high feast for the praising', 'danger was fraught with dreadful power because of the holy jewel']


In [15]:
grammar6 = "Sentence: {<NN|NNP><VBD>*<DT|IN>+<VBD>*<NN|NNP><VBD><JJ>+<DT|IN>+<JJ>*<DT|IN>*<NN|NP>*<DT|IN>*<JJ>*<NN|NNP>}" ## some more simple setences 
printCustomPattern(tagged, grammar6)

Found 3 strings with this pattern.
Some found strings with given pattern: 
['victory of the Elves was dear-bought For those of Ossiriand', 'air of Middle-earth became heavy with the breath of growth', 'Huan the hound was true of heart']


In [16]:
from IPython.display import HTML
grammar7 = """SP1:{<NN|NNP><VBD><DT|IN>+<NN|NNP>}
            """

grammar8 = """SP2: {<NN|NNP>*<JJ>*<DT|IN>+<NN|NNP><DT|IN><JJ>*<NN|NNP>}""" ## chooses sequences of DT (determiner) followed by JJ (adjective) and NN (some or proper noun)

grammar9 = """SP1:{<NNP><VBD><DT|IN|RP>+<NNP>}
            """

grammar10 = """SP1:{<PRP|WP><VBD><DT|IN|RP>+<PRP|WP>}
            """

grammar11 = """SP1:{<DT|IN>+<WP>+<VBD><DT|IN|RP>+<JJ>*<NN|NNP>*}
            """
display(HTML("\nActive sentences: "))
printCustomPattern(tagged, grammar7)
display(HTML("\nPhrases (no verbs): "))
printCustomPattern(tagged, grammar8)
display(HTML("\nSimple sentences with <b>proper nouns<b>: "))
printCustomPattern(tagged, grammar9)
display(HTML("\nSimple sentences <b>with pronouns<b>: "))
printCustomPattern(tagged, grammar10)
display(HTML("\nSimple sentences <b>without pronouns<b>: "))
printCustomPattern(tagged, grammar11)

Found 461 strings with this pattern.
Some found strings with given pattern: 
['Fingon passed over Anfauglith', 'Gwindor gave the sword', 'foam flew like snow', 'Fingon strung an arrow', 'Tlrion was the name', 'wind came out of the east', 'Galdor ruled the house', 'Sapphire was with Elrond', 'Ores loathed the Master', 'Feanor was at the mouth', 'Melkor brooded in the outer', 'land lay under a cloud', 'Mablung set a guard', 'Linaewen was the name', 'Tulkas left the council']


Found 1824 strings with this pattern.
Some found strings with given pattern: 
['of the fate of Elured', 'The love of Finwe', 'in vision from afar', 'Finduilas daughter of Orodreth the King', 'on the city of Armenelos', 'that in the making of Arda', 'friend upon the island of Tol', 'from the bridge of Menegroth', 'towards the Fen of Serech', 'By the ring of Felagund', 'with spilling of blood', 'under the power of Thingol', 'the Tower of Guard', 'the horse of Celegorm', 'feast of the Spring of Arda']


Found 121 strings with this pattern.
Some found strings with given pattern: 
['Fingon passed over Anfauglith', 'Rohirrim aided the Lords', 'Lord arose in Mirkwood', 'Iluvatar permitted the Valar', 'Varda hallowed the Silmarils', 'Beleg departed from Amon', 'Narog rose beneath the Mountains', 'Varda commanded the Moon', 'Ores loathed the Master', 'Tuor came upon an Elf', 'Valar passed over Middle-earth', 'Sauron called the Nazgul', 'Tuor fought with Maeglin', 'Melian was a Maia', 'Fingolfin marched into Mithrim']


Found 87 strings with this pattern.
Some found strings with given pattern: 
['he fled from them', 'she sped on before him', 'he made for them', 'he declared that it', 'he led before them', 'they won for themselves', 'he deemed that in him', 'he knew that it', 'he saw as he', 'he feared that it', 'she went with him', 'he lusted for them', 'she marvelled that she', 'they followed after him', 'he spoke of it']


Found 7 strings with this pattern.
Some found strings with given pattern: 
['of what lay before', 'Those who used the Nine Rings', 'against all who came in', 'of what passed in the', 'of those who led the Noldor', 'those who saw the', 'all who heard that sound']


### Custom entity classification with custom patterns

In [21]:
import wikipedia


"""
Function searches the summary text for appropriate entity description. 
The first found string with the given pattern is saved and returned as dictionary object.
The first verb of the description (typically form of the verb be) is removed.
If no such pattern is found, the description_string is set to "Thing".

Arguments: 
    entity name string, wikipedia summary string

Returns:
    dict object { entity_name : description_string }
"""
def parseSummary(name,summary):
    #print(summary[:100])
    phrases = {}
    pattern_grammar = "NP: {<VBD|VBZ|VBP><DT|IN|RP|WP>*<JJ|RB>*<NN|NNP|NNS>+<VBD|VBN|VBP>*<DT|IN|RP|WP|JJ>*<VBD|VBN|VBP>*<DT|IN|RP|WP|JJ>*<NN|NNP|NNS>*}"
    cp = nltk.RegexpParser(pattern_grammar) ## train model for input grammar
    tagged_tokens = pos_tag(word_tokenize(summary))
    res = cp.parse(tagged_tokens) ## apply parser to pos-tagged tokens of summary
    #print(res)
    ## PICK first entity named NP and save i to phrases -> then return
    for result in res:
        if isinstance(result, nltk.tree.Tree):
            if(result.label() ==  "NP"):
                phrases[name] = " ".join([ent[0] for ent in result.leaves()][1:]) # remove first word (typically is, are)
                return phrases
    
    phrases[name] = "Thing" 
    return phrases #when pattern is not found
    

def findPage(name):
    results = wikipedia.search(name)
    try:
        page = wikipedia.page(name)
        ent = parseSummary(name,page.summary)
    except:
        #print("Page not accessible atm exception.")
        return {name: "Thing"}
    return ent

In [18]:
"""
For list of named entities searches wikipedia pages for short description.
Prints (or stores) the found 'dictionary' as json object.
"""
def getCategories(entities_t, store = False, name = './../res/named_entitites_wordbook.txt'):
    json_object = {}
    for entity in entities_t:
        ent = findPage(entity)
        json_object[entity] = ent[entity]
    
    print(json.dumps(json_object, indent=3, sort_keys=True, ensure_ascii=False))
    if(store):
        with open(name, 'w') as outfile:  
            json.dump(json_object, outfile)


## Couple tests
test = ['Morgoth', 'Mountain', 'Forest', 'Sapphire', 'Nimloth', 'Seven Stones', 'Ainur', 'Namo','Teleri']
getCategories(test)



  lis = BeautifulSoup(html).find_all('li')


Page not accessible atm exception.
{
   "Ainur": "the immortal spirits",
   "Forest": "a large area dominated by trees",
   "Morgoth": "a character from Tolkien",
   "Mountain": "a large landform",
   "Namo": "Thing",
   "Nimloth": "the name of an Elf-maid",
   "Sapphire": "a precious gemstone",
   "Seven Stones": "a traditional South Asian game",
   "Teleri": "the Elves who are"
}


In [None]:
print("Only NE: ")
print(list(ne_entities)[0:25])

## Printing found Wikipedia results for couple of named entities:
#entities = list(ne_entities)[30:70]
#getCategories(entities)

"""create wordbook from nltk entities and save them to named_entitites_wordbook.txt file"""
getCategories(list(ne_entities)[0:400], True) #250 should be enough for demonstration

Only NE: 
['Amon Amarth', 'Nienna', 'Twilight Meres', 'Colon', 'Rivil', 'Finrod Finarfin', 'Hallow', 'Emyn Beraid', 'Darkness', 'Crissaegrim', 'Orthanc', 'Younger', 'Annael', 'Borlad', 'Erech', 'Elder Children', 'Serinde', 'Mount Rerir', 'Silmarien', 'Dimbar', 'Anarion', 'White Telperion', 'Eledhwen', 'Gwindor', 'Dark Years']


The results are not always successfull even when the right Wikipedia page is found. For example "Morgoth" is described as the "evil in the world if Middle-earth" but the first word in this description is of course considered a verb thus after erasing it to get a noun phrase ("in the world if Middle-earth") the description no longer makes any sense.

Other words (Teleri, Exiles etc.) are simply not described conveniently to be easily parsed. If anything I was actually impressed by how many of the terms were even found on Wikipedia.


In [38]:
## Comparison between nltk Wikipedia entities (ne_entities) and custom extracted entities
"""create wordbook from custom entities and save them to custom.txt file"""
getCategories(list(custom)[0:400], True, './../res/custom.txt') #250 should be enough for demonstration

{
   "AINU LIN DALE": "a list of people",
   "Abyss": "with an oil platform crew",
   "Ainu": "an indigenous people of Japan",
   "Ainur": "the immortal spirits",
   "Arda": "a Turkish professional footballer who",
   "Astaldo": "Thing",
   "Aule": "a fictional character from J. R. R. Tolkien",
   "Behold": "a brand of furniture polish",
   "Being": "the existence of a thing",
   "Being things": "the existence of a thing",
   "Both": "Thing",
   "Breath of Arda": "a fictional character in J. R. R. Tolkien",
   "Children of Eru": "a fictional character in J.R.R",
   "Children of Iluvatar": "the name given",
   "Children of Iluvatar arise therein": "Thing",
   "Children of Iluvatar hearken": "Thing",
   "Darkness": "Thing",
   "Dead": "Thing",
   "Deeps of Time": "the concept of geologic time",
   "Deeps of Time Melkor hath": "Thing",
   "Deer": "the hoofed ruminant mammals",
   "Dominion of Men": "a period in J.R.R",
   "Doomsman": "Thing",
   "Ea": "an American video game company headq

In conclusion, the resulting custom.txt file shows that, as expected, more entities weren't found as a wikipedia topic and thus categorized as "Thing". In adittion, the categorization could not correctly assign many made-up terms, especially those regarding non-english grammar rules, i.e. Ainu (singular) as from Ainur (plural), Ea, Eldar, ...