## ENGG*3130 Final Project Jupyter Notebook

Start with required initialization:

In [None]:
import spacy # Word breakdowns
import en_core_web_sm
import tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer
from bs4 import BeautifulSoup # Not used, may be used in future
from transformers import pipeline # Question answering pipeline
import wikipediaapi
# Found on SO, may be obsolete given I fixed other issues with packages
try:
    # For Python 3.0 and later
    from urllib.request import urlopen
except ImportError:
    # Fall back to Python 2's urllib2
    from urllib2 import urlopen

In [None]:
# This is a helper function from Stack Overflow for extracting proper nouns
# Credit goes to T. Jeanneau here: https://stackoverflow.com/questions/63450423/how-to-find-proper-noun-using-spacy-nlp
# Function utilizes spaCy's proper noun tags
def extract_proper_nouns(doc):
        pos = [tok.i for tok in doc if tok.pos_ == "PROPN"]
        consecutives = []
        current = []
        for elt in pos:
            if len(current) == 0:
                current.append(elt)
            else:
                if current[-1] == elt - 1:
                    current.append(elt)
                else:
                    consecutives.append(current)
                    current = [elt]
        if len(current) != 0:
            consecutives.append(current)
        return [doc[consecutive[0]:consecutive[-1]+1] for consecutive in consecutives]

First, we define our class, and define some basic strings to hold our results. We also assign the value of the question, which should be passed to the class object on initialization.

In [None]:
class someone_make_an_acronym:
    FINAL_TITLE = "" # Title that should give a correct Wikipedia article
    WIKI_ARTICLE = "" # Full Wiki article text
    QUESTION = "" # Question to be asked
    result = "" # Data structure that holds the result from the transformer
    def __init__(self, QUESTION):
        self.QUESTION = QUESTION

### Sentence Deconstruction
Next, we need to deconstruct the question. Ultimately, we just need the subject the sentence is talking about. Linguistics can be very complicated. Our initial deconstruction attempt is to retrieve an 'entity', using spaCy. If this fails, a helper function is used, which detects concurrent proper nouns and groups them, like _George Washington_ or _Single-Nucleotide Polymorphism_.

If this fails, an attempt is made to
extract one of the following:
 - Compound noun
 - Nominal subject
 - Direct object
 - Object of a preposition
 
For verbosity's sake, the specific linguistics of these four are ignored. How a question is constructed influences what parts of speech are correct for data extraction. So, like any sophisticated software, we test everything. 

The first part of speech that exists is considered the title for the Wikipedia article. There are obvious flaws with this method, but it usually returns correctly. 

Note: Keep in mind that the program flow only falls here if there are _no_ proper nouns, which makes question answering more ambiguous to begin with.

In [None]:
class someone_make_an_acronym(someone_make_an_acronym):
    def BreakSentence(self):
        nlp = spacy.load("en_core_web_sm")
        nlp = en_core_web_sm.load()
        doc=nlp(self.QUESTION)
        # These are backups for not proper nouns
        TITLE = ""
        TITLE2 = ""
        TITLE3 = ""
        TITLE4 = ""
        for ent in doc.ents:
            self.FINAL_TITLE = ent.text
            print("Entity detected, using: " + self.FINAL_TITLE)
            return
        
        # If we can't detect an entity, detect proper nouns instead
        TEST_LIST = extract_proper_nouns(doc)
        for item in TEST_LIST:
            print(item.text + "\n")
        if(len(TEST_LIST)): # If proper noun extraction worked, we're good
            self.FINAL_TITLE = str(TEST_LIST[0])
            print("Proper noun(s) detected, using:" + self.FINAL_TITLE)
        else: # Otherwise, try to get ANY subject in sentence
            print("No proper noun(s) deteced, attempting to extract subject.")
            # Backup for sentence without proper nouns
            # Try everything remotely close to a noun or subject of sentence
            sub_toks = [tok for tok in doc if (tok.dep_ == "compound") ]
            sub_toks2 = [tok for tok in doc if (tok.dep_ == "nsubj") ]
            sub_toks3 = [tok for tok in doc if (tok.dep_ == "dobj") ]
            sub_toks4 = [tok for tok in doc if (tok.dep_ == "pobj") ]
            for value in sub_toks:
                TITLE = str(value).replace(" ", "_")
                print("Noun compound detected...")
            for value in sub_toks2:
                TITLE2 = str(value).replace(" ", "_")
                print("Nominal subject detected...")
            for value in sub_toks3:
                TITLE3 = str(value).replace(" ", "_")
                print("Direct object detected...")            
            for value in sub_toks4:
                TITLE4 = str(value).replace(" ", "_")
                print("Object of a preposition detected...")
            if len(TITLE):
                self.FINAL_TITLE = str(TITLE)
            elif len(TITLE2):
                self.FINAL_TITLE = str(TITLE2)
            elif len(TITLE3):
                self.FINAL_TITLE = str(TITLE3)
            elif len(TITLE4):
                self.FINAL_TITLE = str(TITLE4)

        if(len(self.FINAL_TITLE)):
            print(self.FINAL_TITLE, len(self.FINAL_TITLE))

### Parse Wikipedia

The original iteration of this can be seen at the end of the Notebook. It used Beautiful Soup to directly scrape a Wikipedia website link. This was clunky and proved difficult to debug. It also introduced noise in the form of extraneous text at the end of each page. It was decided to use a direct Wikipedia API. 

The Wikipedia object is created, and then the title extracted from the given question is fed to the object. If it is a valid Wikipedia article, the text (or just summary, if desired) is returned. This provides very concise context for the question answering pipeline. Both summary or full text could be considered as context, depending on the question scope. For example, if a general question is asked, the summary may be sufficient. This could be considered for large question sets. More specific questions may require the entire text for a correct answer.

We assign the Wikipedia article to `WIKI_ARTICLE` and use it in the next section.

In [None]:
class someone_make_an_acronym(someone_make_an_acronym):
    def ScrapeAndSanitizeWiki(self):
        wiki_object = wikipediaapi.Wikipedia(
        language='en',
        extract_format=wikipediaapi.ExtractFormat.WIKI
        )
        
        wiki_article = wiki_object.page(self.FINAL_TITLE)
        print(wiki_article)
        self.WIKI_ARTICLE  = wiki_article.summary #text or summary       


### Answer the Question
Using Huggingface pipeline for question answering allows incredibly simple use. We create the pipeline object for question answering, pass the Wikipedia article as context, the question asked as the question, and assign `topk` to 3 (as this is how many answers we would like to see). Then, `self.result` is assigned to be the result of the processed question. It is a few nested arrays, so it must be parsed in a specific way.

In [None]:
class someone_make_an_acronym(someone_make_an_acronym):
    def AnswerQuestion(self):
        # Answer the question
        # Determine what parts are taking the longest (assumption: result = nlp)
        nlp = pipeline("question-answering")
        context = self.WIKI_ARTICLE
        self.result = nlp(question=self.QUESTION, context=context, topk = 3)

### Print Results
The results are then printed. We asked for 3 answers, and we print 3 answers, as well as the confidence of a correct answer.

In [None]:
class someone_make_an_acronym(someone_make_an_acronym):
    def PrintQnA(self):
        print("\n" + self.QUESTION + "\n")
        for i in range(0,3): # Increase for more sets of answers - this has no computation delay as they are already stored.
            print(str(self.result[i]['answer']), (25 - len(str(self.result[i]['answer'])))*" ", str(self.result[i]['score']))

### Function Calls
We call each function, making sure to include a question! The more straightforward the question, the more correct the answer.

Note: Try to ask a few questions! You might be surprised how correct they are. Depending on context length and computer speed, it may take up to 30 seconds to receive an answer. If there is an error (or the sentence is not parsed properly) it will error fairly quickly and not delay.

Here are some good questions you could ask:
 - What is MSG used for?
 - What is the Earth made of?
 - Where is Saskatchewan?
 - What did Steve Jobs do before Apple?
 - What is the Meaning of Life?
 - What are _______ made from? (try chimneys, pencils, etc.)


In [52]:
test3 = someone_make_an_acronym("Where is Saskatchewan?")
test3.BreakSentence()
test3.ScrapeAndSanitizeWiki()
test3.AnswerQuestion()
test3.PrintQnA()


Where is Saskatchewan?

Western Canada             0.938798725605011
in Western Canada          0.022149959579110146
northern boreal half is mostly forested and sparsely populated  0.010969744995236397


In [None]:
# Here is the Wikipedia article, for reference.
print(test3.WIKI_ARTICLE)

In [None]:
    # Outdated Wikipedia scraping using Beautiful Soup
    html = urlopen("https://en.wikipedia.org/wiki/" + self.FINAL_TITLE)
    parsed_html = BeautifulSoup(html)
    paragraphs = parsed_html.select("p")
    print(paragraphs)
    clean_html = paragraphs.body.findAll(text=True)
    no_newline_html = list(filter(("\n").__ne__, clean_html))
    self.WIKI_ARTICLE = TreebankWordDetokenizer().detokenize(clean_html)