Achintya Yedavalli

# Assignment 4: Information Extraction from Text

The objective of this assignment is to apply SpaCy's POS tagging, NER, and dependency parsing functionalities to analyze the article about Abraham Lincoln. By dissecting the text using SpaCy, we will gain insights into the linguistic structure, named entities, and syntactic relationships present in the story.

We will be using `Abraham_Lincoln.pdf` for this assignment

## 1. Data Loading & Processing

### a) Load the dataset using appropriate PDF library.

In [10]:
%pip install PyPDF2
# Extract info from PDF
import PyPDF2
f = open(r'Abraham_Lincoln.pdf', mode='rb')
pdfdoc = PyPDF2.PdfReader(f)

page_data = []
for page in pdfdoc.pages:
  text = page.extract_text()
  page_data.append(text)

print("Text extracted using PyPDF2: \n", page_data)

Note: you may need to restart the kernel to use updated packages.
Text extracted using PyPDF2: 
 ['Abraham Lincoln: The Legacy of a Great Leader  Abraham Lincoln, the 16th President of the United States, remains one of the most revered ﬁgures in American history. Born on February 12, 1809, in a log cabin in Hardin County, Kentucky, Lincoln rose from humble beginnings to become a towering ﬁgure whose leadership during one of the nation\'s darkest periods, the Civil War, solidiﬁed his place as one of America\'s greatest presidents. His enduring legacy as the "Great Emancipator" and his unwavering commitment to preserving the Union continue to inspire generations around the world.  Early Life and Career  Lincoln\'s early life was marked by hardship and adversity. Growing up in a frontier environment, he experienced poverty, the loss of his mother at a young age, and limited access to formal education. Despite these challenges, Lincoln possessed an insatiable thirst for knowledge and self-

### b) Convert the entire text read into a set of sentences using NLTK's Sentence Tokenizer.

In [11]:
%pip install nltk spacy

Note: you may need to restart the kernel to use updated packages.


In [12]:
import nltk
def download_nltk_dataset(dataset_name):
  # Let's do this one time implementation  for downloading an NLTK dataset
  # ONLY IF it does not exist
  try:
      nltk.data.find(dataset_name)
  except LookupError:
      nltk.download(dataset_name)
      print(f"Downloaded {dataset_name}")
  else:
        print(f"{dataset_name} is already downloaded")

download_nltk_dataset("punkt")
download_nltk_dataset("averaged_perceptron_tagger")

Downloaded punkt
Downloaded averaged_perceptron_tagger


[nltk_data] Downloading package punkt to /home/codespace/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [13]:
from nltk.tokenize import sent_tokenize

full_text = page_data[0] + " " + page_data[1]
tokens = sent_tokenize(full_text)
print(tokens)

['Abraham Lincoln: The Legacy of a Great Leader  Abraham Lincoln, the 16th President of the United States, remains one of the most revered ﬁgures in American history.', "Born on February 12, 1809, in a log cabin in Hardin County, Kentucky, Lincoln rose from humble beginnings to become a towering ﬁgure whose leadership during one of the nation's darkest periods, the Civil War, solidiﬁed his place as one of America's greatest presidents.", 'His enduring legacy as the "Great Emancipator" and his unwavering commitment to preserving the Union continue to inspire generations around the world.', "Early Life and Career  Lincoln's early life was marked by hardship and adversity.", 'Growing up in a frontier environment, he experienced poverty, the loss of his mother at a young age, and limited access to formal education.', 'Despite these challenges, Lincoln possessed an insatiable thirst for knowledge and self-improvement, educating himself through voracious reading and a keen interest in the la

## 2. Part of Speech Tag Based Analysis [3 Marks]

### a) Using  SpaCy library extract fine grained POS tags for each sentence above (Hint: look for "tok.tag_" while looping over the tokens after processing the sentence using SpaCy). 

In [14]:
import spacy

# This will do tokenization, tagging and parsing
nlp = spacy.load("en_core_web_sm")
all_tags = []

for sent in tokens:
  print ("****************************************")
  print (sent)
  processed_sent = nlp(sent)
  # See tokenization results
  print (f"Tokens:")
  s_tokens = [token.text for token in processed_sent]
  print (s_tokens)

  # Fine tagging
  print (f"Fine tags:")
  s_tags = [token.tag_ for token in processed_sent]
  all_tags += s_tags
  print (s_tags)


****************************************
Abraham Lincoln: The Legacy of a Great Leader  Abraham Lincoln, the 16th President of the United States, remains one of the most revered ﬁgures in American history.
Tokens:
['Abraham', 'Lincoln', ':', 'The', 'Legacy', 'of', 'a', 'Great', 'Leader', ' ', 'Abraham', 'Lincoln', ',', 'the', '16th', 'President', 'of', 'the', 'United', 'States', ',', 'remains', 'one', 'of', 'the', 'most', 'revered', 'ﬁgures', 'in', 'American', 'history', '.']
Fine tags:
['NNP', 'NNP', ':', 'DT', 'NNP', 'IN', 'DT', 'NNP', 'NNP', '_SP', 'NNP', 'NNP', ',', 'DT', 'JJ', 'NNP', 'IN', 'DT', 'NNP', 'NNP', ',', 'VBZ', 'CD', 'IN', 'DT', 'RBS', 'VBN', 'NNS', 'IN', 'JJ', 'NN', '.']
****************************************
Born on February 12, 1809, in a log cabin in Hardin County, Kentucky, Lincoln rose from humble beginnings to become a towering ﬁgure whose leadership during one of the nation's darkest periods, the Civil War, solidiﬁed his place as one of America's greatest presi

### b) Analyze the frequency distribution of POS tags in the text. Which POS tags are most frequent, and what do they signify about the article? Share your insights.  

This is an open ended question and a decent explanation will fetch full marks. 

In [15]:
# make a frequency map using https://www.tutorialspoint.com/list-frequency-of-elements-in-python
import collections
# using Counter to find frequency of elements
frequency = collections.Counter(all_tags)

# printing the frequency
print(dict(sorted(frequency.items(), key=lambda item: item[1],reverse=True)))

{'NN': 159, 'IN': 129, 'DT': 96, 'NNP': 80, 'JJ': 63, ',': 41, 'CC': 39, '.': 32, 'VBD': 32, 'NNS': 30, 'PRP$': 26, 'VBG': 23, 'RB': 16, 'VBN': 14, 'VB': 12, 'POS': 12, 'CD': 11, 'PRP': 10, 'VBZ': 6, 'TO': 6, '_SP': 5, 'VBP': 5, '``': 4, "''": 4, 'WDT': 3, 'RBS': 2, 'HYPH': 2, 'MD': 2, ':': 1, 'WP$': 1, 'JJS': 1, 'RP': 1, 'UH': 1, 'JJR': 1, 'NNPS': 1}


In [16]:
nltk.download('tagsets')
print("Penn Treebank POS Tags:")
nltk.help.upenn_tagset()

Penn Treebank POS Tags:
$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corpori

[nltk_data] Downloading package tagsets to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


As we see NN and IN are by far the two most popular word types in the article. Looking at the tagset, NN = Noun and IN = Prepositions and Conjunctions, which suggests that named places, things, and objects (nouns) and transition words (prepostions) are the most popular. 

The third most popular is DT, or determiner, which means that this document has a lot of numerical information that is being summarized in some way by these determiners.

NNP, proper nouns, is fourth, suggesting a reliance on names of things in this article.

## 3. Named Entity Recognition (NER)

### a) Using SpaCy library extract all Named Entities (NE) from the text (such as persons, organizations, locations, dates, etc.)

In [24]:
ne_dict = {}
for sent in tokens:
    processed_sentence = nlp(sent)
    for entity in processed_sentence.ents:
        if entity.text not in ne_dict:
            ent_tl = {entity.text:str(entity.label_)}
            ne_dict.update(ent_tl)
        print(entity.text, entity.label_)

Abraham Lincoln PERSON
Abraham Lincoln PERSON
16th ORDINAL
the United States GPE
American NORP
February 12, 1809 DATE
Hardin County GPE
Kentucky GPE
Lincoln ORG
the Civil War EVENT
America GPE
Great Emancipator WORK_OF_ART
Union ORG
Career PRODUCT
Lincoln ORG
Lincoln ORG
Illinois GPE
Lincoln ORG
1860 DATE
the Republican Party's ORG
the Civil War EVENT
Lincoln ORG
the United States GPE
North LOC
South LOC
oYice ORG
Lincoln ORG
the Civil War EVENT
1861 DATE
Lincoln ORG
Lincoln ORG
Lincoln ORG
Lincoln ORG
Gettysburg Address FAC
Emancipation Proclamation WORK_OF_ART
One CARDINAL
Lincoln ORG
1863 DATE
the Emancipation Proclamation LAW
American NORP
Confederate ORG
Lincoln ORG
the Declaration of Independence WORK_OF_ART
the Emancipation Proclamation LAW
Lincoln ORG
the Thirteenth Amendment LAW
1865 DATE
Abraham Lincoln's PERSON
American NORP
the Civil War EVENT
Union ORG
Americans NORP
Lincoln ORG
the Gettysburg Address ORG
second ORDINAL
American NORP
Abraham Lincoln's PERSON
American NORP


### b) Provide a list of named entities along with their types.

In [26]:
ne_dict

{'Abraham Lincoln': 'PERSON',
 '16th': 'ORDINAL',
 'the United States': 'GPE',
 'American': 'NORP',
 'February 12, 1809': 'DATE',
 'Hardin County': 'GPE',
 'Kentucky': 'GPE',
 'Lincoln': 'ORG',
 'the Civil War': 'EVENT',
 'America': 'GPE',
 'Great Emancipator': 'WORK_OF_ART',
 'Union': 'ORG',
 'Career': 'PRODUCT',
 'Illinois': 'GPE',
 '1860': 'DATE',
 "the Republican Party's": 'ORG',
 'North': 'LOC',
 'South': 'LOC',
 'oYice': 'ORG',
 '1861': 'DATE',
 'Gettysburg Address': 'FAC',
 'Emancipation Proclamation': 'WORK_OF_ART',
 'One': 'CARDINAL',
 '1863': 'DATE',
 'the Emancipation Proclamation': 'LAW',
 'Confederate': 'ORG',
 'the Declaration of Independence': 'WORK_OF_ART',
 'the Thirteenth Amendment': 'LAW',
 '1865': 'DATE',
 "Abraham Lincoln's": 'PERSON',
 'Americans': 'NORP',
 'the Gettysburg Address': 'ORG',
 'second': 'ORDINAL',
 'one': 'CARDINAL'}

### c) Analyze the frequency distribution of Named Entities. Which entity is most frequent (ideally it should be Abraham Lincoln or Lincoln). Which entity is the second most frequent? Write down your observations.

In [33]:
# doing the frequency analysis from part 2
entities_list = []
for sent in tokens:
    processed_sentence = nlp(sent)
    for entity in processed_sentence.ents:
        entities_list.append(entity.text) # create a list of named entities

# using Counter to find frequency of elements
ne_frequency = collections.Counter(entities_list)

# printing the frequency
print(dict(sorted(ne_frequency.items(), key=lambda item: item[1],reverse=True)))

{'Lincoln': 15, 'American': 5, 'the Civil War': 4, 'Abraham Lincoln': 2, 'the United States': 2, 'Great Emancipator': 2, 'Union': 2, 'the Emancipation Proclamation': 2, "Abraham Lincoln's": 2, '16th': 1, 'February 12, 1809': 1, 'Hardin County': 1, 'Kentucky': 1, 'America': 1, 'Career': 1, 'Illinois': 1, '1860': 1, "the Republican Party's": 1, 'North': 1, 'South': 1, 'oYice': 1, '1861': 1, 'Gettysburg Address': 1, 'Emancipation Proclamation': 1, 'One': 1, '1863': 1, 'Confederate': 1, 'the Declaration of Independence': 1, 'the Thirteenth Amendment': 1, '1865': 1, 'Americans': 1, 'the Gettysburg Address': 1, 'second': 1, 'one': 1}


We can see that, of course, Lincoln is the most frequent named entity, with 15 occurences in this article. This makes sense as the main subject of the article is Abraham Lincoln. The second and third most frequent named entities are "American" and "Civil War" which both also make sense considering Lincoln was instrumental during the civil war period of American history and is heavily featured in this article about him. Most of everything else are background details, such as Kentucky" or "Career", and there are a couple word variations.

## 4. Dependency Parsing

### a) Using SpaCy library perform Dependency Parsing on each sentence. 

In [37]:
# same as above, incorporating root detection from the next question
root_words = []
for sent in tokens:
    processed_sentence = nlp(sent)
    for token in processed_sentence:
        if token.dep_ == "ROOT":
            root_words.append(token.head.text)
        print(f"{token.text} --({token.dep_})--> {token.head.text}")

Abraham --(compound)--> Lincoln
Lincoln --(ROOT)--> Lincoln
: --(punct)--> Lincoln
The --(det)--> Legacy
Legacy --(nsubj)--> remains
of --(prep)--> Legacy
a --(det)--> Lincoln
Great --(compound)--> Leader
Leader --(compound)--> Lincoln
  --(dep)--> Leader
Abraham --(compound)--> Lincoln
Lincoln --(pobj)--> of
, --(punct)--> Lincoln
the --(det)--> President
16th --(amod)--> President
President --(appos)--> Lincoln
of --(prep)--> President
the --(det)--> States
United --(compound)--> States
States --(pobj)--> of
, --(punct)--> remains
remains --(ROOT)--> remains
one --(attr)--> remains
of --(prep)--> one
the --(det)--> ﬁgures
most --(advmod)--> revered
revered --(amod)--> ﬁgures
ﬁgures --(pobj)--> of
in --(prep)--> ﬁgures
American --(amod)--> history
history --(pobj)--> in
. --(punct)--> remains
Born --(advcl)--> rose
on --(prep)--> Born
February --(pobj)--> on
12 --(nummod)--> February
, --(punct)--> February
1809 --(nummod)--> February
, --(punct)--> rose
in --(prep)--> rose
a --(det)-

### b) Identify the ROOT word in each sentence. Ideally these should represent verbs (barring a few aberrations because of parsing mistakes by the parser). List down all unique ROOT words. 

In [39]:
print(root_words)

['Lincoln', 'remains', 'rose', 'continue', 'marked', 'experienced', 'possessed', 'embarked', 'grew', 'assumed', 'been', 'faced', 'tested', 'demonstrated', 'balanced', 'faced', 'assailed', 'remained', 'is', 'issued', 'was', 'was', 'viewed', 'struck', 'left', 'preserved', 'inspired', 'extends', 'remain', 'serves', 'embody', 'endures', 'reminded']


### c) Observe the ROOT words and offer insights on what kind of sequence of actions happened in Lincoln's lifetime. 


A lot of these words are in past tense. The one that isn't that sticks out to me is "remains". It tells me that the article considers Lincoln's legacy to "remain" even to this day, which it obviously does. In a historical context, there are many words of conflict, such as marked, grew, faced, tested, demonstrated, viewed, etc. This can mean many things but I think it can represent the adversity that Lincoln went through in his life, and how he faced this adversity and grew from it as a person.

Going further, there are many words that remind us of *legacy*, such as "remain", "serves", "embody", "endures", "rose(verb)", etc. Being an article about a historical figure, it serves to reason that we can see the legacy of lincoln unfold in a positive way, which is true to the real world, and even if we knew nothing about Lincoln beforehand or in the article, we can see that he struggled, faced adversity, and overcame it, eventually establishing himself as a legend worthy of the legacy he left behind. 