Achintya Yedavalli

# Assignment 4: Information Extraction from Text

The objective of this assignment is to apply SpaCy's POS tagging, NER, and dependency parsing functionalities to analyze the article about Abraham Lincoln. By dissecting the text using SpaCy, we will gain insights into the linguistic structure, named entities, and syntactic relationships present in the story.

We will be using `Abraham_Lincoln.pdf` for this assignment

## 1. Data Loading & Processing

### a) Load the dataset using appropriate PDF library.

In [16]:
%pip install PyPDF2
# Extract info from PDF
import PyPDF2
f = open(r'Abraham_Lincoln.pdf', mode='rb')
pdfdoc = PyPDF2.PdfReader(f)

page_data = []
for page in pdfdoc.pages:
  text = page.extract_text()
  page_data.append(text)

print("Text extracted using PyPDF2: \n", page_data)

Note: you may need to restart the kernel to use updated packages.
Text extracted using PyPDF2: 
 ['Abraham Lincoln: The Legacy of a Great Leader  Abraham Lincoln, the 16th President of the United States, remains one of the most revered ﬁgures in American history. Born on February 12, 1809, in a log cabin in Hardin County, Kentucky, Lincoln rose from humble beginnings to become a towering ﬁgure whose leadership during one of the nation\'s darkest periods, the Civil War, solidiﬁed his place as one of America\'s greatest presidents. His enduring legacy as the "Great Emancipator" and his unwavering commitment to preserving the Union continue to inspire generations around the world.  Early Life and Career  Lincoln\'s early life was marked by hardship and adversity. Growing up in a frontier environment, he experienced poverty, the loss of his mother at a young age, and limited access to formal education. Despite these challenges, Lincoln possessed an insatiable thirst for knowledge and self-

### b) Convert the entire text read into a set of sentences using NLTK's Sentence Tokenizer.

In [17]:
def download_nltk_dataset(dataset_name):
  # Let's do this one time implementation  for downloading an NLTK dataset
  # ONLY IF it does not exist
  try:
      nltk.data.find(dataset_name)
  except LookupError:
      nltk.download(dataset_name)
      print(f"Downloaded {dataset_name}")
  else:
        print(f"{dataset_name} is already downloaded")

download_nltk_dataset("punkt")
download_nltk_dataset("averaged_perceptron_tagger")

Downloaded punkt
Downloaded averaged_perceptron_tagger


[nltk_data] Downloading package punkt to /home/codespace/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [18]:
from nltk.tokenize import sent_tokenize

full_text = page_data[0] + " " + page_data[1]
tokens = sent_tokenize(full_text)
print(tokens)

['Abraham Lincoln: The Legacy of a Great Leader  Abraham Lincoln, the 16th President of the United States, remains one of the most revered ﬁgures in American history.', "Born on February 12, 1809, in a log cabin in Hardin County, Kentucky, Lincoln rose from humble beginnings to become a towering ﬁgure whose leadership during one of the nation's darkest periods, the Civil War, solidiﬁed his place as one of America's greatest presidents.", 'His enduring legacy as the "Great Emancipator" and his unwavering commitment to preserving the Union continue to inspire generations around the world.', "Early Life and Career  Lincoln's early life was marked by hardship and adversity.", 'Growing up in a frontier environment, he experienced poverty, the loss of his mother at a young age, and limited access to formal education.', 'Despite these challenges, Lincoln possessed an insatiable thirst for knowledge and self-improvement, educating himself through voracious reading and a keen interest in the la

## 2. Part of Speech Tag Based Analysis [3 Marks]

### a) Using  SpaCy library extract fine grained POS tags for each sentence above (Hint: look for "tok.tag_" while looping over the tokens after processing the sentence using SpaCy). 

In [19]:
import spacy

# This will do tokenization, tagging and parsing
nlp = spacy.load("en_core_web_sm")

for sent in tokens:
  print ("****************************************")
  print (sent)
  processed_sent = nlp(sent)
  # See tokenization results
  print (f"Tokens:")
  tokens = [token.text for token in processed_sent]
  print (tokens)

  # Fine tagging
  print (f"Fine tags:")
  tags = [token.tag_ for token in processed_sent]
  print (tags)


****************************************
Abraham Lincoln: The Legacy of a Great Leader  Abraham Lincoln, the 16th President of the United States, remains one of the most revered ﬁgures in American history.
Tokens:
['Abraham', 'Lincoln', ':', 'The', 'Legacy', 'of', 'a', 'Great', 'Leader', ' ', 'Abraham', 'Lincoln', ',', 'the', '16th', 'President', 'of', 'the', 'United', 'States', ',', 'remains', 'one', 'of', 'the', 'most', 'revered', 'ﬁgures', 'in', 'American', 'history', '.']
Fine tags:
['NNP', 'NNP', ':', 'DT', 'NNP', 'IN', 'DT', 'NNP', 'NNP', '_SP', 'NNP', 'NNP', ',', 'DT', 'JJ', 'NNP', 'IN', 'DT', 'NNP', 'NNP', ',', 'VBZ', 'CD', 'IN', 'DT', 'RBS', 'VBN', 'NNS', 'IN', 'JJ', 'NN', '.']
****************************************
Born on February 12, 1809, in a log cabin in Hardin County, Kentucky, Lincoln rose from humble beginnings to become a towering ﬁgure whose leadership during one of the nation's darkest periods, the Civil War, solidiﬁed his place as one of America's greatest presi

### b) Analyze the frequency distribution of POS tags in the text. Which POS tags are most frequent, and what do they signify about the article? Share your insights.  

This is an open ended question and a decent explanation will fetch full marks. 