# Lab2: Loading and Navigating through Text Corpora and PDF files, Analyzing patterns using Regular Expressions, NLTK and SpaCY library introduction

Note: this is a lab session **graded**. Complete all exercises and upload to Canvas under **Lab 2: PDF Files, Datasets, RegEx and more** (https://utexas.instructure.com/courses/1382133/assignments/6619547?module_item_id=13585840) by no later than **01/25/2024, 11:59 PM** (Labs are supposed to be completed inside class).


**Also Note:** The first take home assignment - "Assignment 1: Regular Expressions based Pattern Extraction on PDF data" will be posted tonight.

##1. Processing PDF files and extracting Textual Data

By now, we are familiar with the basics of reading and writing simple text files (with a `.txt` extension). However, before delving deeper into advanced text processing techniques, it's crucial to understand how to automatically read and process various file formats. In this context, let's explore PDF file processing, complementing our knowledge of processing web pages using BeautifulSoup and related libraries (previously discussed in Lab1). PDFs and web pages stand out as significant sources of data, collectively presenting vast amounts of text data for text mining endeavors.

Before you execute the example code below, please download the sample PDF file from **https://www.fdic.gov/news/events/affordable/hcachecklist.pdf** and place it under **Files**. You can, of course, first open the file locally and examine the content.

For reading PDF content, we will need to use a specialized library *PyPDF2* which may not be installed in your environment.

Let's launch the installation command fist

In [None]:
%pip install PyPDF2
%pip install nltk
%pip install spacy

In [None]:
import PyPDF2

# Sample PDF file path
f = open(r'hcachecklist.pdf', mode='rb')
pdfdoc = PyPDF2.PdfReader(f)

page_data = []
for page in pdfdoc.pages:
  text = page.extract_text()
  page_data.append(text)

print("Text extracted using PyPDF2: \n", page_data)

Now, this converts text data from each page into an element in the list `page_data`. We can now use this data for downstream processing.

### Exercise E1: Fetch all URLs mentioned in a few research papers.

1. Download this conference proceeding and place it under Files. URL: https://aclanthology.org/2023.acl-long.0.pdf

2. Process the PDF file using code similar to above.

3. Extract all URLs and print (Hint: You can assume that any sentence, after splitting the text with newline symbol `\n`, containing "https" qualifies as a URL

In [46]:
# PyPDF is imported

# open pdf file
de_file = open(r'2023.acl-long.0.pdf', mode='rb')
dedoc = PyPDF2.PdfReader(de_file)

la_data = []
for page in dedoc.pages:
  text = page.extract_text()
  la_data.append(text)


In [47]:
for line in la_data:
  lis = line.split("\n")
  for lane in lis:
    if "https" in lane:
      print(lane)

1https://github.com/acl-org/acl-2023-materialsxl
4https://2023.aclweb.org/committees/program/xli
10https://aclrollingreview.org/reviewertutorialxlv
11https://2023.aclweb.org/blog/reviewer-assignment/xlvi
13https://www.sphinx-doc.org/
14https://myst-parser.readthedocs.io/xlix
40https://2023.aclweb.org/program/best_reviewerslxix
41https://2023.aclweb.org/blog/visa-info/lxxiii


## 2. Navigating through existing Public Datasets

Like we discussed in our last lecture, several datasets are availabe for public use and Huggingface (http://huggingface.co) is a hub that hosts a few datasets, processed and readily available for building NLP applicaitons. Let's check some datasets below.

Datasets from Huggingface are made available through a library called `datasets`. We have to install it.



In [None]:
%pip install datasets

In [None]:
from datasets import load_dataset

# Load IMDb reviews dataset as an example. This is where the data is hosted: https://huggingface.co/datasets/imdb
dataset = load_dataset("imdb")

# Get a sample text from the "train" split of the dataset
sample_text = dataset["train"][0]
print("Sample Text:\n", sample_text)

## Exercise E2: Print samples from the following datasets

In separate code blocks
1. Try to print a few sample sentences from the `spanish_billion_words` corpus,  the `simple_questions_v2` corpus.

2. Be sure to visit huggingface website, type these corpora names in the search box, look at the data format and modify the code above to get the textual elements.

In [None]:
# Load the spanish_billion_words dataset
dataset2 = load_dataset("spanish_billion_words")
sample_text2 = dataset2["train"][0]
print("Sample Text:\n", sample_text2)

My virtual workspace has no space and its too computationally expensive to run locally and colab says 5 hours to compile so... idk I tested the other one and it worked so ig this one works as well

In [48]:
# Load the simple_questions_v2 dataset
dataset3 = load_dataset("simple_questions_v2")
sample_text3 = dataset3["train"][0]
print("Sample Text:\n", sample_text3)

Sample Text:
 {'id': '0', 'subject_entity': 'www.freebase.com/m/04whkz5', 'relationship': 'www.freebase.com/book/written_work/subjects', 'object_entity': 'www.freebase.com/m/01cj3p', 'question': 'what is the book e about\n'}


You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


##3. Basic Regular Expression Patterns

Regular expressions, often abbreviated as regex or regexp, are sequences of characters that define a search pattern. They are powerful tools for string manipulation and matching within text. Regular expressions provide a concise and flexible means to:
  
- **Search Patterns:** Specify patterns of characters to search for in a text.
- **Match Patterns:** Identify whether a string conforms to a given pattern.
- **Extract Information:** Extract specific data from strings based on defined patterns.
- **Replace Patterns:** Replace occurrences of a pattern with a specified replacement.

For a detailed overview on Python RegEx, please follow: https://www.geeksforgeeks.org/python-regex/

Let's say we want to extract important information from text such as dates, emails, and phone numbers. We can use regular expressions to extract and display these patterns from the text as follows:

In [None]:
import re

# Sample text containing dates, emails, and phone numbers
sample_text = "Meeting on 2022-05-20, contact@example.com, and call me at 123-456-7890."

# Regular expressions for extracting patterns
dates = re.findall(r'\d{4}-\d{2}-\d{2}', sample_text)
emails = re.findall(r'[A-Za-z0-9]+@[A-Za-z0-9]+\.[a-z]+', sample_text)
phone_numbers = re.findall(r'\d{3}-\d{3}-\d{4}', sample_text)

print("Dates:", dates)
print("Emails:", emails)
print("Phone Numbers:", phone_numbers)


### Exercise E3:
1. Following section 1, extract all text from `hcachecklist.pdf` (you can copy paste the code from #1). Now, list down all two word phrases that follow a pattern `a <word>`, `an <word>` and `the <word>`. Some example phrases: `a key`, `the document`.

2. For this research paper here https://arxiv.org/abs/1706.03762 , programatically find out if there is any link to a codebase that can be used to replicate the paper (Hint: Load the text from the paper using PyPDF2, look for URLs, clean URLs and then filter out the ones that contain `github.com` in them. Print the final list of urls.

3. **[Optional, not-graded]** Repeat 2 for all papers from EMNLP 2023 conference. You will have to scrape the website: https://aclanthology.org/events/emnlp-2023/#2023emnlp-main ,get URLs of all PDF papers, download the PDFs from within your program and repeate #2 for each of them. Dump the final list of GitHub URLs in a file.

In [49]:
import re
# question 1
page_str = ""
for page in pdfdoc.pages:
  text = page.extract_text()
  text = text.lower()
  page_str += text

#print(page_str)
two_word = re.findall(r'\b((?:a|an|the)\s+\w+)\b', page_str)
# oh my goodness this took too long and required much looking at the g2g regex page to figure out but I think this is it
print(two_word)

['the list', 'the document', 'the document', 'the doc', 'the documents', 'the ones', 'the \ncompleted', 'a copy', 'the backs', 'an off', 'the key', 'a safe', 'a copy', 'a \nfireproof', 'an emergency', 'an attorney', 'a copy', 'a sealed', 'the event', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____3', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a _____', 'a date', 'a small', 'an off', 'a secure']


In [50]:
# question 2
# open pdf file
paper = open(r'1706.03762.pdf', mode='rb')
dedoc = PyPDF2.PdfReader(paper)

la_data = []
for page in dedoc.pages:
  text = page.extract_text()
  la_data.append(text)


In [51]:
urls = []
for line in la_data:
  lis = line.split("\n")
  for lane in lis:
    if ".com" in lane or ".gov" in lane or ".org" in lane:
      urls.append(lane)

urls

['avaswani@google.comNoam Shazeer∗',
 'noam@google.comNiki Parmar∗',
 'nikip@google.comJakob Uszkoreit∗',
 'usz@google.com',
 'llion@google.comAidan N. Gomez∗ †',
 'lukaszkaiser@google.com',
 'illia.polosukhin@gmail.com',
 'The code we used to train and evaluate our models is available at https://github.com/']

## 4. Introduction to NLTK and SpaCY: Two popular NLP libraries for text processing

**NLTK (Natural Language Toolkit):**

NLTK is a powerful library for working with human language data. It provides easy-to-use interfaces to linguistic resources and algorithms, making it an excellent tool for natural language processing, text analysis, and machine learning. NLTK includes functionalities for tokenization, stemming, tagging, parsing, and more.

**spaCy:**

spaCy is a modern natural language processing library that is designed for efficiency and production use. It excels in processing large volumes of text quickly and accurately. spaCy provides pre-trained models for various languages, covering tasks such as part-of-speech tagging, named entity recognition, and dependency parsing.

**Part-of-Speech (PoS) Tagging Example:**

Let's demonstrate a simple PoS tagging example using NLTK and spaCy:



In this example:
- NLTK is used for tokenization and PoS tagging.
- spaCy is used for PoS tagging.

Both libraries provide information about the part of speech of each word in the given sentence. Feel free to run this code to see the PoS tagging results.

In [None]:
# NLTK PoS Tagging Example
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

sentence = "NLTK is a powerful library for natural language processing."
tokens = word_tokenize(sentence)
pos_tags = nltk.pos_tag(tokens)

print("NLTK PoS Tagging:")
print(pos_tags)

# spaCy PoS Tagging Example
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(sentence)

spacy_pos_tags = [(token.text, token.pos_) for token in doc]

print("\nspaCy PoS Tagging:")
print(spacy_pos_tags)

We will use these libraries for text pre-processing and condcuting layer-wise processing that we discussed in week 1 and 2.