# Lab2: Loading and Navigating through Text Corpora and PDF files, Analyzing patterns using Regular Expressions, NLTK and SpaCY library introduction

Note: this is a lab session **graded**. Complete all exercises and upload to Canvas under **Lab 2: PDF Files, Datasets, RegEx and more** (https://utexas.instructure.com/courses/1382133/assignments/6619547?module_item_id=13585840) by no later than **01/25/2024, 11:59 PM** (Labs are supposed to be completed inside class).


**Also Note:** The first take home assignment - "Assignment 1: Regular Expressions based Pattern Extraction on PDF data" will be posted tonight.

##1. Processing PDF files and extracting Textual Data

By now, we are familiar with the basics of reading and writing simple text files (with a `.txt` extension). However, before delving deeper into advanced text processing techniques, it's crucial to understand how to automatically read and process various file formats. In this context, let's explore PDF file processing, complementing our knowledge of processing web pages using BeautifulSoup and related libraries (previously discussed in Lab1). PDFs and web pages stand out as significant sources of data, collectively presenting vast amounts of text data for text mining endeavors.

Before you execute the example code below, please download the sample PDF file from **https://www.fdic.gov/news/events/affordable/hcachecklist.pdf** and place it under **Files**. You can, of course, first open the file locally and examine the content.

For reading PDF content, we will need to use a specialized library *PyPDF2* which may not be installed in your environment.

Let's launch the installation command fist

In [1]:
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/232.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.4/232.6 kB[0m [31m2.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [3]:
import PyPDF2

# Sample PDF file path
f = open(r'hcachecklist.pdf', mode='rb')
pdfdoc = PyPDF2.PdfReader(f)

page_data = []
for page in pdfdoc.pages:
  text = page.extract_text()
  page_data.append(text)

print("Text extracted using PyPDF2:\n", page_data)

Text extracted using PyPDF2:
 ['1 of 4 \n \n 888-388-HOPE (4673) \nwww.hopecoalitionamerica.org\n CHECKLIST OF IMPORTANT LEGAL DOCUMENTS \nAND FINANCIAL STATEMENTS \n \nPlease review the list of important documents below  and check whether you have the document, whether \nyou need to obtain the document or whether the doc ument does not apply to your household. Next, \ncollect the documents you have and obtain the ones you still need. These documents, along with the \ncompleted forms provided here, make up your  Emergency Financial First Aid Kit (EFFAK). \nOnce you have all of these documents together, y ou should make a copy of your entire EFFAK. As \nimportant information is often printed on the backs of  these documents, please be sure to copy both \nsides. \nBecause these documents contain such important and  personal information, we strongly recommend that \nyou keep all original documents, photographs and co mputer backup disks in an off-site safety deposit \nbox.  And be sure to

Now, this converts text data from each page into an element in the list `page_data`. We can now use this data for downstream processing.

Exercise E1: Fetch all URLs mentioned in a few research papers.

1. Download this conference proceeding and place it under Files. URL: https://aclanthology.org/2023.acl-long.0.pdf

2. Process the PDF file using code similar to above.

3. Extract all URLs and print (Hint: You can assume that any sentence, after splitting the text with newline symbol `\n`, containing "https" qualifies as a URL

## 2. Navigating through existing Public Datasets

Like we discussed in our last lecture, several datasets are availabe for public use and Huggingface (http://huggingface.co) is a hub that hosts a few datasets, processed and readily available for building NLP applicaitons. Let's check some datasets below.

Datasets from Huggingface are made available through a library called `datasets`. We have to install it.



In [4]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: dill, multiprocess, datasets
Successfully installed datasets-2.16.1 dill-0.3.7 multiprocess-0.70.15


In [7]:
from datasets import load_dataset

# Load IMDb reviews dataset as an example. This is where the data is hosted: https://huggingface.co/datasets/imdb
dataset = load_dataset("imdb")

# Get a sample text from the "train" split of the dataset
sample_text = dataset["train"][0]
print("Sample Text:\n", sample_text)

Sample Text:
 {'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are 

## Exercise E1: Print samples from the following datasets

In separate code blocks
1. Try to print a few sample sentences from the `spanish_billion_words` corpus,  the `simple_questions_v2` corpus.

2. Be sure to visit huggingface website, type these corpora names in the search box, look at the data format and modify the code above to get the textual elements.

##3. Basic Regular Expression Patterns

Regular expressions, often abbreviated as regex or regexp, are sequences of characters that define a search pattern. They are powerful tools for string manipulation and matching within text. Regular expressions provide a concise and flexible means to:
  
- **Search Patterns:** Specify patterns of characters to search for in a text.
- **Match Patterns:** Identify whether a string conforms to a given pattern.
- **Extract Information:** Extract specific data from strings based on defined patterns.
- **Replace Patterns:** Replace occurrences of a pattern with a specified replacement.

For a detailed overview on Python RegEx, please follow: https://www.geeksforgeeks.org/python-regex/

Let's say we want to extract important information from text such as dates, emails, and phone numbers. We can use regular expressions to extract and display these patterns from the text as follows:

In [8]:
import re

# Sample text containing dates, emails, and phone numbers
sample_text = "Meeting on 2022-05-20, contact@example.com, and call me at 123-456-7890."

# Regular expressions for extracting patterns
dates = re.findall(r'\d{4}-\d{2}-\d{2}', sample_text)
emails = re.findall(r'[A-Za-z0-9]+@[A-Za-z0-9]+\.[a-z]+', sample_text)
phone_numbers = re.findall(r'\d{3}-\d{3}-\d{4}', sample_text)

print("Dates:", dates)
print("Emails:", emails)
print("Phone Numbers:", phone_numbers)


Dates: ['2022-05-20']
Emails: ['contact@example.com']
Phone Numbers: ['123-456-7890']


Exercise E3:
1. Following section 1, extract all text from `hcachecklist.pdf` (you can copy paste the code from #1). Now, list down all two word phrases that follow a pattern `a <word>`, `an <word>` and `the <word>`. Some example phrases: `a key`, `the document`.

2. For this research paper here https://arxiv.org/abs/1706.03762 , programatically find out if there is any link to a codebase that can be used to replicate the paper (Hint: Load the text from the paper using PyPDF2, look for URLs, clean URLs and then filter out the ones that contain `github.com` in them. Print the final list of urls.

3. **[Optional, not-graded]** Repeat 2 for all papers from EMNLP 2023 conference. You will have to scrape the website: https://aclanthology.org/events/emnlp-2023/#2023emnlp-main ,get URLs of all PDF papers, download the PDFs from within your program and repeate #2 for each of them. Dump the final list of GitHub URLs in a file.

## 4. Introduction to NLTK and SpaCY: Two popular NLP libraries for text processing

**NLTK (Natural Language Toolkit):**

NLTK is a powerful library for working with human language data. It provides easy-to-use interfaces to linguistic resources and algorithms, making it an excellent tool for natural language processing, text analysis, and machine learning. NLTK includes functionalities for tokenization, stemming, tagging, parsing, and more.

**spaCy:**

spaCy is a modern natural language processing library that is designed for efficiency and production use. It excels in processing large volumes of text quickly and accurately. spaCy provides pre-trained models for various languages, covering tasks such as part-of-speech tagging, named entity recognition, and dependency parsing.

**Part-of-Speech (PoS) Tagging Example:**

Let's demonstrate a simple PoS tagging example using NLTK and spaCy:



In this example:
- NLTK is used for tokenization and PoS tagging.
- spaCy is used for PoS tagging.

Both libraries provide information about the part of speech of each word in the given sentence. Feel free to run this code to see the PoS tagging results.

In [9]:
# NLTK PoS Tagging Example
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

sentence = "NLTK is a powerful library for natural language processing."
tokens = word_tokenize(sentence)
pos_tags = nltk.pos_tag(tokens)

print("NLTK PoS Tagging:")
print(pos_tags)

# spaCy PoS Tagging Example
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(sentence)

spacy_pos_tags = [(token.text, token.pos_) for token in doc]

print("\nspaCy PoS Tagging:")
print(spacy_pos_tags)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


NLTK PoS Tagging:
[('NLTK', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('powerful', 'JJ'), ('library', 'NN'), ('for', 'IN'), ('natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('.', '.')]

spaCy PoS Tagging:
[('NLTK', 'PROPN'), ('is', 'AUX'), ('a', 'DET'), ('powerful', 'ADJ'), ('library', 'NOUN'), ('for', 'ADP'), ('natural', 'ADJ'), ('language', 'NOUN'), ('processing', 'NOUN'), ('.', 'PUNCT')]


We will use these libraries for text pre-processing and condcuting layer-wise processing that we discussed in week 1 and 2.