# PDF Extraction

- Use PyPDF2 and Grobid to extract text from the provided PDF files (Topic
outlines).
- Structure the output into text files, following the naming convention:
Grobid_RR_{Year}_{Level}_combined.txt and
PyPDF_RR_{Year}_{Level}_combined.txt.
- Organize these text files into two separate folders named Grobid and
PyPDF, each containing three text files corresponding to the readings.
- Develop a Python notebook for this extraction process.

## Imports

In [1]:
import PyPDF2
import os

import requests
from dotenv import load_dotenv

load_dotenv('../config/.env',override=True)

True

In [2]:
def load_env():
    grobid_url = os.getenv("GROBID_URL")
    pdf_directory = os.getenv("PDF_DIR_PATH") # Store the downloaded PDF files from S3
    output_dir = os.getenv("OUTPUT_DIR_PATH") # Store the extracted txt files
    s3_bucket_name = os.getenv("S3_BUCKET_NAME")
    access_key = os.getenv("S3_ACCESS_KEY")
    secret_key = os.getenv("S3_SECRET_KEY")
    region = os.getenv("S3_REGION")
    
    return grobid_url, pdf_directory, output_dir, s3_bucket_name, access_key, secret_key, region

grobid_url, pdf_directory, output_dir, s3_bucket_name, access_key, secret_key, region = load_env()

## 1. PyPDF Extraction

Function for extracting text from pdf using PyPDF2's PdfReader function

In [3]:
def extract_text_from_pdf(pdf_path):
    with open(pdf_path, 'rb') as pdf_file:
        pdf_reader = PyPDF2.PdfReader(pdf_file)
        text = ''
        for page_num in range(len(pdf_reader.pages)):
            text += pdf_reader.pages[page_num].extract_text()
        return text

In [4]:
extract_text_from_pdf("../data/2024-l1-topics-combined-2.pdf")



## 2. Using GROBID

### Cloned grobid python client
`git clone https://github.com/kermitt2/grobid_client_python`<br>
`cd grobid_client_python`<br>
`python3 setup.py install`

In [5]:
! pwd

/Users/sayalidalvi/Documents/Big_data/Assignment_2/Case_Study_2/code


In [6]:
%cd ../grobid_client_python

/Users/sayalidalvi/Documents/Big_data/Assignment_2/Case_Study_2/grobid_client_python


Now start grobid server on docker. You can check if the server is running on `http://localhost:8070/`

### Verifying the installation

In [7]:
!python3 -m grobid_client.grobid_client 

usage: grobid_client.py [-h] [--input INPUT] [--output OUTPUT]
                        [--config CONFIG] [--n N] [--generateIDs]
                        [--consolidate_header] [--consolidate_citations]
                        [--include_raw_citations] [--include_raw_affiliations]
                        [--force] [--teiCoordinates] [--segmentSentences]
                        [--verbose]
                        service
grobid_client.py: error: the following arguments are required: service


### 1. Using Grobid Python Client

In [8]:
from grobid_client.grobid_client import GrobidClient

client = GrobidClient(config_path="./config.json")
client.process("processFulltextDocument", "./resources/test_pdf",
                   output="./resources/test_out/", consolidate_citations=True, tei_coordinates=True, force=True)

GROBID server is up and running


#### Advantages:
- Processes all the documents under test_pdf folder
- Saves all of them to the output directory

#### Limitations:
- Saves in .xml files, we need .txt files
- Do not return the xml in the program, we cannot perform any extra processing.

### 2. Using Grobid web service API

In [9]:
# ! curl -v --form input=@./thefile.pdf localhost:8070/api/processFulltextDocument

In [10]:
def extract_grobid_api(file_name, file_path):

    files = {'input': (file_name, open(file_path, 'rb'))}

    response = requests.post(grobid_url, files=files)
    result = None

    if response.status_code == 200:
        print("POST request successful!")
#         print(response.text)
        result = response.text
    else:
        print(f"POST request failed with status code {response.status_code}")
        print("Response:")
        print(response.text)
        
    return result


In [11]:
xml_content = extract_grobid_api("2024-l1-topics-combined-2.pdf", "../data/2024-l1-topics-combined-2.pdf")
xml_content

POST request successful!




As we have the response, now we can preprocess it and store this in the txt file on the desired location

In [12]:
# Converting the xml to json
import json
import xmltodict

def convert_to_json(xml_content):
    
    # Convert XML to OrderedDict
    ordered_dict_data = xmltodict.parse(xml_content)

    # Convert OrderedDict to JSON
    json_data = json.dumps(ordered_dict_data, indent=2)

    print("XML converted to JSON ")
#     print(json_data)
    return json_data

In [13]:
convert_to_json(xml_content)

XML converted to JSON 




## Downloading PDFs from S3 to local

We prefer to store the files provided by Prof to a private S3 bucket, as these files are sensitive and cannot be exposed for public access

In [14]:
import boto3

def download_files_from_s3():
    s3 = boto3.client('s3', aws_access_key_id=access_key, aws_secret_access_key=secret_key, region_name = region)

    # List objects in the specified S3 folder
    response = s3.list_objects_v2(Bucket=s3_bucket_name, Prefix="raw_pdfs")

    # Download each file to the local directory
    for obj in response.get('Contents')[1:]:
        key = obj['Key']
        local_file_path = os.path.join(pdf_directory, os.path.basename(key))

        s3.download_file(s3_bucket_name, key, local_file_path)
        print(f"Downloaded: {key} to {local_file_path}")

In [15]:
download_files_from_s3()



Downloaded: raw_pdfs/2024-l1-topics-combined-2.pdf to ../data/2024-l1-topics-combined-2.pdf
Downloaded: raw_pdfs/2024-l2-topics-combined-2.pdf to ../data/2024-l2-topics-combined-2.pdf
Downloaded: raw_pdfs/2024-l3-topics-combined-2.pdf to ../data/2024-l3-topics-combined-2.pdf


## Putting it all together

The below function performs following task:
1. It iterates over all the PDF files from local
2. Extracts the text using PyPDF and Grobid
3. Saves these text files with names as 'Grobid_RR_{Year}_{Level}_combined.txt' for Grobid and 'PyPDF_RR_{Year}_{Level}_combined.txt' for PyPDF files 

In [16]:
# Utility function to save the text file

def write_text_file(file_name, file_path, pdf_content):
    try:
        print("Saving txt file ",file_name, " at path ", file_path)
        with open(file_path, 'w', encoding='utf-8') as output_file:
            output_file.write(pdf_content)
            print(f"Text successfully written to {file_path}")
        
    except FileNotFoundError:
        print(f"Error: The specified path {file_path} does not exist.")
    except PermissionError:
        print(f"Error: Permission denied. Unable to write to {file_path}")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")

In [17]:

# Iterate through all PDF files in the directory
for filename in os.listdir(pdf_directory):
    if filename.endswith(".pdf"):
        pdf_file_path = os.path.join(pdf_directory, filename)
        print("Parsing file ",filename, " saved at path ", pdf_file_path)
        
        # pypdf
        pypdf_content = extract_text_from_pdf(pdf_file_path)
        
        year = filename.split("-")[0]
        level = filename.split("-")[1]
        pypdf_name = "PyPDF_RR_"+year+"_"+level+"_combined.txt"
        
        if pypdf_content:
            output_file_path = os.path.join(output_dir+"PyPDF", pypdf_name)
            write_text_file(pypdf_name, output_file_path, pypdf_content)
        else:
            print("No content for this file")
        
        #grobid
        grobid_content = extract_grobid_api(filename, pdf_file_path)
        grobid_content = convert_to_json(grobid_content)
        
        grobid_name = "Grobid_RR_"+year+"_"+level+"_combined.txt"
        if grobid_content:
            output_file_path = os.path.join(output_dir+"Grobid", grobid_name)
            write_text_file(grobid_name, output_file_path, grobid_content)
        else:
            print("No content for this file")
        

Parsing file  2024-l3-topics-combined-2.pdf  saved at path  ../data/2024-l3-topics-combined-2.pdf
Saving txt file  PyPDF_RR_2024_l3_combined.txt  at path  ../sample_output/PyPDF/PyPDF_RR_2024_l3_combined.txt
Text successfully written to ../sample_output/PyPDF/PyPDF_RR_2024_l3_combined.txt
POST request successful!
XML converted to JSON 
Saving txt file  Grobid_RR_2024_l3_combined.txt  at path  ../sample_output/Grobid/Grobid_RR_2024_l3_combined.txt
Text successfully written to ../sample_output/Grobid/Grobid_RR_2024_l3_combined.txt
Parsing file  2024-l1-topics-combined-2.pdf  saved at path  ../data/2024-l1-topics-combined-2.pdf
Saving txt file  PyPDF_RR_2024_l1_combined.txt  at path  ../sample_output/PyPDF/PyPDF_RR_2024_l1_combined.txt
Text successfully written to ../sample_output/PyPDF/PyPDF_RR_2024_l1_combined.txt
POST request successful!
XML converted to JSON 
Saving txt file  Grobid_RR_2024_l1_combined.txt  at path  ../sample_output/Grobid/Grobid_RR_2024_l1_combined.txt
Text successfu