## Q3. Parse the PDF v2

In this part, we are going to parse the text out using ```pdfminer``` from 200 papers related to COVID-19.

Parsing PDF is a very time-consuming and memory-intensive task, so PDFMiner uses a strategy called lazy parsing and only parses when needed to reduce time and memory usage. To parse PDF requires at least two classes: PDFParser and PDFDocument, PDFParser extracts data from the file, and PDFDocument saves the data. In addition, PDFPageInterpreter is needed to process the page content, and PDFDevice converts it to what we need. PDFResourceManager is used to save shared content such as fonts or pictures.

PDF is not like word or txt, which can be read easily. Reading PDF itself is a difficult task. When reading PDF with a program, it also reads PDF in binary and then converts it into text.

Reading PDF is more like reading a picture. PDF puts the content on the exact position of a piece of paper. In most cases, there is no logical structure, such as sentences or paragraphs, and the page size adjustment cannot be adapted. PDFMiner tries to reconstruct their structure by guessing their layout, but it is not guaranteed to work.


## Install pdfminer

Use pip to install the library. The library require `Python 3.7` or higher version.

`pip install pdfminer3k`

### Library Documentation

For detail documentation, please vist https://pypi.org/project/pdfminer/

There are several classes that we need to pay attention to.

1. PDFParser: gets data from files

2. PDFDocument: Storing the document data structure in memory

3. PDFPageInterpreter: parses page content

4. PDFDevice: converts the parsed content into what you need

5. PDFResourceManager: stores shared resources, such as fonts or pictures

#### What is in the LTPage Object?



![image.png](attachment:image.png)
1. LTPage: Represents the entire page. May contain LTTextBox, LTFigure, LTImage, LTRect, LTCurve and LTLine sub-objects.

2. LTTextBox: indicates that a group of text blocks may be contained in a rectangular area. Note that this box is created by geometric analysis and does not necessarily represent a logical boundary of the text. It contains a list of LTTextLine objects. Use the text content returned by the get_text () method.

3. LTTextLine: Contains a list of LTChar objects representing a single text line. Character alignment is either horizontal or vertical, depending on the writing mode of the text. The text content returned by the get_text () method.

4. LTChar

5. LTAnno: The actual letters in the text are represented as Unicode strings (?). It should be noted that although an LTChar object has actual boundaries, but LTAnno objects do not have the bondaries. because these are "virtual" characters, which are inserted by the layout analysis based on the relationship between two characters (for example, a space).

6. LTImage: Represents an image object. Embedded images can be in JPEG or other formats, but currently PDFMiner does not place much effort on graphic objects.

7. LTLine: represents a straight line. Can be used to separate text or drawings.

8. LTRect: Represents a rectangle. Another picture or number that can be used for the frame.

9. LTCurve: Represents a general Bezier curve

### Coding Area

#### imports

In [2]:
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
import numpy as np
import pandas as pd
from testing.testing import test
from re import search

#### Step 1. Parse out the PDF file using pdfmin

In this section, we want to parse the text in the pdfs to a list of strings. 

**Hint**:
* PDFParser
* PDFDocumnet
* PDFTextExtractionNotAllowed
* LAParams
* Create a PDF page aggregation object

In [None]:
def parse_pdf(pdf_list):
    """
    Return the parsed raw text in a list
    
    Args:
        pdf_list(list): list of strings with the path of the pdf files
        
    Returns:
        contests(list):list of strings representing the raw text of each paper 
    """
    
    return []

#### Step2. Clean up the data

In this section, there are 4 things we want to clean up from the contents
1. Double spaceing. E.g. `'Hello  world'` to `'Hello world'`
2. Remove all the Hyphens. E.g. `'hell-\no world'` to `'hello world'`
3. Remove all the `'\n'`. 
4. Remove all the `'http'` and `'https'`.
5. To lower case.

In [None]:
def clean_data_test(clean_data):
    texts = clean_data()
    ## test each clean_data is the same as the text file I am having

def clean_data(contents):
    
    """
    Return the cleaned String in a list
    
    Args:
        contents(list): list of strings representing the raw text of each paper 
        
    Returns: 
        texts(list): list of strings representing cleaned text of each paper
        
    """
    
    
    return []