<div style="text-align:center; font-size:24px; font-weight:bold;">AL2001-Programming For AI</div>
<br>
<div style="text-align:center; font-size:24px; font-weight:bold;">Instructor: Muhammad Saad Rashad</div>
<br>
<div style="text-align:center; font-size:18px; font-weight:bold;">email: saad.rashad@nu.edu.pk</div>


 # **DATA EXTRACTION**

## **1. PDFs(Portable Document Format):**
-  a file format developed by Adobe to present documents in a manner independent of application software, hardware, and operating systems.

### **1.1 Characteristics:**
- **Fixed Layout:** PDFs preserve the layout, fonts, images, and graphics of any source document, which makes them ideal for sharing formatted documents.
- **Cross Platform:** They can be viewed and printed on any device without requiring the original application used to create them.
- **Secure:** PDFs can include security features such as encryption, password protection, and digital signatures.

### **1.2 Why Extract Data From PDFs:**
- **Data Accessibilty:** Many organizations still use PDFs for their documents, making it necessary to extract data for use in databases, analytics, and machine learning applications.



- **Diverse Use Cases Like:**
  - **Research Papers**: Extracting data for analysis, citations, and literature reviews.
  - **Invoices and Receipts**: Automating data entry processes to save time and reduce errors.
  - **Reports**: Extracting tables, charts, and figures for further data analysis or visualization.

### **1.3 Challenges in PDF Extraction**

- **Unstructured Data**
  - **Text Flow**: Text may not flow linearly; it could be arranged in multiple columns or layers, making extraction complicated.
  - **Mixed Content**: PDFs may contain text, images, tables, and graphical elements all in one document, requiring different extraction strategies for each type.

- **Encoding Issues**
Some PDFs may use non-standard encodings or may be scanned images rather than text documents, complicating the extraction process.

- **Inconsistent Layout:**
Different documents have different layouts; for instance, page numbers may be in headers, footers, or embedded within the text body. This inconsistency requires custom extraction logic.

### 1.4 Real-World Examples

- **Academic Research**: Researchers often need to extract data from multiple studies, which are typically published as PDFs. This could involve gathering references, author information, or specific data points for meta-analysis.

- **Financial Documents**: Companies often receive invoices in PDF format. Automating the extraction of payment details (like invoice number, date, and total amount) helps streamline accounting processes.

- **Legal Documents**: Law firms deal with contracts and legal briefs that are often in PDF format. Extracting clauses, dates, and involved parties can save significant time in legal research.



**Here are some PDF Python Libraries**


| Library         | Main Features                                | Best For                               |
|------------------|----------------------------------------------|----------------------------------------|
| pdfminer.six    | Low-level text and layout extraction         | Complex PDF layouts and detailed text  |
| PyPDF2          | Basic text extraction, merging, splitting    | Simple tasks, manipulating PDFs        |
| PDFplumber      | Text, table, and image extraction            | Extracting tables and detailed layouts  |
| PyMuPDF (fitz)  | Fast text extraction, image extraction, rendering | High-speed text and image extraction   |
| Camelot         | Table extraction                             | Extracting tables                      |
| Tabula-py       | Table extraction, exports to DataFrame      | Data analysis, handling tables         |
| Tika            | Metadata and text extraction for many formats | Handling multiple document formats     |
| Slate           | Simple text extraction                       | Quick, lightweight extraction          |
| pdfrw           | Reading, writing, merging, and splitting PDFs | Manipulating PDF structure             |
| pdfquery        | XPath-like querying for PDFs                 | Position-based text extraction         |


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
pip install pdfminer.six


Collecting pdfminer.six
  Downloading pdfminer.six-20240706-py3-none-any.whl.metadata (4.1 kB)
Downloading pdfminer.six-20240706-py3-none-any.whl (5.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m34.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pdfminer.six
Successfully installed pdfminer.six-20240706


**Example 1:**

In [11]:
from pdfminer.high_level import extract_text

def count_words_in_pdf(pdf_path):
    # Extract the entire text from the PDF
    text = extract_text(pdf_path)

    # Split the text into words (by whitespace or newlines)
    words = text.split()

    # Count the number of words
    word_count = len(words)

    return word_count

# Example usage
pdf_path = "/content/drive/MyDrive/Saad Rashad Thesis Report v8.pdf"
total_words = count_words_in_pdf(pdf_path)
print(f"Total number of words: {total_words}")



Total number of words: 16766


**Example 2:**

In [13]:
from pdfminer.high_level import extract_text

def count_characters_in_pdf(pdf_path):
    # Extract the entire text from the PDF
    text = extract_text(pdf_path)

    # Count the number of characters (including spaces)
    character_count = len(text)

    return character_count

# Example usage
pdf_path = "/content/drive/MyDrive/Saad Rashad Thesis Report v8.pdf"
total_characters = count_characters_in_pdf(pdf_path)
print(f"Total number of characters: {total_characters}")


Total number of characters: 94982


**Example 3:**

In [14]:
from pdfminer.high_level import extract_text

def count_characters_excluding_spaces(pdf_path):
    # Extract the entire text from the PDF
    text = extract_text(pdf_path)

    # Remove white spaces (spaces, newlines, and tabs)
    text_no_spaces = text.replace(" ", "").replace("\n", "").replace("\t", "")

    # Count the number of characters excluding white spaces
    character_count = len(text_no_spaces)

    return character_count

# Example usage
pdf_path = "/content/drive/MyDrive/Saad Rashad Thesis Report v8.pdf"
total_characters = count_characters_excluding_spaces(pdf_path)
print(f"Total number of characters (excluding spaces): {total_characters}")


Total number of characters (excluding spaces): 76179


**Example 4:**

In [None]:
from pdfminer.high_level import extract_text

def extract_all_text(pdf_path):
    return extract_text(pdf_path)

# Usage
all_text = extract_all_text("/content/drive/MyDrive/PFAI/PAI-LAB Manuals.pdf")
print(all_text)


**OR**

In [None]:
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextBox

# Open the PDF and process each page
for page_layout in extract_pages("/content/drive/MyDrive/PFAI/PAI-LAB Manuals.pdf"):
    # For each page, loop through the elements in the layout
    for element in page_layout:
        # Check if the element is a block of text
        if isinstance(element, LTTextBox):
            # Extract and print the text in this text box
            print(element.get_text())

**Example 5:**

In [None]:
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextBox, LTTextLine, LTChar

def extract_headings(pdf_path):
    headings = []

    # Loop through each page of the PDF
    for page_layout in extract_pages(pdf_path):
        # Loop through the elements on each page
        for element in page_layout:
            if isinstance(element, LTTextBox):  # Check if the element is a text box
                for text_line in element:
                    if isinstance(text_line, LTTextLine):  # Check if it's a line of text
                        for character in text_line:
                            if isinstance(character, LTChar):  # Check individual characters
                                # Filter based on font size or boldness (e.g., font size > 12)
                                if character.size > 12:  # This threshold can be adjusted
                                    heading_text = text_line.get_text().strip()
                                    headings.append(heading_text)
                                    break
    return headings

# Example usage
pdf_path = "/content/drive/MyDrive/PFAI/PAI-LAB Manuals.pdf"
headings = extract_headings(pdf_path)
for heading in headings:
    print(heading)

# **Tasks**

1. **WEB SCARPPING**: As you have covered web scrapping in Theory class along with code examples
- Scrape a website that lists books (e.g., Goodreads, Amazon, Readings.pk etc) and collect information such as title, author, rating, and price and store it in a dictiornary
**example:*  {"Author Name":"Viktor Frank","Title":"Man's Search For Meaning","etc"}

2. **PDF Mining:** Extract only headings (e.g., Main titles) from  multiple PDF documents based on font size or style and store the value in single dictionary.

*Example:* {"Title 1": "7 habits of Highly Effective People","Title 2": "Power of Habit", "Title 3: Alchemist"}