<a href="https://colab.research.google.com/github/DvAzevedo/PDF_Data_Extraction/blob/main/PDF_Data_Extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Extração Automatizada de Dados de PDFs de Cabos Automotivos

Este projeto tem como objetivo o estudo e a prática de técnicas de extração de dados estruturados e textuais de arquivos PDF. Utilizei técnicas de web scraping para baixar documentos técnicos e, em seguida, processei esses PDFs para extrair informações relevantes, organizando-as em formatos tabulares.

Inicialmete foi necessário a extração de especificações de 28 diferentes tipos de cabos automotivos do site [Eland Cables](https://www.elandcables.com/cables/flr-automotive-thin-wall-cable).

Tecnologias Utilizadas:

    Python: Linguagem de programação principal.
    requests: Para fazer requisições HTTP e baixar PDFs.
    BeautifulSoup4: Para parsear HTML e localizar URLs de PDFs.
    PyMuPDF (fitz): Para extração de texto de PDFs (informações descritivas).
    tabula-py: Para extração de tabelas de PDFs.
    pandas: Para manipulação, organização e armazenamento de dados tabulares (DataFrames e CSVs).
    re (Regex): Para padrões de busca e extração de texto específicos.
    Google Colab: Ambiente de desenvolvimento utilizado para execução e integração com o Google Drive.




## Web Scraping
Baixando todos os pdf's sobre FLR Automotive Wire disponivél em: https://www.elandcables.com/cables/flr-automotive-thin-wall-cable

In [None]:
from google.colab import drive
drive.mount('/content/drive')

!pip install requests beautifulsoup4

import requests
from bs4 import BeautifulSoup
import os
from google.colab import drive

MessageError: Error: credential propagation was unsuccessful

In [None]:
os.makedirs('/content/drive/MyDrive/Eland_Cables_Data/', exist_ok = True)
output_dir = '/content/drive/MyDrive/Eland_Cables_Data/CablesPDFs/'
os.makedirs(output_dir, exist_ok=True)

base_url = "https://www.elandcables.com"
page_url = "https://www.elandcables.com/cables/flr-automotive-thin-wall-cable"

In [None]:
print(f"Acessando a página: {page_url}")
try:
  response = requests.get(page_url)
  response.raise_for_status()
  soup = BeautifulSoup(response.content, 'html.parser')
  print("Página acessada com sucesso.")
except requests.exceptions.RequestException as e:
  print(f"Erro ao acessar a página: {e}")
  exit()

Acessando a página: https://www.elandcables.com/cables/flr-automotive-thin-wall-cable
Página acessada com sucesso.


In [None]:
pdf_links = []
# <a class="product-pdf" data-document="/media/39786/flry-a-cables.pdf"
# href="https://www.elandcables.com/media/39786/flry-a-cables.pdf" target="_blank">FLRY-A Cable</a>
for a_tag in soup.find_all('a', class_='product-pdf', href=True):
    href = a_tag['href']

    if href.endswith('.pdf') or '/media/' in href:
        full_url = href if href.startswith('http') else base_url + href
        pdf_links.append(full_url)

pdf_links = list(set(pdf_links))

print(f"Encontrados {len(pdf_links)} links de PDFs.")
if not pdf_links:
    print("Nenhum link de PDF encontrado com o padrão especificado.")
    print("Por favor, verifique o HTML da página ou o padrão no script.")


Encontrados 28 links de PDFs.


In [None]:
downloaded_count = 0
for pdf_url in pdf_links:
  try:
    file_name = pdf_url.split('/')[-1]
    file_path = os.path.join(output_dir, file_name)

    print(f"Baixando {file_name} de {pdf_url}")
    response = requests.get(pdf_url)
    response.raise_for_status()

    with open(file_path, 'wb') as file:
      for chunk in response.iter_content(chunk_size=8192):
        file.write(chunk)

      print(f"Salvo: {file_path}")
      downloaded_count += 1
  except requests.exceptions.RequestException as e:
    print(f"Erro ao baixar {pdf_url}: {e}")
  except Exception as e:
    print(f"Erro inesperado ao baixar {pdf_url}: {e}")

print(f"Download concluído. {downloaded_count} arquivos baixados.")

Baixando flrywk-cable.pdf de https://www.elandcables.com/media/sq4bqo0m/flrywk-cable.pdf
Salvo: /content/drive/MyDrive/Eland_Cables_Data/CablesPDFs/flrywk-cable.pdf
Baixando flr7y-b-cable.pdf de https://www.elandcables.com/media/d1jhmk4u/flr7y-b-cable.pdf
Salvo: /content/drive/MyDrive/Eland_Cables_Data/CablesPDFs/flr7y-b-cable.pdf
Baixando flr13y-a-cable.pdf de https://www.elandcables.com/media/5msdynfj/flr13y-a-cable.pdf
Salvo: /content/drive/MyDrive/Eland_Cables_Data/CablesPDFs/flr13y-a-cable.pdf
Baixando flryw-b-cable.pdf de https://www.elandcables.com/media/vumnvk5b/flryw-b-cable.pdf
Salvo: /content/drive/MyDrive/Eland_Cables_Data/CablesPDFs/flryw-b-cable.pdf
Baixando flr13y-b-cable.pdf de https://www.elandcables.com/media/dzjeo0zy/flr13y-b-cable.pdf
Salvo: /content/drive/MyDrive/Eland_Cables_Data/CablesPDFs/flr13y-b-cable.pdf
Baixando flry-b-cables.pdf de https://www.elandcables.com/media/39716/flry-b-cables.pdf
Salvo: /content/drive/MyDrive/Eland_Cables_Data/CablesPDFs/flry-b-cab

In [None]:
!ls /content/drive/MyDrive/Eland_Cables_Data/CablesPDFs/

flr12y-a-cable.pdf  flr4y-a-cable.pdf	flr6y-b-cable.pdf  flrydy-cable.pdf
flr12y-b-cable.pdf  flr4y-b-cable.pdf	flr7y-a-cable.pdf  flryk-cable.pdf
flr13y-a-cable.pdf  flr51y-a-cable.pdf	flr7y-b-cable.pdf  flryw-a-cable.pdf
flr13y-b-cable.pdf  flr51y-b-cable.pdf	flr9y-a-cable.pdf  flryw-b-cable.pdf
flr14y-cable.pdf    flr5y-a-cable.pdf	flr9y-b-cable.pdf  flrywd-cable.pdf
flr2x-a-cable.pdf   flr5y-b-cable.pdf	flry-a-cables.pdf  flrywk-cable.pdf
flr2x-b-cable.pdf   flr6y-a-cable.pdf	flry-b-cables.pdf  flryy-cable.pdf


## Extracting PDF data
  Extracting the tables of each Pdf

In [None]:
from google.colab import drive
drive.mount('/content/drive')

!pip install tabula-py pymupdf pandas
import fitz
import tabula
import pandas as pd
import os
import re

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Collecting tabula-py
  Downloading tabula_py-2.10.0-py3-none-any.whl.metadata (7.6 kB)
Collecting pymupdf
  Downloading pymupdf-1.26.0-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading tabula_py-2.10.0-py3-none-any.whl (12.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m79.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pymupdf-1.26.0-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (24.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m77.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pymupdf, tabula-py
Successfully installed pymupdf-1.26.0 tabula-py-2.10.0


In [None]:
# Extrai informações textuais específicas (Application, Characteristics, Construction, Standards), mantendo o conteúdo completo dos blocos.
def extract_textual_info(pdf_path, page_num=0):
    extracted_data = {}

    try:
        document = fitz.open(pdf_path)
        page = document[page_num]
        full_text = page.get_text("text")
        document.close()

        cleaned_text = re.sub(r'\s+', ' ', full_text).strip()

        keywords = [
            "APPLICATION", "CHARACTERISTICS", "CONSTRUCTION", "STANDARDS",
            "THE CABLE LAB®", "Eland Product Group"
        ]

        found_keywords_with_pos = []
        for keyword in keywords:
            match = re.search(r'\b' + re.escape(keyword) + r'\b', cleaned_text, re.IGNORECASE)
            if match:
                found_keywords_with_pos.append({'keyword': keyword, 'start_pos': match.start()})

        found_keywords_with_pos.sort(key=lambda x: x['start_pos'])

        sections_raw = {}
        ordered_keywords_only = [item['keyword'] for item in found_keywords_with_pos]

        for i, current_keyword in enumerate(ordered_keywords_only):
            start_index = cleaned_text.find(current_keyword, found_keywords_with_pos[i]['start_pos']) + len(current_keyword)

            end_index = len(cleaned_text)
            if i + 1 < len(ordered_keywords_only):
                next_keyword = ordered_keywords_only[i+1]
                next_keyword_match = re.search(r'\b' + re.escape(next_keyword) + r'\b', cleaned_text[start_index:], re.IGNORECASE)
                if next_keyword_match:
                    end_index = start_index + next_keyword_match.start()

            block_content = cleaned_text[start_index:end_index].strip()
            sections_raw[current_keyword] = block_content


        # Nome do Cabo
        pdf_file_name = os.path.basename(pdf_path)
        cable_name_clean = pdf_file_name.replace('.pdf', '').split('_data')[0]
        extracted_data['Cable_Name'] = cable_name_clean

        # Atribuição dos blocos completos
        extracted_data['Application'] = sections_raw.get('APPLICATION', 'N/A')
        extracted_data['Characteristics'] = sections_raw.get('CHARACTERISTICS', 'N/A')
        extracted_data['Construction'] = sections_raw.get('CONSTRUCTION', 'N/A')
        extracted_data['Standards'] = sections_raw.get('STANDARDS', 'N/A')

        product_group_match = re.search(r'Eland Product Group:\s*([A-Z0-9]+)', cleaned_text, re.IGNORECASE)
        extracted_data['Eland_Product_Group'] = product_group_match.group(1).strip() if product_group_match else "N/A"

    except Exception as e:
        print(f"Erro ao extrair informações textuais de {pdf_path}: {e}")

        expected_keys = [
            'Cable_Name', 'Application', 'Characteristics', 'Construction', 'Standards', 'Eland_Product_Group'
        ]
        for key in expected_keys:
            if key not in extracted_data:
                extracted_data[key] = "ERROR_N/A"

    return extracted_data


In [None]:
# Extrai tabelas de uma página específica do PDF usando tabula-py. Retorna uma lista de DataFrames, um para cada tabela encontrada na página.
def extract_tables(pdf_path, page_num=1):

    try:
        tables = tabula.read_pdf(pdf_path, pages=str(page_num + 1), multiple_tables=True,
                                 pandas_options={'header': None})
        cleaned_tables = []
        if tables:
            for df in tables:
                df.dropna(how='all', inplace=True)
                df.dropna(axis=1, how='all', inplace=True)
                if not df.empty:
                    cleaned_tables.append(df)

            return cleaned_tables if cleaned_tables else []
        else:
            print(f"Nenhuma tabela encontrada na página {page_num + 1} de {pdf_path}.")
            return []

    except Exception as e:
        print(f"Erro ao extrair tabelas de {pdf_path}: {e}")
        return []

In [None]:

pdf_folder_path = '/content/drive/MyDrive/Eland_Cables_Data/CablesPDFs/'
output_csv_folder = '/content/drive/MyDrive/Eland_Cables_Data/CablesCSV/'
os.makedirs(output_csv_folder, exist_ok=True)

all_pdf_files = [f for f in os.listdir(pdf_folder_path) if f.endswith('.pdf')]
print(f"\nEncontrados {len(all_pdf_files)} PDFs na pasta: {pdf_folder_path}")

all_textual_data_dfs = []

# Processar cada PDF
for pdf_file in all_pdf_files:
    pdf_path = os.path.join(pdf_folder_path, pdf_file)
    print(f"\n--- Processando: {pdf_file} ---")

    # Extrair e Salvar Dados Textuais
    text_info = extract_textual_info(pdf_path, page_num=0)

    # Criar um DataFrame de uma única linha para as informações textuais
    df_text_current_pdf = pd.DataFrame([text_info])
    all_textual_data_dfs.append(df_text_current_pdf)

    print(f"Informações textuais extraídas para {pdf_file}.")

    # Extrair e Salvar Tabelas
    df_tables_list = extract_tables(pdf_path, page_num=1)

    if df_tables_list:
        cable_name_for_tables = text_info.get('Cable_Name', os.path.basename(pdf_file).replace('.pdf', ''))

        for i, df_table in enumerate(df_tables_list):
            # Adicionar nome do cabo e identificador da tabela à tabela
            df_table['Source_PDF'] = pdf_file
            df_table['Cable_Name'] = cable_name_for_tables
            df_table['Table_Index'] = i + 1

            # Gerar nome do arquivo CSV para a tabela
            table_csv_name = f"{cable_name_for_tables}_Tabela_{i+1}.csv"
            table_csv_path = os.path.join(output_csv_folder, table_csv_name)

            df_table.to_csv(table_csv_path, index=False)
            print(f"Tabela {i+1} de {pdf_file} salva em: {table_csv_path}")
    else:
        print(f"Nenhuma tabela encontrada ou extraída de {pdf_file}.")

print("\n--- Processamento de todos os PDFs concluído ---")

if all_textual_data_dfs:
    final_combined_textual_df = pd.concat(all_textual_data_dfs, ignore_index=True)
    combined_textual_output_path = os.path.join(output_csv_folder, 'all_cables_textual_data_combined.csv')
    final_combined_textual_df.to_csv(combined_textual_output_path, index=False)
    print(f"\nTodos os dados textuais combinados salvos em: {combined_textual_output_path}")
    print("\nVisualização das primeiras linhas do DataFrame textual combinado:")
    print(final_combined_textual_df.head())
else:
    print("\nNenhum dado textual extraído para combinar.")




Encontrados 28 PDFs na pasta: /content/drive/MyDrive/Eland_Cables_Data/CablesPDFs/

--- Processando: flrywk-cable.pdf ---
Informações textuais extraídas para flrywk-cable.pdf.
Tabela 1 de flrywk-cable.pdf salva em: /content/drive/MyDrive/Eland_Cables_Data/CablesCSV/flrywk-cable_Tabela_1.csv

--- Processando: flr7y-b-cable.pdf ---
Informações textuais extraídas para flr7y-b-cable.pdf.


ERROR:tabula.backend:Error from tabula-java:
Exception in thread "main" java.lang.IndexOutOfBoundsException: Page number does not exist.
	at technology.tabula.ObjectExtractor.extractPage(ObjectExtractor.java:19)
	at technology.tabula.PageIterator.next(PageIterator.java:30)
	at technology.tabula.CommandLineApp.extractFile(CommandLineApp.java:161)
	at technology.tabula.CommandLineApp.extractFileTables(CommandLineApp.java:124)
	at technology.tabula.CommandLineApp.extractTables(CommandLineApp.java:106)
	at technology.tabula.CommandLineApp.main(CommandLineApp.java:76)




Erro ao extrair tabelas de /content/drive/MyDrive/Eland_Cables_Data/CablesPDFs/flr7y-b-cable.pdf: Command '['java', '-Dfile.encoding=UTF8', '-jar', '/usr/local/lib/python3.11/dist-packages/tabula/tabula-1.0.5-jar-with-dependencies.jar', '--pages', '2', '--guess', '--format', 'JSON', '/content/drive/MyDrive/Eland_Cables_Data/CablesPDFs/flr7y-b-cable.pdf']' returned non-zero exit status 1.
Nenhuma tabela encontrada ou extraída de flr7y-b-cable.pdf.

--- Processando: flr13y-a-cable.pdf ---
Informações textuais extraídas para flr13y-a-cable.pdf.
Nenhuma tabela encontrada na página 2 de /content/drive/MyDrive/Eland_Cables_Data/CablesPDFs/flr13y-a-cable.pdf.
Nenhuma tabela encontrada ou extraída de flr13y-a-cable.pdf.

--- Processando: flryw-b-cable.pdf ---
Informações textuais extraídas para flryw-b-cable.pdf.
Tabela 1 de flryw-b-cable.pdf salva em: /content/drive/MyDrive/Eland_Cables_Data/CablesCSV/flryw-b-cable_Tabela_1.csv
Tabela 2 de flryw-b-cable.pdf salva em: /content/drive/MyDrive/El

ERROR:tabula.backend:Error from tabula-java:
Exception in thread "main" java.lang.IndexOutOfBoundsException: Page number does not exist.
	at technology.tabula.ObjectExtractor.extractPage(ObjectExtractor.java:19)
	at technology.tabula.PageIterator.next(PageIterator.java:30)
	at technology.tabula.CommandLineApp.extractFile(CommandLineApp.java:161)
	at technology.tabula.CommandLineApp.extractFileTables(CommandLineApp.java:124)
	at technology.tabula.CommandLineApp.extractTables(CommandLineApp.java:106)
	at technology.tabula.CommandLineApp.main(CommandLineApp.java:76)




Erro ao extrair tabelas de /content/drive/MyDrive/Eland_Cables_Data/CablesPDFs/flr5y-a-cable.pdf: Command '['java', '-Dfile.encoding=UTF8', '-jar', '/usr/local/lib/python3.11/dist-packages/tabula/tabula-1.0.5-jar-with-dependencies.jar', '--pages', '2', '--guess', '--format', 'JSON', '/content/drive/MyDrive/Eland_Cables_Data/CablesPDFs/flr5y-a-cable.pdf']' returned non-zero exit status 1.
Nenhuma tabela encontrada ou extraída de flr5y-a-cable.pdf.

--- Processando: flr4y-a-cable.pdf ---
Informações textuais extraídas para flr4y-a-cable.pdf.
Tabela 1 de flr4y-a-cable.pdf salva em: /content/drive/MyDrive/Eland_Cables_Data/CablesCSV/flr4y-a-cable_Tabela_1.csv

--- Processando: flry-a-cables.pdf ---
Informações textuais extraídas para flry-a-cables.pdf.
Tabela 1 de flry-a-cables.pdf salva em: /content/drive/MyDrive/Eland_Cables_Data/CablesCSV/flry-a-cables_Tabela_1.csv

--- Processando: flr9y-a-cable.pdf ---
Informações textuais extraídas para flr9y-a-cable.pdf.
Tabela 1 de flr9y-a-cable.pd

In [None]:
!ls /content/drive/MyDrive/Eland_Cables_Data/CablesCSV/

all_cables_textual_data_combined.csv  flr9y-a-cable_Tabela_1.csv
flr14y-cable_Tabela_1.csv	      flr9y-a-cable_Tabela_2.csv
flr2x-a-cable_Tabela_1.csv	      flr9y-b-cable_Tabela_1.csv
flr2x-a-cable_Tabela_2.csv	      flr9y-b-cable_Tabela_2.csv
flr2x-b-cable_Tabela_1.csv	      flry-a-cables_Tabela_1.csv
flr4y-a-cable_Tabela_1.csv	      flry-b-cables_Tabela_1.csv
flr4y-b-cable_Tabela_1.csv	      flrydy-cable_Tabela_1.csv
flr51y-a-cable_Tabela_1.csv	      flryk-cable_Tabela_1.csv
flr51y-b-cable_Tabela_1.csv	      flryk-cable_Tabela_2.csv
flr51y-b-cable_Tabela_2.csv	      flryw-a-cable_Tabela_1.csv
flr5y-b-cable_Tabela_1.csv	      flryw-b-cable_Tabela_1.csv
flr6y-a-cable_Tabela_1.csv	      flryw-b-cable_Tabela_2.csv
flr6y-a-cable_Tabela_2.csv	      flrywd-cable_Tabela_1.csv
flr6y-b-cable_Tabela_1.csv	      flrywd-cable_Tabela_2.csv
flr6y-b-cable_Tabela_2.csv	      flrywk-cable_Tabela_1.csv
flr7y-a-cable_Tabela_1.csv	      flryy-cable_Tabela_1.csv
flr7y-a-cable_Tabela_2.csv


In [None]:
final_combined_textual_df['Characteristics']

#final_combined_textual_df['Construction']

#final_combined_textual_df['Standards']

#final_combined_textual_df['Application']

#final_combined_textual_df['Eland_Product_Group']

#final_combined_textual_df['Cable_Name']

Unnamed: 0,Characteristics
0,Temperature Rating -50°C to +105°C
1,Temperature Rating -45°C to +180°C
2,Temperature Rating -40°C to +150°C
3,Temperature Rating -50°C to +125°C
4,Temperature Rating -40°C to +150°C
5,Test Voltage 3kv i.e < 0.5mm² 5kv i.e > 0.5mm²...
6,Temperature Rating -40°C to +150°C
7,Temperature Rating -40°C to +250°C
8,Temperature Rating -40°C to +260°C Blue
9,Temperature Rating -40°C to +105°C Sheath Colo...
