### __WordsOfChessAndAI - Metadata__ 

#### __Goal__

Define a method to extract some metadata about:
- title
- abstract
- DOI

For the Author and abstract I will use Pytesseract, an OCR (Optical Character Recognizer), to extract from images the text; and then, I will handle some string types to 
retrieve the information that I need.

In [62]:
size = 10
main_path = "articles/"
target_word = "abstract"

In [63]:
import os

def get_path_files(path: str) -> list:
    path_files = []

    for item in os.listdir(path):
        if not item.endswith(".pdf"):
            print("Not a pdf", item, "\n")

        path_files.append(path + item)

    return path_files

path_files = get_path_files(main_path)

Took each path now I will start to convert the files to images. This is useful to extract text with Pytesseract.

In [None]:
import pytesseract

from typing import Dict
from pdf2image import convert_from_path

class ScannedText:
    def __init__(self, index: int, text: str):
        self.index = index
        self.text = text

def convert_file_to_images(path: str) -> list:
    try:
        return convert_from_path(path)[:2]
    except Exception as e:
        print("Error during conversion from file to image:", e, "\n")

def get_target_word_index(text: str) -> int:
    count_line = 0

    lines = text.splitlines()
    for line in lines:
        if "abstract" in line:
            return count_line
        
        count_line += 1
    
    return -1

def scan_file_to_text(paths: list[str]) -> dict[str, ScannedText]:
    dict_scanned_texts: Dict[str, ScannedText] = {}

    for path in paths:
        images = convert_file_to_images(path)

        text = ""
        try:
            for image in images:
                text = text + pytesseract.image_to_string(image)

            dict_scanned_texts[path] = ScannedText(get_target_word_index(text.lower()), text)
        except Exception as e:
            print("Error during conversion from image to text:", e) 
        
    return dict_scanned_texts

dict_scanned_texts = scan_file_to_text(path_files)

The tuple __tuple_key_value__ contains two lists. One of them is used to memorize the line when the __key word__ "Abstract" appears, instead the last one is used to memorize the text extract with Pytessaract. The code below define the function's implementation necessary to retrieve from the text extracted the abstract's section.

File "abstract.txt" is just a debugging method to check the correctness of the code.

In [65]:
from typing import List

def remove_white_spaces(index: int, text: str) -> List[str]:
    lines = text.splitlines()
    first_part = lines[:index + 1]

    start_index = index + 1
    count_line = start_index

    for line in lines[start_index:]:
        if len(line) == 0:
            count_line += 1
            continue
        else:
            break
        
    second_part = lines[count_line:]
    
    return first_part + second_part

def extract_abstract_lines(index: int, lines: list[str]) -> str:
    abstract = ""

    for line in lines[index:]:
        if len(line) == 0:
            break

        abstract += line + "\n"

    return abstract

def extract_abstract(dict: dict[str, ScannedText]) -> dict[str, str]:
    dict_abstracts: Dict[str, str] = {}

    for key in dict.keys():
        value = dict.get(key)

        if value.index > -1:
            abstract_lines = remove_white_spaces(value.index, value.text)
            text = extract_abstract_lines(value.index, abstract_lines)
        else:
            text = "Not found"

        dict_abstracts[key] = text

    return dict_abstracts

dict_abstracts = extract_abstract(dict_scanned_texts)

The next step is to define the functions to retrieve the papers' __metadata__. Various libraries already have some methods or fields that contain the metadata you are looking for. So, I will develop a __recursive__ function that use differente kind of imports.

In [66]:
import pymupdf
import pdfplumber

def get_title_from_dicts(pymupdf: dict[str, str], pdfplumber: dict[str, str]) -> str:
    return (pymupdf.get("title") or pdfplumber.get("title") or "Not found")

def get_author_from_dicts(pymupdf: dict[str, str], pdfplumber: dict[str, str]) -> str:
    return (pymupdf.get("author") or pdfplumber.get("author") or "Not found")

def extract_title_and_author(i: int, len: int, paths: list[str], dict_titles = {}, dict_authors = {}) -> tuple[dict[str, str], dict[str, str]]:
    if len == 0:
        return (dict_titles, dict_authors)
    else:
        metadata_pymupdf = pymupdf.open(paths[i]).metadata
        metadata_pdfplumber = pdfplumber.open(paths[i]).metadata

        title = get_title_from_dicts(metadata_pymupdf, metadata_pdfplumber)
        author = get_author_from_dicts(metadata_pymupdf, metadata_pdfplumber)
        
        dict_titles[paths[i]] = title
        dict_authors[paths[i]] = author

        extract_title_and_author(i + 1, len - 1, paths)
        return (dict_titles, dict_authors)

tuple_dict_titles_authors = extract_title_and_author(0, len(path_files), path_files)

In [None]:
import json

from typing import Dict

class Metadata:
    def __init__(self, title, author, abstract):
        self.title = title
        self.author = author
        self.abstract = abstract

    def get_dict(self) -> Dict[str, str]:
        return {
            "Title": self.title,
            "Author": self.author,
            "Abstract": self.abstract
        }

dict_titles = tuple_dict_titles_authors[0]
dict_authors = tuple_dict_titles_authors[1]

metadata = []
for path in path_files:
    metadata.append(Metadata(dict_titles.get(path), dict_authors.get(path), dict_abstracts[path]))
 
for element in metadata:
    with open("json/metadata.json", "a") as file:
        json.dump(element.get_dict(), file, indent=3)