# **Document Title, Author(s) Extraction and Alternative Title Generation**


# Introduction

This project aims to automatically extract document details

# Setup

Install pytesseract for OCR.

Install tesseract-ocr for text recognition.

In [1]:
!pip install pytesseract
!sudo apt install tesseract-ocr

Collecting pytesseract
  Downloading pytesseract-0.3.10-py3-none-any.whl (14 kB)
Installing collected packages: pytesseract
Successfully installed pytesseract-0.3.10
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  tesseract-ocr-eng tesseract-ocr-osd
The following NEW packages will be installed:
  tesseract-ocr tesseract-ocr-eng tesseract-ocr-osd
0 upgraded, 3 newly installed, 0 to remove and 45 not upgraded.
Need to get 4,816 kB of archives.
After this operation, 15.6 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-eng all 1:4.00~git30-7274cfa-1.1 [1,591 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-osd all 1:4.00~git30-7274cfa-1.1 [2,990 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr amd64 4.1.1-2.1build1 [236 kB]
Fetched 4,816 kB in 1s (4,124 kB/s)
debc

Install unrar for extracting .rar files.

In [2]:
!sudo apt-get install unrar

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
unrar is already the newest version (1:6.1.5-1).
0 upgraded, 0 newly installed, 0 to remove and 45 not upgraded.


#Data Upload

Upload the dataset file (.rar format)

In [3]:
from google.colab import files

# Upload .rar file
uploaded = files.upload()

Saving scientific_publication.rar to scientific_publication.rar


#Extract Images

Extract images from the uploaded .rar file

In [4]:
# Extract images from the uploaded .rar folder
!unrar x "scientific_publication.rar"


UNRAR 6.11 beta 1 freeware      Copyright (c) 1993-2022 Alexander Roshal


Extracting from scientific_publication.rar

Creating    scientific_publication                                    OK
Extracting  scientific_publication/doc_000016.png                          1%  OK 
Extracting  scientific_publication/doc_000045.png                          4%  OK 
Extracting  scientific_publication/doc_000050.png                          6%  OK 
Extracting  scientific_publication/doc_000061.png                          7%  OK 
Extracting  scientific_publication/doc_000089.png                          9%  OK 
Extracting  scientific_publication/doc_000121.png                         11%  OK 
Extracting  scientific_publication/doc_000128.png                         13%  OK 
Extracting  scientific_publication/doc_000130.png                         15%  OK 
Extracting  scientific_publication/doc_000171.png                      

#Import Libraries

Import necessary libraries such as os, pytesseract, Image, and pandas.

In [5]:
import os
import pytesseract
from PIL import Image
import pandas as pd

#Text Extraction From Images

**OCR Function**

Define a function to extract text from images using OCR.

In [6]:
def extract_text_from_image(image_path):
    # Open the image
    img = Image.open(image_path)

    # Perform OCR to extract text
    text = pytesseract.image_to_string(img)

    return text


**Set Image Directory**

Define the directory containing extracted images.


In [7]:
image_dir = '/content/scientific_publication/'

**List Image Files**

List all image files in the specified directory.

In [8]:
image_files = [os.path.join(image_dir, f) for f in os.listdir(image_dir) if os.path.isfile(os.path.join(image_dir, f))]

**Extract Text from Images**

Extract text from each image and store it in a dictionary.

In [9]:
image_texts = {}

for image_file in image_files:
    text = extract_text_from_image(image_file)
    image_texts[os.path.basename(image_file)] = text
    break


**Display Extracted Text**

Display the extracted text from each image.

In [10]:
for image_file, text in image_texts.items():
    print(f"Text extracted from {image_file}:\n{text}\n")

Text extracted from doc_000891.png:
Reprinted from Biochemistry, 198, 27, 40
— copyright © 198 bythe American Choma! Soest and reprinted By peviniaon of copyright owner.

Native Structure and Physical Properties of Bovine Brain Kinesin and
Identification of the ATP-Binding Subunit Polypeptide!

George S. Bloom," Mark C. Wagner, K. Kevin Pfister, and Soot T- Brady

Department of Cell Biology ond Anatomy, Unversity of Texas Souwen, “Medical Center at Dallas, $323 Harry Hines
‘Boulevard, Dallas, Texas 75235

‘Received June 2, 1987; Revised Manuscript Received October 29, 1987

 

 

anstract: Kinesin was extensively purified from bovine brain cytosol by 8 ricrotubule-binding step in
{fe presence of 5-adenylylimidodipbosphate (AMP-PNP),folowat by gel filtration chromatography and
the present on ultracentrifugation. The prodets consistently contained 12400 (124K) and 64000 (64K)
aaa rides, These tw popes appear to represent heat and BN chains of kinesin, respectively,
Galton Po aiid on suer

#Title, Author(s) Extraction and Alternate Title Generation

**Install Packages**

Install datasets, transformers, and sentencepiece for NLP tasks.

In [11]:
!pip install datasets evaluate transformers[sentencepiece]

Collecting datasets
  Downloading datasets-2.19.0-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.2-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m20.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━

**Import Pipeline**

Import pipeline from transformers for question-answering tasks.


In [12]:
from transformers import pipeline

**Title and Author Extraction**

Define a function to extract title and author using question answering.



In [13]:
# Function to extract title and author using question answering
def extract_title_and_author(text):
    # Initialize QA pipeline
    qa_pipeline = pipeline("question-answering", model="bert-large-uncased-whole-word-masking-finetuned-squad")

    # Define questions
    title_question = "What is the title of this document?"
    author_question = "Who are the authors of this document?"

    # Extract title
    title_result = qa_pipeline(question=title_question, context=text)
    title = title_result['answer']

    # Extract author
    author_result = qa_pipeline(question=author_question, context=text)
    author = author_result['answer']

    return title, author


**Alternative Title Generation**

Define a function to generate alternative titles using BART models.



In [14]:
from transformers import BartTokenizer, BartForConditionalGeneration

# Load BART tokenizer and model
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn")
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")

def generate_alternative_title(text):
    # Tokenize input text
    inputs = tokenizer(text, max_length=1024, return_tensors="pt", truncation=True)

    # Generate alternative title using BART
    outputs = model.generate(
        **inputs,
        max_length=50,  # Set a maximum length for the output title
        num_beams=4,    # Use beam search for better alternatives
        length_penalty=0.6,  # Lower length penalty for shorter output
        early_stopping=True  # Stop generation when the full stop is encountered
    )

    # Decode the generated title
    generated_title = tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)

    # Find the index of the first full stop
    first_fullstop_index = generated_title.find('.')
    if first_fullstop_index != -1:
        generated_title = generated_title[:first_fullstop_index+1]  # Include the full stop

    return generated_title

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

**Initialize Lists**

Initialize lists to store extracted data.


In [15]:
# Initialize lists to store extracted data
titles = []
authors = []
alternate_titles = []

**Extraction and Generation**

Extract titles and authors and generate alternative titles for each document.



In [16]:
# Generate alternative titles and extract titles and authors for each text
for image_file, text in image_texts.items():
    # Generate alternative title
    alternative_title = generate_alternative_title(text)
    # Extract titles and authors
    title, author = extract_title_and_author(text)
    # Append data to lists
    titles.append(title)
    authors.append(author)
    alternate_titles.append(alternative_title)




config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

**Create DataFrame**

Create a DataFrame to store extracted and generated data.

In [17]:
# Create DataFrame
df = pd.DataFrame({
    'Document Title': titles,
    'Author(s)': authors,
    'Generated Title': alternate_titles
}, index=image_texts.keys())

**Display DataFrame**

Display the DataFrame containing extracted and generated data.

In [18]:
# Display DataFrame
print(df)

                Document Title                  Author(s)  \
doc_000891.png  clectopharesis  American Chemical Society   

                                                  Generated Title  
doc_000891.png  Kinesin was extensively purified from bovine b...  


In [19]:
df.head(10)

Unnamed: 0,Document Title,Author(s),Generated Title
doc_000891.png,clectopharesis,American Chemical Society,Kinesin was extensively purified from bovine b...


In [None]:
Expand to include more fields

In [43]:
import pandas as pd

# Define a function to extract additional details using question answering
qa_pipeline = pipeline("question-answering", model="bert-large-uncased-whole-word-masking-finetuned-squad")

def extract_additional_details(image_texts):
    questions = {
        "Title":"What is the title of the paper?",
        "Year": "What is the publication year?",
        "Author":"Which persons are the authors of the paper?",
        "Type": "What type of document is it?",
        "Short description": "Can you provide a short description?",
        "Focal constructs": "What are the focal constructs?",
        "Theoretical perspective": "What is the theoretical perspective?",
        "Context": "Can you describe the context?",
        "Study design": "What study design was used?",
        "Level(s)": "What are the levels?",
        "Methods notes": "Do you have any methodological notes?",
        "Main study findings": "What are the main findings of the study?",
        "LLM Generated Executive Summary": "Can you provide a generated executive summary?",
        "Insights related to my research question": "What insights are related to your research question?",
        "Quotable points": "Can you provide any quotable points?",
        "Potential use": "How can this document be used?",
        "Additional Notes": "Do you have any additional notes?"
    }

    data = []
    for _, text in image_texts.items():
        details = []
        for question in questions.values():
            answer = qa_pipeline(question=question, context=text)
            details.append(answer["answer"])
        data.append(details)

    return pd.DataFrame(data, columns=questions.keys())

# Example usage:
additional_details_df = extract_additional_details(image_texts)
print(additional_details_df)

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


            Title  Year                     Author      Type  \
0  clectopharesis  1988  American Chemical Society  SDS-PAGE   

  Short description       Focal constructs  \
0   one polypeptide  ATP-binding\nsubunits   

              Theoretical perspective  \
0  selective probe for\n\nanon anspor   

                                   Context     Study design        Level(s)  \
0  as a selective probe for\n\nanon anspor  selective probe  stoichiometric   

               Methods notes  \
0  American Chemical Society   

                                 Main study findings  \
0  Native Structure and Physical Properties of Bo...   

  LLM Generated Executive Summary Insights related to my research question  \
0                        SDS-PAGE    microtubules from cick brain extracts   

     Quotable points                                  Potential use  \
0  0427-3409801.50/0  serve as a selective probe for\n\nanon anspor   

            Additional Notes  
0  American Chemical Societ

In [44]:
additional_details_df.head()

Unnamed: 0,Title,Year,Author,Type,Short description,Focal constructs,Theoretical perspective,Context,Study design,Level(s),Methods notes,Main study findings,LLM Generated Executive Summary,Insights related to my research question,Quotable points,Potential use,Additional Notes
0,clectopharesis,1988,American Chemical Society,SDS-PAGE,one polypeptide,ATP-binding\nsubunits,selective probe for\n\nanon anspor,as a selective probe for\n\nanon anspor,selective probe,stoichiometric,American Chemical Society,Native Structure and Physical Properties of Bo...,SDS-PAGE,microtubules from cick brain extracts,0427-3409801.50/0,serve as a selective probe for\n\nanon anspor,American Chemical Society
