<a href="https://colab.research.google.com/github/Paushi2003/Text-Detection-and-Layout-Analysis/blob/main/Text_Detection_and_Layout_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
'''
  This project has been divided into 2 sub tasks. One is Text Detection and the other is Layout Analysis.
  Modules required for image files: tesseract-ocr, pytesseract, cv2, numpy, PIL.
  Modules required for pdf files: pdf2image, poppler-utils (dependency library for pdf2image), numpy.
  UI has been created using Streamlit.
  Install all the required libraries.
'''

!sudo apt install tesseract-ocr poppler-utils 
!pip install pytesseract streamlit layoutparser pdf2image
!pip install 'git+https://github.com/facebookresearch/detectron2.git@v0.4#egg=detectron2'

In [None]:
'''
  Firstly, import all the required library files. Three functions are used here:
  text_detection: detect text areas and visualize with a bounding box.
  parserIMG: Layout Analysis for image documents.
  parserPDF: Layout Analysis for pdf documents.

  text_detection:-
    The input image is converted into a numpy array. The image is then passed on
    to the function image_to_data from pytesseract module. This locates the text
    position so that we shall draw the bounding boxes. For each word, the point 
    of origin/ position has been returned in the form of list. The elements at 
    the 6th, 7th, 8th, 9th indices denotes x-coordinate, y-coordinate, width, 
    height of the word respectively. A bounding box is drawn at those regions. 

  parserIMG:-
    This function is to perform layout analysis for image data. Detectron2Layout 
    model is used which uses R-CNN to detect and classfiy layout objects. The
    results from the layout analysis is visualized using bounding box, along with
    text annotation that shows the element type. The image after performing layout 
    analysis is displayed.

  parserPDF:-
    The document which is in the form of pdf when fed as input, it is converted into
    a list of images using pdf2image module. The images are converted into numpy
    array. Each image is passed on to the Detectron2Layout model, where each page
    is treated as a single image document and bounding box is drawn for the same to
    visualize the layout elements. Each page of the pdf is displayed after performing
    layout analysis.   
'''

In [None]:
%%writefile app.py
import streamlit as st
import pytesseract
import cv2
import pdf2image
import numpy as np
import layoutparser as lp
from PIL import Image


def text_detection(img):
  img1 = Image.open(img)
  res = np.array(img1)
  with st.spinner('Processing Text Detection'):
    boxes = pytesseract.image_to_data(res)
    for i,b in enumerate(boxes.splitlines()): 
      if i!=0:
        b = b.split()
        if len(b)==12:
          x,y,w,h = int(b[6]),int(b[7]),int(b[8]),int(b[9])
          res = cv2.rectangle(res,(x-5,y-5),(w+x+5,h+y+5),(0,0,255),3) 
          #res = cv2.putText(res,b[11],(x,y),cv2.FONT_HERSHEY_COMPLEX,1,(50,50,255),2)
  st.subheader("Text Detection")
  st.image(res)

def parserIMG(path):
  res = Image.open(path)
  with st.spinner('Processing Layout Analysis'):
    model = lp.Detectron2LayoutModel('lp://PubLayNet/faster_rcnn_R_50_FPN_3x/config',
                                  extra_config=["MODEL.ROI_HEADS.SCORE_THRESH_TEST", 0.8],
                                  label_map={0: "Text", 1: "Title", 2: "List", 3:"Table", 4:"Figure"})
    layout_result = model.detect(res)
    res=lp.draw_box(res, layout_result,  box_width=5, box_alpha=0.2, show_element_type=True)
  st.subheader("Layout Analysis")
  st.image(res)

def parserPDF(path):
  img = np.asarray(pdf2image.convert_from_bytes(path.read()))
  st.subheader("Layout Analysis")

  with st.spinner('Processing Layout Analysis'):
    model = lp.Detectron2LayoutModel('lp://PubLayNet/faster_rcnn_R_50_FPN_3x/config',
                                  extra_config=["MODEL.ROI_HEADS.SCORE_THRESH_TEST", 0.8],
                                  label_map={0: "Text", 1: "Title", 2: "List", 3:"Table", 4:"Figure"})
    for i in img:
      layout_result = model.detect(i)
      i=lp.draw_box(i, layout_result,  box_width=5, box_alpha=0.2, show_element_type=True)
      st.image(i)

#title 
st.title('Text Detection & Layout Analysis')
#uploading file
file = st.file_uploader(label='Upload your document here',type=['png','jpg','jpeg','pdf'])
button = st.button('Confirm')
if button and file is not None:
    if file.type=="image/png" or file.type=="image/jpg" or file.type=="image/jpeg":
      parserIMG(file)
      text_detection(file)
    elif file.type == "application/pdf":
      parserPDF(file)



In [None]:
!streamlit run app.py & npx localtunnel --port 8501 