## **Text Extraction**

Text extraction from a file with pdf extension is implemented. In addition, the extracted text is cut into 1000-character chunks with an overlap of 200 characters. The necessary steps are detailed below. 

### 1. Install the necessary libraries

Necesita ejecutar en su consola el siguiente comando: `pip install pymupdf pandas`

### 2. Import the necessary libraries

We use the two libraries installed before. 

In [16]:
import fitz  # PyMuPDF
import pandas as pd

### 3. Create a Function to Extract Text

This function open the pdf document and extract all the text page by page. 

In [17]:
def extract_text_from_pdf(pdf_path):
    document = fitz.open(pdf_path)
    text = ""
    for page_num in range(len(document)):
        page = document.load_page(page_num)
        text += page.get_text()
    return text

### 4. Divide the Text in Chunks

This function divide the text in chunks with a max size of 1000 characters. Also it adds an overlap of 200 characters. 

In [18]:
def split_text_into_segments(text, chunk_size=1000, overlap=200):
    segments = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        segment = text[start:end]
        segments.append(segment)
        start += chunk_size - overlap
    return segments

### 5. Ejemplo

Por último, se muestra un ejemplo de cómo utilizar todo lo generado anteriormente. 

In [19]:
pdf_path = '../docs-testing/three-little-pigs-story.pdf'

pdf_text = extract_text_from_pdf(pdf_path)

text_segments = split_text_into_segments(pdf_text)

# Create a dataframe with segments just to show information
df = pd.DataFrame(text_segments, columns=['text'])

# Show the firts 5 rows. 
print(df.head())


                                                text
0  Story 1\nThe Three Little Pigs \nBoth a short ...
1  huffed\nand puffed and blew it down. \nThe sec...
2   discussed in this lesson are underlined.\n \n...
3   bad wolf huffed\nand puffed and blew the hous...
4  d and puffed and THE\nHOUSE DID NOT FALL DOWN!...
