## Install the required packages

In [1]:
%pip install unstructured

Note: you may need to restart the kernel to use updated packages.


## unstructured - text extraction library from PDF

In [2]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "true"
from unstructured.partition.auto import partition

# when the file is a directory
filename = "./data/world-war-one.pdf"
elements = partition(filename=filename, content_type="application/pdf")
print("".join(str(e) for e in elements))

# when the file is a file object
with open(f"./data/world-war-one.pdf","rb") as fobj:
    elements = partition(file=fobj, include_page_breaks=True)
    
#print the extracted elements
for e in elements:
    print(e)


  from .autonotebook import tqdm as notebook_tqdm


Orishti World War I |¢ World War I (WW J), also known as the Great War, lasted from 28 July 1914 to 11 November 1918.¢ WW Iwas fought between the Allied Powers and the Central Powers. © The main members of the Allied Powers were France, Russia, and Britain. The United States also fought on the side of the Allies after 1917.© The main members of the Central Powers were Germany, Austria-Hungary, the Ottoman Empire, and Bulgaria.Causes of the WarThere was no single event that led to World War I. The war happened because of several different events that took place in the years building up to 1914.¢ The new international expansionist policy of Germany: In 1890 the new emperor of Germany, Wilhelm II, began an international policy that sought to turn his country into a world power. Germany was seen as a threat by the other powers and destabilized the international situation. ¢ Mutual Defense Alliances: Countries throughout Europe made mutual defence agreements. These treaties meant that if on

### check how the titles in the PDF are captured by Unstructured

In [3]:
for e in elements:
    if e.category == "Title":
        print(e)

Orishti World War I |
Causes of the War
ATLANTIC OCEAN
Fig: Alliances at the beginning of the War
Phases of the War
Consequences of the war
UROPE, 1922
Military clauses:
¢ Drastic limitation of the German navy.
¢ Demilitarization of the Rhineland region.
War Reparations:
Other Treaties The Treaty of Neuilly, signed with Bulgaria
The Treaty of Sevres (1920) signed with Turkey
India and WWI


In [4]:
# when we know the file type is exclusively a PDF one
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(filename = "./data/world-war-one.pdf", languages=["eng"])
for e in elements:
    print(e)

Orishti World War I |
¢ World War I (WW J), also known as the Great War, lasted from 28 July 1914 to 11 November 1918.
¢ WW Iwas fought between the Allied Powers and the Central Powers. © The main members of the Allied Powers were France, Russia, and Britain. The United States also fought on the side of the Allies after 1917.
© The main members of the Central Powers were Germany, Austria-Hungary, the Ottoman Empire, and Bulgaria.
Causes of the War
There was no single event that led to World War I. The war happened because of several different events that took place in the years building up to 1914.
¢ The new international expansionist policy of Germany: In 1890 the new emperor of Germany, Wilhelm II, began an international policy that sought to turn his country into a world power. Germany was seen as a threat by the other powers and destabilized the international situation. ¢ Mutual Defense Alliances: Countries throughout Europe made mutual defence agreements. These treaties meant that

## unstructured - Data cleaning post text extraction

In [5]:
# code to use the cleaning function from unstructured
from unstructured.cleaners.core import clean
cleaned_text = clean(text="".join(str(e) for e in elements),extra_whitespace=True, dashes=True, bullets=True, trailing_punctuation=True, lowercase=True)

# Its always better to clean non ascii characters
from unstructured.cleaners.core import clean_non_ascii_chars
cleaned_text = clean_non_ascii_chars(text=cleaned_text)

# incase, your text has ordered bullets then clean it too.
from unstructured.cleaners.core import clean_ordered_bullets
cleaned_text = clean_ordered_bullets(text=cleaned_text)

# remove the ASCII and unicode punctuation from the string
from unstructured.cleaners.core import remove_punctuation
cleaned_text = remove_punctuation(cleaned_text)

# remove unicode characters such as \x91 in strings
from unstructured.cleaners.core import replace_unicode_quotes
cleaned_text = replace_unicode_quotes(cleaned_text)

cleaned_text

'orishti world war i | world war i ww j also known as the great war lasted from 28 july 1914 to 11 november 1918 ww iwas fought between the allied powers and the central powers  the main members of the allied powers were france russia and britain the united states also fought on the side of the allies after 1917 the main members of the central powers were germany austria hungary the ottoman empire and bulgariacauses of the warthere was no single event that led to world war i the war happened because of several different events that took place in the years building up to 1914 the new international expansionist policy of germany in 1890 the new emperor of germany wilhelm ii began an international policy that sought to turn his country into a world power germany was seen as a threat by the other powers and destabilized the international situation  mutual defense alliances countries throughout europe made mutual defence agreements these treaties meant that if one country was attacked allie

## unstructured - Chunking post data cleaning

### chunking strategy - "basic"

In [6]:
from unstructured.chunking.basic import chunk_elements
elements = partition(filename = "./data/world-war-one.pdf", languages=["eng"])
chunks = chunk_elements(elements)

for chunk in chunks:
    print(chunk)
    print("\n" + "%"*200)

Orishti World War I |

¢ World War I (WW J), also known as the Great War, lasted from 28 July 1914 to 11 November 1918.

¢ WW Iwas fought between the Allied Powers and the Central Powers. © The main members of the Allied Powers were France, Russia, and Britain. The United States also fought on the side of the Allies after 1917.

© The main members of the Central Powers were Germany, Austria-Hungary, the Ottoman Empire, and Bulgaria.

Causes of the War

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
There was no single event that led to World War I. The war happened because of several different events that took place in the years building up to 1914.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

### chunking strategy - "by_title"

In [7]:
from unstructured.chunking.title import chunk_by_title
elements = partition(filename = "./data/world-war-one.pdf", languages=["eng"])
chunks = chunk_by_title(elements)

for chunk in chunks:
    print(chunk)
    print("\n" + "%"*200)


Orishti World War I |

¢ World War I (WW J), also known as the Great War, lasted from 28 July 1914 to 11 November 1918.

¢ WW Iwas fought between the Allied Powers and the Central Powers. © The main members of the Allied Powers were France, Russia, and Britain. The United States also fought on the side of the Allies after 1917.

© The main members of the Central Powers were Germany, Austria-Hungary, the Ottoman Empire, and Bulgaria.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Causes of the War

There was no single event that led to World War I. The war happened because of several different events that took place in the years building up to 1914.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

## how to recover the section page numbers of a single chunk

In [8]:
for i, chunk in enumerate(chunks):
    print(f"***chunk {i +1 } text is {chunk.text}")
    for ele in chunk.metadata.orig_elements:
        print("page number is", ele.metadata.page_number)
    print("*"*20)

***chunk 1 text is Orishti World War I |

¢ World War I (WW J), also known as the Great War, lasted from 28 July 1914 to 11 November 1918.

¢ WW Iwas fought between the Allied Powers and the Central Powers. © The main members of the Allied Powers were France, Russia, and Britain. The United States also fought on the side of the Allies after 1917.

© The main members of the Central Powers were Germany, Austria-Hungary, the Ottoman Empire, and Bulgaria.
page number is 1
page number is 1
page number is 1
page number is 1
********************
***chunk 2 text is Causes of the War

There was no single event that led to World War I. The war happened because of several different events that took place in the years building up to 1914.
page number is 1
page number is 1
********************
***chunk 3 text is ¢ The new international expansionist policy of Germany: In 1890 the new emperor of Germany, Wilhelm II, began an international policy that sought to turn his country into a world power. Ger