# Tutorial 4: Extract Text Section-wise 

## Retrieve earlier saved soup object 
### If you don't have this, follow Tutorials 1 - 3

In [22]:
%store -r soup

## Imports
### (In addition to BeautifulSoup, we also import the NavigableString, Tag, Comment and Stylesheet classes)

In [23]:
from bs4 import BeautifulSoup, NavigableString, Comment, Stylesheet, Tag

## Pattern-based segmenter for all h2 sections

In [24]:
section_offsets_dict = {"section":[],"start":[],"end":[], "text_content":[]}

import re
# for body_text in soup.find_all("div", {"id":["bodyContent"]}):
for body_text in soup.find_all("div", {"id":["mw-content-text"]}):
    body_text_txt = body_text.get_text(strip=True)
    # Clean up text content if you like.
    # body_text_txt=re.sub(r'\s+', ' ', body_text_txt.strip())
    for idx, sect in enumerate(body_text.find_all("h2")):
        # Get each section text ...
        sect_txt = sect.get_text(strip=True)
        # Clean it up a bit
        sect_txt = re.sub(r'\s+', ' ', sect_txt.strip())
        # Match based on simple find
        if sect_txt in body_text_txt:
            # Capture text above for the first section (call it unlabelled if you wish)
            if idx == 0:
                section_offsets_dict["section"].append("No Section")
                section_offsets_dict["start"].append(0)
                section_offsets_dict["end"].append(body_text_txt.find(sect_txt) - 1)
            # Add this section's text and start offset
            section_offsets_dict["section"].append(sect_txt)
            section_offsets_dict["start"].append(body_text_txt.find(sect_txt))
            # The end offsets start pouring in later, from the third record for the second record, from the fourth for the third and so on
            if idx > 1:
                section_offsets_dict["end"].append(section_offsets_dict["start"][-2] - 1)
    # Fill in the last two end offsets to complete the list
    section_offsets_dict["end"].append(section_offsets_dict["start"][-1] - 1)
    section_offsets_dict["end"].append(len(body_text_txt))
    # Now, start slicing the text content based on each offset pair
    for i in range(len(section_offsets_dict["start"])):
        start_pos = section_offsets_dict["start"][i]
        end_pos = section_offsets_dict["end"][i]
        section_text = body_text_txt[start_pos:end_pos+1].strip()
        section_offsets_dict["text_content"].append(section_text)

# Optional - add to a pandas dataframe
import pandas as pd
section_frame = pd.DataFrame(section_offsets_dict)

## Check out segmented output

In [25]:
pd.set_option('display.max_colwidth', None)
section_frame.head(10)

Unnamed: 0,section,start,end,text_content
0,No Section,0,436,"Serbian writer, historian, and priestPavle StamatovićПавле СтаматовићBorn(1805-04-11)11 April 1805Jakovo,Military FrontierDied14 September 1864(1864-09-14)(aged 59)Novi Sad,Austrian EmpireAlma materRoyal University of PestPavle Stamatović(11 April 1805 – 14 September 1864) was a Serbian writer, historian, and archpriest. He chaired the delegation ofSouth Slavsat thePrague Slavic Congress, 1848.[1]He was also a member ofMatica Srpska."
1,Biography,437,2134,"Biography[edit]He was born in the town ofJakovoat a time whenSremwas under Habsburg rule. He finished his primary and secondary education in Jakovo,Sremski Karlovci, andBuda. He studied philosophy and theology at Sremski Karlovci andPest. In his last year at theRoyal University of Pest, he became acquainted withLjudevit Gaj, who happened to enroll at the same time when he was in his graduating year. In 1832 he became a monk in Pest and later, as a parish priest of theChurch of St. Nicholas, Szeged, from 1834 to 1844, he edited and published a Serbian almanacSrbska pčela(Serbian Bee) which had a significant circulation. He was transferred from his parish in Szeged toNovi Sad. There he continued to publish his almanac and joinedMatica Srpska. He was elected president ofMatica Srpskain 1831 and served as the editor ofLetopis Matice Srpske1831–32.[2]Stamatović was a corresponding member of the Society of Serbian Letters from 11 June 1842 (now part of theSerbian Academy of Sciences and Arts).[3]At the SzegedLycéehe initiated a student society calledMlado Jedinjepromoting the study of Slavic languages and works of literature; published an almanac,Srbska pčela ili novi cvetnikfor more than a decade (1830–1841);[4]composed an ode to Slavic unity and brotherhood (Slava slavenska u Evropi, 1837); translated from Polish the monumental""Historya prawodawstw slowianskich""(History of Slavic Legislation) byWacław Maciejowski; and the pioneering work""Prawda ruska""by Ignacy Benedikt Rakowiecki (1783–1839).[5]He translated many Russian, Polish and Czech articles on Slavic affairs.[citation needed]He also wrote a book calledMladyj Serbljin u vsemirnom carstvu, published in Buda in 1834.[6]"
2,References,2135,3141,"References[edit]^Maxwell, Alexander (15 October 2009).Choosing Slovakia: Slavic Hungary, the Czechoslovak Language and Accidental Nationalism. I. B. Tauris.ISBN9781848850743– via Google Books.^d.o.o, cubes.""190 godina Matice srpske - Vreme"".vreme.com/(in Serbian). Retrieved2024-08-24.^""Stamatovic Pavle"".www.sanu.ac.rs. Retrieved2024-08-24.^""Pavle Stamatović - Knjige po autorima : Antikvarne-knjige.com"".Antikvarneknjige.com.^""MDZ-Reader - Band - Prawda ruska czyli Prawa Wielkiego Xięcia Jarosława Władymirowicza / Rakowiecki, Ignacy Benedikt - Prawda ruska czyli Prawa Wielkiego Xięcia Jarosława Władymirowicza / Rakowiecki, Ignacy Benedikt"".reader.digitale-sammlungen.de.^Göpfert, Herbert Georg; Koziełek, Gerard; Wittmann, Reinhard (12 December 1977).Buch- und Verlagswesen im 18. und 19. Jahrhundert. Camen.ISBN9783921515020– via Google Books.Authority control databasesInternationalISNIVIAFNationalCroatiaRetrieved from ""https://en.wikipedia.org/w/index.php?title=Pavle_Stamatović&oldid=1297381790"""


## Pattern-based segmenter for all "section like" sections (h2 / h3 / h1s)

In [26]:
section_offsets_dict = {"section":[],"start":[],"end":[], "text_content":[]}

import re
# for body_text in soup.find_all("div", {"id":["bodyContent"]}):
for body_text in soup.find_all("div", {"id":["mw-content-text"]}):
    body_text_txt = body_text.get_text(strip=True)
    # Clean up text content if you like.
    # body_text_txt=re.sub(r'\s+', ' ', body_text_txt.strip())
    for idx, sect in enumerate(body_text.find_all(["h1","h2","h3"])):
        # Get each section text ...
        sect_txt = sect.get_text(strip=True)
        # Clean it up a bit
        sect_txt = re.sub(r'\s+', ' ', sect_txt.strip())
        # Match based on simple find
        if sect_txt in body_text_txt:
            # Capture text above for the first section (call it unlabelled if you wish)
            if idx == 0:
                section_offsets_dict["section"].append("No Section")
                section_offsets_dict["start"].append(0)
                section_offsets_dict["end"].append(body_text_txt.find(sect_txt) - 1)
            # Add this section's text and start offset
            section_offsets_dict["section"].append(sect_txt)
            section_offsets_dict["start"].append(body_text_txt.find(sect_txt))
            # The end offsets start pouring in later, from the third record for the second record, from the fourth for the third and so on
            if idx > 1:
                section_offsets_dict["end"].append(section_offsets_dict["start"][-2] - 1)
    # Fill in the last two end offsets to complete the list
    section_offsets_dict["end"].append(section_offsets_dict["start"][-1] - 1)
    section_offsets_dict["end"].append(len(body_text_txt))
    # Now, start slicing the text content based on each offset pair
    for i in range(len(section_offsets_dict["start"])):
        start_pos = section_offsets_dict["start"][i]
        end_pos = section_offsets_dict["end"][i]
        section_text = body_text_txt[start_pos:end_pos+1].strip()
        section_offsets_dict["text_content"].append(section_text)

# Optional - add to a pandas dataframe
import pandas as pd
section_frame_all_hs = pd.DataFrame(section_offsets_dict)

### Observe new segments 

In [27]:
section_frame_all_hs

Unnamed: 0,section,start,end,text_content
0,No Section,0,436,"Serbian writer, historian, and priestPavle StamatovićПавле СтаматовићBorn(1805-04-11)11 April 1805Jakovo,Military FrontierDied14 September 1864(1864-09-14)(aged 59)Novi Sad,Austrian EmpireAlma materRoyal University of PestPavle Stamatović(11 April 1805 – 14 September 1864) was a Serbian writer, historian, and archpriest. He chaired the delegation ofSouth Slavsat thePrague Slavic Congress, 1848.[1]He was also a member ofMatica Srpska."
1,Biography,437,2134,"Biography[edit]He was born in the town ofJakovoat a time whenSremwas under Habsburg rule. He finished his primary and secondary education in Jakovo,Sremski Karlovci, andBuda. He studied philosophy and theology at Sremski Karlovci andPest. In his last year at theRoyal University of Pest, he became acquainted withLjudevit Gaj, who happened to enroll at the same time when he was in his graduating year. In 1832 he became a monk in Pest and later, as a parish priest of theChurch of St. Nicholas, Szeged, from 1834 to 1844, he edited and published a Serbian almanacSrbska pčela(Serbian Bee) which had a significant circulation. He was transferred from his parish in Szeged toNovi Sad. There he continued to publish his almanac and joinedMatica Srpska. He was elected president ofMatica Srpskain 1831 and served as the editor ofLetopis Matice Srpske1831–32.[2]Stamatović was a corresponding member of the Society of Serbian Letters from 11 June 1842 (now part of theSerbian Academy of Sciences and Arts).[3]At the SzegedLycéehe initiated a student society calledMlado Jedinjepromoting the study of Slavic languages and works of literature; published an almanac,Srbska pčela ili novi cvetnikfor more than a decade (1830–1841);[4]composed an ode to Slavic unity and brotherhood (Slava slavenska u Evropi, 1837); translated from Polish the monumental""Historya prawodawstw slowianskich""(History of Slavic Legislation) byWacław Maciejowski; and the pioneering work""Prawda ruska""by Ignacy Benedikt Rakowiecki (1783–1839).[5]He translated many Russian, Polish and Czech articles on Slavic affairs.[citation needed]He also wrote a book calledMladyj Serbljin u vsemirnom carstvu, published in Buda in 1834.[6]"
2,References,2135,3141,"References[edit]^Maxwell, Alexander (15 October 2009).Choosing Slovakia: Slavic Hungary, the Czechoslovak Language and Accidental Nationalism. I. B. Tauris.ISBN9781848850743– via Google Books.^d.o.o, cubes.""190 godina Matice srpske - Vreme"".vreme.com/(in Serbian). Retrieved2024-08-24.^""Stamatovic Pavle"".www.sanu.ac.rs. Retrieved2024-08-24.^""Pavle Stamatović - Knjige po autorima : Antikvarne-knjige.com"".Antikvarneknjige.com.^""MDZ-Reader - Band - Prawda ruska czyli Prawa Wielkiego Xięcia Jarosława Władymirowicza / Rakowiecki, Ignacy Benedikt - Prawda ruska czyli Prawa Wielkiego Xięcia Jarosława Władymirowicza / Rakowiecki, Ignacy Benedikt"".reader.digitale-sammlungen.de.^Göpfert, Herbert Georg; Koziełek, Gerard; Wittmann, Reinhard (12 December 1977).Buch- und Verlagswesen im 18. und 19. Jahrhundert. Camen.ISBN9783921515020– via Google Books.Authority control databasesInternationalISNIVIAFNationalCroatiaRetrieved from ""https://en.wikipedia.org/w/index.php?title=Pavle_Stamatović&oldid=1297381790"""
