# Notebook to Embed and Cluster Text Segments

## Install and Import Dependencies

In [1]:
# Uncomment the line below to install the dependencies
# !pip install -qU numpy pandas scikit-learn torch sentence-transformers wtpsplit datasets

In [None]:
import numpy as np
from datasets import load_dataset

from utils import get_device, TextPreprocessor, SegmenterEmbedder, BiMapping

## Load Data

Dataset [link](https://huggingface.co/datasets/ubaada/booksum-complete-cleaned).

In [3]:
dataset_checkpoint = 'ubaada/booksum-complete-cleaned'
book_data = load_dataset(dataset_checkpoint, 'books')
print(book_data['train'][0]['text'][:500])

BOOK I.


    Of Mans First Disobedience, and the Fruit
  Of that Forbidden Tree, whose mortal tast
  Brought Death into the World, and all our woe,
  With loss of EDEN, till one greater Man
  Restore us, and regain the blissful Seat,
  Sing Heav'nly Muse, that on the secret top
  Of OREB, or of SINAI, didst inspire
  That Shepherd, who first taught the chosen Seed,
  In the Beginning how the Heav'ns and Earth
  Rose out of CHAOS: Or if SION Hill
  Delight thee more, and SILOA'S Brook that flow'


## Text Preprocesing

In [4]:
preprocessor = TextPreprocessor()

In [5]:
preprocessed_text = preprocessor(book_data['train'][0]['text'])
print(preprocessed_text[:1000])

BOOK I.

Of Mans First Disobedience, and the FruitOf that Forbidden Tree, whose mortal tastBrought Death into the World, and all our woe,With loss of EDEN, till one greater ManRestore us, and regain the blissful Seat,Sing Heav'nly Muse, that on the secret topOf OREB, or of SINAI, didst inspireThat Shepherd, who first taught the chosen Seed,In the Beginning how the Heav'ns and EarthRose out of CHAOS: Or if SION HillDelight thee more, and SILOA'S Brook that flow'dFast by the Oracle of God; I thenceInvoke thy aid to my adventrous Song,That with no middle flight intends to soarAbove th'AONIAN Mount, while it pursuesThings unattempted yet in Prose or Rhime.And chiefly Thou O Spirit, that dost preferBefore all Temples th' upright heart and pure,Instruct me, for Thou know'st; Thou from the firstWast present, and with mighty wings outspreadDove-like satst brooding on the vast AbyssAnd mad'st it pregnant: What in me is darkIllumine, what is low raise and support;That to the highth of this great

## Segmentation and Embedding

In [6]:
device = get_device()
segmenter_embedder = SegmenterEmbedder(device=device)
segmenter_embedder.device

'mps'

In [7]:
segments, embeddings = segmenter_embedder(preprocessed_text)
len(segments), embeddings.shape

(1055, (1055, 768))

## Crate a Bidirectional Map from Text to Embedding

In [8]:
mapping = BiMapping(segments, embeddings)

In [9]:
segment = segments[0]
embedding = embeddings[0]
segment == mapping[embedding], np.all(embedding == mapping[segment])

(True, np.True_)