We will implement a text tiling algorithm to segment the text into coherent sections. The algorithm is based on the following steps:
1. Split the text in sentences
2. Clean the sentences by removing stopwords and punctuation
3. Find frequency table for the words in the sentences
    - The x-axis is the sentence number
    - The y-axis is each word
    - Each cell is the frequency of the word in the sentence
4. Initialize blocks of text with a fixed size
5. Iterate until no change in the blocks
    - Calculate the intra-group cohesion for each block
    - Find blocks of text where the topic changes (the cohesion drops in these blocks)
    - Move block boundaries to the topic change points
6. Return the blocks of text

In [7]:
import nltk
import numpy as np
from collections import Counter

In [8]:
# retrieve text from res/mixed_texts.txt
with open('res/mixed_texts.txt', 'r') as file:
    text = file.read()

# split the text into sentences
sentences = nltk.sent_tokenize(text)

# print the first 5 sentences
sentences[:5]

['In order to get to punch his badge at 08:30, 16 years before Fantozzi began setting alarm clock at 06:15.',
 'Today, after continuous experiments and improvements, he manage to set it at 07:51: the limit of humanly possibilities.',
 "Everything is calculated on the edge of seconds: 5 seconds to regain consciousness; 4 seconds to overcome impact of seeing his wife, and 6 more seconds to ask himself -as always with any plausible answer- whatever pushed him to marry that kind of curious pet; 3 seconds to drink Mrs Fantozzi's coffee: 3000 Fahreneit Degrees!",
 'From 8 to 10 seconds to cool down his burned tongue... 2.5 seconds to kiss his daughter Mariangela; brioche and Latte meanwhile hair brushing, brushing coffee-flavoured teeth with minty toothpaste, resulting in an instantaneous bowel movement... all of this performed in 6 seconds, a European Record!',
 'He still has a 3-minute fortune to get dressed and run to the bus stop to catch the 08:01 bus.']

In [9]:
# remove stopwords and punctuation
from nltk.corpus import stopwords
from string import punctuation

# get the list of stopwords
stopwords_list = stopwords.words('english')

# remove stopwords and punctuation
cleaned_sentences = []
for sentence in sentences:
    cleaned_sentence = [word.lower() for word in nltk.word_tokenize(sentence) if word.lower() not in stopwords_list and word not in punctuation]
    cleaned_sentences.append(cleaned_sentence)
    
# print the first 5 cleaned sentences
cleaned_sentences[:5]

[['order',
  'get',
  'punch',
  'badge',
  '08:30',
  '16',
  'years',
  'fantozzi',
  'began',
  'setting',
  'alarm',
  'clock',
  '06:15'],
 ['today',
  'continuous',
  'experiments',
  'improvements',
  'manage',
  'set',
  '07:51',
  'limit',
  'humanly',
  'possibilities'],
 ['everything',
  'calculated',
  'edge',
  'seconds',
  '5',
  'seconds',
  'regain',
  'consciousness',
  '4',
  'seconds',
  'overcome',
  'impact',
  'seeing',
  'wife',
  '6',
  'seconds',
  'ask',
  '-as',
  'always',
  'plausible',
  'answer-',
  'whatever',
  'pushed',
  'marry',
  'kind',
  'curious',
  'pet',
  '3',
  'seconds',
  'drink',
  'mrs',
  'fantozzi',
  "'s",
  'coffee',
  '3000',
  'fahreneit',
  'degrees'],
 ['8',
  '10',
  'seconds',
  'cool',
  'burned',
  'tongue',
  '...',
  '2.5',
  'seconds',
  'kiss',
  'daughter',
  'mariangela',
  'brioche',
  'latte',
  'meanwhile',
  'hair',
  'brushing',
  'brushing',
  'coffee-flavoured',
  'teeth',
  'minty',
  'toothpaste',
  'resulting',

In [10]:
# create a frequency table for the words in the sentences
frequency_table = []

for sentence in cleaned_sentences:
    # get counts of word frequency in the sentence
    frequency = Counter(sentence)
    frequency_table.append(frequency)
    
# print the frequency table for the first 5 sentences
frequency_table[:5]

[Counter({'order': 1,
          'get': 1,
          'punch': 1,
          'badge': 1,
          '08:30': 1,
          '16': 1,
          'years': 1,
          'fantozzi': 1,
          'began': 1,
          'setting': 1,
          'alarm': 1,
          'clock': 1,
          '06:15': 1}),
 Counter({'today': 1,
          'continuous': 1,
          'experiments': 1,
          'improvements': 1,
          'manage': 1,
          'set': 1,
          '07:51': 1,
          'limit': 1,
          'humanly': 1,
          'possibilities': 1}),
 Counter({'seconds': 5,
          'everything': 1,
          'calculated': 1,
          'edge': 1,
          '5': 1,
          'regain': 1,
          'consciousness': 1,
          '4': 1,
          'overcome': 1,
          'impact': 1,
          'seeing': 1,
          'wife': 1,
          '6': 1,
          'ask': 1,
          '-as': 1,
          'always': 1,
          'plausible': 1,
          'answer-': 1,
          'whatever': 1,
          'pushed': 1,
    