# Splitting the text into smaller more meaningful chunks

One key task in dealing with text is to split it up into smaller, more meanigful chunks that can be processed. 

Text segmentation and tokenisation is the process of dividing written text into meaningful units, such as words, keywords, phrases, symbols, sentances and other elements called tokens.

For this exercise we will look at the simplest tokens to understand - line segmentation and word tokenisation.

For simplicity lets tokenise by splitting a text into indivisual words. Lets take a body of text:

In [28]:
text_o = """Hamster, turnin' round in your wheel
I've got something to tell you:
I can  harness your  feel
Dynamo for electricity
You empower my feelings
Give me light for me to see
(Stavros:)
So won't you give   me some food?
I really need to get some energy in me
Give me some
I really need to get some energy in me
Give me some
The world is turning, kicking, screaming 'round my little head
To spread it on my bread
Don't you know that I spread it on my bread?
(Bunf:)
Rest yourself you've been long on your feet
If I buy you a matchbox
Will you keep it all neat?
Radiate, you're no flash in the pan
You're the battery incarnate
I'm an innocent man
(Stavros:)
So won't you  give me some food?
I really need to  get some energy in me
Give me some
I really need to get some energy in me
Give me some
The world is turning, kicking,   screaming 'round my little head
To spread it on my bread
Don't you know that     I spread it on my bread?"""

Lets split this up into tasks 


1. remove any punctuation
2. we have some text in brackets - lets remove these
3. remove any excess white space
4. split into a list by whitespace

We have the start of a process here  - several simple steps that can be run in sequence to acheive our goal. This is often referred to as a pipeline.

In [35]:
import re

#remove any punctuation in the text
text_processed = re.sub(r'[.,?:\']', "", text_o)

#remove the words in brackets - matcxh opening bracket
#then any occurance of anything except close bracket
text_processed = re.sub(r'\([^)]*\)', "", text_processed)

# remove any excess whitespace
text_processed = re.sub( r'  +', " ", text_processed)

print(text_processed)

Hamster turnin round in your wheel
Ive got something to tell you
I can harness your feel
Dynamo for electricity
You empower my feelings
Give me light for me to see

So wont you give me some food
I really need to get some energy in me
Give me some
I really need to get some energy in me
Give me some
The world is turning kicking screaming round my little head
To spread it on my bread
Dont you know that I spread it on my bread

Rest yourself youve been long on your feet
If I buy you a matchbox
Will you keep it all neat
Radiate youre no flash in the pan
Youre the battery incarnate
Im an innocent man

So wont you give me some food
I really need to get some energy in me
Give me some
I really need to get some energy in me
Give me some
The world is turning kicking screaming round my little head
To spread it on my bread
Dont you know that I spread it on my bread


In [36]:
#replace any newline character with a single space
textn = re.sub(r'\n', " ", text_processed)
textn = re.sub( r'  +', " ", textn)
print(textn)

#split on the spaces
tokens = textn.split(" ")
tokens

Hamster turnin round in your wheel Ive got something to tell you I can harness your feel Dynamo for electricity You empower my feelings Give me light for me to see So wont you give me some food I really need to get some energy in me Give me some I really need to get some energy in me Give me some The world is turning kicking screaming round my little head To spread it on my bread Dont you know that I spread it on my bread Rest yourself youve been long on your feet If I buy you a matchbox Will you keep it all neat Radiate youre no flash in the pan Youre the battery incarnate Im an innocent man So wont you give me some food I really need to get some energy in me Give me some I really need to get some energy in me Give me some The world is turning kicking screaming round my little head To spread it on my bread Dont you know that I spread it on my bread


['Hamster',
 'turnin',
 'round',
 'in',
 'your',
 'wheel',
 'Ive',
 'got',
 'something',
 'to',
 'tell',
 'you',
 'I',
 'can',
 'harness',
 'your',
 'feel',
 'Dynamo',
 'for',
 'electricity',
 'You',
 'empower',
 'my',
 'feelings',
 'Give',
 'me',
 'light',
 'for',
 'me',
 'to',
 'see',
 'So',
 'wont',
 'you',
 'give',
 'me',
 'some',
 'food',
 'I',
 'really',
 'need',
 'to',
 'get',
 'some',
 'energy',
 'in',
 'me',
 'Give',
 'me',
 'some',
 'I',
 'really',
 'need',
 'to',
 'get',
 'some',
 'energy',
 'in',
 'me',
 'Give',
 'me',
 'some',
 'The',
 'world',
 'is',
 'turning',
 'kicking',
 'screaming',
 'round',
 'my',
 'little',
 'head',
 'To',
 'spread',
 'it',
 'on',
 'my',
 'bread',
 'Dont',
 'you',
 'know',
 'that',
 'I',
 'spread',
 'it',
 'on',
 'my',
 'bread',
 'Rest',
 'yourself',
 'youve',
 'been',
 'long',
 'on',
 'your',
 'feet',
 'If',
 'I',
 'buy',
 'you',
 'a',
 'matchbox',
 'Will',
 'you',
 'keep',
 'it',
 'all',
 'neat',
 'Radiate',
 'youre',
 'no',
 'flash',
 'in',
 't

To perfrom line segmentatisation we simply split the text up by newline. We can firstly remove any punctuation and excess whitespace and words in brackets - we can use the processed text oputput from earlier for this. 

In [37]:
segments = text_processed.split('\n')
print(segments)

['Hamster turnin round in your wheel', 'Ive got something to tell you', 'I can harness your feel', 'Dynamo for electricity', 'You empower my feelings', 'Give me light for me to see', '', 'So wont you give me some food', 'I really need to get some energy in me', 'Give me some', 'I really need to get some energy in me', 'Give me some', 'The world is turning kicking screaming round my little head', 'To spread it on my bread', 'Dont you know that I spread it on my bread', '', 'Rest yourself youve been long on your feet', 'If I buy you a matchbox', 'Will you keep it all neat', 'Radiate youre no flash in the pan', 'Youre the battery incarnate', 'Im an innocent man', '', 'So wont you give me some food', 'I really need to get some energy in me', 'Give me some', 'I really need to get some energy in me', 'Give me some', 'The world is turning kicking screaming round my little head', 'To spread it on my bread', 'Dont you know that I spread it on my bread']


You can see that we can start to build up lots of small tasks into pipelines to ghet teh text in a format ready for subsequent analysis. Here we have simply split into words and lines, but you can also start splitting into phrases to try and understand teh context and meaning of the text in more detail. 