# Splitting the text into smaller more meaningful chunks

One key task in dealing with text is to split it up into smaller, more meaningful chunks that can be processed. 

Text segmentation and tokenisation are key tools to achieve this. They involve the process of dividing written text into meaningful units, such as words, keywords, phrases, symbols, sentences and other elements called tokens.

For this exercise we will look at some of the simplest tokens to understand - line segmentation and word tokenisation.

For simplicity lets tokenise by splitting a text into individual words. 

Let's take a body of text:

In [None]:
text_o = """Hamster, turnin' round in your wheel
I've got something to tell you:
I can  harness your  feel
Dynamo for electricity
You empower my feelings
Give me light for me to see
(Stavros:)
So won't you give   me some food?
I really need to get some energy in me
Give me some
I really need to get some energy in me
Give me some
The world is turning, kicking, screaming 'round my little head
To spread it on my bread
Don't you know that I spread it on my bread?
(Bunf:)
Rest yourself you've been long on your feet
If I buy you a matchbox
Will you keep it all neat?
Radiate, you're no flash in the pan
You're the battery incarnate
I'm an innocent man
(Stavros:)
So won't you  give me some food?
I really need to  get some energy in me
Give me some
I really need to get some energy in me
Give me some
The world is turning, kicking,   screaming 'round my little head
To spread it on my bread
Don't you know that     I spread it on my bread?"""

## Word Tokenisation

### Breaking down the job - creating a process pipeline

Lets split this up into tasks 

1. remove any punctuation
2. we have some text in brackets - lets remove these
3. remove any excess white space
4. split into a list by whitespace

We have the start of a process here  - several simple steps that can be run in sequence to achieve our goal. This is often referred to as a **pipeline**.

In [None]:
import re

#remove any punctuation in the text
text_processed = re.sub(r'[.,?:\']', "", text_o)

#remove the words in brackets - matcxh opening bracket
#then any occurance of anything except close bracket
text_processed = re.sub(r'\([^)]*\)', "", text_processed)

# remove any excess whitespace
text_processed = re.sub( r'  +', " ", text_processed)

print(text_processed)

### Split into word tokens

In [None]:
#replace any newline character with a single space
textn = re.sub(r'\n', " ", text_processed)
textn = re.sub( r'  +', " ", textn)

#split on the spaces
tokens = textn.split(" ")
tokens

## Line Segmentation

To perfrom line segmentatisation we simply split the text up on the newline character '\n'. We can firstly remove any punctuation and excess whitespace and words in brackets - we can use the processed text output from earlier for this. 

In [None]:
segments = text_processed.split('\n')
print(segments)

## Taking it further

You can see that we can start to build up lots of small tasks into pipelines to get the text in a format ready for subsequent analysis. Here we have simply split into words and lines, but you can also start splitting into phrases to try and understand the context and meaning of the text in more detail. 