**Tokenization**: nltk library
*   **nltk**: natural language toolkit
*   **punkt**: a module that includes pre-trained models for segmenting text into sentences and words.
*   **word_tokenize()**: to break down a string into tokens




In [None]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


**Ex.1** Word tokenization

In [None]:
word_tokenize("Hi there!")

['Hi', 'there', '!']

**Ex.2** Word tokenization

In [None]:
word_tokenize("I don't like Sam's shoes.")

['I', 'do', "n't", 'like', 'Sam', "'s", 'shoes', '.']


*   **sent_tokenize()**: to split a given text into individual sentences.



In [None]:
from nltk.tokenize import sent_tokenize

**Ex.3** Sentence tokenization

In [None]:
text = "Hello, world! This is a test. Let's see how it works."
sentences = sent_tokenize(text)
print(sentences)

['Hello, world!', 'This is a test.', "Let's see how it works."]



*   **regexp_tokenize()**: to tokenize text based on a regular expression pattern.



In [None]:
from nltk.tokenize import regexp_tokenize

**Ex.4** use regexp_tokenize

In [None]:
text = "Hello, world! This is a test."

# Define a regular expression pattern to split the text by words and punctuation
# Matches any word character or matches any character that is not a word character "\w" and not a whitespace character "\s" (spaces, tabs, line breaks).
pattern = r'\w+|[^\w\s]+'

tokens = regexp_tokenize(text, pattern)
print(tokens)

['Hello', ',', 'world', '!', 'This', 'is', 'a', 'test', '.']



*   **TweetTokenizer()**: used specifically for tokenizing text, especially suited for content like tweets or text with similar characteristics found on social media platforms, allowing you to separate hashtags, mentions, emoticons,..


In [None]:
from nltk.tokenize import TweetTokenizer

**Ex.5** use TweetTokenizer

In [None]:
tweet = "This is a #test tweet! 😊 Check out http://example.com @user #NLTK #Python"
tokenizer = TweetTokenizer()
tokens = tokenizer.tokenize(tweet)
print(tokens)

['This', 'is', 'a', '#test', 'tweet', '!', '😊', 'Check', 'out', 'http://example.com', '@user', '#NLTK', '#Python']


**Let's practice!**

Word tokenization with NLTK

*   Utilize **word_tokenize** and **sent_tokenize** from **nltk.tokenize** to tokenize both words and sentences from Python strings.
*   Import the **sent_tokenize** and **word_tokenize** functions from **nltk.tokenize**





In [None]:
# Import necessary modules
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize



*   upload file "scene_one.txt"
*   Read txt file and keep in scene_one


In [None]:
import io
from google.colab import files

# Upload the file
uploaded = files.upload()

# Read the content of the uploaded file
with io.StringIO(uploaded['scene_one.txt'].decode('utf-8')) as f:
    scene_one = f.read()

# Display the content
print(scene_one)

Saving scene_one.txt to scene_one.txt
SCENE 1: [wind] [clop clop clop] 
KING ARTHUR: Whoa there!  [clop clop clop] 
SOLDIER #1: Halt!  Who goes there?
ARTHUR: It is I, Arthur, son of Uther Pendragon, from the castle of Camelot.  King of the Britons, defeator of the Saxons, sovereign of all England!
SOLDIER #1: Pull the other one!
ARTHUR: I am, ...  and this is my trusty servant Patsy.  We have ridden the length and breadth of the land in search of knights who will join me in my court at Camelot.  I must speak with your lord and master.
SOLDIER #1: What?  Ridden on a horse?
ARTHUR: Yes!
SOLDIER #1: You're using coconuts!
ARTHUR: What?
SOLDIER #1: You've got two empty halves of coconut and you're bangin' 'em together.
ARTHUR: So?  We have ridden since the snows of winter covered this land, through the kingdom of Mercea, through--
SOLDIER #1: Where'd you get the coconuts?
ARTHUR: We found them.
SOLDIER #1: Found them?  In Mercea?  The coconut's tropical!
ARTHUR: What do you



*   Tokenize all sentences in **scene_one** using the **sent_tokenize()** function.



In [None]:
# Split scene_one into sentences: sentences
sentences =
print(sentences)

['SCENE 1: [wind] [clop clop clop] \r\nKING ARTHUR: Whoa there!', '[clop clop clop] \r\nSOLDIER #1: Halt!', 'Who goes there?', 'ARTHUR: It is I, Arthur, son of Uther Pendragon, from the castle of Camelot.', 'King of the Britons, defeator of the Saxons, sovereign of all England!', 'SOLDIER #1: Pull the other one!', 'ARTHUR: I am, ...  and this is my trusty servant Patsy.', 'We have ridden the length and breadth of the land in search of knights who will join me in my court at Camelot.', 'I must speak with your lord and master.', 'SOLDIER #1: What?', 'Ridden on a horse?', 'ARTHUR: Yes!', "SOLDIER #1: You're using coconuts!", 'ARTHUR: What?', "SOLDIER #1: You've got two empty halves of coconut and you're bangin' 'em together.", 'ARTHUR: So?', "We have ridden since the snows of winter covered this land, through the kingdom of Mercea, through--\r\nSOLDIER #1: Where'd you get the coconuts?", 'ARTHUR: We found them.', 'SOLDIER #1: Found them?', 'In Mercea?', "The coconut's tropical!", 'ARTHUR:



*  Tokenize the fourth sentence in **sentences**, which you can access as **sentences[3]**, using the **word_tokenize()** function.



In [None]:
# Use word_tokenize to tokenize the fourth sentence: tokenized_sent
tokenized_sent =
print(tokenized_sent)

['ARTHUR', ':', 'It', 'is', 'I', ',', 'Arthur', ',', 'son', 'of', 'Uther', 'Pendragon', ',', 'from', 'the', 'castle', 'of', 'Camelot', '.']




*   Find the unique tokens in the entire scene by using **word_tokenize()** on **scene_one** and then converting it into a set using **set()**.



In [None]:
# Make a set of unique tokens in the entire scene: unique_tokens
unique_tokens =
print(unique_tokens)

{'am', 'if', '...', 'Wait', 'Well', 'That', 'court', 'Not', 'husk', 'will', 'back', 'to', 'suggesting', ':', 'that', 'in', 'my', 'Oh', 'strand', 'KING', 'Britons', 'must', 'martin', 'the', 'line', '#', 'master', 'seek', 'Yes', 'knights', 'but', 'trusty', 'you', 'times', "n't", 'Ridden', 'south', 'Supposing', 'house', 'beat', 'I', 'simple', 'all', 'SCENE', 'grips', 'Listen', 'What', 'Patsy', 'through', 'So', 'carrying', 'ARTHUR', "'s", '[', 'by', 'sovereign', 'coconuts', 'bangin', 'snows', "'", 'covered', 'In', 'coconut', 'tell', 'anyway', 'plover', 'horse', 'creeper', 'bird', 'ounce', 'may', 'carry', 'right', 'But', 'course', 'Pendragon', 'clop', 'there', 'Will', 'maintain', '?', '!', 'Uther', 'here', 'speak', 'your', 'an', 'SOLDIER', 'son', 'this', 'non-migratory', 'order', 'migrate', 'Halt', 'temperate', 'them', 'They', 'our', 'No', 'European', "'d", 'length', 'zone', 'needs', 'just', 'You', 'halves', 'yet', 'pound', 'from', 'dorsal', 'and', 'have', 'on', 'is', 'since', 'these', 'fea



*   Use **re.search()** to search for the first occurrence of the word **"coconuts"** in **scene_one**. Store the result in match.



In [None]:
import re

In [None]:
# Search for the first occurrence of "coconuts" in scene_one: match
match =
# Print the start and end indexes of match
print(match.start(), match.end())

588 596




*   Write a regular expression called **pattern1** to find anything in square brackets.



In [None]:
# Write a regular expression to search for anything in square brackets: pattern1
pattern1 =

# Use re.search to find the first text in square brackets
re.

<re.Match object; span=(9, 32), match='[wind] [clop clop clop]'>




*   Create a pattern to match the script notation (e.g. **Character**:), assigning the result to **pattern2**. Remember that you will want to match any words or spaces that precede the : (such as the space within **SOLDIER #1**:).
*   Use **re.match()** with your new pattern to find and print the script notation in the fourth line. The tokenized sentences are available in your namespace as **sentences**.



In [None]:
# Find the script notation at the beginning of the fourth sentence and print it
pattern2 =
re.

<re.Match object; span=(0, 7), match='ARTHUR:'>
