<a href="https://colab.research.google.com/github/TurkuNLP/intro-to-nlp/blob/master/sentence_splitting_and_tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentence splitting and tokenization example

This short notebook illustrates one comparatively fast way to do sentence splitting and tokenization in Python. It's not particularly accurate at either, but should do the job in cases where the details don't matter too much.

We'll use the [sentence-splitter](https://pypi.org/project/sentence-splitter/) and [regex](https://pypi.org/project/regex/) packages. 

In [17]:
!pip install --quiet sentence-splitter regex 

Grab some example data

In [18]:
!wget -nc http://dl.turkunlp.org/TKO_7095_2023/fiwiki-20221120-sample.txt

File ‘fiwiki-20221120-sample.txt’ already there; not retrieving.



Read in example data in paragraph-per-line format

In [3]:
paragraphs = open('fiwiki-20221120-sample.txt').readlines()

Instantiate sentence splitter. Note that you need to provide the language, and not all languages are supported.

In [5]:
from sentence_splitter import SentenceSplitter

splitter = SentenceSplitter(language='fi')

Run sentence splitting and log runtime. Just take some paragraphs from the start to keep things reasonably fast.

In [21]:
%%time

sentences = [s for p in paragraphs[:100000] for s in splitter.split(p)]

CPU times: user 1min 5s, sys: 344 ms, total: 1min 5s
Wall time: 1min 7s


Split into tokens using a regular expression. Here the regular expression defines as a token any sequence of alphanumeric characters or any (other) single non-space character. 

In [22]:
import regex

TOKENIZE_RE = regex.compile(r'([[:alnum:]]+|\S)')

Tokenize and log runtime

In [24]:
%%time

tokenized = [TOKENIZE_RE.findall(s) for s in sentences]

CPU times: user 6.11 s, sys: 742 ms, total: 6.85 s
Wall time: 6.95 s


Check a few examples

In [26]:
for t in tokenized[:10]:
    print(t)

['Patrick', 'Joseph', 'Leahy', '(', 's', '.']
['31', '.', 'maaliskuuta', '1940', 'Montpelier', ',', 'Vermont', ')', 'on', 'yhdysvaltalainen', 'demokraattisen', 'puolueen', 'poliitikko', '.']
['Leahy', 'toimii', 'Yhdysvaltain', 'senaatin', 'president', 'pro', 'temporena', 'eli', 'de', 'facto', 'senaatin', 'varapresidenttinä', '.']
['Hän', 'on', 'toiminut', 'Vermontin', 'osavaltion', 'senaattorina', 'vuodesta', '1975', '.']
['Grassley', 'myös', 'toimi', 'senaatin', 'president', 'pro', 'temporena', 'joulukuusta', '2012', 'tammikuuhun', '2015', '.']
['Hän', 'on', 'ollut', 'myös', 'senaatin', 'oikeusvaliokunnan', 'puheenjohtaja', '.']
['Elävä', 'kuollut', 'eli', 'epäkuollut', 'tarkoittaa', 'yleisesti', 'erilaisia', 'taruolentoja', ',', 'jotka', 'ovat', 'heränneet', 'kuolleista', 'takaisin', 'elävien', 'maailmaan', '.']
['Populaarikulttuurissa', 'tunnetuimpia', 'eläviä', 'kuolleita', 'ovat', 'vampyyrit', 'ja', 'zombit', '.']
['Sanan', "'", 'epäkuollut', "'", 'kehitti', 'kääntäjä', 'Kersti', 