# Sentence tokenization

In this recipe, we will learn how to count the number of sentences in a piece of text.

In [1]:
import pandas as pd
from nltk.tokenize import sent_tokenize
from sklearn.datasets import fetch_20newsgroups

In [2]:
text = """
The alarm rang at 7 in the morning as it usually did on Tuesdays. She rolled over,
stretched her arm, and stumbled to the button till she finally managed to switch it off.
Reluctantly, she got up and went for a shower. The water was cold as the day before the engineers
did not manage to get the boiler working. Good thing it was still summer.
Upstairs, her cat waited eagerly for his morning snack. Miaow! He voiced with excitement
as he saw her climb the stairs.
"""

In [3]:
# separate text into sentences

sent_tokenize(text)

['\nThe alarm rang at 7 in the morning as it usually did on Tuesdays.',
 'She rolled over,\nstretched her arm, and stumbled to the button till she finally managed to switch it off.',
 'Reluctantly, she got up and went for a shower.',
 'The water was cold as the day before the engineers\ndid not manage to get the boiler working.',
 'Good thing it was still summer.',
 'Upstairs, her cat waited eagerly for his morning snack.',
 'Miaow!',
 'He voiced with excitement\nas he saw her climb the stairs.']

In [4]:
# count number of sentences

len(sent_tokenize(text))

8

In [5]:
# now we do the same for an entire dataframe

# load data
data = fetch_20newsgroups(subset='train')
df = pd.DataFrame(data.data, columns=['text'])
df.head()

Unnamed: 0,text
0,From: lerxst@wam.umd.edu (where's my thing)\nS...
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...


In [6]:
# take the first 10 rows to speed things up

df = df.loc[1:10]
df

Unnamed: 0,text
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...
5,From: dfo@vttoulu.tko.vtt.fi (Foxvog Douglas)\...
6,From: bmdelane@quads.uchicago.edu (brian manni...
7,From: bgrubb@dante.nmsu.edu (GRUBB)\nSubject: ...
8,From: holmes7000@iscsvax.uni.edu\nSubject: WIn...
9,From: kerr@ux1.cso.uiuc.edu (Stan Kerr)\nSubje...
10,From: irwin@cmptrc.lonestar.org (Irwin Arnstei...


In [7]:
# remove first part of email

df['text'] = df['text'].str.split('Lines:').apply(lambda x: x[1])

df

Unnamed: 0,text
1,11\nNNTP-Posting-Host: carson.u.washington.ed...
2,"36\n\nwell folks, my mac plus finally gave up..."
3,14\nDistribution: world\nNNTP-Posting-Host: a...
4,23\n\nFrom article <C5owCB.n3p@world.std.com>...
5,58\n\nIn article <1r1eu1$4t@transfer.stratus....
6,12\n\nThere were a few people who responded t...
7,44\nDistribution: world\nNNTP-Posting-Host: d...
8,10\n\nI have win 3.0 and downloaded several i...
9,29\n\njap10@po.CWRU.Edu (Joseph A. Pellettier...
10,13\n\nI have a line on a Ducati 900GTS 1978 m...


In [8]:
print(df['text'][1])

 11
NNTP-Posting-Host: carson.u.washington.edu

A fair number of brave souls who upgraded their SI clock oscillator have
shared their experiences for this poll. Please send a brief message detailing
your experiences with the procedure. Top speed attained, CPU rated speed,
add on cards and adapters, heat sinks, hour of usage per day, floppy disk
functionality with 800 and 1.4 m floppies are especially requested.

I will be summarizing in the next two days, so please add to the network
knowledge base if you have done the clock upgrade and haven't answered this
poll. Thanks.

Guy Kuo <guykuo@u.washington.edu>



In [9]:
df['num_sent'] = df['text'].apply(sent_tokenize).apply(len)

df

Unnamed: 0,text,num_sent
1,11\nNNTP-Posting-Host: carson.u.washington.ed...,6
2,"36\n\nwell folks, my mac plus finally gave up...",9
3,14\nDistribution: world\nNNTP-Posting-Host: a...,7
4,23\n\nFrom article <C5owCB.n3p@world.std.com>...,10
5,58\n\nIn article <1r1eu1$4t@transfer.stratus....,21
6,12\n\nThere were a few people who responded t...,8
7,44\nDistribution: world\nNNTP-Posting-Host: d...,15
8,10\n\nI have win 3.0 and downloaded several i...,3
9,29\n\njap10@po.CWRU.Edu (Joseph A. Pellettier...,12
10,13\n\nI have a line on a Ducati 900GTS 1978 m...,11
