# 3   Processing Raw Text

In [1]:
!pip install --upgrade pip

Requirement already up-to-date: pip in c:\users\pc\anaconda3\lib\site-packages


Retrying (Retry(total=4, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x02D579D0>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',)': /simple/pip/
Retrying (Retry(total=3, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x02D578F0>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',)': /simple/pip/
Retrying (Retry(total=2, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x02D57890>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',)': /simple/pip/
Retrying (Retry(total=1, connect=None, read=None, redirect=None)) after connection broken by 'NewCon

In [2]:
!pip install nltk




In [3]:
import nltk

The most important source of texts is undoubtedly the Web. It's convenient to have existing text collections to explore, such as the corpora we saw in the previous chapters. However, you probably have your own text sources in mind, and need to learn how to access them.

The goal of this chapter is to answer the following questions:

- How can we write programs to access text from local files and from the web, in order to get hold of an unlimited range of language material?

- How can we split documents up into individual words and punctuation symbols, so we can carry out the same kinds of analysis we did with text corpora in earlier chapters?

- How can we write programs to produce formatted output and save it in a file?

In order to address these questions, we will be covering key concepts in NLP, including tokenization and stemming. Along the way you will consolidate your Python knowledge and learn about strings, files, and regular expressions. Since so much text on the web is in HTML format, we will also see how to dispense with markup.


<b>Note:</b>

<b>Important</b>: From this chapter onwards, our program samples will assume you begin your interactive session or your program with the following import statements:

In [4]:
#from __future__ import division  # Python 2 users only
import nltk, re, pprint
from nltk import word_tokenize

# 3.1   Accessing Text from the Web and from Disk



<b>Electronic Books</b>

A small sample of texts from Project Gutenberg appears in the NLTK corpus collection. However, you may be interested in analyzing other texts from Project Gutenberg. You can browse the catalog of 25,000 free online books at http://www.gutenberg.org/catalog/, and obtain a URL to an ASCII text file. Although 90% of the texts in Project Gutenberg are in English, it includes material in over 50 other languages, including Catalan, Chinese, Dutch, Finnish, French, German, Italian, Portuguese and Spanish (with more than 100 texts each).

Text number 2554 is an English translation of Crime and Punishment, and we can access it as follows.

In [5]:
from urllib import request
url = "http://www.gutenberg.org/files/2554/2554.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')
type(raw),len(raw),raw[:75]



(str,
 1176896,
 'The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n')

The variable raw contains a string with 1,176,893 characters. (We can see that it is a string, using type(raw).) This is the raw content of the book, including many details we are not interested in such as whitespace, line breaks and blank lines. Notice the \r and \n in the opening line of the file, which is how Python displays the special carriage return and line feed characters (the file must have been created on a Windows machine). For our language processing, we want to break up the string into words and punctuation, as we saw in 1.. This step is called tokenization, and it produces our familiar structure, a list of words and punctuation.



In [5]:
nltk.download() # you need to download nltk, after running this cell you will get a screen with some choices, choose to install all packages

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [7]:
tokens = word_tokenize(raw)



In [8]:
print(type(tokens))
print(len(tokens))
tokens[:10]

<class 'list'>
254352


['The',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'Crime',
 'and',
 'Punishment',
 ',',
 'by']

Notice that NLTK was needed for tokenization, but not for any of the earlier tasks of opening a URL and reading it into a string. If we now take the further step of creating an NLTK text from this list, we can carry out all of the other linguistic processing we saw in 1., along with the regular list operations like slicing:



In [9]:
text=nltk.Text(tokens)
type(text)

nltk.text.Text

In [10]:
print(text[1024:1062])

['CHAPTER', 'I', 'On', 'an', 'exceptionally', 'hot', 'evening', 'early', 'in', 'July', 'a', 'young', 'man', 'came', 'out', 'of', 'the', 'garret', 'in', 'which', 'he', 'lodged', 'in', 'S.', 'Place', 'and', 'walked', 'slowly', ',', 'as', 'though', 'in', 'hesitation', ',', 'towards', 'K.', 'bridge', '.']


In [11]:
print(text.collocations()) #Collocations are expressions of multiple words which commonly co-occur

Katerina Ivanovna; Pyotr Petrovitch; Pulcheria Alexandrovna; Avdotya
Romanovna; Rodion Romanovitch; Marfa Petrovna; Sofya Semyonovna; old
woman; Project Gutenberg-tm; Porfiry Petrovitch; Amalia Ivanovna;
great deal; Nikodim Fomitch; young man; Ilya Petrovitch; n't know;
Project Gutenberg; Dmitri Prokofitch; Andrey Semyonovitch; Hay Market
None


Notice that Project Gutenberg appears as a collocation. This is because each text downloaded from Project Gutenberg contains a header with the name of the text, the author, the names of people who scanned and corrected the text, a license, and so on. Sometimes this information appears in a footer at the end of the file. We cannot reliably detect where the content begins and ends, and so have to resort to manual inspection of the file, to discover unique strings that mark the beginning and the end, before trimming raw to be just the content and nothing else:



In [12]:
raw.find("PART I")

5338

In [13]:
raw.rfind("End of Project Gutenberg's Crime")

1157746

In [14]:
raw = raw[5338:1157743] #[1]
raw.find("PART I")

0

The find() and rfind() ("reverse find") methods help us get the right index values to use for slicing the string [1]. We overwrite raw with this slice, so now it begins with "PART I" and goes up to (but not including) the phrase that marks the end of the content.

This was our first brush with the reality of the web: texts found on the web may contain unwanted material, and there may not be an automatic way to remove it. But with a small amount of extra work we can extract the material we need.

#### Dealing with HTML



Much of the text on the web is in the form of HTML documents. You can use a web browser to save a page as text to a local file, then access this as described in the section on files below. However, if you're going to do this often, it's easiest to get Python to do the work directly. The first step is the same as before, using urlopen. For fun we'll pick a BBC News story called Blondes to die out in 200 years, an urban legend passed along by the BBC as established scientific fact:



In [15]:
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = request.urlopen(url).read().decode('utf8')
html[:60]

'<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN'

You can type print(html) to see the HTML content in all its glory, including meta tags, an image map, JavaScript, forms, and tables.



In [16]:
print(html)

<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<title>BBC NEWS | Health | Blondes 'to die out in 200 years'</title>
<meta name="keywords" content="BBC, News, BBC News, news online, world, uk, international, foreign, british, online, service">
<meta name="OriginalPublicationDate" content="2002/09/27 11:51:55">
<meta name="UKFS_URL" content="/1/hi/health/2284783.stm">
<meta name="IFS_URL" content="/2/hi/health/2284783.stm">
<meta name="HTTP-EQUIV" content="text/html;charset=iso-8859-1">
<meta name="Headline" content="Blondes 'to die out in 200 years'">
<meta name="Section" content="Health">
<meta name="Description" content="Natural blondes are an endangered species and will die out by 2202, a study suggests.">
<!-- GENMaps-->
<map name="banner">
<area alt="BBC NEWS" coords="7,9,167,32" href="http://news.bbc.co.uk/1/hi.html" shape="RECT">
</map>

<script src="/nol/shared/js/livestats_v1_1.js" langua

To get text out of HTML we will use a Python library called BeautifulSoup, available from http://www.crummy.com/software/BeautifulSoup/:



In [6]:
from bs4 import BeautifulSoup

In [18]:
raw = BeautifulSoup(html).get_text()
tokens = word_tokenize(raw)
tokens



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


['BBC',
 'NEWS',
 '|',
 'Health',
 '|',
 'Blondes',
 "'to",
 'die',
 'out',
 'in',
 '200',
 "years'",
 'CATEGORIES',
 'TV',
 'RADIO',
 'COMMUNICATE',
 'WHERE',
 'I',
 'LIVE',
 'INDEX',
 'SEARCH',
 'You',
 'are',
 'in',
 ':',
 'Health',
 'News',
 'Front',
 'Page',
 'World',
 'UK',
 'England',
 'N',
 'Ireland',
 'Scotland',
 'Wales',
 'Politics',
 'Business',
 'Entertainment',
 'Science/Nature',
 'Technology',
 'Health',
 'Medical',
 'notes',
 'Education',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Talking',
 'Point',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Country',
 'Profiles',
 'In',
 'Depth',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Programmes',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'SERVICES',
 'Daily',
 'E-mail',
 'News',
 'Ticker',
 'Mobile/PDAs',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Text',
 'Only',
 'Feedback',
 'Help',
 'EDITIONS',
 'Change',
 'to',
 'World',
 'Friday',
 ',',
 '27',
 'September',
 ',',
 '2002',
 ',',
 '11:51',
 '

This still contains unwanted material concerning site navigation and related stories. With some trial and error you can find the start and end indexes of the content and select the tokens of interest, and initialize a text as before.



In [19]:
tokens = tokens[110:390]
text = nltk.Text(tokens)
text.concordance('gene')

Displaying 5 of 5 matches:
hey say too few people now carry the gene for blondes to last beyond the next 
blonde hair is caused by a recessive gene . In order for a child to have blond
 have blonde hair , it must have the gene on both sides of the family in the g
ere is a disadvantage of having that gene or by chance . They do n't disappear
des would disappear is if having the gene was a disadvantage and I do not thin


#### Processing Search Engine Results



The web can be thought of as a huge corpus of unannotated text. Web search engines provide an efficient means of searching this large quantity of text for relevant linguistic examples. The main advantage of search engines is size: since you are searching such a large set of documents, you are more likely to find any linguistic pattern you are interested in. Furthermore, you can make use of very specific patterns, which would only match one or two examples on a smaller example, but which might match tens of thousands of examples when run on the web. A second advantage of web search engines is that they are very easy to use. Thus, they provide a very convenient tool for quickly checking a theory, to see if it is reasonable.



![](table3.JPG)

Unfortunately, search engines have some significant shortcomings. First, the allowable range of search patterns is severely restricted. Unlike local corpora, where you write programs to search for arbitrarily complex patterns, search engines generally only allow you to search for individual words or strings of words, sometimes with wildcards. Second, search engines give inconsistent results, and can give widely different figures when used at different times or in different geographical regions. When content has been duplicated across multiple sites, search results may be boosted. Finally, the markup in the result returned by a search engine may change unpredictably, breaking any pattern-based method of locating particular content (a problem which is ameliorated by the use of search engine APIs).



<b>Your Turn</b>: Search the web for "the of" (inside quotes). Based on the large count, can we conclude that the of is a frequent collocation in English

<mark>after looking for the "the of", we can say that it is a collocation in English</mark>


#### Processing RSS Feeds



The blogosphere is an important source of text, in both formal and informal registers. With the help of a Python library called the Universal Feed Parser, available from  https://pypi.python.org/pypi/feedparser, we can access the content of a blog, as shown below:



In [21]:
!pip install feedparser



In [22]:
import feedparser
llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")
print(llog['feed']['title'])


Language Log


In [25]:
print(len(llog.entries))
post = llog.entries[2]
print(post.title)
content = post.content[0].value
print(content[:70])


13
Bad Chinese
<p>Sign south of the demolished Pfeiffer Bridge on Highway 1 in Monter


With some further work, we can write programs to create a small corpus of blog posts, and use this as the basis for our NLP work.



In [26]:
raw = BeautifulSoup(content).get_text()
print(word_tokenize(raw))

['Sign', 'south', 'of', 'the', 'demolished', 'Pfeiffer', 'Bridge', 'on', 'Highway', '1', 'in', 'Monterey', 'County', '(', 'photograph', 'taken', 'on', 'August', '12', ',', '2017', 'by', 'Richard', 'Masoner', 'while', 'on', 'a', 'Big', 'Sur', 'bike', 'trip', ',', 'via', 'Flickr', ')', ':', 'This', 'is', 'not', 'Chinglish', '.', 'It', 'is', 'the', 'opposite', 'of', 'Chinglish', ':', 'English', 'poorly', 'translated', 'into', 'Chinese', '.', 'The', 'sign', 'says', ':', 'Zhǔdòng', 'gōnglù', 'bùyào', 'zǒu', 'zài', 'zhōngjiān', 'de', 'lùxiàn', 'bǎochí', 'bái', 'xiàn', 'de', 'quánlì', '主动公路不要走在中间的路线保持白线的权利', 'It', "'s", 'difficult', 'for', 'me', 'to', 'make', 'sense', 'of', 'this', 'sign', '.', 'Chinese', 'friends', 'to', 'whom', 'I', 'show', 'this', 'sign', 'are', 'also', 'totally', 'confused', 'by', 'it', '.', 'Forced', 'translation', 'into', 'English', ':', "''", 'Active', 'highway', '.', 'Do', "n't", 'walk', '/', 'ride', 'in', 'the', 'center', 'line', '/', 'lane', '.', 'Keep', '/', 'maint



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


#### Reading Local Files



In order to read a local file, we need to use Python's built-in open() function, followed by the read() method. Suppose you have a file document.txt, you can load its contents like this:



In [7]:
f = open('document.txt')
raw = f.read()

To check that the file that you are trying to open is really in the right directory, use IDLE's Open command in the File menu; this will display a list of all the files in the directory where IDLE is running. An alternative is to examine the current directory from within Python:



In [8]:
import os 
os.listdir('.')  # document is here :)

['.ipynb_checkpoints',
 '1.Language_Processing_and_Python.ipynb',
 '2_Accessing_Text_Corpora_and_Lexical_Resources.ipynb',
 '3_Processing_Raw_Text.ipynb',
 'accusatif....JPG',
 'branches_phonétique.JPG',
 'Categorizing_and_Tagging_Words.ipynb',
 'document.txt',
 'LDM.pdf',
 'morohologie.JPG',
 'phonologie.JPG',
 'phonétique.JPG',
 'phonétique_VS_phonologie.JPG',
 'pipeline.JPG',
 'semantique.JPG',
 'table3.JPG',
 'testcv.pdf',
 'text-mining-pres.pdf',
 'tp-nltk-ss.pdf']

Another possible problem you might have encountered when accessing a text file is the newline conventions, which are different for different operating systems. The built-in open() function has a second parameter for controlling how the file is opened: open('document.txt', 'rU') — 'r' means to open the file for reading (the default), and 'U' stands for "Universal", which lets us ignore the different conventions used for marking newlines.



Assuming that you can open the file, there are several methods for reading it. The read() method creates a string with the contents of the entire file:



In [37]:
f=open('document.txt', 'rU')
raw=f.read()
raw

  """Entry point for launching an IPython kernel.


'Hello,\nMy name is Riahi LOURIZ \nI am new to this field of text mining\nthese are some notebooks that can help you to learn NLP.'

Recall that the '\n' characters are <b>newlines</b>; this is equivalent to pressing Enter on a keyboard and starting a new line.



We can also read a file one line at a time using a for loop:



In [9]:
f=open('document.txt','rU')
for line in f:
    print(line.strip())
    


Hello,
My name is Riahi LOURIZ
I am new to this field of text mining
these are some notebooks that can help you to learn NLP.


  """Entry point for launching an IPython kernel.


Here we use the strip() method to remove the newline character at the end of the input line.



NLTK's corpus files can also be accessed using these methods. We simply have to use nltk.data.find() to get the filename for any corpus item. Then we can open and read it in the way we just demonstrated above:



In [10]:
path = nltk.data.find('corpora/gutenberg/melville-moby_dick.txt')

In [11]:
raw = open(path, 'rU').read()

  """Entry point for launching an IPython kernel.


#### Extracting Text from PDF, MSWord and other Binary Formats



ASCII text and HTML text are human readable formats. Text often comes in binary formats — like PDF and MSWord — that can only be opened using specialized software. Third-party libraries such as  <i>pypdf</i> and <i>pywin32</i> provide access to these formats. Extracting text from multi-column documents is particularly challenging. For once-off conversion of a few documents, it is simpler to open the document with a suitable application, then save it as text to your local drive, and access it as described below. If the document is already on the web, you can enter its URL in Google's search box. The search result often includes a link to an HTML version of the document, which you can save as text.



In [47]:
!pip install PyPDF2

Collecting PyPDF2
  Downloading PyPDF2-1.26.0.tar.gz (77kB)
Building wheels for collected packages: PyPDF2
  Running setup.py bdist_wheel for PyPDF2: started
  Running setup.py bdist_wheel for PyPDF2: finished with status 'done'
  Stored in directory: C:\Users\pc\AppData\Local\pip\Cache\wheels\86\6a\6a\1ce004a5996894d33d93e1fb1b67c30973dc945cc5875a1dd0
Successfully built PyPDF2
Installing collected packages: PyPDF2
Successfully installed PyPDF2-1.26.0


In [12]:
import PyPDF2

In [13]:
pdfFileObj = open('LDM.pdf', 'rb')

In [14]:
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

In [15]:
pdfReader.numPages

1

In [16]:
pageObj = pdfReader.getPage(0)

In [17]:
pageObj.extractText()

"le8août2017\nRiahiLOURIZ\n+60142624708\nriahi.louriz@telecom-bretagne.eu\nIFREMER\nPointeduDiable,\n29280Plouzané\nMadame,Monsieur,\nC'estavecbeaucoupd'enthousiasmeetdemotivationquejevousadressemacandidaturepourrejoindre\nvotreentrepriseentantqueDataScientist.\nL'o˙requevousproposezcorrespondparfaitementàmesobjectifsenraisondel'intérêtparticulierqueje\nporteauxstatistiquesetàl'analysededonnées;denatureobservateur,perspicaceetdotéd'unefortecapacité\nd'analyseetd'anticipation,j'aipudévelopperungoûtprononcépourcedomaineàtraversplusieursmissions\ne˙ectuéestoutaulongdemonparcoursprofessionnel,notammentaucoursd'uncerti˝caten\nMachineLearning\nquej'aiobtenusurlescoursenlignedecoursera.org(Stanforduniversity,ProfessorAndrewNg)etquim'a\npermisdemefamiliariseraveclesoutilsetméthodesclassiquesd'analysededonnées(réseauxdeneurones,SVM,\nMapReduce,LogisticRegression,AnomalyDetection,...).J'aiégalementobtenuuncerti˝catnommé\nTheData\nScientist'sToolbox\n(coursera.org,JohnsHopkinsuniversity)cequim'ap

In [18]:
type(pageObj)

PyPDF2.pdf.PageObject

In [80]:
!pip install docx

Collecting docx
  Downloading docx-0.2.4.tar.gz (54kB)
Building wheels for collected packages: docx
  Running setup.py bdist_wheel for docx: started
  Running setup.py bdist_wheel for docx: finished with status 'done'
  Stored in directory: C:\Users\pc\AppData\Local\pip\Cache\wheels\43\43\f7\ae02727f01b27dd92d5ba84982cfd8da9484b7179e263253a0
Successfully built docx
Installing collected packages: docx
Successfully installed docx-0.2.4


#### Capturing User Input



Sometimes we want to capture the text that a user inputs when he is interacting with our program. To prompt the user to type a line of input, call the Python function input(). After saving the input to a variable, we can manipulate it just as we have done for other strings.



In [20]:
s = input("Enter some text: ")


Enter some text: my name is Riahi LOURIZ


In [21]:
print("You typed", len(word_tokenize(s)), "words.")

You typed 5 words.


In [22]:
word_tokenize(s)

['my', 'name', 'is', 'Riahi', 'LOURIZ']

#### The NLP Pipeline



3.1 summarizes what we have covered in this section, including the process of building a vocabulary that we saw in 1.. (One step, normalization, will be discussed in 3.6.)




![](pipeline.JPG)

---
##### example :
---

In [92]:
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"  # url of the content
html = request.urlopen(url).read().decode('utf8')  # open url and read it
raw = BeautifulSoup(html).get_text()  # clean markup with bs4.BeautifulSoup(..)
tokens = nltk.wordpunct_tokenize(raw) 
print(len(tokens))
text=nltk.Text(tokens)
words= [w.lower() for w in text]
vocab= sorted(set(words))
vocab



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")
 [__init__.py:181]


['"',
 '","").',
 '";',
 "'",
 '\'"><',
 "')!=-",
 "'+",
 "'<",
 "'~",
 '(',
 '(!',
 '(".',
 '("<',
 "('",
 "('<",
 '(/\\',
 ')',
 ')+"...";',
 ');',
 '){',
 ')~',
 '+',
 '+"";',
 '+\'"',
 '+\'">\'+\'<',
 "+'&",
 "+'</",
 ',',
 ',"',
 '-',
 '-------------',
 '----------------------------------------------------------------------------------',
 '.',
 '."',
 '/',
 '/"\'+\'',
 '/))',
 '//-->',
 '0',
 '01',
 '02',
 '08',
 '09',
 '1',
 '11',
 '12',
 '17',
 '2',
 '200',
 '2002',
 '2202',
 '2284783',
 '252',
 '27',
 '28',
 '51',
 '99',
 ':',
 '://',
 ';',
 ';}',
 '<!--',
 '=',
 '="',
 '="\'+',
 "='+",
 "='<",
 '>");',
 ">');",
 ">'+",
 ">';",
 '></',
 '>=',
 '>>',
 '?',
 '?~',
 '\\',
 '^^',
 'a',
 'abductees',
 'about',
 'africa',
 'aid',
 'alert',
 'alien',
 'also',
 'alzheimer',
 'americas',
 'an',
 'and',
 'ann',
 'applet',
 'apr',
 'are',
 'as',
 'asia',
 'at',
 'attractive',
 'babies',
 'back',
 'bbc',
 'be',
 'become',
 'believe',
 'beyond',
 'big',
 'bin',
 'blame',
 'blonde',
 'blonde

There's a lot going on in this pipeline. To understand it properly, it helps to be clear about the type of each variable that it mentions. We find out the type of any Python object x using type(x), e.g. type(1) is  <int> since 1 is an integer.

When we load the contents of a URL or file, and when we strip out HTML markup, we are dealing with strings, Python's <str> data type. (We will learn more about strings in 3.2):

In [23]:
raw = open('document.txt').read()
type(raw)

str

When we tokenize a string we produce a list (of words), and this is Python's <list> type. Normalizing and sorting lists produces other lists:



In [24]:
tokens = word_tokenize(raw)
type(tokens)
words = [w.lower() for w in tokens]
type(words)
vocab = sorted(set(words))
type(vocab)

list

The type of an object determines what operations you can perform on it. So, for example, we can append to a list but not to a string:



In [25]:
vocab.append('blog')


In [26]:
#raw.append('blog')  # PRODUCE AN ERROR

### 3.2   Strings: Text Processing at the Lowest Level



It's time to examine a fundamental data type that we've been studiously avoiding so far. In earlier chapters we focused on a text as a list of words. We didn't look too closely at words and how they are handled in the programming language. By using NLTK's corpus interface we were able to ignore the files that these texts had come from. The contents of a word, and of a file, are represented by programming languages as a fundamental data type known as a string. In this section we explore strings in detail, and show the connection between strings, words, texts and files.



#### Basic Operations with Strings



Strings are specified using single quotes [1] or double quotes [2], as shown below. If a string contains a single quote, we must backslash-escape the quote [3] so Python knows a literal quote character is intended, or else put the string in double quotes [2]. Otherwise, the quote inside the string [4] will be interpreted as a close quote, and the Python interpreter will report a syntax error:



In [28]:
monty = 'Monty Python' #[1]
print(monty)
circus = "Monty Python's Flying Circus" #[2]
print(circus)
circus = 'Monty Python\'s Flying Circus' #[3]
print(circus)


Monty Python
Monty Python's Flying Circus
Monty Python's Flying Circus


In [30]:
#circus = 'Monty Python's Flying Circus' #[4]  gives a syntax error

Sometimes strings go over several lines. Python provides us with various ways of entering them. In the next example, a sequence of two strings is joined into a single string. We need to use backslash [1] or parentheses [2] so that the interpreter knows that the statement is not complete after the first line.



In [32]:
couplet = "Shall I compare thee to a Summer's day?"\
...           "Thou are more lovely and more temperate:" #[1]
print(couplet)

Shall I compare thee to a Summer's day?Thou are more lovely and more temperate:


In [33]:
couplet = ("Rough winds do shake the darling buds of May,"
...           "And Summer's lease hath all too short a date:") #[2]
print(couplet)

Rough winds do shake the darling buds of May,And Summer's lease hath all too short a date:


Unfortunately the above methods do not give us a newline between the two lines of the sonnet. Instead, we can use a triple-quoted string as follows:



In [34]:
couplet = """Shall I compare thee to a Summer's day?
... Thou are more lovely and more temperate:"""
print(couplet)

Shall I compare thee to a Summer's day?
Thou are more lovely and more temperate:


In [39]:
test= """This is a notebook for nltk's book.
... You can explore it and get more knwoledge in this field"""
print(test)

This is a notebook for nltk's book.
You can explore it and get more knwoledge in this field


Now that we can define strings, we can try some simple operations on them. First let's look at the + operation, known as concatenation [1]. It produces a new string that is a copy of the two original strings pasted together end-to-end. Notice that concatenation doesn't do anything clever like insert a space between the words. We can even multiply strings [2]:



In [42]:
'very' + 'very' + 'very'# [1]


'veryveryvery'

In [41]:
'very' * 3 #[2]

'veryveryvery'

<b>Note</b>

Your Turn: Try running the following code, then try to use your understanding of the string + and * operations to figure out how it works. Be careful to distinguish between the string ' ', which is a single whitespace character, and '', which is the empty string.

In [48]:
a = [1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1]
b = ['*' * 2 * (7 - i) + 'SERL' * i for i in a]
for line in b:
    print(line)

************SERL
**********SERLSERL
********SERLSERLSERL
******SERLSERLSERLSERL
****SERLSERLSERLSERLSERL
**SERLSERLSERLSERLSERLSERL
SERLSERLSERLSERLSERLSERLSERL
**SERLSERLSERLSERLSERLSERL
****SERLSERLSERLSERLSERL
******SERLSERLSERLSERL
********SERLSERLSERL
**********SERLSERL
************SERL


#### Accessing Individual Characters



Again as with lists, we can use negative indexes for strings, where -1 is the index of the last character [1]. Positive and negative indexes give us two ways to refer to any position in a string. In this case, when the string had a length of 12, indexes 5 and -7 both refer to the same character (a space). (Notice that 5 = len(monty) - 7.)



In [49]:
print(monty)

Monty Python


In [50]:
monty[0]

'M'

In [51]:
monty[5]

' '

In [52]:
monty[-7]

' '

We can write for loops to iterate over the characters in strings. This print function includes the optional end=' ' parameter, which is how we tell Python to print a space instead of a newline at the end.



In [53]:
sent = 'colorless green ideas sleep furiously'
for ch in sent:
    print(ch, end=' ')

c o l o r l e s s   g r e e n   i d e a s   s l e e p   f u r i o u s l y 

We can count individual characters as well. We should ignore the case distinction by normalizing everything to lowercase, and filter out non-alphabetic characters:



In [54]:
from nltk.corpus import gutenberg
raw = gutenberg.raw('melville-moby_dick.txt')
fdist = nltk.FreqDist(ch.lower() for ch in raw if ch.isalpha())
fdist.most_common(5)

[('e', 117092), ('t', 87996), ('a', 77916), ('o', 69326), ('n', 65617)]

In [57]:
print([char for (char, count) in fdist.most_common()])


['e', 't', 'a', 'o', 'n', 'i', 's', 'h', 'r', 'l', 'd', 'u', 'm', 'c', 'w', 'f', 'g', 'p', 'b', 'y', 'v', 'k', 'q', 'j', 'x', 'z']


#### The Difference between Lists and Strings



Strings and lists are both kinds of sequence. We can pull them apart by indexing and slicing them, and we can join them together by concatenating them. However, we cannot join strings and lists:



In [62]:
query = 'Who knows?'
beatles = ['John', 'Paul', 'George', 'Ringo']
query[2]

'o'

In [63]:
beatles[2]

'George'

In [64]:
query[:2]

'Wh'

In [65]:
beatles[:2]

['John', 'Paul']

In [66]:
query + " I don't"

"Who knows? I don't"

In [68]:
#beatles + 'Brian'     #TypeError: can only concatenate list (not "str") to list

In [69]:
beatles + ['Brian']

['John', 'Paul', 'George', 'Ringo', 'Brian']

<mark>When we open a file for reading into a Python program, we get a string corresponding to the contents of the whole file. If we use a for loop to process the elements of this string, all we can pick out are the individual characters — we don't get to choose the granularity. By contrast, the elements of a list can be as big or small as we like: for example, they could be paragraphs, sentences, phrases, words, characters. So lists have the advantage that we can be flexible about the elements they contain, and correspondingly flexible about any downstream processing. Consequently, one of the first things we are likely to do in a piece of NLP code is tokenize a string into a list of strings (3.7). Conversely, when we want to write our results to a file, or to a terminal, we will usually format them as a string (3.9).

</mark>

Lists and strings do not have exactly the same functionality. Lists have the added power that you can change their elements:



In [70]:
beatles

['John', 'Paul', 'George', 'Ringo']

In [72]:
beatles[0]
del beatles[-1]
beatles


['John', 'Paul', 'George']

In [74]:
beatles[0] = "John Lennon"
beatles

['John Lennon', 'Paul', 'George']

On the other hand if we try to do that with a string — changing the 0th character in query to 'F' — we get:



In [76]:
#query[0] = 'F'   #TypeError: 'str' object does not support item assignment

This is because strings are <b>immutable</b> — you can't change a string once you have created it. However, lists are <b>mutable</b>, and their contents can be modified at any time. As a result, lists support operations that modify the original value rather than producing a new value.



### 3.3   Text Processing with Unicode



Our programs will often need to deal with different languages, and different character sets. The concept of "plain text" is a fiction. If you live in the English-speaking world you probably use ASCII, possibly without realizing it. If you live in Europe you might use one of the extended Latin character sets, containing such characters as "ø" for Danish and Norwegian, "ő" for Hungarian, "ñ" for Spanish and Breton, and "ň" for Czech and Slovak. In this section, we will give an overview of how to use Unicode for processing texts that use non-ASCII character sets.



##### What is Unicode?



Unicode supports over a million characters. Each character is assigned a number, called a code point. In Python, code points are written in the form \uXXXX, where XXXX is the number in 4-digit hexadecimal form.

Within a program, we can manipulate Unicode strings just like normal strings. However, when Unicode characters are stored in files or displayed on a terminal, they must be encoded as a stream of bytes. Some encodings (such as ASCII and Latin-2) use a single byte per code point, so they can only support a small subset of Unicode, enough for a single language. Other encodings (such as UTF-8) use multiple bytes and can represent the full range of Unicode characters.

Text in files will be in a particular encoding, so we need some mechanism for translating it into Unicode — translation into Unicode is called decoding. Conversely, to write out Unicode to a file or a terminal, we first need to translate it into a suitable encoding — this translation out of Unicode is called encoding, and is illustrated in 3.3.

![](unicode.JPG)

From a Unicode perspective, characters are abstract entities which can be realized as one or more glyphs. Only glyphs can appear on a screen or be printed on paper. A font is a mapping from characters to glyphs.



#### Extracting encoded text from files



Let's assume that we have a small text file, and that we know how it is encoded. For example, polish-lat2.txt, as the name suggests, is a snippet of Polish text (from the Polish Wikipedia; see http://pl.wikipedia.org/wiki/Biblioteka_Pruska). This file is encoded as Latin-2, also known as ISO-8859-2. The function nltk.data.find() locates the file for us.



In [78]:
path = nltk.data.find('corpora/unicode_samples/polish-lat2.txt')


## 3.4   Regular Expressions for Detecting Word Patterns



Many linguistic processing tasks involve pattern matching. For example, we can find words ending with ed using endswith('ed'). We saw a variety of such "word tests" in 4.2. Regular expressions give us a more powerful and flexible method for describing the character patterns we are interested in.



To use regular expressions in Python we need to import the re library using: import re. We also need a list of words to search; we'll use the Words Corpus again (4). We will preprocess it to remove any proper names.



In [2]:
import nltk
import re

In [6]:
wordlist=[w for w in nltk.corpus.words.words('en') if w.islower()]
print(type(wordlist),len(wordlist)) 
print(wordlist[0:10],wordlist[len(wordlist)-2:-1])

<class 'list'> 210687
['a', 'aa', 'aal', 'aalii', 'aam', 'aardvark', 'aardwolf', 'aba', 'abac', 'abaca'] ['zythem']


#### Using Basic Meta-Characters



Let's find words ending with ed using the regular expression «ed$». We will use the re.search(p, s) function to check whether the pattern p can be found somewhere inside the string s. We need to specify the characters of interest, and use the dollar sign which has a special behavior in the context of regular expressions in that it matches the end of the word:



In [10]:
print([w for w in wordlist if re.search('ed$',w)])

['abaissed', 'abandoned', 'abased', 'abashed', 'abatised', 'abed', 'aborted', 'abridged', 'abscessed', 'absconded', 'absorbed', 'abstracted', 'abstricted', 'accelerated', 'accepted', 'accidented', 'accoladed', 'accolated', 'accomplished', 'accosted', 'accredited', 'accursed', 'accused', 'accustomed', 'acetated', 'acheweed', 'aciculated', 'aciliated', 'acknowledged', 'acorned', 'acquainted', 'acquired', 'acquisited', 'acred', 'aculeated', 'addebted', 'added', 'addicted', 'addlebrained', 'addleheaded', 'addlepated', 'addorsed', 'adempted', 'adfected', 'adjoined', 'admired', 'admitted', 'adnexed', 'adopted', 'adossed', 'adreamed', 'adscripted', 'aduncated', 'advanced', 'advised', 'aeried', 'aethered', 'afeared', 'affected', 'affectioned', 'affined', 'afflicted', 'affricated', 'affrighted', 'affronted', 'aforenamed', 'afterfeed', 'aftershafted', 'afterthoughted', 'afterwitted', 'agazed', 'aged', 'agglomerated', 'aggrieved', 'agminated', 'agnamed', 'agonied', 'agreed', 'agueweed', 'ahungere

The . <b>wildcard</b> symbol matches any single character. Suppose we have room in a crossword puzzle for an 8-letter word with j as its third letter and t as its sixth letter. In place of each blank cell we use a period:



In [12]:
print([w for w in wordlist if re.search('^..j..t..$',w)])

['abjectly', 'adjuster', 'dejected', 'dejectly', 'injector', 'majestic', 'objectee', 'objector', 'rejecter', 'rejector', 'unjilted', 'unjolted', 'unjustly']


Note

<b>Your Turn</b>: The caret symbol ^ matches the start of a string, just like the $ matches the end. What results do we get with the above example if we leave out both of these, and search for «..j..t..»?

In [13]:
print([w for w in wordlist if re.search('^..j..t..',w)]) # here we care about just the start, third and sixth letters.

['abjectedness', 'abjection', 'abjective', 'abjectly', 'abjectness', 'adjection', 'adjectional', 'adjectival', 'adjectivally', 'adjective', 'adjectively', 'adjectivism', 'adjectivitis', 'adjustable', 'adjustably', 'adjustage', 'adjustation', 'adjuster', 'adjustive', 'adjustment', 'bijouterie', 'cojusticiar', 'dejected', 'dejectedly', 'dejectedness', 'dejectile', 'dejection', 'dejectly', 'dejectory', 'dejecture', 'injectable', 'injection', 'injector', 'injustice', 'majestic', 'majestical', 'majestically', 'majesticalness', 'majesticness', 'majestious', 'majestyship', 'objectable', 'objectation', 'objectative', 'objectee', 'objecthood', 'objectification', 'objectify', 'objection', 'objectionability', 'objectionable', 'objectionableness', 'objectionably', 'objectional', 'objectioner', 'objectionist', 'objectival', 'objectivate', 'objectivation', 'objective', 'objectively', 'objectiveness', 'objectivism', 'objectivist', 'objectivistic', 'objectivity', 'objectivize', 'objectization', 'objec

In [14]:
print([w for w in wordlist if re.search('..j..t..$',w)]) # unlike the previous example here we care about the end 

['abjectly', 'adjuster', 'coprojector', 'dejected', 'dejectly', 'injector', 'interjector', 'majestic', 'maladjusted', 'microprojector', 'munjistin', 'objectee', 'objector', 'projector', 'readjuster', 'rejecter', 'rejector', 'subjected', 'unadjusted', 'undejected', 'unejected', 'uninjected', 'uninterjected', 'unjilted', 'unjolted', 'unjustly', 'unmajestic', 'unobjected', 'unprojected', 'unsubjected']


In [17]:
print( [w for w in wordlist if re.search('..j..t..',w)])

['abjectedness', 'abjection', 'abjective', 'abjectly', 'abjectness', 'adjection', 'adjectional', 'adjectival', 'adjectivally', 'adjective', 'adjectively', 'adjectivism', 'adjectivitis', 'adjustable', 'adjustably', 'adjustage', 'adjustation', 'adjuster', 'adjustive', 'adjustment', 'antejentacular', 'antiprojectivity', 'bijouterie', 'coadjustment', 'cojusticiar', 'conjective', 'conjecturable', 'conjecturably', 'conjectural', 'conjecturalist', 'conjecturality', 'conjecturally', 'conjecture', 'conjecturer', 'coprojector', 'counterobjection', 'dejected', 'dejectedly', 'dejectedness', 'dejectile', 'dejection', 'dejectly', 'dejectory', 'dejecture', 'disjection', 'guanajuatite', 'inadjustability', 'inadjustable', 'injectable', 'injection', 'injector', 'injustice', 'insubjection', 'interjection', 'interjectional', 'interjectionalize', 'interjectionally', 'interjectionary', 'interjectionize', 'interjectiveness', 'interjector', 'interjectorily', 'interjectory', 'interjectural', 'interobjective', 

Finally, the ? symbol specifies that the previous character is optional. Thus ^e-?mail    will match both email and e-mail. We could count the total number of occurrences of this word (in either spelling) in a text using sum(1 for w in text if re.search('^e-?mail$', w)).

In [18]:
sum(1 for w in wordlist if re.search('^e-?mail$',w))

0

In [41]:
test="salut tu peux envoyer un e-mail à la direction de formation pour leur .... j ai dèjà reçu un email  en me informant que ..."

In [42]:
type(test)

str

In [43]:
sum(1 for w in test if re.search('^e-?mail$',w)) # need some preprocessing

0

In [44]:
tokens=nltk.word_tokenize(test)
test=nltk.Text(tokens)

In [49]:
type(test)


nltk.text.Text

In [46]:
sum(1 for w in test if re.search('^e-?mail$',w)) # 

2

#### Ranges and Closures



![](T9.JPG)

The T9 system is used for entering text on mobile phones (see 3.5). Two or more words that are entered with the same sequence of keystrokes are known as textonyms. For example, both hole and golf are entered by pressing the sequence 4653. What other words could be produced with the same sequence? Here we use the regular expression «^[ghi][mno][jlk][def]$»:



In [50]:
print([w for w in wordlist if re.search('^[ghi][mno][jlk][def]$', w)])


['gold', 'golf', 'hold', 'hole']


The first part of the expression, «^[ghi]», matches the start of a word followed by g, h, or i. The next part of the expression, «[mno]», constrains the second character to be m, n, or o. The third and fourth characters are also constrained. Only four words satisfy all these constraints. Note that the order of characters inside the square brackets is not significant, so we could have written «^[hig][nom][ljk][fed]$» and matched the same words.



Let's explore the + symbol a bit further. Notice that it can be applied to individual letters, or to bracketed sets of letters:



In [51]:
chat_words = sorted(set(w for w in nltk.corpus.nps_chat.words()))
[w for w in chat_words if re.search('^m+i+n+e+$', w)]

['miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee',
 'miiiiiinnnnnnnnnneeeeeeee',
 'mine',
 'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee']

In [60]:
[w for w in chat_words if re.search('^[ha]+$', w)]


['a',
 'aaaaaaaaaaaaaaaaa',
 'aaahhhh',
 'ah',
 'ahah',
 'ahahah',
 'ahh',
 'ahhahahaha',
 'ahhh',
 'ahhhh',
 'ahhhhhh',
 'ahhhhhhhhhhhhhh',
 'h',
 'ha',
 'haaa',
 'hah',
 'haha',
 'hahaaa',
 'hahah',
 'hahaha',
 'hahahaa',
 'hahahah',
 'hahahaha',
 'hahahahaaa',
 'hahahahahaha',
 'hahahahahahaha',
 'hahahahahahahahahahahahahahahaha',
 'hahahhahah',
 'hahhahahaha']

t should be clear that + simply means "one or more instances of the preceding item", which could be an individual character like m, a set like [fed] or a range like [d-f]. Now let's replace + with *, which means "zero or more instances of the preceding item". The regular expression ^m*i*n*e*$ will match everything that we found using '^m+i+n+e+$', but also words where some of the letters don't appear at all, e.g. me, min, and mmmmm. Note that the + and * symbols are sometimes referred to as Kleene closures, or simply closures.

The ^ operator has another function when it appears as the first character inside square brackets. For example «[^aeiouAEIOU]» matches any character other than a vowel. We can search the NPS Chat Corpus for words that are made up entirely of non-vowel characters using «^[^aeiouAEIOU]+$» to find items like these: :):):), grrr, cyb3r and zzzzzzzz. Notice this includes non-alphabetic characters.

Here are some more examples of regular expressions being used to find tokens that match a particular pattern, illustrating the use of some new symbols: \, {}, (), and |:

In [61]:
import nltk
import re

In [70]:
# recall of some functions : set and sorted
t=['a','b','a']
print(set(t)) # duplicate are deleted 
print(sorted(t))# sort elements in the list

{'a', 'b'}
['a', 'a', 'b']


In [71]:
wsj = sorted(set(nltk.corpus.treebank.words()))

In [77]:
print([w for w in wsj if re.search('^[0-9]+\.+[0-9]+$',w)])

['0.0085', '0.05', '0.1', '0.16', '0.2', '0.25', '0.28', '0.3', '0.4', '0.5', '0.50', '0.54', '0.56', '0.60', '0.7', '0.82', '0.84', '0.9', '0.95', '0.99', '1.01', '1.1', '1.125', '1.14', '1.1650', '1.17', '1.18', '1.19', '1.2', '1.20', '1.24', '1.25', '1.26', '1.28', '1.35', '1.39', '1.4', '1.457', '1.46', '1.49', '1.5', '1.50', '1.55', '1.56', '1.5755', '1.5805', '1.6', '1.61', '1.637', '1.64', '1.65', '1.7', '1.75', '1.76', '1.8', '1.82', '1.8415', '1.85', '1.8500', '1.9', '1.916', '1.92', '10.19', '10.2', '10.5', '107.03', '107.9', '109.73', '11.10', '11.5', '11.57', '11.6', '11.72', '11.95', '112.9', '113.2', '116.3', '116.4', '116.7', '116.9', '118.6', '12.09', '12.5', '12.52', '12.68', '12.7', '12.82', '12.97', '120.7', '1206.26', '121.6', '126.1', '126.15', '127.03', '129.91', '13.1', '13.15', '13.5', '13.50', '13.625', '13.65', '13.73', '13.8', '13.90', '130.6', '130.7', '131.01', '132.9', '133.7', '133.8', '14.00', '14.13', '14.26', '14.28', '14.43', '14.5', '14.53', '14.54',

In [83]:
print( [w for w in wsj if re.search('^[A-Z]+\$$', w)]) # Dollar sign preceded by one or more letter from [A-Z]

['C$', 'US$']


In [88]:
print([w for w in wsj if re.search('^[0-9]{4}$', w)])  # all combination of four numbers from 0-9 that exist in wsj


['1614', '1637', '1787', '1901', '1903', '1917', '1925', '1929', '1933', '1934', '1948', '1953', '1955', '1956', '1961', '1965', '1966', '1967', '1968', '1969', '1970', '1971', '1972', '1973', '1975', '1976', '1977', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2005', '2009', '2017', '2019', '2029', '3057', '8300']


In [89]:
print([w for w in wsj if re.search('^[0-9]{3}$', w)])  # all combination of three numbers from 0-9 that exist in wsj


['100', '101', '102', '103', '105', '106', '107', '108', '110', '111', '114', '115', '118', '119', '120', '125', '128', '130', '132', '133', '135', '138', '139', '140', '144', '145', '148', '149', '150', '155', '160', '170', '175', '176', '177', '179', '180', '184', '187', '188', '190', '195', '198', '200', '203', '210', '212', '214', '220', '225', '227', '228', '235', '240', '241', '245', '250', '257', '260', '266', '270', '274', '275', '280', '282', '286', '295', '300', '301', '306', '310', '313', '320', '321', '326', '339', '343', '350', '353', '360', '370', '380', '386', '388', '397', '400', '405', '415', '420', '430', '445', '450', '451', '454', '458', '467', '472', '490', '492', '500', '501', '512', '534', '570', '576', '598', '600', '605', '609', '620', '644', '666', '672', '692', '700', '701', '721', '722', '730', '750', '753', '767', '777', '778', '800', '847', '850', '879', '890', '900', '909', '913', '917', '960', '963']


In [107]:
print([w for w in wsj if re.search('^[0-9]+-[a-z]{3,5}$', w)]) # {3,5} limit the length of words after '-'.


['10-day', '10-lap', '10-year', '100-share', '12-point', '12-year', '14-hour', '15-day', '150-point', '190-point', '20-point', '20-stock', '21-month', '237-seat', '240-page', '27-year', '30-day', '30-point', '30-share', '30-year', '300-day', '36-day', '36-store', '42-year', '50-state', '500-stock', '52-week', '69-point', '84-month', '87-store', '90-day']


In [109]:
>>> [w for w in wsj if re.search('^[a-z]{5,}-[a-z]{2,3}-[a-z]{,6}$', w)]# all words having the form: w1-w2-w3 where: length(w1)>=5
# and 2<=length(w2)<=3 and length(w3)<=6


['black-and-white',
 'bread-and-butter',
 'father-in-law',
 'machine-gun-toting',
 'savings-and-loan']

In [118]:
print([w for w in wsj if re.search('(ed|ing)$', w)]) # words ending with ed or ing 



In [116]:
print([w for w in wsj if re.search('^(to|Ski)',w)])  # words starting with to or Ski

['Skilled', 'Skills', 'Skinner', 'to', 'toast', 'tobacco', 'today', 'together', 'toilet', 'told', 'tolerate', 'toll', 'tomorrow', 'ton', 'tone', 'tons', 'too', 'took', 'tool', 'tools', 'tooth', 'top', 'top-level', 'top-selling', 'top-yielding', 'topics', 'topped', 'tormentors', 'torn', 'torrent', 'tort', 'total', 'totaled', 'totaling', 'tote', 'touch', 'touched', 'touchy', 'tough', 'tour', 'tours', 'touted', 'tow', 'toward', 'tower', 'town', 'towns', 'tows', 'toy']


You probably worked out that a backslash means that the following character is deprived of its special powers and must literally match a specific character in the word. Thus, while . is special, \. only matches a period. The braced expressions, like {3,5}, specify the number of repeats of the previous item. The pipe character indicates a choice between the material on its left or its right. Parentheses indicate the scope of an operator: they can be used together with the pipe (or disjunction) symbol like this: «w(i|e|ai|oo)t», matching wit, wet, wait, and woot. It is instructive to see what happens when you omit the parentheses from the last expression above, and search for «ed|ing$».

![](meta.JPG)

To the Python interpreter, a regular expression is just like any other string. If the string contains a backslash followed by particular characters, it will interpret these specially. For example \b would be interpreted as the backspace character. In general, when using regular expressions containing backslash, we should instruct the interpreter not to look inside the string at all, but simply to pass it directly to the  re library for processing. We do this by prefixing the string with the letter r, to indicate that it is a raw string. For example, the raw string r'\band\b' contains two \b symbols that are interpreted by the re library as matching word boundaries instead of backspace characters. If you get into the habit of using r'...' for regular expressions — as we will do from now on — you will avoid having to think about these complications.



### 3.5   Useful Applications of Regular Expressions



The above examples all involved searching for words w that match some regular expression regexp using re.search(regexp, w). Apart from checking if a regular expression matches a word, we can use regular expressions to extract material from words, or to modify words in specific ways.



##### Extracting Word Pieces
The re.findall() ("find all") method finds all (non-overlapping) matches of the given regular expression. Let's find all the vowels in a word, then count them:





In [126]:
word = 'supercalifragilisticexpialidocious'
print(re.findall(r'[aeiou]', word))
print(len(re.findall(r'[aeiou]',word)))

['u', 'e', 'a', 'i', 'a', 'i', 'i', 'i', 'e', 'i', 'a', 'i', 'o', 'i', 'o', 'u']
16


Let's look for all sequences of two or more vowels in some text, and determine their relative frequency:



In [130]:
wsj = sorted(set(nltk.corpus.treebank.words()))


In [138]:
fd = nltk.FreqDist(vs for word in wsj for vs in re.findall(r'[aeiou]{2,}', word))
print(fd)
fd.most_common(12)


<FreqDist with 43 samples and 3405 outcomes>


[('io', 549),
 ('ea', 476),
 ('ie', 331),
 ('ou', 329),
 ('ai', 261),
 ('ia', 253),
 ('ee', 217),
 ('oo', 174),
 ('ua', 109),
 ('au', 106),
 ('ue', 105),
 ('ui', 95)]

In [137]:
#re.findall(r'[aeiou]{2,}','aefbuio')

Note

<b>Your Turn</b>: In the W3C Date Time Format, dates are represented like this: 2009-12-31. Replace the ? in the following Python code with a regular expression, in order to convert the string '2009-12-31' to a list of integers [2009, 12, 31]:

[int(n) for n in re.findall(?, '2009-12-31')]

In [140]:
#[int(n) for n in re.findall(,'2009-12-31')]

##### Doing More with Word Pieces



Once we can use re.findall() to extract material from words, there's interesting things to do with the pieces, like glue them back together or plot them.

It is sometimes noted that English text is highly redundant, and it is still easy to read when word-internal vowels are left out. For example, declaration becomes dclrtn, and inalienable becomes inlnble, retaining any initial or final vowel sequences. The regular expression in our next example matches initial vowel sequences, final vowel sequences, and all consonants; everything else is ignored. This three-way disjunction is processed left-to-right, if one of the three parts matches the word, any later parts of the regular expression are ignored. We use re.findall() to extract all the matching pieces, and ''.join() to join them together (see 3.9 for more about the join operation).



In [141]:
regexp = r'^[AEIOUaeiou]+|[AEIOUaeiou]+$|[^AEIOUaeiou]'
def compress(word):
    pieces = re.findall(regexp, word)
    return ''.join(pieces)

In [142]:
english_udhr = nltk.corpus.udhr.words('English-Latin1')

In [144]:
print(english_udhr[:75])

['Universal', 'Declaration', 'of', 'Human', 'Rights', 'Preamble', 'Whereas', 'recognition', 'of', 'the', 'inherent', 'dignity', 'and', 'of', 'the', 'equal', 'and', 'inalienable', 'rights', 'of', 'all', 'members', 'of', 'the', 'human', 'family', 'is', 'the', 'foundation', 'of', 'freedom', ',', 'justice', 'and', 'peace', 'in', 'the', 'world', ',', 'Whereas', 'disregard', 'and', 'contempt', 'for', 'human', 'rights', 'have', 'resulted', 'in', 'barbarous', 'acts', 'which', 'have', 'outraged', 'the', 'conscience', 'of', 'mankind', ',', 'and', 'the', 'advent', 'of', 'a', 'world', 'in', 'which', 'human', 'beings', 'shall', 'enjoy', 'freedom', 'of', 'speech', 'and']


In [145]:
print(nltk.tokenwrap(compress(w) for w in english_udhr[:75]))

Unvrsl Dclrtn of Hmn Rghts Prmble Whrs rcgntn of the inhrnt dgnty and
of the eql and inlnble rghts of all mmbrs of the hmn fmly is the fndtn
of frdm , jstce and pce in the wrld , Whrs dsrgrd and cntmpt fr hmn
rghts hve rsltd in brbrs acts whch hve outrgd the cnscnce of mnknd ,
and the advnt of a wrld in whch hmn bngs shll enjy frdm of spch and


In [151]:
nltk.tokenwrap([compress(w) for w in english_udhr[:75]])

'Unvrsl Dclrtn of Hmn Rghts Prmble Whrs rcgntn of the inhrnt dgnty and\nof the eql and inlnble rghts of all mmbrs of the hmn fmly is the fndtn\nof frdm , jstce and pce in the wrld , Whrs dsrgrd and cntmpt fr hmn\nrghts hve rsltd in brbrs acts whch hve outrgd the cnscnce of mnknd ,\nand the advnt of a wrld in whch hmn bngs shll enjy frdm of spch and'

In [152]:
print(nltk.tokenwrap([compress(w) for w in english_udhr[:75]]))

Unvrsl Dclrtn of Hmn Rghts Prmble Whrs rcgntn of the inhrnt dgnty and
of the eql and inlnble rghts of all mmbrs of the hmn fmly is the fndtn
of frdm , jstce and pce in the wrld , Whrs dsrgrd and cntmpt fr hmn
rghts hve rsltd in brbrs acts whch hve outrgd the cnscnce of mnknd ,
and the advnt of a wrld in whch hmn bngs shll enjy frdm of spch and


Next, let's combine regular expressions with conditional frequency distributions. Here we will extract all consonant-vowel sequences from the words of Rotokas, such as ka and si. Since each of these is a pair, it can be used to initialize a conditional frequency distribution. We then tabulate the frequency of each pair:



In [153]:
rotokas_words = nltk.corpus.toolbox.words('rotokas.dic')
cvs = [cv for w in rotokas_words for cv in re.findall(r'[ptksvr][aeiou]', w)]
cfd = nltk.ConditionalFreqDist(cvs)
cfd.tabulate()

    a   e   i   o   u 
k 418 148  94 420 173 
p  83  31 105  34  51 
r 187  63  84  89  79 
s   0   0 100   2   1 
t  47   8   0 148  37 
v  93  27 105  48  49 


Examining the rows for s and t, we see they are in partial "complementary distribution", which is evidence that they are not distinct phonemes in the language. Thus, we could conceivably drop s from the Rotokas alphabet and simply have a pronunciation rule that the letter t is pronounced s when followed by i. (Note that the single entry having su, namely kasuari, 'cassowary' is borrowed from English.)

If we want to be able to inspect the words behind the numbers in the above table, it would be helpful to have an index, allowing us to quickly find the list of words that contains a given consonant-vowel pair, e.g. cv_index['su'] should give us all words containing su. Here's how we can do this:

In [155]:
cv_word_pairs = [(cv, w) for w in rotokas_words for cv in re.findall(r'[ptksvr][aeiou]', w)]

cv_index = nltk.Index(cv_word_pairs)

print(cv_index['su'])

print(cv_index['po'])

['kasuari']
['kaapo', 'kaapopato', 'kaipori', 'kaiporipie', 'kaiporivira', 'kapo', 'kapoa', 'kapokao', 'kapokapo', 'kapokapo', 'kapokapoa', 'kapokapoa', 'kapokapora', 'kapokapora', 'kapokaporo', 'kapokaporo', 'kapokari', 'kapokarito', 'kapokoa', 'kapoo', 'kapooto', 'kapoovira', 'kapopaa', 'kaporo', 'kaporo', 'kaporopa', 'kaporoto', 'kapoto', 'karokaropo', 'karopo', 'kepo', 'kepoi', 'keposi', 'kepoto']


##### Finding Word Stems

In [6]:
import nltk
import re

When we use a web search engine, we usually don't mind (or even notice) if the words in the document differ from our search terms in having different endings. A query for laptops finds documents containing laptop and vice versa. Indeed, laptop and laptops are just two forms of the same dictionary word (or lemma). For some language processing tasks we want to ignore word endings, and just deal with word stems.

There are various ways we can pull out the stem of a word. Here's a simple-minded approach which just strips off anything that looks like a suffix:

In [7]:
def stem(word):
...     for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:
...         if word.endswith(suffix):
...             return word[:-len(suffix)]
...     return word


stem('finding')

'find'

Although we will ultimately use NLTK's built-in stemmers, it's interesting to see how we can use regular expressions for this task. Our first step is to build up a disjunction of all the suffixes. We need to enclose it in parentheses in order to limit the scope of the disjunction.

In [9]:
re.findall(r'^.*ing|ly|ed|ious|ies|ive|es|s|ment$', 'processing')

['processing']

In [10]:
re.findall(r'^.*(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

['ing']

Here, re.findall() just gave us the suffix even though the regular expression matched the entire word. This is because the parentheses have a second function, to select substrings to be extracted. If we want to use the parentheses to specify the scope of the disjunction, but not to select the material to be output, we have to add ?:, which is just one of many arcane subtleties of regular expressions. Here's the revised version.



In [13]:
re.findall(r'^.*(?:ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')


['processing']

In [14]:
re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')


[('process', 'ing')]

This looks promising, but still has a problem. Let's look at a different word, processes:



In [23]:
re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')
# .* is mandat because otherwise we force the word to start with
# the suffix which we are looking for

[('processe', 's')]

The regular expression incorrectly found an -s suffix instead of an -es suffix. This demonstrates another subtlety: the star operator is "greedy" and the .* part of the expression tries to consume as much of the input as possible. If we use the "non-greedy" version of the star operator, written *?, we get what we want:

In [19]:
re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')


[('process', 'es')]

![](meta.JPG)

This works even when we allow an empty suffix, by making the content of the second parentheses optional:



In [24]:
re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', 'language')


[('language', '')]

This approach still has many problems (can you spot them?) but we will move on to define a function to perform stemming, and apply it to a whole text:



In [25]:
def stem(word):
...     regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'
...     stem, suffix = re.findall(regexp, word)[0]
...     return stem


In [26]:
raw = """DENNIS: Listen, strange women lying in ponds distributing swords
... is no basis for a system of government.  Supreme executive power derives from
... a mandate from the masses, not from some farcical aquatic ceremony."""

In [32]:
tokens=nltk.word_tokenize(raw)  # generates a list of words that are in raw
#type(tokens)
print([stem(t) for t in tokens])


['DENNIS', ':', 'Listen', ',', 'strange', 'women', 'ly', 'in', 'pond', 'distribut', 'sword', 'i', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern', '.', 'Supreme', 'execut', 'power', 'deriv', 'from', 'a', 'mandate', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcical', 'aquatic', 'ceremony', '.']


<mark>Notice that our regular expression removed the s from ponds but also from is and basis. It produced some non-words like distribut and deriv, but these are acceptable stems in some applications.</mark>



#### Searching Tokenized Text



You can use a special kind of regular expression for searching across multiple words in a text (where a text is a list of tokens). For example, "<a> <man>" finds all instances of a man in the text. The angle brackets are used to mark token boundaries, and any whitespace between the angle brackets is ignored (behaviors that are unique to NLTK's findall() method for texts). In the following example, we include <.*> [1] which will match any single token, and enclose it in parentheses so only the matched word (e.g. monied) and not the matched phrase (e.g. a monied man) is produced. The second example finds three-word phrases ending with the word bro [2]. The last example finds sequences of three or more words starting with the letter l [3].



In [36]:
from nltk.corpus import gutenberg, nps_chat
moby = nltk.Text(gutenberg.words('melville-moby_dick.txt'))
moby.findall(r"<a> <.*> <man>")

a monied man; a nervous man; a dangerous man; a white man; a white
man; a white man; a pious man; a queer man; a good man; a mature man;
a white man; a Cape man; a great man; a wise man; a wise man; a
butterless man; a white man; a fiendish man; a pale man; a furious
man; a better man; a certain man; a complete man; a dismasted man; a
younger man; a brave man; a brave man; a brave man; a brave man


In [38]:
moby.findall(r'<a> (<.*>) <man>') # adding () permits us to have just the
# desires words.

monied; nervous; dangerous; white; white; white; pious; queer; good;
mature; white; Cape; great; wise; wise; butterless; white; fiendish;
pale; furious; better; certain; complete; dismasted; younger; brave;
brave; brave; brave


In [42]:
chat = nltk.Text(nps_chat.words())
print(chat.findall(r"<.*> <.*> <bro>"))
print(chat.findall(r"<.*>(<.*>)<bro>"))


you rule bro; telling you bro; u twizted bro
None
rule; you; twizted
None


Note

Your Turn: Consolidate your understanding of regular expression patterns and substitutions using nltk.re_show(p, s) which annotates the string s to show every place where pattern p was matched, and nltk.app.nemo() which provides a graphical interface for exploring regular expressions. For more practice, try some of the exercises on regular expressions at the end of this chapter.

In [44]:
nltk.re_show(r'(.*)(me)+$','forgmeive')

forgmeive


It is easy to build search patterns when the linguistic phenomenon we're studying is tied to particular words. In some cases, a little creativity will go a long way. For instance, searching a large text corpus for expressions of the form x and other ys allows us to discover hypernyms (cf 5):



In [46]:
from nltk.corpus import brown
hobbies_learned= nltk.Text(brown.words(categories=['hobbies','learned']))

In [51]:
hobbies_learned.findall(r"<\w*> <and> <other> <\w*s>")

speed and other activities; water and other liquids; tomb and other
landmarks; Statues and other monuments; pearls and other jewels;
charts and other items; roads and other features; figures and other
objects; military and other areas; demands and other factors;
abstracts and other compilations; iron and other metals


With enough text, this approach would give us a useful store of information about the taxonomy of objects, without the need for any manual labor. However, our search results will usually contain false positives, i.e. cases that we would want to exclude. For example, the result: demands and other factors suggests that demand is an instance of the type factor, but this sentence is actually about wage demands. Nevertheless, we could construct our own ontology of English concepts by manually correcting the output of such searches.

<b>Note</b>:

This combination of automatic and manual processing is the most common way for new corpora to be constructed. We will return to this in 11..

Searching corpora also suffers from the problem of false negatives, i.e. omitting cases that we would want to include. It is risky to conclude that some linguistic phenomenon doesn't exist in a corpus just because we couldn't find any instances of a search pattern. Perhaps we just didn't think carefully enough about suitable patterns.

<b>Note</b>:
    

Your Turn: Look for instances of the pattern as x as y to discover information about entities and their properties.

In [57]:
hobbies_learned.findall(r" <as> <\w*> <as> <\w*>")

as accurately as possible; as well as the; as faithfully as possible;
as much as what; as neat as a; as simple as you; as well as other; as
well as other; as involved as determining; as well as other; as
important as another; as accurately as possible; as accurate as any;
as much as any; as different as a; as Orphic as that; as coppery as
Delawares; as good as another; as large as small; as well as ease; as
well as their; as well as possible; as straight as possible; as well
as nailed; as smoothly as the; as soon as a; as well as injuries; as
well as many; as well as reason; as well as in; as well as of; as well
as a; as well as summer; as well as providing; as important as
cooling; as evenly as it; as much as shading; as well as some; as well
as subsoil; as high as possible; as well as many; as general as
electrical; as long as the; as well as the; as much as was; as well as
set; as well as by; as high as 15; as well as aid; as much as
possible; as well as personalities; as low as a; 

### 3.6   Normalizing Text



In earlier program examples we have often converted text to lowercase before doing anything with its words, e.g. set(w.lower() for w in text). By using lower(), we have normalized the text to lowercase so that the distinction between The and the is ignored. Often we want to go further than this, and strip off any affixes, a task known as stemming. A further step is to make sure that the resulting form is a known word in a dictionary, a task known as lemmatization. We discuss each of these in turn. First, we need to define the data we will use in this section:



In [59]:
raw = """DENNIS: Listen, strange women lying in ponds distributing swords
... is no basis for a system of government.  Supreme executive power derives from
... a mandate from the masses, not from some farcical aquatic ceremony."""
tokens = nltk.word_tokenize(raw)

##### Stemmers

NLTK includes several off-the-shelf stemmers, and if you ever need a stemmer you should use one of these in preference to crafting your own using regular expressions, since these handle a wide range of irregular cases. The Porter and Lancaster stemmers follow their own rules for stripping affixes. Observe that the Porter stemmer correctly handles the word lying (mapping it to lie), while the Lancaster stemmer does not.



In [63]:
porter=nltk.PorterStemmer()
lancaster=nltk.LancasterStemmer()

In [65]:
print([porter.stem(t) for t in tokens])

['denni', ':', 'listen', ',', 'strang', 'women', 'lie', 'in', 'pond', 'distribut', 'sword', 'is', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern', '.', 'suprem', 'execut', 'power', 'deriv', 'from', 'a', 'mandat', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcic', 'aquat', 'ceremoni', '.']


In [68]:
print([lancaster.stem(t) for t in tokens])

['den', ':', 'list', ',', 'strange', 'wom', 'lying', 'in', 'pond', 'distribut', 'sword', 'is', 'no', 'bas', 'for', 'a', 'system', 'of', 'govern', '.', 'suprem', 'execut', 'pow', 'der', 'from', 'a', 'mand', 'from', 'the', 'mass', ',', 'not', 'from', 'som', 'farc', 'aqu', 'ceremony', '.']


Stemming is not a well-defined process, and we typically pick the stemmer that best suits the application we have in mind. The Porter Stemmer is a good choice if you are indexing some texts and want to support search using alternative forms of words (illustrated in 3.6, which uses object oriented programming techniques that are outside the scope of this book, string formatting techniques to be covered in 3.9, and the enumerate() function to be explained in 4.2).

In [69]:
class IndexedText(object):

    def __init__(self, stemmer, text):
        self._text = text
        self._stemmer = stemmer
        self._index = nltk.Index((self._stem(word), i)
                                 for (i, word) in enumerate(text))

    def concordance(self, word, width=40):
        key = self._stem(word)
        wc = int(width/4)                # words of context
        for i in self._index[key]:
            lcontext = ' '.join(self._text[i-wc:i])
            rcontext = ' '.join(self._text[i:i+wc])
            ldisplay = '{:>{width}}'.format(lcontext[-width:], width=width)
            rdisplay = '{:{width}}'.format(rcontext[:width], width=width)
            print(ldisplay, rdisplay)

    def _stem(self, word):
        return self._stemmer.stem(word).lower()

In [70]:
porter = nltk.PorterStemmer()
grail = nltk.corpus.webtext.words('grail.txt')
text = IndexedText(porter, grail)
text.concordance('lie')

r king ! DENNIS : Listen , strange women lying in ponds distributing swords is no
 beat a very brave retreat . ROBIN : All lies ! MINSTREL : [ singing ] Bravest of
       Nay . Nay . Come . Come . You may lie here . Oh , but you are wounded !   
doctors immediately ! No , no , please ! Lie down . [ clap clap ] PIGLET : Well  
ere is much danger , for beyond the cave lies the Gorge of Eternal Peril , which 
   you . Oh ... TIM : To the north there lies a cave -- the cave of Caerbannog --
h it and lived ! Bones of full fifty men lie strewn about its lair . So , brave k
not stop our fight ' til each one of you lies dead , and the Holy Grail returns t


#### Lemmatization



The WordNet lemmatizer only removes affixes if the resulting word is in its dictionary. This additional checking process makes the lemmatizer slower than the above stemmers. Notice that it doesn't handle lying, but it converts women to woman.

In [72]:
wnl = nltk.WordNetLemmatizer()
print([wnl.lemmatize(t) for t in tokens]) # more slower than stemmer


['DENNIS', ':', 'Listen', ',', 'strange', 'woman', 'lying', 'in', 'pond', 'distributing', 'sword', 'is', 'no', 'basis', 'for', 'a', 'system', 'of', 'government', '.', 'Supreme', 'executive', 'power', 'derives', 'from', 'a', 'mandate', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcical', 'aquatic', 'ceremony', '.']


<mark>The WordNet lemmatizer is a good choice if you want to compile the vocabulary of some texts and want a list of valid lemmas (or lexicon headwords).</mark>



<b>Note</b>

Another normalization task involves identifying non-standard words including numbers, abbreviations, and dates, and mapping any such tokens to a special vocabulary. For example, every decimal number could be mapped to a single token  0.0, and every acronym could be mapped to AAA. This keeps the vocabulary small and improves the accuracy of many language modeling tasks.

### 3.7   Regular Expressions for Tokenizing Text



Tokenization is the task of cutting a string into identifiable linguistic units that constitute a piece of language data. Although it is a fundamental task, we have been able to delay it until now because many corpora are already tokenized, and because NLTK includes some tokenizers. Now that you are familiar with regular expressions, you can learn how to use them to tokenize text, and to have much more control over the process.



##### Simple Approaches to Tokenization



 The very simplest method for tokenizing text is to split on whitespace. Consider the following text from Alice's Adventures in Wonderland:



In [74]:
raw = """'When I'M a Duchess,' she said to herself, (not in a very hopeful tone
... though), 'I won't have any pepper in my kitchen AT ALL. Soup does very
... well without--Maybe it's always pepper that makes people hot-tempered,'..."""


We could split this raw text on whitespace using raw.split(). To do the same using a regular expression, it is not enough to match any space characters in the string [1] since this results in tokens that contain a \n newline character; instead we need to match any number of spaces, tabs, or newlines [2]:



In [77]:
print(re.split(r' ',raw))  # [1

["'When", "I'M", 'a', "Duchess,'", 'she', 'said', 'to', 'herself,', '(not', 'in', 'a', 'very', 'hopeful', 'tone\nthough),', "'I", "won't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very\nwell', 'without--Maybe', "it's", 'always', 'pepper', 'that', 'makes', 'people', "hot-tempered,'..."]


In [78]:
print(re.split(r'[ \t\n]+',raw))

["'When", "I'M", 'a', "Duchess,'", 'she', 'said', 'to', 'herself,', '(not', 'in', 'a', 'very', 'hopeful', 'tone', 'though),', "'I", "won't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very', 'well', 'without--Maybe', "it's", 'always', 'pepper', 'that', 'makes', 'people', "hot-tempered,'..."]


The regular expression «[ \t\n]+» matches one or more space, tab (\t) or newline (\n). Other whitespace characters, such as carriage-return and form-feed should really be included too. Instead, we will use a built-in re abbreviation, \s, which means any whitespace character. The above statement can be rewritten as re.split(r'\s+', raw).



In [79]:
print(re.split(r'\s+',raw))

["'When", "I'M", 'a', "Duchess,'", 'she', 'said', 'to', 'herself,', '(not', 'in', 'a', 'very', 'hopeful', 'tone', 'though),', "'I", "won't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very', 'well', 'without--Maybe', "it's", 'always', 'pepper', 'that', 'makes', 'people', "hot-tempered,'..."]



Note

<b>Important: </b>Remember to prefix regular expressions with the letter r (meaning "raw"), which instructs the Python interpreter to treat the string literally, rather than processing any backslashed characters it contains.


Splitting on whitespace gives us tokens like '(not' and 'herself,'. An alternative is to use the fact that Python provides us with a character class \w for word characters, equivalent to [a-zA-Z0-9_]. It also defines the complement of this class \W, i.e. all characters other than letters, digits or underscore. We can use \W in a simple regular expression to split the input on anything other than a word character:



In [80]:
print(re.split(r'\W+',raw))

['', 'When', 'I', 'M', 'a', 'Duchess', 'she', 'said', 'to', 'herself', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', 'I', 'won', 't', 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', 'Soup', 'does', 'very', 'well', 'without', 'Maybe', 'it', 's', 'always', 'pepper', 'that', 'makes', 'people', 'hot', 'tempered', '']


In [81]:
'xx'.split('x')

['', '', '']

Observe that this gives us empty strings at the start and the end (to understand why, try doing  'xx'.split('x')). We get the same tokens, but without the empty strings, with  re.findall(r'\w+', raw), using a pattern that matches the words instead of the spaces. Now that we're matching the words, we're in a position to extend the regular expression to cover a wider range of cases. The regular expression «\w+|\S\w*» will first try to match any sequence of word characters. If no match is found, it will try to match any non-whitespace character (\S is the complement of \s) followed by further word characters. This means that punctuation is grouped with any following letters (e.g. 's) but that sequences of two or more punctuation characters are separated.