# 3   Processing Raw Text

In [1]:
!pip install --upgrade pip

Requirement already up-to-date: pip in c:\users\pc\anaconda3\lib\site-packages


In [2]:
!pip install nltk




In [3]:
import nltk

The most important source of texts is undoubtedly the Web. It's convenient to have existing text collections to explore, such as the corpora we saw in the previous chapters. However, you probably have your own text sources in mind, and need to learn how to access them.

The goal of this chapter is to answer the following questions:

- How can we write programs to access text from local files and from the web, in order to get hold of an unlimited range of language material?

- How can we split documents up into individual words and punctuation symbols, so we can carry out the same kinds of analysis we did with text corpora in earlier chapters?

- How can we write programs to produce formatted output and save it in a file?

In order to address these questions, we will be covering key concepts in NLP, including tokenization and stemming. Along the way you will consolidate your Python knowledge and learn about strings, files, and regular expressions. Since so much text on the web is in HTML format, we will also see how to dispense with markup.


<b>Note:</b>

<b>Important</b>: From this chapter onwards, our program samples will assume you begin your interactive session or your program with the following import statements:

In [4]:
#from __future__ import division  # Python 2 users only
import nltk, re, pprint
from nltk import word_tokenize

# 3.1   Accessing Text from the Web and from Disk



<b>Electronic Books</b>

A small sample of texts from Project Gutenberg appears in the NLTK corpus collection. However, you may be interested in analyzing other texts from Project Gutenberg. You can browse the catalog of 25,000 free online books at http://www.gutenberg.org/catalog/, and obtain a URL to an ASCII text file. Although 90% of the texts in Project Gutenberg are in English, it includes material in over 50 other languages, including Catalan, Chinese, Dutch, Finnish, French, German, Italian, Portuguese and Spanish (with more than 100 texts each).

Text number 2554 is an English translation of Crime and Punishment, and we can access it as follows.

In [5]:
from urllib import request
url = "http://www.gutenberg.org/files/2554/2554.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')
type(raw),len(raw),raw[:75]



(str,
 1176896,
 'The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n')

The variable raw contains a string with 1,176,893 characters. (We can see that it is a string, using type(raw).) This is the raw content of the book, including many details we are not interested in such as whitespace, line breaks and blank lines. Notice the \r and \n in the opening line of the file, which is how Python displays the special carriage return and line feed characters (the file must have been created on a Windows machine). For our language processing, we want to break up the string into words and punctuation, as we saw in 1.. This step is called tokenization, and it produces our familiar structure, a list of words and punctuation.



In [6]:
nltk.download() # you need to download nltk, after running this cell you will get a screen with some choices, choose to install all packages

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [7]:
tokens = word_tokenize(raw)



In [8]:
print(type(tokens))
print(len(tokens))
tokens[:10]

<class 'list'>
254352


['The',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'Crime',
 'and',
 'Punishment',
 ',',
 'by']

Notice that NLTK was needed for tokenization, but not for any of the earlier tasks of opening a URL and reading it into a string. If we now take the further step of creating an NLTK text from this list, we can carry out all of the other linguistic processing we saw in 1., along with the regular list operations like slicing:



In [9]:
text=nltk.Text(tokens)
type(text)

nltk.text.Text

In [10]:
print(text[1024:1062])

['CHAPTER', 'I', 'On', 'an', 'exceptionally', 'hot', 'evening', 'early', 'in', 'July', 'a', 'young', 'man', 'came', 'out', 'of', 'the', 'garret', 'in', 'which', 'he', 'lodged', 'in', 'S.', 'Place', 'and', 'walked', 'slowly', ',', 'as', 'though', 'in', 'hesitation', ',', 'towards', 'K.', 'bridge', '.']


In [11]:
print(text.collocations()) #Collocations are expressions of multiple words which commonly co-occur

Katerina Ivanovna; Pyotr Petrovitch; Pulcheria Alexandrovna; Avdotya
Romanovna; Rodion Romanovitch; Marfa Petrovna; Sofya Semyonovna; old
woman; Project Gutenberg-tm; Porfiry Petrovitch; Amalia Ivanovna;
great deal; Nikodim Fomitch; young man; Ilya Petrovitch; n't know;
Project Gutenberg; Dmitri Prokofitch; Andrey Semyonovitch; Hay Market
None


Notice that Project Gutenberg appears as a collocation. This is because each text downloaded from Project Gutenberg contains a header with the name of the text, the author, the names of people who scanned and corrected the text, a license, and so on. Sometimes this information appears in a footer at the end of the file. We cannot reliably detect where the content begins and ends, and so have to resort to manual inspection of the file, to discover unique strings that mark the beginning and the end, before trimming raw to be just the content and nothing else:



In [12]:
raw.find("PART I")

5338

In [13]:
raw.rfind("End of Project Gutenberg's Crime")

1157746

In [14]:
raw = raw[5338:1157743] #[1]
raw.find("PART I")

0

The find() and rfind() ("reverse find") methods help us get the right index values to use for slicing the string [1]. We overwrite raw with this slice, so now it begins with "PART I" and goes up to (but not including) the phrase that marks the end of the content.

This was our first brush with the reality of the web: texts found on the web may contain unwanted material, and there may not be an automatic way to remove it. But with a small amount of extra work we can extract the material we need.

#### Dealing with HTML



Much of the text on the web is in the form of HTML documents. You can use a web browser to save a page as text to a local file, then access this as described in the section on files below. However, if you're going to do this often, it's easiest to get Python to do the work directly. The first step is the same as before, using urlopen. For fun we'll pick a BBC News story called Blondes to die out in 200 years, an urban legend passed along by the BBC as established scientific fact:



In [15]:
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = request.urlopen(url).read().decode('utf8')
html[:60]

'<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN'

You can type print(html) to see the HTML content in all its glory, including meta tags, an image map, JavaScript, forms, and tables.



In [16]:
print(html)

<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<title>BBC NEWS | Health | Blondes 'to die out in 200 years'</title>
<meta name="keywords" content="BBC, News, BBC News, news online, world, uk, international, foreign, british, online, service">
<meta name="OriginalPublicationDate" content="2002/09/27 11:51:55">
<meta name="UKFS_URL" content="/1/hi/health/2284783.stm">
<meta name="IFS_URL" content="/2/hi/health/2284783.stm">
<meta name="HTTP-EQUIV" content="text/html;charset=iso-8859-1">
<meta name="Headline" content="Blondes 'to die out in 200 years'">
<meta name="Section" content="Health">
<meta name="Description" content="Natural blondes are an endangered species and will die out by 2202, a study suggests.">
<!-- GENMaps-->
<map name="banner">
<area alt="BBC NEWS" coords="7,9,167,32" href="http://news.bbc.co.uk/1/hi.html" shape="RECT">
</map>

<script src="/nol/shared/js/livestats_v1_1.js" langua

To get text out of HTML we will use a Python library called BeautifulSoup, available from http://www.crummy.com/software/BeautifulSoup/:



In [17]:
from bs4 import BeautifulSoup

In [18]:
raw = BeautifulSoup(html).get_text()
tokens = word_tokenize(raw)
tokens



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


['BBC',
 'NEWS',
 '|',
 'Health',
 '|',
 'Blondes',
 "'to",
 'die',
 'out',
 'in',
 '200',
 "years'",
 'CATEGORIES',
 'TV',
 'RADIO',
 'COMMUNICATE',
 'WHERE',
 'I',
 'LIVE',
 'INDEX',
 'SEARCH',
 'You',
 'are',
 'in',
 ':',
 'Health',
 'News',
 'Front',
 'Page',
 'World',
 'UK',
 'England',
 'N',
 'Ireland',
 'Scotland',
 'Wales',
 'Politics',
 'Business',
 'Entertainment',
 'Science/Nature',
 'Technology',
 'Health',
 'Medical',
 'notes',
 'Education',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Talking',
 'Point',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Country',
 'Profiles',
 'In',
 'Depth',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Programmes',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'SERVICES',
 'Daily',
 'E-mail',
 'News',
 'Ticker',
 'Mobile/PDAs',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Text',
 'Only',
 'Feedback',
 'Help',
 'EDITIONS',
 'Change',
 'to',
 'World',
 'Friday',
 ',',
 '27',
 'September',
 ',',
 '2002',
 ',',
 '11:51',
 '

This still contains unwanted material concerning site navigation and related stories. With some trial and error you can find the start and end indexes of the content and select the tokens of interest, and initialize a text as before.



In [19]:
tokens = tokens[110:390]
text = nltk.Text(tokens)
text.concordance('gene')

Displaying 5 of 5 matches:
hey say too few people now carry the gene for blondes to last beyond the next 
blonde hair is caused by a recessive gene . In order for a child to have blond
 have blonde hair , it must have the gene on both sides of the family in the g
ere is a disadvantage of having that gene or by chance . They do n't disappear
des would disappear is if having the gene was a disadvantage and I do not thin


#### Processing Search Engine Results



The web can be thought of as a huge corpus of unannotated text. Web search engines provide an efficient means of searching this large quantity of text for relevant linguistic examples. The main advantage of search engines is size: since you are searching such a large set of documents, you are more likely to find any linguistic pattern you are interested in. Furthermore, you can make use of very specific patterns, which would only match one or two examples on a smaller example, but which might match tens of thousands of examples when run on the web. A second advantage of web search engines is that they are very easy to use. Thus, they provide a very convenient tool for quickly checking a theory, to see if it is reasonable.



![](table3.JPG)

Unfortunately, search engines have some significant shortcomings. First, the allowable range of search patterns is severely restricted. Unlike local corpora, where you write programs to search for arbitrarily complex patterns, search engines generally only allow you to search for individual words or strings of words, sometimes with wildcards. Second, search engines give inconsistent results, and can give widely different figures when used at different times or in different geographical regions. When content has been duplicated across multiple sites, search results may be boosted. Finally, the markup in the result returned by a search engine may change unpredictably, breaking any pattern-based method of locating particular content (a problem which is ameliorated by the use of search engine APIs).



<b>Your Turn</b>: Search the web for "the of" (inside quotes). Based on the large count, can we conclude that the of is a frequent collocation in English?


<b>Your Turn</b>: Search the web for "the of" (inside quotes). Based on the large count, can we conclude that the of is a frequent collocation in English

<b>Your Turn</b>: Search the web for "the of" (inside quotes). Based on the large count, can we conclude that the of is a frequent collocation in English

after looking for the "the of", we can say that it is a collocation in English


#### Processing RSS Feeds



The blogosphere is an important source of text, in both formal and informal registers. With the help of a Python library called the Universal Feed Parser, available from  https://pypi.python.org/pypi/feedparser, we can access the content of a blog, as shown below:



In [21]:
!pip install feedparser



In [22]:
import feedparser
llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")
print(llog['feed']['title'])


Language Log


In [25]:
print(len(llog.entries))
post = llog.entries[2]
print(post.title)
content = post.content[0].value
print(content[:70])


13
Bad Chinese
<p>Sign south of the demolished Pfeiffer Bridge on Highway 1 in Monter


With some further work, we can write programs to create a small corpus of blog posts, and use this as the basis for our NLP work.



In [26]:
raw = BeautifulSoup(content).get_text()
print(word_tokenize(raw))

['Sign', 'south', 'of', 'the', 'demolished', 'Pfeiffer', 'Bridge', 'on', 'Highway', '1', 'in', 'Monterey', 'County', '(', 'photograph', 'taken', 'on', 'August', '12', ',', '2017', 'by', 'Richard', 'Masoner', 'while', 'on', 'a', 'Big', 'Sur', 'bike', 'trip', ',', 'via', 'Flickr', ')', ':', 'This', 'is', 'not', 'Chinglish', '.', 'It', 'is', 'the', 'opposite', 'of', 'Chinglish', ':', 'English', 'poorly', 'translated', 'into', 'Chinese', '.', 'The', 'sign', 'says', ':', 'Zhǔdòng', 'gōnglù', 'bùyào', 'zǒu', 'zài', 'zhōngjiān', 'de', 'lùxiàn', 'bǎochí', 'bái', 'xiàn', 'de', 'quánlì', '主动公路不要走在中间的路线保持白线的权利', 'It', "'s", 'difficult', 'for', 'me', 'to', 'make', 'sense', 'of', 'this', 'sign', '.', 'Chinese', 'friends', 'to', 'whom', 'I', 'show', 'this', 'sign', 'are', 'also', 'totally', 'confused', 'by', 'it', '.', 'Forced', 'translation', 'into', 'English', ':', "''", 'Active', 'highway', '.', 'Do', "n't", 'walk', '/', 'ride', 'in', 'the', 'center', 'line', '/', 'lane', '.', 'Keep', '/', 'maint



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


#### Reading Local Files



In order to read a local file, we need to use Python's built-in open() function, followed by the read() method. Suppose you have a file document.txt, you can load its contents like this:



In [27]:
f = open('document.txt')
raw = f.read()

To check that the file that you are trying to open is really in the right directory, use IDLE's Open command in the File menu; this will display a list of all the files in the directory where IDLE is running. An alternative is to examine the current directory from within Python:



In [32]:
import os 
os.listdir('.')  # document is here :)

['.ipynb_checkpoints',
 '1.Language_Processing_and_Python.ipynb',
 '2_Accessing_Text_Corpora_and_Lexical_Resources.ipynb',
 '3_Processing_Raw_Text.ipynb',
 'accusatif....JPG',
 'branches_phonétique.JPG',
 'Categorizing_and_Tagging_Words.ipynb',
 'document.txt',
 'morohologie.JPG',
 'phonologie.JPG',
 'phonétique.JPG',
 'phonétique_VS_phonologie.JPG',
 'semantique.JPG',
 'table3.JPG',
 'text-mining-pres.pdf',
 'tp-nltk-ss.pdf']

Another possible problem you might have encountered when accessing a text file is the newline conventions, which are different for different operating systems. The built-in open() function has a second parameter for controlling how the file is opened: open('document.txt', 'rU') — 'r' means to open the file for reading (the default), and 'U' stands for "Universal", which lets us ignore the different conventions used for marking newlines.



Assuming that you can open the file, there are several methods for reading it. The read() method creates a string with the contents of the entire file:



In [37]:
f=open('document.txt', 'rU')
raw=f.read()
raw

  """Entry point for launching an IPython kernel.


'Hello,\nMy name is Riahi LOURIZ \nI am new to this field of text mining\nthese are some notebooks that can help you to learn NLP.'

Recall that the '\n' characters are <b>newlines</b>; this is equivalent to pressing Enter on a keyboard and starting a new line.



We can also read a file one line at a time using a for loop:



In [41]:
f=open('document.txt','rU')
for line in f:
    print(line.strip())
    


Hello,
My name is Riahi LOURIZ
I am new to this field of text mining
these are some notebooks that can help you to learn NLP.


  """Entry point for launching an IPython kernel.


Here we use the strip() method to remove the newline character at the end of the input line.



NLTK's corpus files can also be accessed using these methods. We simply have to use nltk.data.find() to get the filename for any corpus item. Then we can open and read it in the way we just demonstrated above:



In [42]:
path = nltk.data.find('corpora/gutenberg/melville-moby_dick.txt')

In [43]:
raw = open(path, 'rU').read()

  """Entry point for launching an IPython kernel.


#### Extracting Text from PDF, MSWord and other Binary Formats



ASCII text and HTML text are human readable formats. Text often comes in binary formats — like PDF and MSWord — that can only be opened using specialized software. Third-party libraries such as  <i>pypdf</i> and <i>pywin32</i> provide access to these formats. Extracting text from multi-column documents is particularly challenging. For once-off conversion of a few documents, it is simpler to open the document with a suitable application, then save it as text to your local drive, and access it as described below. If the document is already on the web, you can enter its URL in Google's search box. The search result often includes a link to an HTML version of the document, which you can save as text.



In [47]:
!pip install PyPDF2

Collecting PyPDF2
  Downloading PyPDF2-1.26.0.tar.gz (77kB)
Building wheels for collected packages: PyPDF2
  Running setup.py bdist_wheel for PyPDF2: started
  Running setup.py bdist_wheel for PyPDF2: finished with status 'done'
  Stored in directory: C:\Users\pc\AppData\Local\pip\Cache\wheels\86\6a\6a\1ce004a5996894d33d93e1fb1b67c30973dc945cc5875a1dd0
Successfully built PyPDF2
Installing collected packages: PyPDF2
Successfully installed PyPDF2-1.26.0


In [48]:
import PyPDF2

In [68]:
pdfFileObj = open('LDM.pdf', 'rb')

In [69]:
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

In [70]:
pdfReader.numPages

1

In [71]:
pageObj = pdfReader.getPage(0)

In [77]:
pageObj.extractText()

"le8août2017\nRiahiLOURIZ\n+60142624708\nriahi.louriz@telecom-bretagne.eu\nIFREMER\nPointeduDiable,\n29280Plouzané\nMadame,Monsieur,\nC'estavecbeaucoupd'enthousiasmeetdemotivationquejevousadressemacandidaturepourrejoindre\nvotreentrepriseentantqueDataScientist.\nL'o˙requevousproposezcorrespondparfaitementàmesobjectifsenraisondel'intérêtparticulierqueje\nporteauxstatistiquesetàl'analysededonnées;denatureobservateur,perspicaceetdotéd'unefortecapacité\nd'analyseetd'anticipation,j'aipudévelopperungoûtprononcépourcedomaineàtraversplusieursmissions\ne˙ectuéestoutaulongdemonparcoursprofessionnel,notammentaucoursd'uncerti˝caten\nMachineLearning\nquej'aiobtenusurlescoursenlignedecoursera.org(Stanforduniversity,ProfessorAndrewNg)etquim'a\npermisdemefamiliariseraveclesoutilsetméthodesclassiquesd'analysededonnées(réseauxdeneurones,SVM,\nMapReduce,LogisticRegression,AnomalyDetection,...).J'aiégalementobtenuuncerti˝catnommé\nTheData\nScientist'sToolbox\n(coursera.org,JohnsHopkinsuniversity)cequim'ap

In [78]:
type(pageObj)

PyPDF2.pdf.PageObject

In [80]:
!pip install docx

Collecting docx
  Downloading docx-0.2.4.tar.gz (54kB)
Building wheels for collected packages: docx
  Running setup.py bdist_wheel for docx: started
  Running setup.py bdist_wheel for docx: finished with status 'done'
  Stored in directory: C:\Users\pc\AppData\Local\pip\Cache\wheels\43\43\f7\ae02727f01b27dd92d5ba84982cfd8da9484b7179e263253a0
Successfully built docx
Installing collected packages: docx
Successfully installed docx-0.2.4


#### Capturing User Input



Sometimes we want to capture the text that a user inputs when she is interacting with our program. To prompt the user to type a line of input, call the Python function input(). After saving the input to a variable, we can manipulate it just as we have done for other strings.



In [83]:
s = input("Enter some text: ")


Enter some text: Hello, my name is Riahi LOURIZ


In [84]:
print("You typed", len(word_tokenize(s)), "words.")

You typed 7 words.


In [86]:
word_tokenize(s)

['Hello', ',', 'my', 'name', 'is', 'Riahi', 'LOURIZ']

#### The NLP Pipeline



3.1 summarizes what we have covered in this section, including the process of building a vocabulary that we saw in 1.. (One step, normalization, will be discussed in 3.6.)




![](pipeline.JPG)

---
##### example :
---

In [92]:
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"  # url of the content
html = request.urlopen(url).read().decode('utf8')  # open url and read it
raw = BeautifulSoup(html).get_text()  # clean markup with bs4.BeautifulSoup(..)
tokens = nltk.wordpunct_tokenize(raw) 
len(tokens)
text=nltk.Text(tokens)
words= [w.lower() for w in text]
vocab= sorted(set(words))
vocab



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")
 [__init__.py:181]


['"',
 '","").',
 '";',
 "'",
 '\'"><',
 "')!=-",
 "'+",
 "'<",
 "'~",
 '(',
 '(!',
 '(".',
 '("<',
 "('",
 "('<",
 '(/\\',
 ')',
 ')+"...";',
 ');',
 '){',
 ')~',
 '+',
 '+"";',
 '+\'"',
 '+\'">\'+\'<',
 "+'&",
 "+'</",
 ',',
 ',"',
 '-',
 '-------------',
 '----------------------------------------------------------------------------------',
 '.',
 '."',
 '/',
 '/"\'+\'',
 '/))',
 '//-->',
 '0',
 '01',
 '02',
 '08',
 '09',
 '1',
 '11',
 '12',
 '17',
 '2',
 '200',
 '2002',
 '2202',
 '2284783',
 '252',
 '27',
 '28',
 '51',
 '99',
 ':',
 '://',
 ';',
 ';}',
 '<!--',
 '=',
 '="',
 '="\'+',
 "='+",
 "='<",
 '>");',
 ">');",
 ">'+",
 ">';",
 '></',
 '>=',
 '>>',
 '?',
 '?~',
 '\\',
 '^^',
 'a',
 'abductees',
 'about',
 'africa',
 'aid',
 'alert',
 'alien',
 'also',
 'alzheimer',
 'americas',
 'an',
 'and',
 'ann',
 'applet',
 'apr',
 'are',
 'as',
 'asia',
 'at',
 'attractive',
 'babies',
 'back',
 'bbc',
 'be',
 'become',
 'believe',
 'beyond',
 'big',
 'bin',
 'blame',
 'blonde',
 'blonde

There's a lot going on in this pipeline. To understand it properly, it helps to be clear about the type of each variable that it mentions. We find out the type of any Python object x using type(x), e.g. type(1) is  <int> since 1 is an integer.

When we load the contents of a URL or file, and when we strip out HTML markup, we are dealing with strings, Python's <str> data type. (We will learn more about strings in 3.2):

In [93]:
raw = open('document.txt').read()
type(raw)

str

When we tokenize a string we produce a list (of words), and this is Python's <list> type. Normalizing and sorting lists produces other lists:



In [94]:
tokens = word_tokenize(raw)
type(tokens)
words = [w.lower() for w in tokens]
type(words)
vocab = sorted(set(words))
type(vocab)

list

The type of an object determines what operations you can perform on it. So, for example, we can append to a list but not to a string:



In [95]:
vocab.append('blog')


In [96]:
#raw.append('blog')  # PRODUCE AN ERROR

AttributeError: 'str' object has no attribute 'append'

### 3.2   Strings: Text Processing at the Lowest Level

