---

_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._

---

# Working With Text

### What can be done with text?
#### Parse text
#### Find/Identify/Extract relevant information from text (Topic, Sentiment)
#### Classify text documents (multilabel)
#### Search for relevant text documents.
#### Sentiment analysis
#### Topic Modeling

## Primitive Constructs in Text
#### Sentences / input strings
#### Words or Tokens (formed the Sentenced)
#### Characters (formed words)
#### Document, larger files

## Finding Specific Words
#### Long words: Words that are more than 3 letters long
#### Capitalized words
#### Words that end with specific letter

In [4]:
text1 = "Ethics are built right into the ideals and objectives of the United Nations"
## UN Tweet
len(text1) # The length of text1

75

In [5]:
text2 = text1.split(' ') # Return a list of the words in text2, separating by ' '.

len(text2)

13

In [6]:
text2

['Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations']

<br>
List comprehension allows us to find specific words:

In [9]:
## Find long words
[w for w in text2 if len(w) > 3] # Words that are greater than 3 letters long in text2

['Ethics',
 'built',
 'right',
 'into',
 'ideals',
 'objectives',
 'United',
 'Nations']

In [10]:
## Find Capitalized words
[w for w in text2 if w.istitle()] # Capitalized words in text2

['Ethics', 'United', 'Nations']

In [11]:
## Find words end with letter s
[w for w in text2 if w.endswith('s')] # Words in text2 that end in 's'

['Ethics', 'ideals', 'objectives', 'Nations']

<br>
We can find unique words using `set()`.

In [12]:
text3 = 'To be or not to be'
text4 = text3.split(' ')

len(text4)

6

In [13]:
len(set(text4))   ## set() function returns the unique words(include big letter and small letters)

5

In [14]:
set(text4)  ## We found that there are still same words, but some is capitalized, so this needs to be fixed

{'To', 'be', 'not', 'or', 'to'}

In [15]:
len(set([w.lower() for w in text4])) # .lower converts the string to lowercase.

4

In [16]:
set([w.lower() for w in text4])

{'be', 'not', 'or', 'to'}

### Some word comparison functions
#### word.startswith(t)
#### word.endswith(t)
#### t in word
#### word.isupper(); word.islower(); word.istitle(); word.upper(); word.lower(); word.title()
#### word.isalpha() -- Check letters; word.isdigit() -- Check numbers; word.isalnum() -- Check combination of letter and number

### String Operations
#### word.lower(); word.upper(); word.title()
#### string.split(t)
#### string.splitlines()   eg: 'a thing \n two things'.splitlines()   Out[27]: ['a thing ', ' two things']
#### s.join(t)   -- Take the t as an array or set of words, and join it using a string 's'
eg. 'Herry'.join(['', ' Miller'])

Out[41]: 'Herry Miller'
#### string.strip() -- Taking out all the whitespace string beginning and the end
#### string.rstrip() -- Taking out the right space   string.lstrip() -- Taking out left space
#### string.find(t) -- Find the substring t in the string from left and return index position
#### string.rfind(t) -- Find the substring t in the string from right and return index position
#### string.replace(u, v) -- Every occurence of u in string will be replaced by v

In [26]:
text5 = 'ouagadougou'
text6 = text5.split('ou')
text6  ## there is nothing in front of the first 'ou' and after the last 'ou', thus there will be two ''

['', 'agad', 'g', '']

In [19]:
'ou'.join(text6)  ## .join() with get back the original word

'ouagadougou'

In [23]:
## If we want to find all the characters in word
text5.split('')   ## This will give an error

ValueError: empty separator

In [24]:
list(text5)

['o', 'u', 'a', 'g', 'a', 'd', 'o', 'u', 'g', 'o', 'u']

In [27]:
[ c for c in text5]   ## another way to deal with this

['o', 'u', 'a', 'g', 'a', 'd', 'o', 'u', 'g', 'o', 'u']

### Cleaning Text

In [29]:
text8 = '    A quick brown fox jumped over the lazy dog'
text8.split(' ')  ## This will not give us what we want

['',
 '',
 '',
 '',
 'A',
 'quick',
 'brown',
 'fox',
 'jumped',
 'over',
 'the',
 'lazy',
 'dog']

In [36]:
text9 = text8.lstrip(' ')
print(text9)
text9.split()

A quick brown fox jumped over the lazy dog


['A', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']

In [37]:
text8.split()   ## or just use this

['A', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']

### Changing Text

#### Find and Replace

In [40]:
text9.find('o')   ## The index position starts from 0 and include whitespace

10

In [41]:
text9.rfind('o')    ## find the last 'o'

40

In [42]:
text9.replace('o', 'O')

'A quick brOwn fOx jumped Over the lazy dOg'

### Handling Larger Texts

#### Reading files line by line

In [73]:
f = open('log.txt', 'r')  ## 'r' means read mode
f.readline()

'time,user,video,playback position,paused,volume\n'

#### Reading the Full File

In [71]:
f.seek(0)
text12 = f.read(1142)
print(text12)
print(len(text12))
text13 = text12.splitlines()
print(text13)
print(len(text13))

time,user,video,playback position,paused,volume
1469974424,cheryl,intro.html,5,FALSE,10
1469974454,cheryl,intro.html,6,,
1469974544,cheryl,intro.html,9,,
1469974574,cheryl,intro.html,10,,
1469977514,bob,intro.html,1,,
1469977544,bob,intro.html,1,,
1469977574,bob,intro.html,1,,
1469977604,bob,intro.html,1,,
1469974604,cheryl,intro.html,11,,
1469974694,cheryl,intro.html,14,,
1469974724,cheryl,intro.html,15,,
1469974454,sue,advanced.html,24,,
1469974524,sue,advanced.html,25,,
1469974424,sue,advanced.html,23,FALSE,10
1469974554,sue,advanced.html,26,,
1469974624,sue,advanced.html,27,,
1469974654,sue,advanced.html,28,,5
1469974724,sue,advanced.html,29,,
1469974484,cheryl,intro.html,7,,
1469974514,cheryl,intro.html,8,,
1469974754,sue,advanced.html,30,,
1469974824,sue,advanced.html,31,,
1469974854,sue,advanced.html,32,,
1469974924,sue,advanced.html,33,,
1469977424,bob,intro.html,1,TRUE,10
1469977454,bob,intro.html,1,,
1469977484,bob,intro.html,1,,
1469977634,bob,intro.html,1,,
1469977664,bob,i

In [67]:
text13[0]   ## first line, it cna be observed that the '\n' is missing because we splitlines on '\n'

'time,user,video,playback position,paused,volume'

### File Operations
#### f = open(finame, mode)  mode can be 'r' -- 'read' or 'w' -- 'write'
#### f.readline(); fread(); f.read(n)  -- read n characters rather than entire file
#### for line in f:   doSomething(line)
#### f.seek(n) -- Reset the reading position
#### f.write(message) -- write a particular message into a file in the write mode
#### f.close() -- close the file
#### f.closed   -- check if the file is closed or not

### Issues with reading text files

In [78]:
f = open('log.txt', 'r')
text14 = f.readline()
text14

'time,user,video,playback position,paused,volume\n'

### How do you remove the last newline character?

In [80]:
text14.rstrip()
## Works also for DOS newlines(^M) that shows uo as '\r' or '\r\n'

'time,user,video,playback position,paused,volume'

### Take Home Concepts
#### Handling text sentences
#### Splitting sentences into words, words into characters
#### Finding unique words
#### Handling text from documents

### Processing free-text

In [44]:
text5 = '"Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text6 = text5.split(' ')

text6

['"Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations"',
 '#UNSG',
 '@',
 'NY',
 'Society',
 'for',
 'Ethical',
 'Culture',
 'bit.ly/2guVelr']

<br>
Finding hastags:

In [45]:
[w for w in text6 if w.startswith('#')]

['#UNSG']

<br>
Finding callouts:

In [46]:
[w for w in text6 if w.startswith('@')]   ## This does not work out, we need other ways

['@']

In [47]:
text7 = '@UN @UN_Women "Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text8 = text7.split(' ')

<br>

We can use regular expressions to help us with more complex parsing. 

For example `'@[A-Za-z0-9_]+'` will return all words that: 
* start with `'@'` and are followed by at least one: 
* capital letter (`'A-Z'`)
* lowercase letter (`'a-z'`) 
* number (`'0-9'`)
* or underscore (`'_'`)

In [81]:
import re # import re - a module that provides support for regular expressions

[w for w in text8 if re.search('@[a-zA-Z0-9_]+', w)]

['@UN', '@UN_Women']

In [84]:
## The same
[w for w in text8 if re.search('@[A-Za-z0-9_]+', w)]

['@UN', '@UN_Women']

In [None]:
## Parsing the callout regular expression
@[A-Za-z0-9_]+   ## @   A-Za-z0-9_  []+
## Starts with @
## Followed by any alphabet(upper or lower cases), digit, or underscore
## That repeats at least once but any number of times


### Meta-characters: Character matches
#### .: wildcard, matches a single character just once
#### ^: start of a string
#### \$: end of a string
#### [ ]: matches one of the set of characters within [ ]
#### [a-z]: matches one of the range of characters a, b, ..., z

#### [^abc]: matches a character that is not a, b, or c
#### a|b: matches either a or b, where a and b are strings
#### (): Scoping for operators
#### \: Escaping character for special characters (\t, \n, \b)

### Meta-Characters: Character Symbols
#### \b: Matches word boundary
#### \d: Any digit, equivalent to [0-9]
#### \D: Any non-digit, equivalent to [^0-9]
#### \s: Any whitespace, equivalent to [ \t\n\r\f\v]
#### \S: Any non-whitespace, equivalent to [^ \t\n\r\f\v]
#### \w: Alphanumeric character, equivalent to [a-zA-Z0-9_]
#### \W: Non-alphanumeric, equivalent to [^a-zA-Z0-9_]

### Meta-characters: Repetitions
#### * : matches zero or more occurrences
#### + : matches one or more occurrences
#### ? : matches zero or one occurrences
#### {n} : exactly n repetitions, n>=0
#### {n, } : at least n repetitions
#### {, n} : at most n repetitions
#### {m, n} : at least m and at most n repetitions

In [114]:
[w for w in text8 if re.findall('@\w+', w)]   ## This will give the same answer
[w for w in text8 if re.search('@\w+', w)]

['@UN', '@UN_Women']

### Let's look at some more examples!
#### Finding specific characters

In [86]:
text9 = 'ouagadougou'
re.findall(r'[aeiou]', text9)   ## find any vowels

['o', 'u', 'a', 'a', 'o', 'u', 'o', 'u']

In [88]:
re.findall(r'[^aeiou]', text9)  ## find out everything is not a vowel

['g', 'd', 'g']

### Case study: Regular expressions for Dates
#### Date variations for 23rd October 2002
23-10-2002           \d{2}[/-]\d{2}[/-]\d{4}      2digits--dash or slash--2digits--dash or slash--4digits

23/10/2002

23/10/02

10/23/2002

23 Oct 2002

23 October 2002

Oct 23, 2002

October 23, 2002

In [107]:
dataStr = '23-10-2002\n23/10/2002\n23/10/02\n10/23/2002\n23 Oct 2002\n23 October 2002\nOct 23, 2002\nOctober 23, 2002\n'
print(dataStr)

23-10-2002
23/10/2002
23/10/02
10/23/2002
23 Oct 2002
23 October 2002
Oct 23, 2002
October 23, 2002



In [92]:
re.findall(r'\d{2}[/-]\d{2}[/-]\d{4}', dataStr)  

['23-10-2002', '23/10/2002', '10/23/2002']

In [116]:
re.findall(r'\d{2}[/-]\d{2}[/-]\d{2,4}', dataStr)   ## the year match 2 digits or 4 digits

['23-10-2002', '23/10/2002', '23/10/02', '10/23/2002']

In [95]:
re.findall(r'\d{1,2}[/-]\d{1,2}[/-]\d{2,4}', dataStr) ## There should be no space in the digits regular expression

['23-10-2002', '23/10/2002', '23/10/02', '10/23/2002']

In [96]:
re.findall(r'\d{2} (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4}', dataStr) 
## we want to pull the date with characters here, but that does not match, why?
## Because the parenthese here is a scoping operator, which will do the match thing but only pull out the things in it

['Oct']

In [97]:
## Fix
re.findall(r'\d{2} (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4}', dataStr) 
## ?: means not just pull out match here

['23 Oct 2002']

In [101]:
## want to match October
re.findall(r'\d{2} (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* \d{4}', dataStr) 
## starts with the previous three letters but follow with any letters

['23 Oct 2002', '23 October 2002']

In [111]:
re.findall(r'(?:\d{2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{2}, )?\d{4}', dataStr) 
## have a question mark at the end of (?:\d{2} )? means not necessary have to have 2 digits

['23 Oct 2002', '23 October 2002', 'Oct 23, 2002', 'October 23, 2002']

In [112]:
re.findall(r'(?:\d{1,2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{1,2}, )?\d{4}', dataStr) 

['23 Oct 2002', '23 October 2002', 'Oct 23, 2002', 'October 23, 2002']