---

_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._

---

# Working With Text

In [2]:
text1 = "Ethics are built right into the ideals and objectives of the United Nations "

len(text1) # The length of text1

76

In [3]:
text2 = text1.split(' ') # Return a list of the words in text2, separating by ' '.

len(text2)

14

In [4]:
text2

['Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations',
 '']

<br>
List comprehension allows us to find specific words:

In [5]:
[w for w in text2 if len(w) > 3] # Words that are greater than 3 letters long in text2

['Ethics',
 'built',
 'right',
 'into',
 'ideals',
 'objectives',
 'United',
 'Nations']

In [6]:
[w for w in text2 if w.istitle()] # Capitalized words in text2

['Ethics', 'United', 'Nations']

In [7]:
[w for w in text2 if w.endswith('s')] # Words in text2 that end in 's'

['Ethics', 'ideals', 'objectives', 'Nations']

<br>
We can find unique words using `set()`.

In [8]:
text3 = 'To be or not to be'
text4 = text3.split(' ')

len(text4)

6

In [9]:
len(set(text4))

5

In [10]:
set(text4)

{'To', 'be', 'not', 'or', 'to'}

In [11]:
len(set([w.lower() for w in text4])) # .lower converts the string to lowercase.

4

In [12]:
set([w.lower() for w in text4])

{'be', 'not', 'or', 'to'}

### Processing free-text

In [13]:
text5 = '"Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text6 = text5.split(' ')

text6

['"Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations"',
 '#UNSG',
 '@',
 'NY',
 'Society',
 'for',
 'Ethical',
 'Culture',
 'bit.ly/2guVelr']

<br>
Finding hastags:

In [14]:
[w for w in text6 if w.startswith('#')]

['#UNSG']

<br>
Finding callouts:

In [15]:
[w for w in text6 if w.startswith('@')]

['@']

In [16]:
text7 = '@UN @UN_Women "Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text8 = text7.split(' ')

<br>

We can use regular expressions to help us with more complex parsing. 

For example `'@[A-Za-z0-9_]+'` will return all words that: 
* start with `'@'` and are followed by at least one: 
* capital letter (`'A-Z'`)
* lowercase letter (`'a-z'`) 
* number (`'0-9'`)
* or underscore (`'_'`)

In [17]:
import re # import re - a module that provides support for regular expressions

[w for w in text8 if re.search('@[A-Za-z0-9_]+', w)]

['@UN', '@UN_Women']

In [18]:
text5 = '"Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text5.split(" ")

['"Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations"',
 '#UNSG',
 '@',
 'NY',
 'Society',
 'for',
 'Ethical',
 'Culture',
 'bit.ly/2guVelr']

In [19]:
[i for i in text5.split(" ") if i[0] == "@" or i[0] == "#"]

['#UNSG', '@']

In [20]:
tweet = "@nltk Text analysis is awesome! #regex #pandas #python"
[i for i in tweet.split(" ") if i[0] == "#"]

['#regex', '#pandas', '#python']

In [21]:
tweet = "@nltk Text analysis is awesome! #regex #pandas #python"
[word for word in tweet.split(" ") if word.startswith("#")]

['#regex', '#pandas', '#python']

In [22]:
text7 = '@UN @UN_Women "Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'

In [23]:
import re

[w for w in text7.split(" ") if re.search("@[A-Za-z0-9_]+",w)]

['@UN', '@UN_Women']

In [24]:
# string must start with @ and then followed by A-Za-z0-9_ once or more (+)
# note _ is any special character.
# "." any single character
# "^" a start of a string
# "$" the end of the string.
# "[]" matches one of the set of characters within []
# "[^abc]" matches all charaters that are not abc
# "a|b" matches wither a or b
# "()" scoping for operations
# "\" escape character for special characters (\t, \n, \b)
# "\b" match word boundary
# "\d" == [0-9]
# "\D" == [^0-9] any non-digit
# "\s" == [ \t\n\r\f\v]
# "\S" == [^ \t\n\r\f\v]
# "\w" == [a-zA-Z0-9_]
# "\W" == [^a-zA-Z0-9_]

# Meta Characters: Repetitions:

# "*" matches zero or more times 
# "+" matches one or more times
# "?" matches once or zero times
# "{n}" matches exactly n times
# "{n,}" matches at least n times
# "{,n}" matches at most n times
# "{m,n}" matches between m and n times

In [25]:
# searches for anything that doesn't start "S".
[w for w in text7.split(" ") if re.search("^[^S]",w)]

['@UN',
 '@UN_Women',
 '"Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations"',
 '#UNSG',
 '@',
 'NY',
 'for',
 'Ethical',
 'Culture',
 'bit.ly/2guVelr']

In [26]:
# this is searching for anything that begins with a cap A to Z.
[w for w in text7.split(" ") if re.search("^[A-Z]+",w)]

['United', 'Nations"', 'NY', 'Society', 'Ethical', 'Culture']

In [27]:
# this is searching for anything beginning with A-Z literally!
[w for w in text7.split(" ") if re.search("^A-Z",w)]

[]

In [28]:
# starts with 2 caps alphabet.
[w for w in text7.split(" ") if re.search("^[A-Z]{2}",w)]

['NY']

In [30]:
text = "ouagadougou"
re.findall("[aeiou]",text)

['o', 'u', 'a', 'a', 'o', 'u', 'o', 'u']

In [34]:
# remember that ^ inside the [] means not
re.findall("[^aeiou]",text)

['g', 'd', 'g']

## Dates

There are many ways to write dates

In [63]:
text = "23-10-2002\n23/10/2002\n23/10/02\n10/23/2002\n23 Oct 2002\n23 October 2002\nOctober 23, 2002\nOct 23, 2002"

In [64]:
# it reads 0-9 digits twice then either - or / then two digits then / or - then 4 digits
re.findall("\d{2}[/-]\d{2}[/-]\d{4}",text)

['23-10-2002', '23/10/2002', '10/23/2002']

In [65]:
# {2,4} allows for either 2 digits or 4 here
re.findall("\d{2}[/-]\d{2}[/-]\d{2,4}",text)

['23-10-2002', '23/10/2002', '23/10/02', '10/23/2002']

In [66]:
re.findall("\d{1,2}[/-]\d{1,2}[/-]\d{2,4}",text)

['23-10-2002', '23/10/2002', '23/10/02', '10/23/2002']

In [67]:
# what about data like 23th of October or 1st of September?

In [68]:
# () in regex means we want to return only something that match within the brackets
re.findall("\d{2} (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4}", text)

['Oct']

In [69]:
# the ?: here means we dont just want the stuff in the ()
# we also want everything else mentioned in the regex.
re.findall("\d{2} (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4}", text)

['23 Oct 2002']

In [70]:
# include [a-z]* to include any string joint to the month e.g. Oct + ober
re.findall("\d{2} (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* \d{4}", text)

['23 Oct 2002', '23 October 2002']

In [74]:
# i think that the ")?(" creates an or-and, so we can have a date or letters at the start
re.findall("(?:\d{2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{2}, )?\d{4}", text)

['23 Oct 2002', '23 October 2002', 'October 23, 2002', 'Oct 23, 2002']

In [101]:
text = "testing something xyztime timesxyz"

In [102]:
[w for w in text.split(" ") if re.search("xyz$",w)]

['timesxyz']

## Internationalization and Issues

How do we process information in different languages and things like emojis and music notes?

There have been completely different encoders needed e.g. IBM EBCDIC for latin, JIS for japanese, CCCII for chinese, etr...

One unites all of these, called Unicode and UTF-8

The UTF-8 has been implement allowing for use of characters from all langauges

**Unicode** 
- Industry standard for encoding and representing text
- Over 128,000 characters from 130+ scripts and symbol sets
- UTF-8 is default in python 3

### Take home concepts for Internationalization:
- Diversity in Text
- ASCII and other character encodings
- Handling text in UTF-8