## 1. Introduction to Text Mining

* Text data is growing super fast. It's estimated to be about 2.5 Exabytes, that is 2.5 million TB a day.
* With all of this data, what can be done?
* We can parse the text, and try to understand what it says. Find and extract relevant information from text. We can classify the text documents. Also we can do some sort of sentiments analysis such as seeing whether something is positive or negative.

## 2. Handling Text in Python

### **Primitive constructs in Text**
* Sentences / input strings
* Words or Tokens
* Characters
* Document, larger files

In [None]:
text1 = "Ethics are built right into the ideals and objectives of the United Nations "

In [None]:
len(text1)

76

In [None]:
text2 = text1.split(' ')

In [None]:
len(text2)

14

In [None]:
text2

['Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations',
 '']

* Let's find long words that are more than 3 letters long.

In [None]:
[w for w in text2 if len(w) > 3]

['Ethics',
 'built',
 'right',
 'into',
 'ideals',
 'objectives',
 'United',
 'Nations']

* What if Capitalized words?
[w for w in text if w.istitle()

In [None]:
[w for w in text2 if w.istitle()]

['Ethics', 'United', 'Nations']

* words that end with s

In [None]:
[w for w in text2 if w.endswith('s')]

['Ethics', 'ideals', 'objectives', 'Nations']

* Finding unique words : using set()

In [None]:
text3 = 'To be or not to be'

In [None]:
text4 = text3.split(' ')
len(text4)

6

In [None]:
len(set(text4))

5

In [None]:
set(text4)

{'To', 'be', 'not', 'or', 'to'}

* 'To' and 'to' are different!

In [None]:
len(set([w.lower() for w in text4]))

4

* Some word comparition functions
> *  s.startswith(t)
> * s.endswith(t)
> * t in s
> * s.isupper(), s.islower(), s.istitle()
> * s.isalpha(), s.isdigit(), s.isalnum()

* String Operations
> * s.lower(), s.upper(), s.titlecase()
> * s.split(t)
> * s.splitlines()
> * s.join(t)
> * s.strip(), s.rstrip() : remove white space
> * s.find(t), s.rfind(t)
> * s.replace(u, v)

* From words to characters

In [None]:
text5 = "ouagadougou"
text6 = text5.split('ou')

In [None]:
text6

['', 'agad', 'g', '']

In [None]:
'ou'.join(text6)

'ouagadougou'

In [None]:
text5.split('')

ValueError: ignored

In [None]:
list(text5)

['o', 'u', 'a', 'g', 'a', 'd', 'o', 'u', 'g', 'o', 'u']

* Cleaning text

In [None]:
text8 = '    A quick brown fox jumped over the lazy dog. '
text8.split(' ')

['',
 '',
 '',
 '',
 'A',
 'quick',
 'brown',
 'fox',
 'jumped',
 'over',
 'the',
 'lazy',
 'dog.',
 '']

In [None]:
text9 = text8.strip()

In [None]:
text9.split(' ')

['A', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog.']

In [None]:
text9

'A quick brown fox jumped over the lazy dog.'

In [None]:
text9.find('o')

10

In [None]:
text9.rfind('o')

40

In [None]:
text9.replace('o','O')

'A quick brOwn fOx jumped Over the lazy dOg.'

* Handling larger texts

In [None]:
f = open('UNDHR.txt', 'r')

## 3. Regular Expressions

In [None]:
tweet = "@nltk Text analysis is awesome! #regex #pandas #python"

print([word for word in tweet.split() if word.startswith('#')])

['#regex', '#pandas', '#python']


In [None]:
print([word for word in tweet.split() if word.startswith('@')])

['@nltk']


* Finding Callouts

In [None]:
text7 = '@UN @UN_Women "Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text8 = text7.split(' ')

In [None]:
print([word for word in text8 if word.startswith('@')])

['@UN', '@UN_Women', '@']


In [None]:
import re # import re - a module that provides support for regular expressions

[w for w in text8 if re.search('@[A-Za-z0-9_]+', w)]

['@UN', '@UN_Women']

* . : wildcard, a single character
* ^ : start of a string
* $ : end of a string
* [] : matches one of the range of characters
* [^abc] : matches a character that is not a,b, or c
* a|b : either a or b, where a and b are strings
* ( ) : Scoping for operators
* \ : Escape character for special characters

### Meta-characters
* \b : Matches word boundary
* \d : Any digit
* \D : Any non-digit
* \s : Any  whitespace , [ \t\n\r\f\v]
* \S : Any non-whitespace
* \w : Alhpanumeric character
* \W : Any non-alphanumeric character
* * : zero or more occurences
* + : one or more
* ? : zero or one
* {n} : exactrly n repetitions
* {n, } : at least n reperitions
* {,n} : at most n repetitions
* {m,n} : at least m, at most n repetitions



In [None]:
text12 = 'ouagadougou'
re.findall(r'[aeiou]', text12)

['o', 'u', 'a', 'a', 'o', 'u', 'o', 'u']

In [None]:
re.findall(r'[^aeiou]', text12)

['g', 'd', 'g']

* Regular Expression for Dates
> **23rd October 2002**
> * 23-10-2002
> * 23/10/2002
> * 23/10/02
> * 10/23/2002
> * 23 Oct 2002
> * 23 October 2002
> * Oct 23, 2002
> * October 23, 2002



In [None]:
text = '23 Oct 2002, 23 October 2002, Oct 23, 2002, October 23, 2002'

In [None]:
re.findall(r'\d{2} (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4}', text)

['Oct']

In [None]:
re.findall(r'\d{2} (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4}', text)

['23 Oct 2002']

In [None]:
re.findall(r'\d{2} (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* \d{4}', text)

['23 Oct 2002', '23 October 2002']

## 4. Regex with Pandas and Named Groups

### **Working with Text Data in pandas**

In [None]:
import pandas as pd

time_sentences = ["Monday: The doctor's appointment is at 2:45pm.", 
                  "Tuesday: The dentist's appointment is at 11:30 am.",
                  "Wednesday: At 7:00pm, there is a basketball game!",
                  "Thursday: Be back home by 11:15 pm at the latest.",
                  "Friday: Take the train at 08:10 am, arrive at 09:00am."]

df = pd.DataFrame(time_sentences, columns=['text'])
df

Unnamed: 0,text
0,Monday: The doctor's appointment is at 2:45pm.
1,Tuesday: The dentist's appointment is at 11:30...
2,"Wednesday: At 7:00pm, there is a basketball game!"
3,Thursday: Be back home by 11:15 pm at the latest.
4,"Friday: Take the train at 08:10 am, arrive at ..."


In [None]:
df['text'].str.len()

0    46
1    50
2    49
3    49
4    54
Name: text, dtype: int64

In [None]:
df['text'].str.split().str.len()

0     7
1     8
2     8
3    10
4    10
Name: text, dtype: int64

In [None]:
df['text'].str.contains('appointment')

0     True
1     True
2    False
3    False
4    False
Name: text, dtype: bool

In [None]:
df['text'].str.count(r'\d')

0    3
1    4
2    3
3    4
4    8
Name: text, dtype: int64

In [None]:
#find all occurances of the digits
df['text'].str.findall(r'\d')

0                   [2, 4, 5]
1                [1, 1, 3, 0]
2                   [7, 0, 0]
3                [1, 1, 1, 5]
4    [0, 8, 1, 0, 0, 9, 0, 0]
Name: text, dtype: object

In [None]:
#group and find the hours and minutes
df['text'].str.findall(r'(\d?\d):(\d\d)')

0               [(2, 45)]
1              [(11, 30)]
2               [(7, 00)]
3              [(11, 15)]
4    [(08, 10), (09, 00)]
Name: text, dtype: object

In [None]:
# replace weekdays with '???'
df['text'].str.replace(r'\w+day\b', '???')

0          ???: The doctor's appointment is at 2:45pm.
1       ???: The dentist's appointment is at 11:30 am.
2          ???: At 7:00pm, there is a basketball game!
3         ???: Be back home by 11:15 pm at the latest.
4    ???: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object

In [None]:
# replace weekdays with 3 letter abbrevations
df['text'].str.replace(r'(\w+day\b)', lambda x: x.groups()[0][:3])

0          Mon: The doctor's appointment is at 2:45pm.
1       Tue: The dentist's appointment is at 11:30 am.
2          Wed: At 7:00pm, there is a basketball game!
3         Thu: Be back home by 11:15 pm at the latest.
4    Fri: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object

In [None]:
# create new columns from first match of extracted groups
df['text'].str.extract(r'(\d?\d):(\d\d)')

Unnamed: 0,0,1
0,2,45
1,11,30
2,7,0
3,11,15
4,8,10


In [None]:
# extract the entire time, the hours, the minutes, and the period
df['text'].str.extractall(r'((\d?\d):(\d\d) ?([ap]m))')

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1,2,3
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0,2:45pm,2,45,pm
1,0,11:30 am,11,30,am
2,0,7:00pm,7,0,pm
3,0,11:15 pm,11,15,pm
4,0,08:10 am,8,10,am
4,1,09:00am,9,0,am


In [None]:
# extract the entire time, the hours, the minutes, and the period with group names
df['text'].str.extractall(r'(?P<time>(?P<hour>\d?\d):(?P<minute>\d\d) ?(?P<period>[ap]m))')

Unnamed: 0_level_0,Unnamed: 1_level_0,time,hour,minute,period
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0,2:45pm,2,45,pm
1,0,11:30 am,11,30,am
2,0,7:00pm,7,0,pm
3,0,11:15 pm,11,15,pm
4,0,08:10 am,8,10,am
4,1,09:00am,9,0,am


In [None]:
import re


text = 'xyzdf dfxyz'
re.findall(r'xyz$', text)

['xyz']

### 5. Internationalization and Issues with Non-ASCII Characters

* ASCII : American Standard Code for Information Interchange
> * 7-bit character encoding standard : 128 valid codes
> * Range : 0x00 - 0x7f
> * Includes alphabets (upper and lower cases), digits, punctuations, common symbols, control characters
> * Diacritics is not defined in ASCII

* Unicode : Industry standard for encoding and representing text
> * Over 128000 characters from 130+ scripts and symbol sets
> * Can be implemented by differenct characters endings
> * UTF-8 : an extendable encoding set. One bytes to up to four bytes
> * UTF-16 : One or two 16 bit code units
> * UTF-32 : One 32-bit code unit

* UTF-8 : Unicode Transformational  Format - 8 bits
> * variable length encoding : One to four bytes
> * Backward compatible with ASCII : One byte codes same as ASCII
> * Dominant character encoding for the web
> * Default in Python 3

* Resources
> * [Regular Expressions]('https://docs.python.org/3/library/re.html)
> * Tips and tricks of the trade for cleaning text in Python 1
1. https://www.analyticsvidhya.com/blog/2014/11/text-data-cleaning-steps-python/
2. http://ieva.rocks/2016/08/07/cleaning-text-for-nlp/