# Applied Text Mining in Python
**<div style="text-align: right">Jiseong Yang</div>**  
**<font color='red'>Note</font>**: This is my personal lecture note based on [Applied Text Mining in Python](https://www.coursera.org/learn/python-text-mining) on [Coursera](https://www.coursera.org/) provided by University of Michigan

## Module 1: Working with Text in Python

### Handling Text in Python

#### Characters and Words

In [2]:
# Characters
text1= "Ethics are built right into the ideals and objectives of the United Nations."; print(text1)
len(text1)

Ethics are built right into the ideals and objectives of the United Nations.


76

In [3]:
# Words
text2 = text1.split(" "); print(text2)
len(text2)

['Ethics', 'are', 'built', 'right', 'into', 'the', 'ideals', 'and', 'objectives', 'of', 'the', 'United', 'Nations.']


13

#### Finding Specific Words

In [4]:
# Long words: Words taht are more than 3 letters long
[w for w in text2 if len(w) > 3]

['Ethics',
 'built',
 'right',
 'into',
 'ideals',
 'objectives',
 'United',
 'Nations.']

In [5]:
# Capitalized words
[w for w in text2 if w.istitle()]

['Ethics', 'United', 'Nations.']

In [6]:
# Words that end with s
[w for w in text2 if w.endswith('s')]

['Ethics', 'ideals', 'objectives']

In [7]:
# Finding unique words
text3 = "To be or not to be"
text4 = text3.split(" "); print(text4)
print(len(text4))
print(len(set(text4)))
print([w.lower() for w in text4])
print(len(set(text4)))

['To', 'be', 'or', 'not', 'to', 'be']
6
5
['to', 'be', 'or', 'not', 'to', 'be']
5


#### Word Comparison Functions
* s.startswith(t)
* s.endswith(t)
* t in s
* s.isupper(); s.islower(); s.istitle()   
e.g UPPER, lower, Title
* s.isalpha(); s.isdigit(); s.isalnum()  
e.g abcd, 1234, ab12

#### String Operations
* s.lower(); s.upper(); s.titlecase()
* s.split(t)
* s.splitlines() - split sentence by '\n'
* s.join(t)
* s.strip(); s.rstrip()
* s.find(t); s.rfind(t)
* s.replace(u, v)

In [8]:
# Split
text5 = "ouagadougou"
text6 = text5.split("ou"); print(text6)

['', 'agad', 'g', '']


In [9]:
# Join
'ou'.join(text6)

'ouagadougou'

In [10]:
# Get all characters
text5.split("")

ValueError: empty separator

In [None]:
# Get all the characters
print(list(text5))
print([c for c in text5])

#### Cleaning Text

In [None]:
# Whitespaces
text8 = "        A quick brown fox jumped over the lazy dog.    "
text8.split(" ")

In [None]:
# Strip
text9 = text8.strip()
text9.split(" ")

#### Changing Text

In [None]:
# Find returns the number of the target string
text9
text9.find('o')

In [None]:
# Replace
text9.replace("o", "O")

#### Handling Larger Texts

In [None]:
# Working directory
import os
wd = r"C:\Users\Jiseong Yang\Documents\Jiseong Yang\Scholar\SKKU\Semesters\Thesis\Lectures\Applied Text Mining in Python"
os.chdir(wd)
os.getcwd()

In [None]:
# Reading files line by line
f = open('UNDHR.txt', 'r')
f.readline()

In [None]:
# Reading the full file
f.seek(0) # Move to 0th byte.
text12 = f.read()
print(len(text12))

In [None]:
# Splitlines
text13 = text12.splitlines()
len(text13)
print(text13)
text13[0]

#### File Operations
* f = open(filename, mode)
* f.readline() - read one line
* f.read() - read the whole file
* f.read(n) - read n character(s)
* for line in f
* f.seek(n)
* f.write(message)
* f.close()
* f.closed - check if closed

In [None]:
# Issues with Reading Text Files
f = open("UNDHR.txt", 'r')
text14 = f.readline()
text14

In [None]:
# Removing Newline Character
text14.rstrip

# Works also for DOS newlines (^M) that shows up as '\r' or '\r\n'

### Regular Expressions

#### Working With Text

In [None]:
# Text separation
text1 = "Ethics are built right into the ideals and objectives of the United Nations "
len(text1) # The length of text1
text2 = text1.split(' ') # Return a list of the words in text2, separating by ' '.
len(text2)
print(text2)

In [None]:
# Find words that are greater than 3 letters
[w for w in text2 if len(w) > 3]

In [11]:
# Find words that are capitalized
[w for w in text2 if w.istitle()]

['Ethics', 'United', 'Nations.']

In [12]:
# Find words that end with 's'
[w for w in text2 if w.endswith('s')]

['Ethics', 'ideals', 'objectives']

In [13]:
# Find qnique words
text3 = "To be or not to be"
text4 = text3.split(" ")
print(text4)

['To', 'be', 'or', 'not', 'to', 'be']


In [14]:
# Lower down the case
set([w.lower() for w in text4])

{'be', 'not', 'or', 'to'}

##### Processing Free-text

In [15]:
# Text Speration
text5 = '"Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text6 = text5.split(' ')
print(text6)

['"Ethics', 'are', 'built', 'right', 'into', 'the', 'ideals', 'and', 'objectives', 'of', 'the', 'United', 'Nations"', '#UNSG', '@', 'NY', 'Society', 'for', 'Ethical', 'Culture', 'bit.ly/2guVelr']


In [16]:
# Finding hashtags
[w for w in text6 if w.startswith("#")]

['#UNSG']

In [17]:
# Finding callouts
[w for w in text6 if w.startswith("@")]

['@']

##### Necessitiy of Regular Expression

In [18]:
# Text separation
text7 = '@UN @UN_Women "Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text8 = text7.split(' ')
print(text8)

['@UN', '@UN_Women', '"Ethics', 'are', 'built', 'right', 'into', 'the', 'ideals', 'and', 'objectives', 'of', 'the', 'United', 'Nations"', '#UNSG', '@', 'NY', 'Society', 'for', 'Ethical', 'Culture', 'bit.ly/2guVelr']


In [19]:
# Finding callouts
[w for w in text8 if w.startswith('@')]

['@UN', '@UN_Women', '@']

In [20]:
# Finding callout with regular expressions
import re
[w for w in text8 if re.search('@[A-Za-z0-9_]+', w)]

['@UN', '@UN_Women']

* `[A-Za-z0-9_]+` will return all words that: 
    * start with `'@'` and are followed by at least one: 
    * capital letter (`'A-Z'`)
    * lowercase letter (`'a-z'`) 
    * number (`'0-9'`)
    * or underscore (`'_'`)

##### Meta-characters

* Character Symbols
    * `.`: wildcard, matches a single character
    * `^`: start of a string
    * `$`: end of a string
    * `[]`: matches one of the set of characters within []
    * `[a-z]`: matches one of the range of character**s** a, b, ..., z
    * `[^abc]`: matches **a** character that is not a, b, or, c
    * `a|b`: mathes either a or b, where a and b are strings
    * `()`: Scoping for operators
    * `\`: Escape character for special characters (\t, \n, \b)
    * `\b`: matches word boundary
    * `\d`: any digit, equivalant to [0-9]
    * `\D`: any non-digit, equivalant to [^0-9]
    * `\s`: any whitespance, equivalant to [ \t\n\r\f\v]
    * `\S`: any non-whitespance, equivalant to [^ \t\n\r\f\v]
    * `\w`: alphaneumeric character, equivalanet to [a-zA-Z0-9_]
    * `\W`: non-alphaneumeric character, equivalanet to [^a-zA-Z0-9_]

* Repetitions
    * `*`: matches zero or more occurences
    * `+`: matches one or more occurences
    * `?`: matches zero or one occuerences
    * `{n}`: exactly n repetitions, n≥0
    * `{n,}`: at least n repetitions
    * `{,n}`: at most n repetitions
    * `{m,n}`: at least m and at most n repetitions

In [21]:
# Find only vowels
text12 = "ouagadougou"
re.findall(r'[aeiou]', text12)

['o', 'u', 'a', 'a', 'o', 'u', 'o', 'u']

In [22]:
# FInd only consnanats
re.findall(r'[^aeiou]', text12)

['g', 'd', 'g']

#### Case Study: Regular Expression for Dates
* Date variations for 23rd October 2002  
    * 23-13-1002
    * 23/10/2002
    * 23/10/02
    * 10/23/2002
    * 23 Oct 2002
    * 23 October 2002
    * Oct 23, 2002
    * October 23, 2002


In [23]:
# Date Structures
dateStr = '23-13-1002\n23/10/2002\n23/10/02\n10/23/2002\n23 Oct 2002\n23 October 2002\nOct 23, 2002\nOctober 23, 2002\n'
print(dateStr)

23-13-1002
23/10/2002
23/10/02
10/23/2002
23 Oct 2002
23 October 2002
Oct 23, 2002
October 23, 2002



#### Group 1 (23-13-1002, 23/10/2002, 23/10/02, 10/23/2002)

In [24]:
# XX/-XX/-XXXX
re.findall(r'\d{2}[/-]\d{2}[/-]\d{4}', dateStr)

# XX/-XX/-XX(XX)
re.findall(r'\d{2}[/-]\d{2}[/-]\d{2,4}', dateStr)

# X(X)/-X(X)/-XX(XX)
re.findall(r'\d{1,2}[/-]\d{1,2}[/-]\d{2,4}', dateStr)

['23-13-1002', '23/10/2002', '23/10/02', '10/23/2002']

#### Group 2 (23 Oct 2002, 23 October 2002, Oct 23, 2002, October 23, 2002)

In [25]:
# (Mon)
re.findall(r'\d{2} (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4}', dateStr)

['Oct']

In [26]:
# XX (Mon) XXXX
re.findall(r'\d{2} (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4}', dateStr)

['23 Oct 2002']

In [27]:
# XX ((Mon)th) XXXX 
re.findall(r'\d{2} (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* \d{4}', dateStr)

['23 Oct 2002', '23 October 2002']

In [28]:
# XX ((Mon)th) XX, (XXXX)
re.findall(r'(?:\d{2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{2}, )?\d{4}', dateStr)

['23 Oct 2002', '23 October 2002', 'Oct 23, 2002', 'October 23, 2002']

In [29]:
# X(X) ((Mon)th) X(X), (XXXX)
re.findall(r'(?:\d{1,2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{1,2}, )?\d{4}', dateStr)

['23 Oct 2002', '23 October 2002', 'Oct 23, 2002', 'October 23, 2002']

### Working with Text Data in Pandas

In [30]:
import pandas as pd

time_sentences = ["Monday: The doctor's appointment is at 2:45pm.", 
                  "Tuesday: The dentist's appointment is at 11:30 am.",
                  "Wednesday: At 7:00pm, there is a basketball game!",
                  "Thursday: Be back home by 11:15 pm at the latest.",
                  "Friday: Take the train at 08:10 am, arrive at 09:00am."]

df = pd.DataFrame(time_sentences, columns=['text'])
df

Unnamed: 0,text
0,Monday: The doctor's appointment is at 2:45pm.
1,Tuesday: The dentist's appointment is at 11:30...
2,"Wednesday: At 7:00pm, there is a basketball game!"
3,Thursday: Be back home by 11:15 pm at the latest.
4,"Friday: Take the train at 08:10 am, arrive at ..."


In [31]:
# Find the number of characters for each string in df['text']
df['text'].str.len()

0    46
1    50
2    49
3    49
4    54
Name: text, dtype: int64

In [32]:
# Find the number of tokens for each string in df['text']
df['text'].str.split().str.len()

0     7
1     8
2     8
3    10
4    10
Name: text, dtype: int64

In [33]:
# Find which entries contain the word 'appointment'
df['text'].str.contains('appointment')

0     True
1     True
2    False
3    False
4    False
Name: text, dtype: bool

In [34]:
# Find how many times a digit occurs in each string
df['text'].str.count(r'\d')

0    3
1    4
2    3
3    4
4    8
Name: text, dtype: int64

In [35]:
# Find all occurances of the digits
df['text'].str.findall(r'\d')

0                   [2, 4, 5]
1                [1, 1, 3, 0]
2                   [7, 0, 0]
3                [1, 1, 1, 5]
4    [0, 8, 1, 0, 0, 9, 0, 0]
Name: text, dtype: object

In [36]:
# Group and find the hours and minutes
df['text'].str.findall(r'(\d?\d):(\d\d)')

0               [(2, 45)]
1              [(11, 30)]
2               [(7, 00)]
3              [(11, 15)]
4    [(08, 10), (09, 00)]
Name: text, dtype: object

In [37]:
# Replace weekdays with '???'
df['text'].str.replace(r'\w+day\b', '???')

0          ???: The doctor's appointment is at 2:45pm.
1       ???: The dentist's appointment is at 11:30 am.
2          ???: At 7:00pm, there is a basketball game!
3         ???: Be back home by 11:15 pm at the latest.
4    ???: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object

In [38]:
# Replace weekdays with 3 letter abbrevations
df['text'].str.replace(r'(\w+day\b)', lambda x: x.groups()[0][:3])

0          Mon: The doctor's appointment is at 2:45pm.
1       Tue: The dentist's appointment is at 11:30 am.
2          Wed: At 7:00pm, there is a basketball game!
3         Thu: Be back home by 11:15 pm at the latest.
4    Fri: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object

In [39]:
# Create new columns from first match of extracted groups
df['text'].str.extract(r'(\d?\d):(\d\d)')

Unnamed: 0,0,1
0,2,45
1,11,30
2,7,0
3,11,15
4,8,10


In [40]:
# Extract the entire time, the hours, the minutes, and the period
df['text'].str.extractall(r'((\d?\d):(\d\d) ?([ap]m))')

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1,2,3
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0,2:45pm,2,45,pm
1,0,11:30 am,11,30,am
2,0,7:00pm,7,0,pm
3,0,11:15 pm,11,15,pm
4,0,08:10 am,8,10,am
4,1,09:00am,9,0,am


In [41]:
# Extract the entire time, the hours, the minutes, and the period with group names
df['text'].str.extractall(r'(?P<time>(?P<hour>\d?\d):(?P<minute>\d\d) ?(?P<period>[ap]m))')

Unnamed: 0_level_0,Unnamed: 1_level_0,time,hour,minute,period
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0,2:45pm,2,45,pm
1,0,11:30 am,11,30,am
2,0,7:00pm,7,0,pm
3,0,11:15 pm,11,15,pm
4,0,08:10 am,8,10,am
4,1,09:00am,9,0,am


### Internationalization and Issues with Non-ASCII Characters
#### English and ASCII
* ASCII: American Standard Code for Information Interchange
    * 7-bit character encoding standard: 128 valid codes
    
#### Other Character Encodings
* IBM EBCDIC
* LATIN-I
* JIS: Japanese Industrial Standards
* CCCII: Chinese Character Code for Information Interchange
* EUC: Extended Unix Code
* Unicode and UTF-8 (**Most common**)

#### Unicode
* Industry standard for encoding and representing text
* Over 128,000 characters from 130+ scripts and symbol sets

##### UTF-8
* Unicode Transformational Format - 8-bits
* Variable length encoding: One to four bytes
* Backward compatible with ASCII
    * One byte codes same as ASCII
* Dominant character encoding for the Web
* Default in Python 3
    * Python3: "Resume"
    * Python2: u"Resume"

#### Take Home Concepts
* Diversity in Text
* ASCII and other character encodings
* Handling text in UTF-8

## Module 2: Basic Natural Language Processing
### What is Natural Langauge Processing?
* Any computation, manipulation of natural langauge
* Natural langauge evolve
    * new words get added
    * old words lose popularity
    * meanings of words change
    * language rules themselves may change

### NLP Tasks: A Broad Spectrum
* Counting words, counting frequency of words
* Finding sentence boundaries
* Part of speech tagging
* Parsing the sentence structure
* Identifying semantic rules
* Identifying entities in a sentence
* Finding which pronoun refers to which entity, and many more...

### Basic NLP Tasks with NLTK
* NKTK: Natural Langauge Toolkit
* Open source library in Python
* Has support for most NLP tasks
* Also provides access to numerous text corpora
* **More advanced tool than regular expressions**

#### Set-up

In [42]:
# Import module
import nltk
nltk.download()
from nltk.book import *

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [43]:
text1

<Text: Moby Dick by Herman Melville 1851>

In [44]:
sents()

sent1: Call me Ishmael .
sent2: The family of Dashwood had long been settled in Sussex .
sent3: In the beginning God created the heaven and the earth .
sent4: Fellow - Citizens of the Senate and of the House of Representatives :
sent5: I have a problem with people PMing me to lol JOIN
sent6: SCENE 1 : [ wind ] [ clop clop clop ] KING ARTHUR : Whoa there !
sent7: Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
sent8: 25 SEXY MALE , seeks attrac older single lady , for discreet encounters .
sent9: THE suburb of Saffron Park lay on the sunset side of London , as red and ragged as a cloud of sunset .


In [45]:
sent1

['Call', 'me', 'Ishmael', '.']

#### Counting Vocabulary of Words

In [46]:
text7

<Text: Wall Street Journal>

In [47]:
sent7

['Pierre',
 'Vinken',
 ',',
 '61',
 'years',
 'old',
 ',',
 'will',
 'join',
 'the',
 'board',
 'as',
 'a',
 'nonexecutive',
 'director',
 'Nov.',
 '29',
 '.']

In [48]:
# Words
len(sent7)

18

In [49]:
# Letters
len(text7)

100676

In [50]:
# Unique words
len(set(text7))

12408

In [51]:
# First ten words
list(set(text7))[:10]

['high-technology',
 'Navy',
 'contain',
 '434.4',
 'Avrett',
 'T-shirts',
 'Economy',
 'Society',
 'buoyed',
 'indicate']

#### Frequency of Words

In [52]:
# Frequency distribution
dist = FreqDist(text7)
len(dist)

12408

In [53]:
vocab1 = dist.keys()
#vocab1[:10] 
# In Python 3 dict.keys() returns an iterable view instead of a list
list(vocab1)[:10]

['Pierre', 'Vinken', ',', '61', 'years', 'old', 'will', 'join', 'the', 'board']

In [54]:
dist['four']

20

In [55]:
# Restriction on the length of the word is important to avoid meaningless words.
freqwords = [w for w in vocab1 if len(w) > 5 and dist[w] > 100]
freqwords

['billion',
 'company',
 'president',
 'because',
 'market',
 'million',
 'shares',
 'trading',
 'program']

#### Normalization
* Differenct forms of the same "word"

##### Stemming

In [56]:
# Normalization
input1 = "List listed lists listing listings"
words1 = input1.lower().split(' ')
words1

['list', 'listed', 'lists', 'listing', 'listings']

In [57]:
# Extract stem
porter = nltk.PorterStemmer()
[porter.stem(t) for t in words1]

['list', 'list', 'list', 'list', 'list']

In [58]:
udhr = nltk.corpus.udhr.words('English-Latin1')
udhr[:20]

['Universal',
 'Declaration',
 'of',
 'Human',
 'Rights',
 'Preamble',
 'Whereas',
 'recognition',
 'of',
 'the',
 'inherent',
 'dignity',
 'and',
 'of',
 'the',
 'equal',
 'and',
 'inalienable',
 'rights',
 'of']

In [59]:
[porter.stem(t) for t in udhr[:20]] # Still Lemmatization

['univers',
 'declar',
 'of',
 'human',
 'right',
 'preambl',
 'wherea',
 'recognit',
 'of',
 'the',
 'inher',
 'digniti',
 'and',
 'of',
 'the',
 'equal',
 'and',
 'inalien',
 'right',
 'of']

##### Lemmatization
* Lemmatization: Stemming, but resulting stems are all valid words.

In [60]:
WNlemma = nltk.WordNetLemmatizer()
[WNlemma.lemmatize(t) for t in udhr[:20]]

['Universal',
 'Declaration',
 'of',
 'Human',
 'Rights',
 'Preamble',
 'Whereas',
 'recognition',
 'of',
 'the',
 'inherent',
 'dignity',
 'and',
 'of',
 'the',
 'equal',
 'and',
 'inalienable',
 'right',
 'of']

#### Tokenization
* Spliting a sentence into words/tokens

In [61]:
text11 = "Children shouldn't drink a sugary drink before bed."
text11.split(' ')

['Children', "shouldn't", 'drink', 'a', 'sugary', 'drink', 'before', 'bed.']

* NLTK has an in-bulit tokenizer

In [62]:
# Tokenization separates negation and full stop
nltk.word_tokenize(text11)

['Children',
 'should',
 "n't",
 'drink',
 'a',
 'sugary',
 'drink',
 'before',
 'bed',
 '.']

* More fundamental question: what is a sentence and how do you know sentence boundaries?

In [63]:
# Puntuation marks separate sentences, but not all of them (e.g. $2.99.)
text12 = "This is the first sentence. A gallon of milk in the U.S. costs $2.99. Is this the third sentence? Yes, it is!"

* NLTK has an in-built sentence splitter.

In [64]:
sentences = nltk.sent_tokenize(text12)
len(sentences)

4

In [65]:
sentences

['This is the first sentence.',
 'A gallon of milk in the U.S. costs $2.99.',
 'Is this the third sentence?',
 'Yes, it is!']

### Advanced NLP Tasks with NLTK

#### POS tagging

In [66]:
import nltk

# Descriptions of the tag names. 
nltk.help.upenn_tagset('MD')

MD: modal auxiliary
    can cannot could couldn't dare may might must need ought shall should
    shouldn't will would


In [81]:
# Help
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

In [67]:
text13 = nltk.word_tokenize(text11)
nltk.pos_tag(text13)

[('Children', 'NNP'),
 ('should', 'MD'),
 ("n't", 'RB'),
 ('drink', 'VB'),
 ('a', 'DT'),
 ('sugary', 'JJ'),
 ('drink', 'NN'),
 ('before', 'IN'),
 ('bed', 'NN'),
 ('.', '.')]

* POS tagging allows you to group multiple words into a single category according to their part of speech
* Useful for feature engineering

#### Ambiguity in POS Tagging
* Ambiguity is common in English
* e.g. "Visiting aunts can be a nuisance" has two meaings:
    * "Aunts" who are visiting can be nuisance.
    * The act of "visiting" aunts can be nuisance. 

In [68]:
text14 = nltk.word_tokenize("Visiting aunts can be a nuisance")
nltk.pos_tag(text14)

[('Visiting', 'VBG'),
 ('aunts', 'NNS'),
 ('can', 'MD'),
 ('be', 'VB'),
 ('a', 'DT'),
 ('nuisance', 'NN')]

* POS tagging points only one sort of tag though "visiting" can be either adjectives or gerend.
* The probability of finding "visiting" as an adjective is lower so the gerend winds up, though the alternative is still valid.

In [69]:
# Parsing sentence structure
text15 = nltk.word_tokenize("Alice loves Bob")
grammar = nltk.CFG.fromstring("""
S -> NP VP
VP -> V NP
NP -> 'Alice' | 'Bob'
V -> 'loves'
""")

parser = nltk.ChartParser(grammar)
trees = parser.parse_all(text15)
for tree in trees:
    print(tree)

(S (NP Alice) (VP (V loves) (NP Bob)))


#### Ambiguity in Parsing
* Ambiguity may exist even if sentences are grammatically correct.

In [79]:
# Preposition attachment problem
text16 = nltk.word_tokenize("I saw the man with a telescope")
grammar1 = nltk.data.load('mygrammar.cfg')
grammar1

<Grammar with 13 productions>

In [80]:
parser = nltk.ChartParser(grammar1)
trees = parser.parse_all(text16)
for tree in trees:
    print(tree)

(S
  (NP I)
  (VP
    (VP (V saw) (NP (Det the) (N man)))
    (PP (P with) (NP (Det a) (N telescope)))))
(S
  (NP I)
  (VP
    (V saw)
    (NP (Det the) (N man) (PP (P with) (NP (Det a) (N telescope))))))


#### NLTK and Parse Tree Collection
* Grammar check requires much manuel work

In [82]:
from nltk.corpus import treebank
text17 = treebank.parsed_sents('wsj_0001.mrg')[0]
print(text17)

(S
  (NP-SBJ
    (NP (NNP Pierre) (NNP Vinken))
    (, ,)
    (ADJP (NP (CD 61) (NNS years)) (JJ old))
    (, ,))
  (VP
    (MD will)
    (VP
      (VB join)
      (NP (DT the) (NN board))
      (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))
      (NP-TMP (NNP Nov.) (CD 29))))
  (. .))


#### POS Tagging and Parsing Ambiguity

In [83]:
# Uncommon usages of words
text18 = nltk.word_tokenize("The old man the boat")
nltk.pos_tag(text18)

[('The', 'DT'), ('old', 'JJ'), ('man', 'NN'), ('the', 'DT'), ('boat', 'NN')]

In [86]:
# Well-formed sentences may still be meaningless
text19 = nltk.word_tokenize("Colorless green ideas sleep furiously")
nltk.pos_tag(text19)

[('Colorless', 'NNP'),
 ('green', 'JJ'),
 ('ideas', 'NNS'),
 ('sleep', 'VBP'),
 ('furiously', 'RB')]

#### Take Home Concepts
* POS tagging provides insights into the word classes/types in a sentence.
* Parsing the grammatical structures helps derive meaning.
* Both tasks are difficult, liguistic ambiguity increases the difficulty even more.
* Better models could be learned with supervised learning
* NLTK provides access to tools and data for training

## Module 3: Text Classification
### Classification
* Assigning of the correct class label to the given input, given a set of classes.
* Examples
    * Topic Identification
    * Spam Detection
    * Sentiment Analysis
    * Spelling Correction
* Classification is a supervised learning from past instances.

#### Supervised Classification
* Learn a <font color="red">classification model</font> on properties ("features") and their importance("weights") from labeled instances. 
    * X: Set of attributes or features {x1, x2, ..., xn}
    * Y: A "class" label from the label set {y1, y2, ..., yk}
* Apply the model on new instances to predict the label

#### Classification Paradigms
* When there are only two possible classes; |Y| = 2:
<center><font color="red">Binary Classification</font></center>
* When there are more than two possible classes; |Y| > 2:
<center><font color="red">Multi-classification</font></center>
* When data instances can have two or more labels:
<center><font color="red">Multi-label Classification</font></center>

#### Questions to Ask in Supervised Learning?
* Traning Phase
    * What are the features? How do you represent them?
    * What is the classification model/algorithm?
    * What are the model parameters?
    
* Inference Phase
    * What is the expected performance? What is a good measure?

### Identifying Features from Text
#### Why is Textual Data Unique?
* Textual data presents a unique set of challenges
* All the information you need is in the text
* But features can be pulled out from text at different granularities.

#### Types of Textual Features
* Words
    * By far the most common class of features
    * Handling commonly-occurring words: Stop words
    * Normalization: Make lower case vs. leave as-is
    * Stemming / Lemmatization
    * Characteristics of words: Capitalization
    * Parts of speech of words in a sentence
    * Grammatical structure, sentence parsing
    * Grouping words of similar meaning, semantics
        * {buy, purchase}
        * {Mr., Ms., Dr., Pror.}; Numbers / Digits; Dates 
    * Depending on classification tasks, features my come from inside words and word sequence
        * bigrams, trigrams, n-grams: "White House"
        * Character sub-sequences in words: "ing", "ion", ...

### Naive Bayes Classifiers
#### Probabilistic Model
* Update the likelihood of the class given new information
    * e.g. Python (zoology) / Python download (computer science)
* <font color='red'>Prior Probability</font>: Pr(y=<font color='green'>Entertainment</font>), Pr(y=<font color='green'>Computer Science</font>), Pr(y=<font color='green'>Zoology</font>)
* <font color='red'>Posterior Probability</font>: Pr(y=<font color='green'>Entertainment</font>|x=<font color='blue'>"Python"</font>)

#### Bayes's Rule
<img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/87c061fe1c7430a5201eef3fa50f9d00eac78810">

* <font color="red">Naive assumption</font>: Given the class label, features are assumed to be independent of each other
--- 
* y* = argmax(y) Pr(y|X) = argmax(y) Pr(y) * Pr(X|y)  
* y* = argmax(y) Pr(y|X) = argmax(y) Pr(y) * ∏ Pr(xi|y)

#### Parameters
* Prior probabilities: Pr(y) for all y in Y
* Likelihood: Pr(xi|y) for all features xi and labels y in Y
* If there are 3 classes (|Y|=3) and 100 features in X, how many parameters does Naive Bayes models have?
    * 3 + 2 * 3 * 100
    
#### Learning Parameters
* Prior probabilities: Pr(y) for all y in Y
    * From training data
    * Count the number of instances in each class
    * If there are N instances in all, and n out of those are labeled as class y: Pr(y) = n / N
    
* Likelihood: Pr(xi|y) for all features xi and labels y in Y
    * Count how many times fetures xi appears in instances labeled as class y
    * If there are p instances of class y, and xi appears in k of those, Pr(xi|y) = k / p
    
#### Smoothing
* What happens if Pr(xi|y) = 0?
    * Feature xi never occurs in documents labeled y
    * But then, the posterior probability Pr(y|xi) will be 0
    * Instead, smooth the parameters
    * <font color='red'>Laplace smoothing</font> or <font color='red'>Additive smoothing</font>: Add a dummy count

### Support Vector Machines
* Please refer to [Support Vector Machine](https://github.com/Jiseong-Michael-Yang/Machine-Learning/blob/master/Support%20Vector%20Machines.md)
> You may have to install additional plug-in such as from this [link](https://chrome.google.com/webstore/detail/mathjax-plugin-for-github/ioemnmodlmafdkllaclgeombjnmnbima?hl=en) to properly render the page, though some elements may still be unavailable.

### Learning Text Classifiers in Python
* Toolkits for Supervised Text Classification
    * Scikit-learn
    * NLTK
        * Interfaces with sklearn and other ML toolkits (like Weka)

#### Scikit-learn
* Open-source Machine Learning library
* Started as Google Summer of Code by Dave Cournapeau, 2007
* It has a more programmatic interface

##### NaiveBayesClassifier

In [3]:
# Import modules
from sklearn import naive_bayes

# Choose the classifier
clfrNB = naive_bayes.MultinomialNB()

# Train the model
clfrNB.fit(train_data, train_labels)

# Test the model
predicted_labels = clfrNB.predict(test_data)

# Check the result
metrics.f1_score(test_labels, predicted_labels, average='micro')

NameError: name 'train_data' is not defined

##### SVM classifier

In [None]:
# Import modules
from sklearn import svm

# Choose the classifier: linear kernel is typically used for text classification
clfrSVM = svm.SVC(kernel='linear', c=0.1)

# Train the model
clfrSVM.fit(train_data, train_labels)

# Test the model
predicted_labels = clfrSVM.predict(test_data)

# Check the result
metrics.f1_score(test_labels, predicted_labels, average='micro')

##### Model Selection in Scikit-learn
* train_test_split

In [None]:
# Import modules
from sklearn import model_selection

X_train, X_test, y_train, y_test = 
model_selection.train_test_split(train_data, train_labels,
                                 test_size = 0.333, random_state = 0)

* Cross-validation
    * K iterations of training in K-fold cross-validation
    * Get the average out
    * Very common to use 10-fold cross-validation when the data set is large (training: test = 9: 1)

In [None]:
# Import modules
from sklearn import model_selection

X_train, X_test, y_train, y_test = 
model_selection.train_test_split(train_data, train_labels,
                                 test_size = 0.333, random_state = 0)

predicted_labels = model_selection.cross_val_predict(clfrSVM, train_data, train_lables, cv=5)

#### NLTK
* NLTK has some classification algorithms
    * NaiveBayesClassifier
    * DecsionTreeClassifier
    * ConditionalExponentialCLassifier
    * MaxtentClassifier
    * WekaClassifier
    * SklearnClassifier

##### NaiveBayesClassifier

In [None]:
# Import modules
from nltk.classify import NaiveBayesClassifier

# Choose the classifier
classifier = NaiveBayesClassifier.train(train_set)

# Train the model
classifier.classify(unlabeled_instance)
classifier.classify_many(unlabeled_instances)

# Test the model
nltk.classify.util.accuracy(classifier, test_set)

# Shows all the labels the classifier has trained on
classifier.labels()

# Shows the most important/informative feature
classifier.show_most_informative_features()



##### SklearnClassifier
* Sklearn does not have a native SVM function

In [5]:
# Import modules
from nltk.classify import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC

# Choose the model
clfrNB = SklearnClassifier(MultinomialNB()).train(train_set)
clfrSVM = SklearnClassifier(SVC(), kerner='linear').train(train_set)

## Module 4: Topic Modeling
### Semantic Text Similarity
#### Application of Semantic Similarity
* Grouping similar words into semantic concepts
* As a building block in natural language understanding tasks
    * Textual entailment: the smaller or one of the two sentences derives its meaning or entails its meaning from another piece of text
    * Paraphrasing: a task where you rephrase or rewrite some sentence into another sentence that has the same meaning

#### WordNet
* WordNet
    * Semantic dictionary of (mostly) English words, interlinked by semantic relations
    * Includes rich linguistic information
        * Part of speech, word senses, synonyms, hypernyms/hyponyms, meronyms, derivationally related forms et cetera
    * Machine-readable, freely available

* Semantic Similarity Using WordNet
    * WordNet organizes information in a hierarchy
    * Many similarity measures use the hierarchy in some way
    * Verbs, nounds, and adjectives all have separate hierarchies

#### Measures of Similarity  
<img src="./img/word_hierarchy.jfif" width=500 />  
  
##### Path Similarity  
  * Find the shortest path between the two concepts
  * Similarity measure inversely related to path distance
      * PathSim(deer, elk) = 0.5 ($\frac{1}{1+1}$)
      * PathSim(deer, giraffe) = 0.33 ($\frac{1}{2+1}$)
      * PathSim(deer, horse) = 0.14 ($\frac{1}{6+1}$)
      
##### Lowest Common Subsumer (LCS)
* Find the closest ancestor to both concepts
    * LCS(deer, elk) = deer
    * LCS(deer, giraffe) = ruminant
    * LCS(deer, horse) = ungulate
* Lin Similarity
    * Ratio of the amount of information needed to state the commonality between two concepts and the information neede to fully describe these terms
        * $LinSim(u,v) = \frac{2\log(P(LCS(u,v))}{\log(P(u))+\log(P(v))}= \frac{2\log{\frac{N}{df(LCS(u,v))}}}{log{\frac{N}{df(u)}}+log{\frac{N}{df(u)}}} =\frac{2\text{LCS(u,v)'s amount of information}}{\text{u's amount of information + v's amount of information}}$
        * When LCS of words occur often, it could be an indicator of their similarity, however, this may only due to its frequency itself. 
            * Therefore, it needs to be normalized by diving the value with the frequency of each word
        * $P(u), P(v)$ are given by the information content learnt over a large corpus
        > * Random Variable X's Amount of Information is greater when X is not so obvious to take place
        > * $-\log{\frac{1}{P(X)}} = -\log{P(X)} = \log{\frac{N}{*df}}$ (*: frequency of the documents where x occurs)

##### Measures of Similarity In Python

In [1]:
# WordNet easily imported into Python through NLTK
import nltk
nltk.download(["wordnet", "wordnet_ic"])
from nltk.corpus import wordnet as wn

# Find appropriate senses of the words: give sense to words: get the "first" synonym of the "noun", "deer/elk"
deer = wn.synset('deer.n.01')
elk = wn.synset('elk.n.01')
horse = wn.synset('horse.n.01')

# Find path similarity
deer.path_similarity(elk) # 0.5
deer.path_similarity(horse) # 0.1428...

# Use an information criteria to find Lin similarity
from nltk.corpus import wordnet_ic
brown_ic = wordnet_ic.ic('ic-brown.dat') # Criteria

# This measure does not explicitly use the distance between two concepts
deer.lin_similarity(elk, brown_ic) # 0.7726...
deer.lin_similarity(horse, brown_ic) # 0.8623....

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\양지성\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet_ic to
[nltk_data]     C:\Users\양지성\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet_ic is already up-to-date!


0.7726998936065773

##### Collocations and Distributional Similarity
* Collocations
    * "You know a word by the company it keeps" [Firth, 1957]
    * Two words that frequently appears in similar contexts are more likely to be semantically related
        * The friends <u>met</u> <u>at</u> <u>a</u> <font color="red">café</font>.
        * Shyam <u>met</u> Ray <u>at</u> <u>a</u> <font color="red">pizzeria</font>.
        * Let's <u>meet</u> up <u>near</u> <u>the</u> <font color="red">coffe shop</font>.
        * The secret <u>meeting</u> <u>at</u> <u>the</u> <font color="red">restaurant</font> soon became public.
* Distributional Similarity: Context
    * Words before, after, within a small window
    * Part of speech of words before, after, in a small window
    * Specific syntatic relation to the target word
    * Words in the same sentence, same documents, ...
* Strength of Association between Words
    * How frequently are these?
        * Not similar if two words don't occur together often
    * Also important to see how frequent individual words are
        * 'the' is very frequent, so high chances it co-occurs often with every other word (needs to be **normalized**)
        * Pointwise Mutual Information
            * $PMI(w,c)=\log{\left[\frac{P(w,c)}{P(w)P(c)}\right]}$
            * [To be added](https://bab2min.tistory.com/605)
            * Probability of "words together on the context" divided by "that of occuring independently"

##### Collocations and Distributional Similarity in Python

In [9]:
# Use NLTK Collocations and Association measures
import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()

# Learn based on the text
finder = BigramCollocationFinder.from_words(text) # Text variable input

# Get top ten pairs of collocated words using PMI measures
finder.nbest(bigram_measures.pmi, 10) # finder also has other useful functions, such as frequency filter

# Filter any pairs that occur less than ten times in the corpus
finder.apply_freq_filter(10)

* Take Home Concepts
    * Finding similarity between words and text is non-trivial
    * WordNet is a useful resource for semantic relationships between words
    * Many similarity function exist
    * NLTK is a useful package for many such tasks   

### Topic Modeling
#### Underlying Idea
* Latent Dirichlet Allocation (Blei et al., '03)  
    <img src="./img/topic_modeling_example.jfif" width = 500 />  
    * Topic 1: Genetics (gene, sequence, genome, ...)
    * Topic 2: Computation (number, computer, analysis, ...)
    * Topic 3: Life Sciences (life, survive, organism, ...)
    * Topic 4: Anatomy (brain, neuron, nerve, ...)
         
* Intuition: Documents as a Mixture of Topics
<img src="./img/topic_modeling_intuition.jfif" width = 500 />

#### Topic Modeling
* Topic Modeling
    * A coarse-level analysis of what's in a text collection
    * Topic: the subject (theme) of a discourse
    * Topics are represented as a word distribution
        * For a particular word, there are different distribution of probability occuring from a topic
        * Topics are basically this probability distribution over all words
        
* Information for Topic Modeling
    * Known information
        * The text collection or corpus
        * Number of topics
    * Unknown information
        * The actual topics
        * Topic distribution for each document

* Essentially, Topic Modeling is text clustering problem
    * Documents and words clustered simultaneously
   
* Different topic modeling approaches available
    * Probabilistic Latent Semantic Analysis (PLSA) [Hoffman '99]
    * Latent Dirichlet Allocation (LDA) [Blei, Ng, and Jordan, '03]

#### Generative Models

##### Simple Generative Model for Text (Unigram Model)
<img src="./img/simple_generative_model.jfif" width = 500 />
<center>$Pr(text|model)$</center>

* Suppose we have a magic chest that gives out words
* We use those words coming from the chest to generate a document
* Reversely, we create a probability distribution of how likely it is to see these words

##### Generative Model Can Be Complex (Mixture Model)
<img src="./img/complex_generative_model.jfif" width = 500 />
<center>$Pr(text|model)$</center>

* Suppose we have several chests that give out words
* We use randomly coming-out words to generate the **same document** we created in the simple model
* Reversely, we could not only figure out what were **the individual topic models and individual word distributions**, but also this mixture model of how you use these **four topic models and combine them with different _proportions_** to create one document

#### Latent Dirichlet Allocation (LDA)
##### Generative model for a document _d_
* Choose length of document _d_
* Choose a mixture of topics for document _d_
* Use a topic's multinomial distribution to output words to fill that topic's quota
    
##### [Latent Dirichlet Allocation for Topic Modeling](https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/)
* LDA is a matrix factorization
* In vector space, any corpus (collection of documents) can be presented as a DTM
***
* The following matrix shows a corpus of N documents $D_1, D_2, D_3, \ldots , D_N$ and vocabulary size of M words $W_1, W_2, W_3, \ldots , W_M$
* The value of $cell_{ij}$ gives the frequency count of word $W_j$ in Document $D_i$
    
|(N,M)|$W_1$|$W_2$|$W_3$|$\cdots$|$W_M$|
|-|-|-|-|-|-|
|$D_1$|0|2|1|$\cdots$|3|
|$D_2$|1|4|0|$\cdots$|0|
|$D_3$|0|2|3|$\cdots$|1|
|$\vdots$|$\vdots$|$\vdots$|$\vdots$|$\ddots$|$\vdots$
|$D_N$|1|1|3|$\cdots$|0|

* LDA converts this DTM into two lower dimensional matrices
    * $\text{M1 (N,K)}$
    * $\text{M2 (K,M)}$
* $M1$ is a **document-topics matrix** 
    
|(N,K)|$K_1$|$K_2$|$K_3$|$\cdots$|$K_K$|
|-|-|-|-|-|-|
|$D_1$|1|0|0|$\cdots$|1|
|$D_2$|1|1|0|$\cdots$|0|
|$D_3$|1|0|0|$\cdots$|1|
|$\vdots$|$\vdots$|$\vdots$|$\vdots$|$\ddots$|$\vdots$
|$D_N$|1|0|1|$\cdots$|0|
    
* $M2$ is a **topic-terms matrix**

|(K,M)|$W_1$|$W_2$|$W_3$|$\cdots$|$W_M$|
|-|-|-|-|-|-|
|$K_1$|0|1|1|$\cdots$|1|
|$K_2$|1|1|1|$\cdots$|0|
|$K_3$|1|0|0|$\cdots$|1|
|$\vdots$|$\vdots$|$\vdots$|$\vdots$|$\ddots$|$\vdots$
|$K_K$|1|1|0|$\cdots$|0|

* These two matrices already provides topic word and document topic distributions, but need to be improved by LDA
    * LDA makes use of sampling techniques in order to improve these matrices
    * It iterates through each word $w$ for each document $d$ and it tries to adjust the current topic
    * A new topic $k$ is assigned to word $w$ with a probability $P(=P_1{\cdot}P_2)$

* For every topic, two possibilities $P_1$ and $P_2$ are calculated
    * $P_1=P(\text{topic t}|\text{document d})$: the proportion of words in document $d$ that are currently assigned to topic $t$
    * $P_2=P(\text{word w}|\text{topic t})$: the proportion of assignments to topic $t$ over all documents that contain word $w$
					
![image.png](attachment:image.png)


* The current topic - word assignment is updated with a new topic with the probability($P_1{\cdot}P_2)$
    * The model assumes that all the existing word(topic assignments) except the current word are correct
    * This is essentially the probability that topic $t$ generated word $w$

* After a number of iterations, a steady state is achieved where the document topic and topic term distributions are fairly good(**the convergence of LDA**)

#### Topic Modeling in Practice
* How many topics?
    * Finding or even guessing the number of topic is hard
    * If one is knowledgable of a specific domain, it would give a good sense to pick the number
* Interpreting Topics
    * Topics are just word distributions
    * Making sense of words / generating labels is **subjective** 

##### Working with LDA in Python
   * Many packages availbable, such as gensim, lda
   * Pre-processing text
       * Tokenize, normalize (lowercase)
       * Stop word removal (also if some words are commonly used across documents of a specific topic)        
       * Stemming
       * Convert tokenized documents to a document-term matrix
       * Build LDA models on the DTM
   * ldamodel can also be used to find topic distribution of documents

In [1]:
# Gensim is a libarary for topic modeling, document indexing, and similarity retrieval with large corpora
import gensim
from gensim import corpora, models

# Dictionary is mapping between IDs and words
dictionary = corpora.Dictionary(doc_set) # doc_set: set of pre-processed text documents

# Create DTM with all the documents in doc_set
corpus = [dictionary.doc2bow(doc) for doc in doc_set]

# Input DTM to LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=4, id2word=dictionary, passes=50)

# Print the topics
print(ldamodel.print_topics(num_topics=4, num_words=5))



NameError: name 'doc_set' is not defined

##### Take Home Concepts
* Topic modeling is **an explanatory tool** frequently used for text mining
* Latent Dirichlet Allocation is a generative model used extensively for modeling large text corpora
* LDA can also be used as a **feature selection technique for text classification** and other tasks
    * Remove all **common** features coming from corpus
    * Focus on **features** on specific topics 

#### Information Extraction
##### Information is Hidden in Free-text
* Most traditional transactional information is structured
* Abundance of unstructured or freeform text are growing (80% of data)
* The key is to convert unstructured text to structured form

##### Information Extraction
* Goal: identify and extract fields of interest from free text
    * Headlines
    * Author
    * Reviewer
    * Published date
    * Publisher
    
##### Fields of Interest
* Named entities
    * Noun phrases that are of speicific type and refer to specific individuals, places, organizations, ...
        * **<font color='red'>News</font>** People, Places, Dates, ...
        * **<font color='red'>Finance</font>** Money, Companies, ...
        * **<font color='red'>Medicine</font>** Diseases, Drugs, Procedures, ...
    * Recognition: Technique(s) to identify all mentions of pre-defiend names entities in text
        * Identify the mention / phrases: Boundary detection
        * Identify the type: Tagging / classification
    * Examples of Named Entity Recognition Tasks  
    <img src="./img/entity_recognition_example.jpg" width=500 />
    * Approaches to Identify Named Entities
        * Depends on kinds of entities that need to be identified
        * For well-formatted fields like date, phone numbers: **Regular expressions**
        * For other fields: typically a machine learning approach
    * Person, Organization, Location/GPE
        * Standard NER task in NLP research community
        * Typically a four-class model
            * PER
            * ORG
            * LOC / GPE
            * Other / Outside (any other class)
                * e.g. John **met(Outside)** Brinda

* Relation Extraction
    * Identify relationships between named entities
        * Erbitux helps treat lung cancer
            * Erbitux $-(\text{treats})\rightarrow$ lung cancer
            * Lung cancer $-(\text{is treated by})\rightarrow$ Erbitux
    * What happened to who, when, where, ...
    
* Co-reference Resolution
    * Disambiguate mentions and group mentions together
        * <font color='blue'>Anita</font> met <font color='red'>Joseph</font> at the market. <font color='red'>He surprised <font color='blue'>her</font> with a rose.
    
* Question Answering
    * Given a question, find the most appropriate answer from the text
        * What does Erbitux treat?
        * Who gave Anita the rose?
    * Builds on named entity recognition, relation extraction, and co-reference resolution

##### Take Home Concepts
* Information Extraction is important for natural language understanding and making sense of textual data
* Named Entity Recognition is a key building block to address many advanced NLP tasks
* Named Entity Recognition systems extensively deploy supervised machine learning and text mining techniques discussed in this course