## Module 4: Topic Modeling

### Semantic Text Similarity
### Applications of semantic similarity
#### Grouping similar words into semantic concepts
#### As a building block in natural language understanding tasks
---- Textual entailment: smaller sentence or sentence from a text document derives its meaning or entails its meaning from another pice of text.

---- Paraphrasing

#### Which pair of words are most similar
---- a. deer, elk -- right choice, elk is one kind of deer

---- b. deer, giraffe

---- c. deer, horse

---- d. deer, mouse
##### What can be quantify this similarity?

### WordNet (most extensive in English, and a few other languages).
--- Resources useful for semantic similarity
#### Semantic dictionary of (mostly) English words, interlinked by semantic relations
---- Includes rich linguistic information: part of speech (verb or noun), word senses (different meanings of same word), synonyms, hypernyms/hyponyms (relationship: deer is a mammel), meronyms, derivationally related forms, ...
#### Machine-readable, freely available

### Semantic Similartiy Using WordNet
#### WordNet organizes information in a hierarchy
#### Many similarity measures use the hierarchy in some way
#### Verbs, nouns, adjectives all have separate hierarchies.
eg: elk is under deer, deer and giraffe are siblings under ruminant, deer and horse are just relate.

### Ways of Finding Similarty between Concepts
### 1.Path Similarity
#### Find the shortest path between the two concepts 
#### Similarity measure inversely related to path distance
PathSim(deer, elk) = 1/(distance + 1)= 1/(1+1) = 0.5

PathSim(deer, giraffe) = 1/(distance + 1) = 1/(2+1) = 0.33

PathSim(deer, horse) = 0.14 = 1/7
### 2. Lowest Common Subsumer(LCS)
#### Find the lowest (closest) ancestor to both concepts
LCS(deer, elk) = deer

LCS(deer, giraffe) = ruminant -- Though they have other ancestors, but we need to find the lowest ancestor in the hierarchy.

LCS(deer, hourse) = ungulate

### Lin Similarity
#### Similarity measure based on the information contained in the LCS of the two concepts
#### LinSim(u, v) = 2 * log P(LCS(u,v)) / (log P(u) + log P(v))
---- P(u) is given by the information content learnt over a large corpus.

### How to do it in Python
#### 1. WordNet easily imported into Python through NLTK
#### 2. Find appropriate sense of the words
#### 3. Find path similarity
#### 4. Use an information criteria to find Lin similarity

In [5]:
import nltk
from nltk.corpus import wordnet as wn

### Find appropriate sense of the words
deer = wn.synset('deer.n.01')   ## give me the first meaning of deer as a noun
print(deer)

elk = wn.synset('elk.n.01')
print(elk)

## Find path similarity
deer.path_similarity(elk)  ## 0.5
deer.path_similarity(horse)  ## 0.14

In [7]:
### Use an information criteria to find Lin similarity
from nltk.corpus import wordnet_ic
brown_ic = wordnet_ic.ic(ic-brown.dat)

deer.lin_similarity(elk, brown_ic)   ### 0.772
deer.lin_similarity(horse, brown_ic)   ### 0.862

## In WordNet hierarchy, deer and horse were very far awway, but LinSimilarity is very high, this is because in typical contexts
## and the information that is contained by these words(deer and horse) are enough closer in similary(mammels).
## Elk is a very specific instance of deer and not appear often in documents, thus LinSimilarity doesnot come out as close.

### Collocations and Distributional Similarity
#### 'You know a word by the company it keeps' [Firth, 1957]
#### Two words that frequently appears in similar kind of contexts are more likely to be semantically related.
eg:
The friends met at a cafe.

Shyam met Ray at a pizzeria.

Let's meet up near the coffee shop.

The secret meeting at the restaurant soon become public.


Explain: cafe, pizzeria, coffee shop, restaurant are the words have semantic similarity, because they often appear around at 'near' and 'meet' words

### Distributional Similarity: Context
#### Words before, after, within a small window
before coffee shop -- 'a'      within a small window --- 'meet'
#### Parts of speech of words before, after, in a small window
#### Specific syntactic relation to the target word
#### Words in the same sentence, same document, ...

### Strength of Association between Words
#### How frequent are these?
---- # Not similar if two words don't occur together often
#### Also important to see how frequent are individual words
---- # 'the' is very frequent, so high chances it co-occurs oftwn with every other word
#### Thus we use Pointwise Mutual Information(Standardize)
PMI(w, c) = log [P(w, c) / P(w)P(c)]

### How to do it in Python
#### 1. Use NLTK Collocations and Association Measures
#### finder also has other useful functions, such as frequency filter

In [9]:
import nltk
from nltk.collocations import *

bigram_measure = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(text)   ## text is corpus, documentation
finder.nbest(bigram_meaures.pmi, 10)  ## use PMI measure, get the top 10 pairs

## finder pther useful functions, such as frequency filter
finder.apply_freq_filter(10)
## restrict any pair that occurs less than 10 times in corpus, keep words occurs more than 10 times

### Take Home Concepts
#### Finding similarity between words and text is non-trivial
#### WordNet is a useful resource for semantic relationships between words and semantic similarity
#### Many similarity functions exist
#### NLTK is a useful package for many such tasks

### Topic Modeling

### Documents exhibit multiple topics
'Seeking Life's Bare (Genetic) Necessities
#### Topic 1: Genetics: gene, sequence, genome,...
#### Topic 2: Computation: number, computer, analysis, prediction
#### Topic 3: Life sciences: life, survice, organism, ...
### Intuition: Documents as a mixture of topics

### What is Topic Modeling?
#### A coarse-level analysis of what's in a text collection.
#### Topic: the subject (theme) of a discourse
#### Topics are represented as a word distribution
#### A document is assumed to be a mixture of topics

### What is Topic Modeling (2) ?
#### What's known: The text collection or corpus; Number of topics
#### What's not knowm: The actual topics, Topic distribution for each document.
### What is Topic Modeling (3) ?
#### Essentially, text clustering problem
---- Documents and words clustered simultaneously

---- Needs to figure out what words come together(similar, semantic similarity), also figure out what documents come together(same topic), how words derived from those documents, the distribution of words in a particular document, what is a probability of a word in a topic.
#### Different topic modeling approaches available.
---- Probabilistic Latent Semantic Analysis (PLSA) [Hoffman '99]

---- Latent Dirichlet Allocation (LDA) [Blei, Ng, and Jordan, '03]

### Generative Models and LDA
### Generative Models for Text
Generation: Use the model to generate the documents

Inference, Estimation: Use the document to estimate the model, create the distribution of words

Pr(text | model), eg Pr(the|model) = 0.1    Pr(is|model) = 0.07   Pr(harry |model) = 0.05    Pr(Potter|model)=0.04

Explain: 1.generation story: we have one topic model, and then pull out words from the topic model to create document.

2. Inference, Estimation to get the Pr(text | model)

It can be complex, one documents can be inferenced with four models (mixture model) -- four distributions of words.

### Latent Dirichlet Allocation (LDA)
#### Generative model for a document d :
---- 1. Choose length of document d

---- 2. Choose a mixture of topics for document d

---- 3. Use a topic's multinomial distribution to output words to fill that topic's quota.

eg: Suppose a particular document d, 40% of the words from topic A, then use the topic A's multinomial distribution to output the 40% of the words

### Topic Modeling in Practice
#### How many topics?
---- Finding or even guessing the number of topics is hard.  (choice by yourself)
#### Interpreting topics
---- Topics are just word distributions.

---- Making sense of words / generating labels for that topic is a subjective decision.

### Topic Modeling: Summary
#### Great tool for exploratory text analysis
---- What are the documents (tweete, reviews, news articles) about?
#### Many tools available to do it effortlessly in Python.

### Working with LDA in Python
#### Many packages available, such as gensim, lda
#### Step 1: Pre-processing text before using the packages
---- Tokenize, normalize (lowercase)

---- Stop word removal

---- Stemming
#### Step 2: Convert tokenized documents to a document - term matrix
#### Step 3: Build LDA models on the doc-term matrix

#### Practice LDA in Python:
#### doc_set : set of pre-processed text documents
#### ldamodel can also be used to find topic distribution of documents

In [2]:
import gensim
from gensim import corpora, models
dictionary = corpora.Dictionary(doc_set)  ## dictionary mapping id and words
corpus = [dictionary.doc2bow(doc) for doc in doc_set]   ## create document term matrix
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 4, id2word = dictionary, passes = 50)
## id2word, mapping study two words ahead
print(ldamodel.print_topics(num_topics = 4, num_words = 5))  ## give the five top words in each of these four topics

### Take Home Concepts
#### Topic modeling is an exploratory tool frequently used for text mining
#### Latent Dirichlet Allocation is a generative modle used extensively for modeling large text corpora.
#### LDA can also be used as a feature selection technique for text classification and other task.

## Information Extraction

### Goal: Identify and extract fields of interest from free text
Fields of interest: Title, Author, Reviewer, Published time, Place

### Fields of Interest
#### Named entities:
---- [NEWS] People, Places, Dates, Organizations, ...

---- [FINANCE] Money, Companies, ...

---- [MEDICINE] Diseases, Drugs, Procedures, ...
#### Relations
---- What happened to who, when, where, ...

### Information is hidden in free-text
#### Most traditional transactional information is structured
#### Currently 80% of data is Abundance of unstructured, freeform text.
#### How to convert unstructured text to structured form? -- extract useful information from unstructured text

### Named Entity Recognition
#### Named entities: Noun phrases that are of specific type and refer to specific individuals, places, organizations, ...
#### Named Entity Recognition: Technique(s) to identify all mentions of pre-defined named entities in text
---- Identify the mention / phrase : Boundary detection (find start and end position)

---- Identify the type: Tagging / classification


### Approaches to identify named entities
#### Depends on kinds of entities that need to be identifies.
#### For well-formatted fields like date, phone numbers: use Regular Expressions
#### For other fields: Typically a machine learning approach, identify the words into different types

### Person, Organization, Location/GPE
#### Standard NER task in NLP research community
#### Typically a four-class model:
---- 1. PER

---- 2. ORG

---- 3. LOC / GPE

---- 4. Other / Outside (any other class).

John met Brendon. Jon and Brendon are PER, met is Outside

### Relation extraction
#### Identify relationships between named entities.
eg: Erbitux helps treat lung cancer. Erbitux(treatment) and lung cancer(disease) are two named entities.

Relationship: Ervitux is a treatment for lung cancer.

### Co-reference resolution
#### Disambiguate mentions and group mentions together.
eg: Anita met Joseph at the market. He surprised her with a rose.

Anita and Joseph are two named entities, he refers to Joseph and her refers to Anita.

In this case, it is pronoun resolution where making a reference

### Question Answering (why important)
#### Given a question, find the most appropriate answer from the text.
---- eg. What does Erbitux treat?

First need to identify Erbitux is a treatment, relation is treat.

---- eg. Who gave Anita the rose?
#### Builds on named entity recognition, relation extraction, and co-reference resolution.

### Take Home Concepts
#### Information Extraction is important for natural language understanding and making sense of textual data.
#### Named Entity Recognition is a key  building block to address many advanced NLP tasks.
#### Named Entity Recognition systems extensively deploy supervised machine learning and text mining techniques discussed in this course.
