# Semantic Text Similarity

## WordNet
    a) Organize information in hierarchy
    b) Many similarity measures using this hierarchy in some way
        i. path similarity: shortest path between the 2 concepts and similarity measure inversely related to path distance
        ii. Lin similarity & lowest common subsumer(LCS): find closest ancestor to both concepts; 
            LinSim(u,v) = 2 * logP(LCS(u,v))/(log P(u) + log P(u))
        iii. Collocations and Distributional Similarity: 
             Collocation: you know a word by the company it keeps|Firth, 1957
             Two words that frequenlty appears in similar contexts are more likeky to be semantically related
             e.g., cafe, pizzeria, coffee shop, & restaurant with "meet", "at"
             Distributional Similarity: Context
             1. words before, after, within a small window
             2. parts of speech of words before, after, in a small window (after a location morality)
             3. specific syntactic 
             4. same sentence, same document, ...
             Strength of association between words:
             How frequent?
             Baseline frequency of ind. words?
             - Normalization: Pointwise Mutual Information, PMI(w,c) = log [P(w,c)/P(w)P(c)]

In [7]:
import nltk
from nltk.corpus import wordnet as wn
nltk.download('wordnet')
nltk.download('wordnet_ic')

[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet_ic to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Unzipping corpora/wordnet_ic.zip.


True

### Path Similarity

In [8]:
# Find appropriate sense of the words
deer = wn.synset('deer.n.01')
elk = wn.synset('elk.n.01')
horse = wn.synset('horse.n.01')

# Find path similarity
deer.path_similarity(elk) # .5
deer.path_similarity(horse) # .14

0.14285714285714285

### LinSim

In [9]:
from nltk.corpus import wordnet_ic
brown_ic = wordnet_ic.ic('ic-brown.dat') # brown corpus

deer.lin_similarity(elk,brown_ic) # .86
deer.lin_similarity(horse,brown_ic)

0.7726998936065773

### Collocation & Association Measures

In [10]:
import nltk 
from nltk.collocations import *

bigram_measures = nltk.collocations.BigramAssocMeasures()

finder = BigramCollocationFinder.from_words(text)

finder.nbest(bigram_measures.pmi,10)

finder also has other useful functions, such as frequency filter

i.e., finder.apply_freq_filter(10)

# Topic Modeling

## What?
    
    A coarse-level analysis of what is in a text collection
    a) Topic: the subject/theme of a discourse
    b) Topics are represented as a word distribution
    c) In practice,
        i. What's known?
            1.the text collection or corpus
            2.No. of topics
        ii. What's unknown?
            1. the acutal topics
            2. topic distribution for each document
    d) Text clustering problem: documents and words are clustered simultaneously
    e) Different approaches available
        i. Probabilistic Latent Semantic Analysis|PLSA, 99
        ii. Latent Dirichlet Allocation |LDA, 03

### Generative Models and LDA

    Pr(text|model)
    a) Generative models: 
    Chest -- ** Generation ** -- Document -- ** inference, estimation ** -- models
    i.individual models
    ii. mixtual models: how you combine models (topics) to generate such document
    
    b) LDA: Generative model for a document d
    i. choose length of doc d 
    ii. choose a mixture of topics for doc d
    iii. use a topic's multinomial distribution to output words to fill that topic's quota

## Topic Modeling in Practice
    a) How many topics?
        Finding or even guessing the number of topics is hard
    b) How to interpret topics?
        Topics are just distribution of words
        

## Summary
    a) exploratory text analysis:
        What are the documents about?
    b) LDA: genism,lda packages
    i. Preprocessing:
        1. Tokenize, normalize (lowercase)
        2. Stop word removal (domain-specific)
        3. Stemming
    ii. Converting to DTF
    iii. Building LDA on top of DTF
        1. doc_set
        import genism
        from genism import corpora,models
            
        dictionary = corpora.Dictionary(doc_set) # create dictionary
        corpus = [dictionary.doc2bow(doc) for doc in doc_set] # DTM, bow = bag-of-word
        ldamodel = genism.models.ldamodel.LdaModel(corpus, num_topics = 4, id2word = dictionary, passes = 50) 
        print(ldamodel.print_topics(num_topics = 4, num_words = 5))
    LDA used for extensive corpora
    also for feature selections
    first step in text mining

# Information Extraction
    
    a) Goal: identify and extract fields of interest from free text
    i. named entities
        [NEWS] People, Place, Dates, Geographic Entities, ... (typically capitalized)
        [FINANCE] Money, Companies, ...
        [MEDICINE] Drugs, Diseases, Procedures, ...
        
        HOW?
        Techniques to identify all mentions of pre-defined name entities in text:
        1.Identify the menton/phrase: Boundary detection
        2.Identify type/tag
            
    ii. relations
        What happens to who, when, where
    

## How to identify named entities

    Depends on kinds of entities that need to be identified
        a) for well-formatted fields like data, phone numbers: Regular Expressions
        b) for other fields: Typicallly a ML approach
    
    Standard NER task in NLP processing: four-class model
        PER
        ORG
        LOC/GPE
        Other/Outside (any other class)
    * for NEWS, nltk has an embedded function

## Relation Extraction
    Co-reference Resolution
    i. Disambiguate mentions and group mentions together 