## Keyphrases:

### What?
- In a given document, the most important topics can be represented as key phrases.
- It is also termed as key terms, key segments or simply keywords.
- E.g.: 


### Why?
- Need to understand what users are talking about
- Huge number of documents like User Reviews, Chat logs, Customer support tickets 
- Insights to build/improve your product


### How?

#### Preliminaries
- Noun Phrase:
A word or group of words containing a noun and functioning in a sentence as subject, object, or prepositional object.
    - [We] are attending [NLP Classes].
    - [Ram] killed [Ravan].


- Term Frequency (TF)
The count of a term/phrase in a document. It is calculated as a relative score with the total number of terms in the document.
    - In the recent past, the government has taken constructive actions towards terrorism. --> 
    {the: 2/12, in: 1/12, recent: 1/12, past:1/12, ... }


- Inverse Document Frequency (IDF)
A word appearing in a lot of documents would be considered less important compared to rare words that appear in fewer documents. 
    - Doc1: This is a document with tag 1.
    - Doc2: However, it is important to have different words.
    - Doc3: This is a document with tag 3.
    - Document Frequency (DF) {is: 3, this: 2, a: 2, ... }
    - Total Documents (N): 3
    - Inverse Document Frequency (IDF) = $\log{\frac{N}{DF + 1}}$
    

### Algorithm

### Data Fetch

In [21]:
def fetch_data():
    import glob
    abstract_files = glob.glob("AutomaticKeyphraseExtraction/Hulth2003/train/*.abstr")
    full_data = []
    for file in abstract_files:
        f = open(file, 'rb')
        lines = f.readlines()
        file_data = " ".join([str(line.decode("utf-8").strip()) for line in lines])
        full_data.append(file_data)
    return full_data

### Identify Noun Phrases

In [45]:
import nltk
def get_np_chunks(paragraph):
    phrases = []
    sents = nltk.sent_tokenize(paragraph)
    for sent in sents:
        pos_tags = nltk.pos_tag(nltk.word_tokenize(sent))
        grammar = r"""
          NP: {<PP\$>?<JJ>*<NN>+} 
              {<NN>*<NNP>+}                # chunk sequences of proper nouns
              {<NN>*<NNS>+}
              {<NN>+}
        """
        #{<DT|PP\$>?<JJ>*<NN>}   # chunk determiner/possessive, adjectives and noun
        
        chunkParser = nltk.RegexpParser(grammar)
        chunked = chunkParser.parse(pos_tags)
        for subtree in chunked.subtrees(filter=lambda t: t.label() == 'NP'):
            phrase = " ".join([np[0] for np in subtree.leaves()])
            phrases.append(phrase)
    return phrases


### Generate Candidate Phrases

In [58]:
import re

def clean_text(text):
    text = re.sub("[^a-z0-9 ]", " ", text.lower())
    return text.strip()

def get_candidate_phrases(full_data):
    
    doc_phrases = []
    for doc in full_data:
        phrases = get_np_chunks(doc)
        
        cleaned_phrases = []
        for phrase in phrases:
            text = clean_text(phrase)
            if len(text) > 1:
                cleaned_phrases.append(text)
        
        doc_phrases.append(cleaned_phrases)
    return doc_phrases

In [59]:
candidates = get_candidate_phrases(full_data)

### Candidate Scoring using IDF

In [54]:
import math
from collections import Counter, defaultdict


def get_inverse_document_frequency(candidates):
    
    total_docs = len(candidates)
    doc_unique_phrases = []
    for datum in candidates:
        unique_phrases = list(set(datum))
        doc_unique_phrases.extend(unique_phrases)
    
    doc_freq = Counter(doc_unique_phrases)
    
    full_unique_phrases = doc_freq.keys()
    inv_doc_freq = defaultdict()
    for phrase in full_unique_phrases:
        inv_doc_freq[phrase] = math.log(total_docs / (1.0 + doc_freq[phrase]))
    
    return inv_doc_freq
    

In [62]:
inv_doc_freq = get_inverse_document_frequency(candidates)

### Extracting Keyphrases

In [71]:
from operator import itemgetter

def get_keywords(doc_id):
    phrases = candidates[doc_id]
    phrases_tf = Counter(phrases)
    total_tf = len(phrases)
    
    candidate_phrases = []
    for phrase in set(phrases):
        phrase_n_score = []
        tf = phrases_tf[phrase] / (1.0 * total_tf)
        tf_idf = tf * inv_doc_freq[phrase]
        
        phrase_n_score.append(phrase)
        phrase_n_score.append(tf_idf)
        candidate_phrases.append(phrase_n_score)
    
    keywords = sorted(candidate_phrases, key=itemgetter(1), reverse=True)
    return keywords

In [76]:
get_keywords(3)

[['organisms', 0.40094245796272204],
 ['mind independent world', 0.40094245796272204],
 ['realist conception', 0.40094245796272204],
 ['thesis', 0.3562232850233707],
 ['organism', 0.20047122898136102],
 ['realist idea', 0.20047122898136102],
 ['additional threat', 0.20047122898136102],
 ['powerful perspective', 0.20047122898136102],
 ['selective representing the idea', 0.20047122898136102],
 ['integrated system', 0.20047122898136102],
 ['scientific interest', 0.20047122898136102],
 ['consistent', 0.20047122898136102],
 ['primary concern', 0.20047122898136102],
 ['niches', 0.20047122898136102],
 ['realism', 0.18739170936496863],
 ['third', 0.16503212289529298],
 ['latter', 0.1600595203202201],
 ['profiles', 0.1600595203202201],
 ['contents', 0.1600595203202201],
 ['notion', 0.1485538769673578],
 ['compatibility', 0.1485538769673578],
 ['things', 0.145479355038186],
 ['issue', 0.1426725364256173],
 ['representations', 0.1426725364256173],
 ['sense', 0.13769993385054446],
 ['authors', 0.1

In [77]:
full_data[3]

'Selective representing and world-making We discuss the thesis of selective representing-the idea that the contents of the mental representations had by organisms are highly constrained by the biological niches within which the organisms evolved. While such a thesis has been defended by several authors elsewhere, our primary concern here is to take up the issue of the compatibility of selective representing and realism. We hope to show three things. First, that the notion of selective representing is fully consistent with the realist idea of a mind-independent world. Second, that not only are these two consistent, but that the latter (the realist conception of a mind-independent world) provides the most powerful perspective from which to motivate and understand the differing perceptual and cognitive profiles themselves. Third, that the (genuine and important) sense in which organism and environment may together constitute an integrated system of scientific interest poses no additional 

### Further Readings:
1. https://code.google.com/archive/p/kea-algorithm/
2. https://www.r-bloggers.com/key-phrase-extraction-from-tweets/
3. https://github.com/snkim/AutomaticKeyphraseExtraction
4. https://www.cs.waikato.ac.nz/~ml/publications/2005/chap_Witten-et-al_Windows.pdf