# Technical n-gram Identification

Function is designed to identify technical n-grams (multi-word terms) within a text
- n-grams provide more powerful token for use within nlp functions
- Specifying technical n-grams ensures relevance rather than generically using all possible n-gram combinations which can add noise
- Technique used is a semantic filter as defined by [Justeson & Katz, 1994](https://www.researchgate.net/publication/200044387_Technical_Terminology_Some_Linguistic_Properties_and_an_Algorithm_for_Identification_in_Text)
- Limited to a maximum of tri-grams (3 word terms) as longer terms typically only appear in highly technical texts such as academic papers
- Once identified n-grams are joined using underscores between the terms
    - Suggest to save processed texts as a separate entity after processing to preserve the original data

In [None]:
# library import
import nltk
from nltk import word_tokenize
nltk.download('averaged_perceptron_tagger')
nltk.download('tagsets')
from nltk.tokenize.treebank import TreebankWordDetokenizer


def get_ngram(text):

    '''
    Functions accepts a text string and identifies technical n-grams (up to
    3-word terms. Once identified the n-grams are joined within the text via an 
    underscore to create more powerful tokens for use in further NLP work.
    '''

    # set up empty list to collect multi-word tokens
    terms = []
    # tokenize inputted text
    tokens = word_tokenize(text)
    # run semantic filter to identify multiword tokens and append to list
    for x in range(len(tokens) - 2):
        tokens = word_tokenize(text)
        tag_tokens = nltk.pos_tag(tokens)
        if tag_tokens[x][1] in ['NN', 'NNS', 'JJ'] and tag_tokens[x+1][1] in ['NN', 'NNS']:
            terms.append((tokens[x], tokens[x+1]))   
        if tag_tokens[x][1] in ['NN', 'NNS', 'JJ'] and tag_tokens[x+1][1] in ['NN', 'NNS', 'JJ'] \
        and tag_tokens[x+2][1] in ['NN', 'NNS']:
            terms.append((tokens[x], tokens[x+1], tokens[x+2]))
        if tag_tokens[x][1] in ['NN', 'NNS'] and tag_tokens[x+1][1] in ['IN'] \
        and tag_tokens[x+2][1] in ['NN', 'NNS']:
            terms.append((tokens[x], tokens[x+1], tokens[x+2]))

    # define function to stitch tokens back together 
    # and adjoin multiword tokens with an underscore
    def collate_tech_terms(text):

        tokens = word_tokenize(text, terms)
        try:
            for x in range(len(tokens) - 1):
                if (tokens[x], tokens[x + 1], tokens[x + 2]) in terms:
                    tokens[x] = str(tokens[x] + '_' + tokens[x+1] + '_' + tokens[x+2])
                    tokens.remove(tokens[x + 1])
                    tokens.remove(tokens[x + 1])
                elif (tokens[x], tokens[x + 1]) in terms:
                    tokens[x] = str(tokens[x] + '_' + tokens[x+1])
                    tokens.remove(tokens[x + 1])
        except:
            pass

        return TreebankWordDetokenizer().detokenize(tokens)

    # adjoin multiword tokens in user input text
    text = collate_tech_terms(text, terms)

    return text