# Getting Started with Noun Phrase to Vec (NP2vec)

Noun Phrases (NP) play a particular role in NLP applications. 
This code consists in training a word embedding’s model for Noun NP’s using word2vec or fastText algorithm. All the terms in the corpus are used as context in order to train the word embedding’s model; however, at the end of the training, only the word embedding’s of the NP’s are stored, except for the case of fastText training with word_ngrams=1; in this case, we store all the word embedding’s, including non-NP’s in order to be able to estimate word embeddings of out-of-vocabulary NP’s (NP’s that don’t appear in the training corpora).

This tutorial shows how to train an NP2vec model.

First let’s install NLP Architect and wget libraries.

In [None]:
!git clone https://github.com/IntelLabs/nlp-architect.git
!pip install nlp-architect/
%cd nlp-architect
!pip install wget

Let's import relevant libraries

In [None]:
import gzip
import wget

from nlp_architect.models.np2vec import NP2vec
from solutions.set_expansion.prepare_data import load_parser, mark_noun_phrases

Let's download a sample corpus (a subset of English Wikipedia dump).

In [None]:
url = 'https://github.com/IntelLabs/nlp-architect/raw/master/datasets/wikipedia/enwiki-20171201_subset.txt.gz'  
wget.download(url, 'enwiki-20171201_subset.txt.gz')

Let's extract NP's from the corpus.

In [None]:
corpus='enwiki-20171201_subset.txt.gz'
marked_corpus = 'enwiki-20171201_subset_marked.txt'
chunker = 'spacy'
with gzip.open(corpus, 'rt', encoding='utf8', errors='ignore') as corpus_file, open(marked_corpus, 'w', encoding='utf8') as marked_corpus_file:
    nlp = load_parser(chunker)
    num_lines = sum(1 for line in corpus_file)
    corpus_file.seek(0)
    print('%i lines in corpus', num_lines)
    mark_noun_phrases(corpus_file, marked_corpus_file, nlp, num_lines, chunker)

Let's train the NP2vec model and store it.

In [None]:
np2vec = NP2vec(marked_corpus)
np2vec.save()

The NP2vec model can be used for term set expansion (Term Set Expansion Jupyter notebook available at nlp-architect/tutorials/Term_Set_Expansion/term_set_expansion.ipynb).