# Feature Engineering in NLP
- A feature can be a single word in your text, or can be a phrase, a sentence, a paragraph or may be a complete document
- The process of converting text data into vectors of real numbers is called `Feature Extraction from Text Data` or `Text Representation` or `Text Vectorization`, whose goal is to converting the text data into numbers in such a way that those numbers should be able to tell the semantic or meaning of those words
- In simplest words, we need to convert text into a vector of numbers in such a way that the number represent the meaning of that word
- Angle between the two vectors represent the similarity between the two vectors (documents)
- For Example: if angle bween two vectors is zero, the cos(0) = 1, that means two vectors are exactly the same, and if the angle between the two vectors is 90, and cos(90)  = 0, that means two vectors (documents) are completely different
- Cosine distance = 1 - (cosine similarity/angle)

## Different Methods of Text Vectorization
- Frequency or statistical-based approaches
> 1) Label Encoding (Dapricated)
> 2) One-Hot Encoding (Dapricated)
> 3) Bag of Words Encoding
> 4) Bag of n-grams
> 5) TF-IDF
- Prediction-based approaches (Embeddings)
> 1) Word2Vec - from Google
> 2) Fast Text - from Facebook

### 3) Bag of Words (BoW) Encoding
- It is a most basic strategy for converting text into numbers which specifies the presence/count of a word/n-grams in a vocabulary
- The most common NLP application in which we use BoW presentation is text classification (i.e classify a collection documents to categories like sports, entertainment and politics)
- Consider the following corpus that consist of three documents cosisting of five, ten and five words respectively and the corresponding vocabulary of the corpus
>- doc1 = ["Ali youtube channel is amazing"]
>- doc2 = ["I like youtube lectures and ali also like youtube lectures"]
>- doc3 = ["Ali youtube lectures are amazing"]
>- vocab = {'also':0, 'amazing':1, 'and': 2, 'are':4, 'ali':4, 'great':5, 'khurram':6, 'lectures':7, 'like':8, 'youtube':9}
- Irrespestive of the size, each document is converted into a v-dimentional `frequency vector`, where `v` is the size of the vocabulary
- The three documents are represented as a `Document-Term Matrix (DTM)`, which is a mathematical matrix that describes the frequency of terms that occur in a collection of documents
- doc1 has a total of 5 words each appearing once, and it is represented as a vector of size 10 (count of vocab), having 5 non-zero values
#### Advantage:
- ***`Size is Fixed`***: Unlike one hot encoding, which encode every word separately, it encodes every document a fixed size vector, irrespective of number of words in it. So, this can easily be fed to the ML model
#### Disadvantages:
- ***`Size Reduced But Still is Larged`***: The vector representation of a document is small in size as compared to the one hot encoding (limited to the size of vocabulary)
- ***`Sparsity Reduced But Still Exit`***: A bit better than one hot encoding, however, verctor representation of BoW is still has lots and lots of zeros. To save memory, the matrix is saved as a square matrix
- ***`OOV (Out of Vocabulary) Partially Solved`***: In BoW, a word which is not there in the vocabulary will be valued as zero. but if the words it ignores are important in predicting the output so this is disadvantage. Only benefit is that it does not throw an error. However the vector will capture the meaning of the word
- ***`Semantic Meaning are Partially Captured`***: A bit better than one hot encoding, however, BoW does not capture the meaning of your sentence accurately. This limitation can, however, be reduced using lemmatization
- ***`Ordering of Words is Ignored`***: In BoW representation, the ordering of word is not captured, while in the languages like English, the order the words play a role to convey the meaning of the sentence
- ***`Two very Similar Vectors Convey Completely Different Meanings`***: This can be handled using n-grams techniques

#### Creating Bag of Words Using `CountVectorizer`
- Steps to create BoW representation of corpus programmtically
>- ***`Tekenization`***: First, tokenize all the input documents
>- ***`Vocabulary Creation`***: Of all the obtained tokenized words, only unique words are selected to create the vocabulary and then sorted by the alphabetical order
>- ***`Vector Creation`***: Finally, a sparse matrix is created in which each row is documented vector whose length (the columns of the matrix) is equal to the size of the vocabulary. The value of each cell in a row/document is a frequency count of the word under that column
- ***`sklearn's CountVectorizer`***: It computes the frequency of the occurance of a word in the document. It converts the corpus of multiple documents into a Document Term Matrix (Space Matrix). It also allows you to
>- Control your n-gram size
>- Perform custom preprocessing
>- Perform custom Tokenization
>- Eliminate stop words
>- Limit Vocabulary size

In [1]:
corpus = ["Ali youtube channel is amazing",
          "I like youtube lectures and ali also like youtube lectures",
        "Ali youtube lectures are amazing"]

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
print(type(cv))

<class 'sklearn.feature_extraction.text.CountVectorizer'>


In [3]:
bow = cv.fit_transform(corpus) #generate vocabulary dictionary and return a DTM


In [4]:
print(cv.vocabulary_)

{'ali': 0, 'youtube': 9, 'channel': 5, 'is': 6, 'amazing': 2, 'like': 8, 'lectures': 7, 'and': 3, 'also': 1, 'are': 4}


In [5]:
print(cv.get_feature_names_out())

['ali' 'also' 'amazing' 'and' 'are' 'channel' 'is' 'lectures' 'like'
 'youtube']


***Note that single character words are not there in the vocabulary***

In [10]:
bow.shape
# 3 no of documents and 10 shows no of words 

3

In [7]:
bow

<3x10 sparse matrix of type '<class 'numpy.int64'>'
	with 16 stored elements in Compressed Sparse Row format>

In [9]:
# Since bow is a sparse matrix, so to change it to desse matrix or array we can use the numpy toarray() method
# bow.toarray()
bow.todense()

matrix([[1, 0, 1, 0, 0, 1, 1, 0, 0, 1],
        [1, 1, 0, 1, 0, 0, 0, 2, 2, 2],
        [1, 0, 1, 0, 1, 0, 0, 1, 0, 1]], dtype=int64)

#### Let's understand the sparsity of the Matrix

In [12]:
# Total count of values in a bow matrix 
total_cells = bow.shape[0] * bow.shape[1]
total_cells

30

In [14]:
# Total count of non-zero cells are
nonzero_cells = bow.nnz
nonzero_cells

16

In [15]:
# Percentage of non zero values in the Document Term Matrix are
percentage = (nonzero_cells/total_cells)*100
percentage

53.333333333333336

***In order to save memory space and speed up algebric operations, we use sparse representation of matrix***

#### Let's save the corpus as Document Term Matrix of BoW representation

In [16]:
import pandas as pd
dtm = pd.DataFrame(data = bow.todense(), columns = cv.get_feature_names_out())
dtm

Unnamed: 0,ali,also,amazing,and,are,channel,is,lectures,like,youtube
0,1,0,1,0,0,1,1,0,0,1
1,1,1,0,1,0,0,0,2,2,2
2,1,0,1,0,1,0,0,1,0,1


In [18]:
dtm1 = pd.DataFrame(data = bow.todense())
dtm1

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,1,0,1,0,0,1,1,0,0,1
1,1,1,0,1,0,0,0,2,2,2
2,1,0,1,0,1,0,0,1,0,1


### Hyperparameters of `CountVectorizer`
- To improve or fine tune results on your dataset, you can tweek different hyperparameters of CountVectorizer() method
1) ***`vocabulary = None`***: You can pass a python dictionary where keys are terms and values are indices in the feature matrix
2) ***`lowercase = True`***: 
3) ***`Tokenizer = None`***: Remove all the special character, punctuations, and single characters, if not required you can change it
4) ***`stop_words = None`***: By default it will remove all the stop words
5) ***`preprocessor = None`***:
6) ***`Binary = False`***:
7) ***`max_features = None`***:
8) ***`ngram_range = (1,1)`***

## Bag of N-Grams Encoding
- In simple BoW model, the vocabulary consists of single unique words of corpus and its limitation is the ordering of the words is not captured.
- The bag of n-grams model is quite similar to the BoW model, it represents a text document as an unordered collection of its n-grams (a continuous sequence of n-items from a given sample of text or speech)
- An n-gram of size 1 is called unigram, of size 2 called bigram and if size 3 called trigram and so on
- The formula to calculate the count of n-grams in a document is: ***`X-(N-1)`***, where X is the number of words in a given document and N is the number of words in n-gram
- 

#### Advantages:
1) Able to capture the sematic meanings of the sentence, as we use bigram or trigram then it takes a sequence of sentences which make it easy for finding the relationship
2) Intutive and easy to implement - the implementation of N-Gram is straightforward with a little bit modification of Bag of N-grams
#### Disadvantages:
1) As we move from uni-gram to n-gram then dimention of vector formation or vocabulary increase due to which it takes a little bit more time on computation and prediction
2) Still no solution for out of vocabulary terms, we don't have a way another than ignoring the new words in the new sentence

#### Creating Bag of N-Grams using `CountVectorizer`
- To create a bag of n-grams, we can use the `ngram_range` argument of CountVectorizer method
>- ***`cv = sklearn.feature_extraction.text.Countvectorizer(ngram_range = (1,1))`***
>- `ngram_range = (1,1)` will create a vocabulary and later a bag of unigrams
>- `ngram_range = (2,2)` will create a vocabulary and later a bag of bigrams
>- `ngram_range = (1,2)` will create a vocabulary and later a bag of combined unigrams and bigrams
>- `ngram_range = (1,3)` will create a vocabulary and later a bag of combined unigrams, bigrams, and trigrams

#### Example of Bi-grams

In [1]:
import pandas as pd
corpus = ["Ali youtube channel is amazing",
          "I like youtube lectures and ali also like youtube lectures",
        "Ali youtube lectures are amazing"]

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range = (2,2))

bow = cv.fit_transform(corpus)
print(cv.vocabulary_)
print(cv.get_feature_names_out())
print(bow.toarray())

{'ali youtube': 1, 'youtube channel': 10, 'channel is': 5, 'is amazing': 6, 'like youtube': 9, 'youtube lectures': 11, 'lectures and': 7, 'and ali': 3, 'ali also': 0, 'also like': 2, 'lectures are': 8, 'are amazing': 4}
['ali also' 'ali youtube' 'also like' 'and ali' 'are amazing' 'channel is'
 'is amazing' 'lectures and' 'lectures are' 'like youtube'
 'youtube channel' 'youtube lectures']
[[0 1 0 0 0 1 1 0 0 0 1 0]
 [1 0 1 1 0 0 0 1 0 2 0 2]
 [0 1 0 0 1 0 0 0 1 0 0 1]]


In [2]:
dtm = pd.DataFrame(data = bow.todense(), columns = cv.get_feature_names_out())
dtm

Unnamed: 0,ali also,ali youtube,also like,and ali,are amazing,channel is,is amazing,lectures and,lectures are,like youtube,youtube channel,youtube lectures
0,0,1,0,0,0,1,1,0,0,0,1,0
1,1,0,1,1,0,0,0,1,0,2,0,2
2,0,1,0,0,1,0,0,0,1,0,0,1


### Example of Tri-Grams

In [3]:
import pandas as pd
corpus = ["Ali youtube channel is amazing",
          "I like youtube lectures and ali also like youtube lectures",
        "Ali youtube lectures are amazing"]

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range = (3,3))

bow = cv.fit_transform(corpus)
print(cv.vocabulary_)
print(cv.get_feature_names_out())
print(bow.toarray())

{'ali youtube channel': 1, 'youtube channel is': 9, 'channel is amazing': 5, 'like youtube lectures': 8, 'youtube lectures and': 10, 'lectures and ali': 6, 'and ali also': 4, 'ali also like': 0, 'also like youtube': 3, 'ali youtube lectures': 2, 'youtube lectures are': 11, 'lectures are amazing': 7}
['ali also like' 'ali youtube channel' 'ali youtube lectures'
 'also like youtube' 'and ali also' 'channel is amazing'
 'lectures and ali' 'lectures are amazing' 'like youtube lectures'
 'youtube channel is' 'youtube lectures and' 'youtube lectures are']
[[0 1 0 0 0 1 0 0 0 1 0 0]
 [1 0 0 1 1 0 1 0 2 0 1 0]
 [0 0 1 0 0 0 0 1 0 0 0 1]]


In [4]:
dtm = pd.DataFrame(data = bow.todense(), columns = cv.get_feature_names_out())
dtm

Unnamed: 0,ali also like,ali youtube channel,ali youtube lectures,also like youtube,and ali also,channel is amazing,lectures and ali,lectures are amazing,like youtube lectures,youtube channel is,youtube lectures and,youtube lectures are
0,0,1,0,0,0,1,0,0,0,1,0,0
1,1,0,0,1,1,0,1,0,2,0,1,0
2,0,0,1,0,0,0,0,1,0,0,0,1


## Term Frequency - Inverse Document Frequency (TF-IDF)
- In bag of words representation of a document, the values of the vector are the number of times a particular word appear in the a document, butit do not capture the importance of a word in the document
- In Simple words, BoW approach treats each word equally, irrespective of its importance 
- So, BoW gives more importance to some unimportant words that appear more frequently in the document. For example, words like can, have, since, they, you, it - they can have more frequency but are not important
- This will take the attention of ML model away from less frequent but more important words
- ***TF-IDF*** stands for term frequency times inverse document frequency and it is used to address this issue
    - The term frequency (TF) tells us how important a term is in particular document, by assigning more weight to a term that is appearing more frequent in the dataset
    - The document frequncy tells us how important a term is in the entire corpus of the document. The intition behind taking its inverse is that most common a word is across the document, the lesser its important for this document

#### 1) Term Frequency
- TF tells us the count of a term in a specific document and thus tells us how important a term is in a particular document
>-  ***`    TF = Number of times a term occur in a document`***
>-  ***`    TF = (Number of times a term occur in a document)/(Total number of terms in a document)`***
>-  ***`    TF = (Number of times a term occur in a document)/(Frequency of most common term in a document)`***

#### 2) Inverse Document Frequency
- IDF tells us how important a term is in the entire corpus of the documents
- But since we want to penalize frequently occuring words acorss all the documents, so we take the inverse of the Document Frequency
- This way the IDF of the rare words in the corpus will be large, while the IDF of very common words in the entire corpus will be close to zero
>-   ***` IDF = log[(Total number of the documents in the corpus)/(Number of documents in which term t appears)]`***
- In a corpus having large number of documents (n), if a word is appearing in only one document, then the IDF value of the large will be large
- The TF vary for each term in each document, but IDF values remains same for all the terms in the each document inside a corpus

### Term Frequency - Inverse Document Frequency (TFIDF)
   #### `TFIDF = IF * IDF`  

#### Normalize TFIDF Values
- To avoid large documents in a document dominating small documents, we have to normalize each row in the sparse matrix to have Euclidian Norm
- ***Normalize TFIDF = TFIDF / (Euclidian Norm of Respective Document)***
- ***The higher is the TFIDF score of a term in a document, the more relavent the term is in that document***

#### Creating TFIDF using `TfidfVectorizer`

In [5]:
import pandas as pd
corpus = ["Ali youtube channel is amazing",
          "I like youtube lectures and ali also like youtube lectures",
        "Ali youtube lectures are amazing"]

from sklearn.feature_extraction.text import TfidfVectorizer
cv = TfidfVectorizer()

bow = cv.fit_transform(corpus)
print(cv.vocabulary_)
print(cv.get_feature_names_out())
print(bow.toarray())

{'ali': 0, 'youtube': 9, 'channel': 5, 'is': 6, 'amazing': 2, 'like': 8, 'lectures': 7, 'and': 3, 'also': 1, 'are': 4}
['ali' 'also' 'amazing' 'and' 'are' 'channel' 'is' 'lectures' 'like'
 'youtube']
[[0.32630952 0.         0.42018292 0.         0.         0.55249005
  0.55249005 0.         0.         0.32630952]
 [0.18623238 0.31531883 0.         0.31531883 0.         0.
  0.         0.4796162  0.63063767 0.37246476]
 [0.34957775 0.         0.45014501 0.         0.59188659 0.
  0.         0.45014501 0.         0.34957775]]


In [6]:
bow.shape

(3, 10)

In [7]:
bow

<3x10 sparse matrix of type '<class 'numpy.float64'>'
	with 16 stored elements in Compressed Sparse Row format>

In [8]:
bow.toarray()

array([[0.32630952, 0.        , 0.42018292, 0.        , 0.        ,
        0.55249005, 0.55249005, 0.        , 0.        , 0.32630952],
       [0.18623238, 0.31531883, 0.        , 0.31531883, 0.        ,
        0.        , 0.        , 0.4796162 , 0.63063767, 0.37246476],
       [0.34957775, 0.        , 0.45014501, 0.        , 0.59188659,
        0.        , 0.        , 0.45014501, 0.        , 0.34957775]])

In [9]:
dtm = pd.DataFrame(data = bow.todense(), columns = cv.get_feature_names_out())
dtm

Unnamed: 0,ali,also,amazing,and,are,channel,is,lectures,like,youtube
0,0.32631,0.0,0.420183,0.0,0.0,0.55249,0.55249,0.0,0.0,0.32631
1,0.186232,0.315319,0.0,0.315319,0.0,0.0,0.0,0.479616,0.630638,0.372465
2,0.349578,0.0,0.450145,0.0,0.591887,0.0,0.0,0.450145,0.0,0.349578


#### Advantage: 
- The TFIDF technique of text vectorization is mainly used in ***Information Retrival***, like in google search engine
#### Disadvantages:
- Size of vector depends on overall size of vocabulary, thus increase the number of dimensionality
- Sparsity Exists
- Out of vocabulary problem, as in case of new word, we cannot vectorize it
- Number of dimensions increases
- Semantic meanings are not completely captured
    - Ordering of words is ignored
    - Two very similar vectors convey different meanings

## Word Embeddings
- We have seen that in BoW and TFIDF encodings, every word is treated as an individual entity and semantics are completely ignored. These vectorization techniques works fine for NLP tasks like Text Generation and Classification, but these techniques would not work fine for other NLP tasks like Semantics analysis, Machine Translation, and Question Answering, where deep understanding of the context is required for great results
- For this we turn to word embeddings, a featurized word level representation capable of capturing the context meanings of the word
- Word embeddings techniques that map a single word as well as an entire document to a dence vector of fixed size (50 - 300 dimentions) that capture the semantic meanings of the word.
- ***Word Embedding Techniques***
1) Preedicted-based
     - Word2Vec by Google (2013)
     - FastText ny Facebook (2015)
2) Frequency/Count Based
     - Global Vectors (GloVe) by Stanford (2014)

### Word2Vec
- This techniques uses the power of a Simple Neural Network to generate word embeddings
- It is a contexually aware word embedding technique that uses a simple neural network to generate word embeddings
- It converts a word into a vector of real numbers (300 or may be 400 dimentions)
- Two approaches to Train a Word2Vec Model:
     - CBOW (Uses context words to predict the target word)
     Skin-gram (Uses a word to predict target context words)

#### Word2Vec using spacy `en_core_web_lg` Model

In [1]:
# download spacy large model
import sys
!{sys.executable} -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.6.0/en_core_web_lg-3.6.0-py3-none-any.whl (587.7 MB)
     ---------------------------------------- 0.0/587.7 MB ? eta -:--:--
     ---------------------------------------- 0.1/587.7 MB 4.2 MB/s eta 0:02:22
     ---------------------------------------- 0.2/587.7 MB 2.9 MB/s eta 0:03:24
     ---------------------------------------- 0.4/587.7 MB 3.0 MB/s eta 0:03:14
     ---------------------------------------- 0.5/587.7 MB 2.6 MB/s eta 0:03:50
     ---------------------------------------- 0.6/587.7 MB 2.5 MB/s eta 0:03:57
     ---------------------------------------- 0.7/587.7 MB 2.5 MB/s eta 0:03:52
     ---------------------------------------- 0.7/587.7 MB 2.3 MB/s eta 0:04:21
     ---------------------------------------- 0.8/587.7 MB 2.3 MB/s eta 0:04:13
     ---------------------------------------- 1.0/587.7 MB 2.3 MB/s eta 0:04:11
     -------------------------

In [2]:
import spacy
nlp = spacy.load('en_core_web_lg')

In [3]:
len(nlp.vocab.vectors)

514157

In [4]:
nlp.vocab.vectors.shape

(514157, 300)

In [6]:
# Vector representation of different vectors
nlp(u'lion').vector

array([  1.2746  ,   0.46242 ,  -1.1829  ,  -5.2661  ,  -2.7128  ,
         1.8521  ,  -0.94273 ,   2.1865  ,   6.503   ,   0.6704  ,
         1.5361  ,   2.5992  ,  -0.36233 ,   4.3965  ,  -6.5644  ,
         1.6141  ,  -1.2897  ,   2.1184  ,  -0.63654 ,  -3.4572  ,
        -4.3771  ,   4.2074  ,  -3.6411  ,  -0.97214 ,   1.3253  ,
        -2.3125  ,  -3.6531  ,  -2.8398  ,   2.7913  ,  -1.53    ,
        -2.9984  ,  -2.6357  ,   0.50615 ,  -2.6925  ,   4.3401  ,
        -5.6017  ,   0.045691,   4.3832  ,  -0.19535 ,  -1.0751  ,
         0.32172 ,   2.4395  ,   4.6638  ,   3.4471  ,  -3.3847  ,
        -1.8238  ,   0.70212 ,   0.58557 ,   5.0032  ,  -3.1072  ,
         1.2364  ,   7.4595  ,   0.057368,   1.0111  ,  -1.0827  ,
         0.69113 ,   2.8009  ,  -3.4383  ,  -1.0599  ,  -2.2627  ,
        -5.149   ,  -5.0636  ,   3.1405  ,   1.0793  ,  -0.72892 ,
        -3.9939  ,  -0.69551 ,  -0.55767 ,   3.2555  ,  -2.9449  ,
         4.7114  ,   1.6388  ,   1.3828  ,   1.4255  ,  -3.233

In [13]:
# Out of vocabulary word
nlp(u'arif').vector

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0.

In [12]:
# Out of vocabulary words and L2 norm
# Word arif is not present in the vocabulary
word = nlp(u'lion cat arif')
for i in word:
    print(i.text, '--->', i.has_vector, '--->', i.is_oov, '--->', i.vector_norm)

lion ---> True ---> False ---> 55.145737
cat ---> True ---> False ---> 63.188496
arif ---> False ---> True ---> 0.0


#### Cosine Similarit and Distance among the Vectors

In [14]:
word = nlp(u'python language dog pet')
for i in word:
    for j in word:
        print(i.text, '--->', j.text, '--->', i.similarity(j))

python ---> python ---> 1.0
python ---> language ---> 0.3078760504722595
python ---> dog ---> 0.20198722183704376
python ---> pet ---> 0.18855559825897217
language ---> python ---> 0.3078760504722595
language ---> language ---> 1.0
language ---> dog ---> -0.017512168735265732
language ---> pet ---> -0.014662646688520908
dog ---> python ---> 0.20198722183704376
dog ---> language ---> -0.017512168735265732
dog ---> dog ---> 1.0
dog ---> pet ---> 0.7856059074401855
pet ---> python ---> 0.18855559825897217
pet ---> language ---> -0.014662646688520908
pet ---> dog ---> 0.7856059074401855
pet ---> pet ---> 1.0


#### You can compute the Cosine Similarity and Cosine Distance using sklearn as well

In [16]:
from sklearn.metrics.pairwise import cosine_similarity, cosine_distances
word = nlp(u'love hate')
print(cosine_similarity([word[0].vector],word[1].vector))
print(cosine_distances([word[0].vector],word[1].vector))

ValueError: Expected 2D array, got 1D array instead:
array=[ 0.23355  -1.5958   -0.43832  -0.79177  -0.47293  -0.65385   4.766
  3.3118   -1.1163   -1.4135    1.7481    1.6558   -3.4178    2.7059
  3.4494    2.2334   -2.3294   -2.6985    1.5111    1.844     1.4172
  1.8649   -3.4629   -6.215     0.53773   4.182     0.82074   0.95482
 -1.0871    2.4614   -0.2043   -0.4544    0.98015   0.071963 -1.4329
  0.13958  -0.66471  -1.4585    3.9407    5.0935   -6.3574   -0.27942
  2.2665   -2.898    -3.2789   -0.092654 -1.366    -7.0613   -1.3403
  3.4631   -3.2916    0.012934  4.0928   -2.6714   -1.0288    1.2791
  0.47239   3.3938    3.8074    6.3871    4.7075    0.91546   2.9122
 -0.43876  -3.4306    0.87786  -4.3036   -2.9921    1.1719   -0.475
  1.1216    3.5561   -0.83872   1.9656    1.0765   -3.733    -3.5569
  2.4386    2.2814    1.7229   -2.2582   -2.5684    2.3587    1.2975
 -0.24869  -1.4072    0.1968   -3.7248   -0.19302  -2.2311    1.974
 -3.0162    0.50976  -1.9535   -0.18756  -4.2125    1.9024    0.51462
 -3.3284   -0.54298  -0.27768   5.4268    1.3631    1.6919   -1.8497
  0.36691  -0.2659    3.4587   -3.9397    0.90896  -2.7143   -6.5789
  2.1224   -2.9889    2.5707    2.3846   -7.3273   -3.0276    3.3882
 -1.4116   -0.42108  -0.14218  -5.9498   -1.2106   -3.23     -0.086367
  5.313     0.60622  -5.3029   -4.8387    0.080666  3.1116   -5.9983
  0.82871  -1.2897    3.1264   -2.6964    0.17913   1.2406   -3.685
 -0.99229   1.1919    3.1437    2.6326   -4.6198   -1.6105   -0.32186
  2.8834   -1.6916   -2.8429   -3.3514   -2.7182    0.072776  1.2423
 -4.0209    1.5181    0.19122   2.4482   -3.8244    2.2995    2.8282
 -5.3429   -2.4304   -0.68189   0.59473  -1.4817    3.2226   -2.01
 -0.33663   1.2033    1.0166   -3.9318    4.1016   -0.7225    0.74884
 -4.4475    2.283     6.2743    3.9582   -0.8011    0.018162 -2.5592
  3.0756    0.94803  -0.60529  -0.11548   0.86238  -0.24634   3.4805
 -1.0431   -2.5622    3.6095   -0.44399   3.4584   -4.8694   -1.4841
 -3.8994    2.5664   -2.6586    1.8939   -4.0724   -0.22639  -0.35106
 -2.0078   -5.3888   -0.73636   2.6067   -4.475     0.79791   0.69502
  2.0292   -1.8943   -1.2885   -5.0741   -0.80495  -0.86452  -3.5489
  2.9763    0.41344  -1.062     1.8907   -0.076199  4.0966    5.6969
  0.15293  -1.0134   -0.53788   1.2834    1.8435    4.3109   -4.5217
  0.77719   2.2804   -5.426    -2.1385   -0.75142  -2.0936    2.0133
  2.3169    0.1853    0.61986   0.47252   2.1352   -0.018625 -0.83675
  0.10579  -0.50458  -1.1679    2.4421    0.88291   0.65413  -0.39425
 -2.1802    1.3737    5.513    -3.3372   -0.18637  -2.4529   -3.3615
  3.2515    0.61353   0.5084    0.080183 -2.9933    2.9051    1.1434
  1.9729   -1.8239   -4.1009    1.1322    3.0365   -1.6457    0.21405
  0.5171   -3.1972   -0.45893   3.4121   -2.2525    2.3351    1.7703
 -7.0716    2.1427   -0.87619  -1.3424   -4.6187    2.1281    2.8918
 -1.182    -1.6044    1.7043   -0.96296  -1.3007    1.6578   -4.9305
  0.22645  -0.68177   0.82712   4.8422    0.12478  -1.1275  ].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.