In [68]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

%matplotlib inline

## Text Representation
Unlike images which are already available in terms of numbers, text are not inherently **numbers**. For any type of 
digital device to understand **anything**, things should be available in numbers because computers are 
"computing" devices, right? And hence, they process **numbers**.  

For image, it's simple as manipulating pixels and then doing various levels of computations. But for texts (and textual 
data), the representation boils down to any of following things:
- characters
- words
- sentences
- paragraphs

Things are a bit **not so easy** for textual data, since such data can be represented by any of the following forms. 
And also [Natural Language Processing](https://en.wikipedia.org/wiki/Natural_language_processing) is a bit of difficult 
domain itself, partly because textual representations are ambiguous and that human perceive linguistics in different 
forms.  

For images, pixels combine to form abstract features, and such abstract features combine to form high-level features. 
The case is similar for textual data:
- characters and symbols combine to form words
- words combine to form phrases
- phrases combine to form sentences
- sentences combine to form paragraphs

So, we can employ different representation techniques for extracting features from text.

## Feature Extraction in Text
This is the most important part of any machine learning model. We extract features that are relevant and that can be understood by the computer/device. For a text, there are many things that "features" mean.

Unlike, images that have pixels values already available as numeric data/features, texts have to be analyzed to extract features. Some of the things we can do is:
- count the occurence of each word and use the counts as features
- use one hot encoding scheme for a text (word/paragraph/document)
- use techniques like tf-idf (term-frequency-inverse-document-frequency) which utilizes the rareness of the text itself.


## Label Encoding of Words
First, let's convert words into features for a text. Here, we are doing label encoding of the words/tokens.  
That is, we simply assign the index (id) number to each token/word in the given text.
**Note**:  
Token -> Word occuring in the text  
Vocabulary -> Unique words in the given text or textual documents  
<br/>
A text is represented by a sequence of index ids which gives fixed-sized vector of indices.

In [69]:
text = "I am paradox i am nish i love caffeine i love solving problems i am caffeine addict"

In [70]:
# preprocess
text = text.lower()

#### Generate Vocab

In [71]:

def get_vocab(text):
    """
        Just get unique tokens. 
        This is achieved through set() to remove duplicates
    """
    return sorted(set(text.split()))

In [72]:
vocab = get_vocab(text)
vocab

['addict',
 'am',
 'caffeine',
 'i',
 'love',
 'nish',
 'paradox',
 'problems',
 'solving']

#### Create Unique Id for each vocab

In [73]:
def map_token_to_id(vocab):
    """
        We loop from i to len(vocab).
        And for each token/word in vocab, 
        we assign corresponding index number
    """
    vocab = sorted(vocab) # just in case if vocab is not sorted
    res = {}
    for i, token in enumerate(vocab):
        res[token] = i+1
    return res

In [74]:
# create a dictionary for mapping vocab token to id
token_to_idx = map_token_to_id(vocab)
token_to_idx

{'addict': 1,
 'am': 2,
 'caffeine': 3,
 'i': 4,
 'love': 5,
 'nish': 6,
 'paradox': 7,
 'problems': 8,
 'solving': 9}

#### Create id to token mapping

In [75]:
def map_id_to_token(token_to_idx):
    """
        This is reverse of map_token_to_id
    """
    return {idx:token for token, idx in token_to_idx.items()}

In [76]:
# create mapping from id to token for future use
idx_to_token = map_id_to_token(token_to_idx)
idx_to_token

{1: 'addict',
 2: 'am',
 3: 'caffeine',
 4: 'i',
 5: 'love',
 6: 'nish',
 7: 'paradox',
 8: 'problems',
 9: 'solving'}

#### Encode each word from text

In [77]:
def encode_text(text, token_to_idx):
    """
        For each word/token in text, assign corresponding index number.
    """
    tokens = text.split()
    res = list(token_to_idx.keys())
    for i, token in enumerate(res):
        if token in tokens:
            res[i] = token_to_idx[token]
        else:
            res[i] = 0
    return res

In [78]:
features = encode_text('i love paradox', token_to_idx)
features

[0, 0, 0, 4, 5, 0, 7, 0, 0]

In [80]:
df = pd.DataFrame([features], columns = vocab)

In [81]:
df

Unnamed: 0,addict,am,caffeine,i,love,nish,paradox,problems,solving
0,0,0,0,4,5,0,7,0,0


### Find textual Similarity using label encoded text
Let's test if two texts are similar with this encoding.  
We are going to improve this using other encoding schemes

In [110]:
def compute_cosine_similarity(v1, v2):
    mag1 = np.linalg.norm(v1)
    mag2 = np.linalg.norm(v2)
    return np.dot(v1, v2)/(mag1 * mag2)

In [116]:
text1 = "i love problems"
text2 = "i love to solving problems"

In [117]:
features1 = encode_text(text1, token_to_idx)
features1

[0, 0, 0, 4, 5, 0, 0, 8, 0]

In [118]:
features2 = encode_text(text2, token_to_idx)
features2

[0, 0, 0, 4, 5, 0, 0, 8, 9]

In [119]:
df = pd.DataFrame([features1, features2], columns=vocab)

In [120]:
df

Unnamed: 0,addict,am,caffeine,i,love,nish,paradox,problems,solving
0,0,0,0,4,5,0,0,8,0
1,0,0,0,4,5,0,0,8,9


In [121]:
compute_cosine_similarity(features1, features2)

0.7513428837969107

## One Hot Encoding
This is similar to label encoding. Instead of using **index** number, we are simply going to use boolean values 1/0 
to represent the presence or absence of token.

In [122]:
text = "I am paradox I am Nish i love caffeine i love solving problems i am caffeine addict"

In [123]:
# preprocess
text = text.lower().strip()

In [124]:
vocab = get_vocab(text)
vocab

['addict',
 'am',
 'caffeine',
 'i',
 'love',
 'nish',
 'paradox',
 'problems',
 'solving']

In [125]:
token_to_idx = map_token_to_id(vocab)
token_to_idx

{'addict': 1,
 'am': 2,
 'caffeine': 3,
 'i': 4,
 'love': 5,
 'nish': 6,
 'paradox': 7,
 'problems': 8,
 'solving': 9}

In [126]:
def encode_text2(text, token_to_idx):
    tokens = text.split()
    res = list(token_to_idx.keys())
    # loop  through each token in vocab
    # and check if the vocab token is in our text
    # if present, we put 1, else 0
    for i, token in enumerate(res):
        if token not in tokens:
            res[i] = 0
        else:
            res[i] = 1
    return res

In [127]:
text1 = "i love problems"
text2 = "i love caffeine"

In [128]:
features1 = encode_text2(text1, token_to_idx)
features2 = encode_text2(text2, token_to_idx)
features1, features2

([0, 0, 0, 1, 1, 0, 0, 1, 0], [0, 0, 1, 1, 1, 0, 0, 0, 0])

In [129]:
df = pd.DataFrame([features1, features2], columns=vocab)

In [130]:
df

Unnamed: 0,addict,am,caffeine,i,love,nish,paradox,problems,solving
0,0,0,0,1,1,0,0,1,0
1,0,0,1,1,1,0,0,0,0


In [131]:
compute_cosine_similarity(features1, features2)

0.6666666666666667

## Count Based Encoding
While one-hot encoding works most of the time, it does not give any information regarding the strengths provided by 
the frequency of each word in the text. So, count vectorizer takes into account the frequency of the word in the text.  
So, here we have to create a mapping of word and their counts

In [132]:
from collections import Counter

In [133]:
text = "I am paradox I am Nish i love caffeine i love solving problems i am caffeine addict".lower().strip()

In [134]:
def get_frequency_map(text):
    tokens = text.split()
    c = Counter(tokens)
    return dict(c)

In [135]:
frequency_map = get_frequency_map(text)
frequency_map

{'i': 5,
 'am': 3,
 'paradox': 1,
 'nish': 1,
 'love': 2,
 'caffeine': 2,
 'solving': 1,
 'problems': 1,
 'addict': 1}

In [136]:
vocab = get_vocab(text)
vocab

['addict',
 'am',
 'caffeine',
 'i',
 'love',
 'nish',
 'paradox',
 'problems',
 'solving']

In [137]:
token_to_idx = map_token_to_id(vocab)
token_to_idx

{'addict': 1,
 'am': 2,
 'caffeine': 3,
 'i': 4,
 'love': 5,
 'nish': 6,
 'paradox': 7,
 'problems': 8,
 'solving': 9}

In [144]:
def encode_text3(text, token_to_idx):
    frequency_map = get_frequency_map(text)
    tokens = text.split()
    res = list(token_to_idx.keys())
    # loop  through each token in vocab
    # and check if the vocab token is in our text
    # if present, we put corresponding count value
    for i, token in enumerate(res):
        if token in tokens and token in frequency_map:
            res[i] = frequency_map[token]
        else:
            res[i] = 0
    return res

In [145]:
features = encode_text3(text, token_to_idx)
df = pd.DataFrame([features], columns=vocab)

In [146]:
df

Unnamed: 0,addict,am,caffeine,i,love,nish,paradox,problems,solving
0,1,3,2,5,2,1,1,1,1


In [147]:
text1 = "i love problems"
text2 = "i love love caffeine"

In [149]:
features1 = encode_text3(text1, token_to_idx)
features2 = encode_text3(text2, token_to_idx)
features1, features2

([0, 0, 0, 1, 1, 0, 0, 1, 0], [0, 0, 1, 1, 2, 0, 0, 0, 0])

In [150]:
df = pd.DataFrame([features1, features2], columns=vocab)

In [151]:
df

Unnamed: 0,addict,am,caffeine,i,love,nish,paradox,problems,solving
0,0,0,0,1,1,0,0,1,0
1,0,0,1,1,2,0,0,0,0


In [152]:
compute_cosine_similarity(features1, features2)

0.7071067811865476

## Using Scikit-Learn Vectorizer
Since scikit learn already provides above encoding schemes, we can simply use them.

In [153]:
from sklearn.feature_extraction.text import CountVectorizer
import nltk

In [154]:
text = "I am paradox I am Nish i love caffeine i love solving problems i am caffeine addict".lower().strip()

In [155]:
tokens = nltk.word_tokenize(text)
tokens

['i',
 'am',
 'paradox',
 'i',
 'am',
 'nish',
 'i',
 'love',
 'caffeine',
 'i',
 'love',
 'solving',
 'problems',
 'i',
 'am',
 'caffeine',
 'addict']

In [59]:
vectorizer = CountVectorizer()
features = vectorizer.fit_transform(tokens)

In [60]:
vectorizer.get_feature_names()

['addict', 'am', 'caffeine', 'love', 'nish', 'paradox', 'problems', 'solving']

In [61]:
test = "caffeine is love"

In [62]:
features = vectorizer.transform(nltk.word_tokenize(test))
print(features)

  (0, 2)	1
  (2, 3)	1


In [63]:
features = features.toarray()

In [64]:
features

array([[0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0]])

In [65]:
df = pd.DataFrame(features, columns = vectorizer.get_feature_names())

In [156]:
df

Unnamed: 0,addict,am,caffeine,i,love,nish,paradox,problems,solving
0,0,0,0,1,1,0,0,1,0
1,0,0,1,1,2,0,0,0,0
