> **Copyright (c) 2020 Skymind Holdings Berhad**<br><br>
> **Copyright (c) 2021 Skymind Education Group Sdn. Bhd.**<br>
<br>
Licensed under the Apache License, Version 2.0 (the \"License\");
<br>you may not use this file except in compliance with the License.
<br>You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0/
<br>
<br>Unless required by applicable law or agreed to in writing, software
<br>distributed under the License is distributed on an \"AS IS\" BASIS,
<br>WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
<br>See the License for the specific language governing permissions and
<br>limitations under the License.
<br>
<br>
**SPDX-License-Identifier: Apache-2.0**
<br>

# Introduction

This notebook is the handson part for the text representation. You will learn the various implementation of text representation from scratch as well as using `sklearn` library.

# Notebook Content

* [One-Hot Encoding](#One-Hot-Encoding)


* [Bag-of-Words (BoW)](#Bag-of-Words-(BoW))


* [Count Vectorizer](#Count-Vectorizer)


* [TF-IDF](#TF-IDF)

# Text Representation
## One-Hot Encoding

In [1]:
import numpy as np
import re # Regular Expression
import nltk

from nltk import sent_tokenize, word_tokenize

In [2]:
text = ""

with open("../../../resources/day_03/sample_text.txt") as file:
    text = "".join(file.readlines())

print("Text Data:")
print(text)

Text Data:
What Is Artificial Intelligence (AI)?
Artificial intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions. The term may also be applied to any machine that exhibits traits associated with a human mind such as learning and problem-solving.

The ideal characteristic of artificial intelligence is its ability to rationalize and take actions that have the best chance of achieving a specific goal. A subset of artificial intelligence is machine learning, which refers to the concept that computer programs can automatically learn from and adapt to new data without being assisted by humans. Deep learning techniques enable this automatic learning through the absorption of huge amounts of unstructured data such as text, images, or video.

When most people hear the term artificial intelligence, the first thing they usually think of is robots. That's because big-budget films and novels weave stories about

In [3]:
sentences = sent_tokenize(text)

print("Sentence List:")
print(sentences)

Sentence List:
['What Is Artificial Intelligence (AI)?', 'Artificial intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions.', 'The term may also be applied to any machine that exhibits traits associated with a human mind such as learning and problem-solving.', 'The ideal characteristic of artificial intelligence is its ability to rationalize and take actions that have the best chance of achieving a specific goal.', 'A subset of artificial intelligence is machine learning, which refers to the concept that computer programs can automatically learn from and adapt to new data without being assisted by humans.', 'Deep learning techniques enable this automatic learning through the absorption of huge amounts of unstructured data such as text, images, or video.', 'When most people hear the term artificial intelligence, the first thing they usually think of is robots.', "That's because big-budget films and n

In [4]:
processed_sentences = list(map(lambda s: re.sub(r'[^\w\s]','', s.lower()), sentences))

print("Cleaned Sentences:")
print(processed_sentences)

Cleaned Sentences:
['what is artificial intelligence ai', 'artificial intelligence ai refers to the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions', 'the term may also be applied to any machine that exhibits traits associated with a human mind such as learning and problemsolving', 'the ideal characteristic of artificial intelligence is its ability to rationalize and take actions that have the best chance of achieving a specific goal', 'a subset of artificial intelligence is machine learning which refers to the concept that computer programs can automatically learn from and adapt to new data without being assisted by humans', 'deep learning techniques enable this automatic learning through the absorption of huge amounts of unstructured data such as text images or video', 'when most people hear the term artificial intelligence the first thing they usually think of is robots', 'thats because bigbudget films and novels weave st

In [5]:
words = []

for s in processed_sentences:
    words += word_tokenize(s)

print("Word List:")
print(words)

Word List:
['what', 'is', 'artificial', 'intelligence', 'ai', 'artificial', 'intelligence', 'ai', 'refers', 'to', 'the', 'simulation', 'of', 'human', 'intelligence', 'in', 'machines', 'that', 'are', 'programmed', 'to', 'think', 'like', 'humans', 'and', 'mimic', 'their', 'actions', 'the', 'term', 'may', 'also', 'be', 'applied', 'to', 'any', 'machine', 'that', 'exhibits', 'traits', 'associated', 'with', 'a', 'human', 'mind', 'such', 'as', 'learning', 'and', 'problemsolving', 'the', 'ideal', 'characteristic', 'of', 'artificial', 'intelligence', 'is', 'its', 'ability', 'to', 'rationalize', 'and', 'take', 'actions', 'that', 'have', 'the', 'best', 'chance', 'of', 'achieving', 'a', 'specific', 'goal', 'a', 'subset', 'of', 'artificial', 'intelligence', 'is', 'machine', 'learning', 'which', 'refers', 'to', 'the', 'concept', 'that', 'computer', 'programs', 'can', 'automatically', 'learn', 'from', 'and', 'adapt', 'to', 'new', 'data', 'without', 'being', 'assisted', 'by', 'humans', 'deep', 'learni

In [6]:
unique_words = set(words)

print("Unique Words:")
print(unique_words)

Unique Words:
{'humans', 'which', 'innovators', 'skeptical', 'problemsolving', 'mimicking', 'mind', 'with', 'specific', 'hear', 'evolving', 'traits', 'longer', 'develop', 'intelligence', 'reasoning', 'many', 'previous', 'enable', 'since', 'programmed', 'any', 'activities', 'a', 'for', 'embody', 'through', 'function', 'concretely', 'some', 'industries', 'actions', 'reason', 'soon', 'experience', 'ideal', 'techniques', 'novels', 'robots', 'granted', 'become', 'absorption', 'rapid', 'adapt', 'easily', 'exceed', 'this', 'its', 'simple', 'automatically', 'and', 'have', 'term', 'crossdisciplinary', 'benefit', 'able', 'recognition', 'considered', 'outdated', 'perception', 'text', 'computer', 'are', 'goals', 'about', 'automatic', 'amounts', 'may', 'can', 'also', 'include', 'researchers', 'of', 'usually', 'films', 'truth', 'concept', 'or', 'chance', 'extent', 'laced', 'advances', 'character', 'ai', 'such', 'way', 'continuously', 'take', 'those', 'on', 'surprisingly', 'what', 'benchmarks', 'appr

In [7]:
vocab = {word:idx for idx, word in enumerate(unique_words)}

print("Vocabulary Dictionary:")
print(vocab)

Vocabulary Dictionary:
{'humans': 0, 'which': 1, 'innovators': 2, 'skeptical': 3, 'problemsolving': 4, 'mimicking': 5, 'mind': 6, 'with': 7, 'specific': 8, 'hear': 9, 'evolving': 10, 'traits': 11, 'longer': 12, 'develop': 13, 'intelligence': 14, 'reasoning': 15, 'many': 16, 'previous': 17, 'enable': 18, 'since': 19, 'programmed': 20, 'any': 21, 'activities': 22, 'a': 23, 'for': 24, 'embody': 25, 'through': 26, 'function': 27, 'concretely': 28, 'some': 29, 'industries': 30, 'actions': 31, 'reason': 32, 'soon': 33, 'experience': 34, 'ideal': 35, 'techniques': 36, 'novels': 37, 'robots': 38, 'granted': 39, 'become': 40, 'absorption': 41, 'rapid': 42, 'adapt': 43, 'easily': 44, 'exceed': 45, 'this': 46, 'its': 47, 'simple': 48, 'automatically': 49, 'and': 50, 'have': 51, 'term': 52, 'crossdisciplinary': 53, 'benefit': 54, 'able': 55, 'recognition': 56, 'considered': 57, 'outdated': 58, 'perception': 59, 'text': 60, 'computer': 61, 'are': 62, 'goals': 63, 'about': 64, 'automatic': 65, 'amou

In [8]:
# Size of vocab
m = len(vocab)

print("Size of vocab:", m)

Size of vocab: 200


In [9]:
def oneHotEncode(sentence):
    encoded_vec = []
    for word in word_tokenize(sentence):
        vec = [0] * m
        vec[vocab[word]] = 1
        encoded_vec.append(vec)
    return np.array(encoded_vec)

In [10]:
# Encode the first sentence
sentence = processed_sentences[0]

print("Sentence:", sentence)
encoded_sentence = oneHotEncode(sentence)
print("\nAfter One Hot Encoding:")
print(encoded_sentence)
print("\nSize of encoded vector:")
print(encoded_sentence.shape)

Sentence: what is artificial intelligence ai

After One Hot Encoding:
[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 

In [11]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

In [12]:
label_encoder = LabelEncoder()
int_words = label_encoder.fit_transform(words)

print(int_words)

[193  99  21  98  11  21  98  11 146 183 174 152 125  88  98  93 110 172
  20 137 183 179 106  90  16 115 175   6 174 170 114  13  29  18 183  17
 109 172  70 184  24 197   0  88 117 162  22 105  16 136 174  91  45 125
  21  98  99 101   1 183 141  16 165   6 172  84 174  36  43 125   5   0
 157  81   0 161 125  21  98  99 109 105 195 146 183 174  49 172  48 138
  41  26 104  77  16   9 183 120  55 198  32  23  39  90  56 105 168  64
 180  25 105 182 174   4 125  87  14 125 186  55 162  22 171  92 128 190
 194 119 132  86 174 170  21  98 174  75 178 177 188 179 125  99 149 173
  30  37  74  16 123 192 158   3  89 110 172 199  85 126  61  38 122  53
  29  80  77 174 185  21  98  99  27 126 174 135 172  88  98  41  29  57
  93   0 191 172   0 109  41  62 115 100  16  69 167  77 174 119 151 183
 181 172  20  65 118  47 174  82 125  21  98  94 116  88  46   8 148  16
  59  93 174  73  20 111 163 140 159  93 116   7 162  22 105 143  16 133
 183 174  72 172 176  41  29  50  57 155  33 172  9

In [13]:
oneHot_encoder = OneHotEncoder()

print(oneHot_encoder.fit_transform(np.array(words).reshape(len(words), 1)))

  (0, 193)	1.0
  (1, 99)	1.0
  (2, 21)	1.0
  (3, 98)	1.0
  (4, 11)	1.0
  (5, 21)	1.0
  (6, 98)	1.0
  (7, 11)	1.0
  (8, 146)	1.0
  (9, 183)	1.0
  (10, 174)	1.0
  (11, 152)	1.0
  (12, 125)	1.0
  (13, 88)	1.0
  (14, 98)	1.0
  (15, 93)	1.0
  (16, 110)	1.0
  (17, 172)	1.0
  (18, 20)	1.0
  (19, 137)	1.0
  (20, 183)	1.0
  (21, 179)	1.0
  (22, 106)	1.0
  (23, 90)	1.0
  (24, 16)	1.0
  :	:
  (332, 11)	1.0
  (333, 99)	1.0
  (334, 52)	1.0
  (335, 66)	1.0
  (336, 183)	1.0
  (337, 35)	1.0
  (338, 112)	1.0
  (339, 60)	1.0
  (340, 95)	1.0
  (341, 110)	1.0
  (342, 20)	1.0
  (343, 196)	1.0
  (344, 187)	1.0
  (345, 0)	1.0
  (346, 54)	1.0
  (347, 19)	1.0
  (348, 27)	1.0
  (349, 126)	1.0
  (350, 113)	1.0
  (351, 48)	1.0
  (352, 150)	1.0
  (353, 107)	1.0
  (354, 139)	1.0
  (355, 16)	1.0
  (356, 118)	1.0


In [14]:
# Encode the first sentence
sentence = processed_sentences[0]

print("Sentence:", sentence)
encoded_sentence = oneHotEncode(sentence)
print("\nAfter One Hot Encoding:")
for w in word_tokenize(processed_sentences[0]):
    print(oneHot_encoder.transform([[w]]))

Sentence: what is artificial intelligence ai

After One Hot Encoding:
  (0, 193)	1.0
  (0, 99)	1.0
  (0, 21)	1.0
  (0, 98)	1.0
  (0, 11)	1.0


## Bag-of-Words (BoW)

In [15]:
def vectorize(tokens):
    ''' This function takes list of words in a sentence as input 
    and returns a vector of size of filtered_vocab.It puts 0 if the 
    word is not present in tokens and count of token if present.'''
    vector=[]
    for w in filtered_vocab:
        vector.append(tokens.count(w))
    return vector

In [16]:
def unique(sequence):
    '''This functions returns a list in which the order remains 
    same and no item repeats.Using the set() function does not 
    preserve the original ordering,so i didnt use that instead'''
    seen = set()
    return [x for x in sequence if not (x in seen or seen.add(x))]

In [17]:
#create a list of stopwords.You can import stopwords from nltk too
stopwords=["to","is","a"]
#list of special characters.You can use regular expressions too
special_char=[",",":"," ",";",".","?"]

In [18]:
#Write the sentences in the corpus,in our case, just two 
string1="Welcome to Great Learning , Now start learning"
string2="Learning is a good practice"

In [19]:
#convert them to lower case
string1=string1.lower()
string2=string2.lower()

In [20]:
#split the sentences into tokens
tokens1=string1.split()
tokens2=string2.split()

print(tokens1)
print(tokens2)

['welcome', 'to', 'great', 'learning', ',', 'now', 'start', 'learning']
['learning', 'is', 'a', 'good', 'practice']


In [21]:
#create a vocabulary list
vocab=unique(tokens1+tokens2)
print(vocab)

['welcome', 'to', 'great', 'learning', ',', 'now', 'start', 'is', 'a', 'good', 'practice']


In [22]:
#filter the vocabulary list
filtered_vocab=[]
for w in vocab: 
    if w not in stopwords and w not in special_char: 
        filtered_vocab.append(w)

print(filtered_vocab)

['welcome', 'great', 'learning', 'now', 'start', 'good', 'practice']


In [23]:
#convert sentences into vectords
vector1=vectorize(tokens1)
print(vector1)
vector2=vectorize(tokens2)
print(vector2)

[1, 1, 2, 1, 1, 0, 0]
[0, 0, 1, 0, 0, 1, 1]


## Count Vectorizer

In [24]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [25]:
from sklearn.feature_extraction.text import CountVectorizer
  
document = ["Artificial intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions.",
            "The ideal characteristic of artificial intelligence is its ability to rationalize and take actions that have the best chance of achieving a specific goal.",
            "A subset of artificial intelligence is machine learning, which refers to the concept that computer programs can automatically learn from and adapt to new data"]
  
# Create a Vectorizer Object
vectorizer = CountVectorizer()
  
vectorizer.fit(document)
  
# Printing the identified Unique words along with their indices
print("Vocabulary: ", vectorizer.vocabulary_)
  
# Encode the Document
vector = vectorizer.transform(document)
  
# Summarizing the Encoded Texts
print("Encoded Document is:")
print(vector.toarray())

Vocabulary:  {'artificial': 7, 'intelligence': 23, 'ai': 4, 'refers': 37, 'to': 46, 'the': 43, 'simulation': 38, 'of': 33, 'human': 19, 'in': 22, 'machines': 30, 'that': 42, 'are': 6, 'programmed': 34, 'think': 45, 'like': 28, 'humans': 20, 'and': 5, 'mimic': 31, 'their': 44, 'actions': 2, 'ideal': 21, 'characteristic': 12, 'is': 24, 'its': 25, 'ability': 0, 'rationalize': 36, 'take': 41, 'have': 18, 'best': 9, 'chance': 11, 'achieving': 1, 'specific': 39, 'goal': 17, 'subset': 40, 'machine': 29, 'learning': 27, 'which': 47, 'concept': 14, 'computer': 13, 'programs': 35, 'can': 10, 'automatically': 8, 'learn': 26, 'from': 16, 'adapt': 3, 'new': 32, 'data': 15}
Encoded Document is:
[[0 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 2 0 0 0 0 1 0 1 1 0 1 1 0
  0 1 1 0 0 0 1 1 1 1 2 0]
 [1 1 1 0 0 1 0 1 0 1 0 1 1 0 0 0 0 1 1 0 0 1 0 1 1 1 0 0 0 0 0 0 0 2 0 0
  1 0 0 1 0 1 1 2 0 0 1 0]
 [0 0 0 1 0 1 0 1 1 0 1 0 0 1 1 1 1 0 0 0 0 0 0 1 1 0 1 1 0 1 0 0 1 1 0 1
  0 1 0 0 1 0 1 1 0 0 2 1]]


## TF-IDF

In [26]:
# Import TF-IDF Vectorizer from sklearn
from sklearn.feature_extraction.text import TfidfVectorizer

text1 = ['i love nlp', 'nlp is so cool', 
'nlp is all about helping machines process language', 
'this tutorial is on baisc nlp technique']

# Initialize vectorizer
tf = TfidfVectorizer()

# Fit the data into the vectorizer
txt_fitted = tf.fit(text1)

# Vectorized the data
txt_transformed = txt_fitted.transform(text1)

print(txt_transformed)

idf = tf.idf_
print(dict(zip(txt_fitted.get_feature_names(), idf)))

  (0, 9)	0.46263733109032296
  (0, 7)	0.8865476297873808
  (1, 12)	0.6108781210948048
  (1, 9)	0.318781545479458
  (1, 5)	0.3899155916311765
  (1, 3)	0.6108781210948048
  (2, 11)	0.38691946801941357
  (2, 9)	0.20191062952175412
  (2, 8)	0.38691946801941357
  (2, 6)	0.38691946801941357
  (2, 5)	0.24696568444132605
  (2, 4)	0.38691946801941357
  (2, 1)	0.38691946801941357
  (2, 0)	0.38691946801941357
  (3, 15)	0.4196006932295896
  (3, 14)	0.4196006932295896
  (3, 13)	0.4196006932295896
  (3, 10)	0.4196006932295896
  (3, 9)	0.2189650485963657
  (3, 5)	0.26782568715384725
  (3, 2)	0.4196006932295896
{'about': 1.916290731874155, 'all': 1.916290731874155, 'baisc': 1.916290731874155, 'cool': 1.916290731874155, 'helping': 1.916290731874155, 'is': 1.2231435513142097, 'language': 1.916290731874155, 'love': 1.916290731874155, 'machines': 1.916290731874155, 'nlp': 1.0, 'on': 1.916290731874155, 'process': 1.916290731874155, 'so': 1.916290731874155, 'technique': 1.916290731874155, 'this': 1.91629073

# Contributors

**Author**
<br>Chee Lam

# References

1. [Introduction to Text Representations for Language Processing](https://towardsdatascience.com/introduction-to-text-representations-for-language-processing-part-1-dc6e8068b8a4)
2. [An Overview for Text Representations in NLP](https://towardsdatascience.com/an-overview-for-text-representations-in-nlp-311253730af1)