<a href="https://colab.research.google.com/github/Rishav-hub/NLP-resources/blob/main/Text_Vectorization_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Prepare Text Data with scikit-learn

### Word Counts with CountVectorizer

The CountVectorizer provides a simple way to both tokenize a collection of text documents
and build a vocabulary of known words, but also to encode new documents using that vocabulary

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
# list of text documents
text = ["The quick brown fox jumped over the lazy dog."]
# create the transform
vectorizer = CountVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# summarize
print(vectorizer.vocabulary_)
# encode document
vector = vectorizer.transform(text)
# summarize encoded vector
print(vector.shape)
print(type(vector))
print(vector.toarray())


{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}
(1, 8)
<class 'scipy.sparse.csr.csr_matrix'>
[[1 1 1 1 1 1 1 2]]


In [2]:
# encode another document
text2 = ["the puppy"]
vector = vectorizer.transform(text2)
print(vector.toarray())


[[0 0 0 0 0 0 0 1]]


###  Word Frequencies with TfidfVectorizer


Word counts are a good starting point, but are very basic. One issue with simple counts is that
some words like the will appear many times and their large counts will not be very meaningful
in the encoded vectors. An alternative is to calculate word frequencies, and by far the most
popular method is called TF-IDF. This is an acronym that stands for Term Frequency - Inverse
Document Frequency which are the components of the resulting scores assigned to each word.
- Term Frequency: This summarizes how often a given word appears within a document.
- Inverse Document Frequency: This downscales words that appear a lot across documents.


In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
# list of text documents
text = ["The quick brown fox jumped over the lazy dog.",
"The dog.",
"The fox"]
# create the transform
vectorizer = TfidfVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# summarize
print(vectorizer.vocabulary_)
print(vectorizer.idf_)
# encode document
vector = vectorizer.transform([text[0]])
# summarize encoded vector
print(vector.shape)
print(vector.toarray())


{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}
[1.69314718 1.28768207 1.28768207 1.69314718 1.69314718 1.69314718
 1.69314718 1.        ]
(1, 8)
[[0.36388646 0.27674503 0.27674503 0.36388646 0.36388646 0.36388646
  0.36388646 0.42983441]]


## How to Prepare Text Data With Keras

###  Split Words with text to word sequence

A good first step when working with text is to split it into words. Words are called tokens and the process of splitting text into tokens is called tokenization. Keras provides the
text to word sequence() function that you can use to split text into a list of words.

In [4]:
from keras.preprocessing.text import text_to_word_sequence
# define the document
text = 'The quick brown fox jumped over the lazy dog.'
# tokenize the document
result = text_to_word_sequence(text)
print(result)


['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']


### Encoding with one hot

It needs the vocab size which is the maximum size of the tokens or can be more if we intend to add more documents. It will split at white spaces and lower case the tokens.

In [7]:
from keras.preprocessing.text import one_hot
from keras.preprocessing.text import text_to_word_sequence
# define the document
text = 'The quick brown fox jumped over the lazy dog.'
# estimate the size of the vocabulary
words = set(text_to_word_sequence(text))
vocab_size = len(words)
print(vocab_size)
# integer encode the document
result = one_hot(text, round(vocab_size*1.3))
print(result)

8
[2, 7, 7, 5, 6, 3, 2, 6, 3]


### Hash Encoding with hashing trick

- A limitation of integer and count base encodings is that they must maintain a vocabulary of
words and their mapping to integers. An alternative to this approach is to use a one-way hash
function to convert words to integers. This avoids the need to keep track of a vocabulary, which
is faster and requires less memory

- It provides more flexibility, allowing you to specify
the hash function as either hash (the default) or other hash functions such as the built in md5
function or your own function.

In [8]:
from keras.preprocessing.text import hashing_trick
from keras.preprocessing.text import text_to_word_sequence
# define the document
text = 'The quick brown fox jumped over the lazy dog.'


# estimate the size of the vocabulary
words = set(text_to_word_sequence(text))
vocab_size = len(words)
print(vocab_size)
# integer encode the document
result = hashing_trick(text, round(vocab_size*1.3), hash_function='md5')
print(result)



8
[6, 4, 1, 2, 7, 5, 6, 2, 6]


### Tokenizer API



In [9]:
from keras.preprocessing.text import Tokenizer
# define 5 documents
docs = ['Well done!',
'Good work',
'Great effort',
'nice work',
'Excellent!']
# create the tokenizer
t = Tokenizer()
# fit the tokenizer on the documents
t.fit_on_texts(docs)


In [10]:
# summarize what was learned / 4 attributes
print(t.word_counts)
print(t.document_count)
print(t.word_index)
print(t.word_docs)


OrderedDict([('well', 1), ('done', 1), ('good', 1), ('work', 2), ('great', 1), ('effort', 1), ('nice', 1), ('excellent', 1)])
5
{'work': 1, 'well': 2, 'done': 3, 'good': 4, 'great': 5, 'effort': 6, 'nice': 7, 'excellent': 8}
defaultdict(<class 'int'>, {'well': 1, 'done': 1, 'good': 1, 'work': 2, 'effort': 1, 'great': 1, 'nice': 1, 'excellent': 1})


In [16]:

# integer encode documents

encoded_docs_count = t.texts_to_matrix(docs, mode='count')
print(encoded_docs_count)
encoded_docs_tfidf = t.texts_to_matrix(docs, mode='tfidf')
print("Using TFIDF")
print(encoded_docs_tfidf)

[[0. 0. 1. 1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 1. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1.]]
Using TFIDF
[[0.         0.         1.25276297 1.25276297 0.         0.
  0.         0.         0.        ]
 [0.         0.98082925 0.         0.         1.25276297 0.
  0.         0.         0.        ]
 [0.         0.         0.         0.         0.         1.25276297
  1.25276297 0.         0.        ]
 [0.         0.98082925 0.         0.         0.         0.
  0.         1.25276297 0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         0.         1.25276297]]
