#Introduction to word representation

Working with machine learning models means that you must expressed numerically the information that you want to process. Therefore, there are many algorithms that enable words to be expressed mathematically, such as Bag-Of-Word, TF-IDF, Word2Vec, FastText.
**This is called feature extraction or feature encoding.**


In this section we are going to explain you the differences between each one.


##Bag-of-Words
A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:

a) A vocabulary of known words.

b) A measure of the presence of known words.

It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.


Limitations:
* **Vocabulary**: The vocabulary requires careful design, most specifically in order to manage the size, which impacts the sparsity of the document representations.
* **Sparsity**: Sparse representations are harder to model both for computational reasons (space and time complexity) and also for information reasons, where the challenge is for the models to harness so little information in such a large representational space.
* **Meaning**: Discarding word order ignores the context, and in turn meaning of words in the document (semantics).

Sklearn method for BoW, in this example we will use 2 sentences in spanish. It will be a basic one, but if you want to create a good word-vector transformation it is necessary to clean de data before the vectorization.

In [55]:
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')


sent1 = 'El presidente y su joven gobierno son la mejor posibilidad para retomar la senda del desarrollo y la paz social'
sent2 ='El presidente miente, la gente tiene que votar y llegar a un acuerdo'

vectorizer = CountVectorizer(stop_words = stopwords.words('spanish'))  # to use bigrams ngram_range=(2,2) -> default is ngram_range = (1,1)

BoW = vectorizer.fit_transform([sent1,sent2])
print(sorted(vectorizer.vocabulary_))
print(BoW.shape)
print(BoW.toarray())

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
['acuerdo', 'desarrollo', 'gente', 'gobierno', 'joven', 'llegar', 'mejor', 'miente', 'paz', 'posibilidad', 'presidente', 'retomar', 'senda', 'social', 'votar']
(2, 15)
[[0 1 0 1 1 0 1 0 1 1 1 1 1 1 0]
 [1 0 1 0 0 1 0 1 0 0 1 0 0 0 1]]


In [56]:
import pandas as pd

#Visualization with pandas
df_bow_sklearn = pd.DataFrame(BoW.toarray(),columns=vectorizer.get_feature_names_out())
df_bow_sklearn.head()

Unnamed: 0,acuerdo,desarrollo,gente,gobierno,joven,llegar,mejor,miente,paz,posibilidad,presidente,retomar,senda,social,votar
0,0,1,0,1,1,0,1,0,1,1,1,1,1,1,0
1,1,0,1,0,0,1,0,1,0,0,1,0,0,0,1


The same vectorizer can be used on documents that contain words not included in the vocabulary. These words are ignored and no count is given in the resulting vector. 



In [57]:
sent_out_of_vocab = ["Chile es un país joven"]
vector = vectorizer.transform(sent_out_of_vocab)
#For ex: "joven" is the only word that exists in the vocabulary
print(vector.toarray())

[[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]]


In [58]:
#Visualization with pandas
import numpy as np

df_bow_sklearn = pd.DataFrame(np.concatenate([BoW.toarray(), vector.toarray()]),columns=vectorizer.get_feature_names_out())
df_bow_sklearn.head()

Unnamed: 0,acuerdo,desarrollo,gente,gobierno,joven,llegar,mejor,miente,paz,posibilidad,presidente,retomar,senda,social,votar
0,0,1,0,1,1,0,1,0,1,1,1,1,1,1,0
1,1,0,1,0,0,1,0,1,0,0,1,0,0,0,1
2,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0


##TF-IDF
A problem with scoring word frequency is that highly frequent words start to dominate in the document (e.g. larger score), but may not contain as much “informational content” to the model as rarer but perhaps domain specific words.

One approach is to rescale the frequency of words by how often they appear in all documents, so that the scores for frequent words like “the” that are also frequent across all documents are penalized.

This approach to scoring is called Term Frequency – Inverse Document Frequency, where:

**Term Frequency:** is a scoring of the frequency of the word in the current document (summarizes how often a given word appears within a document).

**Inverse Document Frequency:** is a scoring of how rare the word is across documents.
The scores are a weighting where not all words are equally as important or interesting.

In [59]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer_TFIDF = TfidfVectorizer(stop_words = stopwords.words('spanish'))

TFIDF_model = vectorizer_TFIDF.fit_transform([sent1,sent2])
# summarize vocab
print(sorted(vectorizer.vocabulary_))

['acuerdo', 'desarrollo', 'gente', 'gobierno', 'joven', 'llegar', 'mejor', 'miente', 'paz', 'posibilidad', 'presidente', 'retomar', 'senda', 'social', 'votar']


In [60]:
print(vectorizer_TFIDF.idf_)

[1.40546511 1.40546511 1.40546511 1.40546511 1.40546511 1.40546511
 1.40546511 1.40546511 1.40546511 1.40546511 1.         1.40546511
 1.40546511 1.40546511 1.40546511]


In [61]:
tf_idf_df=pd.DataFrame(TFIDF_model.toarray(),columns=vectorizer_TFIDF.get_feature_names_out ())
tf_idf_df.head()

Unnamed: 0,acuerdo,desarrollo,gente,gobierno,joven,llegar,mejor,miente,paz,posibilidad,presidente,retomar,senda,social,votar
0,0.0,0.324336,0.0,0.324336,0.324336,0.0,0.324336,0.0,0.324336,0.324336,0.230768,0.324336,0.324336,0.324336,0.0
1,0.42616,0.0,0.42616,0.0,0.0,0.42616,0.0,0.42616,0.0,0.0,0.303216,0.0,0.0,0.0,0.42616


Bag of Words just creates a set of vectors containing the count of word occurrences in the document, while the TF-IDF model contains information on the more important words and the less important ones as well.
Bag of Words vectors are easy to interpret. However, TF-IDF usually performs better in machine learning models.

#Word Embeddings

On the other hand, Word embeddings are word n-dimensional vector representations where words with similar meaning have similar representation. These representations can also help in identifying synonyms, antonyms,
and various other relationships between words.

Syntax -> Bag-of-words and
term-frequency and inverse document frequency-based methods 
 
Semantics -> Word Embeddings

In the next Notebook, you will see how to use w2v, fasttext, glove and a comparison of TF-IDF method.

#Sources: 


*   https://machinelearningmastery.com/gentle-introduction-bag-words-model/
*   https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
* https://www.kaggle.com/vipulgandhi/bag-of-words-model-for-beginners
* https://medium.com/@dcameronsteinke/tf-idf-vs-word-embedding-a-comparison-and-code-tutorial-5ba341379ab0#:~:text=There%20are%20a%20couple%20of,compared%20to%20the%20embedding%20method.

