### Bag of Words Model

- A popular and simple method of feature extraction with text data is called the bag-of-words model of text.
- It is a way of extracting features from text for use in modeling, such as with machine learning algorithms.
- A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:
    - A vocabulary of known words.
    - A measure of the presence of known words.
- It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.
- This approach looks at the histogram of the words within the text, i.e. considering each word count as a feature.
- The intuition is that documents are similar if they have similar content. Further, that from the content alone we can learn something about the meaning of the document.

#### Steps:
- Collect Data
It is as simple as having a collection of documents.


- Design the vocabulary
The task is to find the unique words out of the collected data.(punctuations are not included and case is ignored.)


- Create Document Vectors
    - The next step is to score words in each document.
    - The objective is to turn each document of free text into a vector to score each word. 
    - The simplest scoring method is to mark the presence of words as a boolean value, 0 for absent, 1 for present. 
    - All ordering of the words is nominally discarded and we have a consistent way of extracting features from any document in our corpus, ready for use in modeling.
    - New documents that overlap with the vocabulary of known words, but may contain words outside of the vocabulary, can still be encoded, where only the occurrence of known words are scored and unknown words are ignored.



- Managing Vocabulary
    - As the vocabulary size increases, so does the vector representation of documents. 
    - You can imagine that for a very large corpus, such as thousands of books, that the length of the vector might be thousands or millions of positions. Further, each document may contain very few of the known words in the vocabulary.
    - This results in a vector with lots of zero scores, called a sparse vector or sparse representation.
    - Sparse vectors require more memory and computational resources when modeling and the vast number of positions or dimensions can make the modeling process very challenging for traditional algorithms.
    - As such, there is pressure to decrease the size of the vocabulary when using a bag-of-words model.
    - Simple steps for cleaning the text: 
        - Ignoring case
        - Ignoring punctuation
        - Ignoring frequent words that don't contain much information called stop words like "is", "are" etc.
        - Fixing misspelled words
        - Reducing words to their stem called stemming - eg. play from playing.
    - A more sophisticated approach is to create a vocabulary of grouped words. This both changes the scope of the vocabulary and allows the bag-of-words to capture a little bit more meaning from the document.
    - In this approach, each word or token is called a “gram”. Creating a vocabulary of two-word pairs is, in turn, called a bigram model. Again, only the bigrams that appear in the corpus are modeled, not all possible bigrams.
    - Often a simple bigram approach is better than a 1-gram bag-of-words model for tasks like documentation classification.
     
 
- Scoring Words
    - Once a vocabulary has been chosen, the occurrence of words in example documents needs to be scored.
    - One very simple approach to scoring: a binary scoring of the presence or absence of words.
    - Other scoring methods include:
        - Counts - count the number of times each word appears in a document.
        - Frequencies - calculate frequency that each word appears in a document out of all the words in the document.


 - Word Hashing
    - We can use a hash representation of known words in our vocabulary. This addresses the problem of having a very large vocabulary for a large text corpus because we can choose the size of the hash space, which is in turn the size of the vector representation of the document.
    - Words are hashed deterministically to the same integer index in the target hash space. A binary score or count can then be used to score the word. This is called the “hash trick” or “feature hashing“.
    - The challenge is to choose a hash space to accommodate the chosen vocabulary size to minimize the probability of collisions and trade-off sparsity.


- TF-IDF
    - A problem with scoring word frequency is that the words that appear more often start to dominate the document even though they do not contain as much information to the model as rarer appearing but perhaps document appearing words.
    - Term-Frequency - is a scoring of frequency of the word in the current document.
    - Inverse Document Frequency - is a scoring how rare the word is across documents.
    - The scores are a weighting where not all words are equally as important or interesting.
    - The scores have the effect of highlighting words that are distinct (contain useful information) in a given document.
    - Thus the idf of a rare term is high, whereas the idf of a frequent term is likely to be low.
    

Limitation of Bag of Words Model

- Vocabulary: The vocabulary requires careful design, most specifically in order to manage the size, which impacts the sparsity of the document representations.
- Sparsity: Sparse representations are harder to model both for computational reasons (space and time complexity) and also for information reasons, where the challenge is for the models to harness so little information in such a large representational space.