# Bag of Words (BoW) Theory
Definition: The Bag of Words (BoW) is a simple and commonly used text representation technique in Natural Language Processing (NLP). It transforms text into a numerical representation, where each document is represented as a vector of word frequencies, ignoring grammar and word order but focusing on the occurrence of words.

# Key Concepts:
Corpus: A collection of text documents.

Vocabulary: The set of unique words across all documents in the corpus.

Word Frequency: The number of times each word appears in a document.

# How BoW Works:
Tokenization: Split the text into individual words (or tokens).

Building the Vocabulary: Construct a list of all unique words from the entire corpus. This list forms the "vocabulary."

Encoding the Documents:For each document in the corpus, create a vector. The length of this vector is equal to the number of unique words (size of the vocabulary).

Each position in the vector corresponds to a word from the vocabulary.

The value in each position represents the frequency of the word in that document (i.e., how many times the word appears in the document).

Example:
    Let’s say we have two documents in the corpus:

    Document 1: "I love NLP."
    Document 2: "NLP is awesome."
    Step 1: Tokenization
    Document 1: ['I', 'love', 'NLP']
    Document 2: ['NLP', 'is', 'awesome']

    Step 2: Build the Vocabulary
    Vocabulary: ['I', 'love', 'NLP', 'is', 'awesome']

    Step 3: Create Vectors
    For Document 1:
    [1, 1, 1, 0, 0] (1 occurrence of 'I', 1 occurrence of 'love', 1 occurrence of 'NLP', 0 occurrences of 'is', 0 occurrences of 'awesome')

    For Document 2:
    [0, 0, 1, 1, 1] (0 occurrences of 'I', 0 occurrences of 'love', 1 occurrence of 'NLP', 1 occurrence of 'is', 1 occurrence of 'awesome')

# Advantages of BoW:
Simplicity: Easy to understand and implement.

Efficiency: Works well for many text classification tasks such as spam detection, sentiment analysis, etc.

Interpretability: The representation is interpretable as it directly reflects word frequencies.

# Limitations of BoW:

Loss of Context: BoW ignores the order of words. For instance, "I love NLP" and "NLP love I" would be treated as the same vector.

High Dimensionality: If the vocabulary is large, the vectors become large and sparse (most of the values in the vector are 0), making computation expensive.

Semantic Information: BoW does not capture the meaning of words. Words like "good" and "great" would be treated as completely unrelated, even though they are semantically similar.

# Improving BoW:
N-grams: Instead of representing single words, consider word pairs (bigrams) or triples (trigrams) to capture some word order.

TF-IDF (Term Frequency-Inverse Document Frequency): A more informative variation of BoW that accounts for the importance of words by downscaling the impact of frequent words that appear in many documents.

# Use Cases:
Text Classification: Spam detection, sentiment analysis, topic classification.

Information Retrieval: Document ranking based on relevance to a query.

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

text = """Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence 
concerned with the interactions between computers and human language. As such, NLP is related to the area of 
human-computer interaction. Many challenges in NLP involve understanding natural language to derive meaning 
and information from it."""


# Apply Bag of Words encoding
vectorizer = CountVectorizer()
X = vectorizer.fit_transform([text])

# Display BoW result (word frequency vector)
print(X.toarray())
print(vectorizer.get_feature_names_out())


[[3 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 2 1 3 1 1 1 2 3 2 1 1 1 1 1 2 2 1 1]]
['and' 'area' 'artificial' 'as' 'between' 'challenges' 'computer'
 'computers' 'concerned' 'derive' 'from' 'human' 'in' 'information'
 'intelligence' 'interaction' 'interactions' 'involve' 'is' 'it'
 'language' 'linguistics' 'many' 'meaning' 'natural' 'nlp' 'of'
 'processing' 'related' 'science' 'subfield' 'such' 'the' 'to'
 'understanding' 'with']


In [1]:
from sklearn.feature_extraction.text import CountVectorizer

text = """Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence 
concerned with the interactions between computers and human language. As such, NLP is related to the area of 
human-computer interaction. Many challenges in NLP involve understanding natural language to derive meaning 
and information from it."""

vectorizer=CountVectorizer()
x=vectorizer.fit_transform([text])



In [4]:

x.toarray()

array([[3, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 3, 1,
        1, 1, 2, 3, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1]], dtype=int64)

In [6]:
vectorizer.get_feature_names_out()

array(['and', 'area', 'artificial', 'as', 'between', 'challenges',
       'computer', 'computers', 'concerned', 'derive', 'from', 'human',
       'in', 'information', 'intelligence', 'interaction', 'interactions',
       'involve', 'is', 'it', 'language', 'linguistics', 'many',
       'meaning', 'natural', 'nlp', 'of', 'processing', 'related',
       'science', 'subfield', 'such', 'the', 'to', 'understanding',
       'with'], dtype=object)

In [7]:
vectorizer.get_params()

{'analyzer': 'word',
 'binary': False,
 'decode_error': 'strict',
 'dtype': numpy.int64,
 'encoding': 'utf-8',
 'input': 'content',
 'lowercase': True,
 'max_df': 1.0,
 'max_features': None,
 'min_df': 1,
 'ngram_range': (1, 1),
 'preprocessor': None,
 'stop_words': None,
 'strip_accents': None,
 'token_pattern': '(?u)\\b\\w\\w+\\b',
 'tokenizer': None,
 'vocabulary': None}

In [8]:
vectorizer.get_stop_words()

In [9]:
vectorizer.__getstate__()

{'input': 'content',
 'encoding': 'utf-8',
 'decode_error': 'strict',
 'strip_accents': None,
 'preprocessor': None,
 'tokenizer': None,
 'analyzer': 'word',
 'lowercase': True,
 'token_pattern': '(?u)\\b\\w\\w+\\b',
 'stop_words': None,
 'max_df': 1.0,
 'min_df': 1,
 'max_features': None,
 'ngram_range': (1, 1),
 'vocabulary': None,
 'binary': False,
 'dtype': numpy.int64,
 'fixed_vocabulary_': False,
 '_stop_words_id': 140721788710088,
 'stop_words_': set(),
 'vocabulary_': {'natural': 24,
  'language': 20,
  'processing': 27,
  'nlp': 25,
  'is': 18,
  'subfield': 30,
  'of': 26,
  'linguistics': 21,
  'computer': 6,
  'science': 29,
  'and': 0,
  'artificial': 2,
  'intelligence': 14,
  'concerned': 8,
  'with': 35,
  'the': 32,
  'interactions': 16,
  'between': 4,
  'computers': 7,
  'human': 11,
  'as': 3,
  'such': 31,
  'related': 28,
  'to': 33,
  'area': 1,
  'interaction': 15,
  'many': 22,
  'challenges': 5,
  'in': 12,
  'involve': 17,
  'understanding': 34,
  'derive': 9

In [11]:
vectorizer.inverse_transform(x)

[array(['natural', 'language', 'processing', 'nlp', 'is', 'subfield', 'of',
        'linguistics', 'computer', 'science', 'and', 'artificial',
        'intelligence', 'concerned', 'with', 'the', 'interactions',
        'between', 'computers', 'human', 'as', 'such', 'related', 'to',
        'area', 'interaction', 'many', 'challenges', 'in', 'involve',
        'understanding', 'derive', 'meaning', 'information', 'from', 'it'],
       dtype='<U13')]