# $$ Bag\ of\ Words$$

_____________

### Introduction to Text Vectorization

In the world of **Natural Language Processing (NLP)**, **text vectorization** refers to the process of converting text data into a numerical format that machine learning models can understand and work with. Since machine learning algorithms cannot work with raw text directly, we need to transform the text into a **numerical representation** (vectors) that captures the essence of the words, their importance, and their relationships in the data.

Text vectorization is essential because **machine learning models** require input data in numerical form, and **text data** is inherently non-numeric. By converting the text into numbers, we allow algorithms to process, analyze, and learn patterns within the data.

### Why Do We Use Text Vectorization?

1. **Machine Learning Compatibility**: Most machine learning models only understand numerical data. Vectorization allows us to convert text into a numerical format that these models can interpret.
   
2. **Capturing Features of Text**: Through vectorization, we can capture various features of text, such as the frequency of words, the importance of words in a particular context, and the relationships between different words.
   
3. **Improving Model Performance**: Vectorization techniques like **TF-IDF** (Term Frequency-Inverse Document Frequency) can highlight the most relevant words, which helps the model focus on important features rather than irrelevant ones, thus improving its performance.

### What Does Vectorizing Text Mean?

Text vectorization involves transforming text into a **vector** (a list or array of numbers) that can be input into machine learning models. The vector representation encodes useful information about the text, such as the presence or frequency of words. The key idea is to represent the **semantics** of the text in a structured way that allows a machine to analyze it.

### Common Methods of Text Vectorization:

1. **The Bag of Words (BoW) Model**:
   - **BoW** is one of the simplest methods for text vectorization.
   - In this model, we count how many times each word appears in the document. However, BoW does not take into account the **order** of words or **context**.
   - **Limitation**: It might fail to capture important relationships between words (like synonyms or word context) and can lead to sparse vectors if there are a lot of unique words.

2. **TF-IDF (Term Frequency-Inverse Document Frequency)**:
   - **TF-IDF** is a more advanced vectorization technique.
   - It assigns **weights** to words based on their importance in a particular document. The key idea is to give higher weight to words that are frequent in a specific document but not too common across all documents.
   - **Term Frequency (TF)** measures how often a word appears in a document.
   - **Inverse Document Frequency (IDF)** measures how common or rare a word is across all documents.
   - **Importance**: Words that appear often in a document but rarely in the corpus get higher weights, which helps emphasize their relevance.

In summary, **text vectorization** is the bridge that allows machine learning algorithms to handle and understand text. Through methods like **Bag of Words** and **TF-IDF**, we can convert text into numerical vectors that maintain useful information for analysis and decision-making in machine learning tasks.


_____________

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
data = [' Most shark attacks occur about 10 feet from the beach since that is where the people are',
        'the efficiency with which he paired the socks in the drawer was quite admirable',
        'carol drank the blood as if she were a vampire',
        'giving directions that the mountains are to the west only works when you can see them',
        'the sign said there was road work ahead so he decided to speed up',
        'the gruff old man sat in the back of the bait shop grumbling to himself as he scooped out a handful of worms']

In [3]:
countvec = CountVectorizer()

In [4]:
countvec_fit = countvec.fit_transform(data)

In [5]:
bag_of_words = pd.DataFrame(countvec_fit.toarray(), columns = countvec.get_feature_names_out())

In [6]:
print(bag_of_words)

   10  about  admirable  ahead  are  as  attacks  back  bait  beach  ...  \
0   1      1          0      0    1   0        1     0     0      1  ...   
1   0      0          1      0    0   0        0     0     0      0  ...   
2   0      0          0      0    0   1        0     0     0      0  ...   
3   0      0          0      0    1   0        0     0     0      0  ...   
4   0      0          0      1    0   0        0     0     0      0  ...   
5   0      0          0      0    0   1        0     1     1      0  ...   

   were  west  when  where  which  with  work  works  worms  you  
0     0     0     0      1      0     0     0      0      0    0  
1     0     0     0      0      1     1     0      0      0    0  
2     1     0     0      0      0     0     0      0      0    0  
3     0     1     1      0      0     0     0      1      0    1  
4     0     0     0      0      0     0     1      0      0    0  
5     0     0     0      0      0     0     0      0      1    0 