### Bag of Words: Implementation

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample data
documents = ["I love programming.", "Programming is fun."]

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the documents
X = vectorizer.fit_transform(documents)

# Convert the result to an array and print
print("Vocabulary:", vectorizer.get_feature_names_out())  # Get the vocabulary
print("BoW Matrix:\n", X.toarray())  # Display the document-term matrix

Vocabulary: ['fun' 'is' 'love' 'programming']
BoW Matrix:
 [[0 0 1 1]
 [1 1 0 1]]


Explanation:
* The CountVectorizer converts the text data into a matrix where each column represents a word from the vocabulary.
* The rows represent the individual documents, and the values are the counts of each word in those documents.

### TF-IDF: Implementation
Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF highlights terms that uniquely represent a document by penalizing terms that are common across many documents.


In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample data
documents = ["I love programming.", "Programming is fun."]

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the documents
X = vectorizer.fit_transform(documents)

# Convert the result to an array and print
print("Vocabulary:", vectorizer.get_feature_names_out())  # Get the vocabulary
print("TF-IDF Matrix:\n", X.toarray())  # Display the TF-IDF matrix

Vocabulary: ['fun' 'is' 'love' 'programming']
TF-IDF Matrix:
 [[0.         0.         0.81480247 0.57973867]
 [0.6316672  0.6316672  0.         0.44943642]]


Explanation:
* The TfidfVectorizer computes the TF-IDF score for each word. The higher the score, the more relevant the word is in that document relative to others.

### When should you use Bag of Words instead of TF-IDF?

When you want to focus on word frequency regardless of document context.  Bag of Words focuses purely on word counts, while TF-IDF incorporates the importance of words relative to a corpus.

### Challenge 1: Implement Bag of Words

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample data
documents = ["Machine learning is fun.", "Deep learning is a subset of machine learning."]
# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the documents
X = vectorizer.fit_transform(documents)

# Display the vocabulary and the BoW matrix
print("Vocabulary:", vectorizer.get_feature_names_out())  # Get the vocabulary
# BoW Matrix:
print(X.toarray())  # Display the document-term matrix

Vocabulary: ['deep' 'fun' 'is' 'learning' 'machine' 'of' 'subset']
[[0 1 1 1 1 0 0]
 [1 0 1 2 1 1 1]]


Hint : 
* Think about how CountVectorizer works. It tokenizes the text into words and counts the occurrence of each word in the documents.
* The get_feature_names_out() function will give you the list of words in your vocabulary.
* The toarray() method will show the term frequencies (counts) of each word in each document.

### Challenge 2: Implement TF-IDF

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample data
documents = ["Machine learning is fun.", "Deep learning is a subset of machine learning."]
# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the documents
X = vectorizer.fit_transform(documents)

# Display the vocabulary and the TF-IDF matrix
print("Vocabulary:", vectorizer.get_feature_names_out())  # Get the vocabulary
# TF-IDF Matrix:
print(X.toarray())  # Display the TF-IDF matrix

Vocabulary: ['deep' 'fun' 'is' 'learning' 'machine' 'of' 'subset']
[[0.         0.63009934 0.44832087 0.44832087 0.44832087 0.
  0.        ]
 [0.40697968 0.         0.2895694  0.57913879 0.2895694  0.40697968
  0.40697968]]


Hint: 

* The TfidfVectorizer works similarly to CountVectorizer, but it also adjusts the term frequencies by the importance of each word across documents.
* Check out the difference between the TF-IDF values and the word counts from BoW. High values show the importance of words that appear less frequently across documents.

### Challenge 3: Comparing Bag of Words and TF-IDF

In [5]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Sample data
documents = ["Data science is exciting!", "Data science requires programming knowledge.", "Programming is essential for data science."]

# Initialize the CountVectorizer and TfidfVectorizer
vectorizer_bow = CountVectorizer()
vectorizer_tfidf = TfidfVectorizer()

# Fit and transform the documents using both vectorizers
X_bow = vectorizer_bow.fit_transform(documents)
X_tfidf = vectorizer_tfidf.fit_transform(documents)

# Display the results
# BoW Matrix:
print(X_bow.toarray())
# TF-IDF Matrix:
print(X_tfidf.toarray())

[[1 0 1 0 1 0 0 0 1]
 [1 0 0 0 0 1 1 1 1]
 [1 1 0 1 1 0 1 0 1]]
[[0.39148397 0.         0.66283998 0.         0.50410689 0.
  0.         0.         0.39148397]
 [0.32630952 0.         0.         0.         0.         0.55249005
  0.42018292 0.55249005 0.32630952]
 [0.30083189 0.50935267 0.         0.50935267 0.38737583 0.
  0.38737583 0.         0.30083189]]


* For BoW, you should observe the frequency of words in each document. Words that appear in multiple documents will have higher values.
* For TF-IDF, remember that frequent words like "data" and "science" may have lower values due to their prevalence across documents.
* Compare the two matrices to see how TF-IDF adjusts the term frequencies to reflect the significance of words in each document.