# **Bag of Words**

In the following code:

1. We first import CountVectorizer.
2. Then, we define a list of sample documents.
3. CountVectorizer is instantiated and used to fit the model to the documents.
4. The fit_transform method converts the text documents into a bag of words model.
5. We then get the feature names (which are the words from the documents) and the bag of words array, which shows the frequency of each word in each document.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample documents
documents = [
    "This movie is very scary and long",
    "This movie is not scary and is slow",
    "This movie is spooky and good"
]

# Create an instance of CountVectorizer
vectorizer = CountVectorizer()

# Fit the model and transform the documents
bag_of_words = vectorizer.fit_transform(documents)
print(bag_of_words)

# Get the feature names
feature_names = vectorizer.get_feature_names_out()

# Convert bag of words to an array
bag_of_words_array = bag_of_words.toarray()

# Display the feature names and the bag of words array
print("Feature names:", feature_names)
print("Bag of Words array:\n", bag_of_words_array)

  (0, 9)	1
  (0, 4)	1
  (0, 2)	1
  (0, 10)	1
  (0, 6)	1
  (0, 0)	1
  (0, 3)	1
  (1, 9)	1
  (1, 4)	1
  (1, 2)	2
  (1, 6)	1
  (1, 0)	1
  (1, 5)	1
  (1, 7)	1
  (2, 9)	1
  (2, 4)	1
  (2, 2)	1
  (2, 0)	1
  (2, 8)	1
  (2, 1)	1
Feature names: ['and' 'good' 'is' 'long' 'movie' 'not' 'scary' 'slow' 'spooky' 'this'
 'very']
Bag of Words array:
 [[1 0 1 1 1 0 1 0 0 1 1]
 [1 0 2 0 1 1 1 1 0 1 0]
 [1 1 1 0 1 0 0 0 1 1 0]]


# **TF-IDF**

In the following code:
1. We first import TfidfVectorizer.
2. Then, define a list of sample documents.
3. TfidfVectorizer is instantiated and used to fit the model to the documents.
4. The fit_transform method converts the text documents into a TF-IDF model.
5. We then get the feature names and the TF-IDF array, which shows the TF-IDF score of each word in each document. The TF-IDF score represents the importance of a word to a document in a collection or corpus.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
documents = [
    "This movie is very scary and long",
    "This movie is not scary and is slow",
    "This movie is spooky and good"
]

# Create an instance of TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit the model and transform the documents
tfidf_matrix = vectorizer.fit_transform(documents)
print(tfidf_matrix)

# Get the feature names
feature_names = vectorizer.get_feature_names_out()

# Convert TF-IDF matrix to an array
tfidf_array = tfidf_matrix.toarray()

# Display the feature names and the TF-IDF array
print("Feature names:", feature_names)
print("TF-IDF array:\n", tfidf_array)


  (0, 3)	0.5016513317715935
  (0, 0)	0.2962833577206743
  (0, 6)	0.3815187681027303
  (0, 10)	0.5016513317715935
  (0, 2)	0.2962833577206743
  (0, 4)	0.2962833577206743
  (0, 9)	0.2962833577206743
  (1, 7)	0.4463133444082536
  (1, 5)	0.4463133444082536
  (1, 0)	0.2635998509359665
  (1, 6)	0.3394328023512059
  (1, 2)	0.527199701871933
  (1, 4)	0.2635998509359665
  (1, 9)	0.2635998509359665
  (2, 1)	0.5427006131762078
  (2, 8)	0.5427006131762078
  (2, 0)	0.32052772458725637
  (2, 2)	0.32052772458725637
  (2, 4)	0.32052772458725637
  (2, 9)	0.32052772458725637
Feature names: ['and' 'good' 'is' 'long' 'movie' 'not' 'scary' 'slow' 'spooky' 'this'
 'very']
TF-IDF array:
 [[0.29628336 0.         0.29628336 0.50165133 0.29628336 0.
  0.38151877 0.         0.         0.29628336 0.50165133]
 [0.26359985 0.         0.5271997  0.         0.26359985 0.44631334
  0.3394328  0.44631334 0.         0.26359985 0.        ]
 [0.32052772 0.54270061 0.32052772 0.         0.32052772 0.
  0.         0.       