CountVectorizer is used for text preprocessing in natural language processing (NLP) and machine learning tasks. It is a common technique for converting text data into numerical features that can be used as input for machine learning algorithms. Here's why CountVectorizer is used:

Text to Numeric Conversion: Machine learning algorithms typically work with numerical data, and text data is inherently non-numeric. CountVectorizer converts text documents (corpus) into a matrix of token counts, where each row represents a document, and each column represents a unique word in the corpus. The values in the matrix represent the frequency of each word's occurrence in the corresponding document.

Feature Extraction: It helps in feature extraction from text data. Each unique word in the corpus becomes a feature, and the count of how many times each word appears in a document becomes the value of that feature. These features can then be used to train machine learning models.

Bag of Words (BoW) Representation: CountVectorizer implements the Bag of Words model, which treats each document as an unordered collection of words, ignoring grammar and word order. This simplifies the representation but can still capture important information about the text.

Preprocessing: CountVectorizer handles common text preprocessing tasks like tokenization (splitting text into words or tokens), lowercase conversion, and stop word removal. It can also be customized with various options for stemming, lemmatization, and more.

handling features that represent discrete data like word counts or frequencies. Multinomial Naive Bayes is particularly well-suited for text classification problems, where the data is often represented as a bag of words or term frequencies.

In [17]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Read the IMDB Movie Reviews CSV file
d=pd.read_csv("IMDB Dataset.csv",delimiter=",",nrows=2000)

In [39]:
# Split the data into train and test sets
split = 0.7
train = d[:int(split * (len(d)))]
test = d[int(1 - split * (len(d))):]

# Create a CountVectorizer object
v = CountVectorizer()

# Fit the CountVectorizer object to the train data
train_features = v.fit_transform(train["review"])

# Get the feature names
feature_names = v.get_feature_names_out()

# Convert the sparse matrix to a dense matrix
train_features_dense = train_features.todense()

# Create a pandas DataFrame from the dense matrix
train_features_df = pd.DataFrame(train_features_dense, columns=feature_names)

# Display the DataFrame
train_features_df


Unnamed: 0,00,000,007,00am,01pm,04,08,10,100,1000,...,zu,zucker,zulu,zwart,zwick,zzzzzzzzzzzzzzzzzz,æon,élan,ís,ísnt
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1395,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1396,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1397,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1398,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## sentiment analysis

In [19]:
train.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [37]:
test_features=v.transform(test.review)

# Convert the sparse matrix to a dense matrix
test_features_dense = test_features.todense()

# Create a pandas DataFrame from the dense matrix
test_features_df = pd.DataFrame(test_features_dense, columns=feature_names)

test_features_df

Unnamed: 0,00,000,007,00am,01pm,04,08,10,100,1000,...,zu,zucker,zulu,zwart,zwick,zzzzzzzzzzzzzzzzzz,æon,élan,ís,ísnt
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1394,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1395,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1396,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1397,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [40]:
print(train_features)

  (0, 12969)	1
  (0, 12894)	7
  (0, 18596)	16
  (0, 13106)	2
  (0, 15510)	1
  (0, 8498)	1
  (0, 11721)	1
  (0, 18592)	4
  (0, 581)	1
  (0, 20201)	2
  (0, 10177)	2
  (0, 13251)	6
  (0, 6333)	2
  (0, 20768)	3
  (0, 10977)	3
  (0, 1769)	2
  (0, 8936)	1
  (0, 18641)	1
  (0, 1155)	2
  (0, 15594)	2
  (0, 1247)	4
  (0, 18669)	3
  (0, 9862)	9
  (0, 6514)	1
  (0, 20321)	2
  :	:
  (1399, 20594)	1
  (1399, 16332)	1
  (1399, 18795)	1
  (1399, 525)	1
  (1399, 787)	1
  (1399, 15402)	1
  (1399, 18847)	1
  (1399, 8786)	1
  (1399, 13267)	1
  (1399, 15733)	1
  (1399, 1425)	1
  (1399, 19092)	1
  (1399, 18482)	1
  (1399, 11471)	1
  (1399, 11423)	1
  (1399, 9763)	1
  (1399, 19761)	1
  (1399, 16267)	1
  (1399, 12013)	1
  (1399, 811)	1
  (1399, 5208)	1
  (1399, 5881)	1
  (1399, 14054)	1
  (1399, 10012)	1
  (1399, 2913)	1


In [41]:
from sklearn.naive_bayes import MultinomialNB
model=MultinomialNB()
model.fit(train_features,train.sentiment)
pred=model.predict_proba(test_features)

In [42]:
review="I love the movie"
print(model.predict(v.transform([review]))[0])

positive
