Skip to content

Commit b58c72c

Browse files
Merge pull request avinashkranjan#836 from zaverisanya/master
Bag of words model
2 parents 2ac8f5e + 7705143 commit b58c72c

File tree

2 files changed

+65
-0
lines changed

2 files changed

+65
-0
lines changed

Bag of words model/README.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# Package/Script Name
2+
3+
-->Package installed- NLKT
4+
- NLTK stands for 'Natural Language Tool Kit'. It consists of the most common algorithms such as tokenizing, part-of-speech tagging, stemming, sentiment analysis, topic segmentation, and named entity recognition. NLTK helps the computer to analysis, preprocess, and understand the written text.
5+
6+
--> Pandas
7+
- pandas is a library where your data can be stored, analyzed and processed in row and column representation
8+
9+
--> from sklearn.feature_extraction.text import CountVectorizer
10+
- Scikit-learn's CountVectorizer is used to convert a collection of text documents to a vector of term/token counts. It also enables the ​pre-processing of text data prior to generating the vector representation. This functionality makes it a highly flexible feature representation module for text.
11+
12+
## Setup instructions
13+
14+
1) Input the sentences you would like to vectorize.
15+
2) The script will tokenize the sentences.
16+
3) It will transform the text to vectors where each word and its count is a feature.
17+
4) Then the bag of word model is ready.
18+
5) create dataframe where dataFrame is an analogy to excel-spreadsheet.
19+
6) Open excel and check the 'bowp.xlsx' where sheet name is 'data'. The dataframe will be stored over there.
20+
21+
22+
## Output
23+
24+
![Image](https://i.postimg.cc/pLQq8Vdc/output.png)
25+
26+
## Author(s)
27+
28+
- This code is written by [Sanya Devansh Zaveri](https://github.com/zaverisanya)
29+
30+
## Disclaimers, if any
31+
32+
There are no disclaimers for this script.

Bag of words model/bow.py

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
from sklearn.feature_extraction.text import CountVectorizer
2+
import nltk
3+
import pandas as pd #pandas is a library where your data can be stored, analyzed and processed in row and column representation
4+
from openpyxl import Workbook
5+
sentences=input("Enter your sentences: ")
6+
#eg. My name is sanya. I am caring and loving. I am generous.
7+
#converting to lower case (normalization)
8+
sentences=sentences.lower()
9+
#sentence tokenized
10+
tokenized_sentences=nltk.tokenize.sent_tokenize(sentences)
11+
print(tokenized_sentences)
12+
tokenized_sentences1=[]
13+
for x in tokenized_sentences:
14+
x=x.replace(".","") #removed .
15+
tokenized_sentences1.append(x)
16+
print(tokenized_sentences1) #list of word can be converted to set to get unique words
17+
#instantiating CountVectorizer()
18+
countVectorizer=CountVectorizer() #BOW
19+
#transforming text from to vectors where each word and its count is a feature
20+
tmpbow=countVectorizer.fit_transform(tokenized_sentences1)#pass list of sentences as arguments
21+
print("tmpbow \n",tmpbow) #bag of word model is ready
22+
23+
bow=tmpbow.toarray()
24+
print("Vocabulary = ",countVectorizer.vocabulary_)
25+
print("Features = ",countVectorizer.get_feature_names())
26+
#Features in machine learning are nothing but names of the columns
27+
print("BOW ",bow)
28+
29+
#create dataframe #DataFrame is an analogy to excel-spreadsheet
30+
cv_dataframe=pd.DataFrame(bow,columns=countVectorizer.get_feature_names())
31+
32+
print("cv_dataframe is below\n",cv_dataframe)
33+
cv_dataframe.to_excel('./Bag of words model/bowp.xlsx', sheet_name='data')

0 commit comments

Comments
 (0)