Merge pull request avinashkranjan#836 from zaverisanya/master

avinashkranjan · web-flow · commit b58c72cf2d74 · 2021-04-10T22:15:16.000+05:30
Bag of words model
diff --git a/Bag of words model/README.md b/Bag of words model/README.md
@@ -0,0 +1,32 @@
+# Package/Script Name
+
+-->Package installed- NLKT
+- NLTK stands for 'Natural Language Tool Kit'. It consists of the most common algorithms such as tokenizing, part-of-speech tagging, stemming, sentiment analysis, topic segmentation, and named entity recognition. NLTK helps the computer to analysis, preprocess, and understand the written text.
+
+--> Pandas
+- pandas is a library where your data can be stored, analyzed and processed in row and column representation
+
+--> from sklearn.feature_extraction.text import CountVectorizer
+- Scikit-learn's CountVectorizer is used to convert a collection of text documents to a vector of term/token counts. It also enables the ​pre-processing of text data prior to generating the vector representation. This functionality makes it a highly flexible feature representation module for text.
+
+## Setup instructions
+
+1) Input the sentences you would like to vectorize.
+2) The script will tokenize the sentences.
+3) It will transform the text to vectors where each word and its count is a feature.
+4) Then the bag of word model is ready.
+5) create dataframe where dataFrame is an analogy to excel-spreadsheet.
+6) Open excel and check the 'bowp.xlsx' where sheet name is 'data'. The dataframe will be stored over there.
+
+
+## Output
+
+![Image](https://i.postimg.cc/pLQq8Vdc/output.png)
+
+## Author(s)
+
+- This code is written by [Sanya Devansh Zaveri](https://github.com/zaverisanya)
+
+## Disclaimers, if any
+
+There are no disclaimers for this script.
diff --git a/Bag of words model/bow.py b/Bag of words model/bow.py
@@ -0,0 +1,33 @@
+from sklearn.feature_extraction.text import CountVectorizer
+import nltk
+import pandas as pd #pandas is a library where your data can be stored, analyzed and processed in row and column representation
+from openpyxl import Workbook
+sentences=input("Enter your sentences: ")
+#eg. My name is sanya. I am caring and loving. I am generous.
+#converting to lower case (normalization)
+sentences=sentences.lower()
+#sentence tokenized
+tokenized_sentences=nltk.tokenize.sent_tokenize(sentences)
+print(tokenized_sentences)
+tokenized_sentences1=[]
+for x in tokenized_sentences:
+    x=x.replace(".","") #removed .
+    tokenized_sentences1.append(x)
+print(tokenized_sentences1) #list of word can be converted to set to get unique words
+#instantiating CountVectorizer()
+countVectorizer=CountVectorizer() #BOW
+#transforming text from to vectors where each word and its count is a feature
+tmpbow=countVectorizer.fit_transform(tokenized_sentences1)#pass list of sentences as arguments
+print("tmpbow \n",tmpbow) #bag of word model is ready
+
+bow=tmpbow.toarray()
+print("Vocabulary = ",countVectorizer.vocabulary_)
+print("Features = ",countVectorizer.get_feature_names())
+#Features in machine learning are nothing but names of the columns
+print("BOW ",bow)
+
+#create dataframe  #DataFrame is an analogy to excel-spreadsheet
+cv_dataframe=pd.DataFrame(bow,columns=countVectorizer.get_feature_names())
+
+print("cv_dataframe is below\n",cv_dataframe)
+cv_dataframe.to_excel('./Bag of words model/bowp.xlsx', sheet_name='data')