# Bag of Words Model in Pandas

A bag of words is a matrix representation of a document. It consists of several columns which are unique words. And every row is a new document. The cell values of every column indicate whether the word is present in the document or not. A dataframe representation is shown below. 

![Image](./data/bagofwords.PNG)

Each column (except doc_id) is a word. Each row is a new document. The first column is the name of the document. The first row is telling us that doc_id 1987_1 does not have the word abalone, abbeel or zhou. Hence each value is 0. If the word is contained in the document then that corresponding value in the column is 1.     

We have to build this bag of words model with 5 documents.  The documents are named as doc1.txt, doc2.txt, doc3.txt, doc4.txt and doc5.txt.  

**There should be 5 rows in the dataframe. The columns should be unique words in all documents. The columns should have words with length greater than 4. The words should not have any punctuation marks with it.**

### From the DataFrame find the following information:
1. Find out all the common words across the five documents.
2. Find out all the uncommon words across the five documents.

## Import Files
    

In [1]:
fo=open("./data/Bag of Words Docs/doc1.txt",'r')
text1 = fo.read()
fo=open("./data/Bag of Words Docs/doc2.txt",'r')
text2 = fo.read()
fo=open("./data/Bag of Words Docs/doc3.txt",'r')
text3 = fo.read()
fo=open("./data/Bag of Words Docs/doc4.txt",'r')
text4 = fo.read()
fo=open("./data/Bag of Words Docs/doc5.txt",'r')
text5 = fo.read()

## Data frame Representation

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

sentence_1= text1
sentence_2= text2
sentence_3= text3
sentence_4= text4
sentence_5= text5
CountVec = CountVectorizer(ngram_range=(1,1), # to use bigrams ngram_range=(2,2)
                           stop_words='english')
#transform
Count_data = CountVec.fit_transform([sentence_1,sentence_2,sentence_3,sentence_4,sentence_5])

#create dataframe
cv_dataframe=pd.DataFrame(Count_data.toarray(),columns=CountVec.get_feature_names())
cv_dataframe

Unnamed: 0,000,100,academic,achieve,advances,ambiguity,amounts,analogies,analysis,appear,...,ways,won,word,word2vec,words,work,works,world,years,york
0,0,0,0,0,0,0,0,0,2,0,...,0,0,0,3,2,0,1,0,0,0
1,0,0,0,0,0,1,0,0,2,0,...,0,0,0,1,1,0,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,...,0,1,0,0,0,0,0,0,1,1
3,1,1,1,1,0,0,1,0,1,0,...,1,0,0,1,0,1,0,0,0,0
4,0,0,0,0,0,0,0,1,0,1,...,0,0,2,1,2,0,0,1,0,0


## Removing punctuation  And the word length which are smaller than 4

In [3]:
import string
text_file = [text1,text2,text3,text4,text5]
st = [1,2,3,4,5,6]  #Only for creating the List
stripped = [23,54,65,4,5,3]   #Only for creating the List
for i in range(5):
    st[i] = text_file[i]
    string = [] 
    text = st[i].split(" ") 
    for x in text: 
        if len(x) > 4: 
            string.append(x) 
    words = string
    import string
    table = str.maketrans('', '', string.punctuation)
    stripped[i] = [w.translate(table) for w in words]

str1 = stripped[0]
str2 = stripped[1]
str3 = stripped[2]
str4 = stripped[3]
str5 = stripped[4]

#print(str1)

In [4]:
# Converting List into string
Tolist = [1,2,3,4,5]
for i in range(5):
    strinn1 = " "
    Tolist[i] = strinn1.join(stripped[i])

print(Tolist[0])

tutorial competition little deeper sentiment analysis Googles Word2Vec deeplearning inspired method focuses meaning words Word2Vec attempts understand meaning semantic relationships among words works similar approaches recurrent neural neural nets computationally efficient tutorial focuses Word2Vec sentiment analysis


## Creating Dictonary for each of the file for finding the frequency of words

In [5]:
doct1 = Tolist[0]
doct2 = Tolist[1]
doct3 = Tolist[2]
doct4 = Tolist[3]
doct5 = Tolist[4]
du = [1,2,3,4,5] # For defainig the list
dd = [doct1.split(),doct2.split(),doct3.split(),doct4.split(),doct5.split()]
for i in range(5):
    du[i] = dict()   
    for line in dd[i]: 
        line = line.strip()  
        line = line.lower() 
        words = line.split(" ") 
        for word in words: 
            if word in du[i]: 
                du[i][word] = du[i][word] + 1
            else: 
                du[i][word] = 1
#print(du[1])
#print(dd[1])

# 1. Find out all the common words across the five documents.

In [6]:
from collections import Counter
d1=du[0]
d2=du[1]
d3=du[2]
d4=du[3]
d5=du[4]
c1= Counter(d1)
c2= Counter(d2)
c3= Counter(d3)
c4=  Counter(d4)
c5=  Counter(d5)
t=c1&c2&c3&c4&c5
t

Counter()

# 2.Find out all the uncommon words across the five documents.

In [7]:
def UncommonWords(A, B, C, D, E): 
  
    count = {} 
    for word in A : 
        count[word] = count.get(word, 0) + 1
      
    for word in B: 
        count[word] = count.get(word, 0) + 1
  
    for word in C:
        count[word] = count.get(word, 0) +1
    
    for word in D:
        count[word] = count.get(word, 0) +1
        
    for word in E:
        count[word] = count.get(word, 0) +1
    # return required list of words 
    return [word for word in count if count[word] == 1] 
  
A = str1
B = str2
C = str3
D = str4
E = str5
print(UncommonWords(A, B, C,D,E)) 

['little', 'deeper', 'Googles', 'deeplearning', 'method', 'attempts', 'understand', 'semantic', 'among', 'works', 'recurrent', 'nets', 'computationally', 'efficient', 'Sentiment', 'challenging', 'subject', 'People', 'express', 'their', 'emotions', 'often', 'obscured', 'sarcasm', 'ambiguity', 'plays', 'could', 'misleading', 'humans', 'computers', 'Theres', 'another', 'review', 'explore', 'applied', 'problem', 'years', 'front', 'Times', 'These', 'techniques', 'architecture', 'human', 'brain', 'possible', 'recent', 'advances', 'computing', 'power', 'waves', 'breakthrough', 'results', 'speech', 'processing', 'natural', 'tasks', 'Recently', 'competitions', 'including', 'discovery', 'task', 'Since', 'rapidly', 'evolving', 'field', 'large', 'amounts', 'published', 'exists', 'academic', 'papers', 'exploratory', 'prescriptive', 'experiment', 'rather', 'giving', 'recipe', 'output\n\nTo', 'achieve', 'these', 'goals', '100000', 'multiparagraph', 'reviews', 'positive', 'negative', 'labels', 'order'