# Bag of Words
Bag of Words (BOW) is a method to extract features from text documents. These features can be used for training machine learning algorithms. It creates a vocabulary of all the unique words occurring in all the documents in the training set.

In simple terms, it’s a collection of words to represent a sentence with word count and mostly disregarding the order in which they appear.

BOW is an approach widely used with:

- Natural language processing
- Information retrieval from documents
- Document classifications

Following is the steps for BOW:
<img src = "https://cdn-media-1.freecodecamp.org/images/qRGh8boBcLLQfBvDnWTXKxZIEAk5LNfNABHF">

Lets say we have three sentences
- "I like to play football"
- "Did you go outside to play tennis"
- "John and I play tennis"

## Step1 : Tokenize the sentences
<img src= "./images/token.PNG">

## Step2 : Create a Dictionary of Word Frequency
Create a dictionary that contains all the words in our corpus as keys and the frequency of the occurrence of the words as values.
<img src= "./images/freq.PNG">

## Step 3: Creating the Bag of Words Model
To create the bag of words model, we need to create a matrix where the columns correspond to the most frequent words in our dictionary where rows correspond to the document or sentences.

Suppose we filter the 8 most occurring words from our dictionary. Then the document frequency matrix will look like this:
<img src= "./images/docFreq.PNG">

In [1]:
import nltk  
import numpy as np  
import random  
import string

import bs4 as bs  
import urllib.request  
import re  

In [2]:
raw_html = urllib.request.urlopen('https://en.wikipedia.org/wiki/Natural_language_processing')  
raw_html = raw_html.read()

article_html = bs.BeautifulSoup(raw_html, 'lxml')

article_paragraphs = article_html.find_all('p')

article_text = ''

for para in article_paragraphs:  
    article_text += para.text

In [3]:
corpus = nltk.sent_tokenize(article_text)

In [4]:
corpus

['Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.',
 'The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them.',
 'The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.',
 'Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.',
 'Natural language processing has its roots in the 1950s.',
 'Already in 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence, a task that involves 

In [5]:
for i in range(len(corpus )):
    corpus [i] = corpus [i].lower()
    corpus [i] = re.sub(r'\W',' ',corpus [i])
    corpus [i] = re.sub(r'\s+',' ',corpus [i])

In [6]:
print(len(corpus))

49


In [7]:
corpus[30]

'in some areas this shift has entailed substantial changes in how nlp systems are designed such that deep neural network based approaches may be viewed as a new paradigm distinct from statistical natural language processing '

In [8]:
wordfreq = {}
for sentence in corpus:
    tokens = nltk.word_tokenize(sentence)
    for token in tokens:
        if token not in wordfreq.keys():
            wordfreq[token] = 1
        else:
            wordfreq[token] += 1

In [9]:
wordfreq

{'natural': 21,
 'language': 29,
 'processing': 17,
 'nlp': 17,
 'is': 17,
 'a': 25,
 'subfield': 1,
 'of': 68,
 'linguistics': 9,
 'computer': 4,
 'science': 3,
 'and': 30,
 'artificial': 2,
 'intelligence': 4,
 'concerned': 1,
 'with': 11,
 'the': 68,
 'interactions': 1,
 'between': 1,
 'computers': 2,
 'human': 2,
 'in': 30,
 'particular': 1,
 'how': 2,
 'to': 31,
 'program': 1,
 'process': 2,
 'analyze': 2,
 'large': 3,
 'amounts': 1,
 'data': 5,
 'goal': 1,
 'capable': 1,
 'understanding': 4,
 'contents': 1,
 'documents': 4,
 'including': 1,
 'contextual': 1,
 'nuances': 1,
 'within': 1,
 'them': 1,
 'technology': 1,
 'can': 5,
 'then': 2,
 'accurately': 1,
 'extract': 1,
 'information': 1,
 'insights': 1,
 'contained': 1,
 'as': 16,
 'well': 2,
 'categorize': 1,
 'organize': 1,
 'themselves': 1,
 'challenges': 1,
 'frequently': 2,
 'involve': 2,
 'speech': 5,
 'recognition': 2,
 'generation': 2,
 'has': 6,
 'its': 2,
 'roots': 1,
 '1950s': 1,
 'already': 1,
 '1950': 1,
 'alan': 1

In [10]:
import heapq
most_freq = heapq.nlargest(200, wordfreq, key=wordfreq.get)

In [11]:
sentence_vectors = []
for sentence in corpus:
    sentence_tokens = nltk.word_tokenize(sentence)
    sent_vec = []
    for token in most_freq:
        if token in sentence_tokens:
            sent_vec.append(1)
        else:
            sent_vec.append(0)
    sentence_vectors.append(sent_vec)

In [12]:
sentence_vectors = np.asarray(sentence_vectors)