## Bag of Words

![alt text](https://cdn-media-1.freecodecamp.org/images/qRGh8boBcLLQfBvDnWTXKxZIEAk5LNfNABHF)

Bag of Words (BOW) is a method to extract features from text documents. These features can be used for training machine learning algorithms. It creates a vocabulary of all the unique words occurring in all the documents in the training set.

In simple terms, it’s a collection of words to represent a sentence with word count and mostly disregarding the order in which they appear.

BOW is an approach widely used with:

* Natural language processing
* Information retrieval from documents
* Document classifications

Let’s start with an example to understand by taking some sentences and generating vectors for those.

1. "John likes to watch movies. Mary likes movies too."
2. "John also likes to watch football games."

Further, for each sentence, remove multiple occurrences of the word and use the word count to represent this.

In [None]:
1. {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1}
2. {"John":1,"also":1,"likes":1,"to":1,"watch":1,"football":1,"games":1}

SyntaxError: ignored

Assuming these sentences are part of a document, below is the combined word frequency for our entire document. Both sentences are taken into account.

In [None]:
{"John":2,"likes":3,"to":2,"watch":2,"movies":2,"Mary":1,"too":1,  "also":1,"football":1,"games":1}

{'John': 2,
 'Mary': 1,
 'also': 1,
 'football': 1,
 'games': 1,
 'likes': 3,
 'movies': 2,
 'to': 2,
 'too': 1,
 'watch': 2}

**The length of the vector will always be equal to vocabulary size. In this case the vector length is 11.**

In order to represent our original sentences in a vector, each vector is initialized with all zeros — [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

This is followed by iteration and comparison with each word in our vocabulary, and incrementing the vector value if the sentence has that word.

John likes to watch movies. Mary likes movies too.[1, 2, 1, 1, 2, 1, 1, 0, 0, 0]
John also likes to watch football games.[1, 1, 1, 1, 0, 0, 0, 1, 1, 1]
For example, in sentence 1 the word likes appears in second position and appears two times. So the second element of our vector for sentence 1 will be 2: [1, 2, 1, 1, 2, 1, 1, 0, 0, 0]

The vector is always proportional to the size of our vocabulary.

A big document where the generated vocabulary is huge may result in a vector with lots of 0 values. This is called a sparse vector. Sparse vectors require more memory and computational resources when modeling. The vast number of positions or dimensions can make the modeling process very challenging for traditional algorithms.

In [None]:
import numpy
import re

'''
Tokenize each the sentences, example
Input : "John likes to watch movies. Mary likes movies too"
Ouput : "John","likes","to","watch","movies","Mary","likes","movies","too"
'''
def tokenize(sentences):
    words = []
    for sentence in sentences:
        w = word_extraction(sentence)
        words.extend(w)
        
    words = sorted(list(set(words)))
    return words

def word_extraction(sentence):
    ignore = ['a', "the", "is"]
    words = re.sub("[^\w]", " ",  sentence).split()
    cleaned_text = [w.lower() for w in words if w not in ignore]
    return cleaned_text    
    
def generate_bow(allsentences):    
    vocab = tokenize(allsentences)
    print("Word List for Document \n{0} \n".format(vocab));

    for sentence in allsentences:
        words = word_extraction(sentence)
        bag_vector = numpy.zeros(len(vocab))
        for w in words:
            for i,word in enumerate(vocab):
                if word == w: 
                    bag_vector[i] += 1
                    
        print("{0} \n{1}\n".format(sentence,numpy.array(bag_vector)))


allsentences = ["joe waited for the train", "the train was late", "mary and samantha took the bus", 
            "i looked for mary and samantha at the bus station", 
            "mary and samantha arrived at the bus station early but waited until noon for the bus"]


generate_bow(allsentences)

# or one can use sklearn

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(allsentences)
print(X.toarray())

Word List for Document 
['and', 'arrived', 'at', 'bus', 'but', 'early', 'for', 'i', 'joe', 'late', 'looked', 'mary', 'noon', 'samantha', 'station', 'took', 'train', 'until', 'waited', 'was'] 

joe waited for the train 
[0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0.]

the train was late 
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1.]

mary and samantha took the bus 
[1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0.]

i looked for mary and samantha at the bus station 
[1. 0. 1. 1. 0. 0. 1. 1. 0. 0. 1. 1. 0. 1. 1. 0. 0. 0. 0. 0.]

mary and samantha arrived at the bus station early but waited until noon for the bus 
[1. 1. 1. 2. 1. 1. 1. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 1. 1. 0.]

[[0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 1 0 1 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 1]
 [1 0 0 1 0 0 0 0 0 0 1 0 1 0 1 1 0 0 0 0]
 [1 0 1 1 0 0 1 0 0 1 1 0 1 1 1 0 0 0 0 0]
 [1 1 1 2 1 1 1 0 0 0 1 1 1 1 2 0 0 1 1 0]]


### Limitations of BOW

**Semantic meaning**: the basic BOW approach does not consider the meaning of the word in the document. It completely ignores the context in which it’s used. The same word can be used in multiple places based on the context or nearby words.

**Vector size**: For a large document, the vector size can be huge resulting in a lot of computation and time. You may need to ignore words based on relevance to your use case.

## Bi-gram / N-gram

In [None]:
import nltk
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
nltk.download("gutenberg")

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


True

In [None]:
import nltk
from nltk.corpus import stopwords
from collections import Counter

word_list = []

# Set up a quick lookup table for common words like "the" and "an" so they can be excluded
stops = set(stopwords.words('english'))

# For all 18 novels in the public domain book corpus, extract all their words
[word_list.extend(nltk.corpus.gutenberg.words(f)) for f in nltk.corpus.gutenberg.fileids()]

# Filter out words that have punctuation and make everything lower-case
cleaned_words = [w.lower() for w in word_list if w.isalnum()]

# Ask NLTK to generate a list of bigrams for the word "sun", excluding 
# those words which are too common to be interesing 
sun_bigrams = [b for b in nltk.bigrams(cleaned_words) if (b[0] == 'sun' or b[1] == 'sun') \
  and b[0] not in stops and b[1] not in stops]

In [None]:
print(sun_bigrams)
print(len(sun_bigrams))

[('day', 'sun'), ('glaring', 'sun'), ('sun', 'tired'), ('sun', 'bright'), ('sun', 'appeared'), ('western', 'sun'), ('rising', 'sun'), ('sun', 'frequently'), ('sun', 'gained'), ('sun', 'went'), ('sun', 'rose'), ('sun', 'waxed'), ('sun', '17'), ('sun', 'goeth'), ('sun', 'shall'), ('sun', 'goeth'), ('sun', 'goeth'), ('sun', 'go'), ('sun', 'shall'), ('israel', 'sun'), ('sun', 'stand'), ('sun', 'stood'), ('sun', 'stood'), ('sun', 'went'), ('sun', 'went'), ('sun', 'went'), ('sun', '12'), ('sun', '12'), ('sun', 'riseth'), ('sun', 'saying'), ('sun', 'shone'), ('sun', 'going'), ('sun', '19'), ('sun', 'unto'), ('sun', '58'), ('sun', '74'), ('sun', 'knoweth'), ('sun', 'ariseth'), ('sun', 'unto'), ('sun', 'shall'), ('sun', '1'), ('sun', 'also'), ('sun', 'goeth'), ('sun', '1'), ('sun', '2'), ('sun', '2'), ('sun', '2'), ('sun', '4'), ('sun', '4'), ('sun', 'namely'), ('sun', '7'), ('sun', '7'), ('sun', '8'), ('sun', '9'), ('sun', '9'), ('sun', '11'), ('sun', 'hath'), ('sun', 'shall'), ('sun', 'ashame

In [None]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.util import ngrams 

sentences = ["To Sherlock Holmes she is always the woman.", "I have seldom heard him mention her under any other name."]

bigrams = []
for sentence in sentences:
    sequence = word_tokenize(sentence) 
    bigrams.extend(list(ngrams(sequence, 2)))

freq_dist = nltk.FreqDist(bigrams)
prob_dist = nltk.MLEProbDist(freq_dist)
number_of_bigrams = freq_dist.N()

In [None]:
print(bigrams)
print(freq_dist)

[('To', 'Sherlock'), ('Sherlock', 'Holmes'), ('Holmes', 'she'), ('she', 'is'), ('is', 'always'), ('always', 'the'), ('the', 'woman'), ('woman', '.'), ('I', 'have'), ('have', 'seldom'), ('seldom', 'heard'), ('heard', 'him'), ('him', 'mention'), ('mention', 'her'), ('her', 'under'), ('under', 'any'), ('any', 'other'), ('other', 'name'), ('name', '.')]
<FreqDist with 19 samples and 19 outcomes>


In [None]:
from nltk.util import ngrams
text = "I am aware that nltk only offers bigrams and trigrams, but is there a way to split my text in four-grams, five-grams or even hundred-grams"
tokenize = nltk.word_tokenize(text)
print(tokenize)
print (len(tokenize))
trigrams=ngrams(tokenize,3)
print(trigrams)
fourgrams=ngrams(tokenize,4)
print(fourgrams)

['I', 'am', 'aware', 'that', 'nltk', 'only', 'offers', 'bigrams', 'and', 'trigrams', ',', 'but', 'is', 'there', 'a', 'way', 'to', 'split', 'my', 'text', 'in', 'four-grams', ',', 'five-grams', 'or', 'even', 'hundred-grams']
27
<generator object ngrams at 0x7f4e21150a40>
<generator object ngrams at 0x7f4e3273ea98>


In [None]:
def get_ngrams(n_grams):
    return [ ' '.join(grams) for grams in n_grams]
get_ngrams(trigrams)

['I am aware',
 'am aware that',
 'aware that nltk',
 'that nltk only',
 'nltk only offers',
 'only offers bigrams',
 'offers bigrams and',
 'bigrams and trigrams',
 'and trigrams ,',
 'trigrams , but',
 ', but is',
 'but is there',
 'is there a',
 'there a way',
 'a way to',
 'way to split',
 'to split my',
 'split my text',
 'my text in',
 'text in four-grams',
 'in four-grams ,',
 'four-grams , five-grams',
 ', five-grams or',
 'five-grams or even',
 'or even hundred-grams']

In [None]:
get_ngrams(fourgrams)

['I am aware that',
 'am aware that nltk',
 'aware that nltk only',
 'that nltk only offers',
 'nltk only offers bigrams',
 'only offers bigrams and',
 'offers bigrams and trigrams',
 'bigrams and trigrams ,',
 'and trigrams , but',
 'trigrams , but is',
 ', but is there',
 'but is there a',
 'is there a way',
 'there a way to',
 'a way to split',
 'way to split my',
 'to split my text',
 'split my text in',
 'my text in four-grams',
 'text in four-grams ,',
 'in four-grams , five-grams',
 'four-grams , five-grams or',
 ', five-grams or even',
 'five-grams or even hundred-grams']

## TF-IDF Vectorizer


TF-IDF stands for term frequency-inverse document frequency. TF-IDF weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

**Term Frequency (TF)**: is a scoring of the frequency of the word in the current document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. The term frequency is often divided by the document length to normalize.

![alt text](https://miro.medium.com/max/404/1*SUAeubfQGK_w0XZWQW6V1Q.png)


**Inverse Document Frequency (IDF)**: is a scoring of how rare the word is across documents. IDF is a measure of how rare a term is. Rarer the term, more is the IDF score.

![alt text](https://miro.medium.com/max/411/1*T57j-UDzXizqG40FUfmkLw.png)


Thus,

![alt text](https://miro.medium.com/max/215/1*YrgmAeG7KNRB4dQcGcsdyg.png)

The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.

In [None]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving stackoverflow-data-idf.json to stackoverflow-data-idf.json
User uploaded file "stackoverflow-data-idf.json" with length 43476643 bytes


In [None]:
import pandas as pd

# read json into a dataframe
df_idf=pd.read_json("stackoverflow-data-idf.json",lines=True)

# print schema
print("Schema:\n\n",df_idf.dtypes)
print("Number of questions,columns=",df_idf.shape)

Schema:

 id                            int64
title                        object
body                         object
answer_count                  int64
comment_count                 int64
creation_date                object
last_activity_date           object
last_editor_display_name     object
owner_display_name           object
owner_user_id               float64
post_type_id                  int64
score                         int64
tags                         object
view_count                    int64
accepted_answer_id          float64
favorite_count              float64
last_edit_date               object
last_editor_user_id         float64
community_owned_date         object
dtype: object
Number of questions,columns= (20000, 19)


In [None]:
import re
def pre_process(text):
    
    # lowercase
    text=text.lower()
    
    #remove tags
    text=re.sub("</?.*?>"," <> ",text)
    
    # remove special characters and digits
    text=re.sub("(\\d|\\W)+"," ",text)
    
    return text

df_idf['text'] = df_idf['title'] + df_idf['body']
df_idf['text'] = df_idf['text'].apply(lambda x:pre_process(x))

#show the first 'text'
df_idf['text'][0]

'serializing a private struct can it be done i have a public class that contains a private struct the struct contains properties mostly string that i want to serialize when i attempt to serialize the struct and stream it to disk using xmlserializer i get an error saying only public types can be serialized i don t need and don t want this struct to be public is there a way i can serialize it and keep it private '

In [None]:
df_idf['text'][10]

'cannot assign a value on a integer in bash linux i asked a few days ago about an error in a bash script in the meantime i had rebuilded the code and now i still have problem i want to assign the value of waarde to vorigewaarde in the if loop that looks how much your score is while i tried to find the problem with bash x filename it says that the value of vorigewaarde empty is but i don t see how that can happen the code i have marked the line where the fault is it is after the until code line bin bash score uitkomst juist waardegok number random kaarten harten harten harten harten harten harten harten harten harten harten harten harten harten klaver klaver klaver klaver klaver klaver klaver klaver klaver klaver klaver klaver klaver schop schop schop schop schop schop schop schop schop schop schop schop schop ruit ruit ruit ruit ruit ruit ruit ruit ruit ruit ruit ruit ruit gokkaart kaarten number echo if number gt amp amp number lt then case gokkaart in harten waarde harten waarde hart

In [None]:
uploaded = files.upload()

Saving stopwords.txt to stopwords.txt


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

def get_stop_words(stop_file_path):
    """load stop words """
    
    with open(stop_file_path, 'r', encoding="utf-8") as f:
        stopwords = f.readlines()
        stop_set = set(m.strip() for m in stopwords)
        return frozenset(stop_set)

#load a set of stop words
stopwords=get_stop_words("stopwords.txt")

#get the text column 
docs=df_idf['text'].tolist()

#create a vocabulary of words, 
#ignore words that appear in 85% of documents, 
#eliminate stop words
cv=CountVectorizer(max_df=0.85,stop_words=stopwords)
word_count_vector=cv.fit_transform(docs)

  'stop_words.' % sorted(inconsistent))


In [None]:
word_count_vector.shape

(20000, 124901)

In [None]:
word_count_vector

<20000x124901 sparse matrix of type '<class 'numpy.int64'>'
	with 1079735 stored elements in Compressed Sparse Row format>

In [None]:
cv=CountVectorizer(max_df=0.85,stop_words=stopwords,max_features=10000)
word_count_vector=cv.fit_transform(docs)
word_count_vector.shape

  'stop_words.' % sorted(inconsistent))


(20000, 10000)

In [None]:
list(cv.vocabulary_.keys())[:10]

['serializing',
 'private',
 'struct',
 'public',
 'class',
 'contains',
 'properties',
 'string',
 'serialize',
 'attempt']

In [None]:

list(cv.get_feature_names())[2000:2015]

['customization',
 'customize',
 'customized',
 'customlog',
 'customview',
 'cut',
 'cv',
 'cv_',
 'cval',
 'cvc',
 'cw',
 'cwd',
 'cx',
 'cx_oracle',
 'cxf']

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(word_count_vector)

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

In [None]:

tfidf_transformer.idf_

array([ 7.37717703,  9.80492526,  9.51724319, ...,  8.82409601,
       10.21039037,  9.51724319])

In [None]:
uploaded = files.upload()

Saving stackoverflow-test.json to stackoverflow-test.json


In [None]:
# read test docs into a dataframe and concatenate title and body
df_test=pd.read_json("stackoverflow-test.json",lines=True)
df_test['text'] = df_test['title'] + df_test['body']
df_test['text'] =df_test['text'].apply(lambda x:pre_process(x))

# get test docs into a list
docs_test=df_test['text'].tolist()
docs_title=df_test['title'].tolist()
docs_body=df_test['body'].tolist()

In [None]:
def sort_coo(coo_matrix):
    tuples = zip(coo_matrix.col, coo_matrix.data)
    return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)

def extract_topn_from_vector(feature_names, sorted_items, topn=10):
    """get the feature names and tf-idf score of top n items"""
    
    #use only topn items from vector
    sorted_items = sorted_items[:topn]

    score_vals = []
    feature_vals = []

    for idx, score in sorted_items:
        fname = feature_names[idx]
        
        #keep track of feature name and its corresponding score
        score_vals.append(round(score, 3))
        feature_vals.append(feature_names[idx])

    #create a tuples of feature,score
    #results = zip(feature_vals,score_vals)
    results= {}
    for idx in range(len(feature_vals)):
        results[feature_vals[idx]]=score_vals[idx]
    
    return results

In [None]:
# you only needs to do this once
feature_names=cv.get_feature_names()

# get the document that we want to extract keywords from
doc=docs_test[0]

#generate tf-idf for the given document
tf_idf_vector=tfidf_transformer.transform(cv.transform([doc]))

#sort the tf-idf vectors by descending order of scores
sorted_items=sort_coo(tf_idf_vector.tocoo())

#extract only the top n; n here is 10
keywords=extract_topn_from_vector(feature_names,sorted_items,10)

# now print the results
print("\n=====Title=====")
print(docs_title[0])
print("\n=====Body=====")
print(docs_body[0])
print("\n===Keywords===")
for k in keywords:
    print(k,keywords[k])


=====Title=====
Integrate War-Plugin for m2eclipse into Eclipse Project

=====Body=====
<p>I set up a small web project with JSF and Maven. Now I want to deploy on a Tomcat server. Is there a possibility to automate that like a button in Eclipse that automatically deploys the project to Tomcat?</p>

<p>I read about a the <a href="http://maven.apache.org/plugins/maven-war-plugin/" rel="nofollow noreferrer">Maven War Plugin</a> but I couldn't find a tutorial how to integrate that into my process (eclipse/m2eclipse).</p>

<p>Can you link me to help or try to explain it. Thanks.</p>

===Keywords===
eclipse 0.593
war 0.317
integrate 0.281
maven 0.273
tomcat 0.27
project 0.239
plugin 0.214
automate 0.157
jsf 0.152
possibility 0.146


In [None]:
# put the common code into several methods
def get_keywords(idx):

    #generate tf-idf for the given document
    tf_idf_vector=tfidf_transformer.transform(cv.transform([docs_test[idx]]))

    #sort the tf-idf vectors by descending order of scores
    sorted_items=sort_coo(tf_idf_vector.tocoo())

    #extract only the top n; n here is 10
    keywords=extract_topn_from_vector(feature_names,sorted_items,10)
    
    return keywords

def print_results(idx,keywords):
    # now print the results
    print("\n=====Title=====")
    print(docs_title[idx])
    print("\n=====Body=====")
    print(docs_body[idx])
    print("\n===Keywords===")
    for k in keywords:
        print(k,keywords[k])

In [None]:
idx=120
keywords=get_keywords(idx)
print_results(idx,keywords)


=====Title=====
SQL Import Wizard - Error

=====Body=====
<p>I have a CSV file that I'm trying to import into SQL Management Server Studio.</p>

<p>In Excel, the column giving me trouble looks like this:
<a href="https://i.stack.imgur.com/pm0uS.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/pm0uS.png" alt="enter image description here"></a></p>

<p>Tasks > import data > Flat Source File > select file</p>

<p><a href="https://i.stack.imgur.com/G4b6I.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/G4b6I.png" alt="enter image description here"></a></p>

<p>I set the data type for this column to DT_NUMERIC, adjust the DataScale to 2 in order to get 2 decimal places, but when I click over to Preview, I see that it's clearly not recognizing the numbers appropriately:</p>

<p><a href="https://i.stack.imgur.com/NZhiQ.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/NZhiQ.png" alt="enter image description here"></a></p>

<p>The column ma