# Problem Statement:
After understanding the syntax of your text document, now, you need to extract features from the text
document using various methods like bag-of-words, CountVectorizer, TermFrequency–Inverse Document
Frequency (Tf–Idf).
Assignment for Bag-of-words
1. Perform the following tasks using bag-of-words in Python:
a) Create a function named ‘tokenized_text ()’that takes ‘sentence’ as its argument and performs
word tokenization and removes all stopwords
b) Create a function named ‘sorted_token ()’ that takes ‘sentence’ as its argument and removes
the duplicate word tokens and returns a sorted list of word tokens
c) Create a function named ‘bag_of_word ()’ that takes ‘sentence’ and ‘word’ as its arguments,
calculates the frequency word count of word tokens, and returns a NumPy array of word tokens
d) Create a bag-of-words model on the following sentences using the three functions defined
above:
 Joe went to the store
 Joe wants to buy a dining set
 Joe met John at the store
 Joe and John are best friends
e) Convert the sentences into vectors using bag-of-words





In [78]:
import nltk
import numpy as np
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

stop_words = set(stopwords.words("english"))

def tokenized_text(sentence):
    words = nltk.word_tokenize(sentence)
    cleaned_text = [w.lower() for w in words if w not in stop_words]
    return cleaned_text

In [79]:
def sorted_token(sentences):
    words = []
    for sentence in sentences:
        w = extract_words(sentence)
        words.extend(w)
    words = sorted(list(set(words)))
    return words

In [80]:
def bag_of_words(sentence, words):
    sentence_words = extract_words(sentence)
    bag = np.zeros(len(words))
    for sw in sentence_words:
        for i,word in enumerate(words):
            if word == sw:
                bag[i] += 1
    return np.array(bag)

In [81]:
corpus = ["Joe went to the store",
          "Joe wants to buy a dining set",
          "Joe met John at the store",
          "Joe and John are best friends"]

In [82]:
vocabulary = sorted_token(corpus)
print(vocabulary)

['best', 'buy', 'dining', 'friends', 'joe', 'john', 'met', 'set', 'store', 'wants', 'went']


In [83]:
bag_of_words("Joe went to the store", vocabulary)

array([0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 1.])

# Assignment for CountVectorizer
1. Perform the following tasks using CountVectorizer:
a) Import ‘CountVectorizer’ from ‘sklearn.feature_extraction.text’
b) Create a numpy array named ‘corpus’ that contains the following sentences:
 Joe went to the store
 Joe wants to buy a dining set
 Joe met John at the store
 Joe and John are best friends
c) Use the ‘CountVectorizer’ class object to fit and transform the text present in ‘corpus’ and
store the result in ‘bag_of_words’
d) Print ‘bag_of_words’ as a numpy array
e) Print all feature names of the above-created ‘CountVectorizer’ object

In [38]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [41]:
corpus1 = np.array(["Joe went to the store",
          "Joe wants to buy a dining set",
          "Joe met John at the store",
          "Joe and John are best friends"])

In [42]:
count = CountVectorizer()
bag_of_words = count.fit_transform(corpus1)
bag_of_words.toarray()

array([[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1],
       [0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0],
       [0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0],
       [1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)

In [43]:
feature_names = count.get_feature_names()

In [45]:
feature_names

['and',
 'are',
 'at',
 'best',
 'buy',
 'dining',
 'friends',
 'joe',
 'john',
 'met',
 'set',
 'store',
 'the',
 'to',
 'wants',
 'went']

In [46]:
pd.DataFrame(bag_of_words.toarray(),columns=feature_names)

Unnamed: 0,and,are,at,best,buy,dining,friends,joe,john,met,set,store,the,to,wants,went
0,0,0,0,0,0,0,0,1,0,0,0,1,1,1,0,1
1,0,0,0,0,1,1,0,1,0,0,1,0,0,1,1,0
2,0,0,1,0,0,0,0,1,1,1,0,1,1,0,0,0
3,1,1,0,1,0,0,1,1,1,0,0,0,0,0,0,0


# Assignment for Term Frequency–Inverse Document Frequency
1. Do the following operations:
a) Import ‘TfidfVectorizer’ from ‘sklearn.feature_extraction.text’
b) Create a numpy array named ‘corpus’ that contains the following sentences:
 Joe went to the store
 Joe wants to buy a dining set
 Joe met John at the store
 Joe and John are best friends
c) Use the ‘TfidfVectorizer’ class to create an object to fit and transform the text present in
‘corpus’ created above and store the result in ‘bag_of_words’
g) Print ‘bag_of_words’ as a numpy array
h) Print all feature names of the above-created ‘TfidfVectorizer’ object
i) Print the ‘bag_of_words’ array as a pandas data frame with column names as feature names# 

In [48]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [49]:
corpus2 = np.array(["Joe went to the store",
          "Joe wants to buy a dining set",
          "Joe met John at the store",
          "Joe and John are best friends"])

In [50]:
vectorizer = TfidfVectorizer(use_idf=True)

In [60]:
bagof_words = vectorizer.fit_transform(corpus2)

In [65]:
bag_of_words.toarray()

array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.29462843, 0.        , 0.        ,
        0.        , 0.44513219, 0.44513219, 0.44513219, 0.        ,
        0.56459374],
       [0.        , 0.        , 0.        , 0.        , 0.45203489,
        0.45203489, 0.        , 0.23589056, 0.        , 0.        ,
        0.45203489, 0.        , 0.        , 0.3563895 , 0.45203489,
        0.        ],
       [0.        , 0.        , 0.49164562, 0.        , 0.        ,
        0.        , 0.        , 0.25656108, 0.38761905, 0.49164562,
        0.        , 0.38761905, 0.38761905, 0.        , 0.        ,
        0.        ],
       [0.45203489, 0.45203489, 0.        , 0.45203489, 0.        ,
        0.        , 0.45203489, 0.23589056, 0.3563895 , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        ]])

In [66]:
feature_names = vectorizer.get_feature_names()
feature_names

['and',
 'are',
 'at',
 'best',
 'buy',
 'dining',
 'friends',
 'joe',
 'john',
 'met',
 'set',
 'store',
 'the',
 'to',
 'wants',
 'went']

In [67]:
dataframe1 = pd.DataFrame(bag_of_words.toarray(), columns =feature_names)
dataframe1

Unnamed: 0,and,are,at,best,buy,dining,friends,joe,john,met,set,store,the,to,wants,went
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.294628,0.0,0.0,0.0,0.445132,0.445132,0.445132,0.0,0.564594
1,0.0,0.0,0.0,0.0,0.452035,0.452035,0.0,0.235891,0.0,0.0,0.452035,0.0,0.0,0.356389,0.452035,0.0
2,0.0,0.0,0.491646,0.0,0.0,0.0,0.0,0.256561,0.387619,0.491646,0.0,0.387619,0.387619,0.0,0.0,0.0
3,0.452035,0.452035,0.0,0.452035,0.0,0.0,0.452035,0.235891,0.356389,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# END OF ASSIGNMENT