# Feature Extraction: Bag of Words

In [6]:
import pandas as pd

In [2]:
# Examples for the lesson
sentence1 = "Word Embeddings convert a Word into a number"
sentence2 = "Computers do not like seeing a Word instead of a number"


In [3]:
dictionary = ['Word', 'Embeddings', 'convert', 'a', 'into', 'number', 
              'Computers', 'do', 'not', 'like', 'seeing', 'instead', 'of']
print('Sentence 1: ' + sentence1)
print('Sentence 2: ' + sentence2)
print('Dictionary: ', dictionary)

Sentence 1: Word Embeddings convert a Word into a number
Sentence 2: Computers do not like seeing a Word instead of a number
Dictionary:  ['Word', 'Embeddings', 'convert', 'a', 'into', 'number', 'Computers', 'do', 'not', 'like', 'seeing', 'instead', 'of']


##### Vectors
a vector in this case is a one-hot encoded vector where 1 stands for the position where the word exists and 0 everywhere else.  

In [4]:
S1_vector = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
S2_vector = [1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1]
print('Sentence 1 Word Embedding Vector: ', S1_vector)
print('Sentence 2 Word Embedding Vector: ', S2_vector)

Sentence 1 Word Embedding Vector:  [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
Sentence 2 Word Embedding Vector:  [1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1]


##### Putting it all together

In [7]:
df = pd.DataFrame({'Sentence1' : S1_vector,
              'Sentence2' : S1_vector
             }, columns = ['Sentence1', 'Sentence2']).transpose()
df.columns = dictionary

df

Unnamed: 0,Word,Embeddings,convert,a,into,number,Computers,do,not,like,seeing,instead,of
Sentence1,1,1,1,1,1,1,0,0,0,0,0,0,0
Sentence2,1,1,1,1,1,1,0,0,0,0,0,0,0


**Goal:**  take in a corpus of documents and return a dataframe where each row is a document and each column is the complete dictionary (after normalizing the text) of all distinct words in the entire corpus.  The values in this matrix (stored as a dataframe) are equal to the count of that word appearing in that document.  

In [8]:
print('Sentence 1: ' + sentence1)
print('Sentence 2: ' + sentence2)
print('Dictionary: ', dictionary)

Sentence 1: Word Embeddings convert a Word into a number
Sentence 2: Computers do not like seeing a Word instead of a number
Dictionary:  ['Word', 'Embeddings', 'convert', 'a', 'into', 'number', 'Computers', 'do', 'not', 'like', 'seeing', 'instead', 'of']


- Vector for each document (sentence) and the number of instances of each word in the dictionary for that document. 
- We will combine these vectors to create a count matrix.   
- We may either take the frequency (number of times a word has appeared in the document) or the presence(has the word appeared in the document?) to be the entry in the count matrix. But generally, frequency method is preferred over the latter.   

In [9]:
S1_bow = [2, 1, 1, 2, 1, 1, 0, 0, 0, 0, 0, 0, 0]
S2_bow = [1, 0, 0, 2, 0, 1, 1, 1, 1, 1, 1, 1, 1]

df = pd.DataFrame({'Sentence1' : S1_bow,
              'Sentence2' : S1_bow
             }, columns = ['Sentence1', 'Sentence2']).transpose()
df.columns = dictionary

df

Unnamed: 0,Word,Embeddings,convert,a,into,number,Computers,do,not,like,seeing,instead,of
Sentence1,2,1,1,2,1,1,0,0,0,0,0,0,0
Sentence2,2,1,1,2,1,1,0,0,0,0,0,0,0


In [23]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import os

#### a. prepare environment

In [11]:
import pandas as pd
import numpy as np
import spacy
from requests import get
from bs4 import BeautifulSoup

import warnings
warnings.filterwarnings("ignore")

#### b. wrangle data (acquire and parse)

## Example

In [12]:
# Assign the address of the web page to a variable named url.
url = 'https://inshorts.com/en/read/business'

# Request the server the content of the web page by using get(), and store the server’s response in the variable response.
response = get(url)

# Use BeautifulSoup to parse the html into a variable ('soup')
soup = BeautifulSoup(response.content, 'html.parser')

# Identify the key tags you need to extract the data you are looking for
news_category = 'Business'
news_data = []
# using list comprehension to create a list of the data desired
news_articles = [{'news_headline': headline.find('span', attrs={"itemprop": "headline"}).string,
                  'news_article': article.find('div', attrs={"itemprop": "articleBody"}).string,
                  'news_category': news_category}
                 for headline, article in 
                 zip(soup.find_all('div', class_=["news-card-title news-right-box"]),
                     soup.find_all('div', class_=["news-card-content news-right-box"]))
                ]
news_data.extend(news_articles)

# Create a dataframe of the data desired
news_df =  pd.DataFrame(news_data)

# combine headline and article text and create a corpus of documents
news_df['full_text'] = news_df["news_headline"].map(str)+ '. ' + news_df["news_article"]

# Create a corpus of the column with the text you want to analyze.
corp = news_df['full_text']

0     I've been an idiot for not buying: Buffett's f...
1     I'd be disgusted if someone celebrated $1tn m-...
2     Billionaire Elon Musk personally owes $507 mil...
3     Air India warns employees not to speak to medi...
4     PepsiCo withdraws its lawsuit against 4 Gujara...
5     I’m only 28, can take time to build my busines...
6     Anil Ambani's wealth falls 60% in 2 months to ...
7     Brexit is worse than Y2K for customers, says I...
8     Delhi HC asks J&J to pay ₹25 lakh each to 4 hi...
9     India secures additional oil supplies as Iran ...
10    India again postpones retaliatory tariffs agai...
11    Gujarat farmers seek compensation from PepsiCo...
12    India's April jobless rate at 7.6%, highest si...
13    JSW Group enters paints business, Parth Jindal...
14    Qualcomm expects at least $4.5 billion from Ap...
15    Ex-Alibaba exec Bhushan Patil quits as Paytm P...
16    Standard Life to sell 1.78% stake in HDFC Life...
17    Jet Airways down 20% on report bidders not

##### Leave out the last document to test the vectorizer

In [19]:
# extract the bag of words
# we will leave out the last document, so we can use that as a new document we will run the transform function on.  
orig_docs = corp[:-1]
print(orig_docs.shape)
new_doc = corp[-1:]
print(new_doc.shape)

(24,)
(1,)


##### count vectorizer:  Convert a collection of text documents to a matrix of token counts

In [20]:
from sklearn.feature_extraction.text import CountVectorizer
# create the vectorizer object
vectorizer = CountVectorizer(min_df=1, ngram_range=(1,1))

# fit and transform on the original docs
features = vectorizer.fit_transform(orig_docs)    

# convert to a dense matrix
features = features.todense()

# look at the vector for each of the first 10 documents
print(features[:10])

[[0 0 0 ... 0 1 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 3 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


##### get feature names, or list of unique words

In [21]:
feature_names = vectorizer.get_feature_names()

# look at the first 10 feature names (words in alphabetical order)
print(feature_names[:10])

['000', '04', '10', '120', '13', '14', '16', '1tn', '20', '2016']


##### create a dataframe out of the matrix and feature names

In [22]:
df = pd.DataFrame(data=features,
                  columns=feature_names)
df.head()

Unnamed: 0,000,04,10,120,13,14,16,1tn,20,2016,...,withheld,within,without,world,worse,would,writing,y2k,year,years
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,1,0,0,1,0
1,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,1,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


We now have a dataframe we can work with to model:  sentiment analysis, topic modeling, etc. 