# Feature Extraction and Bag of Words

### Creating Bag of Words from Text Files Using Python:

we'll use basic Python to build a rudimentary NLP system. We'll build a corpus of documents (two small text files), create a vocabulary from all the words in both documents, and then demonstrate a Bag of Words technique to extract features from each document.

#### 1) Reading Whole Files:

In [10]:
# Reading One.txt
with open("One.txt") as file_one:
    print(file_one.read())

This is a story about dogs
our canine pets
Dogs are furry animals



In [12]:
# Reading two.txt
with open("Two.txt") as file_two:
    print(file_two.read())

This story is about surfing
Catching waves is fun
Surfing is a popular water sport



#### 2) Reading Entire File as a Single String:

In [13]:
with open("One.txt") as file_one:
    text1 = file_one.read()

In [14]:
text1

'This is a story about dogs\nour canine pets\nDogs are furry animals\n'

In [15]:
with open("Two.txt") as file_two:
    text2 = file_two.read()

In [16]:
text2

'This story is about surfing\nCatching waves is fun\nSurfing is a popular water sport\n'

#### 3) Reading Each Line of File as an Elemnet of List:

In [17]:
with open("One.txt") as file_one:
    lines1 = file_one.readlines()

In [18]:
lines1

['This is a story about dogs\n',
 'our canine pets\n',
 'Dogs are furry animals\n']

In [19]:
with open("Two.txt") as file_two:
    lines2 = file_two.readlines()

In [20]:
lines2

['This story is about surfing\n',
 'Catching waves is fun\n',
 'Surfing is a popular water sport\n']

#### 4) Reading Words from File Seperately:

In [25]:
with open("One.txt") as one:
    words1 = one.read().lower().split()

In [26]:
words1

['this',
 'is',
 'a',
 'story',
 'about',
 'dogs',
 'our',
 'canine',
 'pets',
 'dogs',
 'are',
 'furry',
 'animals']

In [27]:
with open("Two.txt") as two:
    words2 = two.read().lower().split()

In [28]:
words2

['this',
 'story',
 'is',
 'about',
 'surfing',
 'catching',
 'waves',
 'is',
 'fun',
 'surfing',
 'is',
 'a',
 'popular',
 'water',
 'sport']

#### Building a vocabulary (Creating a "Bag of Words")

Let's create dictionaries that correspond to unique mappings of the words in the documents. We can begin to think of this as mapping out all the possible words available for all (both) documents.

In [29]:
with open("One.txt") as one:
    words1 = one.read().lower().split()

In [30]:
words1

['this',
 'is',
 'a',
 'story',
 'about',
 'dogs',
 'our',
 'canine',
 'pets',
 'dogs',
 'are',
 'furry',
 'animals']

In [31]:
len(words1)

13

In [32]:
unique_words_1 = set(words1)

In [33]:
unique_words_1

{'a',
 'about',
 'animals',
 'are',
 'canine',
 'dogs',
 'furry',
 'is',
 'our',
 'pets',
 'story',
 'this'}

In [34]:
with open("Two.txt") as two:
    words2 = two.read().lower().split()

In [35]:
words2

['this',
 'story',
 'is',
 'about',
 'surfing',
 'catching',
 'waves',
 'is',
 'fun',
 'surfing',
 'is',
 'a',
 'popular',
 'water',
 'sport']

In [36]:
len(words2)

15

In [37]:
unique_words_2 = set(words2)

In [38]:
unique_words_2

{'a',
 'about',
 'catching',
 'fun',
 'is',
 'popular',
 'sport',
 'story',
 'surfing',
 'this',
 'water',
 'waves'}

In [40]:
# Getting all Unique Words from Both Documents in One Set:

all_unique_words = set()

all_unique_words.update(unique_words_1)
all_unique_words.update(unique_words_2)

In [41]:
all_unique_words

{'a',
 'about',
 'animals',
 'are',
 'canine',
 'catching',
 'dogs',
 'fun',
 'furry',
 'is',
 'our',
 'pets',
 'popular',
 'sport',
 'story',
 'surfing',
 'this',
 'water',
 'waves'}

In [42]:
# Creating a Dictionary of All unique Words:

dict_words = dict()
temp = 0

for word in all_unique_words:
    
    dict_words[word] = temp
    temp = temp+1

In [43]:
dict_words

{'story': 0,
 'sport': 1,
 'about': 2,
 'water': 3,
 'a': 4,
 'pets': 5,
 'furry': 6,
 'this': 7,
 'catching': 8,
 'animals': 9,
 'our': 10,
 'canine': 11,
 'is': 12,
 'are': 13,
 'waves': 14,
 'fun': 15,
 'popular': 16,
 'surfing': 17,
 'dogs': 18}

#### Bag Of Words:

1) We will create a Matrix or DataFrame where columns are all the unique words from all the Documnets.

2) Rows will represent a single document.

3) Cells of Matrx or DataFrame will show how many times a Particular Word (Column) is Present in a Particular Document (Row).

In [44]:
# Creating Emply Lists for Both documents of Same Length as Dictionary of Unique Words:

word_freq_1 = [0] * len(dict_words)
word_freq_2 = [0] * len(dict_words)

In [45]:
word_freq_1

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [46]:
word_freq_2

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [47]:
# Adding Frequency of Each Word Per Documnet in List:

with open("One.txt") as x:
    word1 = x.read().lower().split()

In [48]:
for word in word1:
    
    index = dict_words[word]
    word_freq_1[index] = word_freq_1[index] + 1

In [49]:
word_freq_1

[1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 2]

In [50]:
with open("Two.txt") as x:
    word2 = x.read().lower().split()

In [51]:
for word in word2:
    
    index = dict_words[word]
    word_freq_2[index] = word_freq_2[index] + 1

In [52]:
word_freq_2

[1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 3, 0, 1, 1, 1, 2, 0]

In [53]:
import pandas as pd

In [54]:
bag_of_words = pd.DataFrame(data= [word_freq_1, word_freq_2], columns= dict_words.keys())

In [55]:
bag_of_words

Unnamed: 0,story,sport,about,water,a,pets,furry,this,catching,animals,our,canine,is,are,waves,fun,popular,surfing,dogs
0,1,0,1,0,1,1,1,1,0,1,1,1,1,1,0,0,0,0,2
1,1,1,1,1,1,0,0,1,1,0,0,0,3,0,1,1,1,2,0
