In [1]:
# accessing the text file
with open('One.txt') as mytext:
    a = mytext.read()

In [2]:
a

'This is a story about dogs\nour canine pets\nDogs are furry animals\n'

In [3]:
print(a)

This is a story about dogs
our canine pets
Dogs are furry animals



In [5]:
with open('One.txt') as mytext:
    a = mytext.readlines() # readlines returns a list of lines

In [6]:
a # readlines returns a list of lines

['This is a story about dogs\n',
 'our canine pets\n',
 'Dogs are furry animals\n']

Now, keep in mind, for really large text files, you may want to avoid actually displaying the output.

It's not a big deal for Python to read in a large text file like this if you have enough RAM, but it may crash Jupiter If you're trying to print out an entire novel into this output cell, it'll just take forever to actually render.

In [7]:
# Reading in each file seperately 
a.lower().split()

AttributeError: 'list' object has no attribute 'lower'

In [8]:
with open('One.txt') as mytext:
    a = mytext.read()

In [9]:
a.lower().split()

['this',
 'is',
 'a',
 'story',
 'about',
 'dogs',
 'our',
 'canine',
 'pets',
 'dogs',
 'are',
 'furry',
 'animals']

This is what we exactly want when we are thinking of the Bag of words model

In [10]:
with open('Two.txt') as mytext2:
    b = mytext2.read()

In [11]:
b.lower().split()

['this',
 'story',
 'is',
 'about',
 'surfing',
 'catching',
 'waves',
 'is',
 'fun',
 'surfing',
 'is',
 'a',
 'popular',
 'water',
 'sport']

So let's begin by building out a vocabulary.

* And this idea of a vocabulary is extremely important in natural language processing.
------
1. We have to imagine that with all the documents available for us, there's going to be kind of this limited vocabulary available.

2. So, for example, if you're dealing with something like the novel Moby Dick, there's only a certain amount of words that are used throughout that entire novel.

3. And we can think of that array of unique words as the vocabulary for that particular document or that particular text.

4. So we want to build a vocabulary for both one.txt and two.txt. So there are some words that are similar between the one.text and two.text and there's some words that are unique to each document.

So this where we want to move from just count vectorization to being able to compare the frequency of certain words b/w the two documents.

#### So let's begin the process by getting all the unique words across all the documents.

In [13]:
# Opening one.txt and getting all the unique words
with open('One.txt') as text1:
    words_1 = text1.read().lower().split()
    
    #unique words
    uni_words_1 = set(words_1)

uni_words_1

{'a',
 'about',
 'animals',
 'are',
 'canine',
 'dogs',
 'furry',
 'is',
 'our',
 'pets',
 'story',
 'this'}

So above we can see the unique words from all sentences/dcouments from one.txt file.

In [15]:
# Opening two.txt and  getting the unique words
with open('Two.txt') as text2:
    words_2 = text2.read().lower().split()
    
    #unique words
    uni_words_2 = set(words_2)
    
uni_words_2

{'a',
 'about',
 'catching',
 'fun',
 'is',
 'popular',
 'sport',
 'story',
 'surfing',
 'this',
 'water',
 'waves'}

So above we can see the unique words from the allsentences/dcouments from two.txt file.

In [18]:
# Getting unique words across all the documents i.e. one.txt + two.txt
all_uni_words = set()
all_uni_words.update(uni_words_1)
all_uni_words # now has unique words from document 1

{'a',
 'about',
 'animals',
 'are',
 'canine',
 'dogs',
 'furry',
 'is',
 'our',
 'pets',
 'story',
 'this'}

In [19]:
all_uni_words.update(uni_words_2)
all_uni_words # has all unique words from both one.txt and two.txt

{'a',
 'about',
 'animals',
 'are',
 'canine',
 'catching',
 'dogs',
 'fun',
 'furry',
 'is',
 'our',
 'pets',
 'popular',
 'sport',
 'story',
 'surfing',
 'this',
 'water',
 'waves'}

In [20]:
# Assigning a number of each of these words
full_vocab = dict()
i = 0

for word in all_uni_words:
    full_vocab[word] = i
    i += 1

In [21]:
full_vocab

{'this': 0,
 'pets': 1,
 'story': 2,
 'are': 3,
 'animals': 4,
 'about': 5,
 'surfing': 6,
 'furry': 7,
 'a': 8,
 'is': 9,
 'water': 10,
 'waves': 11,
 'our': 12,
 'popular': 13,
 'fun': 14,
 'catching': 15,
 'canine': 16,
 'dogs': 17,
 'sport': 18}

In [22]:
one_freq = [0]*len(full_vocab)
two_freq = [0]*len(full_vocab)
all_words = ['']*len(full_vocab)

In [23]:
one_freq

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [26]:
two_freq

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

So here are kind of my first frequency counts for all the words in my actual documents, and it's the same thing for two frequency.

In [25]:
all_words

['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']

And what I'm going to do is go through those documents and every time I encounter a word, I'll look it up in my full_vocab, take a note of the index position such as zero.

And then what's going to happen is zero here essentially stands for the word "this" - 0th word in full_vocab.

In [28]:
with open('One.txt') as f:
    one_t = f.read().lower().split()
    
one_t

['this',
 'is',
 'a',
 'story',
 'about',
 'dogs',
 'our',
 'canine',
 'pets',
 'dogs',
 'are',
 'furry',
 'animals']

So what is happening is :
1. First we look through "one_t" that contains all the words from the documents/sentences from one.txt
2. Then from the "full_vocab" that has all the unique words from boths text file we'll get the index of that particular word
3. And finally in the one_freq list that has counts for the unique words from one.text we will add frequency each time we see the word in "one_t" list at the index we got from "full_vocab" dict.

* Ex: - if we find the word dogs in "one_t" and in "full_vocab" dogs is at index 17 th index, then we will add 1 at 17th index in the "one_freq" list and continue the process for all the words.

In [29]:
for word in one_t:
    word_index = full_vocab[word]
    one_freq[word_index] += 1

In [30]:
one_freq

[1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 2, 0]

In [32]:
# for two.txt
with open('Two.txt') as s:
    two_t = s.read().lower().split()
    
two_t

['this',
 'story',
 'is',
 'about',
 'surfing',
 'catching',
 'waves',
 'is',
 'fun',
 'surfing',
 'is',
 'a',
 'popular',
 'water',
 'sport']

In [33]:
for word in two_t:
    word_indexes = full_vocab[word]
    two_freq[word_indexes] += 1

In [34]:
two_freq

[1, 0, 1, 0, 0, 1, 2, 0, 1, 3, 1, 1, 0, 1, 1, 1, 0, 0, 1]

And finally we'll fill out the all_words with the unique words in the full-vocab dictionary.

In [45]:
for word in full_vocab:
    word_index = full_vocab[word]
    all_words[word_index] = word

In [46]:
all_words

['this',
 'pets',
 'story',
 'are',
 'animals',
 'about',
 'surfing',
 'furry',
 'a',
 'is',
 'water',
 'waves',
 'our',
 'popular',
 'fun',
 'catching',
 'canine',
 'dogs',
 'sport']

So in this list we have all the unique words from the full_vocab dict with their indexes matching up with full_vocab and one_freq and two_freq, as we just did.

In [39]:
#### Finally we can build a nice dataframe to organise this info from all_words, one_freq and two_freq.

In [40]:
import pandas as pd

In [42]:
# BOW - bag of words
bow = pd.DataFrame(data=[one_freq, two_freq],columns=all_words)
bow

Unnamed: 0,this,pets,story,are,animals,about,surfing,furry,a,is,water,waves,our,popular,fun,catching,canine,dogs,sport
0,1,1,1,1,1,1,0,1,1,1,0,0,1,0,0,0,1,2,0
1,1,0,1,0,0,1,2,0,1,3,1,1,0,1,1,1,0,0,1


### So what is this bow - bag of words
This is known as a bag of words model, which is a frequency count of all the words in the documents.

And just for visualization purposes here, I've shown you all the words and then the frequency counts.

The first row shows the no of times all the words from all_words show up in the sentences/documents in one.txt file and the same for second row which represents the second.txt.

### So this is how Scikit learn directly converts text into this frquency counts.

## Points:-
We can now imagine that more similar these counts are between two documents could mean the documenst themselves are more common i.e. they could be talking about the same thing.



### Before using Scikit-learn.

1. Right now we have this bag of words model, but by itself, these may not actually be too helpful until we consider term frequencies, as well as how often these individual words appear in the documents.

2. If you're dealing with a bunch of documents that all happen to be part of, for instance, the same general category of a topic, for example, you're dealing with just text words in the sports section, of a newspaper and you're trying to classify things between soccer, baseball or basketball, etc There may be certain words that appear in all the documents, for example, run or running, that you do in all those sports. 

3. And we're going to have to start considering is not just is this word common in the English language, which is the idea behind stop words. So words such as "a","the","and" are those are very common across the entire English language. It could be a good idea just to remove them across all documents.

4. The other thing we want to consider are for the particular scope of all our documents or any particular words showing up a lot across all the documents versus just a few documents. For example, you're not going to see the word "soccer" show up across all sports documents as it is specific to one sport only and will only show up soccer related articles.

5. So the idea of TF-IDF term frequency inverse document frequency is going to help to alleviate/overcome those issues.