### Convert text to vectors using Bag-of-Words (BOW)

In [1]:
import nltk

In [3]:
## Create a corpus
speech = '''Fourscore and seven years ago our fathers brought forth, on this continent, a new nation, conceived 
in liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great 
civil war, testing whether that nation, or any nation so conceived, and so dedicated, can long endure. We are 
met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final 
resting-place for those who here gave their lives, that that nation might live. It is altogether fitting and 
proper that we should do this. But, in a larger sense, we cannot dedicate, we cannot consecrate—we cannot 
hallow—this ground. The brave men, living and dead, who struggled here, have consecrated it far above our poor 
power to add or detract. The world will little note, nor long remember what we say here, but it can never 
forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which 
they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great 
task remaining before us—that from these honored dead we take increased devotion to that cause for which they 
here gave the last full measure of devotion—that we here highly resolve that these dead shall not have died in 
vain—that this nation, under God, shall have a new birth of freedom, and that government of the people, by the 
people, for the people, shall not perish from the earth.'''

print(speech)

Fourscore and seven years ago our fathers brought forth, on this continent, a new nation, conceived 
in liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great 
civil war, testing whether that nation, or any nation so conceived, and so dedicated, can long endure. We are 
met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final 
resting-place for those who here gave their lives, that that nation might live. It is altogether fitting and 
proper that we should do this. But, in a larger sense, we cannot dedicate, we cannot consecrate—we cannot 
hallow—this ground. The brave men, living and dead, who struggled here, have consecrated it far above our poor 
power to add or detract. The world will little note, nor long remember what we say here, but it can never 
forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which 
they who fought here have thus

#### Cleaning the Text

- Removing stopwords
- Removing punctuations
- Stemming and Lemmatization
- Lower the sentences

In [4]:
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

In [5]:
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

In [6]:
sentences = nltk.sent_tokenize(speech)
sentences

['Fourscore and seven years ago our fathers brought forth, on this continent, a new nation, conceived \nin liberty, and dedicated to the proposition that all men are created equal.',
 'Now we are engaged in a great \ncivil war, testing whether that nation, or any nation so conceived, and so dedicated, can long endure.',
 'We are \nmet on a great battle-field of that war.',
 'We have come to dedicate a portion of that field, as a final \nresting-place for those who here gave their lives, that that nation might live.',
 'It is altogether fitting and \nproper that we should do this.',
 'But, in a larger sense, we cannot dedicate, we cannot consecrate—we cannot \nhallow—this ground.',
 'The brave men, living and dead, who struggled here, have consecrated it far above our poor \npower to add or detract.',
 'The world will little note, nor long remember what we say here, but it can never \nforget what they did here.',
 'It is for us the living, rather, to be dedicated here to the unfinished 

In [7]:
### PERFROM STEMMING
stem_corpus = []
for i in range(len(sentences)):
    only_alphas = re.sub('[^a-zA-Z]', ' ', sentences[i])     ### Keep only Alphabets
    lower_words = only_alphas.lower()     ### Convert the remaining text to the same lower form
    split_words = lower_words.split()     ### Split the lower words to form a list
    stem_words = [stemmer.stem(word) for word in split_words if word not in set(stopwords.words('english'))]
    final_text = ' '.join(stem_words)
    stem_corpus.append(final_text)
    
stem_corpus

['fourscor seven year ago father brought forth contin new nation conceiv liberti dedic proposit men creat equal',
 'engag great civil war test whether nation nation conceiv dedic long endur',
 'met great battl field war',
 'come dedic portion field final rest place gave live nation might live',
 'altogeth fit proper',
 'larger sens cannot dedic cannot consecr cannot hallow ground',
 'brave men live dead struggl consecr far poor power add detract',
 'world littl note long rememb say never forget',
 'us live rather dedic unfinish work fought thu far nobli advanc',
 'rather us dedic great task remain us honor dead take increas devot caus gave last full measur devot highli resolv dead shall die vain nation god shall new birth freedom govern peopl peopl peopl shall perish earth']

In [8]:
### PERFROM LEMMATIZATION
lemmat_corpus = []
for i in range(len(sentences)):
    only_alphas = re.sub('[^a-zA-Z]', ' ', sentences[i])     ### Keep only Alphabets
    lower_words = only_alphas.lower()     ### Convert the remaining text to the same lower form
    split_words = lower_words.split()     ### Split the lower words to form a list
    stem_words = [lemmatizer.lemmatize(word) for word in split_words if word not in set(stopwords.words('english'))]
    final_text = ' '.join(stem_words)
    lemmat_corpus.append(final_text)
    
lemmat_corpus

['fourscore seven year ago father brought forth continent new nation conceived liberty dedicated proposition men created equal',
 'engaged great civil war testing whether nation nation conceived dedicated long endure',
 'met great battle field war',
 'come dedicate portion field final resting place gave life nation might live',
 'altogether fitting proper',
 'larger sense cannot dedicate cannot consecrate cannot hallow ground',
 'brave men living dead struggled consecrated far poor power add detract',
 'world little note long remember say never forget',
 'u living rather dedicated unfinished work fought thus far nobly advanced',
 'rather u dedicated great task remaining u honored dead take increased devotion cause gave last full measure devotion highly resolve dead shall died vain nation god shall new birth freedom government people people people shall perish earth']

##### Here we can see that when we perform lemmatization we gert a proper meaning to the sentences. Thus we will go with lemmatization.

#### Creating the BOW Model

In [9]:
from sklearn.feature_extraction.text import CountVectorizer    ### This library will help in obtaining the histogram and ordering them and fonally give us our matrix.

In [10]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(lemmat_corpus).toarray()     ### Here we will get the sparse matrix 
X

array([[0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
        0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
        0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
        1, 1, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [13]:
import numpy as np
np.shape(X)

(10, 93)

> Here we have **10 sentences** and **93 represents the vectors** in each sentence.