List comprehensions allow us to quickly generate lists using the logic of loops and iterations

They are optimized to run very quickly in python and they are easy to write concisely, but they can be hard to read and their syntax takes some getting used to

A corpus is a body of text that will 'teach' our algorithm the language and style we want it to use

In order to use algorithms with language, we must either make lan-
guage simpler, so that the short mathematical algorithms we have explored
so far can reliably work with it, or make our algorithms smarter, so that they
can deal with the messy complexity of human language as it has developed
naturally. We’ll do the latter.

# Space insertion

Suppose we are only given the digitized text that has a few errors in the text and we are not even given the paper record from which it has been extracted.

how do we correct the mistakes?

In [1]:
text = "The oneperfectly divine thing, the oneglimpse of God's paradisegiven on earth, is to fight a losingbattle - and notlose it."

In the above example, even though the spellings are all correct, there needs to be spaces between certain words.

# Defining a word list and finding words

The first thing we will do for our algorithm is teach it some english words

In [2]:
word_list = ['The','one', 'perfectly', 'divine']

### In this chapter we will create and manipulate list comprehensions

In [3]:
word_list_copy = [word for word in word_list]

In [4]:
# this is a simple example which can be made more complex as follows

has_n = [word for word in word_list if 'n' in word]

In [5]:
has_n

['one', 'divine']

# We will be using re to access text manipulation tools

In [6]:
import re
locs = list(set([(m.start(), m.end()) for word in word_list for m in re.finditer(word, text)])) 

the locs variable will contain the locations in the text of every word in our word list. we will use a list comprehension to get the list of locations

we use for word in word list to iterate over every word in word list.

for each word we call re.finditer() which finds the selected words in our text and returns a list of every location where that word occurs. We iterate over these locations, and each individual location is stored in m

we will get the location in the text of the beginning and end of the word, respectively

the whole list comprehension is enveloped by list(set()). This is a convenient way to get a list that contains only unique values with no duplicates. Our list comprehension alone might have multiple identical elements, but converting it to a set automatically removes duplicates, and then converting it back to a list puts in the format we want: a list of unique word locations. We can just run print(locs) to see the result of the whole operation

In [7]:
print(locs)

[(17, 23), (7, 16), (0, 3), (35, 38), (4, 7)]


In python, the ordered pairs like these are called tuples, and these tuples show the location of each word from word_list in our text

as we know that some of these words end in the same index where other word begins, in order to find places where we neet to insert spaces, we will look for cases like this: where the end of one valid word is at the same place as the beginning of another valid word

# Dealing with compound words

what if there is a word like butterfly where both butter and fly is valid but it dont mean butterfly invalid, hence we dont just need to change valid words that appear together but also need to check whether the words together make sense

In order to check this, we need to find all the spaces in our text. We can look at all the substrings between two consecutive spaces and call those potential words. If a potential word is not in our list, then we will conclude its invalid and check whether its made oj joint valid words

now we will again use re.finditer to find all spaces in text and store it in a variable called space starts

In [8]:
spacestarts = [m.start() for m in re.finditer(' ', text)]
spacestarts.append(-1)
spacestarts.append(len(text))
spacestarts.sort()

In [9]:
spacestarts

[-1,
 3,
 16,
 23,
 30,
 34,
 45,
 48,
 54,
 68,
 71,
 78,
 81,
 84,
 90,
 92,
 105,
 107,
 111,
 119,
 123]

the -1 initializes the frst word and the len of text shows the end of last word

it will be useful to have another list that records the locations of the first character of each potential word. 

we will call that list spacestart_affine, since in technical terms, this new list is an affine transformation of the spacestarts list.

Affine is often used to refer to linear transformations, such as adding 1 to each location, which we will do here

In [10]:
spacestarts_affine = [ss+1 for ss in spacestarts]
spacestarts_affine.sort()

Next we will get all the substrings that are between two spaces

In [11]:
between_spaces = [(spacestarts[k] + 1, spacestarts[k+1]) for k in range(0, len(spacestarts) - 1)]

The variable is a list of tuple of the form (location of beginning of substrings, location of end of substring) like (17,23). The way we get these tuples through a list comprehension

This list comprehension iterates over k. In this case, k takes on the values of integers between 0 and one less than the length of the spacestarts list

for each k we will generate 1 tuple.

the first element of the tuple is spacestarts[k] + 1, which is one position after the location of each space. The seconf element of the tuple is spacestarts[k+1] which is the location of the next space in the text. this way our final output contains tuples that indicate the beginning and end of each substring between spaces

Now consider all of the potential words that are between spaces, and find the ones that are no valid ( not in our list )

In [12]:
between_spaces_notvalid = [loc for loc in between_spaces if text[loc[0]:loc[1]] not in word_list]

In [13]:
between_spaces_notvalid

[(4, 16),
 (24, 30),
 (31, 34),
 (35, 45),
 (46, 48),
 (49, 54),
 (55, 68),
 (69, 71),
 (72, 78),
 (79, 81),
 (82, 84),
 (85, 90),
 (91, 92),
 (93, 105),
 (106, 107),
 (108, 111),
 (112, 119),
 (120, 123)]

looking here we can see that its a list of the locations of all invalid potential words in our text

In [14]:
text[4:16]

'oneperfectly'

In [15]:
text[24:30]

'thing,'

In [16]:
text[31:34]

'the'

One way to make all words recognized is by manually adding or we can simply import a word list that already contained a substantial body of valid english words. Such a collection of words is referred to as corpus

# Using an imported corpus to check for valid words

In [17]:
import nltk
nltk.download('brown')

[nltk_data] Downloading package brown to /Users/devdutt/nltk_data...
[nltk_data]   Package brown is already up-to-date!


True

In [18]:
from nltk.corpus import brown
wordlist = set(brown.words())
word_list = list(wordlist)

before we use this new word_list, however, we should do some cleanup to remove what it thinks are words but are actually punctuation marks

In [19]:
word_list = [word.replace('*','') for word in word_list]
word_list = [word.replace('[','') for word in word_list]
word_list = [word.replace(']','') for word in word_list]
word_list = [word.replace('?','') for word in word_list]
word_list = [word.replace('.','') for word in word_list]
word_list = [word.replace('+','') for word in word_list]
word_list = [word.replace('/','') for word in word_list]
word_list = [word.replace(';','') for word in word_list]
word_list = [word.replace(':','') for word in word_list]
word_list = [word.replace(',','') for word in word_list]
word_list = [word.replace(')','') for word in word_list]
word_list = [word.replace('(','') for word in word_list]
word_list.remove('')

first we replaced all the symbols with a simple '' and then removed all of them

In [20]:
# now rerunning the code to get a better output
between_spaces_notvalid = [loc for loc in between_spaces if text[loc[0]:loc[1]] not in word_list]
between_spaces_notvalid

[(4, 16),
 (24, 30),
 (35, 45),
 (55, 68),
 (72, 78),
 (93, 105),
 (112, 119),
 (120, 123)]

In [21]:
# now we will check for the potential words that might be there in the substring

partial_words = [loc for loc in locs if loc[0] in spacestarts_affine and loc[1] not in spacestarts]

In [22]:
partial_words

[(35, 38), (4, 7)]

In [23]:
text[35:38]

'one'

In [24]:
text[4:7]

'one'

our locs variable contains the location of every word in the text

it checks whether locs[0], the beginning of the word, is in space_starts_affine, a list containing the characters that come just after a space.

then it checks whether the spacestarts does not have loc[1], which checks whether the word ends where a space begins

If a word starts after a space and does not end at the same place as a space, we put it in our partial_words variable, because this could be a word that needs to have a space inserted after it

In [25]:
# Next we will write a list comprehension that finds the ending location of every valid word that begin at the same location

partial_words_end = [loc for loc in locs if loc[0] not in spacestarts_affine and loc[1] in spacestarts]

In [26]:
partial_words_end

[(7, 16)]

In [27]:
text[35:46]

'oneglimpse '

# FINDING FIRST AND SECOND HALVES OF POTENTIAL WORDS

In [28]:
loc = between_spaces_notvalid[0]

In [29]:
loc

(4, 16)

we will now check whether any of the words in partial_words could be the first half of oneperfectly

for a valid word to be the first half, it would have to have the same beginning location in the text, but not the same ending location

In [30]:
endsofbeginning= [loc2[1] for loc2 in partial_words if loc2[0] == loc[0] and (loc2[1] - loc[0])>1]

we have specfied two things in the list which is that the word should have the same starting as that of the word in loc[0] and the other condition that is mentioned ensures that the word is not just one character long

The second condition is not necessary but it can help us avoid in geting false positives like avoid, aside, along, irate, iconic

in the above examples the first letter of the word can itself be considered as a word itself but probably is not true

our list endsofbeginnings should include the ending location of every valid word that begins at the same place as oneperfectly.

now we will find the beginning location of every valid word. that ends at the same place as oneperfectly

In [31]:
beginningsofends = [loc2[0] for loc2 in partial_words_end if loc2[1] == loc[1] and (loc2[1] - loc[0])>1]

In [32]:
beginningsofends

[7]

now we just need to find whether any locations are contained in both, endsofbeginnings and beginningsofends. If there are, that means that our invalid word is indeed a combination of 2 valid words without a space. We can use the intersection() function to find all the elements that are shared by both lists

In [33]:
pivot = list(set(endsofbeginning).intersection(beginningsofends))

In [34]:
pivot

[7]

It is possible that there is a case where the word might be from a brochure like choose spain! and instead gets interpreted as chooses pain.

Hence a more sophesticated approach is to take into account the context - whether the words around choosespain tend to be about olives and bullfighting or about whips and superfluous dentist appointments

However such an approach is difficult to do well and impossible to do perfectly, illustrating again the difficulty of language algorithms in general,

In our case we will take the smallest element of pivot, not because this is certianly the correct one, but just because we have to take one

In [35]:
import numpy as np
pivot = np.min(pivot)

Finally replacing the invalids with the valids with a space in between in one liner code

In [36]:
textnew = text
textnew = textnew.replace(text[loc[0]:loc[1]], text[loc[0]:pivot]+' '+text[pivot:loc[1]])

In [37]:
textnew

"The one perfectly divine thing, the oneglimpse of God's paradisegiven on earth, is to fight a losingbattle - and notlose it."

as we can see the code did work perfectly for oneperfectly and left all the others as it is since that was what the code was intended to do

assembling all the algos together inside a function to cure the text completely we will get

In [38]:
def insertspaces(text,word_list):
    locs = list(set([(m.start(),m.end()) for word in word_list for m in re.finditer(word, text)]))    
    spacestarts = [m.start() for m in re.finditer(' ', text)]
    spacestarts.append(-1)
    spacestarts.append(len(text))
    spacestarts.sort()
    spacestarts_affine = [ss + 1 for ss in spacestarts]
    spacestarts_affine.sort()
    partial_words = [loc for loc in locs if loc[0] in spacestarts_affine and loc[1] not in spacestarts]
    partial_words_end = [loc for loc in locs if loc[0] not in spacestarts_affine and loc[1] in spacestarts]
    between_spaces = [(spacestarts[k] + 1, spacestarts[k+1]) for k in range(0, len(spacestarts) - 1)]
    between_spaces_notvalid = [loc for loc in between_spaces if text[loc[0]:loc[1]] not in word_list]
    textnew = text
    for loc in between_spaces_notvalid:
        endsofbeginnings = [loc2[1] for loc2 in partial_words if loc2[0] == loc[0] and (loc2[1] - loc[0])>1]
        beginningsofends = [loc2[0] for loc2 in partial_words_end if loc2[1] == loc[1] and (loc2[1] - loc[0])>1]
        pivot = list(set(endsofbeginnings).intersection(beginningsofends))
        if len(pivot)>0:
            pivot = np.min(pivot)
            textnew = textnew.replace(text[loc[0]:loc[1]], text[loc[0]:pivot]+' '+text[pivot:loc[1]])
    textnew = textnew.replace(' ',' ')
    return textnew

now we can define any text and call the function to add space automatically

In [39]:
print(insertspaces(text, word_list))

The one perfectly divine thing, the one glimpse of God's paradise given on earth, is to fight a losing battle - and not lose it.


#### we have successfully created an algorithm that can smartly insert spaces into english text

# Phrase completion

Apparently bulding this feature is simple

we will start with a corpus but this time we are not just interested in the words but also how they fit together, so we will compile lists of n-grams from our corpus

An n-gram is simply a collection of n words that appear together. For eg, the phrase 'Reality is not always probable, or likely' is made up of seven words once spoken by someone great

A 1-gram is an individual word, so the 1-grams of the above phrase are all the words the above phrase consists of

The 2 grams have every string of 2 words for eg above is reality is, is not, not always, always probable... and so goes on for 3 grams and n

## Tokenizing and getting N-grams

In [40]:
from nltk.tokenize import sent_tokenize, word_tokenize
text = 'Time forks perpetually toward innumerable futures'
print(word_tokenize(text))

['Time', 'forks', 'perpetually', 'toward', 'innumerable', 'futures']


In [41]:
# now we can get n-grams just this easily like this
import nltk
from nltk.util import ngrams
token = nltk.word_tokenize(text)
bigrams = ngrams(token, 2)
trigrams = ngrams(token, 3)
fourgrams = ngrams(token, 4)
fivegrams = ngrams(token, 5)

In [42]:
token

['Time', 'forks', 'perpetually', 'toward', 'innumerable', 'futures']

In [43]:
# alternatively we can put all the n-grams inside a list called grams

grams = [ngrams(token,2),ngrams(token,3),ngrams(token,4),ngrams(token,5)]

In [44]:
grams

[<zip at 0x132cb07c0>,
 <zip at 0x132cb2640>,
 <zip at 0x132cb0c40>,
 <zip at 0x132cb2380>]

one corpus we could use is a collection of literary texts made available onine by google's peter norvig at some link

for this chaper we will download Shakespear's complete works, available for free online at some link as well

We read a corpus in python as follows

In [45]:
import requests
file = requests.get('http://www.bradfordtuckfield.com/shakespeare.txt')
file = file.text
text = file.replace('\n','')

here we use requests to read a text file containing the collected works of shakespear from a websit where its being hosted, and then read into our python session in a variable called text

now we rerun the code that created the grams variable. Here its with the new definition of the text variable

In [46]:
token = nltk.word_tokenize(text)
bigrams = ngrams(token, 2)
trigrams = ngrams(token, 3)
fourgrams = ngrams(token, 4)
fivegrams = ngrams(token, 5)
grams = [ngrams(token, 1),ngrams(token, 2),ngrams(token, 3),ngrams(token, 4), ngrams(token, 5)]

now when the user searches for something, we will give him a suggestion of n+1-gram, essentially adding to his n- gram with our which most match all his entered words, that way we will be able to accomplish suggestions

# Finding candidate n+1 grams

we can use the following simple lines to get the length of the search

In [47]:
from nltk.tokenize import sent_tokenize, word_tokenize
search_term = 'life is a'
split_term = tuple(search_term.split(' '))
search_term_length = len(search_term.split(' '))

Now we need to use the most frequent appearing n+1-grams for suggesting so we will use a function called counter()

In [52]:
from collections import Counter
counted_grams = Counter(grams[search_term_length])

This line has selected only the n+1 grams from our grams variable

Applying the Counter() function creates a list of tuples. Each tuple has an n+1 gram as its first element and the frequency of that n+1 gram in our corpus as its second elemnt

In [53]:
print(list(counted_grams.items())[0])

(('From', 'fairest', 'creatures', 'we'), 1)


This n-gram is the beginning of Shakespear's sonnet

In [58]:
list(counted_grams)[10000]

('this', 'learning', 'mayst', 'thou')

we dont just want to search for the n-grams which are n+1 then the input but also find the most relatable

In [59]:
matching_terms = [element for element in list(counted_grams.items()) if element[0][:-1] == tuple(split_term)]

This list comprehension iterates over every n+1 gram and calls each element as it does so.

for each element it checks whether element[0][:-1]==tuple(split_term). The left side of this equality, element[0][:-1], simply takes the first n elements of each n+1 gram: the [:-1] is a handy way to disregard the last element of a list.

The right side of the equality, tuple(split_term) is the n-gram which we are searching for ('life is a'). So we are checing for n+1 grams whose first n elements are the same as our n-gram of interest

whichever terms match are stored in our final output

## Selecting a phrase based on frequency

In [65]:
if len(matching_terms)>0:
    frequencies = [item[1] for item in matching_terms]
    maximum_frequency = np.max(frequencies)
    highest_frequency_term = [item[0] for item in matching_terms if item[1] == maximum_frequency][0]
    combined_terms = ' '.join(highest_frequency_term)

in the above code we defined frequencies, a list containing frequency of every n+1 gram in our corpus that matches the search term, then we used the numpy modul's max to get the highest of those frequencies

we used another list comprehension to get the first n+1 gram that occurs with the highest frequency in the corpus, and finally we created a combined_term, a string that puts together all of the words in that search term, with spaces seperating the words

In [66]:
combined_terms

'life is a tedious'

Putting it all together

In [74]:
def search_suggestion(search_term, text):
    token = nltk.word_tokenize(text)
    bigrams = ngrams(token, 2)
    trigrams = ngrams(token, 3)
    fourgrams = ngrams(token, 4)
    fivegrams = ngrams(token, 5)
    grams = [ngrams(token, 2), ngrams(token, 3),ngrams(token, 4),ngrams(token, 5)]
    split_term = tuple(search_term.split(' '))
    search_term_length = len(search_term.split(' '))
    counted_grams = Counter(grams[search_term_length-1])
    combined_term = 'No suggested searches'
    matching_terms = [element for element in list(counted_grams.items()) if element[0][:-1] == tuple(split_term)]
    if len(matching_terms)>0:
        frequencies = [item[1] for item in matching_terms]
        maximum_frequency = np.max(frequencies)
        highest_frequency_term = [item[0] for item in matching_terms if item[1] == maximum_frequency][0]        
        combined_term = ' '.join(highest_frequency_term)
    return combined_term

In [79]:
print(search_suggestion('life of a', text))

life of a man
