# 1. Text processing

We will create the pipline of text preprocessing

# 1. 1 Normalization

The first step is normalisation.
It might include:
* converting all letters to lower or upper case
* converting numbers into words or removing numbers
* removing punctuations, accent marks and other diacritics
* removing white spaces
* expanding abbreviations

In this exercise it would be ok to have a lowercase text without specific characters and digits and without unnecessery space symbols.

How neural networks could be implemented for text normalization?

In [1]:
!pip install Unidecode



In [2]:
!pip install config



In [3]:
!pip install bs4



In [4]:
!pip install --user -U nltk

Requirement already up-to-date: nltk in /home/jafar/anaconda3/lib/python3.7/site-packages (3.4.5)


In [5]:
!mkdir misc

mkdir: cannot create directory ‘misc’: File exists


In [6]:
import re
import unidecode
import json
# normilize text
def normalize(text, is_query=False):
    text = unidecode.unidecode(text) # remove accents
    text = re.sub('(\w)',lambda m: m.group(0).lower(),text) # to_lower the entire text
    if is_query:
        text = re.sub('[^a-z $ *]', "", text) # remove punctuations
    else:
        text = re.sub('[^a-z ]', "", text)
    text = re.sub('(\ +)', " ", text) # if we have more than one white space it will become one
    
    return text

In [7]:
text = """Borrowed from \n\n Latins* $ teachers drunks niggas per he drank, he killed  he'll  \"\"\"\"\'\'   sē (“by itself”), from per (“by, through”) and sē (“itself, himself, herself, themselves”)"""

text = normalize(text, False)
print(text)

borrowed from latins teachers drunks niggas per he drank he killed hell se by itself from per by through and se itself himself herself themselves


# 1.2 Tokenize
Use nltk tokenizer to tokenize the text

In [8]:
# tokenize text using nltk lib
import nltk
import config 
config.flag = False
def download_packages():
    if not config.flag:
        print(nltk.download('punkt'))
        nltk.download('stopwords')
        nltk.download('wordnet')
    config.flag = True
def tokenize(text):
    
    download_packages()
    return nltk.word_tokenize(text)

In [9]:
tokens = tokenize(text)
print(tokens)

True
['borrowed', 'from', 'latins', 'teachers', 'drunks', 'niggas', 'per', 'he', 'drank', 'he', 'killed', 'hell', 'se', 'by', 'itself', 'from', 'per', 'by', 'through', 'and', 'se', 'itself', 'himself', 'herself', 'themselves']


[nltk_data] Downloading package punkt to /home/jafar/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/jafar/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jafar/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# 1.3 Lemmatization
What is the difference between stemming and lemmatization?

[Optional reading](https://towardsdatascience.com/state-of-the-art-multilingual-lemmatization-f303e8ff1a8)


In [10]:
from nltk.stem import SnowballStemmer
from nltk.stem import WordNetLemmatizer 
def lemmatization(tokens):
    lm = WordNetLemmatizer()
    return list(map(lm.lemmatize,tokens))

def stemmatiztion(tokens):
    ps = SnowballStemmer(language='english')
    return list(map(ps.stem,tokens))
    

In [11]:
lemmed = lemmatization(tokens)
print(lemmed)


['borrowed', 'from', 'latin', 'teacher', 'drunk', 'nigga', 'per', 'he', 'drank', 'he', 'killed', 'hell', 'se', 'by', 'itself', 'from', 'per', 'by', 'through', 'and', 'se', 'itself', 'himself', 'herself', 'themselves']


In [12]:
stemmed = stemmatiztion(tokens)
print(stemmed)

['borrow', 'from', 'latin', 'teacher', 'drunk', 'nigga', 'per', 'he', 'drank', 'he', 'kill', 'hell', 'se', 'by', 'itself', 'from', 'per', 'by', 'through', 'and', 'se', 'itself', 'himself', 'herself', 'themselv']


# 1.4 Stop words
The next step is to remove stop words. Take the list of stop words from nltk.

In [13]:


def remove_stop_word(tokens):
    stopping_words = set(nltk.corpus.stopwords.words('english'))
    return [word for word in tokens if word not in stopping_words]

In [14]:
clean = remove_stop_word(stemmed)
print(clean)

['borrow', 'latin', 'teacher', 'drunk', 'nigga', 'per', 'drank', 'kill', 'hell', 'se', 'per', 'se', 'themselv']


# 1.5 Pipeline
Run a complete pipeline inone function.

In [15]:
def preprocess(text, is_query=False):
    # TODO
    res = remove_stop_word(lemmatization(tokenize(normalize(text,is_query))))
    if not is_query:
        return ["$"+i+"$" for i in res]
    else:
        return res

In [16]:

clean = preprocess(text)
print(clean)

['$borrowed$', '$latin$', '$teacher$', '$drunk$', '$nigga$', '$per$', '$drank$', '$killed$', '$hell$', '$se$', '$per$', '$se$']


# 2. Collection

Download Reuters data from here:
https://archive.ics.uci.edu/ml/machine-learning-databases/reuters21578-mld/reuters21578.tar.gz

Read data description here:
https://archive.ics.uci.edu/ml/datasets/reuters-21578+text+categorization+collection

The function should return a list of strings - raw texts. Remove html tags using bs4 package.

## 2.1 Alternative (0.5 task bonus points)

Download songs (the process takes time, 1000 documents might be enough for a sake of exercise) from https://www.lyrics.com/. Implement a text search on it. In this case you have to creare class *Song* with fiels *title*, *artist* *and* text. The collection will contain a list of songs.

# 2.2 Save collection.json

Here just retrieve the json if it exists to save time while testing

In [17]:
import urllib.request as req
import os.path
import tarfile
from bs4 import BeautifulSoup
import codecs


def download(url,filepath):
    if os.path.isfile(filepath):
        return 0
    else:
        req.urlretrieve(url,filepath)
        
def get_collection():
    if os.path.isfile('collection.json'):
        with open('collection.json','r') as fd:
            return json.load(fd)      
    download('https://archive.ics.uci.edu/ml/machine-learning-databases/reuters21578-mld/reuters21578.tar.gz','./misc/reuters.tar.gz')
    tf = tarfile.open("./misc/reuters.tar.gz")
    tf.extractall('./misc')
    collection = []
    names = []
    counter = 0
    pattern = re.compile(".+\.sgm$")
    for file in os.listdir("./misc"):
        file = './misc/' + file
        if not pattern.match(file):
            continue
        collection.append([])    
        with codecs.open(file, 'r', encoding='utf-8', errors='ignore') as f:
            for line in f:
                soup = BeautifulSoup(line)
                collection[-1].extend(preprocess(soup.get_text()))
        
        print(file)
        names.append(counter)
        counter+=1
#         if counter == 3:
#             break
    
    return [(collection[i],names[i]) for i in range(len(collection))]

In [18]:
collection = get_collection()

print(len(collection))

22


# 2.3 Save collection

save collection for fast testing


In [19]:
import json

with open('collection.json', 'w') as fd:
    json.dump(collection, fd)

# 3.0 Wild card processing and inserting in the data structure

For example having words like 'word' it will be transformed into a collection of bigrams 
to [$"\$w","wo","or","rd","d\$$"]

In [20]:
def to_n_gram(tokens, n=2): # default is a bigram
    if type(tokens) != type([]):
        #print(type(tokens))
        tokens = [tokens]
    res = []
    for token in tokens:
        
        ngrams = zip(*[token[i:] for i in range(n)])
        res += ["".join(ngram) for ngram in ngrams]
        
    return res

In [21]:
print(to_n_gram(['$hello$','$ja'],2))
print(to_n_gram('$hello$',2))


['$h', 'he', 'el', 'll', 'lo', 'o$', '$j', 'ja']
['$h', 'he', 'el', 'll', 'lo', 'o$']


# 3.1 Word Index

Here we will have the index that matches the word to the respected n gram

## get_cached, write cached

these two functions are not a part of the assignment and I used them to make testing on my machines faster.

In [22]:

import os.path
 
def get_cached():
    a,b = {}, {}
    if os.path.isfile('inverted_word_index.json'):
        with open('inverted_word_index.json','r') as fd:
            a = json.load(fd)
            a = {vi:set(a[vi]) for vi in a.keys()}
    if os.path.isfile('ngram_word_index.json'):
        with open('ngram_word_index.json','r') as fd:
            b = json.load(fd)
            b = {vi:set(b[vi]) for vi in b.keys()}
    return a,b

def write_cached(inverted_word_index, ngram_word_index):
    a,b = inverted_word_index, ngram_word_index
    print(type(a) , type(b))
    with open('inverted_word_index.json','w') as fd:

        data = {vi:list(a[vi]) for vi in a.keys()}
        a = json.dump(data, fd)
    with open('ngram_word_index.json','w') as fd:
        data = {vi:list(b[vi]) for vi in b.keys()}
        json.dump(data, fd)
    return 'Success'
    
def make_word_index(collection, force= False):
    inverted_word_index = {}
    ngram_word_index = {}
    if not force:
        a,b = get_cached()
        if bool(a) and bool(b):
            return a,b
    print('gg')
    for (group, name) in collection:
        for word in group:
            if word in inverted_word_index.keys():
                continue
            else:
                ngrams = to_n_gram(word)
                inverted_word_index[word] = ngrams
                for gram in set(ngrams):
                    
                    d = ngram_word_index.setdefault(gram,set([]))
                    d.add(word)
    write_cached(inverted_word_index, ngram_word_index)
    return inverted_word_index, ngram_word_index
                
                

In [23]:
inverted_word_index, ngram_word_index = make_word_index(collection)





# 4 Wildcard queries

here I will try to reclaim all the words that match a wild card query


In [24]:
import re
def wild_find(token, ngram_word_index):
    temp_token = token
    if len(token) < 2:
        return "Minimum length should be 2 letters"
    if token[-1] == '*':
        token = token[:-1]
        token = "$"+token
    elif token[0] == '*':
        token = token[1:] + "$"
    elif token.find("*")!=-1:
        token = token.split('*')
        token[0] = "$" + token[0]
        token[1] = token[1] + "$"
   
    ngrams = to_n_gram(token)
    #print(ngrams)
    A = ngram_word_index[ngrams[0]]
    temp_token = "\$"+temp_token.replace("*","[a-z]*")+"\$"
    #print(temp_token)
    pattern = re.compile(temp_token)
   
    answers =  A.intersection(*[ngram_word_index[vi] for vi in ngrams[1:]])
    return [i for i in answers if pattern.match(i)]
    

In [25]:
#test

print("results for \'h*ell\' are \n",wild_find('h*ell', ngram_word_index))
print("results for \'j*far\' are \n",wild_find('j*far', ngram_word_index))
print("results for \'jaafar\' are \n",wild_find('jaafar', ngram_word_index))

print(wild_find('k*', ngram_word_index))


results for 'h*ell' are 
 ['$howell$', '$hartnell$', '$honeywell$', '$hell$', '$hopewell$']
results for 'j*far' are 
 ['$jaafar$']
results for 'jaafar' are 
 ['$jaafar$']
['$klms$', '$kimal$', '$kistler$', '$keatings$', '$kiyonga$', '$kota$', '$kirk$', '$kamloops$', '$kindled$', '$kellwood$', '$keewatin$', '$krenzler$', '$khoo$', '$keta$', '$khon$', '$kompagni$', '$knapp$', '$kcsi$', '$krishna$', '$krupp$', '$kitty$', '$known$', '$kosan$', '$ko$', '$kiena$', '$kotch$', '$kaufhof$', '$kertosastro$', '$khashoggi$', '$kuwaitussr$', '$keswick$', '$kwextv$', '$kilowatthours$', '$kenai$', '$kurzkasch$', '$kurdish$', '$kims$', '$kitwe$', '$kampala$', '$krumper$', '$kickoff$', '$kittiwake$', '$ka$', '$kyushu$', '$knowledgable$', '$kohldelors$', '$kuroda$', '$kinzoku$', '$keenly$', '$kennametal$', '$karda$', '$kerkorian$', '$kmg$', '$kimmelman$', '$kamel$', '$kp$', '$kleckner$', '$klaus$', '$kk$', '$kauai$', '$knockdown$', '$kingsleyjones$', '$kleve$', '$keatinghawke$', '$kepco$', '$krolik$', '

# 5 Soundex algorithm

Here I will explore the algorithm of soundex presented in lecture

In [26]:
import re
def soundex_encode(word):
    assert len(word)>0, "Word should be non empty"
    dd = {'bfpv':'1','cgjkqsxz':'2', 'dt':'3','l':'4','mn':'5','r':'6'}
    d = {}
    for i in dd.keys():
        for j in i:
            d[j] = dd[i]
    word =  word[0].upper() + "".join(map(lambda x: d[x],list(re.sub(r'[aeiouhwy]', '', word[1:]))))
    """last = '#'   # ask rustam if slide number 39 makes sense to remove consecutive digits
    nword = ''
    for i in word:
        if i == last:
            continue
        last = i
        nword = nword + i
    word = nword"""
    while len(word) < 4:
        word = word + "0"
    return word[:4]
    

In [27]:
#test

soundex_encode("Herman")

'H655'

# 5.1 Levenshtein distance

Here we see a pairwise cumputation of Levenshtein distance


In [28]:
 
def edit_distance(first, second):
    N, M = len(first), len(second)
    dp = [[0 for i in range(M+1) ] for j in range(N+1)]
    for i in range(N+1):
        dp[i][0] = i
    for i in range(M+1):
        dp[0][i] = i
    for i in range(1,N+1,1):
        for j in range(1,M+1,1):
            dp[i][j] = min(dp[i-1][j-1] + (1 if first[i-1] != second[j-1] else 0),dp[i-1][j] + 1, dp[i][j-1]+1)
    return dp[N][M]


In [62]:
edit_distance('worrld','world')


1

# 6. Inverted index
You will work with the boolean search model. Construct a dictionary which maps words to the postings.  

In [30]:
def make_index(collection):
    inverted_index = {}
    for (group,name) in collection:
        #print(name)
        for word in group:
            if word in inverted_index.keys():
                inverted_index[word].add(name)
            else:
                inverted_index[word] = set([name])
    return inverted_index

In [31]:
index = make_index(collection)

print(index['$food$'])

{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21}


# 7. Spelling correction 

Here i will have a function that given a lexicon will return some possible ways that are close to the original meaning

In [87]:
import random
def get_options(token,inverted_word_index,ngram_word_index):
    setty = None
    grams = to_n_gram(token,2)
    for gram in grams:
        if not setty:
            setty = ngram_word_index[gram]
        setty = setty.union(ngram_word_index[gram])

    mxi, oword = 1000, ''
    arr = []
    for word in setty:
        edit_dst = edit_distance(token,word)
        if edit_dst > 3:
            continue
        arr.append((edit_dst, word))
    arr = sorted(arr)
    #print(arr[:4])
    return [i[1] for i in arr[:5] if i[0] == arr[0][0]]
            
def correct_spelling(text, inverted_word_index, ngram_word_index):
    tokens = preprocess(text)
    print("query is ", tokens)
    new_text = ""
    for token in tokens:
        
        options = get_options(token,inverted_word_index,ngram_word_index)
        oword = options[0]
        new_text = new_text + " " + oword
        
    return new_text

In [45]:
correct_spelling('inaguration respctful consititution jafar',inverted_word_index, ngram_word_index)

query is  ['$inaguration$', '$respctful$', '$consititution$', '$jafar$']


' $inauguration$ $respectful$ $constitution$ $jaafar$'

 # 8. Logical OR operation in the binary search model
 
 
 Here we can process wild cards as text

In [112]:
def search(index, query, collection):
    query_copy = preprocess(query, is_query = True)
    expression = ''
    relevant = None
    for word in query_copy:
        potential_words = wild_find(word, ngram_word_index)
        if word.find('*') ==-1:
            potential_words = get_options("$"+word+"$",inverted_word_index,ngram_word_index)
        expression = expression + " && ("+ " || ".join(potential_words) + ") "
       
        if not any(word in index for word in potential_words):
            return set()
        new_set = set([])
        new_set = new_set.union(*[index[w] for w in potential_words]) # Union for multiple documents
        
        if relevant:
            relevant = relevant.intersection(new_set)
        else:
            relevant = new_set
    relevant_documents = [collection[doc_id] for doc_id in relevant]
    print("Query expression is :: ",expression[3:])
    return relevant_documents

In [115]:
query = 'fanci worrd' # change for something else if you are searching song lyrics
relevant = search(index, query, collection)
print("how many relevant docs: ",len(relevant))

Query expression is ::   ($fancy$ || $fauci$)  && ($word$ || $world$ || $worry$) 
how many relevant docs:  2
