# Text Retrieval
**Total Points: 7**

In this exercise, we will have a look at some essential components of text retrieval. There are many python libraries developed to deal with text data. We will use, scikit-learn, NLTK and Gensim in this exercise.

In [1]:
#Import the necessary packages
from time import time
import csv

import gensim.models
import pandas as pd
import glob
import pickle
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer
from nltk.stem import WordNetLemmatizer
from nltk import WordPunctTokenizer
from collections import Counter
from num2words import num2words
import nltk
import os
import string
import numpy as np
import copy
import re
import math

NLTK is one of the most prominent libraries for natural language processing. We need to download some of the static features in order to perform preprocessing on text as follows,

In [2]:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Jonathan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Jonathan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Jonathan\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Text Preprocessing 
Text preprocessing is one of the most important steps. 


In [3]:
# functions for preprocessing
def convert_lower_case(data):
    return np.char.lower(data)

def remove_stop_words_en(data):
    stop_words = stopwords.words('english')
    words = word_tokenize(str(data))
    new_text = ""
    for w in words:
        if w not in stop_words and len(w) > 1:
            new_text = new_text + " " + w
    return new_text

def remove_punctuation_en(data):
    symbols = "!\"#$%&()*+-./:;<=>?@[\]^_`{|}~\n"
    for i in range(len(symbols)):
        data = np.char.replace(data, symbols[i], ' ')
        data = np.char.replace(data, "  ", " ")
    data = np.char.replace(data, ',', '')
    return data

def pos_tag_en(data):
    tokens = word_tokenize(str(data))
    tagged = []
    for w in tokens:
        tagged = nltk.pos_tag(tokens)
    return tagged

def remove_apostrophe_en(data):
    return np.char.replace(data, "'", "")

def stemming_en(data):
    #stemmer= PorterStemmer()
    lemmatizer = WordNetLemmatizer()
    tokens = word_tokenize(str(data))
    new_text = ""
    for w in tokens:
        new_text = new_text + " " + lemmatizer.lemmatize(w)
    return new_text

def convert_numbers_en(data):
    tokens = word_tokenize(str(data))
    new_text = ""
    for w in tokens:
        try:
            w = num2words(int(w))
        except:
            a = 0
        new_text = new_text + " " + w
    new_text = np.char.replace(new_text, "-", " ")
    return new_text

def preprocess_en(data):
    data = convert_lower_case(data)
    data = remove_punctuation_en(data) #remove comma seperately
    data = remove_apostrophe_en(data)
    data = remove_stop_words_en(data)
    data = convert_numbers_en(data)
    data = stemming_en(data)
    data = remove_punctuation_en(data)
    data = convert_numbers_en(data)
    data = stemming_en(data) #needed again as we need to stem the words
    data = remove_punctuation_en(data) #needed again as num2word is giving few hypens and commas fourty-one
    data = remove_stop_words_en(data) #needed again as num2word is giving stop words 101 - one hundred and one
    return data

In [4]:
#imports for the bow model
from gensim import corpora
from gensim import similarities
from gensim import models

In this exercise, we will work with the 20-newsgroup dataset. It contains 20000 messages taken from 20 various newsgroups. There are 20 classes such as alt.atheism, comp.graphics, comp.os.ms-windows.misc, comp.sys.ibm.pc.hardware, comp.sys.mac.hardware. We will try to use simple bag of words model to retieve data related to the query.

In [5]:
from sklearn.datasets import fetch_20newsgroups
import pandas as pd

news_group = fetch_20newsgroups(subset='train')

news_group_data = news_group.data
news_group_target_names = news_group.target_names
news_group_target = news_group.target

# Creating a dataframe from the loaded data
news_df = pd.DataFrame({'news': news_group_data, 
                        'class': news_group_target})

In [6]:
from pprint import pprint
pprint(list(news_group.target_names)) #class names


['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']


In [7]:
textDF1 = news_df.to_dict()
#preprocess data inorder to create the  dictionary and indexes
article_text = []
for x in range(len(textDF1['news'])):
    article_text.append(word_tokenize(str(preprocess_en(textDF1['news'][x]))))

### Index creation 
In index creation, we index documents in order to prepare them for retrieval.

In [8]:
#build dictionary for corpus and get the similarity matrix for the dictionary
corpora_dict = corpora.Dictionary(article_text)
for token, token_id in corpora_dict.token2id.items():
    corpora_dict.id2token[token_id] = token
corpus = [corpora_dict.doc2bow(text) for text in article_text]
index_bow = similarities.SparseMatrixSimilarity(corpus, num_features=len(corpora_dict))

In [9]:
len(corpora_dict)

113605

**Exercise -- 1 point**

Play around with the preprocessing techniques and observe the change in the length of the corpus.

# Solution Exercise 1

The first thing I did was comment out everything in the preprocessing function aside from the first line to see what the difference there is:

```python
def preprocess_en(data):
    data = convert_lower_case(data)
    #data = remove_punctuation_en(data) #remove comma seperately
    #data = remove_apostrophe_en(data)
    #data = remove_stop_words_en(data)
    #data = convert_numbers_en(data)
    #data = stemming_en(data)
    #data = remove_punctuation_en(data)
    #data = convert_numbers_en(data)
    #data = stemming_en(data) #needed again as we need to stem the words
    #data = remove_punctuation_en(data) #needed again as num2word is giving few hypens and commas fourty-one
    #data = remove_stop_words_en(data) #needed again as num2word is giving stop words 101 - one hundred and one
    return data
```

The result of this was a stark increase of corpus size from 113605 to 181531.
Next step I iteratively uncommented the next line and will not the corpus sizes resulting below:

 1. 131458
2. 131335
3. 131148
4. 120347
5. 113727
6. 113726
7. 113692
8. 113666
9. 113666
10. 113605


Interestingly enough we can see clearly that each of the processing step decreases the corpus size, while between 8 and 9 there seems to be no such decrease.

**Exercise -- 2 points**

Create a tfidf index using [tfidf](https://radimrehurek.com/gensim/models/tfidfmodel.html) model from gesim.

# Solution Exercise 2

In [10]:
model = models.TfidfModel(corpus)
index_tfidf = similarities.SparseMatrixSimilarity(model[corpus], num_features=len(corpora_dict))

### Query search
Once the documents are indexed, the corpus can be used for querying. For a given query, we find the cosine similarities between query and all documents. The documents with highest similarity values, are treated as matches and are retrieved.

In [11]:
#main search function.
def search(index, query):
    bow_vec = corpora_dict.doc2bow(query.lower().split()) 
    similarities = index[bow_vec]  # get cosine similarities between the query and all index documents
    similarities = [(x, i) for i, x in enumerate(similarities)]
    similarities.sort(key=lambda elem: -elem[0])# sorting by similarity_value in decreasing order
    conf = []
    classes = list(news_df['class'])
    file_id = list(textDF1['class'])
    for i in range(len(similarities)):
        try:
            if (similarities[i][0] > 0.05): #change the confidence value here if you wish too
                conf.append(similarities[i])
        except IndexError:
            IndexError
    #print(len(conf))
    res = []
    for result in conf:
        res.append((file_id[result[1]],classes[result[1]],result[0], article_text[file_id[result[1]]]))
    return res[0:10]

**Exercise -- 4 points**

Search for the following queries using BOW and TFIDF index. Compare top 5 results and comment on the differences. 
1. hardware
2. space
3. natural
4. sport
5. medicine

# Solution Exercise 4 

In [24]:
search_words = ["hardware", "space", "natural", "sport", "medicine"]

for s_word in search_words:
    one_tfidf = search(index_tfidf, s_word)
    one_bow = search(index_bow, s_word)
    print(f"-------------------SEARCH WORD {s_word}-------------------")
    print()
    for i in range(5):
        print(f'-------------------RANK {i}-------------------')
        print('bow:', one_bow[i][0])
        print('tfidf', one_tfidf[i][0])
    print()
    print()
    print()

-------------------SEARCH WORD hardware-------------------

-------------------RANK 0-------------------
bow: 3908
tfidf 3908
-------------------RANK 1-------------------
bow: 8257
tfidf 8257
-------------------RANK 2-------------------
bow: 11205
tfidf 802
-------------------RANK 3-------------------
bow: 802
tfidf 11205
-------------------RANK 4-------------------
bow: 10435
tfidf 10752



-------------------SEARCH WORD space-------------------

-------------------RANK 0-------------------
bow: 7545
tfidf 9986
-------------------RANK 1-------------------
bow: 2800
tfidf 7545
-------------------RANK 2-------------------
bow: 4504
tfidf 2800
-------------------RANK 3-------------------
bow: 1665
tfidf 4425
-------------------RANK 4-------------------
bow: 6707
tfidf 4504



-------------------SEARCH WORD natural-------------------

-------------------RANK 0-------------------
bow: 5800
tfidf 5800
-------------------RANK 1-------------------
bow: 2509
tfidf 2509
-------------------RANK 

Generally we can observe that both indicies find very similar things, but sometimes they are ranked differently.

For example in the last case (5) we can see that they find the same documents, but ranks 1 and 2 are switched between bow and tfidf indicies.
We can observe similar things in case 1 where only the last item differs completely (tfidf finds a different document). This is probably due to the fact that the TFIDF Index values repeated occurrences of a word lower than the BoW (Bag of Words Model does). One example would be the documents 6033 and 6793 in case 5. I printed them out below. We can observe that the document 6793 has more occurrences of the word "medicine" in it than the document 6033 but the tfidf Index ranks 6033 higher than 6793. This effect can not only be observed in this case but other cases for example when searching for "natural" the documents 480 and 1810. What can also be observed that BoW tends to rank longer texts higher than TFIDF (which is in line with our previous discovery as longer text probably also have more occurrences of the same word.

In [18]:
print(textDF1['news'][6033])
print("BORDER")
print(textDF1['news'][6793])

From: kxgst1+@pitt.edu (Kenneth Gilbert)
Subject: Re: quality control in medicine
Organization: University of Pittsburgh
Lines: 20

In article <93108.003258U19250@uicvm.uic.edu> U19250@uicvm.uic.edu writes:
:Does anybody know of any information regarding the implementaion of total
: quality management, quality control, quality assurance in the delivery of
: health care service.  I would appreciate any information.  If there is enough
:interest, I will post the responses.


This is in fact a hot topic in medicine these days, and much of the
medical literature is devoted to this.  The most heavily funded studies
these days are for outcome research, and physicians (and others!) are
constantly questionning whether what we do it truly effective in any given
situation.  QA activities are a routine part of every hospital's
administrative function and are required by accreditation agencies.  There
are even entire publications devoted to QA issues.

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-|-=-=-=

In [21]:
print(textDF1['news'][480])
print("BORDER")
print(textDF1['news'][1810])

From: keith@cco.caltech.edu (Keith Allan Schneider)
Subject: Re: <Political Atheists?
Organization: California Institute of Technology, Pasadena
Lines: 20
NNTP-Posting-Host: punisher.caltech.edu

livesey@solntze.wpd.sgi.com (Jon Livesey) writes:

>Perhaps the chimps that failed to evolve cooperative behaviour
>died out, and we are left with the ones that did evolve such
>behaviour, entirely by chance.

That's the entire point!

>Are you going to proclaim a natural morality every time an
>organism evolves cooperative behaviour?

Yes!

Natural morality is a morality that developed naturally.

>What about the natural morality of bee dance?

Huh?

keith

BORDER
From: kcochran@nyx.cs.du.edu (Keith "Justified And Ancient" Cochran)
Subject: We don't need no stinking subjects!
X-Disclaimer: Nyx is a public access Unix system run by the University
	of Denver for the Denver community.  The University has neither
	control over nor responsibility for the opinions of users.
Organization: The Loyal 