<a href="https://colab.research.google.com/github/Echo9k/Information-Retrival/blob/main/Programming_Assignment_%7C_Part_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
#@title About this notebook
## This code is an extension of the previous programming assignment.
__author__    = "Guillermo Alcantara Gonzalez (Echo9k)"
__class__ = "CS3308: INFORMATION RETRIEVAL"
__originalAuthor__ = "University of the People"

__contact__   = {"email":"guillermoalcantara@my.uopeople.edu",
                 "linkedin":"guillermoaagg"}
__licence__   = "UNKNOWN"
__version__   = 3.0
__date__      = 'February 20, 2021'

# Set up

In [2]:
#@title installing libraries
%%capture
%%bash
pip install -U textblob
python -m textblob.download_corpora
rm /content/sample_data -r

In [3]:
#@title Imports
#@markdown Import the necesary libraries for this notebook to work.
%%capture
# Other
import sys, os, re, time, timeit
# Numerical computations
import scipy.stats as stats
import numpy as np

from sklearn import preprocessing
from sklearn import metrics

import math
from collections import Counter

# DataFrames
import pandas as pd
from google.colab.data_table import DataTable
%load_ext google.colab.data_table
# Graphs
import seaborn as sns
import plotly.express as px
# Read HTML
import codecs
from bs4 import BeautifulSoup as bs
# Text processing
from textblob import TextBlob
from textblob import Word
import nltk
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
# from nltk.tokenize import word_tokenize  
# from nltk.stem import PorterStemmer

In [4]:
#@markdown Get the file path and unzip its content<br>
#@markdown The file should be downloaded automaticaly downloaded
URL = 'https://raw.githubusercontent.com/Echo9k/Information-Retrival/main/cacm.zip'
path = "/content/"  # @param {type:"string"}
!wget -q O- {path} URL

folder_name = URL.rsplit('/',1)[1].split('.')[0]

!unzip -qq -n {path}{folder_name}.zip -d {path}
dirname = f"{path}{folder_name}"


# Part 2

## Functions
In this section I define the functions and classes we will be using to create the index.

In [5]:
#@title Stopwatch functionality
#@markdown Wrapper function from [realPython](https://realpython.com/primer-on-python-decorators/)
import functools
def timer(func):
    """Print the runtime of the decorated function"""
    @functools.wraps(func)
    def wrapper_timer(*args, **kwargs):
        start_time = time.perf_counter()    # 1
        value = func(*args, **kwargs)
        end_time = time.perf_counter()      # 2
        run_time = end_time - start_time    # 3
        print(f"Finished {func.__name__!r} in {run_time:.4f} secs")
        return value
    return wrapper_timer

In [6]:
#@title Text processing
#@markdown 1. **Stop words removal:** Removes english stopwords from the sentence.<br><br>
#@markdown 2. **Language:** Select the language of your text so we know the correct stopwords.
language = "spanish" #@param ["turkish", "spanish", "english", "french", "german", "swedish"]

#@markdown 3. **Method:** Applies one of the following types of text normalization.
#@markdown * Lematize: Gets the core/regular form of each words.
#@markdown * Steam: Truncates each word by removing es/ing/...
#@markdown * Remove stopwords: Returns the stopwords witouth any change.<br><br>
method = "steam" #@param ["steam", "lematize", "Remoe stopwords"]


class corpora_processor():
    def __init__(self, method, language="english"):
        self.method = method
        self.stopwords = stopwords.words(language)
        self.sw_count = 0
        self.numeric_tokens = 0
        self.small_tokens = 0
    def text_processor(self, text) -> list:
        def _process(word):
            # Skips numeric tokens
            try:
                int(word)                
                self.numeric_tokens += 1
                return None
            except ValueError:
                pass
            # Skips stop words
            if word in self.stopwords:
                self.sw_count += 1
                return None
            # Skips small tokens
            if len(word)<3:
                self.small_tokens += 1
                return None
                
            else:
                if self.method == "lematize": return word.lemmatize("v")
                if self.method == "steam": return word.stem()
                else: return word
            

        words = TextBlob(text).words
        words = words.lower()
        
        return [_process(word) for word in words if _process(word) is not None]

processor = corpora_processor(method, language)
sample_text = "Esto es un 1111 texto de prueba el investigador que este estudiando el tema o algun estudiante para el estudiante" #@param {type:"string"}
processor.text_processor(sample_text)

['texto',
 'prueba',
 'investigador',
 'estudiando',
 'tema',
 'algun',
 'estudiant',
 'estudiant']

## Text processing
Let's process the corpus but this time applaying removing stopwords and applaying text normalization.

Ignore from the index
* Stop words for the selected language
* Tokens with less than 2 characters in length
* Numerical terms
* Starting with a punctuation character are not indexed

### Text processing
Integrate the Porter Stemmer code into our index

### Additioanl (new) metrics
Calculate the tf-idft,d value for each unique combination of document and term


In [7]:
#@title Text processing function
#@markdown This function returns:

@timer
def indexer(path):
    os.chdir(dirname)
    total_tokens = 0
    total_documents = 0
    token_count = {}  # Term frequency corpus
    token_document = {} # Token:[document which contains it]
    label_encoder = preprocessing.LabelEncoder()
    document_list = [os.path.join(path, i) for i in os.listdir(path) if os.path.isfile(os.path.join(path, i))]
    tfd = {}
    tf_idf = {}
    for file_ in document_list:
        total_documents += 1
        with open(file_, 'r') as my_file:
            f = file_.split('/')[-1]
            tf_idf.update({f:{}})
            tfd.update({f:{}})
            text = bs(my_file.read()).get_text('\n',strip=True)
            local_tokens = processor.text_processor(text)
            N = len(local_tokens)

            # Metrics & data structures
            total_tokens += N
            
            for t in local_tokens:
                # Term frecuency in the corpus
                try: token_count[t] += 1
                except: token_count[t] = 1
                # Frecuency of a token per document document:tf
                try: tfd[f][t] += 1
                except KeyError: tfd[f].update({t:1})
                # The documents where you find a token: {token:[document]}
                try: token_document[t].append(f)
                except KeyError: token_document.update({t:[f]})
            
            # TF_IDF
            for t in set(local_tokens):
                # Calculate the Inverse document frequency: idf
                idf = np.log(N/tfd[f][t])
                tf =  np.log(1+token_count[t])
                tf_idf[f].update({t:tf*idf})

    # return to original directory
    os.chdir(path)


    # Write to CSV
    processed_documents = sorted(tfd.keys())
    processed_documents = pd.DataFrame(processed_documents)
    processed_documents.to_csv("processed_documents_df.csv")
    
    index_encoded = label_encoder.fit_transform(list(token_count.keys()))
    term_frequency_corpus = pd.DataFrame.from_dict(token_count,
                                            orient='index',
                                            columns=['count'])\
                                .sort_values('count')
    term_frequency_corpus.to_csv("token_count.csv")

    #output
    metrics = {
        "total_documents": total_documents,
        "total_tokens": total_tokens,
        "unique_tokens":len(token_count),
        "stopword_count":processor.sw_count,
        "numeric_tokens":processor.numeric_tokens,
        "small_tokens":processor.small_tokens
        }
    data_structures = {        
        "term_frequency_corpus": term_frequency_corpus,
        "token_document":token_document,
        "processed_documents":processed_documents,
        "label_encoder":label_encoder
        }
    scores = {        
        "tfd": tfd,
        "idf":{k: np.log(total_tokens/v) for k,v in token_count.items()},
        "tf_idf":tf_idf,
        "token count":token_count
        }
    return metrics, data_structures, scores
#@markdown Metrics
#@markdown * total_documents: Count of documents
#@markdown * total_tokens: Count of tokens processed.
#@markdown * unique_tokens: Unique tokens processed.
#@markdown * stopword_count: Count of stopwords.
#@markdown > Ignores irrelevant tokens _(Small, numeric, stopwords)_

#@markdown Data Structures
#@markdown * term_frequency_corpus: Frequency of each term in the corpus.
#@markdown * token_document: In which documents you find what token
#@markdown * processed_documents: List of processed documents


#@markdown Scores
#@markdown * tfd: Term frequency by document
#@markdown * inverse_document_frequency: Inverse term frequency by document
metrics, data_structures, scores = indexer(f"{path}/{folder_name}")

Finished 'indexer' in 2.2431 secs


In [8]:
#@title Build the index
#@markdown Let's recreate the index and take a meassure of the new metrics.

#@markdown ### Configure the text function
method = "steam" #@param ["steam", "lematize", "Remoe stopwords"]
language = "english" #@param ["turkish", "spanish", "english", "french", "german", "swedish"]
processor = corpora_processor(method, language)
metrics, data_structures, scores = indexer(f"{path}/{folder_name}")
print(
    f"Total documents:\t{metrics['total_documents']}\n"
    f"Total tokens:\t\t{metrics['total_tokens']} \n"
    f"Unique tokens:\t\t{metrics['unique_tokens']}\n"
    f"stopword count: \t{metrics['stopword_count']}\n"
    # f"Small tokens:\t\t{metrics['small_tokens']}\n"
    # f"Numeric tokens:\t\t{metrics['numeric_tokens']}\n"
    
)

Finished 'indexer' in 2.1105 secs
Total documents:	572
Total tokens:		17601 
Unique tokens:		7191
stopword count: 	4228



For the record, the metrics for version 1.0 of this function were:

```
Finished 'improved_function' in 0.2038 secs
Total documents:	  570
Total tokens:		 34357
Unique tokens:		5792
```

## Output documents
The output is stored the directory `/contents/`:
* documents.csv
* index.csv

Let's take a look at them!

In [9]:
#@title Documents
#@markdown The list of processed documents.
DataTable(data_structures['processed_documents'],False,3)

Unnamed: 0,0
0,CACM-0001.html
1,CACM-0002.html
2,CACM-0003.html
3,CACM-0004.html
4,CACM-0005.html
...,...
567,CACM-0568.html
568,CACM-0569.html
569,CACM-0570.html
570,processed_documents_df.csv


In [10]:
#@title Term frequency by document
#@markdown Type the name of the document you want to check.
doc_name = 'CACM-0569.html' #@param {type:"string"}
scores['tfd'][doc_name]

{'9:22': 1,
 'algorithm': 1,
 'binomi': 1,
 'ca620616': 1,
 'cacm': 1,
 'coeffici': 1,
 'june': 1,
 'march': 1,
 'steck': 1}

In [11]:
#@title Word count
#@markdown * Graph token count
try: data_structures['term_frequency_corpus'].reset_index(inplace=True)
except ValueError: pass

fig = px.bar(data_structures['term_frequency_corpus'], x="count", y="index", color='count', orientation='h',height=600,
             log_x=True,
             title='Word count')
fig.show()

#@markdown * Count of aparences by token
DataTable(data_structures['term_frequency_corpus'][['index', 'count']], True,5)

Unnamed: 0,index,count
0,ca590602,1
1,sssr,1
2,sole,1
3,garner,1
4,moder,1
...,...,...
7186,program,115
7187,comput,151
7188,algorithm,228
7189,cacm,570


In [12]:
#@title Inverse document frecuency: idf_df
#@markdown log(N/tdf)
import plotly.express as px
idf_df = pd.DataFrame(scores['idf'],['idf']).T
display(DataTable(idf_df, num_rows_per_page=5))
DataTable(idf_df, num_rows_per_page=5)

idf_df.sort_values('idf', ascending=True, inplace=True)
try: idf_df.reset_index(inplace=True)
except ValueError: pass

fig = px.bar(idf_df, x="idf", y="index", color='idf', orientation='h',height=600,
             log_x=True,
             title='Inverse document frecuency')
fig.show()

Unnamed: 0,idf
approxim,7.067661
curv,7.829801
line,7.696269
segment,7.829801
use,5.513031
...,...
ca610432,9.775711
11:27,9.775711
computer-ori,9.775711
programmer-ori,9.775711


# Part 3

## Assestment criteria
### Instructions
1. Prompt the user to enter a query. Multiple terms are separated by a space
2. For each query term entered, you process must determine the tf-idft,d weight.
3. Search for the documents that contains each of the query terms
4. For each of this documents calculate the cosine similarity between the query and the document.
 * The cosine similarity scores must be sorted in descending.
5. Finally your search process must print out the top 20 documents (or as many as are returned by the search if there are fewer than 20) listing the following statistics for each:
 * The document file name
 * The cosine similarity score for the document
 * The total number of items that were retrieved as candidates (you will only print out the top 20 documents)
 ‘home mortgage’ is provided in the output of the search for terms 

In [13]:
#@title Config
def _get_documents(query_terms):
    document_set = set()
    for term in query_terms:
        try: document_set = document_set.union(data_structures['token_document'][term])
        except KeyError: continue
    return document_set
def _tfidf(document_set, terms):
    doc_rank_tfidf = {}
    for document in document_set:
        doc_rank_tfidf.update({document:0})
        for term in terms:
            try:
                score = scores['tf_idf'][document][term]
                doc_rank_tfidf[document] += score
            except KeyError: continue
    return doc_rank_tfidf
encoder = data_structures['label_encoder']
#@markdown Function: Cosine similarity
def _cosine_similarity(document_set, query_tokens):
    def tokens_from_html(html_doc):
        with open(f'{path}/{folder_name}/' +html_doc, 'r') as page:
            text = bs(page).get_text(strip=True)
            return query_processor.text_processor(text)
        
    def buildVector(iterable1, iterable2):
        counter1 = Counter(iterable1)  # document
        counter2 = Counter(iterable2)  # query
        # all_items = set(counter1.keys()).union( set(counter2.keys()) )
        vector1 = [counter1[k] for k in counter2]
        vector2 = [counter2[k] for k in counter2]
        return vector1, vector2

    vectors = {}
    for document in document_set:
        doc_tokens = tokens_from_html(document)
        v1,v2 = buildVector(query_tokens, doc_tokens)
        cos_similarity = nltk.cluster.util.cosine_distance(v1,v2)
        vectors.update({document:cos_similarity})
    return vectors

def best_n_match(query_terms, sort_by, n=None):
    document_set = _get_documents(query_terms)
    # Ranks
    tfidf_rank = _tfidf(document_set, query_terms)
    tfidf_rank = pd.Series(tfidf_rank)
    # cosine_similarity
    cos_sim = _cosine_similarity(document_set, query_terms)
    cos_sim = pd.Series(cos_sim)
    df = pd.concat([tfidf_rank, cos_sim],1)
    df.columns=["tf_idf", "cosine_similarity"]
    df.index.name = "document_name"
    df.sort_values(sort_by, ascending=False, inplace=True)
    return df.iloc[:n]

method = "steam" #@param ["steam", "lematize", "Remoe stopwords"]
language = "english" #@param ["turkish", "spanish", "english", "french", "german", "swedish"]
query_processor = corpora_processor(method, language)

In [14]:
#@title Querry
#@markdown Type down your querry.
query = "A new method of specifying all diagnostic operations" #@param {type:"string"}
query_terms = query_processor.text_processor(query)

In [16]:
#@markdown Get the list of the best documents according to which metric:
sort_by = "tf_idf" #@param ["tf_idf", "cosine_similarity"]
#@markdown How many documents do you want to get?
best_n = 5 #@param ["None", "1", "5", "10", "20"] {type:"raw"}
best_documents = best_n_match(query_terms, sort_by, best_n)
best_documents

Unnamed: 0_level_0,tf_idf,cosine_similarity
document_name,Unnamed: 1_level_1,Unnamed: 2_level_1
CACM-0202.html,40.491145,0.647792
CACM-0531.html,31.101175,0.900249
CACM-0435.html,23.585628,0.700187
CACM-0558.html,20.345399,0.861325
CACM-0252.html,19.732126,0.761333


In [17]:
#@markdown Render
from IPython.core.display import display, HTML
document_name = "CACM-0202.html" #@param {type:"string"}

with open(dirname + '/' + document_name, 'r') as my_file:
        text = bs(my_file).text
        display(HTML(text))

# Further Analysis
For my first experiment I copied a section  *"A new method of specifying all diagnostic operations"* from the document **CACM-0202.html**. I expected both metrics to be high for this document but surprisingly cosine similarity was quite low while TF-IDF did brougth the rigth document to the top of the list

## Results
Top results ordered by tf_idf

|document_name|tf_idf|cosine_similarity|
|---------------|--------------------|---------------------|
|CACM-0202.html |  49.9160152996059  | 0.6477917762013607  |
|CACM-0531.html | 28.764743545584647 | 0.9369119702123488  |
|CACM-0558.html | 24.467685375908488 | 0.9122941980692971  |
|CACM-0492.html | 24.40473070667761  | 0.8601242787639529  |

## Cosine similarity analysis
I tougth that what cause the the cosine similarity to be low was that in my first run I created two vectors with all of the tokens for both the document I was analysing and the query terms.
This is:


```
query_terms = ['A', 'B', 'C', 'M', 'M', 'O']
docu_terms = ['X', 'Y', 'Y', 'Z', 'O', 'M', 'O']

all_unique_tokens = ['A', 'B', 'C', 'M', 'O', 'X', 'Y', 'Z']
```

This way I could form vectors of the same size including all the values.
```
# Original vectors
query_vector = {'A':1, 'B':1, 'C':1, 'M':2, 'O':1]
doc_vector = {'X':1, 'Y':2, 'Z':1, 'M':1, 'O':2]

# Extended query vecotrs
query_vector = {'A':1, 'B':1, 'C':1, 'M':2, 'O':1, 'X':0, 'Y':0, 'Z':0}
doc_vector   = {'A':0, 'B':0, 'C':0, 'M':1, 'O':2, 'X':1, 'Y':2, 'Z':1}
```
Onece we have vectors of the same size we can apply cosine similarity defined as:

```
cosine-similarity = u•v / |u|•|v|
```

```
# using a function to compute that
cosine_similarity(u=query_vector, v=doc_vector) 
```
Doing this give us the results in the table above. Never the less these results are far from optimal for the cosine similarity as we expected to retrive CAM-0202.html as the top result.



|document_name|	tf_idf|	cosine_similarity|
|---------------|-----------------------|---------------------|
|CACM-0202.html	|   49.9160152996059    |   0.6477917762013607|
|CACM-0531.html	|   28.764743545584647  |   0.9002490663892367|
|CACM-0558.html	|   24.467685375908488  |   0.8613249509436927|
|CACM-0492.html	|   24.40473070667761   |   0.7788370657676543|