# **RankNet Text Ranking**
Rankeamento textual basicamente é o processo de ordenar texto a partir de algum critério de relevância. Seu objetivo é gerar uma lista ordenada de textos em resposta a uma consulta específica. Temos uma coleção de textos e uma consulta, e vamos retornar ela ordenado.

<div align="center" style="margin-top: 40px;">
    <img src="./images/rank.png" alt="Alt text" width="800"/>
</div>

### **Vector Space Ranking**
Baseado em espaço vetorial, tanto o documento quanto a consulta, e para cada par documento e consulta, a similaridade de cosseno entre eles vai representar a importância. O quão relevante o documento é em relação a consulta, se o que a consulta quer realmente é de acordo com o documento.

<div align="center" style="margin-top: 40px;">
    <img src="./images/vector-space.png" alt="Alt text" width="800"/>
</div>

### **Imports**

In [1]:
import numpy as np
import math
from numpy import linalg as LA

### **Load Data**

In [2]:
D1 = 'Machine learning teaches machine how to learn'
D2 = 'Machine translation is my favorite subject'
D3 = 'Term frequency and inverse document frequency is important'

### **Text Processing**

In [3]:
def normalize_document(document):
    return document.lower()

D1 = normalize_document(D1)
D2 = normalize_document(D2)
D3 = normalize_document(D3)

print(D1)
print(D2)
print(D3)

machine learning teaches machine how to learn
machine translation is my favorite subject
term frequency and inverse document frequency is important


In [4]:
def term_frequency(term, document):
    doc = document.split()
    return doc.count(term.lower()) / float(len(doc))

def inverse_document_frequency(term, document):
    count = 0
    
    for doc in documents:
        if term.lower() in doc.lower().split():
            count += 1
    
    if count > 0:
        return 1.0 + math.log(float(len(documents)) / count)
    else:
        return 1.0
    
# tf-idf of a term in a document
def tf_idf(term, document, documents):
    tf = term_frequency(term, document)
    idf = inverse_document_frequency(term, documents)
    return tf * idf

In [5]:
documents = [D1, D2, D3]

for term in D1.split():
    print('{}: {}'.format(term, inverse_document_frequency(term, documents)))

machine: 1.4054651081081644
learning: 2.09861228866811
teaches: 2.09861228866811
machine: 1.4054651081081644
how: 2.09861228866811
to: 2.09861228866811
learn: 2.09861228866811


In [6]:
for term in D1.split(' '):
    print('{}: {}'.format(term, tf_idf(term, D1, documents)))

machine: 0.4015614594594755
learning: 0.2998017555240157
teaches: 0.2998017555240157
machine: 0.4015614594594755
how: 0.2998017555240157
to: 0.2998017555240157
learn: 0.2998017555240157


In [8]:
query = 'machine learning document'

def generate_vectors(query, documents):
    tf_idf_matrix = np.zeros((len(query.split()), len(documents)))
    
    for i, s in enumerate(query.lower().split()):
        idf = inverse_document_frequency(s, documents)
        
        for j, doc in enumerate(documents):
            tf_idf_matrix[i][j] = idf * term_frequency(s, doc)
    
    return tf_idf_matrix

tf_idf_matrix = generate_vectors(query, documents)

def word_count(s):
    counts = dict()
    words = s.lower().split()
    
    for word in words:
        if word in counts:
            counts[word] += 1
        else:
            counts[word] = 1
            
    return counts

def build_query_vector(query, documents):
    count = word_count(query)
    vector = np.zeros((len(count), 1))
    
    for i, word in enumerate(query.lower().split()):
        vector[i] = float(count[word]) / len(count) * inverse_document_frequency(word, documents)
        
    return vector

query_vector = build_query_vector(query, documents)

def cosine_similarity(v1, v2):
    return np.dot(v1, v2) / float(LA.norm(v1) * LA.norm(v2))

def compute_relevance(query, documents):
    for i, doc in enumerate(documents):
        similarity = cosine_similarity(tf_idf_matrix[:,i].reshape(1, len(tf_idf_matrix)), query_vector)
        print('Query document {}, similarity {}'.format(i, float(similarity[0])))
        
compute_relevance(query, documents)

Query document 0, similarity 0.7252786189058528
Query document 1, similarity 0.4279929226831737
Query document 2, similarity 0.639070441396375


### **Learning to Rank**