##### Vector Space Model (VSM) is the basis of most Information Retrieval systems.
```vue
VSM used in finding relevant documents with respect to a given query. In VSM, each document or query is a N-dimensional vector where N is the number of distinct terms over all the documents and queries.The i-th index of a vector contains the score of the i-th term for that vector. 

    - The Vector Space Model for Information Retrieval represents documents and queries as vectors of weights.
    - The weights represent the importance of the terms (aka words, tokens) in the documents and queries.
    - Each weight is a measure of the importance of an index term in a document or a query, respectively.
    
```

#### The main score functions are based on: Term-Frequency (tf) and Inverse-Document-Frequency(idf). 
```vue

Term-Frequency and Inverse-Document Frequency – The Term-Frequency (tf_{ij}) is computed with respect to the i-th term and j-th document
where $ n_{i, j} are the occurrences of the i-th term in the j-th document. 

The idea is that if a document has multiple receptions of given terms, it will probably deals with that argument. 
The Inverse-Document-Frequency (idf_{i}) takes into consideration the i-th terms and all the documents in the collection :  
```
![image](idf.png)

```vue
The intuition is that rare terms are more important that common ones : if a term is present only in a document it can mean that term characterizes that document. 

The final score for the i-th term in the j-th document consists of a simple multiplication Since a document/query contains only a subset of all the distinct terms in the collection, the term frequency can be zero for a big number of terms : this means a sparse vector representation is needed to optimize the space requirements. 
```

### The algorithm steps for VSM are:

```vue
        1- Collecting and preprocessing documents  
        2- Creating a vocabulary of unique terms   
        3- Representing each document as a vector 
        4- Representing each query as a vector
        5- Calculating the similarity between each document vector and the query vector
        6- Ranking documents based on their similarity to the query.
```

In [38]:
# create some documents
doc1 = "I like apples. I like oranges too?"
doc2 = "I love apples. I hate doctors"
doc3 = "An apple a day keeps/ the doctor away"
doc4 = "Never compare an apple // to an orange"
doc5 = "I prefer scikit-learn to learn"

# create some queries
query1 = "I hate apples"
query2 = "I like oranges"
query3 = "I like apples. I like oranges too"
query4 = "It was the best time to go to doctor"
query5 = "It's wrong to compare an apple to an orange"
query6 = "I would like to learn scikit-learn and pandas"

# add all docs to a list
documents = [doc1, doc2, doc3, doc4, doc5]
print("Documents:\n",documents)


# add all queries to a list
queries = [query1, query2, query3, query4, query5, query6]
print("Queries:\n",queries)


Documents:
 ['I like apples. I like oranges too?', 'I love apples. I hate doctors', 'An apple a day keeps/ the doctor away', 'Never compare an apple // to an orange', 'I prefer scikit-learn to learn']
Queries:
 ['I hate apples', 'I like oranges', 'I like apples. I like oranges too', 'It was the best time to go to doctor', "It's wrong to compare an apple to an orange", 'I would like to learn scikit-learn and pandas']


In [39]:
# preprocess documents and queries

import re

def RemoveStopWords(token):
    stop_words = open('StopWords.txt', 'r').read()
    stop_words = stop_words.split()
    token = re.sub('[^A-Za-z0-9]+', '', token)

    if token in stop_words:
        return ''

    return token

def tokenize(document):
    tokens = [token.strip() for token in document.split()]
    tokens = [RemoveStopWords(token) for token in tokens]
    tokens = [token for token in tokens if token != '']
    return set(tokens)

In [40]:
# create the vocabulary of unique terms

vocabulary = set()
for doc in documents:
    vocabulary.update(tokenize(doc))

# print the vocabulary
print(vocabulary)


{'learn', 'apple', 'Never', 'doctors', 'keeps', 'love', 'I', 'away', 'apples', 'prefer', 'doctor', 'orange', 'compare', 'day', 'hate', 'oranges', 'like', 'An', 'scikitlearn'}


In [41]:
# Representing each document as a vector
documents_vector = [tokenize(doc) for doc in documents]
print(documents_vector)


[{'I', 'like', 'oranges', 'apples'}, {'apples', 'doctors', 'love', 'I', 'hate'}, {'apple', 'doctor', 'keeps', 'away', 'day', 'An'}, {'Never', 'orange', 'compare', 'apple'}, {'I', 'learn', 'scikitlearn', 'prefer'}]


In [42]:
# Representing each query as a vector
queries_vector = [tokenize(que) for que in queries]
print(queries_vector)


[{'I', 'hate', 'apples'}, {'I', 'like', 'oranges'}, {'I', 'like', 'oranges', 'apples'}, {'It', 'doctor', 'best', 'time'}, {'wrong', 'apple', 'Its', 'orange', 'compare'}, {'learn', 'I', 'like', 'pandas', 'scikitlearn'}]


#### Cosine Similarity – 
```vue
In order to compute the similarity between two vectors : a, b (document/query but also document/document) 
the cosine similarity is used :
```
    
![image](cos.png)

In [43]:
#  Rank documents based to similatity between queries terms and documents terms using cosine_similarity

import math
def cosine_similarity(query, document):
    intersection = query.intersection(document)
    return len(intersection) / (math.sqrt(len(query)) * math.sqrt(len(document)))

for query in queries_vector:
    query_index = queries_vector.index(query)
    print("Query:(",query_index+1, ") : ", queries[query_index])
    scores = [cosine_similarity(query, document) for document in documents_vector]
    print("Scores:", scores)
    
    print("===========================================")
    print("Ranking documents for query:", query_index+1)
    print("===========================================")
    for document_index in sorted(range(len(scores)), key=lambda i: scores[i], reverse=True):
        print("Document(", document_index+1, ") ==>",
              documents[document_index])

    print("\n")


Query:( 1 ) :  I hate apples
Scores: [0.5773502691896258, 0.7745966692414834, 0.0, 0.0, 0.2886751345948129]
Ranking documents for query: 1
Document( 2 ) ==> I love apples. I hate doctors
Document( 1 ) ==> I like apples. I like oranges too?
Document( 5 ) ==> I prefer scikit-learn to learn
Document( 3 ) ==> An apple a day keeps/ the doctor away
Document( 4 ) ==> Never compare an apple // to an orange


Query:( 2 ) :  I like oranges
Scores: [0.8660254037844387, 0.2581988897471611, 0.0, 0.0, 0.2886751345948129]
Ranking documents for query: 2
Document( 1 ) ==> I like apples. I like oranges too?
Document( 5 ) ==> I prefer scikit-learn to learn
Document( 2 ) ==> I love apples. I hate doctors
Document( 3 ) ==> An apple a day keeps/ the doctor away
Document( 4 ) ==> Never compare an apple // to an orange


Query:( 3 ) :  I like apples. I like oranges too
Scores: [1.0, 0.4472135954999579, 0.0, 0.0, 0.25]
Ranking documents for query: 3
Document( 1 ) ==> I like apples. I like oranges too?
Document

In [44]:
# Another way to rank documents based to similatity between queries terms and documents terms using cosine_similarity

def rank_documents(query, documents):
    scores = [len(query.intersection(document)) for document in documents]
    return scores

for query in queries_vector:
    query_index = queries_vector.index(query)
    print("Query:(",query_index+1, ") : ", queries[query_index])
    scores = rank_documents(query, documents_vector)
    print("Scores:", scores)

    print("===========================================")
    print("Ranking documents for query:",query_index+1)
    print("===========================================")
    for document_index in sorted(range(len(scores)), key=lambda i: scores[i], reverse=True):
        print("Document(", document_index+1, ") ==>", documents[document_index])

    print("\n")
    

Query:( 1 ) :  I hate apples
Scores: [2, 3, 0, 0, 1]
Ranking documents for query: 1
Document( 2 ) ==> I love apples. I hate doctors
Document( 1 ) ==> I like apples. I like oranges too?
Document( 5 ) ==> I prefer scikit-learn to learn
Document( 3 ) ==> An apple a day keeps/ the doctor away
Document( 4 ) ==> Never compare an apple // to an orange


Query:( 2 ) :  I like oranges
Scores: [3, 1, 0, 0, 1]
Ranking documents for query: 2
Document( 1 ) ==> I like apples. I like oranges too?
Document( 2 ) ==> I love apples. I hate doctors
Document( 5 ) ==> I prefer scikit-learn to learn
Document( 3 ) ==> An apple a day keeps/ the doctor away
Document( 4 ) ==> Never compare an apple // to an orange


Query:( 3 ) :  I like apples. I like oranges too
Scores: [4, 2, 0, 0, 1]
Ranking documents for query: 3
Document( 1 ) ==> I like apples. I like oranges too?
Document( 2 ) ==> I love apples. I hate doctors
Document( 5 ) ==> I prefer scikit-learn to learn
Document( 3 ) ==> An apple a day keeps/ the doc