<a href="https://colab.research.google.com/github/3lLobo/basic-probability-programming/blob/master/assignment3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Combining wget and unzip, we can easily download a zip file and unpack.


In [1]:
!wget -O corpus.zip https://github.com/probabll/basic-probability-programming/raw/master/weekly_tasks/week3/homework/code/corpus.zip
!unzip corpus.zip

--2020-11-18 10:00:01--  https://github.com/probabll/basic-probability-programming/raw/master/weekly_tasks/week3/homework/code/corpus.zip
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/probabll/basic-probability-programming/master/weekly_tasks/week3/homework/code/corpus.zip [following]
--2020-11-18 10:00:01--  https://raw.githubusercontent.com/probabll/basic-probability-programming/master/weekly_tasks/week3/homework/code/corpus.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5559299 (5.3M) [application/zip]
Saving to: ‘corpus.zip’


2020-11-18 10:00:01 (21.5 MB/s) - ‘corpus.zip’ saved [5559299/5

We can now load in the prepared dictionary. See the assignment for the details and purpose of the dictionaries. 



In [3]:
import json
import numpy as np
corpus = json.load(open('corpus/corpus-subset.json', 'r'))

#print an example to inspect the structure of the dictionary. 
print(corpus["1789-Washington"])

{'collection': 'inaugural', 'fid': '1789-Washington', 'filename': 'corpus/1789-Washington.txt'}


In [5]:
from collections import Counter
import json
print_steps = True
def step(message):
    if print_steps:
        print("\n" + "*"*70 + '\n{:*^70}\n'.format(message))



def split_text(text):
    """
    Split the string in words (tokens)
    :param text: string
    :return: list
    """
    return text.lower().split(' ')

def get_file_freqs(filename):
    """
    Get the word frequencies in a file
    :param filename:
    :return: Counter
    """
    freqs = Counter()
    with open(filename, 'r', encoding='utf8') as file:
        for line in file:
            words = split_text(line)
            freqs.update(words)

    return freqs


#####################################################################

####
step("1. Collect the corpus frequencies")
# Make a Counter object `corpus_freqs` with the corpus (word) freqs,
# the freqs of words in all documents combined. You can use the
# function get_file_freqs and Counter.update() might be useful.
corpus_freqs = Counter()
idf_counter = Counter()
for docid, info in corpus.items():
    file_freq = get_file_freqs(info['filename'])
    corpus_freqs = corpus_freqs + file_freq
    idf_counter.update(list(corpus_freqs.keys()))
    print('Number of unique words in corpus:', len(corpus_freqs))

tf_idf = dict()
total_n = len(idf_counter)
for key in idf_counter.keys():
    tf_idf[key] = corpus_freqs[key] * (np.log(total_n / (1 + idf_counter[key]))+1)

####
step("2. Make vocabulary")
# To scale things down, we only consider some of the most frequent
# words in the corpus. Use the method Counter.most_common() to make
# a list (!) called `vocabulary` which contains the voc_size=100 most
# common words in the corpus
voc_size = 100
vocabulary = list(zip(*corpus_freqs.most_common(voc_size)))[0]
tf_idf_voc = sorted(tf_idf)[:voc_size]

####
step("# 3. Collect vocabulary word frequencies")
# We are only interested in the frequency of words in the vocabulary.
# Write a function that, given a frequency counter and the vocabulary
# returns a list (!) of frequencies of the words in the vocabulary.
# So if freqs['book'] = 10 and 'book' is the 3rd word in the vocabulary,
# then the function should output a list with 10 as the 3rd item.
def freqs_to_vector(freqs, vocabulary):
    """
    Turns a frequency Counter into a list (!) of frequencies, in the
    same order as the words in vocabulary.
    :param freqs: a Counter with word frequencies
    :param vocabulary: a list of vocabulary words
    :return: a list with the frequencies of each of the voc. words
    """
    freq_list = list()
    for word in vocabulary:
        if word in freqs:
            freq_list.append(freqs[word])
        else:
            freq_list.append(0)
    return freq_list


####
step("4. Collect vocabulary word frequencies")
# Store the frequency vector of every document in the `corpus` object
# as `corpus[doc_id]['freqs']` (remember `corpus` is a dictionary of
# dictionaries). You first have to read out the frequencies again
# using `get_file_freqs`, and then you can use `freq_to_vector`.

# ...

for docid, info in corpus.items():
    item_freqs = get_file_freqs(info['filename'])
    corpus[docid]['freqs'] = freqs_to_vector(item_freqs, vocabulary)

for docid, info in corpus.items():
    item_freqs = get_file_freqs(info['filename'])
    corpus[docid]['freqs_tfidf'] = freqs_to_vector(item_freqs, tf_idf_voc)
####
step("5. Norm")
# Define a function that returns the norm of a vector (list of numbers).
# So e.g. norm([3, 4]) = sqrt( 3^2 + 4^2 ) = 5. You can use the function
# `math.sqrt` for the square root, and `sum(my_list)` is also useful

import numpy as np
def norm(vector):
    """
    Computes the norm of a list of numbers
    :param vector: a list of numbers
    :return: a number
    """
    return np.linalg.norm(vector)

#Here are some tests that your norm function should pass
print("\nTest 5: norm:")
print( norm([3, 4]) ) # Should be 5.0
print( norm([1, 1, 1, 1])) # Should be 2.0
print( norm([5, 20, 102, 9, 1])) # Should be 104.45...

####
step("6. Cosine similarity")
# Write a function that computes the cosine similarity of two vectors
# A = [a_1, ..., a_n] and B = [b_1, ..., b_n]. Recall that the cosine
# similarity is defined as
#   sim = (a_1 * b_1 + ... + a_n * b_n) / ( norm(A) * norm(B) )
def similarity(a, b):
    """
    Computes the cosine similarity between two vectors (of equal length)
    :param A: a vector (list of numbers)
    :param B: another vector (list of numbers)
    :return: the cosine similarity (a number between -1 and 1)
    """
    if len(a) != len(b):
        crop = min(len(a), len(b))
        a, b = a[:crop], b[:crop]
    return np.dot(a, b)/(np.linalg.norm(a)*np.linalg.norm(b))

#Here are some tests that your similiary function should pass
print("\nTest 6: cosine similarity:")
print( similarity([1,0,0], [1,0,0]) ) # Should be 1.0
print( similarity([1,0,0], [0,1,0]) ) # Should be 0.0
print( similarity([1,0,0], [-1,0,0]) )# Should be -1.0
print( similarity([1,1,.5], [0.5,2,0]))# Should be 0.808...

####
step("7. Compute cosine similarities")
# Compute the cosine similarity between three documents:
# the first and second inaugural address by Washington
# (id's "1789-Washington" and "1793-Washington") and the
# poetry collection 'Leaves of grass' by Walt Whitman
# (id "whitman-leaves").
washington1 = corpus["1789-Washington"]['freqs']
washington2 = corpus["1793-Washington"]['freqs']
whitman = corpus['whitman-leaves']['freqs']

print("\nCosine similarities")
print("Washington 1 vs Washington 2:", similarity(washington1, washington2))
print("Washington 1 vs Whitman:", similarity(washington1, whitman))
print("Washington 2 vs Whitman:", similarity(washington2, whitman))


####
step("8. Arbitrary queries")
# We want to use the cosine similarity to automatically find
# the document most similar to a 'query', that is, we want
# to build a kind of search engine. Write a function that turns
# a query string into a frequency vector. You probabl want to use
# the functions split_text and freqs_to_vector we defined before.
def text_to_vector(text, vocabulary):
    """
    Turns a string into a vector (list) of word frequencies for those
     words in the vocabulary
    :param text: the input string
    :param vocabulary: the list of vocabulary words
    :return: a list of word-frequencies
    """
    words_list = split_text(text)
    return freqs_to_vector(vocabulary,words_list)


# We have already written the function that ranks the documents for you
def rank_documents(query_vector, corpus, num=100, tfidf=False):
    """
    Ranks the documents according to their cosine similarity to a query vector.
    :param query_vector: list
    :param num: only return the `num` top ranking documents
    :return: two lists: a list of ranked document ids (most similar first) and a
        list with the corresponding similarity scores
    """
    similarities = {}
    freq_key = 'freqs_tfidf' if tfidf else 'freqs'
    for doc_id, info in corpus.items():
        freq_vect = corpus[doc_id][freq_key]
        similarities[doc_id] = similarity(query_vector, freq_vect)

    ranked_ids = sorted(similarities, key=lambda i: similarities[i], reverse=True)
    ranked_sims = [similarities[id] for id in ranked_ids]
    return ranked_ids[:num], ranked_sims[:num]


####
step("9. Rank documents")
# Use the functions text_to_vector and rank_documents to find the document
# closest to the following excerpt from Adams inaugural address.

adams_txt = "When it was first perceived, in early times, that no middle \
course for America remained between unlimited submission to a foreign \
legislature and a total independence of its claims, men of reflection \
were less apprehensive of danger from the formidable power of fleets \
and armies they must determine to resist than from those contests and \
dissensions which would certainly arise concerning the forms of government \
to be instituted over the whole and over the parts of this extensive country."

print(rank_documents(text_to_vector(adams_txt, corpus_freqs),corpus))

# Do play around with our querying system. To use the full collection,
# rather than the 3 corpus we used so far, uncomment the line
# `corpus = json.load(open('documents.json', 'r'))` at the top of this file.
# You note that our system isn't very reliable, and can be improved in
# many ways. The first thing you would want to do is tackle stop-words.
#
# If you look at the vocabulary (print it!) it contains many words like
# 'and', 'of', 'it', and so on. These are highly frequent in all texts,
# and not informative for a document's content. There are at least two
# improvements: (1) remove such words from the vocabulary, or (2) adjust
# our vectors to be less 'sensitive' to those words.
#
# A common approach to (2) is to use so called tf-idf vectors, which stands
# for (term frequency)-(inverse document frequency). The inverse document
# frequency roughly punishes words that occur in many of the documents
# in the corpus. It is fairly easy to extend this assignment to use
# tf-idf scores instead. If you're interested, look up the Wikipedia page
# https://en.wikipedia.org/wiki/Tf%E2%80%93idf or ask Bas if you need help.

step("11. Play around")
print(rank_documents(text_to_vector(adams_txt, corpus_freqs),corpus, tfidf=True))



**********************************************************************
******************1. Collect the corpus frequencies*******************

Number of unique words in corpus: 634
Number of unique words in corpus: 670
Number of unique words in corpus: 24164

**********************************************************************
**************************2. Make vocabulary**************************


**********************************************************************
***************# 3. Collect vocabulary word frequencies***************


**********************************************************************
****************4. Collect vocabulary word frequencies****************


**********************************************************************
*******************************5. Norm********************************


Test 5: norm:
5.0
2.0
104.45573225055675

**********************************************************************
*************************6. Cosine similarity*****