# Exercise set nro 1

* This is a mixture of programming and non-programming exercises
* Maybe some of you can't do all of them, but that's fine
* The exercise relies on the availability of the file /home/ginter/IR_Course/fiwiki-20140809-corpus.txt.gz
* This file is there on the course server so if you do your exercises there, you don't have to do anything
* The course server is vm0964.kaj.pouta.csc.fi so,
* If you do your exercises on your own computer, you can do (note the dot at the end, it is a part of the command)
  
    scp yourusername@vm0964.kaj.pouta.csc.fi:/home/ginter/IR_Course/fiwiki-20140809-corpus.txt.gz .



Below is gathered in one place the best IR system from the lecture. This, and the wikipedia data will be the basis of our first exercises.

* It stores its data in the efficient, sparse matrix
* Does the full tf.idf weighting
* Can answer queries consisting of multiple terms
* It cannot do the negations - we had to give up on that for a moment

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import gzip

def articles(gzipfile,max_articles=1000):
    """A function to yield documents from the wiki text dumps, one at a time"""
    with gzip.open(gzipfile,"rt") as f:
        article=[] #here we assemble the lines of the current article
        for line in f:
            line=line.strip()
            article.append(line)
            if line=="</article>": #end of article:
                yield " ".join(article) # yield it
                max_articles-=1
                if max_articles==0:
                    break
                article=[] # get ready for the next article

def search(keywords,td_matrix,tfvec):
    """Carry out the search"""
    term2row=tfvec.vocabulary_ #A more readable variable name
    hits=np.sum(td_matrix[term2row[keyword]] for keyword in keywords) #Sum up the rows of the tf-idf weighted matrix
    #Hits is a sparse matrix with a single row
    #This is how you get the document indices and scores as arrays
    document_indices=hits.nonzero()[1]
    document_scores=hits[hits.nonzero()].A1 #A1 returns itself as flat array
    # search done, return one array with document indices of the hits, and one array with their scores
    return document_indices, document_scores

def top_n_simple(document_indices, document_scores,top_N):
    """Sort the hits and return top N - simple and quite slow version for large collections"""
    print("LEN",len(document_indices))
    sorted_hits=sorted(zip(document_scores, document_indices),reverse=True) # Rank the results
    #sorted_hits=list(zip(document_scores, document_indices)) # No ranking
    return sorted_hits[:top_N] # Returns list of (score, doc_idx), (score, doc_idx), ...

def index_wiki(doc_count):
    # Read the documents into a list
    # We need to remember them, so we can refer to them later
    documents=list(articles("fiwiki-20140809-corpus.txt.gz",doc_count))
    tfv_wiki=TfidfVectorizer(input="content",lowercase=True,sublinear_tf=True,use_idf=True,norm=None)
    td_matrix_wiki=tfv_wiki.fit_transform(documents)
    td_matrix_wiki=td_matrix_wiki.T.tocsr() #Turn document-term into term-document sparse matrix
    return td_matrix_wiki, tfv_wiki, documents #Returns the matrix, the learned vectorizer, and the documents

td_matrix_wiki, tfv_wiki, documents=index_wiki(15000)
document_indices,document_scores=search(["ilma","kone"],td_matrix_wiki,tfv_wiki)
top_n=top_n_simple(document_indices, document_scores,4)
for score, doc_idx in top_n:
    print("****", score)
    print(documents[doc_idx][:200]," (...)") #Print first 200 characters
    print()
    print()

LEN 385
**** 28.6950241217
<article name="Lentokone"> Lentokone on ilmassa liikkuva, ilmaa raskaampi kiinteäsiipinen ilma-alus. Lentokone pysyy ilmassa sen kantopintojen, kuten siipien aiheuttaman nostovoiman ansiosta, mutta le  (...)


**** 24.9326224979
<article name="Malév"> Malév Hungarian Airlines oli unkarilainen lentoyhtiö. Malév liikennöi 34 maahan ja 50 kaupunkiin. Partner-yhtiöiden code-share-lennot mukaan laskien luvut nousevat 42 maahan ja   (...)


**** 24.468528463
<article name="Suomen ilmavoimat"> Suomen ilmavoimat on yksi Suomen puolustusvoimien kolmesta puolustushaarasta. Muut kaksi ovat maavoimat ja merivoimat. Ilmavoimien perustaminen Pääartikkeli Suomen i  (...)


**** 23.8049677357
<article name="Wrightin veljekset"> Wrightin veljeksiä, Orville Wrightia (19. elokuuta 1871 – 30. tammikuuta 1948) ja Wilbur Wrightia (16. huhtikuuta 1867 – 30. toukokuuta 1912), pidetään yleisesti en  (...)




...it seems to work fine!

# Exercise 1 - Read the code

Read the source code above and do your best to understand what's going on. Try to put print statements in various places to make sure you know what it does. Then answer: how many unique terms you get when using the code to index the first 25,000 wiki articles, then 50,000 and then 100,000. How does the number of terms change with increasing number of documents?

I'll be in the room so ask if you don't understand something.

# Exercise 2 - Ranking on/off

Find the line in the code which ranks the results by their score and turn the ranking off. That will degrade the system into a simple keyword matching.

1. When you search for ["kissa","koira"] - is this degraded system returning documents for "kissa AND koira" or is it returning documents for "kissa OR koira"?
2. Can you explain why the first hit you get is the page for Adolf Hitler?

# Exercise 3 - Fix the system

The system crashes if you query it with a word that is not in the vocabulary. Fix it. :)

In [2]:
search(["somewordwhichdoesnotexist"],td_matrix_wiki,tfv_wiki)

KeyError: 'somewordwhichdoesnotexist'

# Exercise 4 - Does it scale?

The "system" is just a handful of lines in Python. I wonder how well it scales? Let's see - here's your tasks:

1. Index in it 10,000  50,000 and 100,000 articles from Finnish (or any other you like) Wikipedia
2. Report for these three data sizes:
   1. How long does the indexing take?
   2. How long does it take to answer various queries of your choice - are there big differences?
   3. How much of memory does the system roughly take ("`top`" command)?
3. Based on your experience in (2) is our quick'n'dirty IR system totally sucky or not?

In [None]:
# Here is how you can measure the time spent doing something
import time
start=time.time() #Current time
#... do something here
x=sum(x for x in range(1000000)) #waste some time
end=time.time()
print("Spent",end-start,"seconds")

# Exercise 5 - Tf.idf and other options

When creating the vectorizer in the function `index_wiki()`, you can turn on/off various options on its behavior. The full list is in the [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). Index yourself a decent number of documents from the wiki, say 25,000 and experiment with various queries and various combinations of parameters. At least IDF on/off and sub-linear TF on/off. Can you spot any differences in the results? Would you agree that the tf.idf weighting is superior? Write up your experiences.

# Exercise 6 - Intersect lists

On the lecture, I showed an algorithm to intersect several sorted lists of IDs. Write a function `intersect(list_of_lists)` which will compute a list containing their intersection. So `intersect([[1,3,4,6],[2,3,6,9],[1,3,4,5,6,9,27]])` would return `[3,6]`. If you want to do it fancy, then you can accept a list of iterators and return an iterator. In that way, you could chain these functions.