# Document Analysis Assignment 1: Information Retrieval

In this assignment, your task is to index a new document collection into *Elasticsearch* and measure search performance based on predefined queries. 
A set of new documents collection containing more than 10000 goverment sites description and a set of predefined queries will be provided for this assignment.
Throughout this assginment, 
1. you will get better understanding of indexer including tokeniser, parser, and normalisers to improve the search performance given a predefined evaluation metric, 
2. you will get better understanding of search algorithm to obtain better search results, and 
3. you will find the best way to combine indexer and search algorithm to maximise the performance.

Below, you will solve five programming assignments (Q1-Q5), and three written assignments after that. We will check the correctness of your code, but the score of the programming assignments will be graded based on your performance on Kaggle competition.
- Write your code after `### Your code here`, and remove `raise NotImplementedError` after implementation.
- Written assignments should be written in the given notebook cells. Please write them direcly in to the designated cells, and upload the notebook file to Wattle page.
- Write answers in this notebook file, and upload the file to Wattle submission site. **Please rename and submit jupyter notebook file (`Assignment1.ipynb`) to `your_uid.ipynb` (e.g. `u6000001.ipynb`) with your written answers therein**. Do not upload any other files to Wattle except this notebook file.

For the Kaggle competition:
- Note that you are only allowed to upload **3 copies** of your results to Kaggle per day. Make every upload count, and don't waste your opportunities!

Score distribution (total 10 points)
- Kaggle competition (Q1-Q5): 4 points
- Written assignment 1: 2 points
- Written assignment 2: 2 points
- Written assignment 3: 2 points


## Coding assignment (Q1 - Q5), 4 points

## Q1. Index Gov dataset

Download a collection of government documents `gov.zip` from Wattle, and unzip the file. The unzipped file contains two sub-folders; `documents`, and `topics`. `documents` folder consists of sub-folders, each of which contains multiple documents. Your first job is to index the documents as we have done in the lab exercises.

- Note that depending on your machine, indexing may take several minutes to a few hours. You may implement multi-threaded version of indexing to mitigate the problem.

Below we provide the basic configuration for indexing.

In [1]:
# basic configuration for indexing
basic_settings = {
  "mappings": {
    "doc": {
      "properties": {
        "filename": {
          "type": "keyword",
          "index": False,
        },
        "path": {
          "type": "keyword",
          "index": False,
        },
        "text": {
          "type": "text",
          "similarity": "boolean",
          "analyzer": "my_analyzer",
          "search_analyzer": "my_analyzer"
        }
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "filter": [
            "stop"
          ],
          "char_filter": [
            "html_strip"
          ],
          "type": "custom",
          "tokenizer": "whitespace"
        }
      }
    }
  }   
}

You need to implement function `build_gov_index` below. Don't forget to remove `raise NotImplementedError` after implementation.

In [2]:
from elasticsearch import Elasticsearch

ES_HOSTS = ['http://localhost:9200']
DOCS_PATH = 'gov/documents'
INDEX_NAME = 'gov'
DOC_TYPE = 'doc'

def build_gov_index(es_conn, index_name, doc_path, settings):
    # TODO implement function that 
    # 1. create index with `index_name`, if `index_name` already exists, remove the index first.
    # 2. index the documents under doc_path including subfolders into elasticsearch (hint: read demo carefully)
    # Note that this function will be used throughout this assignment    
    # YOUR CODE HERE
    raise NotImplementedError()

es_conn = Elasticsearch(ES_HOSTS)
build_gov_index(es_conn, INDEX_NAME, DOCS_PATH, basic_settings)

ImportError: No module named elasticsearch

## Q2. Search and measure performance

For the second task, you first need to read `topics/gov.topics` file. As we have done in lab demo session, each file of this file is formatted as

`query_id query_terms`

`query_id` is a numerical number, and `query_terms` consists of multiple keywords as search terms. Your job is to read the query file and search using the provided `search` function. You need to write the outputs of search results to `output.csv` file. The first line of `output.csv` file should start with the following header:

`QueryId,DocumentNumber,Similarity,Iteration,RunId,Rank`, 

but you only need to fill QueryId and DocumentNumber for this assignment. Specifically, for each query, you will

1. rank the output of search results based on their scores,
2. write top-10 documents to `output.csv` based on the highest score. Except `QueryID` and `DocumentNumber` the other fields should be filled with 0s. For example, each line of the output file will be formatted as:
    
    `1,G00-27-1804490,0,0,0,0`
    
    where `1` is `query_id`, and `G00-27-1804490` is the file name of retrived document (=`DocumentNumber`).
    
    White space is now allowe between fields, i.e. `1, G00-27-1804490, 0, 0, 0, 0` will not be evaluated properly. 
    - **Note that you are only allowed to write 10-documents at most for each query**. If your output file contains more than 10 documents per query, **you will get 0 score for the programming assignment**.
    
3. upload `output.csv` file to [kaggle competition site](#Upload-output-file-to-Kaggle-competition-site) (see the details below). Check the performance of the basic search algorithm in terms of Precision@10.

In [None]:
def search(query_string, es_conn, index_name):
    '''
        searches for query_string with default search algorithm
        input:
            - query_string: a query
            - es_conn: elasticsearch connection
            - index_name: name of index
        output:
            - a generator of tuple (filename, score)

    '''
    res = es_conn.search(index = index_name,
        body = {
            "_source": [ "filename"],
            "query": {
                "query_string": {
                    "query": query_string,
                }
            }
        }
    )
    return res['hits']['hits']

# TODO: read query file from `query_path`, search using `search_fn`,
#       and write top 10 outputs per query to `output_file`
#       Note that the function takes a search function as an argument, you can directly call the search function
#          as `result = search_fn(query_string, es_conn, index_name)` within the function.
#       This function will be used throughout this assignment
def read_search_write_output(search_fn, query_path, output_file):
    with open(output_file, 'w') as output:
        output.write('QueryId,DocumentNumber,Similarity,Iteration,RunId,Rank\n')  #for your convenience

        # YOUR CODE HERE
        raise NotImplementedError()

query_path = 'topics/gov.topics'
output_file = 'output.csv'
read_search_write_output(search, query_path, output_file)

## Q3. Improve indexer

You will be asked to change the configuration of indexer (`basic_settings`) to improve the search performance.

Please look at the elastic search official document [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html) for better understanding of configuration and other options.

Note that you can check how your tokeniser tokenises your input string via `analyze_query` function provided in the demo code.

In [None]:
# TODO: configure settings to define your own analyzer for indexing
q3_settings = {
  "mappings": {
    "doc": {
      "properties": {
        "filename": {
          "type": "keyword",
          "index": False,
        },
        "path": {
          "type": "keyword",
          "index": False,
        },
        "text": {
          "type": "text",
          "similarity": "boolean",
          "analyzer": "my_analyzer",
          "search_analyzer": "my_analyzer"
        }
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
          "my_analyzer": {
            # YOUR CODE HERE
            raise NotImplementedError()
        }
      }
    }
  }
}

In [None]:
# TODO: run this block to generate an output based on q3_settings defined above.
build_gov_index(es_conn, INDEX_NAME, DOCS_PATH, q3_settings)
read_search_write_output(search, query_path, output_file)

Upload the final output to Kaggle to check the difference in Precision@10.

## Q4. Improve search algorithm

*Elasticsearch* also provides multiple configurable scoring algorithms. For this task, you will be asked to find a better similarity module to improve the search performance. Please refer [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/similarity.html) for better understanding of configurable elasticsearch similarity modules.

You can also change the `search` function to improve performance. Please refer [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html) for better understanding of Query DSL used in *elasticsearch*.

In [None]:
# TODO: define your own analyzer for indexing and searching
q4_settings = {
  "mappings": {
    "doc": {
      "properties": {
        "filename": {
          "type": "keyword",
          "index": False,
        },
        "path": {
          "type": "keyword",
          "index": False,
        },
        "text": {
            # YOUR CODE HERE
            raise NotImplementedError()
        }
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
            # YOUR CODE HERE
            raise NotImplementedError()
      }
    }
  }
}


# TODO: change search algorithm to improve the search results, the return type should be the same as that of `search` function
def my_search(query_string, es_conn, index_name):
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
# TODO: run this block to generate an output based on q4_settings and my_search defined above.
build_gov_index(es_conn, INDEX_NAME, DOCS_PATH, q4_settings)
read_search_write_output(my_search, query_path, output_file)

## Q5. Find the best combination

Now it's time to explorer the best configuration of indexer and search algorithms. Each combination will yield a different search outcome. Try different combinations and report best results below.

In [None]:
# TODO: find the best combination of indexer configuration and search algorithm to maximise the performance of search result.
best_settings = {
    # YOUR CODE HERE
    raise NotImplementedError()
}

def best_search(query_string, es_conn, index_name):
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
# TODO: run this block to generate the output
build_gov_index(es_conn, INDEX_NAME, DOCS_PATH, best_settings)
read_search_write_output(my_search, query_path, output_file)

Answer following questions based on your implementation from Q1 to Q5.

# Written Assignment 1 (2pt)

What changes did you make for the **indexer** to improve the performance of the system? Why do you think it improves the performance?

(provide answer using bullet list with 2~3 items (Check [this](https://sourceforge.net/p/jupiter/wiki/markdown_syntax/#md_ex_lists) if you are not familiar with markdown syntax))

YOUR ANSWER HERE

# Written Assignment 2  (2pt)

What changes did you make for the **search algorithm** to improve the performance of the system? Why do you think it improves the performance?

(provide answer using bullet list with 2~3 items)

YOUR ANSWER HERE

# Written Assignment 3 (2pt)

In the Kaggle competition, we only use Precision@10 as an evaluation metric. What other metrics can be used to measure the performance of IR system for the government document collection? Provide two metrics and explain why.

(provide answer using bullet list with 2~3 items)

YOUR ANSWER HERE

# Upload output file to Kaggle competition site

Once you generate `output.csv` file, you can upload your result on Kaggle competition site. To upload and evaluate your result

1. Go to Kaggle competition site: [Click here](https://www.kaggle.com/t/26d21af1d3ea4447b86737fe889160ff).
1. Sign up for Kaggle if you do not have an account. Go back to the [original kaggle page](https://www.kaggle.com/t/26d21af1d3ea4447b86737fe889160ff).
1. Before submitting the result, first go to `team` manu and change **your team name as your university id**.
![ChangeUID](images/changeuid.png)
1. Time to submit your own result. Click `submit predictions` in the manu, you may need to agree the competetion rules before submitting your result.
1. Upload your output csv file, you can write additional description of your submission in the description box.
    Note that you are only allowed to submit **3 results per day**. Do not upload an arbitrary result and think which algorithm or parser will perform the best.
1. If your output format is correct, the system will generate your score automatically.
1. Go to `Leaderboard` menu. The leaderboard will show the current score of the other students.
![Leaderboard](images/leaderboard.png)


Note that you can check all of your submission from `my submission` menu. Please select one best performing submission before the assignment due. The selected submission will be used to measure the performance of *hidden* test case (see below for details).
![Check](images/check.png)

### Evaluation

The system uses *precision@10* to measure the performance of each submission, which basically measure the accuracy of your top 10 returned documents.

It is important to understand that the leaderboard score will be only computed based on the half of the test cases, and the remaining half will be computed after the deadline based on your selected submission. This process will ensure that your performance is not only applicable for the known test cases, but also generalised to the unknown test cases. We will combine these two performances to score the first assignment.