# Information Retrieval

Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).

![Sort-Based-Index](img/treclogo-c.gif)

# Conferences

Special Interest Group on Information Retrieval: <a href='https://sigir.org'>SIGIR</a> <br>
Text REtrieval Conference: <a href='https://trec.nist.gov/proceedings/proceedings.html'>TREC</a>  <br>
European Conference on Information Retrieval: <a href='http://www.ecir2018.org/'>ECIR</a>  <br>

# Example papers:
<ul>
<li><a href='http://nrl.northumbria.ac.uk/30863/1/SIGIR2017_Elsweiler.pdf'>Exploiting Food Choice Biases for Healthier <b>Recipe Recommendation</b></a> -> dataset of food reciped with nutrition information crawled from Allrecipes.com
<li>CitySearcher: A <b>City Search</b> Engine For Interests 
<li>A Test Collection for Evaluating <b>Legal Case Law Search</b>
<li>Multihop Attention Networks for <b>Question Answer Matching</b>
<li>Semantic Location in <b>Email Query Suggestion</b>
<li>Online <b>Job Search</b>: Study of Users’ Search Behavior using Search Engine Query Logs	
</ul>

# Most Valued Projects:

- Something Useful for Sofia University, the Master's Degree, etc. (contact Prof. Koychev)
- Participating in Shared Tasks (contact us or Prof. Koychev)

## Some project ideas
- Grammarly or [Hemingway](http://www.hemingwayapp.com/) for Bulgarian
- Collect/crawl questions and answers from exams after 4th/12th grade (there are a lot of on-line resources!). This will serve as a good stating point for building a Machine Reading/Question Answering model for Bulgarian!

# Some Shared Tasks

## <a href='http://alt.qcri.org/semeval2019/index.php?id=tasks'>SemEval</a>
- Fact Checking in Community Question Answering Forums
- Suggestion Mining from Online Reviews and Forums
- RumourEval 2019: Determining Rumour Veracity and Support for Rumours
- many more

tbc.

# Basic (pre-)requisites

## Python basics:
- http://nbviewer.jupyter.org/github/justmarkham/python-reference/blob/master/reference.ipynb
- https://www.cs.put.poznan.pl/csobaniec/software/python/py-qrc.html
- https://www.stavros.io/tutorials/python/

## Jupyter Notebooks
- https://www.dataquest.io/blog/jupyter-notebook-tutorial/

## Text Processing Libraries
- NLTK - collection of libraries and tools for text processing, created by academics (not production ready)
- Spacy - Industrial-Strength Natural Language Processing
- scikit-learn - Machine Learning in Python
- Pandas - Structures and data analysis tools for Python
- Numpy - scientific computing with Python
- Keras, TensorFlow, Pythorch- deep learning libraries for Python

# Books:
## Information Retrieval 
- Book for the course : https://nlp.stanford.edu/IR-book/information-retrieval-book.html
## NLP
- Foundations of Statistical Natural Language Processing https://nlp.stanford.edu/fsnlp/
- Speech and Language Processing (you can find also Youtube videos https://web.stanford.edu/~jurafsky/slp3/ )

# Online sources

## Search for papers to find relevant work and existing approaches
- https://scholar.google.com/
- https://www.researchgate.net/

## Corpora
- https://toolbox.google.com/datasetsearch
- https://www.kaggle.com/
- https://archive.ics.uci.edu/ml/datasets.html

## Facebook groups
- https://www.facebook.com/groups/1034542806576291/
- https://www.facebook.com/groups/829586007120477/
- https://www.facebook.com/groups/machine.learning.bg/
- https://www.facebook.com/datasciencesoc/

## Misc
- https://www.kdnuggets.com/
- https://machinelearningmastery.com/start-here/
- https://nlpprogress.com/ - latest research in NLP
- https://paperswithcode.com/ - implementations of papers

# Incidence Matrixes

In [49]:
sample_bbc_news_sentences = [
    "China confirms Interpol chief detained",
    "Turkish officials believe the Washington Post writer was killed in the Saudi consulate in Istanbul.",
    "US wedding limousine crash kills 20",
    "Bulgarian journalist killed in park",
    "Kanye West deletes social media profiles",
    "Brazilians vote in polarised election",
    "Bull kills woman at French festival",
    "Indonesia to wrap up tsunami search",
    "Tina Turner reveals wedding night ordeal",
    "Victory for Trump in Supreme Court battle",
    "Clashes at German far-right rock concert",
    "The Walking Dead actor dies aged 76",
    "Jogger in Netherlands finds lion cub",
    "Monkey takes the wheel of Indian bus"
]

In [50]:
#basic tokenization
from nltk.tokenize import TweetTokenizer

tokenizer = TweetTokenizer()
sample_bbc_news_sentences_tokenized = [tokenizer.tokenize(sent) 
                            for sent in sample_bbc_news_sentences]
sample_bbc_news_sentences_tokenized[0]

['China', 'confirms', 'Interpol', 'chief', 'detained']

In [51]:
sample_bbc_news_sentences_tokenized_lower = [[_t.lower() 
                                              for _t in _s] 
                for _s in sample_bbc_news_sentences_tokenized]
sample_bbc_news_sentences_tokenized_lower[0]

['china', 'confirms', 'interpol', 'chief', 'detained']

In [52]:
#get all unique tokens
unique_tokens = set(sum(sample_bbc_news_sentences_tokenized_lower, []))
list(unique_tokens)[:5]

['deletes', 'social', 'bull', 'tsunami', 'actor']

In [55]:
# create incidence matrix (term-document frequency)
import numpy as np
incidence_matrix = np.array([[sent.count(token) for sent in sample_bbc_news_sentences_tokenized_lower] 
                    for token in unique_tokens])
print(incidence_matrix)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 1 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 [0 1 0 ... 0 0 0]]


For a bigger vocab can take too much memory (number of tokens * number of documents), while also being sparse!

![Inverted Index](img/inverted-index.png)

# Dataset 
https://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups 

In [6]:
!ls data/mini_newsgroups/sci.electronics/
!tail -50 data/mini_newsgroups/sci.electronics/52464 | head -10

52464 52830 53589 53676 53750 53820 53871 53918 54041 54111 54157 54248 54337
52758 53508 53640 53683 53769 53824 53872 53921 54042 54114 54160 54255 54353
52766 53511 53641 53692 53772 53829 53891 53935 54057 54115 54165 54265 54489
52792 53521 53653 53706 53777 53837 53892 53938 54066 54122 54175 54302 54490
52794 53529 53655 53708 53804 53839 53909 53971 54069 54132 54176 54305
52817 53569 53664 53712 53808 53850 53911 53976 54090 54140 54212 54306
52820 53574 53669 53741 53812 53865 53913 53986 54092 54143 54224 54310
52822 53584 53675 53742 53818 53868 53915 54010 54096 54147 54244 54325
Lines: 48

In article <1993Mar25.161909.8110@wuecl.wustl.edu> dp@cec1.wustl.edu (David Prutchi) writes:
>In article <C4CntG.Jv4@spk.hp.com> long@spk.hp.com (Jerry Long) writes:
>>Fred W. Culpepper (fculpepp@norfolk.vak12ed.edu) wrote:
>>[...]
>>A couple of years ago I put together a Tesla circuit which
>>was published in an electronics magazine and could have been
>>the circuit which is referred t

# You will now have to construct the Inverted Index - only the dictionary part (term and #docs)

In [36]:
from nltk.tokenize import sent_tokenize
from collections import defaultdict, Counter
from string import punctuation
import os

In [37]:
def preprocess_document(content):
    """
    Returns a list of tokens for a document's content. 
    Tokens should not contain punctuation and should be lower-cased.
    """
    pass

def prepare_dataset(documents_dir):
    """
    Returns list of documents in the documents_dir, where each document is a list of its tokens. 
    
    """
    pass

__Example Output: __ <br>

prepare_dataset('data/mini_newsgroups/talk.politics.guns/')

>Found documents:  100 <br>
>[['path', 'cantaloupe.srv.cs.cmu.edu', 'crabapple.srv.cs.cmu.edu', 'fs7.ece.cmu.edu', 'europa.eng.gtefsd.com', 'howland.reston.ans.net', 'wupost', 'cs.utexas.edu', 'uunet', 'olivea', 'sgigate','sgiblab','adagio.panasonic.com',...

In [45]:
def document_frequency(tokenized_documents):
    """
    Returns a dictionary {token:number of documents containing the token}
    """
    pass

__Example Output:__<br><br>
selected_category = 'data/mini_newsgroups/sci.crypt/' <br>
print(selected_category) <br>
tokenized_dataset = prepare_dataset(selected_category) <br>
print("Sample tokenized document:") <br>
print(tokenized_dataset[0][:10])

> data/mini_newsgroups/sci.crypt/<br>
> Found documents:  100<br>
> Sample tokenized document:<br>
> ['newsgroups', 'sci.crypt', 'path', 'cantaloupe.srv.cs.cmu.edu', 'rochester', 'udel', 'bogus.sura.net', 'howland.reston.ans.net', 'torn', 'nott']

df = document_frequency(tokenized_dataset) <br>
print("Most common words:")<br>
print(Counter(df).most_common(10))<br>
print("Least common words:")<br>
print(Counter(df).most_common()[-10:])<br>

>Most common words:<br>
>[('apr', 100), ('newsgroups', 100), ('path', 100), ('message-id', 100), ('cantaloupe.srv.cs.cmu.edu', 100), ('from', 100), ('date', 100), ('sci.crypt', 100), ('subject', 100), ('lines', 99)]<br>
>Least common words:<br>
>[('herndon', 1), ('type', 1), ('stu-iii', 1), ('confident', 1), ('elephants', 1), ('ii', 1), ('combined', 1), ('fuzzy', 1), ('myktotronx', 1), ('amanda@intercon.com', 1)]

## __Questions__: 
- Which words have highest and which lowest Total freq?
- What are some examples of inverted indexes we have seen in life?
- Why do we use inverted indexes?

![Sort-Based-Index](img/sort-based-method.png)

# What's next in research:

- <a href='http://nbjl.nankai.edu.cn/Lab_Papers/2018/SIGIR2018.pdf'>Index Compression for BitFunnel Query Processing</a>
- <a href='https://link.springer.com/chapter/10.1007/978-3-319-76941-7_47'>Inverted List Caching for Topical Index Shards</a>
