# RSS Wikipedia Search Dashboard

1. Initialize Danish Wikipedia
2. Initialize RsspediaInit class
3. Initialize the RSSWikiDashboard class

**Global setup**

In [5]:
try:
    with open("../../global_setup.py") as setupfile:
        exec(setupfile.read())
except FileNotFoundError:
    print('Setup already completed')

In [6]:
%%html
<style>
.output_wrapper, .output {
    height:auto !important;
    max-height: 10000px;
}
</style>

**Import modules**

In [7]:
from src.text.document_retrieval.wikipedia import Wikipedia
from notebooks.exercises.src.text.news_wiki_search_init import RsspediaInit
from notebooks.exercises.src.text.news_wiki_dashboard import RSSWikiDashboard

## 1. Initialize Danish Wikipedia

In [12]:
wikipedia = Wikipedia(
    language="Danish",
    cache_directory_url=False
)

Loading parsed documents.
Loading preprocessed documents.
Wikipedia loaded.


## 2. Initialize RsspediaInit class

In this initialization, the data that is needed to perform search is loaded.
1. Okapi BM-25: uses precomputed wikipedia tf and idf vectors. No more preprocessing is done.
2. Explicit Semantic Analysis: here we load the stored tf-idf vectors or compute them and store on disk.
3. FTN-a and FTN-b: here we load the stored Fasttext vectors for wikipedia titles and abstracts or compute them and store on disk. As a preprocessing, non-alphanumeric characters are removed, and zero-length abstracts are removed as well. Wikipedia documents and titles are adjusted accordingly.
<br><br>
For both (2) and (3) we remove the following stop-words: <br>
<code>stop_words = ["den", "det", "denne", "dette", "en", "et", "om", "for", "til", "at", "af", "på", "som", "og", "er", "i"]</code>

In [13]:
rss_search_init = RsspediaInit(wikipedia = wikipedia)



Loading vectorized TF-IDF documents.
Vectorized TF-IDF documents loaded.


Initializing and loading Fasttext binaries.
Fasttext initialized and loaded.


Loading vectorized Fasttext documents.
Vectorized Fasttext documents loaded.


## 3. Initialize and start the RSSWikiDashboard

### 3.1 Types of search and their descriptions:

#### 3.1.1 Okapi BM-25

In information retrieval, Okapi BM25 (BM stands for Best Matching) is a ranking function used by search engines to rank matching documents according to their relevance to a given search query. It is based on the probabilistic retrieval framework developed in the 1970s and 1980s by Stephen E. Robertson, Karen Spärck Jones, and others.

The name of the actual ranking function is BM25. To set the right context, however, it is usually referred to as "Okapi BM25", since the Okapi information retrieval system, implemented at London's City University in the 1980s and 1990s, was the first system to implement this function.

BM25 and its newer variants, e.g. BM25F (a version of BM25 that can take document structure and anchor text into account), represent state-of-the-art TF-IDF-like retrieval functions used in document retrieval.

Given a query $Q$, containing keywords $ q_1,...,q_n$, the BM25 score of a document $D$ is:

$$
score(D,Q)=\sum_{i=1}^n IDF(q_i) * \frac{f(q_i,D)*(k_1+1)} {f(q_i,D)+k_1*(1-b+b*\frac{|D|}{(avgdl)}},
$$

where $f(q_i,D)$ is $q_{i}$'s term frequency in the document $D$, $|D|$ is the length of the document $D$ in words, and $avgdl$ is the average document length in the text collection from which documents are drawn. $k_1$ and $b$ are free parameters, usually chosen, in absence of an advanced optimization, as $k_1\in [1.2,2.0]$ and $b=0.75$. $IDF(q_i)$ is the IDF (inverse document frequency) weight of the query term $q_i$. 

It is usually computed as:

$$
IDF(q_i)=log((N-n(q_i)+0.5)/(n(q_i)+0.5),
$$

where $N$ is the total number of documents in the collection, and $n(q_i)$ is the number of documents containing $q_i$.

#### 3.1.2 Explicit Semantic Analysis

#### 3.1.3 FTN-a

#### 3.1.4 FTN-b

### 3.2 Post-processing

In [10]:
rsswdb = RSSWikiDashboard(wikipedia, rss_search_init)
rsswdb.start

VBox(children=(Dropdown(description='Vælg nyhedskilde:', layout=Layout(width='400px'), options={'Politiken.dk'…

In [None]:
r = rsswdb.rsspedia_search.cdist_func([rss_search_init.sumVectorRepresentation("6")],
    [rss_search_init.sumVectorRepresentation("Guide Anmelderne anbefaler 6 gode spisesteder med pasta på menuen")])

In [None]:
import pprint
pprint.pprint(titles)