# Ανάκτηση Πληροφορίας

Σουκαράς Σωτήριος ice21390206
Θεοφάνης Κουνιάκης ice21390103


### Βήμα 1: Συλλογή δεδομένων

Για τη συλλογή δεδομένων έχει καταστευάσετει ένα αναδρομικό προγράμμα Data Scrape για Wikipedia. Το πρόγραμμα εξερευνεί αναδρομικά όλους τους συνδέσμους που περιέχονται μέσα σε μια σελίδα της Wikipedia μέχρι ένα συγκεκριμένο βάθος. Τα αποτελέσματα αποθηκεύονται στο αρχείο "wiki_scrape.json" σε μορφή οπού κάθε json αντικείμενο περιέχει το Link του ιστότοπου και τις παραγρφαφούς του ιστότοπου σε μια λίστα.

**To ID "mw-content-text"** <br>
Στις σελίδες της Wikipedia όλο το ουσιαστικό περιεχόμενο περιέχεται μέσα στο id "mw-content-text". Επομένος δεν εξερευνούνται links που οδηγούν στην αρχική σελιδα της Wikipedia για παράδειγμα.

In [1]:
import requests
import json
import time
from bs4 import BeautifulSoup

def fetch_wikipedia(URL, depth_limit, depth = 1):
    parsed_paragraphs = {}
    print("Searching " + URL + "...")
    
    try:
        wiki_responce = requests.get(URL)
        wiki_responce.raise_for_status()        # Throw if error was encountered in the request

        # Parse the responce with BeautifulSoup
        soup_responce = BeautifulSoup(wiki_responce.text, 'html.parser')
        soup_paragraphs = soup_responce.find_all('p')

        # Remove html tags and append them to the return values if they have text
        parsed_paragraphs[URL] = [p.text.strip() for p in soup_paragraphs if p.text.strip() != ""]

        # If the maxt depth of the search has been reached exit the recursion
        if depth >= depth_limit:
            return parsed_paragraphs
        
        # Find the main content of the wiki article if exists
        body = soup_responce.find(id="mw-content-text")
        if not body:
            return parsed_paragraphs
        
        for link in body.find_all('a'):
            # If the href tag in not present or it doesn't point to an other wiki side skip it
            if not ('href' in link.attrs) or link['href'].find("/wiki/") == -1 or link['href'].find("File:") != -1:
                continue

            # Search the next wiki link
            new_paragraphs = fetch_wikipedia("https://en.wikipedia.org" + link['href'], depth_limit, depth + 1)
            # Dont spam the wiki database
            time.sleep(1)       

            # Return value is valid
            if not new_paragraphs:
                continue
            
            # Append the return values to the dictionary 
            parsed_paragraphs.update(new_paragraphs)

        return parsed_paragraphs
    except:
        print("Unable to parse link: " + URL)
        return parsed_paragraphs


#Fetch info for link with max recusive search of 2
results = fetch_wikipedia("https://en.wikipedia.org/wiki/World_War_II", 2)

filename = "wiki_scrape.json"

# Convert to json object
json_object = [
    {
        "website_url": website,
        "content": data_list, 
    }
    for website, data_list in results.items()
]

# Save as JSON file
try:
    with open(filename, "w", encoding="utf-8") as file:
        json.dump(json_object, file, indent=4)
    print(f"Data saved to JSON file: {filename}")
except IOError as e:
    print(f"Error saving to JSON file: {e}")


Searching https://en.wikipedia.org/wiki/World_War_II...
Searching https://en.wikipedia.org/wiki/WWII_(disambiguation)...
Searching https://en.wikipedia.org/wiki/The_Second_World_War_(disambiguation)...
Searching https://en.wikipedia.org/wiki/World_War_II_(disambiguation)...
Searching https://en.wikipedia.org/wiki/Junkers_Ju_87...
Searching https://en.wikipedia.org/wiki/Eastern_Front_(World_War_II)...
Searching https://en.wikipedia.org/wiki/Matilda_II...
Searching https://en.wikipedia.org/wiki/North_African_campaign...
Searching https://en.wikipedia.org/wiki/Atomic_bombings_of_Hiroshima_and_Nagasaki...
Searching https://en.wikipedia.org/wiki/Battle_of_Stalingrad...
Searching https://en.wikipedia.org/wiki/Raising_a_Flag_over_the_Reichstag...
Searching https://en.wikipedia.org/wiki/Reichstag_building...
Searching https://en.wikipedia.org/wiki/Battle_of_Berlin...
Searching https://en.wikipedia.org/wiki/Invasion_of_Lingayen_Gulf...
Searching https://en.wikipedia.org/wiki/Japanese_occupation

### NLTK Setup
Για τη σωστή λειουργεία των υπόλοιπων προγραμμάτων απαιτείται η βιβλιοθήκη NLTK. Ο παρακάτων κώδικας ελένχει και κατεβά


In [2]:
import nltk

nltk.download()
nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")

try:
    nltk.data.find('corpora/stopwords')
    nltk.data.find('corpora/wordnet')
except LookupError:
    nltk.download('stopwords')
    nltk.download('wordnet')
    nltk.download('punkt')

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Kouniakis\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Kouniakis\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Kouniakis\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Kouniakis\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Kouniakis\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Kouniakis\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Βήμα 2. Προεπεξεργασία κειμένου (Text Processing):
Για την προεπεξεργασία του κειμένου, δημιουργήθηκε πρόγραμμα που να παίρνει το .json αρχείο του βήματος 1 και:
1) χωρίζει τα κείμενα σε λέξεις (tokenization), 
2) αφαιρεί τα stop words (πχ 'is', 'the', 'or'),
3) και μετατρέπει κάθε λέξη σε λεξικογραφική μορφή (lemmatization)

In [3]:
import json
import re       #REGEX

import nltk.stem
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

def clear_special_char(text: str) -> str:
    brackets_regex = r"\[[^\]]*\]"
    alpharethmetic_regex = r"[^a-zA-Z0-9\s]"

    # Remove the references like [55] or [a]
    parsed_text = re.sub(brackets_regex, "", text)

    # The '-' many times is used as seperator to seperate the words with a ' '
    parsed_text = re.sub('-', " ", parsed_text)
    # Remove any non alapharithmetic char
    parsed_text = re.sub(alpharethmetic_regex, "", parsed_text)

    parsed_text.strip()

    return parsed_text

def preprocess_text(text: str) -> str:
    # Lemmatizer and stop word objects for english
    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words("english"))

    # Tokenize the paragraph
    tokens = word_tokenize(text)

    # Remove all the words inside the stop word container
    non_stopwords_tokens = [word for word in tokens if word.lower() not in stop_words]

    # Lematize the remain words
    lemmatized = [lemmatizer.lemmatize(word) for word in non_stopwords_tokens]

    # Join all the lemmatized words into a string
    final_text = " ".join(lemmatized)
    return final_text

# Open the save file
filepath = "wiki_scrape.json"
with open(filepath, "r", encoding="utf-8") as file:
    data = json.load(file)

# Parse all the text in the scrape file
parsed_data = {}
for index, site_data in enumerate(data):
    parsed_content = []
    link = site_data["website_url"]
    for p_index, paragraph in enumerate(site_data["content"]):
        # For each paragraph run clean and preprocess
        clear_paragraph = clear_special_char(paragraph)
        parsed_paragraph = preprocess_text(clear_paragraph)
        parsed_content.append(parsed_paragraph)
    
    parsed_data[link] = parsed_content

# Convert to json object
json_object = [
    {
        "website_url": website,
        "content": data_list, 
    }
    for website, data_list in parsed_data.items()
]

# Save as JSON file
filename = "parsed_scrape.json"
try:
    with open(filename, "w") as file:
        json.dump(json_object, file, indent=4)
    print(f"Data saved to JSON file: {filename}")
except IOError as e:
    print(f"Error saving to JSON file: {e}")

Data saved to JSON file: parsed_scrape.json


### Βήμα 3. Ευρετήριο (Indexing):
Το πρόγραμμα indexing δέχεται το προεπεξεργασμένο κείμενο και δημιουργεί το inverted index. To inverted index είναι μια δομή που κάθε λέξη συνδεέται με μια λίστα στην οποία είναι όλα τα έγγραφα που εμφανίζεται η λέξη. Τα γράμματα απο τις λέξεις μετατρέπονται σε μικρά ώστε τα δεδομένα να είναι ομοιόμορφα, άρα και πίο ευκολά να προσπελαθούν. <br>
Τέλος το inverted index αποθηκεύεται στο αρχείο "parsed_scrape.json". <br><br>

**Μορφή Inverted Index:** <br>
`word: [ document_link1, document_link2, document_link3 ]`

In [8]:
import json
from collections import defaultdict

def create_inverted_index(data):
    inverted_index = defaultdict(list)

    # For all the enties in the file
    for index, site_data in enumerate(data):
        link = site_data["website_url"]

        # For each paragraph of the entry
        for p_index, paragraph in enumerate(site_data["content"]):
            # Remove any caps
            paragraph = paragraph.lower();

            # Take only the unique words inside the paragraph
            # Split them using the ' '
            words = set(paragraph.split()) 
            
            # For every unique word in paragraph
            for word in words:
                # If the website link is not present in the on the word entry append it  
                if link not in inverted_index[word]:
                    inverted_index[word].append(link)

    return inverted_index

# Open the save file
filepath = "parsed_scrape.json"
with open(filepath, "r", encoding="utf-8") as file:
    data = json.load(file)

inverted_index = create_inverted_index(data)

# Convert to dict
inverted_index = dict(inverted_index)

# Save to JSON
index_filename = "inverted_index.json"
try:
    with open(index_filename, "w", encoding="utf-8") as file:
        json.dump(inverted_index, file, ensure_ascii=False, indent=4)
    print(f"Inverted index saved to: {index_filename}")
except IOError as e:
    print(f"Error saving inverted index: {e}")

Inverted index saved to: inverted_index.json


### Βήμα 4. Μηχανή αναζήτησης (Search Engine):
#### α) Επεξεργασία ερωτήματος (Query Processing):
Ο επεξεργαστής ερωτήματος δέχεται ένα απλό ερώτημα bool και αφού εντοπίσει τα έγγραφα από τις λέξεις που ζητήθηκαν, δημιουργεί την απάντηση λαμβάνοντας υπόψη τους λογικούς τελεστές ανάμεσα στις λέξεις. Οι λογικοι τελεστές που υλοποιούνται είναι οι AND, OR, NOT και η προκαθορισμένη πράξη (Αν δεν δoθεί λογικος τελεστής) είναι η OR. <br>
Από σύνολο των Stopwords αφαιρούνται οι λογικοί τελεστές AND, OR, NOT καθώς δεν θέλουμε να σβηστούν κατα τη διάρκεια της προεπεξεργασίας του ερωτήματος.

Στο παράδειγμα τρέξαμε το Query: *Pacific and not Asia*

In [9]:
import json

# NLTK imports
import nltk.stem
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

def request_query(query: str, index: dict) -> set:
    logic_operators = {"and", "or", "not"};

    # Init nltk objects
    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words('english')) - logic_operators

    # Tokenize query
    tokens = word_tokenize(query.lower())

    # Remove all the stop words inside the query tokens
    non_stopwords_tokens = [word for word in tokens if word.lower() not in stop_words]

    # Lematize the remain words
    lemmatized_query = [lemmatizer.lemmatize(word) for word in non_stopwords_tokens];

    # Remove any duplicates
    result = set()

    # Default search op is logic or
    op = "or"

    for token in lemmatized_query:
        # Chnage mode
        if token.lower() in logic_operators:
            op = token.lower()
            continue

        # Token is not found
        if token not in index:
            continue
        
        url_list = set(index[token])

        # Sets allow logic operations on 
        if op == "or":
            result |= url_list  # If or join the two url lists
        elif op == "and":
            result &= url_list  # If and take the common links only
        elif op == "not":
            result -= url_list  # If not remove the links from the result

    return result

# Open the save file
filepath = "inverted_index.json"
with open(filepath, "r", encoding="utf-8") as file:
    data = json.load(file)

query = input("Request query: ")
sites = request_query(query, data)

for res in sites:
    print(res) 

https://en.wikipedia.org/wiki/Atomic_bombings_of_Hiroshima_and_Nagasaki
https://en.wikipedia.org/wiki/Winston_Churchill
https://en.wikipedia.org/wiki/Invasion_of_Lingayen_Gulf
https://en.wikipedia.org/wiki/Chiang_Kai-shek
https://en.wikipedia.org/wiki/Adolf_Hitler
https://en.wikipedia.org/wiki/Theater_(warfare)
https://en.wikipedia.org/wiki/Matilda_II


#### β) Κατάταξη αποτελεσμάτων (Ranking):
Σε αυτή την υλοποίηση, ο χρήστης έχει την δυνατότητα να επιλέξει ανάμεσα σε τρείς αλγορίθμους ανάκτησης δεδομένων.<br><br>
    **1. Boolean Retrieval <br>**
    Ο αλγόριθμος είναι σε μεγάλο βαθμό ίδιος με την παραπάνω υλοποίηση με τη διαφορά οτι τα αποτελέσματα του ταξινομούνται χρησιμοποιώντας τον αλγόριθμο κατάταξης TF-IDF<br><br>
    **2. Vector Space Model <br>**
    O αλγοριθμός ανακτήσει τα δεδομένα χρησιμοποιόντας ένα πίνακα συχνότητων από τις λέξεις των εγγράφων, για αυτό δεν δέχεται το inverted index ως είσοδο αλλά το parsed scrape καθώς είναι πιο εύκολο να χτίσει έτσι τον πίνακα.<br><br>
    **3. Okapi BM25 <br>**
    Είναι ένας αλγόριθμος που κάνει κατάταξη βάση των πιθανοτήτων απο τις λέξεις του ερωτήματος, και τις λέξεις των εγγράφων.

Στη παρακάτω δοκιμή τρέξαμε το Query: *Pacific and not Asia* με αλγόριθμο VSM


In [10]:
import json
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

from rank_bm25 import BM25Okapi

logic_operators = {"and", "or", "not"}

def preprocess_query(query: str, exclude_words: set = set()) -> str:

    # Init nltk objects
    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words('english')) - exclude_words

    # Tokenize query
    tokens = word_tokenize(query.lower())
    # Remove all the stop words inside the query tokens
    non_stopwords_tokens = [word for word in tokens if word.lower() not in stop_words]
    # Lematize the remain words
    lemmatized_query = " ".join([lemmatizer.lemmatize(word) for word in non_stopwords_tokens])

    return lemmatized_query

def ranking_TF_IDF(parsed_scrape: dict, query: str, result_set: set = None):

    if not query:
        return {}

    # Preprocess the query
    lemmatized_query = preprocess_query(query, logic_operators)

    # Combine the URL and the paragraphs in a signle line
    documents = {entry['website_url']: " ".join(entry['content']) for entry in parsed_scrape}
    
    # A Result set has been provided
    if result_set is not None:

        # Remove documents that arent in the result set
        documents = {url: content for url, content in documents.items() if url in result_set}

    # Init the TF-IDF matrix
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(documents.values())

    query_vector = vectorizer.transform([lemmatized_query])

    scores = cosine_similarity(query_vector, tfidf_matrix).flatten()  # Using flatter to conv to vectoer
    ranked_results = sorted(zip(documents.keys(), scores), key=lambda x: x[1], reverse=True)

    return ranked_results
    
def boolean_retrieval(query: str, index: dict) -> set:
    
    if not query: 
        return {}
    
    # Preprossess query to remove any unwanted words or characters, keeping the logic ops
    lemmatized_query = preprocess_query(query, logic_operators)

    # Remove any duplicates
    result = set()

    # Default search op is logic or
    op = "or"

    for token in lemmatized_query.split():
        # Chnage mode
        if token.lower() in logic_operators:
            op = token.lower()
            continue

        # Token is not found
        if token not in index:
            continue
        
        url_list = set(index[token])

        # Sets allow logic operations on pyth
        if op == "or":
            result |= url_list  # If or join the two url lists
        elif op == "and":
            result &= url_list  # If and take the common links only
        elif op == "not":
            result -= url_list  # If not remove the links from the result

    return result

def vsm_retrieval(query: str, parsed_scrape: dict):
    if not query:
        return {}

    processed_query = preprocess_query(query)

    results = ranking_TF_IDF(parsed_scrape, processed_query)
    return results

def probabilistic_retrieval(parsed_scrape: list, query: str):
    if not query:
        return []

    # Preprocess the query
    processed_query = preprocess_query(query)

    # Combine the URL and the paragraphs in a signle line
    documents = {entry['website_url']: " ".join(entry['content']) for entry in parsed_scrape}

    # Tokenize documents
    tokenized_documents = [word_tokenize(doc.lower()) for doc in documents.values()]

    # Initialize BM25 model
    bm25 = BM25Okapi(tokenized_documents)

    # Tokenize the query
    tokenized_query = word_tokenize(processed_query)

    # Get BM25 scores
    scores = bm25.get_scores(tokenized_query)
    ranked_results = sorted(zip(documents.keys(),scores), key=lambda x: x[1], reverse=True)

    return ranked_results

def dataRetrival(inverted_index: dict, parsed_scrape: dict, option: str, query: str) -> set:
    result_set = set()
    if option == "1":
        bool_results = boolean_retrieval(query, inverted_index)
        result_set = ranking_TF_IDF(parsed_scrape, query, bool_results)
    elif option == "2":
        result_set = vsm_retrieval(query, parsed_scrape)
    elif option == "3":
        result_set = probabilistic_retrieval(parsed_scrape,query)
    else:
        raise Exception(f"Invalid Retrival Method: \"{option}\"")
    
    return result_set

if __name__ == "__main__":
    # Open the inverted index save file
    with open('inverted_index.json', 'r') as file:
        inverted_index = json.load(file)

    # Open the parsed data (Used for TF-IDF matrix init)
    with open('parsed_scrape.json', 'r') as file:
        parsed_scrape = json.load(file)

    print("Options:")
    print("0. Exit")
    print("1. Boolean Retrieval")
    print("2. Vector Space Model (TF-IDF Ranking)")
    print("3. Okapi BM25")
    option = input("0,1,2,3: ")
    if option == "0":
        exit()

    query = input("Request query: ")

    result_set = dataRetrival(inverted_index, parsed_scrape, option, query)

    print("\nResults: ")
    for score, url in result_set:
        print(f"URL: {score}. Score: {url}")


Options:
0. Exit
1. Boolean Retrieval
2. Vector Space Model (TF-IDF Ranking)
3. Okapi BM25

Results: 
URL: https://en.wikipedia.org/wiki/Template:Campaignbox_World_War_II. Score: 0.525416978120242
URL: https://en.wikipedia.org/wiki/South-East_Asian_theatre_of_World_War_II. Score: 0.13196562210054344
URL: https://en.wikipedia.org/wiki/Pacific_War. Score: 0.10440782793030888
URL: https://en.wikipedia.org/wiki/World_War_II. Score: 0.08540729649592382
URL: https://en.wikipedia.org/wiki/Indian_Ocean_in_World_War_II. Score: 0.05381481754044277
URL: https://en.wikipedia.org/wiki/Empire_of_Japan. Score: 0.049830418553484836
URL: https://en.wikipedia.org/wiki/Template_talk:Campaignbox_World_War_II. Score: 0.03807856292675406
URL: https://en.wikipedia.org/wiki/American_Theater_(World_War_II). Score: 0.03437410663691899
URL: https://en.wikipedia.org/wiki/World_War_II_by_country. Score: 0.0324988986782311
URL: https://en.wikipedia.org/wiki/Allies_of_World_War_II. Score: 0.029072793902156127
URL: h

### Βήμα 5. Αξιολόγηση συστήματος:
Για την αξιολόγηση του συστήματος έχει δημιουργηθεί μια δομή με *καθολικές αλήθεις (Ground Truths)* η οποία χρησιμοποιείται ως σημείο αναφοράς για την αξιολόγηση. <br>
Στο πρόγραμμα αξιολόγησης ο χρήστης μπορεί να επιλέξει τον αλγόριθμο που θέλει να αξιολογήσει και το πρόγραμμα θα τρέξει τα Queries που υπάρχουν στο Ground Truths χρησιμοποιόντας τη συνάρτηση ανάκτησης απο το παραπάνω ερώτημα και ύστερα συγκρίνοντας τα αποτελέσματα της αναζήτησης με το Ground Truths, παράγει τις τιμές αξιολόγησης.

##### ground_truths.json:

In [6]:
data = [
    {
        "query": "Junkers Ju 87",
        "links": [
            "https://en.wikipedia.org/wiki/Junkers_Ju_87",
            "https://en.wikipedia.org/wiki/Battle_of_Britain",
            "https://en.wikipedia.org/wiki/Battle_of_France",
            "https://en.wikipedia.org/wiki/Invasion_of_Poland"
        ]
    },
    {
        "query": "Major Battles",
        "links": [
            "https://en.wikipedia.org/wiki/Battle_of_Stalingrad",
            "https://en.wikipedia.org/wiki/Allies_of_World_War_II",
            "https://en.wikipedia.org/wiki/Second_Sino-Japanese_War"
        ]
    },
    {
        "query": "United Kingdom and not Germany",
        "links": [
            "https://en.wikipedia.org/wiki/South-East_Asian_theatre_of_World_War_II",
            "https://en.wikipedia.org/wiki/Japanese_occupation_of_the_Philippines",
            "https://en.wikipedia.org/wiki/Attacks_on_Australia_during_World_War_II",
            "https://en.wikipedia.org/wiki/Matilda_II",
            "https://en.wikipedia.org/wiki/Invasion_of_Lingayen_Gulf"
        ]
    },
    {
        "query": "United States in the Pacific",
        "links": [
            "https://en.wikipedia.org/wiki/United_States",
            "https://en.wikipedia.org/wiki/Allies_of_World_War_II",
            "https://en.wikipedia.org/wiki/World_War_II",
            "https://en.wikipedia.org/wiki/Aftermath_of_World_War_II",
            "https://en.wikipedia.org/wiki/Axis_powers",
            "https://en.wikipedia.org/wiki/Soviet_Union",
            "https://en.wikipedia.org/wiki/South-East_Asian_theatre_of_World_War_II",
            "https://en.wikipedia.org/wiki/Pacific_War",
            "https://en.wikipedia.org/wiki/American_theater_(World_War_II)",
            "https://en.wikipedia.org/wiki/Empire_of_Japan",
            "https://en.wikipedia.org/wiki/Japanese_occupation_of_the_Philippines"
        ]
    }
]


*Η κλάση `EvaluationValues` χρησιμοποιείται για ευκολότερη μεταφορά δεδομένω και εμφάνηση των αποτελεσμάτων*

In [7]:
#import search_engine_2     # Use this outside Jupyter nodebook
import json
import sklearn.metrics as metrics

# Data transfer class for the metrics
class EvaluationValues:
    _Precision: float
    _Recall: float
    _F1_score: float
    _Map: float

    # Print method
    def print_values(self):
        print(f"Precission: {self._Precision}")
        print(f"Recall: {self._Recall}")
        print(f"F1 Score: {self._F1_score}")
        print(f"Map: {self._Map}")


def evaluate_query(results, ground_truth_set: set) -> EvaluationValues:
    relative_results = []

    # Keep only reletive docs
    for link, score in results:
        if score != 0:
            relative_results.append(link)

    # Convert to tables for the sklearn lib
    y_true = [1 if link in ground_truth_set else 0 for link in relative_results]    
    y_pred = [1 if link in relative_results else 0 for link in ground_truth_set]

    # Pad with 0 
    while len(y_pred) != len(y_true):
        y_pred.append(0)

    results = EvaluationValues()
    results._Precision = metrics.precision_score(y_true, y_pred)
    results._Recall = metrics.recall_score(y_true, y_pred)
    results._F1_score = metrics.f1_score(y_true, y_pred)
    results._Map = metrics.average_precision_score(y_true, y_pred)

    return results


if __name__ == "__main__":
    # Open the inverted index save file
    with open('inverted_index.json', 'r') as file:
        inverted_index = json.load(file)

    # Open the parsed data (Used for TF-IDF matrix init)
    with open('parsed_scrape.json', 'r') as file:
        parsed_scrape = json.load(file)
    
    # Use this outside Jupyter notebook
    #with open('ground_truths.json', 'r') as file:
    #   data = json.load(file)

    # Select the algorithm
    print("Select Evaluation Algorithm")
    print("0. Exit")
    print("1. Boolean Retrieval")
    print("2. Vector Space Model (TF-IDF Ranking)")
    print("3. Okapi BM25")
    
    option = input("0,1,2,3: ")
    
    if option == "0":
        exit()
    elif option != "1" and option != "2" and option != "3":
        raise Exception(f"Invalid option: {option}")

    for index, question in enumerate(data):
        query = question['query']
        ground_truths = set(question['links'])

        # Call the data retrival func from the previuse question
        #result_set = search_engine_2.dataRetrival(inverted_index, parsed_scrape, option, query)    # Use this outside Jupyter notebook
        result_set = dataRetrival(inverted_index, parsed_scrape, option, query)     # Use this inside Jupyter notebook

        ev = evaluate_query(result_set, ground_truths)

        # Print the results
        print(f"\nQuery: {query}. Evaluation Values:")
        ev.print_values()

    

Select Evaluation Algorithm
0. Exit
1. Boolean Retrieval
2. Vector Space Model (TF-IDF Ranking)
3. Okapi BM25

Query: Junkers Ju 87. Evaluation Values:
Precission: 1.0
Recall: 1.0
F1 Score: 1.0
Map: 1.0

Query: Major Battles. Evaluation Values:
Precission: 0.0
Recall: 0.0
F1 Score: 0.0
Map: 0.04838709677419355

Query: United Kingdom and not Germany. Evaluation Values:
Precission: 0.0
Recall: 0.0
F1 Score: 0.0
Map: 0.078125

Query: United States in the Pacific. Evaluation Values:
Precission: 0.7
Recall: 0.7
F1 Score: 0.7
Map: 0.5361538461538461
