In [1]:
from bs4 import BeautifulSoup
from tqdm import tqdm
import requests
from functions import *
import os
import re
from datetime import datetime
import csv
import pandas as pd
import json
pd.set_option("display.max_rows", None, "display.max_columns", None, "max_colwidth", None)

# 1. Data ollection

At first, we needed to get the links of the anime in the first 400 pages in here [top animes ever list](https://myanimelist.net/topanime.php). Then, after collecting all the urls, we needed to download their html pages, in order to store them in a local folder and to use them without any kind of problem. After the crawling, the last thing to do was to parse the downloaded pages, i.e. taking only the requested information, and to put each anime in a tsv file. <br>
At the end we could start creating the Search Engine!

## 1.1 Get the list of animes

**THE CELL BELOW MUST BE RUNNED JUST ONE TIME, COMMENT IT**

In [2]:
# get_link('https://myanimelist.net/topanime.php?limit=', 'anime_links.txt')

## 1.2 Crawl animes

In [29]:
# how many urls has the text file
with open('anime_links.txt', 'r', encoding='utf-8') as file:
    urls = list(file.read().splitlines())
    len_list = len(urls)
    print("Numbers of anime founded:", len_list)

print("Numbers of pages founded:", len_list//50+1)

Numbers of anime founded: 19130
Numbers of pages founded: 383


As we can see in the cell below, there were not 400 pages, but only 383, and also there were only 19130 and not 20000, but this is not the only "problem" that we found...

**THE CELL BELOW MUST BE RUNNED JUST ONE TIME, COMMENT IT**

We used this cell to create the folders to store the html pages.

In [4]:
# os.mkdir('pages')
# for i in range(len_list//50 + 1):
#     os.mkdir(f'pages/page_{i + 1}')

**THE CELL BELOW MUST BE RUNNED JUST ONE TIME, COMMENT IT**

This cell was used to crawl each anime.

In [5]:
# change the start_ind value if the process stops
# and the crawling will continue from that index
# change stop_ind value to recover only the pages
# from start_ind to stop_ind

# start_ind = 0
# stop_ind = len_list
# crawl_html(start_ind)

## 1.3 Parse downloaded pages

After all the crawling, we found out that this anime, `article_7242.html` and `article_15009.html`, were missing, because the anime urls do not exist:<br>
https://myanimelist.net/anime/2644/Doraemon__Treasure_of_the_Shinugumi_Mountain<br>
https://myanimelist.net/anime/43247/Bing_Di_Lian

**THE CELL BELOW MUST BE RUNNED JUST ONE TIME, COMMENT IT**

We used this cell to create the pages to store the tsv file for each anime.

In [6]:
# os.mkdir('pages_tsv')
# for i in range(len_list//50 + 1):
#     os.mkdir(f'pages_tsv/page_{i + 1}')

In the cell below we store the information requested for each anime in their respective tsv file, by storing in the first row the columns' names, and in the second row the information. We already executed the cell, so there is no need to do it again and we can comment it!

In [30]:
# for i in tqdm(range(len_list)):
#     # the files with the indexes below do not exist
#     if i == 7241 or i == 15008:
#         continue
#     page = i//50 + 1
#     with open(f'pages/page_{page}/article_{i+1}.html', 'r', encoding='utf-8') as file:
#         soup = BeautifulSoup(file, 'html.parser')
#         with open(f'pages_tsv/page_{page}/anime_{i+1}.tsv', 'w', encoding='utf-8') as tsv_file:
#             tsv = csv.writer(tsv_file, delimiter='\t')
#             tsv.writerow(['animeTitle', 'animeType', 'animeNumEpisode', 'releaseDate', 'endDate', 'animeNumMembers', 'animeScore', 'animeUsers', 'animeRank', 'animePopularity', 'animeDescription', 'animeRelated', 'animeCharacters', 'animeVoices', 'animeStaff'])
#             tsv.writerow([get_title(soup), get_type(soup), get_num_ep(soup), get_dates(soup)[0], get_dates(soup)[1], get_memb(soup), get_score(soup), get_users(soup), get_rank(soup), get_pop(soup), get_descr(soup), get_rel_an(soup), get_char(soup), get_voices(soup), get_staff(soup)])


# 2. Search Engine

## 2.1. Conjunctive query

In [8]:
# download all the nltk files needed
download()

# creates the vocabulary of every word in the Synopsis of every anime
# and stores it in a json file
create_vocab()

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\clara\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\clara\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
100%|██████████| 383/383 [01:22<00:00,  4.62it/s]


### 2.1.1) Create your index!

In [9]:
# creates the inverted index dictionary and stores it in a json file
invertedIndex()

100%|██████████| 383/383 [00:26<00:00, 14.42it/s]


### 2.1.2) Execute the query

The cell below is used to open the vocabulary's terms, the inverted_index and the anime_links and to store them in global variables to use in other functions.

In [2]:
# opening the vocabulary
voc_json = open('vocabulary.json', 'r', encoding='utf-8')
vocabulary = json.load(voc_json)
voc_json.close()

# opening the inverted_index
inv_ind_json = open('inverted_index.json', 'r', encoding='utf-8')
inverted_index = json.load(inv_ind_json)
inv_ind_json.close()

# opening the anime_links
list_url_txt = open('anime_links.txt', 'r', encoding='utf-8')
list_url = list_url_txt.read().splitlines()
list_url_txt.close()

In [11]:
# taking in input the query
query = input("Insert the query for the inverted index: ")
assert len(query) > 0, "The query is empty!!"
# stemming the query
query_stemmed = text_mining(query)
print("Query:", query)

Query: saiyan race


In [12]:
# creating index query dictionary
query_dict = dict()
for word in query_stemmed:
    if word in vocabulary.keys():
        query_dict[vocabulary[word]] = inverted_index[str(vocabulary[word])]

In [13]:
# saving the inverted_index of the query
query_index = list(query_dict.keys())


# searching for the documents requested from the query
doc_list = set(query_dict[query_index[0]])
for query_word in query_index[1:]:
    doc_list.intersection_update(query_dict[query_word])

# creating a pandas dataframe for the final result
doc_df = pd.DataFrame(columns=["animeTitle", "animeDescription", "Url"])
for doc in doc_list:
    i = int(''.join(re.findall(r'\d+', doc)))
    doc_page = (i-1)//50 + 1
    path = f'pages_tsv/page_{doc_page}/anime_{i}.tsv'
    tsv_file = open(path, 'r', encoding='utf-8')
    anime_tsv = csv.DictReader(tsv_file, delimiter='\t')
    anime = anime_tsv.__next__()
    doc_df.loc[doc, ["animeTitle", "animeDescription", "Url"]] = [anime["animeTitle"], anime["animeDescription"], list_url[i-1]]

In [14]:
doc_d = dict(selector="th",
             props=[('text-align', 'center')])
def make_clickable(val):
    return '<a href="{}">{}</a>'.format(val, val)

doc_df.style.set_properties(**{'text-align':'center'}).set_table_styles([doc_d]).format({'Url': make_clickable})

Unnamed: 0,animeTitle,animeDescription,Url
document_365,Dragon Ball Z,"Five years after winning the World Martial Arts tournament, Gokuu is now living a peaceful life with his wife and son. This changes, however, with the arrival of a mysterious enemy named Raditz who presents himself as Gokuu's long-lost brother. He reveals that Gokuu is a warrior from the once powerful but now virtually extinct Saiyan race, whose homeworld was completely annihilated. When he was sent to Earth as a baby, Gokuu's sole purpose was to conquer and destroy the planet; but after suffering amnesia from a head injury, his violent and savage nature changed, and instead was raised as a kind and well-mannered boy, now fighting to protect others. With his failed attempt at forcibly recruiting Gokuu as an ally, Raditz warns Gokuu's friends of a new threat that's rapidly approaching Earth—one that could plunge Earth into an intergalactic conflict and cause the heavens themselves to shake. A war will be fought over the seven mystical dragon balls, and only the strongest will survive in Dragon Ball Z. [Written by MAL Rewrite]",https://myanimelist.net/anime/813/Dragon_Ball_Z
document_1035,Dragon Ball Kai,"Five years after the events of Dragon Ball, martial arts expert Gokuu is now a grown man married to his wife Chi-Chi, with a four-year old son named Gohan. While attending a reunion on Turtle Island with his old friends Master Roshi, Krillin, Bulma and others, the festivities are interrupted when a humanoid alien named Raditz not only reveals the truth behind Gokuu's past, but kidnaps Gohan as well. With Raditz displaying power beyond anything Gokuu has seen before, he is forced to team up with his old nemesis, Piccolo, in order to rescue his son. But when Gokuu and Piccolo reveal the secret of the seven mystical wish-granting Dragon Balls to Raditz, he informs the duo that there is more of his race, the Saiyans, and they won’t pass up an opportunity to seize the power of the Dragon Balls for themselves. These events begin the saga of Dragon Ball Kai, a story that finds Gokuu and his friends and family constantly defending the galaxy from increasingly more powerful threats. Bizarre, comical, heartwarming and threatening characters come together in a series of battles that push the powers and abilities of Gokuu and his friends beyond anything they have ever experienced.",https://myanimelist.net/anime/6033/Dragon_Ball_Kai
document_401,Dragon Ball Super: Broly,"Forty-one years ago on Planet Vegeta, home of the infamous Saiyan warrior race, King Vegeta noticed a baby named Broly whose latent power exceeded that of his own son. Believing that Broly's power would one day surpass that of his child, Vegeta, the king sends Broly to the desolate planet Vampa. Broly's father Paragus follows after him, intent on rescuing his son. However, his ship gets damaged, causing the two to spend years trapped on the barren world, unaware of the salvation that would one day come from an unlikely ally. Years later on Earth, Gokuu Son and Prince Vegeta—believed to be the last survivors of the Saiyan race—are busy training on a remote island. But their sparring is interrupted when the appearance of their old enemy Frieza drives them to search for the last of the wish-granting Dragon Balls on a frozen continent. Once there, Frieza shows off his new allies: Paragus and the now extremely powerful Broly. A legendary battle that shakes the foundation of the world ensues as Gokuu and Vegeta face off against Broly, a warrior without equal whose rage is just waiting to be unleashed. [Written by MAL Rewrite]",https://myanimelist.net/anime/36946/Dragon_Ball_Super__Broly
document_1469,Dragon Ball Z Special 1: Tatta Hitori no Saishuu Kessen,"Bardock, Son Goku's father, is a low-ranking Saiyan soldier who was given the power to see into the future by the last remaining alien on a planet he just destroyed. He witnesses the destruction of his race and must now do his best to stop Frieza's impending massacre. (Source: ANN)",https://myanimelist.net/anime/986/Dragon_Ball_Z_Special_1__Tatta_Hitori_no_Saishuu_Kessen


## 2.2) Conjunctive query & Ranking score

### 2.2.1) Inverted index

We create two dictionaries:
- inverted_term such that for each word we have the list of documents in which it is contained in, and the relative tfIdf score.
- inverte_doc such that for each document we have the sum of the squares of the tfidf, we will use this dictonary in the execute query.

We compute the tf-idf as $tf*idf$:
- $tf=\frac{n_i}{|d|}$, n is the number of occurences of the i-th word in the document and |d| is the number of words in the document
- $idf=log_{10}\left(\frac{N}{n_d}\right)$, N is the total number of documents and $n_d$ is the number of documents contaning the word

In [15]:
# creates the inverted_index_tfidf and inverted_doc dictionaries and stores them in a json file
invertedIndex_tfidf(vocabulary, inverted_index)

100%|██████████| 383/383 [00:27<00:00, 13.89it/s]


### 2.2.2) Execute the query

The cell below is used to open the inverted_index_tfidf and the inverted_doc and to store them in global variables to use in other functions.

In [3]:
# opening the inverted_index
inv_ind_tfidf_json = open('inverted_index_tfidf.json', 'r', encoding='utf-8')
inverted_index_tfidf = json.load(inv_ind_tfidf_json)
inv_ind_tfidf_json.close()

# opening the anime_links
inv_doc_json = open('inverted_doc.json', 'r', encoding='utf-8')
inverted_doc = json.load(inv_doc_json)
inv_doc_json.close()

Given a query we get the set of documents containing all the words in the query and sort them according to their similarity to the query

- First we consider only the documents that contain all the words in the query.

- We create a dictionary called numerator such that for each document we have the tf-idf sum of the words in the query in reference to the document.

- We compute the cosine similarity for each document as $ cos(\theta) = \frac{(\vec{q} \cdot \vec{d})}{(|{\vec{q}}| \cdot |{\vec{d}}|)}$ where:
<p> $(\vec{q} \cdot \vec{d})$ is the intersection of the document and the query vectors: we used the numerator dictonary argument for each document as intersection;<p>
<p> $|{\vec{q}}| \cdot |{\vec{d}}|$  are the norms of the document and query vectors . We compute $|{\vec{q}}|$ as the square root of the length of the query (because the query vector has only components equal to 1 corresponding to the query words). We compute $|{\vec{d}}|$ as the square root of the sum of the squares of the tf-idf of all words in the document (so we use the inverted_doc dictonary to compute the document norm)<p>

- Then we create the result dictonary to store for each document the corresponding cosine similarity to the query.

In [4]:
# taking in input the query
query_2 = input("Insert the query for the inverted index tfidf: ")
k = int(input("Insert k: "))
assert len(query_2) > 0, "The query is empty!!"
assert isinstance(k, int), "k should be an integer!!"
# stemming the query
query_stemmed_2 = text_mining(query_2)
print("Query:", query_2)
print('k:', k)

Query: saiyan race
k: 3


In [5]:
# creating a pandas dataframe for the final result
doc_df_2 = pd.DataFrame(columns=["animeTitle", "animeDescription", "Url", "Similarity"])
final_doc, result = top_k_documents(k, query=query_stemmed_2, inverted_index=inverted_index, inverted_index_tfidf=inverted_index_tfidf, inverted_doc=inverted_doc, vocabulary=vocabulary)
for doc in final_doc:
    i = int(''.join(re.findall(r'\d+', doc)))
    doc_page = (i-1)//50 + 1
    path = f'pages_tsv/page_{doc_page}/anime_{i}.tsv'
    tsv_file = open(path, 'r', encoding='utf-8')
    anime_tsv = csv.DictReader(tsv_file, delimiter='\t')
    anime = anime_tsv.__next__()
    doc_df_2.loc[doc, ["animeTitle", "animeDescription", "Url", "Similarity"]] = [anime["animeTitle"], anime["animeDescription"], list_url[i-1], result[doc]]

In [6]:
doc_dict = dict(selector="th",
                props=[('text-align', 'center')])
def make_clickable(val):
    return '<a href="{}">{}</a>'.format(val, val)

doc_df_2.style.set_properties(**{'text-align':'center'}).set_table_styles([doc_dict]).format({'Url': make_clickable})

Unnamed: 0,animeTitle,animeDescription,Url,Similarity
document_1469,Dragon Ball Z Special 1: Tatta Hitori no Saishuu Kessen,"Bardock, Son Goku's father, is a low-ranking Saiyan soldier who was given the power to see into the future by the last remaining alien on a planet he just destroyed. He witnesses the destruction of his race and must now do his best to stop Frieza's impending massacre. (Source: ANN)",https://myanimelist.net/anime/986/Dragon_Ball_Z_Special_1__Tatta_Hitori_no_Saishuu_Kessen,0.320894
document_401,Dragon Ball Super: Broly,"Forty-one years ago on Planet Vegeta, home of the infamous Saiyan warrior race, King Vegeta noticed a baby named Broly whose latent power exceeded that of his own son. Believing that Broly's power would one day surpass that of his child, Vegeta, the king sends Broly to the desolate planet Vampa. Broly's father Paragus follows after him, intent on rescuing his son. However, his ship gets damaged, causing the two to spend years trapped on the barren world, unaware of the salvation that would one day come from an unlikely ally. Years later on Earth, Gokuu Son and Prince Vegeta—believed to be the last survivors of the Saiyan race—are busy training on a remote island. But their sparring is interrupted when the appearance of their old enemy Frieza drives them to search for the last of the wish-granting Dragon Balls on a frozen continent. Once there, Frieza shows off his new allies: Paragus and the now extremely powerful Broly. A legendary battle that shakes the foundation of the world ensues as Gokuu and Vegeta face off against Broly, a warrior without equal whose rage is just waiting to be unleashed. [Written by MAL Rewrite]",https://myanimelist.net/anime/36946/Dragon_Ball_Super__Broly,0.085567
document_365,Dragon Ball Z,"Five years after winning the World Martial Arts tournament, Gokuu is now living a peaceful life with his wife and son. This changes, however, with the arrival of a mysterious enemy named Raditz who presents himself as Gokuu's long-lost brother. He reveals that Gokuu is a warrior from the once powerful but now virtually extinct Saiyan race, whose homeworld was completely annihilated. When he was sent to Earth as a baby, Gokuu's sole purpose was to conquer and destroy the planet; but after suffering amnesia from a head injury, his violent and savage nature changed, and instead was raised as a kind and well-mannered boy, now fighting to protect others. With his failed attempt at forcibly recruiting Gokuu as an ally, Raditz warns Gokuu's friends of a new threat that's rapidly approaching Earth—one that could plunge Earth into an intergalactic conflict and cause the heavens themselves to shake. A war will be fought over the seven mystical dragon balls, and only the strongest will survive in Dragon Ball Z. [Written by MAL Rewrite]",https://myanimelist.net/anime/813/Dragon_Ball_Z,0.067677


To check that we were computing the cosine similarity in the right way, we did some tests considering the description of some documents as a query.
<p>In the first test we used as a query the description of document 401 and the cosine similarity is 0.65, it is a high similarity value, but the value should have been around 1 (having used the document description itself). This does not happen because the description of the 401 document is very long (the number of words is high) and therefore the cosine similarity is less reliable. <p>
<p>In the second test instead we used the description of document 203 which has only 7 words and in this case the cosine similarity is 0.96.<p>

#### FIRST TEST
Using all anime description of document_401

In [20]:
# taking in input the query
query_2 = input("Insert the query for the first test: ")
k = int(input("Insert k: "))
assert len(query_2) > 0, "The query is empty!!"
assert isinstance(k, int), "k should be an integer!!"
# stemming the query
query_stemmed_2 = text_mining(query_2)
print("Query:", query_2)
print('k:', k)

Query: Forty-one years ago on Planet Vegeta, home of the infamous Saiyan warrior race, King Vegeta noticed a baby named Broly whose latent power exceeded that of his own son. Believing that Broly's power would one day surpass that of his child, Vegeta, the king sends Broly to the desolate planet Vampa. Broly's father Paragus follows after him, intent on rescuing his son. However, his ship gets damaged, causing the two to spend years trapped on the barren world, unaware of the salvation that would one day come from an unlikely ally. Years later on Earth, Gokuu Son and Prince Vegeta—believed to be the last survivors of the Saiyan race—are busy training on a remote island. But their sparring is interrupted when the appearance of their old enemy Frieza drives them to search for the last of the wish-granting Dragon Balls on a frozen continent. Once there, Frieza shows off his new allies: Paragus and the now extremely powerful Broly. A legendary battle that shakes the foundation of the world

In [21]:
# creating a pandas dataframe for the final result
doc_df_2 = pd.DataFrame(columns=["animeTitle", "animeDescription", "Url", "Similarity"])
final_doc, result = top_k_documents(k, query=query_stemmed_2, inverted_index=inverted_index, inverted_index_tfidf=inverted_index_tfidf, inverted_doc=inverted_doc, vocabulary=vocabulary)
for doc in final_doc:
    i = int(''.join(re.findall(r'\d+', doc)))
    doc_page = (i-1)//50 + 1
    path = f'pages_tsv/page_{doc_page}/anime_{i}.tsv'
    tsv_file = open(path, 'r', encoding='utf-8')
    anime_tsv = csv.DictReader(tsv_file, delimiter='\t')
    anime = anime_tsv.__next__()
    doc_df_2.loc[doc, ["animeTitle", "animeDescription", "Url", "Similarity"]] = [anime["animeTitle"], anime["animeDescription"], list_url[i-1], result[doc]]

In [22]:
doc_dict = dict(selector="th",
                props=[('text-align', 'center')])
def make_clickable(val):
    return '<a href="{}">{}</a>'.format(val, val)

doc_df_2.style.set_properties(**{'text-align':'center'}).set_table_styles([doc_dict]).format({'Url': make_clickable})

Unnamed: 0,animeTitle,animeDescription,Url,Similarity
document_401,Dragon Ball Super: Broly,"Forty-one years ago on Planet Vegeta, home of the infamous Saiyan warrior race, King Vegeta noticed a baby named Broly whose latent power exceeded that of his own son. Believing that Broly's power would one day surpass that of his child, Vegeta, the king sends Broly to the desolate planet Vampa. Broly's father Paragus follows after him, intent on rescuing his son. However, his ship gets damaged, causing the two to spend years trapped on the barren world, unaware of the salvation that would one day come from an unlikely ally. Years later on Earth, Gokuu Son and Prince Vegeta—believed to be the last survivors of the Saiyan race—are busy training on a remote island. But their sparring is interrupted when the appearance of their old enemy Frieza drives them to search for the last of the wish-granting Dragon Balls on a frozen continent. Once there, Frieza shows off his new allies: Paragus and the now extremely powerful Broly. A legendary battle that shakes the foundation of the world ensues as Gokuu and Vegeta face off against Broly, a warrior without equal whose rage is just waiting to be unleashed. [Written by MAL Rewrite]",https://myanimelist.net/anime/36946/Dragon_Ball_Super__Broly,0.655309


#### SECOND TEST
Using the description of document_203

In [23]:
# taking in input the query
query_2 = input("Insert the query for the second test: ")
k = int(input("Insert k: "))
assert len(query_2) > 0, "The query is empty!!"
assert isinstance(k, int), "k should be an integer!!"
# stemming the query
query_stemmed_2 = text_mining(query_2)
print("Query:", query_2)
print('k:', k)

Query: Two special episodes bundled in the fourth and fifth volume of the Blu-ray/DVD.
k: 1


In [24]:
# creating a pandas dataframe for the final result
doc_df_2 = pd.DataFrame(columns=["animeTitle", "animeDescription", "Url", "Similarity"])
final_doc, result = top_k_documents(k, query=query_stemmed_2, inverted_index=inverted_index, inverted_index_tfidf=inverted_index_tfidf, inverted_doc=inverted_doc, vocabulary=vocabulary)
for doc in final_doc:
    i = int(''.join(re.findall(r'\d+', doc)))
    doc_page = (i-1)//50 + 1
    path = f'pages_tsv/page_{doc_page}/anime_{i}.tsv'
    tsv_file = open(path, 'r', encoding='utf-8')
    anime_tsv = csv.DictReader(tsv_file, delimiter='\t')
    anime = anime_tsv.__next__()
    doc_df_2.loc[doc, ["animeTitle", "animeDescription", "Url", "Similarity"]] = [anime["animeTitle"], anime["animeDescription"], list_url[i-1], result[doc]]

In [25]:
doc_dict = dict(selector="th",
                props=[('text-align', 'center')])
def make_clickable(val):
    return '<a href="{}">{}</a>'.format(val, val)

doc_df_2.style.set_properties(**{'text-align':'center'}).set_table_styles([doc_dict]).format({'Url': make_clickable})

Unnamed: 0,animeTitle,animeDescription,Url,Similarity
document_203,Natsume Yuujinchou Go Specials,Two special episodes bundled in the fourth and fifth volume of the Blu-ray/DVD.,https://myanimelist.net/anime/34534/Natsume_Yuujinchou_Go_Specials,0.961133


# 3. Define a new Score!

We define the new score as:
\begin{equation}
score={\sum_{j∈columns}(score_j)}
\end{equation}
where $columns={animeTitle, animeType, ..., animeStaff}$

We compute the scores in this way:
- the title score: $score_t={\sum_{i∈query}\frac{2*n_i}{len(title)+len(query)}}$, where $n_i$ is the number of occurrences of $word_i$ of the query in the title. We are considering how many times the query word appears in the title, normalizing by the sum of query length and the title length. We decided to multiply the numerator by two in order to try to give greater importance of similarity to the queries with a word that appeared in the title.
- the type score: $score_p={\sum_{i∈query}\frac{n_i}{2}}$, where $n_i$ is different from zero, if in the query is specified the anime type, in that case we consider as score: $n_i$ divided by two to not give much importance to the score type.
- score_episode, score_members, score_animeScore, score_users, score_rank, score_popularity are computed such that their sum is equal to one; we give much importance if the query match with score_animeScore and with score_rank, score_episode. The others are less significant in the similarity in our opinion, so we did not give them much weight.
- the description score: $score_d={\sum_{i∈query}\frac{2*d_i}{len(description)+len(query)}}$, where $d_i$ is the occurrences of the $word_i$ of the query in the description. We compute this score with the same rules of the title_score.
- the releaseDate_score and endDate_score: we consider these scores equal to 0.5 if in the query appears an year that coincides with the releasedate or enddate year. We compute these scores taking into account only the year because we think the user is more likely to do a search by remembering only the releasedate or enddate year rather than the exact date with day and month.
- the score_staff, score_characters, score_voice: we compute these scores adding 0.5 every time a name in the query appears in the list of strings. We decided to add 0.5 because in this way, if the users only enter the name or the surname of the character/staff/voice the relative score will be 0.5, instead if the users enters both the name and the surname the relative score is equal to 1 (as the similarity is more precise).

In [7]:
query = input("insert the query for the new score: ")
k = int(input("insert k: "))
assert len(query) > 0, "The query is empty!!"
assert isinstance(k, int), "k should be an integer!!"
# stemming the query
print("Query:", query)
print('k:', k)

Query: gintama enchousen Youichi 2018 64
k: 5


In [7]:
new_score(query)

100%|██████████| 383/383 [14:39<00:00,  2.30s/it]


In [8]:
# opening the score_dict
score_dict_json = open('score_dict.json', 'r', encoding='utf-8')
score_dict = json.load(score_dict_json)
score_dict_json.close()

# opening the heap list
heap_json = open('heap.json', 'r', encoding='utf-8')
heap = json.load(heap_json)
heap_json.close()

In [9]:
# creating a pandas dataframe for the final result
doc_df = pd.DataFrame(columns=["animeTitle", "animeDescription", "Url", "Similarity"])
final_doc, final_score_dict = top_k_documents(k, heap=heap, score_dict=score_dict)
for doc in final_doc:
    i = int(''.join(re.findall(r'\d+', doc)))
    doc_page = (i-1)//50 + 1
    path = f'pages_tsv/page_{doc_page}/anime_{i}.tsv'
    tsv_file = open(path, 'r', encoding='utf-8')
    anime_tsv = csv.DictReader(tsv_file, delimiter='\t')
    anime = anime_tsv.__next__()
    doc_df.loc[doc, ["animeTitle", "animeDescription", "Url", "Similarity"]] = [anime["animeTitle"], anime["animeDescription"], list_url[i-1], final_score_dict[doc]]

In [10]:
doc_dict = dict(selector="th",
                props=[('text-align', 'center')])
def make_clickable(val):
    return '<a href="{}">{}</a>'.format(val, val)

doc_df.style.set_properties(**{'text-align':'center'}).set_table_styles([doc_dict]).format({'Url': make_clickable})

Unnamed: 0,animeTitle,animeDescription,Url,Similarity
document_9,Gintama': Enchousen,"While Gintoki Sakata was away, the Yorozuya found themselves a new leader: Kintoki, Gintoki's golden-haired doppelganger. In order to regain his former position, Gintoki will need the help of those around him, a troubling feat when no one can remember him! Between Kintoki and Gintoki, who will claim the throne as the main character? In addition, Yorozuya make a trip back down to red-light district of Yoshiwara to aid an elderly courtesan in her search for her long-lost lover. Although the district is no longer in chains beneath the earth's surface, the trio soon learn of the tragic backstories of Yoshiwara's inhabitants that still haunt them. With flashback after flashback, this quest has Yorozuya witnessing everlasting love and protecting it as best they can with their hearts and souls. Gintama': Enchousen includes moments of action-packed intensity along with their usual lighthearted, slapstick humor for Gintoki and his friends. [Written by MAL Rewrite]",https://myanimelist.net/anime/15417/Gintama__Enchousen,0.979227
document_10,Gintama: The Final,New Gintama movie.,https://myanimelist.net/anime/39486/Gintama__The_Final,0.972222
document_22,Gintama.: Shirogane no Tamashii-hen - Kouhan-sen,Second Season of the final arc of Gintama.,https://myanimelist.net/anime/37491/Gintama__Shirogane_no_Tamashii-hen_-_Kouhan-sen,0.848485
document_15,Gintama,"The Amanto, aliens from outer space, have invaded Earth and taken over feudal Japan. As a result, a prohibition on swords has been established, and the samurai of Japan are treated with disregard as a consequence. However one man, Gintoki Sakata, still possesses the heart of the samurai, although from his love of sweets and work as a yorozuya, one might not expect it. Accompanying him in his jack-of-all-trades line of work are Shinpachi Shimura, a boy with glasses and a strong heart, Kagura with her umbrella and seemingly bottomless stomach, as well as Sadaharu, their oversized pet dog. Of course, these odd jobs are not always simple, as they frequently have run-ins with the police, ragtag rebels, and assassins, oftentimes leading to humorous but unfortunate consequences. Who said life as an errand boy was easy? [Written by MAL Rewrite]",https://myanimelist.net/anime/918/Gintama,0.833333
document_6,Gintama',"After a one-year hiatus, Shinpachi Shimura returns to Edo, only to stumble upon a shocking surprise: Gintoki and Kagura, his fellow Yorozuya members, have become completely different characters! Fleeing from the Yorozuya headquarters in confusion, Shinpachi finds that all the denizens of Edo have undergone impossibly extreme changes, in both appearance and personality. Most unbelievably, his sister Otae has married the Shinsengumi chief and shameless stalker Isao Kondou and is pregnant with their first child. Bewildered, Shinpachi agrees to join the Shinsengumi at Otae and Kondou's request and finds even more startling transformations afoot both in and out of the ranks of the the organization. However, discovering that Vice Chief Toushirou Hijikata has remained unchanged, Shinpachi and his unlikely Shinsengumi ally set out to return the city of Edo to how they remember it. With even more dirty jokes, tongue-in-cheek parodies, and shameless references, Gintama' follows the Yorozuya team through more of their misadventures in the vibrant, alien-filled world of Edo. [Written by MAL Rewrite]",https://myanimelist.net/anime/9969/Gintama,0.801462


# 5. Algorithmic question

## Write an algorithm that computes the acceptable solution with the longest possible duration.


**MyAlg**

```
seq = the sequence of request
query = the list of requests
query_index = empty list

for time in query:
    query_index.append(seq.index(time))

new_query=[0 for i in seq]

for i in query_index:
    new_query[i] = seq[i]

max_sum_nonadjacent_element(new_query)
```

- In the first for we write the position of each element of the query in the sequence.
- Then, we define a new vector with length equal to the length of the sequence, where the elements of the query take position according to query_index, and the other values are 0
- we call the function "max_sum_nonadjacent_elements(new_query)" while keeping track of the elements we are summing

**Max sum non-adjacent element(A)**

```
incl_sum = A[0]
excl_sum = 0

for i = 1 to length(A):
    max_sum = max(incl_sum, excl_sum)
    incl_sum = excl_sum + A[i]
    excl_sum = max_sum

max_sum = max(incl_sum, excl_sum)

return max_sum
```

- We take two variables incl_sum and excl_sum and we itialize incl_sum equal to the first element and excl_sum as zero;
- For each element find the maximum of incl_sum and excl_sum;
- Incl_sum will be equal to excl_sum at the previous step plus the current element;
- excl_sum will be the maximum between excl_sum and incl_sum at the previous step;
- After all the steps, we will take the maximum between incl_sum and excl_sum as result.

## Implement a program that given in input an instance in the form given above, gives the optimal solution.

In [26]:
def MyAlg(seq,query):

    query_index=[]
    seq1=seq.copy()

    for time in query:
        try:
            query_index.append(seq.index(time))
            seq[seq.index(time)]=0
        except:
            print('not valid query')
            return

    new_query=[0 for i in seq]

    # creating new_query as explained before
    for i in query_index:
        new_query[i] = seq1[i]

    # using max_sum_nonadjacent_elements(new_query)
    incl_sum = new_query[0]
    excl_sum = 0
    ind1 = [0]             # indeces of the inclusive sum
    ind2 = []              # indeces of the exclusive sum
    indm = ind1[:]         # indeces of max sum
    max_sum = incl_sum

    for i in range(1, len(new_query)):
        incl_sum = excl_sum + new_query[i]
        ind2.append(i)
        excl_sum = max_sum
        ind1 = indm[:]
        if incl_sum > excl_sum:
            max_sum = incl_sum
            indm = ind2[:]
            ind2 = ind1 [:]
        else:
            max_sum = excl_sum
            indm = ind1[:]
            ind2 = ind1[:]

    query_opt = []

    for i in indm:
        query_opt.append(new_query[i])

    print(f"The optimal solution is: {query_opt}")
    print(f"The duration is: {max_sum}")

In [27]:
seq=[30, 40, 25, 50, 30, 20]
query=[30, 40, 25, 30]
MyAlg(seq, query)

The optimal solution is: [30, 25, 30]
The duration is: 85


We added a try/except to avoid an overlap of appointments as shown in the following example:

In [28]:
seq = [30, 40, 25, 50, 30, 20]
query = [30, 40, 25, 50, 30, 20, 20]
MyAlg(seq,query)

not valid query
