# HOMEWORK 3

## 1) DATA COLLECTION

### 1.1 Get the list of animes
We start from the list of animes to include in your corpus of documents. In particular, we focus on the top animes ever list. From this list we want to collect the url associated to each anime in the list. The list is long and splitted in many pages. We ask you to retrieve only the urls of the animes listed in the first 400 pages (each page has 50 animes so you will end up with 20000 unique anime urls).

The output of this step is a .txt file whose single line corresponds to an anime's url.

## Solution:
The main idea is to split the work between the retrieving of the html pages (skipping the already stored urls) and the retrieving of the links in the pages.<br>
Once we've defined a way to scrap the urls from the page, we'll iterate over 20000 to get all the pages till that number. At the time of execution the urls retrieved was only 19122, but in other execution it was even 19128, so it depends from the moment you execute that. The file was created in 5-Nov-2021 so the list refers to the content of the site in that day.

In [None]:
#import libraries
import requests
url = 'https://myanimelist.net/topanime.php'
response= requests.get(url)
from bs4 import BeautifulSoup

In [None]:
soup = BeautifulSoup(response.text, 'html.parser')

In [None]:
def get_links_from_soup(soup):
    ''' 
        Given a BeautifulSoup object the functions iterates over all the links in the table rows in the corrisponding page.
        The functions also returns the list containing all the destination (href) of these links
    '''
    anime = []
    for tag in soup.find_all('tr'):
        links = tag.find_all('a')
        for link in links:
            # checking if there is a content in the link
            if type(link.get('id')) == str and len(link.contents[0]) > 1:
                anime.append((link.contents[0], link.get('href')))
    return anime

In [None]:
tot_list= []
for lim in range(0, 20000, 50):
    
    # The principal page of animelis, so: https://myanimelist.net/topanime.php
    if lim==0:
        new_url = url
    else: # we've to skio the first lim elements
        new_url = url+'?limit='+str(lim)
    response = requests.get(new_url)
    soup = BeautifulSoup(response.text, 'html.parser')
    tot_list += get_links_from_soup(soup)

In [None]:
len(tot_list)

In [None]:
# Given that the list contains couples in the form: name, link and we want only the link, we save only the second element

with open('out.txt', 'w') as f:
    for n, link in tot_list:
        f.write(link+'\n')

### 1.2. Crawl animes
Once you get all the urls in the first 400 pages of the list, you:

* Download the html corresponding to each of the collected urls.
* After you collect a single page, immediately save its html in a file. In this way, if your program stops, for any reason, you will not lose the data collected up to the * * * stopping point. More details in Important (2).
* Organize the entire set of downloaded html pages into folders. Each folder will contain the htmls of the animes in page 1, page 2, ... of the list of animes.

**Important**

Due to the large amount of pages you need to download, follow the next tips that will help you speeding up several time-consuming operations.

[Save time downloading files] You are asked to crawl a considerable number of pages, which will take plenty of time. To speed up the operation, we suggest you to work in parallel with your group's colleagues or even generate code that works in parallel with all the CPUs available in your computer. In particular, using the same code, each component of the group can be in charge of downloading a subset of pages (e.g., the first 100). PAY ATTENTION: Once obtained all the pages, merge your results into an unique dataset. In fact, the search engine must look up for results in the whole set of documents.

[Save your data] It is not nice to restart a crawling procedure, given its runtime. For this reason, it is extremely important that for every time you crawl a page, you must save it with the name article_i.html, where i corresponds to the number of articles you have already downloaded. In such way, if something goes bad, you can restart your crawling procedure from the i+1-th document.

## Solution:
Once the website has to protect itself, it will block our requests if they are too much, so i've looked first if there's an error and, in that case, the program stops for a random range of time.<br>
In this way we've collected only cleaned pages (not the one containing the error button).<br>
The program (that can be found in the [downloder.py](downloader.py) file) has not been run by this notebook, but we've made a CLI version in order to run it from the terminal in how many terminal you want in order to work in parallel.<br>
The specification of that function can be retrieved in that file and even launching it from terminal with the ```--h``` flag only to see the documentation and the usage.

### 1.3 Parse downloaded pages
At this point, you should have all the html documents about the animes of interest and you can start to extract the animes informations. The list of information we desire for each anime and their format is the following:

* **Anime Name** (to save as animeTitle): *String*
* **Anime Type** (to save as animeType): *String*
* **Number of episode** (to save as animeNumEpisode): *Integer*
* **Release and End Dates of anime** (to save as releaseDate and endDate): Convert both release and end date into *datetime format*.
* **Number of members** (to save as animeNumMembers): *Integer*
* **Score** (to save as animeScore): *Float*
* **Users** (to save as animeUsers): *Integer*
* **Rank** (to save as animeRank): *Integer*
* **Popularity** (to save as animePopularity): *Integer*
* **Synopsis** (to save as animeDescription): *String*
* **Related Anime** (to save as animeRelated): Extract all the related animes, but only keep unique values and those that have a hyperlink associated to them. *List of strings*.
* **Characters** (to save as animeCharacters): *List of strings*.
* **Voices** (to save as animeVoices): *List of strings*
* **Staff** (to save as animeStaff): Include the staff name and their responsibility/task in a *list of lists*.

In [5]:
import html_parser

## Solution:

All the needed functions are in the [html_parser.py](./html_parser.py) file that contains the relative comments and documentation too.
The only function that has to be called to entirely solve this point is the ```save_tsv_info``` function.<br>
Even if the function that will be called in order to get all the needed data in the tsv format is only one, the function implemented are several and more or less they are grouped in three category:
* support function (to obtain a BeautifulSoup object from the html o to retrieve the datetime format from a string)
* webscraping function: They are the core of this implementation, all of them takes as parameter a BeautifulSoup object and a dictionary and their purpouse is to put into this dictionary as much information as possible starting from the object in input
* applicative function: They allow to directly obtain the data from a file o an index employing the other functions (i.e. ```save_tsv_info```)

### Webscraping:
In order to retrieve the 13 required fields we need first of all to study the structure of the page.<br>
In general the title is simple to retrieve using bs.<br>
The page, in general is structured as a series of nested tables, so we have a huge table with two columns, one for the left menu that contains all the info relatives to the score, the type of the serie, the ranking and so on and the other is the real body of the page, containing the synopsis, the related animes and other required fields.<br>
<newline>
Considering that, we have written a function called ```get_left_attributes``` that retrieves the values for the fields:
* episodes
* start_date
* end_date
* score
* users
* rank
* members
* popularity
* type
These fields are all in one div for each one and this div has the class equal to '*spaceit_pad*', then we have as content of this div a list of values in the form (at exception of the score) ```['\n', name_of_the_value, value]``` then value has to be processed in order to extract the date or the integer and so on.<br>
<newline>

The remaining values can be found in the right column of the huge table inside different divs or subtables.<br>
<newline>

#### Synopsis<br>
Can be found simply in a paragraph with the attribute '*itemprop*' equal to 'description'<br>
#### Related animes<br>
They are inside a dedicated table with class '*anime_detail_related_anime*', in particular we want only the animes for which exists a link, so we'll iterate over all the links in that table and we'll take the different contents returning the unique values
#### Staff<br>
In this case (as for the next) the function will work directly on a specific div (the divs that has the class '*detail-characters-list clearfix*')<br>
Given that div we have all the names that we need as content of links so we'll simply iterate over it and collect the data of interest
#### Characters and voices<br>
This div has the same class of the above one, so if we have only one of the two we can distinguish them by the content of an h3 element of the class '*h3_characters_voice_actors*'.<br>
Starting from this div we have other two div in order to split the list in two columns, these divs contains a list of tables, each of them contains only one row and three columns, one for the images of the characters, one for the name of the characters and the third for the voice, so we'll iterate over the row elements and selecting only the second two columns.<br>
From the first and the second we'll get the link element containing the name either of the character or for the voice.
#### Putting all togheter<br>
It's done with a function that uses each of these subfunction in order to retrieve, given an html page, the content of interest in a dictionary
<newline>

Once we can obtain a dictionary with the data of interest converting and saving it in a tsv file is trivial and it's done by ad-hoc functions. All this work is encapsulated in the funciton ```save_tsv_info``` that takes the ranges of the page for which we that we want to save the scraped tsv, the base directory where these page are stored and the destination directory over which the outputs will be stored.

In [6]:
html_parser.get_total_info('../data/html_pages/article_00000.html')

{'title': 'Fullmetal Alchemist: Brotherhood',
 'type': 'TV',
 'episodes': 64,
 'start_date': datetime.datetime(2009, 4, 5, 0, 0),
 'end_date': datetime.datetime(2010, 7, 4, 0, 0),
 'score': 9.16,
 'users': 1622384,
 'ranked': 1,
 'popularity': 3,
 'members': 2675906,
 'synopsis': "After a horrific alchemy experiment goes wrong in the Elric household, brothers Edward and Alphonse are left in a catastrophic new reality. Ignoring the alchemical principle banning human transmutation, the boys attempted to bring their recently deceased mother back to life. Instead, they suffered brutal personal loss: Alphonse's body disintegrated while Edward lost a leg and then sacrificed an arm to keep Alphonse's soul in the physical realm by binding it to a hulking suit of armor.",
 'related_anime': ['Fullmetal Alchemist: Brotherhood - 4-Koma Theater',
  'Fullmetal Alchemist: Brotherhood Specials',
  'Fullmetal Alchemist',
  'Fullmetal Alchemist: The Sacred Star of Milos'],
 'characters': ['Elric, Edward

For each anime, you create an anime_i.tsv file of this structure:

animeTitle \t animeType \t  ... \t animeStaff

## Solution:
In this case we've simply runned the function ```save_tsv_info``` starting from 0 to 19123 (the total number of the pages) in order to get all the necessary.
An example of the output of the funcion that gets the tsv is the following:

In [7]:
header, content = html_parser.get_tsv_from_idx(0)
print(f"The header of the tsv table is the following:\n\t{header}")
print()
print(f"The content for the 0-th page (the one of the dictionary above) is the following:\n\t{content}")

The header of the tsv table is the following:
	title	type	episodes	start_date	end_date	score	users	ranked	popularity	members	synopsis	related_anime	characters	voices	staff

The content for the 0-th page (the one of the dictionary above) is the following:
	Fullmetal Alchemist: Brotherhood	TV	64	2009-04-05 00:00:00	2010-07-04 00:00:00	9.16	1622384	1	3	2675906	After a horrific alchemy experiment goes wrong in the Elric household, brothers Edward and Alphonse are left in a catastrophic new reality. Ignoring the alchemical principle banning human transmutation, the boys attempted to bring their recently deceased mother back to life. Instead, they suffered brutal personal loss: Alphonse's body disintegrated while Edward lost a leg and then sacrificed an arm to keep Alphonse's soul in the physical realm by binding it to a hulking suit of armor.	['Fullmetal Alchemist: Brotherhood - 4-Koma Theater', 'Fullmetal Alchemist: Brotherhood Specials', 'Fullmetal Alchemist', 'Fullmetal Alchemist: The Sa

## Getting the TSVs!!!

In [8]:
# The following line is commented for safety reason, the execution will bring to an error given that the data doesn't exist in your machine

#html_parser.save_tsv_info(0, 1923) # We've used the default directory for source and dest, see the code for details

## 2. SEARCH ENGINE
Now, we want to create two different Search Engines that, given as input a query, return the animes that match the query.

First, you must pre-process all the information collected for each anime by:

* Removing stopwords
* Removing punctuation
* Stemming
* Anything else you think it's needed

For this purpose, you can use the nltk library.

In [None]:
import pandas as pd
import string
import json
from operator import itemgetter

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer
import nltk

In [None]:
nltk.download('punkt')
nltk.download('stopwords')

### 2.1. Conjunctive query
For the first version of the search engine, we narrow our interest on the Synopsis of each anime. It means that you will evaluate queries only with respect to the anime's description.

In [None]:
df = pd.read_table("../shared_stuff/tsv_files/0total_pages.tsv",
                       delimiter = "\t",
                       header = "infer",
                       on_bad_lines = "skip") # fix line 16473

In [None]:
def str_to_list(s):
    """
    This functions returns a list
    with each of the characters of the input string
    
    Arguments
        s : string
        
    Returns
        (list)
    """
    
    return [char for char in s]

def has_digits(s):
    """
    This function checks whether a string
    contains any digits
    
    Arguments
        s : string
        
    Returns
        (bool) True / False
    """
    
    return len([char for char in s if char.isdigit()]) != 0

def bad_words():
    """
    This function creates a list with words
    that should be excluded from the vocabulary
    during preprocessing, including punctuation,
    stopwords et similia
    
    Arguments
        none
        
    Returns
        (list)
    """
    
    punct = str_to_list(string.punctuation)
    punct += ["...", "''", "``", '""']
    
    stops = stopwords.words("english")
    
    other_suffixes = ["'s", "n't"]
    
    return punct + stops + other_suffixes

def preprocess(text, stemmer):
    """
    This function preprocesses some text (a document)
    by isolating each word, excluding stopwords et similia,
    and finally stemming them
    
    Arguments
        text : (string)
        stemmer : stemmer object, e.g. SnowBallStemmer()
    
    Returns
        (list) preprocessed input text
    """
    
    text = str(text)
    
    tokens = word_tokenize(text)
        
    return [stemmer.stem(w) for w in tokens 
            if w not in bad_words() and not has_digits(w)]

In [None]:
stemmer = SnowballStemmer("english")

df['synopsis_clean'] = df.apply(lambda row: preprocess(row['synopsis'], stemmer), 
                                axis = 'columns')

### 2.1.1) Create your index!
Before building the index,

Create a file named vocabulary, in the format you prefer, that maps each word to an integer (term_id).
Then, the first brick of your homework is to create the Inverted Index. It will be a dictionary of this format:

{
term_id_1:[document_1, document_2, document_4],
term_id_2:[document_1, document_3, document_5, document_6],
...}
where document_i is the id of a document that contains the word.

Hint: Since you do not want to compute the inverted index every time you use the Search Engine, it is worth to think to store it in a separate file and load it in memory when needed.

In [None]:
def create_vocab(corpus):
    """
    This function creates a set of unique
    and preprocessed words from a corpus
    
    Arguments
        corpus : pandas df column or list-like
    
    Returns
        vocab  : dictionary with the words as keys
                 and a unique integer for each as values
    """
    
    vocab = set()
    
    for doc in corpus:
        vocab.update(set(doc))

    return {word:idx for idx, word in enumerate(vocab)}
    

def save_dict_to_file(dct, filename):
    """
    This function saves a dictionary 
    to an external JSON file
    
    Arguments
        dct       : dictionary
        filename  : name of the file
        
    Returns
        void
    """
        
    with open(filename, "w") as file:
        json.dump(dct, file)
        

def read_dict_from_file(filename):
    """
    This function reads a dictionary
    from an external JSON file
        
    Arguments
        filename : name of the file
    
    Returns
        dct : dictionary with the contents of 'filename'
    """

    with open(filename, "r") as file:
        dct = json.loads(file.read())

    return dct

In [None]:
# only execute this cell the first time or 
# when the preprocessing changes!

vocab = create_vocab(df['synopsis_clean'])

save_dict_to_file(vocab, "vocabulary.json")

In [None]:
vocab = read_dict_from_file("vocabulary.json")

In [None]:
def create_inv_idx(corpus, vocab): 
    """
    This functions creates an inverted index list
    given a corpus of documents and a vocabulary
    
    Arguments
        corpus  : pandas df column or list-like
        vocab   : dictionary of all the words in the corpus
    
    Returns
        inv_idx : dictionary with the words as referenced in 'vocab' as keys 
                  and the lists of the documents each word is in as values       
    """
    
    inv_idx = {}
    
    for idx, word in zip(vocab.values(), vocab.keys()):
        inv_idx[idx] = [doc_id for doc_id, doc in enumerate(corpus) if word in doc]
    
    return inv_idx

In [None]:
# only execute this cell the first time or 
# when the preprocessing/vocabulary change!

inv_idx = create_inv_idx(df['synopsis_clean'], vocab)

#save_dict_to_file(inv_idx, "inv_idx.json")

In [None]:
# The keys of the dict we get are str, not int like when we created it
# I don't think it's necessary, but should we parse them when we read the json?
inv_idx = read_dict_from_file("inv_idx.json")

### 2.1.2) Execute the query
Given a query, that you let the user enter:

saiyan race
the Search Engine is supposed to return a list of documents.

What documents do we want?
Since we are dealing with conjunctive queries (AND), each of the returned documents should contain all the words in the query. The final output of the query must return, if present, the following information for each of the selected documents:

animeTitle
animeDescription
Url
Example Output:

animeTitle	animeDescription	Url
Fullmetal Alchemist: Brotherhood	...	https://myanimelist.net/anime/5114/Fullmetal_Alchemist__Brotherhood
Gintama	...	https://myanimelist.net/anime/28977/Gintama%C2%B0
Shingeki no Kyojin Season 3 Part 2	...	https://myanimelist.net/anime/38524/Shingeki_no_Kyojin_Season_3_Part_2
If everything works well in this step, you can go to the next point, and make your Search Engine more complex and better in answering queries.

In [None]:
def parse_query(query, vocab):
    """
    This functions converts the list of words
    input by the user into the list of the IDs
    the words are saved as in the vocabulary
    
    Arguments
        query : list of words
        vocab : vocabulary of words with the words as keys
                and their IDs as values
    Returns
        list of the IDs of the words in the query
    """
    
    parsed_query = []
    
    for word in query:
        try:
            parsed_query.append(vocab[stemmer.stem(word)])
        except KeyError:
            print(f"The term '{word}' wasn't found anywhere!")
    
    return parsed_query


def get_results(query, inv_idx):
    """
    This functions finds the documents all the words
    in the query are in.
    
    It finds them in three steps:
    1. creates a list of docs each word is in from the inverted index
    2. converts that list into a set
    3. intersects all those sets into a single set
       
    Arguments
        query : list of words as parsed by 'parse_query'
        
    Returns
        set with the documents that contain all the words in the query
    """
    
    return set.intersection(*[set(inv_idx[str(q)]) for q in query])


def get_df_entries(df, results,
                   url_file = "../shared_stuff/url_list.txt"):
    """
    This function filters the dataset so it only shows
    the rows which match the results, and adds a new column
    with the URL for the anime of each row.
    
    Arguments
        df       : pandas dataframe
        results  : set with the row indices to be filtered out
        url_file : external file with the URLs for each of the rows in df
    
    Returns
        df : filtered pandas dataframe
    """
    
    if not results:
        print("No results!")
        return
    
    with open(url_file, 'r') as file:
        url_list = file.read().split("\n")

    df = df.iloc[[*results]]
    df = df[["title", "synopsis"]]
    df = df.rename(columns = {"title": "animeTitle", 
                              "synopsis": "animeDescription"})
    
    df['animeUrl'] = itemgetter(*results)(url_list)
    
    return df

In [None]:
query = parse_query(input().split(), vocab)

# if at least one word in the query
# is in the vocabulary
if query:
    
    results = get_results(query)

    df_entries = get_df_entries(df, results)

    if df_entries: 
        display(df_entries)

### 2.2) Conjunctive query & Ranking score
For the second search engine, given a query, we want to get the top-k (the choice of k it's up to you!) documents related to the query. In particular:

* Find all the documents that contains all the words in the query.
* Sort them by their similarity with the query.
* Return in output k documents, or all the documents with non-zero similarity with the query when the results are less than k. You must use a heap data structure (you can use é * Python libraries) for maintaining the top-k documents.

To solve this task, you will have to use the tfIdf score, and the Cosine similarity. The field to consider it is still the synopsis. Let's see how.

### 2.2.1) Inverted index

In [None]:
#bozza

'''
def create_inv_idx(corpus, vocab):
    
    inv_idx = {}
   
    for doc_num, doc in enumerate(corpus):
        cnt = Counter(doc)
        for word, idx in zip(vocab.keys(), vocab.values()):
           inv_idx[idx] = (cnt[word])
     
    return inv_idx
'''

### 2.2.2) Execute the query

## 3. DEFINE A NEW SCORE!

Now it's your turn. Build a new metric to rank animes based on the queries of their users.

In this scenario, a single user can give in input more information than the single textual query, so you need to take into account all this information, and think a creative and logical way on how to answer at user's requests.

#### Practically:

The user will enter you a text query. As a starting point, get the query-related documents by exploiting the search engine of Step 2.1.<br>
<newline>

Once you have the documents, you need to sort them according to your new score. In this step you won't have anymore to take into account just the plot of the documents, you must use the remaining variables in your dataset (or new possible variables that you can create from the existing ones...). You must use a heap data structure (you can use Python libraries) for maintaining the top-k documents.<br>
<newline>

**Q: How to sort them?** <br>
**A: Allow the user to specify more information** that you find in the documents, and define a new metric that ranks the results based on the new request. You can also use other information regarding the anime to score some animes above others.<br>
N.B.: You have to define a scoring function, not a filter!

The output, must contain:

* **animeTitle**
* **animeDescription**
* **Url**
* **The new similarity score of the documents** with respect to the query


Are the results you obtain better than with the previous scoring function. Explain and compare results


## Solution

The basic idea is that a user can search for the words in the synopsis in the 2.1 point, so here we want to give the possibility to the user to query the data in the points he preferes from:
* title
* staff
* characters
* voices
* synopsis

Given that the query shoudld be in the form:<br>
"```word1 word2 [where_to_search] word3 word4 [where_t_s2] ...```"<br>
and the program will return the documents that contains word1 AND word2 in the field 'where_to_search' AND contains word3 AND word4 in the field 'where_t_s2'
<newline>

Hence the idea is to create an inverted index for each of thie fields and then parse the query and obtain the document that match it.<br>
We need to sort this document accordingly with a customized score and return for each of them the neccessarly info


### Importing the data...
Given that the file can contain data that are not so clean, we have to use some special attributes of the read_table function in order to retrieve a dataframe containing all the information of interest.<br>
As expected. the dataframe contains exactly 19122 rows indexed from 0 to 19121

In [3]:
import csv
import pandas as pd 

def import_df(path = "../shared_stuff/tsv_files/0total_pages.tsv"):
    return pd.read_table(path,
                        delimiter = "\t",
                        header = "infer",quoting=csv.QUOTE_NONE, error_bad_lines=False)


In [4]:
df = import_df()
df.info()



  df = import_df()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19122 entries, 0 to 19121
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   title          19122 non-null  object
 1   type           19122 non-null  object
 2   episodes       19122 non-null  object
 3   start_date     19122 non-null  object
 4   end_date       19122 non-null  object
 5   score          19122 non-null  object
 6   users          19122 non-null  object
 7   ranked         19122 non-null  object
 8   popularity     19122 non-null  int64 
 9   members        19122 non-null  int64 
 10  synopsis       19122 non-null  object
 11  related_anime  19122 non-null  object
 12  characters     19122 non-null  object
 13  voices         19122 non-null  object
 14  staff          19122 non-null  object
dtypes: int64(2), object(13)
memory usage: 2.2+ MB


In [1]:
from advanced_queryer import *
from question_two import *

In [6]:
import ast

def preprocessing_staff(df):
    df['str_staff']= df.apply(lambda row: ' '.join([el[0] for el in ast.literal_eval(row['staff'])]), 
                                axis = 'columns')
    print(f"[Staff]: Converted all the lists in strings, starting the preprocessing...")
    prepr_staff = df.apply(lambda row: (preprocess(row['str_staff'], SnowballStemmer('english'))), axis = 'columns')
    return prepr_staff

def preprocessing_voices(df):
    df['str_voices'] = df.apply(lambda row: ' '.join(ast.literal_eval(row['voices'])), 
                                    axis = 'columns')
    print(f"[Voices]: Converted all the lists in strings, starting the preprocessing...")
    return df.apply(lambda row: (preprocess(row['str_voices'], SnowballStemmer('english'))), axis = 'columns')

def preprocessing_characters(df):
    df['str_characters'] = df.apply(lambda row: ' '.join(ast.literal_eval(row['characters'])), 
                                    axis = 'columns')
    print(f"[Characters]: Converted all the lists in strings, starting the preprocessing...")
    return df.apply(lambda row: (preprocess(row['str_characters'], SnowballStemmer('english'))), axis = 'columns')

def preprocessing_title(df):
    print(f"[Title]: Starting the preprocessing...")
    return df.apply(lambda row: (preprocess(row['title'], SnowballStemmer('english'))), axis = 'columns')

def preprocessing_synopsis(df):
    print(f"[Synopsis]: Starting the preprocessing...")
    return df.apply(lambda row: (preprocess(row['synopsis'], SnowballStemmer('english'))), axis = 'columns')

In [7]:
'''
Preprocessing staff,voices,characters and title
'''


prepr_staff = preprocessing_staff(df)
print("[Staff]: Done.")

prepr_voices = preprocessing_voices(df)
print("[Voices]: Done.")

prepr_characters = preprocessing_characters(df)
print("[Characters]: Done.")

prepr_title = preprocessing_title(df)
print("[Title]: Done.")

[Staff]: Converted all the lists in strings, starting the preprocessing...
[Staff]: Done.
[Voices]: Converted all the lists in strings, starting the preprocessing...
[Voices]: Done.
[Characters]: Converted all the lists in strings, starting the preprocessing...
[Characters]: Done.
[Title]: Starting the preprocessing...
[Title]: Done.


In [8]:
voc_staff = create_vocab(prepr_staff)
print("[Staff]: Created vocabulary")

voc_voices = create_vocab(prepr_voices)
print("[Voices]: Created vocabulary")

voc_characters = create_vocab(prepr_characters)
print("[Characters]: Created vocabulary")

voc_title = create_vocab(prepr_title)
print("[Title]: Created vocabulary")

[Staff]: Created vocabulary
[Voices]: Created vocabulary
[Characters]: Created vocabulary
[Title]: Created vocabulary


In [None]:
'''
Creation of inverted indexes of staff,voices,character and title
'''

idx_staff = create_inv_idx(prepr_staff, voc_staff)
print("[Staff]: Created index")

idx_voices = create_inv_idx(prepr_voices, voc_voices)
print("[Voices]: Created index")

idx_characters = create_inv_idx(prepr_characters, voc_characters)
print("[Characters]: Created index")

idx_title = create_inv_idx(prepr_title, voc_title)
print("[Title]: Created index")

In [None]:
idx_dir = os.path.join('..', 'shared_stuff', 'indexes')
staff_dir = os.path.join(idx_dir, 'staff')
voices_dir = os.path.join(idx_dir, 'voices')
ch_dir = os.path.join(idx_dir, 'characters')
title_dir = os.path.join(idx_dir, 'title')
syns_dir = os.path.join(idx_dir, 'synopsis')

tot_dirs = {'staff': staff_dir, 'voices': voices_dir, 'characters': ch_dir, 'title': title_dir}
for d in tot_dirs.values():
    if not os.path.exists(d):
        os.mkdir(d)



In [None]:
'''
save all the vocabularies in the corresponding json files
'''

for p in [staff_dir, voices_dir, ch_dir, title_dir]:
    idx = os.path.join(p, 'inv_idx.json')
    voc = os.path.join(p, 'vocabulary.json')

    if os.path.exists(idx) or os.path.join(voc):
        var = input(f"The directory {p} contains an index or vocabulary that will be overwritten do you want to proceed? [y/n]")
        if var=='y':
            continue
        else:
            raise ValueError("Indexes existists!!!")


save_dict_to_file(dct=voc_staff, filename=os.path.join(staff_dir, 'vocabulary.json'))
save_dict_to_file(dct=idx_staff, filename=os.path.join(staff_dir, 'inv_idx.json'))
print("[Staff]: All saved.")

save_dict_to_file(dct=voc_voices, filename=os.path.join(voices_dir, 'vocabulary.json'))
save_dict_to_file(dct=idx_voices, filename=os.path.join(voices_dir, 'inv_idx.json'))
print("[Voices]: All saved.")

save_dict_to_file(dct=voc_characters, filename=os.path.join(ch_dir, 'vocabulary.json'))
save_dict_to_file(dct=idx_characters, filename=os.path.join(ch_dir, 'inv_idx.json'))
print("[Characters]: All saved.")

save_dict_to_file(dct=voc_title, filename=os.path.join(title_dir, 'vocabulary.json'))
save_dict_to_file(dct=idx_title, filename=os.path.join(title_dir, 'inv_idx.json'))
print("[Title]: All saved.")

Here we're fixing the synopsis vocabulary

In [None]:
prepr_syns = preprocessing_synopsis(df)
print("[Synopsis]: Done.")
voc_syns = create_vocab(prepr_syns)
print("[Synopsis]: Created vocabulary")
idx_syns = create_inv_idx(prepr_syns, voc_syns)
print("[Synopsis]: Created index")


In [None]:


save_dict_to_file(dct=voc_syns, filename=os.path.join(syns_dir, 'vocabulary.json'))
save_dict_to_file(dct=idx_syns, filename=os.path.join(syns_dir, 'inv_idx.json'))
print("[Synopsis]: All saved.")

In [2]:
p_q = parse_advanced_query("dragon [title] vegeta [characters]")
print(p_q)

['dragon', '[title]', 'vegeta', '[characters]']
{'title': [4998], 'characters': [14666]}


The idea is to assign a similarity score to each documment that has all the words ewe are searching following the formula:

$\displaystyle \sum_{i=0}^n \frac{q_i*pi}{di} $ 

where:
* $qi$= *len query in the i-th fielfd*
* $di$= *len document in the i-th field*
* $pi$= *weight of i-th field of he search*

Obviously, fields such as title are more important than synopsis which might be a very long description: the user is able to find relevant results just by writing one word of the title or one main character.
The score is upper-bounded by 1. Nevertheless, it is almost impossible to reach this score since the user should write all the right informations in the appropriate field.

Since,commonly, the query are very short, this method assures good results.

Comparing our scoring function with the previous one, they both share their range between O and 1 but tf idf scoring function reaches more easily values towards 1.


In [6]:

dragonball = "dragon [title] gohan [characters]"
query_anime(df, dragonball, 15)

['dragon', '[title]', 'gohan', '[characters]']


Unnamed: 0,doc_id,title,description,url,score
0,6339,Dragon Ball GT,Emperor Pilaf finally has his hands on the Bla...,https://myanimelist.net/anime/225/Dragon_Ball_...,0.243333
1,365,Dragon Ball Z,Five years after winning the World Martial Art...,https://myanimelist.net/anime/813/Dragon_Ball_Z\n,0.242564
2,1960,Dragon Ball Super,Seven years after the events of,https://myanimelist.net/anime/30694/Dragon_Bal...,0.241905
3,10958,Super Dragon Ball Heroes,"In May 2018, V-Jump announced a promotional an...",https://myanimelist.net/anime/37885/Super_Drag...,0.183
4,4796,Dragon Ball Z: Saiya-jin Zetsumetsu Keikaku,Dr. Raichii is the only Tsufurujin (the race e...,https://myanimelist.net/anime/984/Dragon_Ball_...,0.125238
5,6340,Dragon Ball Z: Atsumare! Gokuu World,Dragon Ball Z: Atsumare! Goku's World is a Ter...,https://myanimelist.net/anime/6714/Dragon_Ball...,0.125238
6,5373,Dragon Ball: Super Saiya-jin Zetsumetsu Keikaku,Remake of Dragon Ball Z: Plan to Destroy the S...,https://myanimelist.net/anime/10017/Dragon_Bal...,0.125238
7,5660,Dragon Ball Z: Summer Vacation Special,"One peaceful afternoon, the Son family and fri...",https://myanimelist.net/anime/22695/Dragon_Bal...,0.124167
8,1035,Dragon Ball Kai (Dragon Ball Z Kai),"Five years after the events of Dragon Ball, ma...",https://myanimelist.net/anime/6033/Dragon_Ball...,0.109231
9,5048,Dragon Ball Z Movie 03: Chikyuu Marugoto Chouk...,"A mysterious device crashes on planet Earth, c...",https://myanimelist.net/anime/896/Dragon_Ball_...,0.0975


## 5. Algorithmic question
You consult for a personal trainer who has a back-to-back sequence of requests for appointments. A sequence of requests is of the form > 30, 40, 25, 50, 30, 20 where each number is the time that the person who makes the appointment wants to spend. You need to accept some requests, however you need a break between them, so you cannot accept two consecutive requests. For example, [30, 50, 20] is an acceptable solution (of duration 100), but [30, 40, 50, 20] is not, because 30 and 40 are two consecutive appointments. Your goal is to provide to the personal trainer a schedule that maximizes the total length of the accepted appointments. For example, in the previous instance, the optimal solution is [40, 50, 20], of total duration 110.

* Write an algorithm that computes the acceptable solution with the longest possible duration.
* Implement a program that given in input an instance in the form given above, gives the optimal solution.

In [None]:
def algoritmo (array):
    a=0
    b=array[0]
    for elem in array[1:]:
        n=max(a+elem,b)
        a=b
        b=n
    return b

In [10]:
array=[3,4,5,60,4]
algoritmo(array)

64

In [11]:
caso_limite = [50, 65, 90, 60, 35, 10, 15, 25]
print(algoritmo(caso_limite))

200


In [12]:
def recursive_algo(arr):
    if len(arr)<2:
        return arr

    return(max(recursive_algo(arr[1:]), [arr[0]] + recursive_algo(arr[2:]), key=sum))

recursive_algo(caso_limite)

[50, 90, 35, 25]

In [13]:
def algoritmo_leo(array):
    a=0
    b=array[0]
    step=[array[0]]
    for elem in array[1:]:
        n=max(a+elem,b)
        if a+elem>b:
            step.append(elem)
        print(elem,'massimo fra',a+elem,b,': n=',n,'a=',a,'b=',b)
        a=b
        b=n
    return b,step

print(algoritmo_leo(caso_limite)[1])

65 massimo fra 65 50 : n= 65 a= 0 b= 50
90 massimo fra 140 65 : n= 140 a= 50 b= 65
60 massimo fra 125 140 : n= 140 a= 65 b= 140
35 massimo fra 175 140 : n= 175 a= 140 b= 140
10 massimo fra 150 175 : n= 175 a= 140 b= 175
15 massimo fra 190 175 : n= 190 a= 175 b= 175
25 massimo fra 200 190 : n= 200 a= 175 b= 190
[50, 65, 90, 35, 15, 25]


In [1]:
step=[50, 65, 90, 35, 15, 25, 35]
original=[50, 65, 90, 35, 15, 25, 35]

def algoritmo_leo(array):
    a=0
    b=array[0]
    step=[array[0]]
    for elem in array[1:]:
        n=max(a+elem,b)
        if a+elem>b:
            step.append(elem)
        print(elem,'massimo fra',a+elem,b,': n=',n,'a=',a,'b=',b)
        a=b
        b=n
    return b,step

def filtra_ris(step,original):
    step.reverse()
    #print(step)
    print('step',step)


    ultimo=step[-1]
    i=0
    condizione=len(step)
    while i<condizione:
        print('while number',i,'step',step)
        j=0
        while (step[i]!=original[j]):
            j+=1
        if (step[i]==original[j]):
            if (step[i+1]==original[j-1] and step[i]!=ultimo):
                step.remove(step[i+1])
                print(step)
            elif(step[i]==ultimo and step[i-1]==original[j+1] ):
                step.remove(step[i-1])
            

        condizione=len(step)
        print('step',step)
        i+=1
    step.reverse() 
    return step


def solution(array):
    b,step=algoritmo_leo(array)
    risultato=filtra_ris(step,array)