# Homework 3 - Algorithmic Methods of Data Mining

In [None]:
import os
import time
import multiprocessing as mp
import ctypes
import nltk
from nltk.stem import PorterStemmer

import list_urls as l_u
import content_html as c_html
import anime_information as a_info
import vocabularize as voc

## 1. Data Collection

We start from the list of animes to include in our corpus of documents. In particular, we focus on the top animes listed in the first 400 pages. From this list we want to collect the url associated to each anime in the list. The list is long and splitted in many pages. We ask you to retrieve only the urls of the animes listed in the first 400 pages (each page has 50 animes so you will end up with 20000 unique anime urls).
The output of this step is a .txt file whose single line corresponds to an anime's url.

### 1.1. Get the list of animes
We extract from the main pages - the first 400 according to pagination - the urls og animes' web pages that will be retrieved in the next points.

In [None]:
path = os.getcwd()
# get the txt file with all the urls
l_u.parallelize_extraction()

### 1.2. Get the html files from animes' urls previuosly collected
We download - via url from previuos point - and memorize all the html files in a directory organized as follows:
- a dir is created for each page, where each html in the main web page is collected.

In [None]:
# get the html files
c_html.get_content()

### 1.3 Get the needed information about animes
Here we read each html files previously collected to get the requested information from the web pages and we create tsv files to memorize them. Each directory - so each page - contains a directory where all the tsv of the corresponding html files is memorized.

In [None]:
# get the info about the animes
a_info.parallelize_parsing(path)

## 2. Search Engine

### 2.1 Conjunctive query

#### 2.1.1 Create indeces
After collecting the necessary data, we started focusing on the searh engine.
We initialized some needed objects and files - JSON files.

In [None]:
 # initialize object that will be needed: shared "managed" dictionaries
manager = mp.Manager()
vocabulary = manager.dict()
inverted_index = manager.dict()
complex_index = manager.dict()
docs_short = manager.dict()
#  shared "managed" counter and lock to control tasks incrementing the counter
v = manager.Value(ctypes.c_ulonglong, 0)
lock = manager.Lock()
# stemming utilities
nltk.download('punkt')
porter = PorterStemmer()

At this point, we computed some JSON files taht we will use in the next points:

    - Vocabulary: a txt file whose lines contain a pair, a pre-processed word and its unique code identifier;
    
    - Inverted_index : a JSON file whose line contain a pair, a word ID and a list of documents in which the word is present. The documents are represented as a string composed by the string document and their identification number;
    
    - Tf_complex_index : similar to iverted_index, but the list of documents contains tuples (document, tf index). At this point we only have the tf - we will get the Idf after.

In [None]:
# get JSON files for further calculations
voc.parallelize_process_anime(path, vocabulary, inverted_index, complex_index, porter, v, lock)
voc.write_index(vocabulary, "vocabulary")
voc.write_index(inverted_index, "inverted_index")
voc.write_index(complex_index, "tf_complex_index")

Now we compute two other documents which are needed to get the cosine-similarity: 

    - TfIdf_complex index : as the tf_complex_index, but with the tfIdf index instead of the tf one. To have two step was necessary since to compute the Idf index some overall information previously lacking were needed. After we got the information (stored in the inverted_index.json), the rest could be done;
    
    - Docs_short : a complementary json file needed to compute the custom measure.

In [None]:
# at this point you have an incomplete complex_index since the number associated to each doc is only the tf part
# get complete tfIdf complex_index
voc.get_complex_index(complex_index)
voc.write_index(complex_index, "tfIdf_complex_index")
# get index with documents as key to retrieve information on members/popularity of each anime
voc.parallelize_docs_short(path, docs_short, porter)
voc.write_docs_short(docs_short)

#### 2.1.2 Execute the query

In [None]:
# run conjunctive query and print the result as table
docs = voc.conjunctive_query(path, porter)
voc.print_tables(docs, path, "")

### 2.2 Cosine-similarity ranking

In [None]:
# run cosine similarity query and print the result as table
top = voc.cosine_similarity_rank(porter, 5)
voc.print_tables(top, path, "tfIdf")

## 3. Custom ranking 

In [None]:
# run custom query and print the result as table
top = voc.custom_rank(porter, 5)
voc.print_tables(top, path, "tfIdf")