This implementation is from group 05, made by:

- Joao Fonseca 89476
- Tomas Lopes 89552


This Notebook showcases the functional part of the second delivery. In each section we present the function and a set of outputs. After each function we will mention the structure and the meaning of each input and output.

In [1]:
# Imports - clustering
from clustering import *

In [2]:
# Clustering Approach (a) - clustering

# Processing the topics corpus (stop words removal, stemming, tag removal, selection of pertinent sections)
corpus = process_topics(topic_directory)

# Clustering the corpus (with n_clusters=50)
clusters = clustering(corpus, clustering_model=AgglomerativeClustering(n_clusters=50, linkage="complete", affinity="cosine"))

print(f"Clusters: {clusters}")

Clusters: [(array([0., 0., 0., ..., 0., 0., 0.]), {197, 134, 102, 136, 180}), (array([0., 0., 0., ..., 0., 0., 0.]), {160, 170, 177, 187, 159}), (array([0., 0., 0., ..., 0., 0., 0.]), {174, 198, 158}), (array([0., 0., 0., ..., 0., 0., 0.]), {107, 125}), (array([0., 0., 0., ..., 0., 0., 0.]), {114, 110}), (array([0., 0., 0., ..., 0., 0., 0.]), {192, 111}), (array([0.06980858, 0.        , 0.        , ..., 0.06186659, 0.05997764,
       0.33195777]), {148, 135}), (array([0., 0., 0., ..., 0., 0., 0.]), {141, 149}), (array([0.       , 0.       , 0.       , ..., 0.       , 0.0707719,
       0.       ]), {105, 147}), (array([0.        , 0.        , 0.        , ..., 0.11700774, 0.        ,
       0.        ]), {176, 162, 165}), (array([0., 0., 0., ..., 0., 0., 0.]), {156, 175}), (array([0., 0., 0., ..., 0., 0., 0.]), {145, 143}), (array([0., 0., 0., ..., 0., 0., 0.]), {112, 106}), (array([0.        , 0.12061111, 0.        , ..., 0.        , 0.        ,
       0.        ]), {104, 131}), (array(

In this case, we clustered the entire topic collection.

__@input:__

__corpus__ corresponds to the processed topic/document collection

__clustering_model__ corresponds to the clustering model to be used

__@output__

A list of tuples, where each of them corresponds to a cluster - a pair composed by its centroid and a set of topic/document identifiers in the cluster.

In [3]:
# Clustering Approach (b) - interpret

# Describes the documents in the first cluster (considering median and medoid criteria)
n_docs, docs_in_cluster, centroid, medoid, label, median = interpret(clusters[0], corpus)

print(f"Number of docs in cluster 0: {n_docs}")
print(f"Docs in cluster 0: {docs_in_cluster}")
print(f"Cluster 0 centroid: {centroid}")
print(f"Cluster 0 medoid: {medoid}")
print(f"Suggested label for cluster 0: {label}")
print(f"Geometric median of cluster 0: {median}")

Number of docs in cluster 0: 5
Docs in cluster 0: [102, 134, 136, 180, 197]
Cluster 0 centroid: [0. 0. 0. ... 0. 0. 0.]
Cluster 0 medoid: 197
Suggested label for cluster 0: crime crimin law
Geometric median of cluster 0: [0. 0. 0. ... 0. 0. 0.]


In this case, we interpreted the first cluster of the list of clusters returned in the previous cell.

__@input:__

__cluster__ corresponds to the cluster which is going to be analyzed

__corpus__ corresponds to the processed topic/document collection

__@output__

The number of topics/documents in the cluster and their identifiers, the centroid and medoid of the cluster, the suggested label for the cluster given the corpus and its geometric median (through unconstrained minimization, using the BFGS method with cosine distance).

In [4]:
# Clustering Approach (c) - evaluate

# Evaluates the solution produced by the clustering function
sil_score, vrc, dbi = evaluate(corpus, clustering_model=AgglomerativeClustering(n_clusters=50, linkage="complete", affinity="cosine"))

print(f"Silhouette coefficient: {sil_score}")
print(f"Variance Ratio Criterion: {vrc}")
print(f"Davies-Bouldin index: {dbi}")

Silhouette coefficient: 0.19126574065915034
Variance Ratio Criterion: 2.1279172493520955
Davies-Bouldin index: 1.0913503208711628


In this case, we evaluated the clustering approach we had selected (AgglomerativeClustering with 50 clusters, complete linkage and cosine affinity).

__@input:__

__corpus__ corresponds to the processed topic/document collection

__clustering_model__ corresponds to the clustering model to be used

__@output__

The silhouette coefficient, the variance ratio criterion and the Davies-Bouldin index of the selected clustering approach.

In [5]:
# Imports - classification
from classification import *

In [6]:
# Supervised Approach - setting up

# Process documents for the training and testing corpus (stop words removal, stemming, tag removal, selection of pertinent sections)
train_corpus = process_documents(corpus_directory, train=True)
test_corpus = process_documents(corpus_directory, train=False)

# Extract the relevance feedback for the training and testing process
train_rels = extract_relevance(qrels_train_directory)
test_rels = extract_relevance(qrels_test_directory)

In [7]:
# Supervised Approach (a) - training

# Processes topic R135 (stop words removal, stemming, tag removal, selection of pertinent sections)
topic = process_topic(135, topic_directory)

# The classifier used was K-Nearest-Neighbours with n_neighbours=15 and metric="euclidean"
classification_model = KNeighborsClassifier(n_neighbors=25, metric="euclidean")

# Trains the classification model with the documents that have relevance feedback for topics [R104, R135, R175],
# using features from topic 135
model = training(topic, train_corpus, train_rels, model=classification_model)

print(model)

KNeighborsClassifier(metric='euclidean', n_neighbors=25)


In this case, we trained the classification model with documents and relevance feedback regarding topics R104, R135 and R175, and with features calculated using the processed content of topic R135.
The classifier used was K-Nearest-Neighbours with n_neighbours=15 and metric="euclidean".

__@input:__

__topic__ corresponds to the processed topic content

__d_train__ corresponds to the processed training corpus

__r_train__ corresponds to the relevance feedback for the Dtrain collection

__model__ corresponds to the classifier used for this approach

__@output__

The fitted classification model.

In [8]:
# Supervised Approach (b) - classify

classes = [classify(test_corpus[i][1], topic, model) for i in range(len(test_corpus))]

print(classes)

[0.0, 0.56, 0.0, 0.56, 0.36, 0.0, 0.0, 0.36, 0.56, 0.0, 0.0, 0.36, 0.0, 0.0, 0.48, 0.0, 0.56, 0.56, 0.48, 0.48, 0.56, 0.48, 0.0, 0.0, 0.36, 0.32, 0.0, 0.36, 0.0, 0.36, 0.28, 0.32, 0.0, 0.28, 0.28, 0.32, 0.56, 0.56, 0.28, 0.44, 0.56, 0.48, 0.56, 0.44, 0.36, 0.56, 0.0, 0.36, 0.56, 0.56, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.56, 0.0, 0.0, 0.0, 0.0, 0.0, 0.56, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.56, 0.0, 0.0, 0.0, 0.0, 0.56, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.28, 0.56, 0.0, 0.0, 0.56, 0.28, 0.0, 0.0, 0.56, 0.56, 0.36, 0.0, 0.0, 0.36, 0.36, 0.28, 0.56, 0.36, 0.28, 0.56, 0.28, 0.36, 0.0, 0.0, 0.36, 0.0, 0.48, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.36, 0.56, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.48, 0.0, 0.0, 0.0, 0.04, 0.48, 0.0, 0.0, 0.56, 0.0, 0.0, 0.0, 0.0, 0.0, 0.56, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.56, 0.0, 0.0, 0.0, 0.0, 0.48, 0.4, 0.56, 0.56, 0.0, 0.0, 0.0, 0.0, 0.56, 0.56, 0.56, 0.0, 0.0, 0.48, 0.56, 0.0, 0.56, 0.56,

In this case, we classified each document in the test corpus according to the what the model learned.

__@input:__

__doc__ corresponds to the document to be classified by the model

__topic__ corresponds to the processed content of the topic, for which the model will give a probabilistic output on the relevance of the given document

__model__ corresponds to the trained classifier

__@output__

The probabilistic output on the relevance of the given document to the given topic

In [9]:
# Supervised Approach (c) - evaluate

evaluation = evaluate([135], test_corpus, test_rels, [classes])

print(f"***Statistics for topic R135***")
mse, mae, evs, r2, tp, fn, fp, tn, acc, sens, spec = evaluation[0]
print("**With the original continuous probability values (using regression metrics)**")
print()
print(f"Mean squared error: {mse}")
print(f"Mean absolute error: {mae}")
print(f"Explained variance score: {evs}")
print(f"R^2 score: {r2}")
print()
print(f"**With binarization of probability values (p >= {bin_prob_threshold} is considered relevant, else irrelevant)**")
print()
print(f"True positives: {tp}")
print(f"False negatives: {fn}")
print(f"False positives: {fp}")
print(f"True negatives: {tn}")
print(f"Accuracy score: {acc}")
print(f"Sensitivity: {sens}")
print(f"Specificity: {spec}")

Evaluating topic 135
***Statistics for topic R135***
**With the original continuous probability values (using regression metrics)**

Mean squared error: 0.1046464
Mean absolute error: 0.19424
Explained variance score: 0.5681060817093302
R^2 score: 0.5035655325527997

**With binarization of probability values (p >= 0.3 is considered relevant, else irrelevant)**

True positives: 138
False negatives: 13
False positives: 35
True negatives: 314
Accuracy score: 0.904
Sensitivity: 0.9139072847682119
Specificity: 0.8997134670487106


In this case, we evaluated the output of the classification approach we had selected, using a probability threshold of 0.3 (probability values greater than or equal to this threshold will be regarded as relevant for computation of classification metrics)

__@input:__

__topics__ corresponds to the identifiers of the topics used in the training procedure

__d_test__ corresponds to the processed testing corpus

__r_test__ corresponds to the relevance feedback for the Dtest collection

__classes_list__ corresponds to the list of previously calculated probabilities on the relevance of the documents

__@output__

The mean squared error, mean absolute error, explained variance score, and R^2 score of the classification output (regression metrics), as well as the confusion matrix, accuracy score, sensitivity, and specificity (classification metrics).

In [10]:
# Imports - graph ranking
from pagerank import *

In [11]:
# Graph Ranking Approach - setting up

# Process documents and topics (stop words removal, stemming, tag removal, selection of pertinent sections)
corpus = process_documents(corpus_directory, stemmed=True, train=False)
topics = process_topics(topic_directory, stemmed=True)

In [12]:
# Graph Ranking Approach (a) - build graph

# Construct the graph that reflects document relationships
graph = build_graph(corpus, use_idf=True, threshold=0.4)

# Get the nodes and edges of the graph
nodes = list(graph.nodes)
edges = list(graph.edges)

print("First 20 Nodes:")
print(nodes[:20])
print("First 20 Edges:")
print(edges[:20])

First 20 Nodes:
[86971, 87911, 88903, 88908, 88914, 88993, 89021, 89133, 89439, 89726, 89905, 89987, 91489, 91543, 91659, 91966, 92158, 92447, 92516, 92593]
First 20 Edges:
[(86971, 101496), (86971, 115834), (86971, 132311), (86971, 149293), (86971, 165346), (86971, 181714), (86971, 181893), (86971, 198798), (86971, 216162), (86971, 230195), (86971, 274492), (86971, 281402), (86971, 290055), (88903, 88993), (88903, 89905), (88903, 91543), (88903, 102895), (88903, 103512), (88903, 105574), (88903, 106284)]


__@input:__

__corpus__ corresponds to the processed documents content

__use_idf__ corresponds to whether the TF-IDF vectorizer will use IDF to obtain document similarities or not

__threshold__ corresponds to the value of theta (minimum similarity threshold)

__@output__

An undirected graph that captures document relationships based on their similarities. 

In [13]:
# Graph Ranking Approach (b) - undirected page rank

# Calculates the PageRank for a set of documents
upr = undirected_page_rank(topics[3], corpus, n_docs=20, threshold=0.9, sim="TF-IDF", use_priors=True, weighted=True)
    
print("Top 20 PageRank docs:")
print(upr)

Top 20 PageRank docs:
[(115834, 0.05262626913048876), (149293, 0.05160712691763864), (132311, 0.051569304340374975), (165346, 0.04645318796295266), (181893, 0.04601172198864406), (181714, 0.04577070763277186), (198798, 0.0453270234053804), (216162, 0.0382565070915663), (230195, 0.03630450053135198), (101496, 0.03004616379067409), (274492, 0.024873846342984757), (281402, 0.024845811461217893), (290055, 0.02471151893781811), (86971, 0.019467718235574773), (181687, 0.012741160355552096), (183845, 0.012741160355552096), (102965, 0.009073035613730683), (103671, 0.009073035613730683), (181612, 0.008688010758045435), (181880, 0.008688010758045435)]


In this case, we calculated the PageRank for a set of documents. The calculation of the priors was done based on the ranking scores for topic R104.
The function did not use the previously built graph, as the build_graph function is called inside this one. The previous call was purely for demonstration purposes.

__@input:__

__topic__ corresponds to the processed topic content, used for calculation of prior probabilities

__corpus__ corresponds to the processed documents content

__n_docs__ corresponds to the number of top documents to return

__sim__ corresponds to the similarity criterion that should be used to compare documents

__threshold__ corresponds to the minimum similarity value for which documents are considered similar

__max_iter__ corresponds to the maximum number of iterations the PageRank algorithm will perform

__damping__ corresponds to the damping factor

__use_priors__ corresponds to the whether the PageRank will account for prior probabilities or simply use a uniform distribution

__weighted__ corresponds to the whether the PageRank algorithm will use weighted edges or simply uniformly distributed weights

__@output__

Ordered list of documents, in a descending order of score. They are presented in pairs of (document identifier, score).