# Unsupervised Retrieval

In this notebook we use the generated features for a unsupervised crosslingual information retrieval task.

## I. Import Data

In this section we import the feature dataframe for the retrieval task.

In [1]:
import pickle5 as pickle
import pandas as pd
import sys, os
sys.path.append(os.path.dirname((os.path.abspath(''))))

from src.models.predict_model import MAP_score 

feature_retrieval = pd.read_feather("../data/processed/feature_retrieval_en_de_testset.feather")
feature_retrieval = feature_retrieval.rename(columns={"id_source": "source_id", "id_target": "target_id"})

feature_retrieval_pl = pd.read_feather("../data/processed/feature_retrieval_en_pl.feather")
feature_retrieval_pl = feature_retrieval_pl.rename(columns={"id_source": "source_id", "id_target": "target_id"})

feature_retrieval_it = pd.read_feather("../data/processed/feature_retrieval_en_it.feather")
feature_retrieval_it = feature_retrieval_it.rename(columns={"id_source": "source_id", "id_target": "target_id"})

feature_retrieval_doc = pd.read_feather("../data/processed/feature_retrieval_doc.feather")
feature_retrieval_doc = feature_retrieval_doc.rename(columns={"id_source": "source_id", "id_target": "target_id"})

## II. Unsupervised Retrieval

For Unsupervised Classification we use the distance measure features defined in the feature generation. Therefore we have four unsupervised models. Three crosslingual embedding based and one sentence encoder based. The three crosslingual embedding based models work with euclidean, cosine and word mover distance.

# Unsupervised: Proc 5k

In [2]:
print("Unsupervised for EN-DE")
unsupervised_prediction = feature_retrieval["cosine_similarity_average_proc_5k"].to_numpy()
unsupervised_prediction = [[1- pos_prediction_prob, pos_prediction_prob] for pos_prediction_prob in unsupervised_prediction]
map_score = MAP_score(feature_retrieval["source_id"], feature_retrieval["Translation"], unsupervised_prediction)
print("Map Score for Cosine Similarity Average: {}".format(map_score))

unsupervised_prediction = feature_retrieval["cosine_similarity_tf_idf_proc_5k"].to_numpy()
unsupervised_prediction = [[1- pos_prediction_prob, pos_prediction_prob] for pos_prediction_prob in unsupervised_prediction]
map_score = MAP_score(feature_retrieval["source_id"], feature_retrieval["Translation"], unsupervised_prediction)
print("Map Score for Cosine Similarity tf-idf: {}".format(map_score))

unsupervised_prediction = feature_retrieval["jaccard_translation_proc_5k"].to_numpy()
unsupervised_prediction = [[1- pos_prediction_prob, pos_prediction_prob] for pos_prediction_prob in unsupervised_prediction]
map_score = MAP_score(feature_retrieval["source_id"], feature_retrieval["Translation"], unsupervised_prediction)
print("Map Score for Jaccard Translation: {}".format(map_score))

Unsupervised for EN-DE
Map Score for Cosine Similarity Average: 0.3505914370259752
Map Score for Cosine Similarity tf-idf: 0.4943535537432778
Map Score for Jaccard Translation: 0.7535745719798199


In [2]:
print("Unsupervised for EN-DE")
unsupervised_prediction = feature_retrieval["cosine_similarity_average_proc_5k"].to_numpy()
unsupervised_prediction = [[1- pos_prediction_prob, pos_prediction_prob] for pos_prediction_prob in unsupervised_prediction]
map_score = MAP_score(feature_retrieval["source_id"], feature_retrieval["Translation"], unsupervised_prediction)
print("Map Score for Cosine Similarity Average: {}".format(map_score))

unsupervised_prediction = feature_retrieval["cosine_similarity_tf_idf_proc_5k"].to_numpy()
unsupervised_prediction = [[1- pos_prediction_prob, pos_prediction_prob] for pos_prediction_prob in unsupervised_prediction]
map_score = MAP_score(feature_retrieval["source_id"], feature_retrieval["Translation"], unsupervised_prediction)
print("Map Score for Cosine Similarity tf-idf: {}".format(map_score))

unsupervised_prediction = feature_retrieval["jaccard_translation_proc_5k"].to_numpy()
unsupervised_prediction = [[1- pos_prediction_prob, pos_prediction_prob] for pos_prediction_prob in unsupervised_prediction]
map_score = MAP_score(feature_retrieval["source_id"], feature_retrieval["Translation"], unsupervised_prediction)
print("Map Score for Jaccard Translation: {}".format(map_score))

Unsupervised for EN-DE
Map Score for Cosine Similarity Average: 0.2584898701767092
Map Score for Cosine Similarity tf-idf: 0.413996493330724
Map Score for Jaccard Translation: 0.6985180520618797


In [3]:
print("Unsupervised for EN-IT")
unsupervised_prediction = feature_retrieval_it["cosine_similarity_average_proc_5k"].to_numpy()
unsupervised_prediction = [[1- pos_prediction_prob, pos_prediction_prob] for pos_prediction_prob in unsupervised_prediction]
map_score = MAP_score(feature_retrieval_it["source_id"], feature_retrieval_it["Translation"], unsupervised_prediction)
print("Map Score for Cosine Similarity Average: {}".format(map_score))

unsupervised_prediction = feature_retrieval_it["cosine_similarity_tf_idf_proc_5k"].to_numpy()
unsupervised_prediction = [[1- pos_prediction_prob, pos_prediction_prob] for pos_prediction_prob in unsupervised_prediction]
map_score = MAP_score(feature_retrieval_it["source_id"], feature_retrieval_it["Translation"], unsupervised_prediction)
print("Map Score for Cosine Similarity tf-idf: {}".format(map_score))

unsupervised_prediction = feature_retrieval_it["jaccard_translation_proc_5k"].to_numpy()
unsupervised_prediction = [[1- pos_prediction_prob, pos_prediction_prob] for pos_prediction_prob in unsupervised_prediction]
map_score = MAP_score(feature_retrieval_it["source_id"], feature_retrieval_it["Translation"], unsupervised_prediction)
print("Map Score for Jaccard Translation: {}".format(map_score))

Unsupervised for EN-IT
Map Score for Cosine Similarity Average: 0.31014113064298576
Map Score for Cosine Similarity tf-idf: 0.4850770175117134
Map Score for Jaccard Translation: 0.7852548730691135


In [4]:
print("Unsupervised for EN-PL")
unsupervised_prediction = feature_retrieval_pl["cosine_similarity_average_proc_5k"].to_numpy()
unsupervised_prediction = [[1- pos_prediction_prob, pos_prediction_prob] for pos_prediction_prob in unsupervised_prediction]
map_score = MAP_score(feature_retrieval_pl["source_id"], feature_retrieval_pl["Translation"], unsupervised_prediction)
print("Map Score for Cosine Similarity Average: {}".format(map_score))

unsupervised_prediction = feature_retrieval_pl["cosine_similarity_tf_idf_proc_5k"].to_numpy()
unsupervised_prediction = [[1- pos_prediction_prob, pos_prediction_prob] for pos_prediction_prob in unsupervised_prediction]
map_score = MAP_score(feature_retrieval_pl["source_id"], feature_retrieval_pl["Translation"], unsupervised_prediction)
print("Map Score for Cosine Similarity tf-idf: {}".format(map_score))

unsupervised_prediction = feature_retrieval_pl["jaccard_translation_proc_5k"].to_numpy()
unsupervised_prediction = [[1- pos_prediction_prob, pos_prediction_prob] for pos_prediction_prob in unsupervised_prediction]
map_score = MAP_score(feature_retrieval_pl["source_id"], feature_retrieval_pl["Translation"], unsupervised_prediction)
print("Map Score for Jaccard Translation: {}".format(map_score))

Unsupervised for EN-PL
Map Score for Cosine Similarity Average: 0.2786150707142338
Map Score for Cosine Similarity tf-idf: 0.4766138420359052
Map Score for Jaccard Translation: 0.8137032595592242


# Unsupervised: Proc-B 1k

In [5]:
print("Unsupervised for EN-DE")
unsupervised_prediction = feature_retrieval["cosine_similarity_average_proc_b_1k"].to_numpy()
unsupervised_prediction = [[1- pos_prediction_prob, pos_prediction_prob] for pos_prediction_prob in unsupervised_prediction]
map_score = MAP_score(feature_retrieval["source_id"], feature_retrieval["Translation"], unsupervised_prediction)
print("Map Score for Cosine Similarity Average: {}".format(map_score))

unsupervised_prediction = feature_retrieval["cosine_similarity_tf_idf_proc_b_1k"].to_numpy()
unsupervised_prediction = [[1- pos_prediction_prob, pos_prediction_prob] for pos_prediction_prob in unsupervised_prediction]
map_score = MAP_score(feature_retrieval["source_id"], feature_retrieval["Translation"], unsupervised_prediction)
print("Map Score for Cosine Similarity tf-idf: {}".format(map_score))

unsupervised_prediction = feature_retrieval["jaccard_translation_proc_b_1k"].to_numpy()
unsupervised_prediction = [[1- pos_prediction_prob, pos_prediction_prob] for pos_prediction_prob in unsupervised_prediction]
map_score = MAP_score(feature_retrieval["source_id"], feature_retrieval["Translation"], unsupervised_prediction)
print("Map Score for Jaccard Translation: {}".format(map_score))

Unsupervised for EN-DE
Map Score for Cosine Similarity Average: 0.3457652409615119
Map Score for Cosine Similarity tf-idf: 0.4927404504291553
Map Score for Jaccard Translation: 0.7395733864576752


In [6]:
print("Unsupervised for EN-IT")
unsupervised_prediction = feature_retrieval_it["cosine_similarity_average_proc_b_1k"].to_numpy()
unsupervised_prediction = [[1- pos_prediction_prob, pos_prediction_prob] for pos_prediction_prob in unsupervised_prediction]
map_score = MAP_score(feature_retrieval_it["source_id"], feature_retrieval_it["Translation"], unsupervised_prediction)
print("Map Score for Cosine Similarity Average: {}".format(map_score))

unsupervised_prediction = feature_retrieval_it["cosine_similarity_tf_idf_proc_b_1k"].to_numpy()
unsupervised_prediction = [[1- pos_prediction_prob, pos_prediction_prob] for pos_prediction_prob in unsupervised_prediction]
map_score = MAP_score(feature_retrieval_it["source_id"], feature_retrieval_it["Translation"], unsupervised_prediction)
print("Map Score for Cosine Similarity tf-idf: {}".format(map_score))

unsupervised_prediction = feature_retrieval_it["jaccard_translation_proc_b_1k"].to_numpy()
unsupervised_prediction = [[1- pos_prediction_prob, pos_prediction_prob] for pos_prediction_prob in unsupervised_prediction]
map_score = MAP_score(feature_retrieval_it["source_id"], feature_retrieval_it["Translation"], unsupervised_prediction)
print("Map Score for Jaccard Translation: {}".format(map_score))

Unsupervised for EN-IT
Map Score for Cosine Similarity Average: 0.3248569226142762
Map Score for Cosine Similarity tf-idf: 0.47241797298157584
Map Score for Jaccard Translation: 0.7666288527118588


In [7]:
print("Unsupervised for EN-PL")
unsupervised_prediction = feature_retrieval_pl["cosine_similarity_average_proc_b_1k"].to_numpy()
unsupervised_prediction = [[1- pos_prediction_prob, pos_prediction_prob] for pos_prediction_prob in unsupervised_prediction]
map_score = MAP_score(feature_retrieval_pl["source_id"], feature_retrieval_pl["Translation"], unsupervised_prediction)
print("Map Score for Cosine Similarity Average: {}".format(map_score))

unsupervised_prediction = feature_retrieval_pl["cosine_similarity_tf_idf_proc_b_1k"].to_numpy()
unsupervised_prediction = [[1- pos_prediction_prob, pos_prediction_prob] for pos_prediction_prob in unsupervised_prediction]
map_score = MAP_score(feature_retrieval_pl["source_id"], feature_retrieval_pl["Translation"], unsupervised_prediction)
print("Map Score for Cosine Similarity tf-idf: {}".format(map_score))

unsupervised_prediction = feature_retrieval_pl["jaccard_translation_proc_b_1k"].to_numpy()
unsupervised_prediction = [[1- pos_prediction_prob, pos_prediction_prob] for pos_prediction_prob in unsupervised_prediction]
map_score = MAP_score(feature_retrieval_pl["source_id"], feature_retrieval_pl["Translation"], unsupervised_prediction)
print("Map Score for Jaccard Translation: {}".format(map_score))

Unsupervised for EN-PL
Map Score for Cosine Similarity Average: 0.22835769863245614
Map Score for Cosine Similarity tf-idf: 0.41528536333978944
Map Score for Jaccard Translation: 0.7747319008780764


# Unsupervised: VecMap

In [8]:
unsupervised_prediction = feature_retrieval["cosine_similarity_average_vecmap"].to_numpy()
unsupervised_prediction = [[1- pos_prediction_prob, pos_prediction_prob] for pos_prediction_prob in unsupervised_prediction]
map_score = MAP_score(feature_retrieval["source_id"], feature_retrieval["Translation"], unsupervised_prediction)
print("Map Score for Cosine Similarity Average: {}".format(map_score))

unsupervised_prediction = feature_retrieval["cosine_similarity_tf_idf_vecmap"].to_numpy()
unsupervised_prediction = [[1- pos_prediction_prob, pos_prediction_prob] for pos_prediction_prob in unsupervised_prediction]
map_score = MAP_score(feature_retrieval["source_id"], feature_retrieval["Translation"], unsupervised_prediction)
print("Map Score for Cosine Similarity tf-idf: {}".format(map_score))

unsupervised_prediction = feature_retrieval["jaccard_translation_vecmap"].to_numpy()
unsupervised_prediction = [[1- pos_prediction_prob, pos_prediction_prob] for pos_prediction_prob in unsupervised_prediction]
map_score = MAP_score(feature_retrieval["source_id"], feature_retrieval["Translation"], unsupervised_prediction)
print("Map Score for Jaccard Translation: {}".format(map_score))

Map Score for Cosine Similarity Average: 0.3933087812214947
Map Score for Cosine Similarity tf-idf: 0.5169790332324079
Map Score for Jaccard Translation: 0.7444692427185002


In [9]:
print("Unsupervised for EN-IT")
unsupervised_prediction = feature_retrieval_it["cosine_similarity_average_vecmap"].to_numpy()
unsupervised_prediction = [[1- pos_prediction_prob, pos_prediction_prob] for pos_prediction_prob in unsupervised_prediction]
map_score = MAP_score(feature_retrieval_it["source_id"], feature_retrieval_it["Translation"], unsupervised_prediction)
print("Map Score for Cosine Similarity Average: {}".format(map_score))

unsupervised_prediction = feature_retrieval_it["cosine_similarity_tf_idf_vecmap"].to_numpy()
unsupervised_prediction = [[1- pos_prediction_prob, pos_prediction_prob] for pos_prediction_prob in unsupervised_prediction]
map_score = MAP_score(feature_retrieval_it["source_id"], feature_retrieval_it["Translation"], unsupervised_prediction)
print("Map Score for Cosine Similarity tf-idf: {}".format(map_score))

unsupervised_prediction = feature_retrieval_it["jaccard_translation_vecmap"].to_numpy()
unsupervised_prediction = [[1- pos_prediction_prob, pos_prediction_prob] for pos_prediction_prob in unsupervised_prediction]
map_score = MAP_score(feature_retrieval_it["source_id"], feature_retrieval_it["Translation"], unsupervised_prediction)
print("Map Score for Jaccard Translation: {}".format(map_score))

Unsupervised for EN-IT
Map Score for Cosine Similarity Average: 0.4223994289089312
Map Score for Cosine Similarity tf-idf: 0.5556760158274051
Map Score for Jaccard Translation: 0.7952996481245238


In [10]:
print("Unsupervised for EN-PL")
unsupervised_prediction = feature_retrieval_pl["cosine_similarity_average_vecmap"].to_numpy()
unsupervised_prediction = [[1- pos_prediction_prob, pos_prediction_prob] for pos_prediction_prob in unsupervised_prediction]
map_score = MAP_score(feature_retrieval_pl["source_id"], feature_retrieval_pl["Translation"], unsupervised_prediction)
print("Map Score for Cosine Similarity Average: {}".format(map_score))

unsupervised_prediction = feature_retrieval_pl["cosine_similarity_tf_idf_vecmap"].to_numpy()
unsupervised_prediction = [[1- pos_prediction_prob, pos_prediction_prob] for pos_prediction_prob in unsupervised_prediction]
map_score = MAP_score(feature_retrieval_pl["source_id"], feature_retrieval_pl["Translation"], unsupervised_prediction)
print("Map Score for Cosine Similarity tf-idf: {}".format(map_score))

unsupervised_prediction = feature_retrieval_pl["jaccard_translation_vecmap"].to_numpy()
unsupervised_prediction = [[1- pos_prediction_prob, pos_prediction_prob] for pos_prediction_prob in unsupervised_prediction]
map_score = MAP_score(feature_retrieval_pl["source_id"], feature_retrieval_pl["Translation"], unsupervised_prediction)
print("Map Score for Jaccard Translation: {}".format(map_score))

Unsupervised for EN-PL
Map Score for Cosine Similarity Average: 0.3548724728578375
Map Score for Cosine Similarity tf-idf: 0.5251183009700898
Map Score for Jaccard Translation: 0.8103150605249755


# Document Corpus

In [13]:
unsupervised_prediction = feature_retrieval_doc["cosine_similarity_average_vecmap"].to_numpy()
unsupervised_prediction = [[1- pos_prediction_prob, pos_prediction_prob] for pos_prediction_prob in unsupervised_prediction]
map_score = MAP_score(feature_retrieval_doc["source_id"], feature_retrieval_doc["Translation"], unsupervised_prediction)
print("Document Corpus Evaluation:")

print("\n------Proc-5K-----")
print("Map Score for Cosine Similarity Average: {}".format(map_score))

unsupervised_prediction = feature_retrieval_doc["cosine_similarity_tf_idf_proc_5k"].to_numpy()
unsupervised_prediction = [[1- pos_prediction_prob, pos_prediction_prob] for pos_prediction_prob in unsupervised_prediction]
map_score = MAP_score(feature_retrieval_doc["source_id"], feature_retrieval_doc["Translation"], unsupervised_prediction)
print("Map Score for Cosine Similarity tfidf: {}".format(map_score))

unsupervised_prediction = feature_retrieval_doc["jaccard_translation_proc_5k"].to_numpy()
unsupervised_prediction = [[1- pos_prediction_prob, pos_prediction_prob] for pos_prediction_prob in unsupervised_prediction]
map_score = MAP_score(feature_retrieval_doc["source_id"], feature_retrieval_doc["Translation"], unsupervised_prediction)
print("Map Score for Jaccard Translation: {}".format(map_score))


print("\n------Proc-b-1K-----")
unsupervised_prediction = feature_retrieval_doc["cosine_similarity_average_proc_b_1k"].to_numpy()
unsupervised_prediction = [[1- pos_prediction_prob, pos_prediction_prob] for pos_prediction_prob in unsupervised_prediction]
map_score = MAP_score(feature_retrieval_doc["source_id"], feature_retrieval_doc["Translation"], unsupervised_prediction)
print("Map Score for Cosine Similarity Average: {}".format(map_score))

unsupervised_prediction = feature_retrieval_doc["cosine_similarity_tf_idf_proc_b_1k"].to_numpy()
unsupervised_prediction = [[1- pos_prediction_prob, pos_prediction_prob] for pos_prediction_prob in unsupervised_prediction]
map_score = MAP_score(feature_retrieval_doc["source_id"], feature_retrieval_doc["Translation"], unsupervised_prediction)
print("Map Score for Cosine Similarity tfidf: {}".format(map_score))

unsupervised_prediction = feature_retrieval_doc["jaccard_translation_proc_b_1k"].to_numpy()
unsupervised_prediction = [[1- pos_prediction_prob, pos_prediction_prob] for pos_prediction_prob in unsupervised_prediction]
map_score = MAP_score(feature_retrieval_doc["source_id"], feature_retrieval_doc["Translation"], unsupervised_prediction)
print("Map Score for Jaccard Translation: {}".format(map_score))


print("\n------VecMap-----")
unsupervised_prediction = feature_retrieval_doc["cosine_similarity_average_proc_b_1k"].to_numpy()
unsupervised_prediction = [[1- pos_prediction_prob, pos_prediction_prob] for pos_prediction_prob in unsupervised_prediction]
map_score = MAP_score(feature_retrieval_doc["source_id"], feature_retrieval_doc["Translation"], unsupervised_prediction)
print("Map Score for Cosine Similarity Average: {}".format(map_score))

unsupervised_prediction = feature_retrieval_doc["cosine_similarity_tf_idf_vecmap"].to_numpy()
unsupervised_prediction = [[1- pos_prediction_prob, pos_prediction_prob] for pos_prediction_prob in unsupervised_prediction]
map_score = MAP_score(feature_retrieval_doc["source_id"], feature_retrieval_doc["Translation"], unsupervised_prediction)
print("Map Score for Cosine Similarity tfidf: {}".format(map_score))

unsupervised_prediction = feature_retrieval_doc["jaccard_translation_vecmap"].to_numpy()
unsupervised_prediction = [[1- pos_prediction_prob, pos_prediction_prob] for pos_prediction_prob in unsupervised_prediction]
map_score = MAP_score(feature_retrieval_doc["source_id"], feature_retrieval_doc["Translation"], unsupervised_prediction)
print("Map Score for Jaccard Translation: {}".format(map_score))

Document Corpus Evaluation:

------Proc-5K-----
Map Score for Cosine Similarity Average: 0.0025130097534969286
Map Score for Cosine Similarity tfidf: 0.025811535832982724
Map Score for Jaccard Translation: 0.059061184981531165

------Proc-b-1K-----
Map Score for Cosine Similarity Average: 0.0029142710632124097
Map Score for Cosine Similarity tfidf: 0.02737368845495181
Map Score for Jaccard Translation: 0.08353223927958048

------VecMap-----
Map Score for Cosine Similarity Average: 0.0029142710632124097
Map Score for Cosine Similarity tfidf: 0.023074329213133172
Map Score for Jaccard Translation: 0.08408493760362276


# Analysis: Why Jaccard Translation works well