# Classification /extraction 
--- 

we should consider only some properties derivated  from items but in this case we can build a vocabulary of property initially for two categories **Culture Representative** and **Culture Agnostic**. 


### How should we chose properties from item of Culture Agnostic and Culture Representative? 
### Training  Phase:

--- 

### Step 1: Building 1 vocabulary of properties associated to Culture Representative and Culture Agnostic 
In this phase, for every sample $ item_{i} $ associated to, we put all its properties  into a set  $S_{P} $
### Step 2: Embedded vectors of properties for every item 
In this case we associate a binary vector for every $item_{i} $ where for every possible property we associate 1 if it has that property $P_k$ otherwise 0.
### Step 3: Compute the centroids for every category C.A C.R, C.E.X
Now, after the preprocessed data we compute the centroid (like mean over every property $P_k$ on all samples) $C_{CA} $ and $C_{CR} $.
### Step 4: We compute the euclidean distance among properties and centroids
for every sample we compute distance among each property and centroids, so we will interpret strong relationship of some  properties wrt culture concepts while for concept more neutrals
### Step 5: Corresponding the similarity to every property with some methods wrt centroids
Every property, will have a weight computed following the importance of distance wrt centroids $C_{C.A} $ and  $ C_{C.R}$ : 
#### Case 1: Kernel Funtion (Kernel Density Estimation)
we will use kernel to give a more flexible weight : $ w_{gauss}(item_{i})=\exp{(-\frac{d_{C}(item_{i})}{2*\sigma^2})} $ .
We can compute **$\sigma$** like a constant such that influences the area of neighbours entities: 
we can compute it as: $\sigma=\frac{1}{\sqrt{2}*mean \space of \space distance}$, can be optimized empirically 
**The Gaussian kernel has the advantage of providing a gradual decrease in weight rather than a linear or inversely proportional decrease, allowing us to assign greater weight to nearby entities without excluding those that are further away.** .
We normalize all weights wrt the sum following weights wrt centroids

### Test Phase:

### Step 1: Compute distance among every element wrt to the both centroids
we compute per every $item_i$  and every feature of kind: $P_{123},P_{2345} $ ecc... the euclidean distance
### Step 2: Compute similarity for every test sample(with kernel approach) for both centroids
this pass helps us to understand in which direction a sample should go to the centroids for both centroids 
### Step 3: Given averaged sum of the importance of feature of samples 
in this step we compute, Culture and Agnostic_score= $ \sum_{fi=1}^{N_feat} item_{fi}* importance \space of  \space features $  the secnond term is measured in previous case
### Step 4 : Then, to emphasize the influence of entities closer to the centroids, we multiply the weighted score by the similarity.
So, ultimately, we multiply the weighted score by similarity to get a total score that is more sensitive to the entity's proximity to the centroids. This helps make a more accurate prediction based on how much the entity resembles the cultural or agnostic centers.

total_culture_score=$ Culture \space and  \space Agnostic_score * similarity \space of \space distance \space of \space test_sample $ repeated for every class score
total_agnostic_score = ... 
### Step 5: Classification based on best result
if total_culture_score > total agnostic_score  -> **Culture Representative**









# Step 0: Building a Vocabulary with all properties of training items

In [None]:
from datasets import load_dataset
import pandas as pd
import torch as t 
import numpy as np
from wikidata.client import Client
from itertools import islice
from collections import Counter
from concurrent.futures import ThreadPoolExecutor
from scipy.spatial.distance import euclidean
import urllib.error

In [None]:
ds=load_dataset("sapienzanlp/nlp2025_hw1_cultural_dataset")
train_set=train_data=pd.DataFrame(ds["train"])
test_set=train_data=pd.DataFrame(ds["validation"])
list_category=set(list(train_set.category))
print(list_category)
list_subcategory=set(list(train_set.subcategory))
X_train=train_set.values
X_test =test_set.values

print(X_train) # stamp all dataset
#print(X_train.shape) 

In [None]:
def extract_entity_id(url):
    return url.strip().split("/")[-1]
#print(X_train.shape[0])
def extract_sample_from_cat(X,cat):
    l=list()
    for elem in X:
        if elem[4]==cat:
            l.append(elem[0])
    return np.array(l)
#entity_train=np.zeros(shape=
#def count_frequency_prop_id(list_sampl):
    

def dynamic_threshold(n):
    if n>=430:
        return 0.35
    elif n>=340:
        return 0.4
    elif n>=250:
        return 0.45
    else:
        return 0.50
# Implementation vocabulary on categories importance 
client = Client()
vocabulary=list()
def build_vocabulary(cat):
    vocabulary_subset=list()
    list_sample_cat=extract_sample_from_cat(X_train,cat)
    number_list_sample_cat=len(list_sample_cat)
    #print("sample category: "+cat)
    #print(len(list_sample_cat))
    set_properties=list()
    #set_properties=np.array(set_properties)
    for url in list_sample_cat:
        entity_train=extract_entity_id(url)
        #-----Check on entity train
        if not entity_train or not entity_train.startswith("Q"): #verify if entity_id starts with Q
            continue
        try:
            item = client.get(entity_train, load=True)
        except urllib.error.HTTPError: #handle error of HTTP error like HTTP 404
            continue
        #------
        claim_item_i=item.data.get("claims",{})
       # print(claim_item_i)
        #print(len(claim_item_i))
        set_property_item=set()
        for prop_id,values in islice(claim_item_i.items(),len(claim_item_i)):
            #prop_entity = client.get(prop_id, load=True)
            #label = prop_entity.label
            #print(f"{prop_id} = {label} ({len(values)} statement{'s' if len(values) != 1 else ''})")
            set_property_item.add(str(prop_id))
        set_properties.append(set_property_item)

    print("Category:"+cat+" collected")
    counter=Counter()
    for s in set_properties:
        for prop in s:
            counter[prop]+=1
    frequency_prop=counter
    #frequency_prop=count_frequency_prop_id(set_properties)
    sorted_prop=frequency_prop.most_common()
   # print(sorted_prop)
    #threshold=int(len(sorted_prop)*0.6)
    # ERROR
    #------
    th=0
    if(number_list_sample_cat>=429):
        th=0.35
    elif number_list_sample_cat>=300:
        th=0.4
    elif number_list_sample_cat>=150:
        th=0.45
    #th=dynamic_threshold(number_list_sample_cat)
    for prop_i,count in sorted_prop:
        support_categories=count/number_list_sample_cat
        if support_categories>=th: #min support 
            vocabulary_subset.append(prop_i)
        #if len(vocabulary_subset)>=6: #the first top k
            #break
    #print("Updated Vocabulary",vocabulary_subset) 
    #-----       
    return vocabulary_subset
        #set_property_item.clear()
    #print("intersection of sample belongs"+cat,set.intersection(*set_properties))

        #label = prop_entity.labels
        #print(claim_item_i)


### Execute for category

In [None]:
with  ThreadPoolExecutor(max_workers=10) as executor:
    results=list(executor.map(build_vocabulary,list_category))

for vocab in results:
    vocabulary.extend(vocab)
vocabulary=list(set(vocabulary))
print("Updated Vocabulary: for categories",vocabulary)
#Updated Vocabulary: for categories ['P495', 'P279', 'P345', 'P910', 'P17', 'P571', 'P373', 'P18', 'P625', 'P856', 'P244', 'P646', 'P31', 'P131', 'P641']

### Execute for subcategory

In [None]:
with  ThreadPoolExecutor(max_workers=10) as executor:
    results=list(executor.map(build_vocabulary,list_subcategory))

for vocab in results:
    vocabulary.extend(vocab)
vocabulary=list(set(vocabulary))
print("Updated Vocabulary: for categories",vocabulary)
#Updated Vocabulary: for categories ['P495', 'P279', 'P345', 'P910', 'P17', 'P571', 'P373', 'P18', 'P625', 'P856', 'P244', 'P646', 'P31', 'P131', 'P641']

# Step 1: Embedding con vettori binari 1/0 

In [None]:
def gaussian_kernel_estimation(X,d):
    sigma=1/(np.sqrt(2)*np.mean(d,axis=0))
    dist_sq=np.sum((X-d)**2)
    return np.exp(-dist_sq/(2*sigma**2))

# too slow :(
#----------
def embedding_sample(X,vocabulary,device=None):
    X_train_emb=t.zeros(X.shape[0],len(vocabulary),dtype=t.int,device=device)
   # X_train_emb=np.zeros(shape=(X.shape[0],len(vocabulary)))
    #print(f"Processing {X.shape[0]} sample")
    for j in range(0,X.shape[0]):
        ent=extract_entity_id(X[j,0])
        item_j = client.get(ent, load=True)
        claim_item_i=item_j.data.get("claims",{})
        set_p=set()
        for prop_id,__ in islice(claim_item_i.items(),len(claim_item_i)):
            set_p.add(str(prop_id))
        #print("processed:",claim_item_i)
        for v in range(0,len(vocabulary)):
            if vocabulary[v] in set_p:
                X_train_emb[j,v]=1
            else:
                X_train_emb[j,v]=0
        #np.arange(X_train_emb=X[j,6]
        #print(f"Update X_train_emb {j}: {X_train_emb[j]}")
    return X_train_emb
device = t.device("cuda" if t.cuda.is_available() else "cpu")
#------------
#X_embed_train=embedding_sample(X_train,vocabulary)
print("Embedding culture representative samples")
culture_representative_train=embedding_sample(X_train[X_train[:,6]=='cultural representative'],vocabulary,device).to('cpu').numpy()
print("--Embedding culture agnostic sample")
culture_agnostic_train=embedding_sample(X_train[X_train[:,6]=='cultural agnostic'],vocabulary,device).to('cpu').numpy()
print("--Embedding Culture exclusive sample")
culture_exclusive_train=embedding_sample(X_train[X_train[:,6]=='cultural exclusive'],vocabulary,device).to('cpu').numpy()


print(f"culture_agnostic_train: {culture_agnostic_train}")
print(f"cutlure agnostic shape {culture_agnostic_train.shape}")

print(f"culture_representative_train {culture_representative_train}")
print(f"culture_representative_train shape {culture_representative_train.shape}")

print(f"culture_exlusive_train {culture_exclusive_train}")
print(f"culture_exlusive_train shape {culture_exclusive_train.shape}")

centroid_agnostic=np.mean(culture_agnostic_train,axis=0)
print(f"centroid agnostic for each property {centroid_agnostic.shape}",centroid_agnostic)
centroid_representative=np.mean(culture_representative_train,axis=0)
print(f"centroid_representative {centroid_representative.shape}",centroid_representative)
centroid_exclusive=np.mean(culture_exclusive_train,axis=0)
print(f"centroid_exclusive {centroid_exclusive.shape}",centroid_exclusive)


weights_agnostic=np.zeros(shape=len(vocabulary))
weights_representative=np.zeros(shape=len(vocabulary))
weights_exclusive=np.zeros(shape=len(len(vocabulary)))

for i in range(len(vocabulary)):
    weights_agnostic[i]=gaussian_kernel_estimation(X_train[0:,i],centroid_agnostic)
    weights_representative[i]=gaussian_kernel_estimation(X_train[0:,i],centroid_representative)
    weights_exclusive[i]=gaussian_kernel_estimation(X_train[0:,i],centroid_exclusive)

# normalization
weights_agnostic[i]/=np.sum(weights_agnostic)
print("weights agnostic normalized for every property")
weights_representative[i]/=np.sum(weights_representative)
weights_exclusive[i]/=np.sum(weights_exclusive)


#---

# I'm arrived here
def predict_entity_score(x_sample,centroid_CA,centroid_CR,centroid_CE,weights_agnostic,weights_representative,weights_exclusive):
    similiraty_sample_CA=gaussian_kernel_estimation(x_sample,centroid_CA)
    similarity_sample_CR=gaussian_kernel_estimation(x_sample,centroid_CR)
    similarity_sample_CE=gaussian_kernel_estimation(x_sample,centroid_CE)
    #weight_CA=
    #weight_CR=
   # weight_CE=
    Sum_score_Agnostic=0
    Sum_score_Representative=0
    Sum_score_Exclusive=0
    for i in range(0,x_sample.shape[0]):
        Sum_score_Agnostic+=x_sample[i]*weights_agnostic[i]
        Sum_score_Representative+=x_sample[i]*weights_representative[i]
        Sum_score_Exclusive+=x_sample[i]*weights_exclusive[i]
        
    total_score_agnostic=Sum_score_Agnostic*similiraty_sample_CA
    total_score_representative=Sum_score_Representative*similarity_sample_CR
    total_score_exclusive=Sum_score_Agnostic*similarity_sample_CE
    
    return np.argmax([total_score_agnostic,total_score_representative,total_score_exclusive])

### implementazione gpu LUCA (Sicuro quella di Emilio è meglio)

In [36]:
import torch as t
import numpy as np
from itertools import islice
import time


# Configura il dispositivo
device = t.device("cuda" if t.cuda.is_available() else "cpu")

def embedding_sample_optimized_gpu(X_numpy, vocabulary, client, extract_entity_id, device=None):
    """Crea embedding binari mantenendo il tensore sulla GPU."""
    num_samples = X_numpy.shape[0]
    vocab_size = len(vocabulary)
    X_train_emb = t.zeros((num_samples, vocab_size), dtype=t.float32, device=device)

    for j in range(num_samples):
        ent = extract_entity_id(X_numpy[j, 0])
        try:
            item_j = client.get(ent, load=True)
        except urllib.error.HTTPError: #handle error of HTTP error like HTTP 404
            continue
        claim_item_i = item_j.data.get("claims", {})
        set_p = {str(prop_id) for prop_id in islice(claim_item_i.items(), len(claim_item_i))}
        present_mask = t.tensor([vocab_item in set_p for vocab_item in vocabulary], dtype=t.bool, device=device)
        X_train_emb[j, present_mask] = 1.0

    return X_train_emb.int()
def gaussian_kernel_estimation_torch(X, d, sigma):
    """Stima del kernel gaussiano usando tensori PyTorch."""
    dist_sq = t.sum((X - d)**2, dim=1, keepdim=True) if X.ndim > 1 else t.sum((X - d)**2)
    return t.exp(-dist_sq / (2 * sigma**2))
# Assumi che X_train e vocabulary siano già definiti come array NumPy
X_train_numpy = X_train
vocabulary_torch = [str(v) for v in vocabulary]

start_time = time.time()
print("Embedding culture representative samples (GPU)")
culture_representative_train_gpu = embedding_sample_optimized_gpu(
    X_train_numpy[X_train_numpy[:, 6] == 'cultural representative'],
    vocabulary_torch, client, extract_entity_id, device
)
print(f"--Embedding culture representative sample time: {time.time() - start_time:.4f} seconds")

start_time = time.time()
print("--Embedding culture agnostic sample (GPU)")
culture_agnostic_train_gpu = embedding_sample_optimized_gpu(
    X_train_numpy[X_train_numpy[:, 6] == 'cultural agnostic'],
    vocabulary_torch, client, extract_entity_id, device
)
print(f"--Embedding culture agnostic sample time: {time.time() - start_time:.4f} seconds")

start_time = time.time()
print("--Embedding Culture exclusive sample (GPU)")
culture_exclusive_train_gpu = embedding_sample_optimized_gpu(
    X_train_numpy[X_train_numpy[:, 6] == 'cultural exclusive'],
    vocabulary_torch, client, extract_entity_id, device
)
print(f"--Embedding Culture exclusive sample time: {time.time() - start_time:.4f} seconds")

print(f"culture_agnostic_train_gpu: {culture_agnostic_train_gpu.cpu().numpy()}")
print(f"cutlure agnostic shape {culture_agnostic_train_gpu.shape}")
print(f"culture_representative_train_gpu {culture_representative_train_gpu.cpu().numpy()}")
print(f"culture_representative_train_gpu shape {culture_representative_train_gpu.shape}")
print(f"culture_exclusive_train_gpu {culture_exclusive_train_gpu.cpu().numpy()}")
print(f"culture_exclusive_train_gpu shape {culture_exclusive_train_gpu.shape}")

# Calcolo dei centroidi sulla GPU
centroid_agnostic_gpu = culture_agnostic_train_gpu.float().mean(dim=0)
print(f"centroid agnostic for each property (GPU) {centroid_agnostic_gpu.shape}", centroid_agnostic_gpu.cpu().numpy())
centroid_representative_gpu = culture_representative_train_gpu.float().mean(dim=0)
print(f"centroid_representative (GPU) {centroid_representative_gpu.shape}", centroid_representative_gpu.cpu().numpy())
centroid_exclusive_gpu = culture_exclusive_train_gpu.float().mean(dim=0)
print(f"centroid_exclusive (GPU) {centroid_exclusive_gpu.shape}", centroid_exclusive_gpu.cpu().numpy())

# Calcolo di sigma basato sui dati embeddati (ora su GPU)
all_data_gpu = t.cat([
    culture_agnostic_train_gpu.float(),
    culture_representative_train_gpu.float(),
    culture_exclusive_train_gpu.float()
], dim=0)
distances_gpu = t.cdist(all_data_gpu, all_data_gpu)
sigma_gpu = 1 / (t.sqrt(t.tensor(2.0, device=device)) * (distances_gpu.mean() + 1e-8))

# Calcolo dei pesi con il kernel gaussiano (ora sugli embedding sulla GPU)
embeddings_agnostic_gpu = culture_agnostic_train_gpu.float()
weights_agnostic_gpu = gaussian_kernel_estimation_torch(embeddings_agnostic_gpu, centroid_agnostic_gpu.unsqueeze(0), sigma_gpu).mean(dim=0)

embeddings_representative_gpu = culture_representative_train_gpu.float()
weights_representative_gpu = gaussian_kernel_estimation_torch(embeddings_representative_gpu, centroid_representative_gpu.unsqueeze(0), sigma_gpu).mean(dim=0)

embeddings_exclusive_gpu = culture_exclusive_train_gpu.float()
weights_exclusive_gpu = gaussian_kernel_estimation_torch(embeddings_exclusive_gpu, centroid_exclusive_gpu.unsqueeze(0), sigma_gpu).mean(dim=0)

# Normalizzazione dei pesi (ora su GPU)
weights_agnostic_gpu /= (weights_agnostic_gpu.sum() + 1e-8)
print("weights agnostic normalized for every property (GPU)")
weights_representative_gpu /= (weights_representative_gpu.sum() + 1e-8)
weights_exclusive_gpu /= (weights_exclusive_gpu.sum() + 1e-8)

# Funzione di predizione ottimizzata per la GPU (gestisce input NumPy e lo sposta sulla GPU)
def predict_entity_score_optimized_gpu(x_sample_np, centroid_CA_gpu, centroid_CR_gpu, centroid_CE_gpu,
                                      weights_agnostic_gpu, weights_representative_gpu, weights_exclusive_gpu,
                                      sigma_gpu):
    """Predice il punteggio di appartenenza usando tensori PyTorch sulla GPU."""
    x_sample = t.tensor(x_sample_np, dtype=t.float32, device=centroid_CA_gpu.device)

    similarity_sample_CA = gaussian_kernel_estimation_torch(x_sample.unsqueeze(0), centroid_CA_gpu.unsqueeze(0), sigma_gpu)
    similarity_sample_CR = gaussian_kernel_estimation_torch(x_sample.unsqueeze(0), centroid_CR_gpu.unsqueeze(0), sigma_gpu)
    similarity_sample_CE = gaussian_kernel_estimation_torch(x_sample.unsqueeze(0), centroid_CE_gpu.unsqueeze(0), sigma_gpu)

    weighted_sample_agnostic = x_sample * weights_agnostic_gpu
    weighted_sample_representative = x_sample * weights_representative_gpu
    weighted_sample_exclusive = x_sample * weights_exclusive_gpu

    total_score_agnostic = weighted_sample_agnostic.sum() * similarity_sample_CA
    total_score_representative = weighted_sample_representative.sum() * similarity_sample_CR
    total_score_exclusive = weighted_sample_exclusive.sum() * similarity_sample_CE

    return t.argmax(t.stack([total_score_agnostic, total_score_representative, total_score_exclusive])).item()

# Esempio di predizione
if X_train.shape[1] > len(vocabulary):
    sample_to_predict = np.random.rand(len(vocabulary)) # Crea un campione casuale di embedding
else:
    sample_to_predict = np.random.rand(len(vocabulary)) # Genera un campione casuale se X_train non ha abbastanza colonne

prediction = predict_entity_score_optimized_gpu(
    sample_to_predict,
    centroid_agnostic_gpu,
    centroid_representative_gpu,
    centroid_exclusive_gpu,
    weights_agnostic_gpu,
    weights_representative_gpu,
    weights_exclusive_gpu,
    sigma_gpu
)
print(f"Prediction for sample: {prediction}")

Embedding culture representative samples (GPU)


KeyboardInterrupt: 