# Classification /extraction 
--- 

we should consider only some properties derivated  from items but in this case we can build a vocabulary of property initially for two categories **Culture Representative** and **Culture Agnostic**. 


### How should we chose properties from item of Culture Agnostic and Culture Representative? 
### Training  Phase:

--- 

### Step 1: Building 1 vocabulary of properties associated to Culture Representative and Culture Agnostic 
In this phase, for every sample $ item_{i} $ associated to, we put all its properties  into a set  $S_{P} $
### Step 2: Embedded vectors of properties for every item 
In this case we associate a binary vector for every $item_{i} $ where for every possible property we associate 1 if it has that property $P_k$ otherwise 0.
### Step 3: Compute the centroids for every category C.A C.R, C.E.X
Now, after the preprocessed data we compute the centroid (like mean over every property $P_k$ on all samples) $C_{CA} $ and $C_{CR} $.
### Step 4: We compute the euclidean distance among properties and centroids
for every sample we compute distance among each property and centroids, so we will interpret strong relationship of some  properties wrt culture concepts while for concept more neutrals
### Step 5: Corresponding the similarity to every property with some methods wrt centroids
Every property, will have a weight computed following the importance of distance wrt centroids $C_{C.A} $ and  $ C_{C.R}$ : 
#### Case 1: Kernel Funtion (Kernel Density Estimation)
we will use kernel to give a more flexible weight : $ w_{gauss}(item_{i})=\exp{(-\frac{d_{C}(item_{i})}{2*\sigma^2})} $ .
We can compute **$\sigma$** like a constant such that influences the area of neighbours entities: 
we can compute it as: $\sigma=\frac{1}{\sqrt{2}*mean \space of \space distance}$, can be optimized empirically 
**The Gaussian kernel has the advantage of providing a gradual decrease in weight rather than a linear or inversely proportional decrease, allowing us to assign greater weight to nearby entities without excluding those that are further away.** .
We normalize all weights wrt the sum following weights wrt centroids

### Test Phase:

### Step 1: Compute distance among every element wrt to the both centroids
we compute per every $item_i$  and every feature of kind: $P_{123},P_{2345} $ ecc... the euclidean distance
### Step 2: Compute similarity for every test sample(with kernel approach) for both centroids
this pass helps us to understand in which direction a sample should go to the centroids for both centroids 
### Step 3: Given averaged sum of the importance of feature of samples 
in this step we compute, Culture and Agnostic_score= $ \sum_{fi=1}^{N_feat} item_{fi}* importance \space of  \space features $  the secnond term is measured in previous case
### Step 4 : Then, to emphasize the influence of entities closer to the centroids, we multiply the weighted score by the similarity.
So, ultimately, we multiply the weighted score by similarity to get a total score that is more sensitive to the entity's proximity to the centroids. This helps make a more accurate prediction based on how much the entity resembles the cultural or agnostic centers.

total_culture_score=$ Culture \space and  \space Agnostic_score * similarity \space of \space distance \space of \space test_sample $ repeated for every class score
total_agnostic_score = ... 
### Step 5: Classification based on best result
if total_culture_score > total agnostic_score  -> **Culture Representative**









# Step 0: Building a Vocabulary with all properties of training items

In [48]:
from datasets import load_dataset
import pandas as pd
import torch as t 
import numpy as np
from wikidata.client import Client
from itertools import islice
from collections import Counter
from concurrent.futures import ThreadPoolExecutor

In [32]:
ds=load_dataset("sapienzanlp/nlp2025_hw1_cultural_dataset")
train_set=train_data=pd.DataFrame(ds["train"])
list_category=set(list(train_set.category))
X_train=train_set.values

print(X_train) # stamp all dataset
#print(X_train.shape) 

[['http://www.wikidata.org/entity/Q32786' '916' '2012 film by M. Mohanan'
  ... 'films' 'film' 'cultural exclusive']
 ['http://www.wikidata.org/entity/Q371' '!!!'
  'American dance-punk band from California' ... 'music' 'musical group'
  'cultural representative']
 ['http://www.wikidata.org/entity/Q3729947' '¡Soborno!'
  'Mort & Phil comic' ... 'comics and anime' 'comics'
  'cultural representative']
 ...
 ['http://www.wikidata.org/entity/Q10779' 'Zwenkau'
  'city in the district of Leipzig, in the Free State of Saxony, Germany'
  ... 'geography' 'city' 'cultural exclusive']
 ['http://www.wikidata.org/entity/Q245296' 'zydeco'
  'music genre evolved in southwest Louisiana which blends blues, rhythm and blues, and music indigenous to the Louisiana Creoles and the Native people of Louisiana'
  ... 'music' 'music genre' 'cultural representative']
 ['http://www.wikidata.org/entity/Q129298' 'Zygmunt Chmielewski'
  'actor (1894-1978)' ... 'performing arts' 'theatrical director'
  'cultural ex

In [None]:
def extract_entity_id(url):
    return url.strip().split("/")[-1]
print(X_train.shape[0])
def extract_sample_from_cat(X,cat):
    l=list()
    for elem in X:
        if elem[4]==cat:
            l.append(elem[0])
    return np.array(l)
#entity_train=np.zeros(shape=
def count_frequency_prop_id(list_sampl):
    counter=Counter()
    for s in list_sampl:
        for prop in s:
            counter[prop]+=1
    return counter

# Implementation vocabulary on categories importance 
client = Client()
vocabulary=list()
def build_vocabulary(cat):
    vocabulary_subset=list()
    list_sample_cat=extract_sample_from_cat(X_train,cat)
    #print("sample category: "+cat)
    #print(len(list_sample_cat))
    set_properties=list()
    #set_properties=np.array(set_properties)
    for url in list_sample_cat:
        entity_train=extract_entity_id(url)
        item = client.get(entity_train, load=True)
        claim_item_i=item.data.get("claims",{})
       # print(claim_item_i)
        #print(len(claim_item_i))
        set_property_item=set()
        for prop_id,values in islice(claim_item_i.items(),len(claim_item_i)):
            #prop_entity = client.get(prop_id, load=True)
            #label = prop_entity.label
            #print(f"{prop_id} = {label} ({len(values)} statement{'s' if len(values) != 1 else ''})")
            set_property_item.add(str(prop_id))
        set_properties.append(set_property_item)

    print("Category:"+cat+"is collected")
    frequency_prop=count_frequency_prop_id(set_properties)
    sorted_prop=frequency_prop.most_common()
    print(sorted_prop)
    #threshold=int(len(sorted_prop)*0.6)
    # ERROR
    #------
    for prop_i,count in sorted_prop:
        support_categories=count/
        if support_categories>=0.4:
            vocabulary_subset.append(prop_i)
    #print("Updated Vocabulary",vocabulary_subset) 
    #-----       
    return vocabulary_subset
        #set_property_item.clear()
    #print("intersection of sample belongs"+cat,set.intersection(*set_properties))

        #label = prop_entity.labels
        #print(claim_item_i)
with  ThreadPoolExecutor(max_workers=10) as executor:
    results=list(executor.map(build_vocabulary,list_category))

for vocab in results:
    vocabulary.extend(vocab)
vocabulary=list(set(vocabulary))
print("Updated Vocabulary: for categories",vocabulary)
    

6251
Category:booksis collected
[('P31', 172), ('P646', 130), ('P373', 85), ('P244', 78), ('P227', 77), ('P18', 74), ('P214', 66), ('P213', 61), ('P8189', 60), ('P279', 54), ('P2671', 51), ('P735', 50), ('P106', 50), ('P27', 50), ('P19', 50), ('P21', 50), ('P569', 50), ('P10832', 49), ('P268', 49), ('P1412', 48), ('P13049', 48), ('P269', 46), ('P17', 46), ('P691', 45), ('P734', 40), ('P1006', 38), ('P1207', 38), ('P407', 37), ('P7902', 37), ('P910', 33), ('P625', 33), ('P570', 32), ('P856', 32), ('P571', 31), ('P9964', 30), ('P1368', 29), ('P20', 28), ('P50', 27), ('P1343', 26), ('P131', 26), ('P2163', 25), ('P1889', 25), ('P9918', 25), ('P3368', 24), ('P166', 24), ('P1015', 23), ('P463', 22), ('P648', 22), ('P159', 22), ('P1014', 22), ('P3417', 21), ('P69', 21), ('P10553', 21), ('P7293', 21), ('P495', 20), ('P108', 19), ('P1559', 19), ('P271', 18), ('P1695', 18), ('P866', 18), ('P11496', 18), ('P937', 17), ('P3987', 17), ('P12458', 17), ('P950', 16), ('P1284', 16), ('P5398', 16), ('P1

KeyboardInterrupt: 

Category:architectureis collected
[('P646', 355), ('P373', 309), ('P31', 282), ('P18', 272), ('P279', 203), ('P227', 128), ('P17', 126), ('P910', 106), ('P244', 105), ('P131', 102), ('P625', 100), ('P1343', 98), ('P1014', 98), ('P6366', 85), ('P8189', 80), ('P1417', 78), ('P214', 77), ('P571', 68), ('P2671', 68), ('P3417', 67), ('P691', 64), ('P268', 64), ('P1889', 63), ('P12596', 55), ('P8814', 53), ('P569', 52), ('P106', 52), ('P27', 52), ('P21', 52), ('P2581', 52), ('P5008', 51), ('P213', 51), ('P735', 50), ('P19', 50), ('P8408', 49), ('P8313', 49), ('P8406', 49), ('P570', 46), ('P2347', 46), ('P4342', 45), ('P1435', 44), ('P856', 43), ('P1412', 43), ('P361', 43), ('P4212', 42), ('P269', 42), ('P366', 42), ('P12385', 41), ('P20', 41), ('P508', 41), ('P186', 40), ('P1296', 40), ('P10283', 39), ('P3827', 39), ('P10832', 39), ('P2163', 39), ('P5604', 39), ('P7902', 38), ('P734', 37), ('P13049', 37), ('P5508', 36), ('P935', 36), ('P2924', 36), ('P6375', 34), ('P349', 34), ('P245', 33), 