# Table of Contents
* &nbsp;
	* [Dataset Creation:](#Dataset-Creation:)


## Dataset Creation:

How can we generate the dataset?

In [1]:
import numpy as np
import random
import pandas as pd
print(np.random.random(10))




[0.10773651 0.71644464 0.94283329 0.85951229 0.2807103  0.16025629
 0.88060116 0.23190625 0.43986917 0.55574045]


So we need to generate a dataset that has:

- Symmetrical relations
- Entities
- Logic task
- Redudant information for second dataset

So the first step is going to be to define the set of relations, the axioms of symmetry between them and rules of compositionality and recursiveness. Then we can think about the actual generation by expanding a simple example into a more general case.

In [2]:
import string

In [9]:
r = ['P', 'S'] # set of symmetrical relations

r = np.asarray(r)
e = np.array(list(string.ascii_lowercase)) # this will be the set of entities 



Now, we can define the axioms and the rules to use to generate our dataset

we want elements of this type $S(a,b)$

And our first axiom is that $S(a, b) == P(b, a)$




In [8]:
i = np.random.choice(r)

In [9]:
t = '{}({}, {})'.format(i, np.random.choice(e), np.random.choice(e))

In [30]:
t

'S(t, d)'

In [53]:
def check_rules(relation):
    if 'S'  in relation:

        x = list(relation)
        x[0] = 'P'
        x[2], x[5] = x[5], x[2]  # Swap using tuple unpacking
        return ''.join(x)
       
    elif 'P' in relation:
    # Convert to list only if necessary to make changes
        x = list(relation)
        x[0] = 'S'
        x[2], x[5] = x[5], x[2]  # Swap using tuple unpacking
        return ''.join(x)
    
    

In [34]:
check_rules(t)

'P(d, t)'

In [37]:
e = list(e)

Now let's try to generate a k number of relations from a subset of the entities 

In [49]:
def generate_trees(k):
    sub_e = np.asarray(random.sample(e, k))
    
    inst = []
    for i in range (0, (int(k / 2))):
        t = '{}({}, {})'.format(np.random.choice(r), np.random.choice(sub_e), np.random.choice(sub_e))
        inst.append(t)
        
    return inst

        
    

In [50]:
generate_trees(4)

['P(l, p)', 'S(p, l)']

In [54]:
for i in generate_trees(6):
    print(i)
    f = check_rules(i)
    print(f)

S(y, y)
P(y, y)
P(w, v)
S(v, w)
P(v, s)
S(s, v)


To create a consistent knowledge base we use the following approach:

- A subset of the all entities is taken
- The depth of the tree is decided
- The first relationship is generated, defining the first layer of the tree
- From this point each other relation is created by using the constraints that come with the previous relations





So, suppose that the tree has been created by the first relation: $S(a, b)$ where it means that a is the son of b.

It follows that there are two levels (parent and son) and the following relations must account for this. 
So, if the following relation is of the type $B(b, c)$, whihc means that b is the brother of c, there will be an addition fo the same hierarchical level of b, and that will constrain relationship of the same type $B()$ between the levels. 

So a relation of the type $B(a, d)$ is admissible, but after it, we cannot say that $B(d, b)$, because the relation brother is only accetable between a previous entity and a new one or between entities that share the same branch.

*Definition of the algorithm*

- Define number of the relations in the I
- Check if entities used is empty
- If it is, select a random entites from set of entities
- Select second entity (random between used, if U is > 2, and total entites)
- Check for constraints between the two
- First relation between the two, add constraints to C
- Repeat step 2, in this case we select the first entity from the one previously used in order to have a chain of relations


When k match the size of our I, we need to select an hypothesis to check: 


Other idea, select two entites, define their distance into the concept space (number of instances?)
So, let's say k=3, a and b: I can connect them with the minimal number of relations knowing that their distance should be 2 (k-1). Thus, I start from a which cannot be connected directly to b as it would be dist = 1.

a will be in a random relation: example: P(a, x) so a is the parent of x. Now I now that from this one there must be another connection between x and b to get the final relation: it cannot be on the same level, so it can be B(x, b). Now we have the connections (we can create them redundant by using the check formula. So, now we must create the H, which woudl be 


Define in advance the following for the algorithm to work: final relation to check, step to obtained. 


Task: given relation say whether is true or not ? what can the model use?
    
    
    

In [3]:
def generate_task():
    t = '{}({}, {})'.format(np.random.choice(r), np.random.choice(e), np.random.choice(e))
    return t

In [4]:
# Initial array


def add_elements(arr, elements):
    position = 1
    for element in elements:
        arr.insert(position, element)
        position += 1
    return arr



In [7]:
def generate_inst(k):
    task = generate_task()
    # create a list of the entities necessary to use for the main chain
    l = [task[2], task[5]]
    used = []
    
    ent = np.random.choice(e, int(k/2)) # here we must control to avoid redundancy between entities
    
    add_elements(l, ent)
    
    # Iterate in entities, starts from head and second entity create first relation, add. Must stop to the last entity
    
    
    # we need to keep in mind the task constraint for the construction of the data: keep track of the depth
    
    
    
    
    
    
    
    
    
    # Generate noise:
    
    
    
    
    
    print(task)
    return(l)
    



    
    
    # start from 
    
        
        
    
    
    
    
    
    
    

In [10]:
generate_inst(6)

P(r, r)


['r', 'v', 'y', 'f', 'r']

### Try new method:

In [5]:
import random
import string

def generate_chain(task_relation, entity1, entity2, k):
    entities = list(string.ascii_lowercase)
    chain = []
    current_entity = entity1
    remaining_entities = [e for e in entities if e not in {entity1, entity2}]

    # Define possible next relations based on the current relation
    relation_followup = {
        'P': ['P', 'B'],
        'S': ['S', 'B'],
        'B': ['P', 'S', 'B']
    }

    # Generate the chain of relations
    for _ in range(k - 1):
        if not remaining_entities:
            break  # Prevent empty list operations
        next_entity = random.choice(remaining_entities)
        remaining_entities.remove(next_entity)
        next_relation = random.choice(relation_followup[task_relation])
        chain.append((next_relation, current_entity, next_entity))
        current_entity = next_entity
        task_relation = next_relation

    # Final link in the chain
    chain.append((task_relation, current_entity, entity2))
    return chain

def add_noise(chain, entities, noise_count):
    noise = []
    used_entities = {e for _, e1, e2 in chain for e in {e1, e2}}
    available_entities = [e for e in entities if e not in used_entities]

    while noise_count > 0 and len(available_entities) >= 2:
        e1, e2 = random.sample(available_entities, 2)
        relation = random.choice(['P', 'S', 'B'])
        noise.append((relation, e1, e2))
        available_entities.remove(e1)
        available_entities.remove(e2)
        noise_count -= 1

    return noise

def generate_redundant_info(direct_chain):
    redundant_chain = []
    for relation, e1, e2 in direct_chain:
        redundant_chain.append((relation, e1, e2))
        if relation == 'P':
            redundant_chain.append(('S', e2, e1))
        elif relation == 'S':
            redundant_chain.append(('P', e2, e1))
        elif relation == 'B':
            redundant_chain.append(('B', e2, e1))
    return redundant_chain

def generate_binary_vector(direct_chain, combined_chain):
    return [1 if item in direct_chain else 0 for item in combined_chain]

def generate_task_and_chain():
    entities = list(string.ascii_lowercase)
    relations = ['P', 'S', 'B']
    k = random.randint(3, 5)
    noise_count = random.randint(2, 4)
    entity1, entity2 = random.sample(entities, 2)
    task_relation = random.choice(relations)

    # Generate the direct chain and redundant chain
    direct_chain = generate_chain(task_relation, entity1, entity2, k)
    redundant_chain = generate_redundant_info(direct_chain)
    # Generate noise and combine with direct and redundant chains
    noise = add_noise(direct_chain, entities, noise_count)
    combined_direct_chain = direct_chain + noise
    combined_redundant_chain = redundant_chain + noise
    random.shuffle(combined_direct_chain)
    random.shuffle(combined_redundant_chain)
    # Generate binary vectors
    binary_vector_direct = generate_binary_vector(direct_chain, combined_direct_chain)
    binary_vector_redundant = generate_binary_vector(redundant_chain, combined_redundant_chain)

    # Format output
    direct_chain_repr = " | ".join(f"{r}({a}, {b})" for r, a, b in combined_direct_chain)
    redundant_chain_repr = " | ".join(f"{r}({a}, {b})" for r, a, b in combined_redundant_chain)
    print("Task:", f"{task_relation}({entity1}, {entity2})")
    print("Direct Chain and Task:", direct_chain_repr)
    print("Direct Binary Vector:", ''.join(map(str, binary_vector_direct)))
    print("Redundant Chain and Task:", redundant_chain_repr)
    print("Redundant Binary Vector:", ''.join(map(str, binary_vector_redundant)))

# Example usage
generate_task_and_chain()


Task: P(b, i)
Direct Chain and Task: B(h, f) | B(c, x) | B(b, g) | P(n, p) | S(l, s) | P(q, o) | B(g, c) | B(x, i)
Direct Binary Vector: 01100011
Redundant Chain and Task: P(n, p) | B(g, c) | B(c, g) | B(b, g) | B(i, x) | S(l, s) | B(x, c) | B(x, i) | B(h, f) | B(c, x) | P(q, o) | B(g, b)
Redundant Binary Vector: 011110110101


In [12]:
import random
import string

def generate_chain(task_relation, entity1, entity2, k):
    entities = list(string.ascii_lowercase)
    chain = []
    current_entity = entity1
    remaining_entities = [e for e in entities if e not in {entity1, entity2}]

    # Generate the chain of relations
    for _ in range(k - 1):
        if not remaining_entities:
            break
        next_entity = random.choice(remaining_entities)
        remaining_entities.remove(next_entity)
        chain.append((task_relation, current_entity, next_entity))
        current_entity = next_entity

    # Final link in the chain
    chain.append((task_relation, current_entity, entity2))
    return chain

def add_noise(chain, entities, noise_count):
    noise = []
    used_entities = {e for _, e1, e2 in chain for e in {e1, e2}}
    available_entities = [e for e in entities if e not in used_entities]

    while noise_count > 0 and len(available_entities) >= 2:
        e1, e2 = random.sample(available_entities, 2)
        relation = random.choice(['P', 'S', 'B'])
        noise.append((relation, e1, e2))
        available_entities.remove(e1)
        available_entities.remove(e2)
        noise_count -= 1

    return noise

def generate_redundant_info(direct_chain):
    redundant_chain = []
    for relation, e1, e2 in direct_chain:
        redundant_chain.append((relation, e1, e2))
        if relation == 'P':
            redundant_chain.append(('S', e2, e1))
        elif relation == 'S':
            redundant_chain.append(('P', e2, e1))
        elif relation == 'B':
            redundant_chain.append(('B', e2, e1))
    return redundant_chain

def generate_binary_vector(direct_chain, combined_chain):
    direct_set = set(direct_chain)
    return [1 if item in direct_set else 0 for item in combined_chain]

def generate_task_and_chain():
    entities = list(string.ascii_lowercase)
    relations = ['P', 'S', 'B']
    k = random.randint(3, 5)
    noise_count = random.randint(2, 4)
    entity1, entity2 = random.sample(entities, 2)
    task_relation = random.choice(relations)

    direct_chain = generate_chain(task_relation, entity1, entity2, k)
    redundant_chain = generate_redundant_info(direct_chain)
    noise = add_noise(direct_chain, entities, noise_count)
    
    combined_direct_chain = direct_chain + noise
    random.shuffle(combined_direct_chain)
    binary_vector_direct = generate_binary_vector(direct_chain, combined_direct_chain)
    
    combined_redundant_chain = redundant_chain + noise
    random.shuffle(combined_redundant_chain)
    binary_vector_redundant = generate_binary_vector(direct_chain, combined_redundant_chain)  # Use direct chain to mark

    print("Task:", f"{task_relation}({entity1}, {entity2})")
    print("Direct Chain and Task:", " | ".join(f"{r}({a}, {b})" for r, a, b in combined_direct_chain))
    print("Direct Binary Vector:", ''.join(map(str, binary_vector_direct)))
    print("Redundant Chain and Task:", " | ".join(f"{r}({a}, {b})" for r, a, b in combined_redundant_chain))
    print("Redundant Binary Vector:", ''.join(map(str, binary_vector_redundant)))

generate_task_and_chain()


Task: P(z, g)
Direct Chain and Task: P(w, b) | P(p, v) | S(q, s) | B(i, j) | P(z, t) | P(b, g) | P(t, w)
Direct Binary Vector: 1000111
Redundant Chain and Task: S(q, s) | P(t, w) | B(i, j) | P(p, v) | P(w, b) | S(w, t) | S(t, z) | P(b, g) | S(g, b) | S(b, w) | P(z, t)
Redundant Binary Vector: 01001001001


In [14]:
import random
import string

def generate_chain(task_relation, entity1, entity2, k):
    entities = list(string.ascii_lowercase)
    chain = []
    current_entity = entity1
    remaining_entities = [e for e in entities if e not in {entity1, entity2}]

    # Generate the chain of relations
    for _ in range(k - 1):
        if not remaining_entities:
            break
        next_entity = random.choice(remaining_entities)
        remaining_entities.remove(next_entity)
        chain.append(f"{task_relation}({current_entity}, {next_entity})")
        current_entity = next_entity

    # Final link in the chain
    chain.append(f"{task_relation}({current_entity}, {entity2})")
    return chain

def add_noise(chain, entities, noise_count):
    noise = []
    used_entities = {e for rel in chain for e in rel[rel.find('(')+1:rel.find(')')].split(', ')}
    available_entities = [e for e in entities if e not in used_entities]

    while noise_count > 0 and len(available_entities) >= 2:
        e1, e2 = random.sample(available_entities, 2)
        relation = random.choice(['P', 'S', 'B'])
        noise.append(f"{relation}({e1}, {e2})")
        available_entities.remove(e1)
        available_entities.remove(e2)
        noise_count -= 1

    return noise

def generate_redundant_info(direct_chain):
    redundant_chain = []
    for rel in direct_chain:
        rel_type, entities = rel[0], rel[2:-1].split(',')
        e1, e2 = entities[0].strip(), entities[1].strip()
        redundant_chain.append(rel)
        if rel_type == 'P':
            redundant_chain.append(f"S({e2}, {e1})")
        elif rel_type == 'S':
            redundant_chain.append(f"P({e2}, {e1})")
        elif rel_type == 'B':
            redundant_chain.append(f"B({e2}, {e1})")
    return redundant_chain

def generate_binary_vector(direct_chain, combined_chain):
    direct_set = set(direct_chain)
    return [1 if item in direct_set else 0 for item in combined_chain]

def generate_datasets(k):
    entities = list(string.ascii_lowercase)
    relations = ['P', 'S', 'B']
    direct_data = []
    redundant_data = []

    for _ in range(k):
        entity1, entity2 = random.sample(entities, 2)
        task_relation = random.choice(relations)
        k_chain = random.randint(3, 5)
        noise_count = random.randint(2, 4)

        direct_chain = generate_chain(task_relation, entity1, entity2, k_chain)
        redundant_chain = generate_redundant_info(direct_chain)
        noise = add_noise(direct_chain, entities, noise_count)

        combined_direct_chain = direct_chain + noise
        random.shuffle(combined_direct_chain)
        binary_vector_direct = generate_binary_vector(direct_chain, combined_direct_chain)

        combined_redundant_chain = redundant_chain + noise
        random.shuffle(combined_redundant_chain)
        binary_vector_redundant = generate_binary_vector(redundant_chain, combined_redundant_chain)

        task_str = f"{task_relation}({entity1}, {entity2})"
        direct_data.append({
            'input': f"[{task_str}] " + " | ".join(combined_direct_chain),
            'output': binary_vector_direct
        })

        redundant_data.append({
            'input': f"[{task_str}] " + " | ".join(combined_redundant_chain),
            'output': binary_vector_redundant
        })

    return direct_data, redundant_data

# Generate datasets
k = 10  # Number of instances to generate
direct_data, redundant_data = generate_datasets(k)

# Example: Printing the first instance of each type for verification
print("Direct Dataset Example:")
print(direct_data[0])
print("\nRedundant Dataset Example:")
print(redundant_data[0])


Direct Dataset Example:
{'input': '[B(f, n)] B(a, n) | B(f, w) | B(w, a) | P(y, z) | B(l, u) | S(b, m)', 'output': [1, 1, 1, 0, 0, 0]}

Redundant Dataset Example:
{'input': '[B(f, n)] B(a, n) | B(a, w) | S(b, m) | B(w, a) | B(n, a) | B(w, f) | B(f, w) | P(y, z) | B(l, u)', 'output': [1, 1, 0, 1, 1, 1, 1, 0, 0]}


In [23]:
import random
import string

def generate_chain(task_relation, entity1, entity2, k):
    entities = list(string.ascii_lowercase)
    chain = []
    current_entity = entity1
    remaining_entities = [e for e in entities if e not in {entity1, entity2}]

    for _ in range(k - 1):
        if not remaining_entities:
            break
        next_entity = random.choice(remaining_entities)
        remaining_entities.remove(next_entity)
        chain.append(f"{task_relation}({current_entity}, {next_entity})")
        current_entity = next_entity

    chain.append(f"{task_relation}({current_entity}, {entity2})")
    return chain

def add_noise(chain, entities, noise_count):
    noise = []
    used_entities = {e for rel in chain for e in rel[rel.find('(')+1:rel.find(')')].split(', ')}
    available_entities = [e for e in entities if e not in used_entities]

    while noise_count > 0 and len(available_entities) >= 2:
        e1, e2 = random.sample(available_entities, 2)
        relation = random.choice(['P', 'S', 'B'])
        noise.append(f"{relation}({e1}, {e2})")
        available_entities.remove(e1)
        available_entities.remove(e2)
        noise_count -= 1

    return noise

def generate_redundant_info(direct_chain):
    redundant_chain = []
    for rel in direct_chain:
        relation, pair = rel.split('(')
        e1, e2 = pair[:-1].split(', ')
        redundant_chain.append(rel)
        if relation == 'P':
            redundant_chain.append(f"S({e2}, {e1})")
        elif relation == 'S':
            redundant_chain.append(f"P({e2}, {e1})")
        elif relation == 'B':
            redundant_chain.append(f"B({e2}, {e1})")
    return redundant_chain

def generate_binary_vector(direct_chain, combined_chain):
    direct_set = set(direct_chain)
    return [1 if item in direct_set else 0 for item in combined_chain]

def generate_datasets(k):
    entities = list(string.ascii_lowercase)
    relations = ['P', 'S', 'B']
    direct_data = []
    redundant_data = []

    for _ in range(k):
        entity1, entity2 = random.sample(entities, 2)
        task_relation = random.choice(relations)
        k_chain = random.randint(3, 5)
        noise_count = random.randint(2, 4)

        direct_chain = generate_chain(task_relation, entity1, entity2, k_chain)
        redundant_chain = generate_redundant_info(direct_chain)
        noise = add_noise(direct_chain, entities, noise_count)
        
        combined_direct_chain = direct_chain + noise
        random.shuffle(combined_direct_chain)
        binary_vector_direct = generate_binary_vector(direct_chain, combined_direct_chain)
        
        combined_redundant_chain = redundant_chain + noise
        random.shuffle(combined_redundant_chain)
        binary_vector_redundant = generate_binary_vector(direct_chain, combined_redundant_chain)

        task_str = f"{task_relation}({entity1}, {entity2})"
        direct_data.append({
            'input': f"[{task_str}] " + " | ".join(combined_direct_chain),
            'output': binary_vector_direct
        })

        redundant_data.append({
            'input': f"[{task_str}] " + " | ".join(combined_redundant_chain),
            'output': binary_vector_redundant
        })

    return direct_data, redundant_data

# Generate datasets
k = 10000  # Number of instances to generate
direct_data, redundant_data = generate_datasets(k)

# Example: Printing the first instance of each type for verification
print("Direct Dataset Example:")
print(direct_data[0])
print("\nRedundant Dataset Example:")
print(redundant_data[0])


Direct Dataset Example:
{'input': '[B(p, g)] P(q, t) | B(y, j) | P(e, b) | P(w, s) | B(r, n) | B(u, g) | B(n, u) | B(p, r)', 'output': [0, 0, 0, 0, 1, 1, 1, 1]}

Redundant Dataset Example:
{'input': '[B(p, g)] B(u, n) | P(w, s) | B(n, r) | B(n, u) | B(r, n) | P(e, b) | B(u, g) | B(p, r) | B(r, p) | B(y, j) | P(q, t) | B(g, u)', 'output': [0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0]}


In [24]:
import pandas as pd

def convert_to_dataframe(data):
    # Extract 'input' and 'output' from the data to form a DataFrame
    df = pd.DataFrame(data)
    return df

def save_to_csv(df, filename):
    # Save the DataFrame to a CSV file
    df.to_csv(filename, index=False)
    print(f"Data saved to {filename}")

# Convert data to DataFrame
direct_df = convert_to_dataframe(direct_data)
redundant_df = convert_to_dataframe(redundant_data)

# Optionally, save to CSV files
#save_to_csv(direct_df, 'direct_data.csv')
#save_to_csv(redundant_df, 'redundant_data.csv')

# The DataFrames can now be used to fine-tune a language model.
# Here, 'direct_df' and 'redundant_df' are ready for use.


In [21]:
direct_df.head()

Unnamed: 0,input,output
0,"[S(l, p)] S(y, f) | P(a, c) | S(f, p) | S(l, v...","[1, 0, 1, 1, 0, 1]"
1,"[S(y, p)] S(n, p) | S(x, c) | S(r, v) | B(b, q...","[1, 0, 1, 0, 0, 1, 1, 1]"
2,"[S(w, q)] S(w, g) | P(d, y) | S(r, q) | S(g, r...","[1, 0, 1, 1, 0]"
3,"[B(l, n)] B(y, k) | S(w, x) | B(s, v) | B(v, n...","[0, 0, 1, 1, 1, 1]"
4,"[B(g, o)] P(e, f) | B(g, t) | S(v, r) | B(j, o...","[0, 1, 0, 1, 1]"


In [22]:
redundant_df.head()

Unnamed: 0,input,output
0,"[S(l, p)] S(v, y) | P(p, f) | P(a, c) | S(l, v...","[1, 0, 0, 1, 0, 1, 0, 0, 1, 0]"
1,"[S(y, p)] S(r, v) | S(x, c) | P(v, r) | P(r, i...","[1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0]"
2,"[S(w, q)] P(q, r) | P(g, w) | S(f, u) | S(w, g...","[0, 0, 0, 1, 1, 0, 1, 0]"
3,"[B(l, n)] B(y, k) | B(s, v) | B(n, v) | S(w, x...","[0, 1, 0, 0, 1, 0, 1, 0, 1, 0]"
4,"[B(g, o)] B(o, j) | P(e, f) | B(j, t) | B(t, g...","[0, 0, 0, 0, 0, 1, 1, 1]"


In [25]:
save_to_csv(direct_df, 'direct_data.csv')
save_to_csv(redundant_df, 'redundant_data.csv')

Data saved to direct_data.csv
Data saved to redundant_data.csv
