The aim of this project is to investigate how we can differential human and LLM generated seemingly random sequences, from machine generated ones. Consider for instance a sequence of ones and 0s, each with an equal chance of occuring. While humans can accurately balance the freqeuncy of each kind, there is a tendency to avoid long sequences, and other possible biases that set these apart. This project attempts to find a method to capture this difference can be accuratly described and quantified.

In [71]:
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
import numpy as np
import csv

Sources for the data:
The Randomly_generated.dat file was produced by myself using the numpy.random library.
The Gpt_generated.dat file was made using the output of chatgpt o mini, when it was asked to produce a random sequence of 0s and 1s.
Other strings of random symbols bellow where made by either chatgpt or myself, as explained in each instance

Hypothesis:
By examining the information contained in each string, we can accurately describe its source.
The methods used to do this include examining the entropy of a string and implementing a neural network.

Analysis: Entropy based method

In [72]:
def generate_sequence(length, symbols, *probabilities):
    """
    generates a string of a given length, where each symbol in the sequence has the given probability of occuring at any spot
    Note: for n symbols, only n-1 probabilites need to be provided
    returns: string
    This function will provide the standard for perfectly random generation
    """
    throws = np.random.rand(length)
    sequence = ""
    for i, entry in enumerate(throws):
        prob = 0
        for j, prob_ in enumerate(probabilities):
            prob += prob_
            if entry<prob:
                sequence+=symbols[j]
                break
        else:
            sequence += symbols[-1]
    return sequence

In [73]:
def calculate_probability(key, symbols, *probabilities):
    """
    calculates the probability of a given string (key) occuring if it was generated according to the probabilities provided
    Note: for n symbols, only n-1 probabilites need to be provided
    returns: float
    """
    prob = 1
    last_prob = 1- sum(probabilities)
    for i in key:
        location_index = symbols.index(i)
        if location_index == len(probabilities):
            prob*=last_prob
        else:
            prob*=probabilities[location_index]
    return 1/prob


def convert_to_base(num, symbols):
    """
    Converts a given number (num) to a base composed of the symbols provided.
    For example if there are 6 symbols, the number will be prepresented in base 6, with 0 being the first symbol etc.
    returns: string
    """
    if num == 0:
        return symbols[0]
    
    result = ""
    while num > 0:
        result = symbols[num % len(symbols)] + result
        num //= len(symbols)
    
    return result

In [74]:
def occurance_chain(sequence, glyphs, symbols):
    """
    Creates a dictionary, with key all possible length glyphs:int subsequences created with the symbols.
    The value of each key is the percentage of times a subsequence occures in the sequence.
    returns: dict 
    """
    dictionary = dict()
    for i in range(len(symbols)**glyphs):
        key  = convert_to_base(i, symbols)
        key = key.rjust(glyphs, symbols[0])
        dictionary.update({key:0})

    for i in range(len(sequence)-glyphs+1):
        dictionary[sequence[i:glyphs+i]]+=1

    for key in dictionary.keys():
        dictionary[key]/=(len(sequence)-glyphs+1)

    return dictionary

In [75]:
def cross_entropy_loss(dictionary, symbols, *probabilities):
    """
    Caculates the cross entropy between the ideal distribution of a subsequence occuring formed by the probabilities provided,
    and the disctribution derived from the sequence and provided in the form of a dictionary created by occurance_chain
    returns: float
    """
    entropy = 0
    for key in dictionary.keys():
        entropy -= np.log2(dictionary[key])*calculate_probability(key, symbols, *probabilities)
    return entropy

In [76]:
def p_for_n_glyphs(sequence, glyphs, symbols, *probabilites):
    """
    Calculates the p value (the probability that the null hypothesis, that the string is tryly random, is false).
    This is done by measuring the cross entropy of the given sequence, and randomly generating many others of equal length and measuring their entropy.
    The p value is then the percentage of those randomly generated entropies that are equal or greater than the entropy of the sequence provided.
    Note that a uniform prior is assumed.
    returns: float
    """
    cross_entropy = cross_entropy_loss(occurance_chain(sequence, glyphs, symbols), symbols, *probabilites)
    # the lower bound determines how many random strings will be generated, thus limiting the accuracy with which p is calculated
    # hence, a p value of 0 can only be interpered as something less the lower bound
    lower_bound = 0.0001
    number_of_tries = int(1/lower_bound)
    number_of_rand = 0
    for _ in range(number_of_tries):
        rand_sequence = generate_sequence(len(sequence), symbols, *probabilites)
        rand_cross_entorpy = cross_entropy_loss(occurance_chain(rand_sequence, glyphs, symbols), symbols, *probabilites)
        if rand_cross_entorpy >=cross_entropy:
            number_of_rand+=1
    return number_of_rand/number_of_tries

In [90]:
# testing the results so far

# the first test involves 
# this string was generated by chagpt
test_sequence_1 = "ABBAAABABABBBABABABBAABBAABBBABABABBAABBABABABABABABAABBBABAABABBBABAABABBBAABABBBABBAAABAAABBAABBABABAAABBABBBAAABBBABABBAABBABABABABA"
print(len(test_sequence_1))

# this string was generated by myself
test_sequence_2 = "ABABABABABBABBABABBABBABABBBAAABABABABAABABABBABABABBABABABBABABBABABAAABABABABABAAAABABABAABABBABABAAABBABBABAAABABAABABAABBABABAAABBB"

# this string was generated by myself, paying special attention in an attempt to
# make it as random as possible in light of what I learned in this project
test_sequence_3 = "ABBAAABABABBBAAABBABAAAAABBAABAABBBBAAABABABAABAAABBBBABABBBABBBBAAAABABABBBABBABABABBBAAABABABABABABABABABBBBBAAAAABBAABABBABABBBBAAAB"

# this stirng is generated randomly through numpy
test_sequence_4 = generate_sequence(100, ["A", "B"], 0.5)

p = p_for_n_glyphs(test_sequence_1, 2, ["A", "B"], 0.5)
print(f"The p value for the first string is {p}")

p = p_for_n_glyphs(test_sequence_2, 2, ["A", "B"], 0.5)
print(f"The p value for the second string is {p}")

p = p_for_n_glyphs(test_sequence_3, 2, ["A", "B"], 0.5)
print(f"The p value for the third string is {p}")

p = p_for_n_glyphs(test_sequence_4, 2, ["A", "B"], 0.5)
print(f"The p value for the forth string is {p}")


135
The p value for the first string is 0.0091
The p value for the second string is 0.0
The p value for the third string is 0.3463
The p value for the forth string is 0.2444


In [91]:
# the preceding test, used information of about the occurances of series of two symbols, the same can be done with more:

p = p_for_n_glyphs(test_sequence_1, 3, ["A", "B"], 0.5)
print(f"The p value for the first string is {p}")

p = p_for_n_glyphs(test_sequence_2, 3, ["A", "B"], 0.5)
print(f"The p value for the second string is {p}")

p = p_for_n_glyphs(test_sequence_3, 3, ["A", "B"], 0.5)
print(f"The p value for the third string is {p}")

p = p_for_n_glyphs(test_sequence_4, 3, ["A", "B"], 0.5)
print(f"The p value for the forth string is {p}")

The p value for the first string is 0.0054
The p value for the second string is 0.0
The p value for the third string is 0.3487


  entropy -= np.log2(dictionary[key])*calculate_probability(key, symbols, *probabilities)


The p value for the forth string is 0.3596


In [96]:
# for the third string, using a subsequence of length 4 is especially valuable, providing more certainty:

p = p_for_n_glyphs(test_sequence_3, 4, ["A", "B"], 0.5)
print(f"The p value for the third string is {p}")

  entropy -= np.log2(dictionary[key])*calculate_probability(key, symbols, *probabilities)


The p value for the third string is 0.1811


However, it is important to remeber that increasing the number of symbols is not always helpful, as the number of possible combinations is of the form 2^n, and for large n the string becomes to small to build and accurate picture, as seen in the example below

In [97]:
p = p_for_n_glyphs(test_sequence_1, 4, ["A", "B"], 0.5)
print(f"The p value for the first string is {p}")

  entropy -= np.log2(dictionary[key])*calculate_probability(key, symbols, *probabilities)


The p value for the first string is 0.0167


This method is not limited to two symbols, as seen below, however for accurate result, the more the sybmols, the greater the amount of data needed

In [98]:
# the same but for three symbols instead of two

# this string was generated by myself
test_sequence_5 ="ABCABABCABABCABBACAAABCBABACBABACBABCCAAACABCABABCABABCABBACAAABCBABACBABACBABCCAAAC"

p = p_for_n_glyphs(test_sequence_5, 2, ["A", "B", "C"], 0.33, 0.33)
print(f"The p value is {p}")


  entropy -= np.log2(dictionary[key])*calculate_probability(key, symbols, *probabilities)


The p value is 0.0032


Results for the entropy based method:
This method can deduce whether a string is truly random or not mostly accurately.
By calculing the p value associated with each sequence, we can somewhat reliably check if a sequence is truly random or not.


Analysis: Neural network approach.
In this section, as simple binary classifier neural network attemps to answer this question. The model will classify a series of fixed length 150 into either truly random (0) or not (1)

In [None]:
# loading the data from the two .dat folders, and splitting them into a test and train set
X_gpt = []

with open('Machine_Generated.dat', 'r') as file:
    reader = csv.reader(file, delimiter=',')
    for row in reader:
        X_gpt.append(np.array([float(i) for i in list(row[0])]))

X_gpt = np.array(X_gpt)
Y_gpt = np.ones(X_gpt.shape[0], dtype="float")


X_rand = []

with open('Randomly_generated.dat', 'r') as file:
    reader = csv.reader(file, delimiter=',')
    for row in reader:
        X_rand.append(np.array([float(i) for i in list(row[0])]))

X_rand = np.array(X_rand)
Y_rand= np.zeros(X_rand.shape[0], dtype="float")

X = np.concatenate((X_rand, X_gpt))
Y = np.concatenate((Y_rand, Y_gpt))

# Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Convert to PyTorch tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32).view(-1, 1)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32).view(-1, 1)



[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.
 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.
 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 1. 0. 0. 0. 0. 0. 1. 0. 1. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0.
 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1.
 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0.]


In [82]:
# Create a simple neural network that classifies a series into either truly random or biased
class BinaryClassifier(nn.Module):
    def __init__(self):
        super(BinaryClassifier, self).__init__()
        self.layer1 = nn.Linear(150, 200)  # Input layer to first hidden layer
        self.layer2 = nn.Linear(200, 40)   # First hidden layer to second hidden layer
        self.layer3 = nn.Linear(40, 1)     # Second hidden layer to output layer
        self.relu = nn.ReLU()           # ReLU activation function
        self.sigmoid = nn.Sigmoid()     # Sigmoid activation function for output between 1 and 0
    
    def forward(self, x):
        x = self.relu(self.layer1(x))
        x = self.relu(self.layer2(x))
        x = self.sigmoid(self.layer3(x))
        return x

In [83]:
# Model, optimizer and loss funcition initialization

model = BinaryClassifier()

loss_func = nn.BCELoss()  # Binary Cross Entropy Loss
optimizer = optim.Adam(model.parameters(), lr=0.001)

In [84]:
# Model training
num_epochs = 200
for epoch in range(num_epochs):
    model.train()
    

    outputs = model(X_train_tensor) # calculates the output with the current state of the model
    loss = loss_func(outputs, y_train_tensor) # calulates the loss with the respect to the last output
    

    optimizer.zero_grad()  # Clear previous gradients
    loss.backward()        # Backpropagation to calculate gradients
    optimizer.step()       # Update model weights
    
    # prints the loss every 10 epochs
    if (epoch + 1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')




Epoch [10/200], Loss: 0.4252
Epoch [20/200], Loss: 0.4190
Epoch [30/200], Loss: 0.4127
Epoch [40/200], Loss: 0.4000
Epoch [50/200], Loss: 0.3813
Epoch [60/200], Loss: 0.3543
Epoch [70/200], Loss: 0.3181
Epoch [80/200], Loss: 0.2787
Epoch [90/200], Loss: 0.2429
Epoch [100/200], Loss: 0.2078
Epoch [110/200], Loss: 0.1703
Epoch [120/200], Loss: 0.1282
Epoch [130/200], Loss: 0.0858
Epoch [140/200], Loss: 0.0515
Epoch [150/200], Loss: 0.0285
Epoch [160/200], Loss: 0.0161
Epoch [170/200], Loss: 0.0099
Epoch [180/200], Loss: 0.0067
Epoch [190/200], Loss: 0.0049
Epoch [200/200], Loss: 0.0038


In [None]:
# Model evaluation

model.eval()  #evaluation mode
with torch.no_grad():
    y_pred = model(X_test_tensor)
    y_pred_class = y_pred.round()  # Round the predictions to 0 or 1
    
    # Calculate accuracy
    accuracy = (y_pred_class.eq(y_test_tensor).sum().item()) / y_test_tensor.size(0)
    print(f'Test Accuracy: {accuracy:.4f}')

with torch.no_grad():
    y_pred = model(X_test_tensor)
    y_pred_class = y_pred.round()  # Round the predictions to 0 or 1
    
    # Calculate accuracy
    accuracy = (y_pred_class.eq(y_test_tensor).sum().item()) / y_test_tensor.size(0)
    print(f'Test Accuracy: {accuracy:.4f}')

Test Accuracy: 0.8498


In [113]:
# another metric for the evaluation is the false positive rate, comparable to the p value
# for that, we will perform the test only with truly random sequences

X_rand = []

for _ in range(300):
    X_rand.append(np.array([float(i) for i in list(generate_sequence(150, ["0", "1"], 0.5))]))

X_rand = np.array(X_rand)
Y_rand= np.zeros(X_rand.shape[0], dtype="float")

X_test_tensor = torch.tensor(X_rand, dtype=torch.float32)
y_test_tensor = torch.tensor(Y_rand, dtype=torch.float32).view(-1, 1)

with torch.no_grad():
    y_pred = model(X_test_tensor)
    y_pred_class = y_pred.round()  # Round the predictions to 0 or 1
    
    # Calculate accuracy
    accuracy = (y_pred_class.eq(y_test_tensor).sum().item()) / y_test_tensor.size(0)
    print(f'False positive rate: {accuracy:.4f}')

False positive rate: 0.9333


Results for the neural network method:
The model does detect a difference between the two types of sequences, and achived a respecable but low final accuracy of 85%. The false positive rate is about 7%

Conclusion:
We can see that the hypothesis is indeed supported by the results obtained, as both methods managed to diffentiate between the two types of data, with varying degrees of accuracy. The entropy based method is more flexible and extendable but 