# Rephrasing Chemistry  

### Hurst could allow us to predict downstream performance.  

To calculate it: Run an LM through it. Grab the prediction and calculate the amount of bits required to represent this prediction. Use this as your timeseries to calculate the Hurst Parameter

In [31]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from torch.nn.functional import softmax
from torch.utils.data import DataLoader
from paperDatabase import TextDataset
import json
from tqdm import tqdm

device = "cuda"  # the device to load the model onto
models_path = {"qwen": "Qwen/Qwen1.5-7B-Chat", "mistral-inst": "mistralai/Mistral-7B-Instruct-v0.1", "zephyr": "HuggingFaceH4/zephyr-7b-alpha"}
MODELPATH = models_path["qwen"]

model = AutoModelForCausalLM.from_pretrained(MODELPATH, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(MODELPATH)
tokenizer.pad_token = tokenizer.eos_token

batch_size = 4
dataset = TextDataset("./papers")
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=False)

results = []

paragraph_text = """Taylor Alison Swift (born December 13, 1989) is an American singer-songwriter. Her artistry, songwriting, and entrepreneurship have influenced the music industry and popular culture. A subject of widespread public interest, Swift is an advocate of artists' rights and has had a political impact.

Swift began professional songwriting at age 14. She signed with Big Machine Records in 2005 and achieved prominence as a country pop singer with the albums Taylor Swift (2006) and Fearless (2008). Their singles "Teardrops on My Guitar", "Love Story", and "You Belong with Me" were crossover successes on country and pop radio formats and brought Swift mainstream fame. She experimented with rock and electronic styles on her next albums, Speak Now (2010) and Red (2012), respectively, with the latter featuring her first Billboard Hot 100 number-one single, "We Are Never Ever Getting Back Together". Swift recalibrated her image from country to pop with 1989 (2014), a synth-pop album containing the chart-topping songs "Shake It Off", "Blank Space", and "Bad Blood". Media scrutiny inspired the hip-hop-influenced Reputation (2017) and its number-one single "Look What You Made Me Do".

After signing with Republic Records in 2018, Swift released the eclectic pop album Lover (2019) and the autobiographical documentary Miss Americana (2020). She explored indie folk styles on the 2020 albums Folklore and Evermore, subdued electropop on Midnights (2022), and re-recorded four albums subtitled Taylor's Version after a dispute with Big Machine. These albums spawned the number-one songs "Cruel Summer", "Cardigan", "Willow", "Anti-Hero", "All Too Well", and "Is It Over Now?". Her Eras Tour (2023–2024) and its accompanying concert film became the highest-grossing tour and concert film of all time, respectively. Swift has directed videos and films such as Folklore: The Long Pond Studio Sessions (2020) and All Too Well: The Short Film (2021).

One of the world's best-selling musicians, Swift has sold over 200 million records as of 2019. She is the highest-grossing female touring act, the most-streamed woman on Spotify and Apple Music, and the first billionaire with music as the main source of income. Six of her albums have opened with over one million sales in a week. The 2023 Time Person of the Year, Swift has appeared on lists such as Rolling Stone's 100 Greatest Songwriters of All Time, Billboard's Greatest of All Time Artists, and Forbes' World's 100 Most Powerful Women. Her accolades include 14 Grammy Awards, a Primetime Emmy Award, 40 American Music Awards, 40 Billboard Music Awards, and 23 MTV Video Music Awards; she has won the Grammy Award for Album of the Year, the MTV Video Music Award for Video of the Year, and the IFPI Global Recording Artist of the Year a record four times each.

Taylor Alison Swift was born on December 13, 1989, in West Reading, Pennsylvania. Her father, Scott Kingsley Swift, is a former stockbroker for Merrill Lynch; her mother, Andrea Gardner Swift (née Finlay), worked for a time as a mutual fund marketing executive. Her younger brother, Austin, is an actor. Swift's maternal grandmother, Marjorie Finlay (née Moehlenkamp), was an opera singer, whose singing in church became one of Swift's earliest memories of music that shaped her career. Swift spent her early years on a Christmas tree farm in Pennsylvania that her father had purchased from one of his clients, and she spent her summers at her family's vacation home in Stone Harbor, New Jersey, where she occasionally performed acoustic songs at a local coffee shop. She was raised Christian and attended preschool and kindergarten at a Montessori school run by the Bernardine Sisters of St. Francis before transferring to the Wyndcroft School. When her family moved to Wyomissing, Pennsylvania, she attended Wyomissing Area Junior/Senior High School. As a child, she performed in Berks Youth Theatre Academy productions and traveled regularly to New York City for vocal and acting lessons. Her early love for country music was influenced by Shania Twain, Patsy Cline, LeAnn Rimes, and the Dixie Chicks, and she spent weekends performing at local festivals and events. After watching a documentary about Faith Hill, she became determined to pursue a country-music career in Nashville, Tennessee."""

def calculate_probability_and_perplexity(input_ids, model):
    with torch.no_grad():
        outputs = model(input_ids, labels=input_ids)
        logits = outputs.logits
        shift_logits = logits[..., :-1, :].contiguous()
        shift_labels = input_ids[..., 1:].contiguous()

        # print(shift_labels)
        # print(tokenizer.decode(shift_labels[0]))
        
        # Softmax to convert logits to probabilities
        probabilities = softmax(shift_logits, dim=-1)
        
        # Gather the probabilities of the actual next tokens
        actual_next_token_probs = torch.gather(probabilities, 2, shift_labels.unsqueeze(-1)).squeeze(-1)
        
        # Compute average negative log likelihood for perplexity
        loss = outputs.loss
        perplexity = torch.exp(loss).item()
        
        return actual_next_token_probs, perplexity

results_probs_next_token = []

# for i in range(0, 10):
    # paragraph_text = dataset.__getitem__(i)
    # for paragraph_text in batch:
    # paragraph_text = """Nickel has become a promising platform for developing versatile fluoroalkylation processes, as vital steps significant to chemical bond-forming reactions have been established. These critical steps include reductive eliminations at elevated valence states of nickel,[1, 2, 3, 4] the generation of fluoroalkyl radicals via electron transfer or atom abstraction reactions involving fluoroalkyl electrophiles and nickel,[5, 6, 7, 8, 9, 10, 11, 12] and catalytic transformations employing selected fluoroalkyl groups. However, most previous studies required nickel to have stabilizing ligands, which have the potential to prevent catalysis through redistributions that affect the specificity of the nickel catalyst. Moreover, these additional ligands increase the cost of the reaction and can contribute to air sensitivity. To address this, we sought to develop fluoroalkylation methods with nickel that employ solvent as the only coordinating ligand. These reactions are often referred to as \"ligandless\" conditions, and have been demonstrated in other systems with notable success.[22, 23, 2"""
inputs = tokenizer(paragraph_text, return_tensors="pt", padding=True, truncation=True, max_length=512)
input_ids = inputs.input_ids.to(device)

actual_next_token_probs, perplexity = calculate_probability_and_perplexity(input_ids, model)

# Example: Print the probability of the actual next token for the first token in the sequence
first_token_prob = actual_next_token_probs[:, 0].item()  # Assuming batch size of 1 for simplicity
results_probs_next_token.extend(actual_next_token_probs[0])
# print(f"Probability of the actual next token for the first token in the sequence: {actual_next_token_probs}")

results.append({"input_text": paragraph_text, "first_token_prob": first_token_prob, "perplexity": perplexity})

        # You can include code here to save results periodically

Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.12it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [32]:
print(len(results_probs_next_token))

511


In [34]:
import numpy as np
import matplotlib.pyplot as plt
from hurst import compute_Hc, random_walk

# Use random_walk() function or generate a random walk series manually:
# series = random_walk(99999, cumprod=True)

series = [x for x in actual_next_token_probs.cpu()[0]]
print(series)
np.random.seed(42)
# random_changes = 1. + np.random.randn(99999) / 1000.
# series = np.cumprod(random_changes)  # create a random walk from random changes

# Evaluate Hurst equation
H, c, data = compute_Hc(series, kind='change', simplified=True)

# Plot
f, ax = plt.subplots()
ax.plot(data[0], c*data[0]**H, color="deepskyblue")
ax.scatter(data[0], data[1], color="purple")
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('Time interval')
ax.set_ylabel('R/S ratio')
ax.grid(True)
plt.show()

print("H={:.4f}, c={:.4f}".format(H,c))

[tensor(3.1626e-06), tensor(0.4547), tensor(0.4547), tensor(0.9664), tensor(0.9847), tensor(1.0000), tensor(0.9998), tensor(0.9999), tensor(1.0000), tensor(0.9986), tensor(1.0000), tensor(1.0000), tensor(1.0000), tensor(0.9906), tensor(0.9891), tensor(0.9990), tensor(0.9970), tensor(0.9999), tensor(0.9990), tensor(0.9973), tensor(1.0000), tensor(0.2832), tensor(0.6703), tensor(0.0003), tensor(0.9877), tensor(0.0001), tensor(0.1895), tensor(0.9997), tensor(0.8271), tensor(0.9803), tensor(1.0951e-05), tensor(0.9651), tensor(0.0073), tensor(0.1997), tensor(0.0774), tensor(0.9973), tensor(0.1467), tensor(0.6242), tensor(0.9999), tensor(0.1829), tensor(0.0034), tensor(1.0048e-07), tensor(0.9459), tensor(0.0140), tensor(0.0035), tensor(0.5249), tensor(0.9648), tensor(0.8907), tensor(0.2346), tensor(0.0002), tensor(0.3342), tensor(0.0032), tensor(0.0073), tensor(0.9946), tensor(0.9970), tensor(0.4838), tensor(0.5545), tensor(0.0009), tensor(0.6308), tensor(0.0002), tensor(0.7298), tensor(0.05

UnboundLocalError: local variable 'R' referenced before assignment