<h1>Finenzyme generation and fine-tuning tutorial</h1>

The generation process through the Protein Language Model Finenzyme involves several steps:

1. **Tokenization**: The input text composed of conditioning tags and an amino acid sequence is translated into tokens.

2. **Feeding the Model**: The tokens are then fed into the model. The model takes in a sequence of tokens and returns a probability distribution for the next token - that is, it assigns a probability to every possible next token over the amino acid vocabulary.

3. **Sampling**: A token is selected from this distribution. The way this selection happens can vary - it could be the token with the highest probability, or it could be a random selection weighted by the probabilities.

4. **Decoding**: The selected token is then added to the sequence of input tokens, and the process repeats: the extended sequence is fed back into the model, which returns a new probability distribution for the next token, and so on. This continues until a stop condition is met, such as reaching a maximum length or generating a special end-of-sequence stop token.

5. **Detokenization**: Finally, the sequence of tokens is converted back into an amino acid sequence.

<h2>Step 1. Generation through the pre-trained model</h2>

In this section, we will generate sequences using the pre-trained Finenzyme model. This involves several steps, including specifying UniProt keywords, writing the code to perform the generation, and using taxonomy keywords

#### Process Overview
1. **Model Loading**: First, we load a pre-trained model that has been trained on a large dataset of protein sequences. This model is capable of generating new sequences based on the patterns it has learned.
2. **Sequence Generation Example**: Using the model, we generate a new protein sequence. This can be done by providing a starting sequence or keywords that guide the generation process (in the example below, we use both).
3. **Keywords Specification**: We specify keywords from the UniProt database. These keywords help in filtering and refining the sequences based on biological functions, and structures.
4. **Taxonomy Keyword**: Additionally, we use a taxonomy keyword to focus the generation on a specific organism.

#### UniProt Keywords
UniProt keywords are terms used to describe various aspects of protein sequences, including their functions, domains, and cellular locations. By specifying these keywords, we can guide the model to generate sequences that have desired properties. Examples of UniProt keywords include:
- **Kinase**: To generate sequences related to kinases.
- **Membrane**: For sequences associated with membrane proteins.
- **DNA-binding**: To focus on proteins that bind to DNA.

#### Taxonomy Keywords
Taxonomy keywords allow us to filter sequences based on the organism or group of organisms of interest. By incorporating these keywords, we can ensure the generated sequences are relevant to specific taxonomic groups. Examples of taxonomy keywords include:
- **Homo sapiens**: For human-related sequences.
- **Escherichia coli**: To focus on sequences from E. coli.
- **Fungi**: For sequences related to fungal organisms.

By combining UniProt and taxonomy keywords, we can guide the generation process of protein sequences using the pre-trained model.

#### Code
The code for generating sequences through the pre-trained model involves loading the model, specifying the generation parameters, and executing the generation process.

In [1]:
from generation_manager import GeneratorManager
from tokenizer import Tokenizer

In [19]:
# Here we set the model checkpoint path, 
# for this example, we use Finenzyme pre-trained model.
model_path = 'ckpt/pretrain_progen_full.pth' 

# Now, time to set the generator manager with default parameters: penalty = 0, and let's set top-k sampling, with k = 1
# This class loads the model and in memory.
generator = GeneratorManager(model_path, topk = 1)

# Now, let's load the tokenizer
tokenizer = Tokenizer()

MODEL SIZE: 
1280
Found PyTorch checkpoint at  ckpt/pretrain_progen_full.pth
GPU aviable. Previous checkpoint loaded in GPU


In [29]:
# An example sequence: P05102	MTH1_HAEPH	Type II methyltransferase M.HhaI 
# (source: https://www.uniprot.org/uniprotkb/P05102/entry)
sequence = "MIEIKDKQLTGLRFIDLFAGLGGFRLALESCGAECVYSNEWDKYAQEVYEMNFGEKPEGDITQVNEKTIPDHDILCAGFPCQAFSISGKQKGFEDSRGTLFFDIARIVREKKPKVVFMENVKNFASHDNGNTLEVVKNTMNELDYSFHAKVLNALDYGIPQKRERIYMICFRNDLNIQNFQFPKPFELNTFVKDLLLPDSEVEHLVIDRKDLVMTNQEIEQTTPKTVRLGIVGKGGQGERIYSTRGIAITLSAYGGGIFAKTGGYLVNGKTRKLHPRECARVMGYPDSYKVHPSTSQAYKQFGNSVVINVLQYIAYNIGSSLNFKPY"

# keyword IDs in input (from UniProt): 
keywords_uniprot = 'KW-0002; KW-0238; KW-0489; KW-0680; KW-0949; KW-0808'
keywords_uniprot = [int(i.split('-')[-1]) for i in keywords_uniprot.split('; ') if i != '']

# Convert keywords to control IDs, 
# given that UniProt goes under continuous updates, we might not have all keywords in the encoding dictionary
keywords_finenzyme = [tokenizer.kw_to_ctrl_idx[i] for i in keywords_uniprot if i in tokenizer.kw_to_ctrl_idx.keys()]
print('Keywords from uniProt, translated into Finenzyme control codes: ', keywords_finenzyme)

# we can also add taxonomy ID as a keyword
taxonomy_id = 735 # from UniProt
taxonomy_id = tokenizer.taxa_to_ctrl_idx[taxonomy_id] if taxonomy_id in tokenizer.taxa_to_ctrl_idx.keys() else None
print('Do we have the taxonomy ID in the tokenizer? ', taxonomy_id is not None)
print('taxonomy id: ', taxonomy_id)

# next, we add the taxonomy Finenzyme tag into the list of tags in input to the model
keywords_finenzyme.append(taxonomy_id)

# Next, we set the amino acid prefix to give in input to the model
prefix = 20


Keywords from uniProt, translated into Finenzyme control codes:  [13, 29, 422, 49, 3]
Do we have the taxonomy ID in the tokenizer?  True
taxonomy id:  1634


In [30]:
# Last step: generation. With after_n_generation we generate up to the real protein length.
res, tokens_prob, offset = generator.after_n_generation(sequence, keywords_finenzyme, prefix)

In [31]:
# What happened during generation?
print('Finenzyme generated a protein with a prefix of ', offset, 'amino acids.')
print('Is the predicted protein equal to the real one?', res == sequence[prefix:])
print('Generated protein:')
print(res)
print('Actual protein:')
print(sequence[prefix:])


ProGen generated a protein with a prefix of  20 amino acids.
Is the predicted protein equal to the real one? False
Generated protein:
IGGFHAALHRLGGRCVYASEIDPHVRKVYELNFGDRPEFTEDIRKITEEEIPDHDVLLAGFPCQPFSIIGKRRGFKDERGTLYFEILRILKAKRPRAFLLENVKNFVNHDKGRTFKIIEDVLEELDFSFSYKLLDPKNFGVPQNRERVFIVGFREKEKLDFKFPKPEELPKPLTLSDILEDNPDSQYFLSKDKLTKLHRHKEKGNGFGFGLVNIEGKGGKIARTLSARYHKDNIDNIEGARNNARPNHQANGIPLSPQQAAKIQGFPEDFKIIGNDAVYKQLGNAVVVPLIQAIGEKILKELNKEKK
Actual protein:
LGGFRLALESCGAECVYSNEWDKYAQEVYEMNFGEKPEGDITQVNEKTIPDHDILCAGFPCQAFSISGKQKGFEDSRGTLFFDIARIVREKKPKVVFMENVKNFASHDNGNTLEVVKNTMNELDYSFHAKVLNALDYGIPQKRERIYMICFRNDLNIQNFQFPKPFELNTFVKDLLLPDSEVEHLVIDRKDLVMTNQEIEQTTPKTVRLGIVGKGGQGERIYSTRGIAITLSAYGGGIFAKTGGYLVNGKTRKLHPRECARVMGYPDSYKVHPSTSQAYKQFGNSVVINVLQYIAYNIGSSLNFKPY


In [32]:
# here we search for indices of amino acids that differ from the natural sequence
differing_idx = [(i, [actual, predicted]) for i, (actual, predicted) in enumerate(zip(sequence[prefix:], res)) if actual != predicted]
print(f'The pre-trained model generated an enzyme with {len(differing_idx)} different amino acids from the original one.')

The pre-trained model generated an enzyme with 259 different amino acids from the original one.


In [34]:
# we can analyze the probabilities of these indices that we have in output from Finenzyme
from tokenizer import Tokenizer
tokenizer = Tokenizer()
for difference in differing_idx:
    print(f'The predicted sequence in generated index {difference[0]},'
    f' has probabilities for true aa {difference[1][0]} of {tokens_prob[difference[0]][0][tokenizer.aa_to_probs_index[difference[1][0]]]:.3f},'
    f' and for predicted aa {difference[1][1]} of {tokens_prob[difference[0]][0][tokenizer.aa_to_probs_index[difference[1][1]]]:.3f}')
    break


The predicted sequence in generated index 0, has probabilities for true aa L of 0.169, and for predicted aa I of 0.666


### Comment on the Generation with the Pre-trained Model

In this section, we discuss the results obtained from generating protein sequences using the pre-trained model. The model was guided by specific UniProt keywords and a taxonomy ID to produce sequences relevant to the desired biological context.

#### Results Analysis
The generation process began with the input sequence of the Type II methyltransferase P05102 and relevant UniProt keywords related to this entry, translated into control codes for the model. A taxonomy ID was also added to refine the context further.

The model generated a sequence with a specified prefix length, and the resulting sequence was compared to the original protein sequence. Key findings:

- **Generated Sequence**: The pre-trained model produced a protein sequence with an initial prefix of 20 amino acids provided as input. The rest of the sequence was generated based on the model's learned patterns and the specified keywords.
- **Comparison to Actual Sequence**: The generated sequence was compared to the actual sequence starting from the 20th amino acid. It was noted that there were differences in several amino acids between the generated and actual sequences.
- **Differing Indices**: The specific indices where the generated sequence differed from the actual sequence were identified. The analysis found a certain number of amino acids that did not match, indicating areas where the model's predictions deviated from the natural sequence.
- **Probability Analysis**: For the differing amino acids, the probabilities assigned by the model to both the actual and predicted amino acids were examined. This provides insight into the model's confidence in its predictions and highlights the areas where it may have uncertain.

#### Conclusion
The pre-trained model generated a protein sequence influenced by the input keywords and taxonomy ID. Even with the use of strict sampling methods (top-k with k = 1), the generated sequence was not identical to the natural one, the differences provide valuable information about the model's predictive capabilities for further improvement (i.e. fine-tuning). 

<h2>Step 2. Fine-tuning the model on a specific protein set</h2>

<p>Dataset preparation: descrivi</p>
pickle dizionario con [kws definite e già codificate, amino acidi (codificati da lettere)]

<p>Il finetuning di Finenzyme è svolto dal modulo pytorch_training.py, basta chiamarlo direttamente da linea di comando dando in input: il modello pre-trained di partenza, il dataset di training, opzionalmente il dataset di validation, e tutti i possibili parametri del fine-tuning.</p>

In [None]:
DISCLAIMER: this step requires a huge amount of time, parallelized hardware (GPU) is recommended

In [None]:
import subprocess

# Define the command as a string
command = """
python pytorch_training.py --model_dir 'ckpt/' --model_path 'ckpt/pretrain_progen_full.pth' --stop_token 1 --model_name 'ec_2_1_1_37' --db_directory 'data_specific_enzymes/databases/pickles/'
"""

# Use subprocess to run the command
process = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

# Print the output line by line as it's being generated
for line in iter(process.stdout.readline, b''):
    print(line.decode(), end='')

# Print the error messages, if any
for line in iter(process.stderr.readline, b''):
    print(line.decode(), end='')



In [None]:
# comandi da chiamare dal notebook.....
# e spiegazione......

<h2>Step 3. Generation through the fine-tuned model</h2>

<p>Paragrafo test</p>

In [1]:
from generation_manager import GeneratorManager

In [44]:
# Here we set the model checkpoint path, 
# for this example, we use Finenzyme on phage lysozymes.
model_path = 'ckpt/ec_2_1_1_37_warmup_1000_earlystop_epoch6_015_flip_LR01_2batch.pth' 

# Now, time to set the generator manager with default parameters: penalty = 0, and let's set top-k = 1
# This class loads the model and the tokenizer in memory.
generator = GeneratorManager(model_path, topk = 1)

MODEL SIZE: 
1280
Found PyTorch checkpoint at  ckpt/ec_2_1_1_37_warmup_1000_earlystop_epoch6_015_flip_LR01_2batch.pth
GPU aviable. Previous checkpoint loaded in GPU


In [52]:
# An example sequence: P05102	MTH1_HAEPH	Type II methyltransferase M.HhaI 
# (source: https://www.uniprot.org/uniprotkb/P05102/entry)
sequence = "MIEIKDKQLTGLRFIDLFAGLGGFRLALESCGAECVYSNEWDKYAQEVYEMNFGEKPEGDITQVNEKTIPDHDILCAGFPCQAFSISGKQKGFEDSRGTLFFDIARIVREKKPKVVFMENVKNFASHDNGNTLEVVKNTMNELDYSFHAKVLNALDYGIPQKRERIYMICFRNDLNIQNFQFPKPFELNTFVKDLLLPDSEVEHLVIDRKDLVMTNQEIEQTTPKTVRLGIVGKGGQGERIYSTRGIAITLSAYGGGIFAKTGGYLVNGKTRKLHPRECARVMGYPDSYKVHPSTSQAYKQFGNSVVINVLQYIAYNIGSSLNFKPY"

# keyword IDs in input defined during fine-tuning: (already in control code codification) 
keywords_finenzyme = [0]

# Next, we set the amino acid prefix to give in input to the model
prefix = 20


In [53]:
# Last step: generation. With after_n_generation we generate up to the real protein length.
res, tokens_prob, offset = generator.after_n_generation(sequence, keywords_finenzyme, prefix)

In [54]:
# What happened during generation?
print('Finenzyme generated a protein with a prefix of ', offset, 'amino acids.')
print('Is the predicted protein equal to the real one?', res == sequence[prefix:])
print('Generated protein:')
print(res)
print('Actual protein:')
print(sequence[prefix:])


ProGen generated a protein with a prefix of  20 amino acids.
Is the predicted protein equal to the real one? False
Generated protein:
LGGFRLALESFGAECVYSNEWDKYAQEVYQMNFGDKPDGDITLVDENSVPDHDILCAGFPCQAFSISGKQKGFEDSRGTLFFDVARIVKAKNPKVVFMENVKNFASHDNGNTLKVVKNIMVDLGYDFYSDVLNSLDFGIPQKRERIYMVCFRKDLNIKNFTFPKPFKLSTFLEDLLLPDEEVSNLIINRPDLVLKDIEIKNNSNKTIRIGEVGKGGQGERIYSPKGIAITLSAYGGGVFSKTGGYLINGKTRKLHPRECARIMGYPDSYLIHPSWNQAYKQFGNSVVVNVLQYITKNMGEALSGEYN
Actual protein:
LGGFRLALESCGAECVYSNEWDKYAQEVYEMNFGEKPEGDITQVNEKTIPDHDILCAGFPCQAFSISGKQKGFEDSRGTLFFDIARIVREKKPKVVFMENVKNFASHDNGNTLEVVKNTMNELDYSFHAKVLNALDYGIPQKRERIYMICFRNDLNIQNFQFPKPFELNTFVKDLLLPDSEVEHLVIDRKDLVMTNQEIEQTTPKTVRLGIVGKGGQGERIYSTRGIAITLSAYGGGIFAKTGGYLVNGKTRKLHPRECARVMGYPDSYKVHPSTSQAYKQFGNSVVINVLQYIAYNIGSSLNFKPY


In [55]:
# here we search for indices of amino acids that differ from the natural sequence
differing_idx = [(i, [actual, predicted]) for i, (actual, predicted) in enumerate(zip(sequence[prefix:], res)) if actual != predicted]
print(f'The pre-trained model generated an enzyme with {len(differing_idx)} different amino acids from the original one.')

The pre-trained model generated an enzyme with 71 different amino acids from the original one.


In [57]:
# we can analyze the probabilities of these indices that we have in output from Finenzyme
from tokenizer import Tokenizer
tokenizer = Tokenizer()
for difference in differing_idx:
    print(f'The predicted sequence in index {difference[0]},'
    f' has probabilities for true aa {difference[1][0]} of {tokens_prob[difference[0]][0][tokenizer.aa_to_probs_index[difference[1][0]]]:.3f},'
    f' and for predicted aa {difference[1][1]} of {tokens_prob[difference[0]][0][tokenizer.aa_to_probs_index[difference[1][1]]]:.3f}')
    break


The predicted sequence in index 10, has probabilities for true aa C of 0.017, and for predicted aa F of 0.796


<p>Considerazioni finali</p>