<a href="https://www.kaggle.com/code/ricardoriveroh/gemini-the-flu-genomics-expert?scriptVersionId=206062275" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [None]:
# Get the API key from here: https://ai.google.dev/tutorials/setup
# Create a new secret called "GEMINI_API_KEY", via Add-ons/Secrets in the top menu, and attach it to this notebook

import google.generativeai as genai
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()
api_key = user_secrets.get_secret("GEMINI_API_KEY")
genai.configure(api_key = api_key)

In [None]:
import os
import pandas as pd

# Path to your FASTA file
path = '/kaggle/input/h5-98cluster/H5_GisaidData.clustered.aligned.fasta'

# Initialize an empty list to store sequences
sequences = []
headers= []
# Open the FASTA file and read sequences
with open(path, 'r') as file:
    sequence = ''
    for line in file:
        line = line.strip()
        if line.startswith('>'):
            headers.append(line)
            if sequence:
                sequences.append(sequence)
                sequence = ''
        else:
            sequence += line
    if sequence:
        sequences.append(sequence)

print(f"Total sequences loaded: {len(sequences)}")

In [None]:
token_count = len(sequences)
print(token_count)
test_seq = sequences[1:50]
test_headers = headers[1:50]

In [None]:
# Define the prompt
prompt = f"""
You are a computational biologist AI specializing in identifying patterns, mutations, and areas of interest in protein sequences to learn from the evolurionary trajectories of Influenza A.

Analyze the following protein alignment from the Hemagglutinin sequence of Influenza A subtype H5, these sequence come from different bird and cattle species as well as humans,
the goal is to identify signatures from the sequence belonging to each host species.
I will provide you with the alignment in a fasta file, where the each line that starts with a ">" is the header and is separated by "/", these headers will bring you information such that
if you separate the header using the delimiter, the second field is the source of the sample (host), the second from last is the clade, and the last is the date.
Based on that analyze the file and provide insights such that you extract the following:

- Repetitive sequences and their significance.
- Higher-order interactiomns (positions where their value distribution depends on other positions).
- Common mutations or SNPs (Single Nucleotide Polymorphisms) and their possible effect (non-synonymous, synonymous, deletion, etc).
- Any patterns that could be of medical research interest because it might be a signature of antiviral resistance or immune escape.

Genome Sequences:
{test_seq}
headers:
{test_headers}

Instead of giving me code to do the analysis, execute the required code internally, analyze the results and provide insights as the computational biologist expert you are.
"""

In [None]:
# Create the model
generation_config = {
  "temperature": 1,
  "top_p": 0.95,
  "max_output_tokens": 8000,
}


model = genai.GenerativeModel(model_name='gemini-1.5-pro-latest')
response = model.generate_content(prompt)

In [None]:
documentation = response.text
print(documentation)