<a href="https://www.kaggle.com/code/ricardoriveroh/gemini-the-flu-genomics-expert?scriptVersionId=206230846" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
# Get the API key from here: https://ai.google.dev/tutorials/setup
# Create a new secret called "GEMINI_API_KEY", via Add-ons/Secrets in the top menu, and attach it to this notebook

import google.generativeai as genai
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()
api_key = user_secrets.get_secret("GEMINI_API_KEY")
genai.configure(api_key = api_key)

In [2]:
import os
import pandas as pd
import numpy as np

# Path to your FASTA file
path = '/kaggle/input/h5-98cluster/H5_GisaidData.clustered.aligned.fasta'

# Initialize an empty list to store sequences
sequences = []
headers= []
# Open the FASTA file and read sequences
with open(path, 'r') as file:
    sequence = ''
    for line in file:
        line = line.strip()
        if line.startswith('>'):
            headers.append(line)
            if sequence:
                sequences.append(sequence)
                sequence = ''
        else:
            sequence += line
    if sequence:
        sequences.append(sequence)

print(f"Total sequences loaded: {len(sequences)}")

Total sequences loaded: 591


In [3]:
token_count = len(sequences)
print(token_count)
num_total_sequences = len(sequences)
num_samples = 50
random_indices = np.random.choice(num_total_sequences, size=num_samples, replace=False)
#Randomizing the sequence selection
test_seq = [sequences[i] for i in random_indices]
test_headers = [headers[i] for i in random_indices]

591


In [4]:
# Define the prompt
prompt = f"""
“You are a computational biologist specializing in identifying patterns, mutations, and areas of interest in protein sequences to understand the evolutionary trajectories of Influenza A.

Please analyze the Hemagglutinin protein alignment from Influenza A subtype H5. These sequences come from different bird and cattle species, as well as humans. Each sequence header contains metadata separated by ‘/’, where:
	•	The second field represents the host species.
	•	The second-to-last field indicates the clade.
	•	The last field provides the date of the sample.

Based on this information, please provide insights on the following:
	1.	Repetitive Sequences: Identify any repetitive sequences within the alignment and discuss their potential significance.
	2.	Epistatic Interactions: Discuss areas where higher-order or epistatic interactions might occur—positions where the amino acid distribution depends on other positions.
	3.	Common Mutations or SNPs: Identify prevalent mutations or single nucleotide polymorphisms (SNPs) and their possible effects (e.g., non-synonymous, synonymous, deletions).
	4.	dN/dS Ratios: Highlight sites that may have a significant ratio of non-synonymous to synonymous substitutions (dN/dS), indicating selective pressures.
	5.	Medical Research Interest: Identify patterns that could be of medical significance, such as signatures of antiviral resistance or immune escape. Indicate their potential structural or biochemical effects and specify the domains where they are located (e.g., Globular head, HA1, HA2, Stalk).
	6.	Phylogenetic Relationships: Discuss the phylogenetic relationships among the sequences, referencing methods like neighbor-joining trees, and provide insights into the evolutionary trajectories.

Please provide your expert analysis on these topics.”
Genome Sequences:
{test_seq}
headers:
{test_headers}

Instead of giving me code to do the analysis, execute the required code internally, analyze the results and provide insights as the computational biologist expert you are.
"""

In [5]:
# Create the model
generation_config = {
  "temperature": 0.5,
  "top_p": 0.95,
  "max_output_tokens": 20000,
}


model = genai.GenerativeModel(model_name='gemini-1.5-pro-latest')
response = model.generate_content(prompt)

In [6]:
documentation = response.text
print(documentation)

## Analysis of Hemagglutinin Protein Alignment from Influenza A Subtype H5

After analyzing the provided Hemagglutinin (HA) protein alignment from various avian, cattle, and human hosts, I’ve identified several key patterns and variations that provide insights into the evolutionary trajectory of Influenza A H5.

**1. Repetitive Sequences:**

The alignment does not exhibit striking long repetitive amino acid sequences. However, shorter k-mer repeats (e.g., RKKR) are observed, particularly in the stalk region. These short repeats can be important for protein stability and interactions with other viral or host proteins. Changes in these repeats could affect HA function and viral fitness.  Further analysis with specialized repeat-finding tools might reveal more subtle repetitive patterns.

**2. Epistatic Interactions:**

Potential epistatic interactions are likely present, as suggested by the variability at certain positions coupled with dependencies on amino acids at other sites. For exam