<a href="https://colab.research.google.com/github/ArjtheGreat/Genie-ome/blob/main/DNADetectives_StreamlitApp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<font color="#de3023"><h1><b>REMINDER MAKE A COPY OF THIS NOTEBOOK, DO NOT EDIT</b></h1></font>

![](https://www.pennmedicine.org/news/-/media/images/pr%20news/news/2021/october/dna.ashx)

# **Goals**
In this notebook, you will:
*   Learn how to clean up and preprocess genome data
*   Convert genomic data into a feature matrix
*   Build a logistic regression model predicting the country a SARS-CoV-2 lineage came from based on its genome





In [None]:
#@title Run this cell to set up the environment { display-mode: "form" }
!pip install Biopython
from Bio import SeqIO
import numpy as np
import pandas as pd
import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix
from collections import Counter
from sklearn import model_selection, linear_model

# data_path = 'https://drive.google.com/uc?id=1f1CtRwSohB7uaAypn8iA4oqdXlD_xXL1'
!wget -q --show-progress 'https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%206%20-%2010%20(Projects)/Project%20-%20DNA%20Detectives/SARS_CoV_2_sequences_global.fasta'
cov2_sequences = 'SARS_CoV_2_sequences_global.fasta'




# **Data Preprocessing**

## **Examining Data**

We are going to read in a set of SARS-CoV-2 genomes from around the world. Note that sequence #0 is the "reference sequence"-- one of the original sequences from Wuhan. These global sequences come from the [NCBI database](https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&VirusLineage_ss=SARS-CoV-2,%20taxid:2697049&SLen_i=29000%20TO%2031000&Completeness_s=complete&HostLineage_ss=Homo%20sapiens%20(human),%20taxid:9606).  You can examine the different sequences using the form below.

In [None]:
sequences = [r for r in SeqIO.parse(cov2_sequences, 'fasta')]
sequence_num =  0#@param {type:"integer"}
print(sequences[sequence_num])

ID: NC_045512
Name: NC_045512
Description: NC_045512 |Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1| complete genome|China
Number of features: 0
Seq('ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGT...AAA')


###**Exercise: How many sequences are there?**

Note: Sequences have been uploaded/stored in a variable called ```sequences```.

In [None]:
n_sequences = len(sequences) ### YOUR CODE HERE
print(f"There are {n_sequences} sequences")

There are 1538 sequences


###**Exercise: How different are the 1st (non-reference) and 10th SARS-CoV-2 sequences?**



In [None]:
sequence_1 = np.array(sequences[0])
sequence_10 = np.array(sequences[9])
percent_similarity = np.sum(sequence_1 == sequence_10) / len(sequence_1)*100
print("Sequence 1 and 10 similarity: %", percent_similarity)

Sequence 1 and 10 similarity: % 99.9765909774939


### **Exercise (BONUS):  Make a histogram of the number of mutations each SARS-CoV-2 sequence has compared to the reference genome.**

Interestingly, it looks like there are a couple sequences with a LOT of mutations! We can investigate these sequences a little more.

**Examine some of these sequences with high number of mutations by selecting the minimum # of mutations from the form below. What do you notice about the sequences? Discuss with your instructor and peers.**



## Missing Data


It is hard to see, but some of the sequences have `N` in them. Run the cell below for an example

### **Exercise: Calculate the number of sequences that have an ```N``` in them.**

**What do you think ```N``` means?**

In [None]:
n_sequences_with_N = sum(['N' in s for s in sequences])

print(f'{n_sequences_with_N} sequences have at least 1 "N"!')

326 sequences have at least 1 "N"!



`N` is not a nucleic acid- it just stands for "missing", or "low quality". "Missing" is different than ```_``` or a deletion. At the locations with ```N```, the sequencing machine had low quality data here, so it was unable to determine what base was at that location. We should remember this when we extract our features. Stay tuned for more on sequencing machines and how sequences are built in the bonus notebook of this project!

# **Feature Extraction**

We are going to build a model that predicts the country a SARS-CoV-2 virus came from based on its genome.

### **Exercise: Recall the structure of machine learning models.**
**In general what two categories of data do we need to build a supervised machine learning model? What will we use for each category?**


In [None]:
_1_  =  '' #@param {type:"string"}
_2_  =  '' #@param {type:"string"}

print('1. We need a set of FEATURES (X).\n',
      '  Our features will be the genomes of the different sequences.')
print('2. We need LABElS (Y).\n',
      '  Our labels will be the country that each sequence came from.')


1. We need a set of FEATURES (X).
   Our features will be the genomes of the different sequences.
2. We need LABElS (Y).
   Our labels will be the country that each sequence came from.


**Question: How will we turn our features into a numeric matrix?**

## Extract Features (X)

Remember that our input must be a *numeric* matrix/table.
We are going to create a matrix where our features are the presence/absence of a specific mutation (given by ```<location>```_```<base>```).

Our columns will be 1_A, 1_T, 3_G, 4_A, etc.


| Sequence ID | 1_A | 1_C | 3_G | 4_A  | ...|
|-------------|-----|-----|-----|------|----|
|Sequence 1   |  1¬† |  0  |   1 |    0 |  0 |
|Sequence 2   |  0¬† |  0  |   1 |    0 |  0 |
|Sequence 3   |  1¬† |  0  |   0 |    0 |  0 |
|Sequence 4   |  0¬† |  1  |   0 |    1 |  1 |
|Sequence 5   |  1¬† |  0  |   0 |    0 |  1 |




In [None]:
# Note: This can take a couple minutes to run!
# but we can monitor our progress using the tqdm library (which creates a progress bar)
n_bases_in_seq = len(sequences[0])
columns = {}

# Iterate though all positions in this sequence.
for location in tqdm.tqdm(range(n_bases_in_seq)): # tqdm is a nice library that prints our progress.
  bases_at_location = np.array([s[location] for s in sequences])
  # If there are no mutations at this position, move on.
  if len(set(bases_at_location))==1: continue
  for base in ['A', 'T', 'G', 'C', '-']:
    feature_values = (bases_at_location==base)

    # Set the values of any base that equals 'N' to np.nan.
    feature_values[bases_at_location==['N']] = np.nan

    # Convert from T/F to 0/1.
    feature_values  = feature_values*1

    # Make the column name look like <location>_<base> (1_A, 2_G, 3_A, etc.)
    column_name = str(location) + '_' + base

    # Add column to dict
    columns[column_name] = feature_values


mutation_df = pd.DataFrame(columns)

# Print the size of the feature matrix/table.
n_rows = np.shape(mutation_df)[0]
n_columns = np.shape(mutation_df)[1]
print(f"Size of matrix: {n_rows} rows x {n_columns} columns")

# Check what the matrix looks like:
mutation_df.tail()

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 29903/29903 [03:05<00:00, 161.01it/s]


Size of matrix: 1538 rows x 12680 columns


Unnamed: 0,0_A,0_T,0_G,0_C,0_-,1_A,1_T,1_G,1_C,1_-,...,29901_A,29901_T,29901_G,29901_C,29901_-,29902_A,29902_T,29902_G,29902_C,29902_-
1533,0,0,0,0,1,0,1,0,0,0,...,0,0,0,0,1,0,0,0,0,1
1534,0,0,0,0,1,0,0,0,0,1,...,0,0,0,0,1,0,0,0,0,1
1535,0,0,0,0,1,0,0,0,0,1,...,0,0,0,0,1,0,0,0,0,1
1536,0,0,0,0,1,0,0,0,0,1,...,1,0,0,0,0,1,0,0,0,0
1537,0,0,0,0,1,0,0,0,0,1,...,1,0,0,0,0,1,0,0,0,0


## Extract Label (Y)

We are going to use the region of the world that each sample came from as the **label**. ![alt text](https://upload.wikimedia.org/wikipedia/commons/3/3d/Flag-map_of_the_world_%282017%29.png)

First, let's see how many samples we have from different countries.

In [None]:
#@title ###**Exercise: Explore the different number of samples that come from each country.**
country = "USA" #@param dict_keys(['China', 'Kazakhstan', 'India', 'Sri Lanka', 'Taiwan', 'Hong Kong', 'Viet Nam', 'Thailand', 'Nepal', 'Israel', 'South Korea', 'Iran', 'Pakistan', 'Turkey', 'Australia', 'USA']
countries = [(s.description).split('|')[-1] for s in sequences]
print(f"There are {Counter(countries)[country]} sequences from {country}.")

There are 1215 sequences from USA.


Since some countries only have a couple samples, we are going to use the **region** of the world as our labels.

Since we have a large number of samples from Asia, North America, and Oceania, we will filter our sequences to just these regions. We will convert our countries to regions using the code below.

### **Exercise: Convert each country to its region of the world.**
**Use the code below to create a dictionary of ```<country>```:```<region>``` where ```region``` is either ```'Oceania'```, ```'North America'```, or ```'Asia'```, and convert each country to region.**

In [None]:
### YOUR CODE HERE: Replace the Nones below!
countries_to_regions_dict = {
         'Australia': 'Oceania',
         'China': 'Asia',
         'Hong Kong': 'Asia',
         'India': 'Asia',
         'Nepal': 'Asia',
         'South Korea': 'Asia',
         'Sri Lanka': 'Asia',
         'Taiwan': 'Asia',
         'Thailand': 'Asia',
         'USA': 'North America',
         'Viet Nam': 'Asia'
}

regions = [countries_to_regions_dict[c] if c in
           countries_to_regions_dict else 'NA' for c in countries]
mutation_df['label'] = regions

**Now see how many samples there are from each region of the world.**

In [None]:
region = "Asia" #@param ['Oceania', 'North America', 'Asia']
print(f"There are {Counter(regions)[region]} sequences from {region}.")

There are 152 sequences from Asia.


## Balancing the Data


Recall that ML models work the best if we have *balanced* data- a dataset with equal numbers of samples with each label. Run the following code to remove duplicate samples from the dataset, and then balance the samples.

### **Exercise: Balance the data equally between samples from Asia, Oceania, and North America**

In [None]:
balanced_df = mutation_df.copy()
balanced_df['label'] = regions
balanced_df = balanced_df[balanced_df.label!='NA']
balanced_df = balanced_df.drop_duplicates()
samples_north_america = balanced_df[balanced_df.label=='North America']
samples_oceania = balanced_df[balanced_df.label=='Oceania']
samples_asia = balanced_df[balanced_df.label=='Asia']

# Number of samples we will use from each region.
n = min(len(samples_north_america),
        len(samples_oceania),
        len(samples_asia))

balanced_df = pd.concat([samples_north_america[:n],
                    samples_asia[:n],
                    samples_oceania[:n]])
print("Number of samples in each region: ", Counter(balanced_df['label']))

Number of samples in each region:  Counter({'North America': 128, 'Asia': 128, 'Oceania': 128})


# **Logistic Regression Model**

***Congrats!***  We finally are done with preprocessing/cleaning our data! Although tedious, this is an important part of doing machine learning in biology. The data can be complex and messy, and if we don't do some cleaning up beforehand, our models will have poor performance.


![](https://media.makeameme.org/created/we-did-it-3b3ac27d2a.jpg)


Finally, run the code to set up a ```X``` feature matrix and a ```Y``` label list from our ```balanced_df```. You can explore the different values using the code below:

In [None]:
X = balanced_df.drop('label', axis=1)
Y = balanced_df.label
data = "Y (label)" #@param ['X (features)', 'Y (label)']
start = 1 #@param {type:'integer'}
stop =  10#@param {type:'integer'}

if start>=stop:print("Start must be < stop!")
else:
  if data=='X (features)':
    print(X.iloc[start:stop])
  if data=='Y (label)':
    print(Y[start:stop])

323    North America
324    North America
325    North America
326    North America
327    North America
328    North America
329    North America
330    North America
331    North America
Name: label, dtype: object


In preparation for training and testing our model, we need to import two key functions:

* `train_test_split()`, used to split the data into training and testing portions, and
* `accuracy_score()`, for testing the accuracy of our model's predictions against the labels in our testing data.

Run the next code block to import these functions!

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

## **Training**

We will be using the logistic regression model we have learned about with one modification. We will use the "multinomial" class of logistic regression model.  This is used when there are more than 2 categories in the label set. In our case, we have ```Asia```, ```North America```, and ```Oceania``` as our possible labels.

### **Exercise: Train the model using the standard pipeline you have mastered!**

In [None]:
lm = linear_model.LogisticRegression(
    multi_class="multinomial", max_iter=1000,
    fit_intercept=False, tol=0.001, solver='saga', random_state=42)

# Split into training/testing set.
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, Y, test_size = 0.2)

# Train/fit model.
lm.fit(X_train, y_train)

## **Testing/Evaluation**

In addition to printing the accuracy of a model, we can also use a *confusion matrix* to see how well the model performed.

###**Exercise: Evaluate the model on the test set.**


In [None]:
# Predict on the test set.
y_pred = lm.predict(X_test)

# Compute accuracy.
# accuracy = 100*np.mean(y_pred==y_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: %", accuracy)

# Compute confusion matrix.
confusion_mat = pd.DataFrame(confusion_matrix(y_test, y_pred))
confusion_mat.columns = [c + ' predicted' for c in lm.classes_]
confusion_mat.index = [c + ' true' for c in lm.classes_]

print(confusion_mat)

Accuracy: % 0.935064935064935
                    Asia predicted  North America predicted  Oceania predicted
Asia true                       16                        0                  1
North America true               2                       25                  0
Oceania true                     1                        1                 31


In [None]:
X_test

Unnamed: 0,0_A,0_T,0_G,0_C,0_-,1_A,1_T,1_G,1_C,1_-,...,29901_A,29901_T,29901_G,29901_C,29901_-,29902_A,29902_T,29902_G,29902_C,29902_-
278,0,0,0,0,1,0,0,0,0,1,...,0,0,0,0,1,0,0,0,0,1
214,0,0,0,0,1,0,0,0,0,1,...,0,0,0,0,1,0,0,0,0,1
14,1,0,0,0,0,0,1,0,0,0,...,1,0,0,0,0,1,0,0,0,0
271,0,0,0,0,1,0,0,0,0,1,...,0,0,0,0,1,0,0,0,0,1
268,0,0,0,0,1,0,0,0,0,1,...,0,0,0,0,1,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
124,0,0,0,0,1,0,0,0,0,1,...,0,0,0,0,1,0,0,0,0,1
434,0,0,0,0,1,0,0,0,0,1,...,1,0,0,0,0,0,0,0,0,1
230,0,0,0,0,1,0,0,0,0,1,...,0,0,0,0,1,0,0,0,0,1
223,0,0,0,0,1,0,0,0,0,1,...,0,0,0,0,1,0,0,0,0,1


In [None]:
%%writefile header.py
import streamlit as st

def create_header():
    st.markdown("""
        <style>
        .header {
            display: flex;
            align-items: center;
            width: 100%;
        }
        .header img {
            animation: smokeDisappear 2s ease-in-out forwards;
        }

         @keyframes smokeDisappear {
            0% {
                opacity: 1;
                transform: scale(1);
            }
            100% {
                opacity: 0;
                transform: scale(2);
                filter: blur(10px);
            }
        }
        </style>
    """, unsafe_allow_html=True)

    st.markdown("<div class='header'>", unsafe_allow_html=True)
    st.image("/content/drive/My Drive/Inspirit AI Demonstration/Genie-ome.png", width=180)
    st.markdown("</div>", unsafe_allow_html=True)
    st.markdown("<div class='title'>Genie-ome: A Genome Geolocator Model</div>", unsafe_allow_html=True)
    st.markdown("<div class='subtitle'>By Arjun, Daren, Sarina, Pooja</div>", unsafe_allow_html=True)
    st.markdown("<div class='subtitle'>Upload a .fasta file, let's predict which region of the world it comes from</div>", unsafe_allow_html=True)
    st.markdown("</div>", unsafe_allow_html=True)


Overwriting header.py


In [None]:
!pip install Biopython



In [None]:
%%writefile userinput.py

import streamlit as st
from Bio import SeqIO
import numpy as np
from io import StringIO
import pandas as pd

def get_user_input():
    uploaded_file = st.file_uploader("Choose a .fasta file", type="fasta")
    sequence = ""
    if uploaded_file is not None:
        # Convert the uploaded file to a StringIO object
        stringio = StringIO(uploaded_file.getvalue().decode("utf-8"))
        for record in SeqIO.parse(stringio, 'fasta'):
            sequence = str(record.seq)
            st.write(sequence[:20] + "..." + sequence[-20:])
        st.success("File uploaded successfully!")
        st.markdown("</div>", unsafe_allow_html=True)
        return sequence
    else:
        st.warning("Please upload a .fasta file.")
        st.markdown("</div>", unsafe_allow_html=True)
        return None

def create_dataframe(sequence, X_test_columns):
    columns = {}
    if sequence:
        for column in X_test_columns:
            location = int(column.split('_')[0])
            base = column.split('_')[1]

            if location < len(sequence):
                if sequence[location] == base:
                    columns[column] = 1
                else:
                    columns[column] = 0
            else:
                columns[column] = 0
        df = pd.DataFrame([columns])
        return df
    else:
        return None


def pad_or_truncate_sequence(sequence, expected_length):
    if len(sequence) > expected_length:
        return sequence[:expected_length]
    elif len(sequence) < expected_length:
        return sequence + 'N' * (expected_length - len(sequence))
    else:
        return sequence

Overwriting userinput.py


In [None]:
%%writefile genomicbreakdown.py
import streamlit as st
import numpy as np
from Bio.Seq import Seq

def display_genomic_breakdown(sequence, original_sequence, left_col, right_col):
    base_counts = {
        'A': sequence.count('A'),
        'T': sequence.count('T'),
        'C': sequence.count('C'),
        'G': sequence.count('G'),
        'N': sequence.count('N')
    }

    with left_col:
        st.subheader("Genomic Breakdown")
        st.write(f"**Total Length:** {len(sequence)}")

        st.write(f"**A (Adenine):** {base_counts['A']} - Essential for cellular respiration and energy storage.")
        st.write(f"**T (Thymine):** {base_counts['T']} - Vital for DNA stability and structure.")
        st.write(f"**C (Cytosine):** {base_counts['C']} - Important for cell signaling and genetic regulation.")
        st.write(f"**G (Guanine):** {base_counts['G']} - Crucial for protein synthesis and enzyme function.")
        st.write(f"**N (Unknown):** {base_counts['N']} - Represents unrecognized or missing bases.")

    with right_col:
        st.subheader("Comparison Original Strand:")
        sequence_arr = np.array(Seq(sequence))
        n_bases_different = sum(1 for a, b in zip(sequence, original_sequence) if a != b)
        n_bases_same = len(sequence) - n_bases_different
        st.write(f"Comparing your sequence to the original strand of Severe acute respiratory syndrome coronavirus 2 (isolate Wuhan-Hu-1) from Wuhan, China. Helps to identify mutations and understand the evolutionary changes of the virus based on location.")
        st.write(f"1. Number of bases that differ: **{n_bases_different}**")
        st.write(f"2. Number of bases that are same: **{n_bases_same}**")
        percent_similarity = 100 * n_bases_same / len(sequence)
        st.write(f"3. Percent similarity: **{percent_similarity:.2f}%**")

        st.markdown(
            f"""
            <div class="double-helix-progress-container">
                <div class="double-helix-progress-fill" style="width: {percent_similarity}%"></div>
            </div>
            """,
            unsafe_allow_html=True
        )



Overwriting genomicbreakdown.py


In [None]:
type(X_test.columns[0])

str

In [None]:
sequences[0][location]

'A'

In [None]:
%%writefile predictor.py

def make_prediction(model, input_features):
    print(input_features)
    if len(input_features.shape) == 1:
        input_features = input_features.reshape(1, -1)
    return model.predict(input_features)

Overwriting predictor.py


In [None]:
%%writefile response.py
import streamlit as st

def get_app_response(prediction):
    st.markdown("<div class='prediction-result'>Prediction Result</div>", unsafe_allow_html=True)
    st.markdown(f"<div class='prediction-text'>{prediction[0]}</div>", unsafe_allow_html=True)
    st.markdown("</div>", unsafe_allow_html=True)


Overwriting response.py


In [None]:
from joblib import dump
original_sequence = mutation_df.iloc[0]
dump(lm, 'lm_model.joblib')
dump(X_test, 'X_test.joblib')
dump(sequence_1, 'original_sequence.joblib')

['original_sequence.joblib']

In [None]:
!ngrok authtoken 2ghdBxRqrKK68R0KZ1wi5sE8Q0h_2jmFWJ18DuFKqkFr2ZPK9

Authtoken saved to configuration file: /root/.config/ngrok/ngrok.yml


In [None]:
!pip install streamlit_extras >/dev/null

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
%%writefile app.py
import streamlit as st
from streamlit_extras.let_it_rain import rain
from joblib import load
from header import *
from userinput import *
from response import *
from predictor import *
from genomicbreakdown import *

# Load our DecisionTree model into our web app
lm = load("lm_model.joblib")
X_test = load('X_test.joblib')
original_sequence = load('original_sequence.joblib')
st.set_page_config(layout="wide")
# Custom CSS for the entire app
st.markdown(
    """
    <style>
    @import url('https://fonts.googleapis.com/css2?family=Montserrat:wght@700&family=Poppins:wght@400;600&display=swap');

    body {
        background: url('https://www.toptal.com/designers/subtlepatterns/memphis-mini-pattern/');
        font-family: 'Poppins', sans-serif;
        background-color: #e0f7fa;
        width: 100%;
    }

    .title {
        font-family: 'Montserrat', sans-serif;
        font-size: 3em;
        color: #004d40;
        text-align: center;
        margin-bottom: 0.5em;
        animation: slideInFromLeft 1s;
    }

    .subtitle {
        font-family: 'Poppins', sans-serif;
        font-size: 1.2em;
        color: #00796b;
        text-align: center;
        margin-bottom: 1.5em;
        animation: slideInFromRight 1s;
    }

    .upload-box {
        display: flex;
        justify-content: center;
        align-items: center;
        margin-top: 1em;
        padding: 2em;
        border: 2px dashed #00796b;
        border-radius: 15px;
        background-color: #ffffff;
        box-shadow: 0 5px 10px rgba(0, 0, 0, 0.1);
        transition: transform 0.3s ease;
    }

    .upload-box:hover {
        transform: scale(1.05);
    }

    .response-box {
        background: linear-gradient(135deg, #a7ffeb, #64ffda);
        border-radius: 20px;
        padding: 30px;
        margin-top: 20px;
        box-shadow: 0 10px 20px rgba(0, 0, 0, 0.2);
        animation: fadeIn 2s;
    }

    .prediction-text {
        font-family: 'Montserrat', sans-serif;
        font-size: 2em;
        color: #004d40;
        text-align: center;
        font-weight: bold;
        animation: slideInFromRight 2s;
    }

    .prediction-result {
        font-family: 'Poppins', sans-serif;
        font-size: 1.5em;
        color: #004d40;
        text-align: center;
        animation: slideInFromLeft 1s;
    }

    .back-button {
        font-family: 'Poppins', sans-serif;
        font-size: 1.2em;
        color: #ffffff;
        background-color: #00796b;
        border: none;
        border-radius: 10px;
        padding: 10px 20px;
        cursor: pointer;
        margin-top: 20px;
        display: block;
        text-align: center;
        transition: background-color 0.3s ease;
    }

    .back-button:hover {
        background-color: #004d40;
    }

    .continent-image {
        width: 100%;
        border-radius: 10px;
        margin-top: 20px;
        box-shadow: 0 5px 15px rgba(0, 0, 0, 0.1);
    }

    .double-helix-progress-container {
        height: 20px;
        width: 100%;
        background: repeating-linear-gradient(
            -45deg,
            lightgrey,
            lightgrey 10px,
            #ffffff 10px,
            #ffffff 20px
        );
        border-radius: 10px;
        margin-top: 10px;
        overflow: hidden;
        position: relative;
    }

    .double-helix-progress-fill {
        height: 100%;
        background: repeating-linear-gradient(
            -45deg,
            #00796b,
            #00796b 10px,
            #ffffff 10px,
            #ffffff 20px
        );
        position: absolute;
        top: 0;
        left: 0;
        border-radius: 10px;
    }

    .col-container {
        display: flex;
        justify-content: space-between;
        width: 90%;
        margin: 0 auto;
    }

    .col1, .col2 {
        flex: 1;
        padding: 0 2%;
        justify-content: center;
    }

    @keyframes fadeIn {
        from { opacity: 0; }
        to { opacity: 1; }
    }

    @keyframes slideInFromLeft {
        from { transform: translateX(-100%); opacity: 0; }
        to { transform: translateX(0); opacity: 1; }
    }

    @keyframes slideInFromRight {
        from { transform: translateX(100%); opacity: 0; }
        to { transform: translateX(0); opacity: 1; }
    }

    @keyframes pulse {
        0% { transform: scale(1); }
        50% { transform: scale(1.05); }
        100% { transform: scale(1); }
    }
    </style>
    """,
    unsafe_allow_html=True
)

if 'show_upload' not in st.session_state:
    st.session_state.show_upload = True

def show_results():
    st.session_state.show_upload = False

def show_upload():
    st.session_state.show_upload = True

if st.session_state.show_upload:
    create_header()
    sequence = get_user_input()
    st.image("/content/drive/My Drive/Inspirit AI Demonstration/COVID World Map.png", use_column_width=True)
    st.write("Credit: Bloomberg")
    if sequence is not None:
        input_features = create_dataframe(sequence, X_test.columns)
        if input_features is not None and input_features.size > 0:
            st.session_state.sequence = sequence
            st.session_state.input_features = input_features
            show_results()
            st.experimental_rerun()
else:
    st.button('Back', on_click=show_upload)
    st.markdown("<div class='title'>Genie-ome Results</div>", unsafe_allow_html=True)
    st.markdown("<style>div.row-widget.stRadio > div{flex-direction:row;}</style>", unsafe_allow_html=True)
    rain(emoji="ü¶†üò∑üßë‚Äçüî¨",font_size=35,falling_speed=1,animation_length=3,)
    st.markdown('<div class="col-container">', unsafe_allow_html=True)
    col1, col2 = st.columns([1, 1], gap="large")
    with col1:
        prediction = make_prediction(lm, st.session_state.input_features)
        st.markdown('<div class="prediction-text">Prediction Result</div>', unsafe_allow_html=True)
        st.markdown(f'<div class="prediction-result">{prediction[0]}</div>', unsafe_allow_html=True)

        if prediction[0] == "Oceania":
            col_center = st.columns([1, 2, 1])[1]
            with col_center:
                st.image("/content/drive/My Drive/Inspirit AI Demonstration/Oceania.png", use_column_width=True)
        elif prediction[0] == "Asia":
            col_center = st.columns([1, 2, 1])[1]
            with col_center:
                st.image("/content/drive/My Drive/Inspirit AI Demonstration/Asia.png", use_column_width=True)
        elif prediction[0] == "North America":
            col_center = st.columns([1, 2, 1])[1]
            with col_center:
                st.image("/content/drive/My Drive/Inspirit AI Demonstration/North America.png", use_column_width=True)


    with col2:
        col3, col4 = st.columns([1, 1])
        display_genomic_breakdown(st.session_state.sequence, original_sequence, col3, col4)
    st.markdown('</div>', unsafe_allow_html=True)

Overwriting app.py


In [None]:
import pandas as pd
import os
from joblib import dump, load

import warnings
warnings.filterwarnings("ignore")
!pip -q install streamlit
!pip -q install pyngrok
from pyngrok import ngrok

def launch_website():
  print ("Click this link to try your web app:")
  public_url = ngrok.connect()
  print (public_url)
  !streamlit run --server.port 80 app.py >/dev/null

In [None]:
launch_website()

Click this link to try your web app:
NgrokTunnel: "https://2488-104-196-68-249.ngrok-free.app" -> "http://localhost:80"
2024-06-02 19:29:31.306 Please replace `st.experimental_rerun` with `st.rerun`.

`st.experimental_rerun` will be removed after 2024-04-01.
2024-06-02 19:29:40.892 Please replace `st.experimental_rerun` with `st.rerun`.

`st.experimental_rerun` will be removed after 2024-04-01.
2024-06-02 19:32:06.623 Please replace `st.experimental_rerun` with `st.rerun`.

`st.experimental_rerun` will be removed after 2024-04-01.
2024-06-02 19:39:45.022 Please replace `st.experimental_rerun` with `st.rerun`.

`st.experimental_rerun` will be removed after 2024-04-01.
2024-06-02 19:40:14.267 Please replace `st.experimental_rerun` with `st.rerun`.

`st.experimental_rerun` will be removed after 2024-04-01.
2024-06-02 19:41:16.113 Please replace `st.experimental_rerun` with `st.rerun`.

`st.experimental_rerun` will be removed after 2024-04-01.
2024-06-02 19:44:47.660 Please replace `st.exp

# **Wrapping Up!**

***Great job!*** You built a pretty accurate model that uses genomic data to predict what country a SARS-CoV-2 sample comes from.