<a href="https://colab.research.google.com/github/Nazaruk-Anton/COVID-19-Genome-Sequencing-Analysis-and-Bit-Score-Prediction/blob/main/Part_4_Predicting_Bit_Scores_of_Sequences.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To find the genome sequences that are most closely related to the reference genome (NC_045512.2 Wuhan seafood market pneumonia virus isolate Wuhan-Hu-1) and potentially identify a subsequence to target with a combative drug, we can use the bit scores as a measure of similarity.

Higher bit scores indicate a stronger similarity, suggesting that sequences with higher scores are more closely related to the reference genome. On the other hand, lower bit scores indicate lower similarity, implying less relatedness.

To leverage this information, we can focus on the sequences with the highest bit scores. These sequences are more likely to share a closer evolutionary relationship with the reference genome and are therefore potential candidates for further analysis and drug targeting.

Conversely, sequences with the lowest bit scores might represent more divergent or distinct strains, and targeting them with a combative drug may be less effective.

In [1]:
### Loading in data

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

In [2]:
### Reading in data. We'll store our data in dataframe called 'df'.

df = pd.read_csv('/content/Alignment-HitTable.csv', header = None)
df.columns = ['query acc.verr', 'subject acc.ver', '% identity', 'alignment length', 'mismatches', 
             'gap opens', 'q. start', 'q. end', 's. start', 's. end', 'evalue', 'bit score']
df.head()

Unnamed: 0,query acc.verr,subject acc.ver,% identity,alignment length,mismatches,gap opens,q. start,q. end,s. start,s. end,evalue,bit score
0,MN997409.1,MN997409.1,100.0,29882,0,0,1,29882,1,29882,0.0,55182
1,MN997409.1,MT020881.1,99.99,29882,3,0,1,29882,1,29882,0.0,55166
2,MN997409.1,MT020880.1,99.99,29882,3,0,1,29882,1,29882,0.0,55166
3,MN997409.1,MN985325.1,99.99,29882,3,0,1,29882,1,29882,0.0,55166
4,MN997409.1,MN975262.1,99.99,29882,3,0,1,29882,1,29882,0.0,55166


In [3]:
## In this final notebook, we'll be predicting 'bit_score' from some of the columns in the data.
## Create a feature dataframe called 'X' with the columns: ['% identity',  'mismatches', 'gap opens', 'q. start', 's. start'].
## Store the target 'y' with the bit scores.

X = df[['% identity',  'mismatches', 
             'gap opens', 'q. start', 's. start']]
y = df['bit score']

### Use the Standard Scaler from the sklearn.preprocessing library to normalize the data.
## Store the transformed X data in a variable called 'X_transformed'.
ss = StandardScaler()
X_transformed = ss.fit_transform(X)

In [4]:
## Predict the bit score by fitting a linear regression model. Store the predicted bit scores in
## a variable called 'lin_pred'. Get the score in a variable called 'lin_score'. 
## Store the linear regression coefficients in a variable called 'coef'.

linreg = LinearRegression()
linreg.fit(X_transformed, y)
score = linreg.score(X_transformed, y)
lin_pred = linreg.predict(X_transformed)
coef = linreg.coef_

In [5]:
lin_pred

array([49636.55233745, 49588.64767161, 49588.64767161, 49588.64767161,
       49588.64767161, 49559.15766986, 49573.46264729, 49573.46264729,
       49573.46264729, 49573.46264729, 49573.46264729, 49543.97264554,
       49543.97264554, 49558.57097193, 49558.57097193, 49555.92803009,
       49555.92803009, 49603.83269593, 49540.74300576, 49540.74300576,
       49525.55798144, 49525.55798144, 49525.55798144, 49525.55798144,
       49525.55798144, 49379.86020808, 49477.65331559, 49492.83833992,
       49753.08673984, 49463.34833817, 49540.74300576, 49477.65331559,
       49275.62914013, 49427.39905687, 49364.97919228, 49168.45112182,
       49347.15122611, 49144.83370168, 49362.04290147, 49666.32646219,
       49528.78762121, 49049.31771896, 49824.45490142, 49528.7551882 ,
       48792.44958797, 36498.1816252 , 33886.59064785, 20530.74441579,
        8046.27178069, 10757.14286689,  8028.89015196, 32019.62703583,
       17893.6270238 , -5388.20141405,  7722.01096679,  4631.63446961,
      

By analyzing the bit scores, we can identify sequences with high similarity to the reference genome, which can help in tracing closely related genome sequences and potentially identifying a subsequence to target with a combative drug. However, it's important to note that further analysis, such as phylogenetic studies and experimental validation, would be necessary to confirm the findings and assess the suitability of a particular sequence as a drug target.