<a href="https://colab.research.google.com/github/Abhiram03-2009/Abhi.github.io/blob/main/converting_data_to_numbers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install the biopython library
Biopython is a collection of tools and libraries for computational biology and bioinformatics. It provides functionality to work with biological data in Python.

Biopython simplifies complex biological computations, enabling researchers and students to perform bioinformatics tasks efficiently using Python.

In [None]:
!pip install biopython

Collecting biopython
  Downloading biopython-1.85-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Downloading biopython-1.85-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: biopython
Successfully installed biopython-1.85



### Bio and SeqIO

#### Bio
The `Bio` module is a core component of Biopython, a comprehensive library for computational biology. It contains various submodules and tools designed to handle a wide range of biological data and analyses.

#### SeqIO
The `SeqIO` submodule within Biopython is specifically used for reading and writing sequence data in different formats. It provides a straightforward interface to parse sequence files (such as FASTA, GenBank, etc.) and to output sequences in these formats.

#### Key Uses of SeqIO
- **Reading Sequences**: Easily read sequences from files and handle them as Python objects.
- **Writing Sequences**: Write sequence data to files in various formats.
- **Format Conversion**: Convert sequence data between different file formats.
- **Iterating Over Sequences**: Efficiently process large sequence datasets by iterating through sequences in a file.


In [None]:
from Bio import SeqIO

### Code Description

The following code demonstrates how to use the `SeqIO` module from Biopython to read and process sequences from a FASTA file:

The fa file here is a sample for investigation


In [None]:
for sequence in SeqIO.parse('/content/example.fa', "fasta"):
  print(sequence.id)
  print(sequence.seq)
  print(len(sequence))

ENST00000435737.5
ATGTTTCGCATCACCAACATTGAGTTTCTTCCCGAATACCGACAAAAGGAGTCCAGGGAATTTCTTTCAGTGTCACGGACTGTGCAGCAAGTGATAAACCTGGTTTATACAACATCTGCCTTCTCCAAATTTTATGAGCAGTCTGTTGTTGCAGATGTCAGCAACAACAAAGGCGGCCTCCTTGTCCACTTTTGGATTGTTTTTGTCATGCCACGTGCCAAAGGCCACATCTTCTGTGAAGACTGTGTTGCCGCCATCTTGAAGGACTCCATCCAGACAAGCATCATAAACCGGACCTCTGTGGGGAGCTTGCAGGGACTGGCTGTGGACATGGACTCTGTGGTACTAAATGAAGTCCTGGGGCTGACTCTCATTGTCTGGATTGACTGA
390
ENST00000419127.5
ATGTTTCGCATCACCAACATTGAGTTTCTTCCCGAATACCGACAAAAGGAGTCCAGGGAATTTCTTTCAGTGTCACGGACTGTGCAGCAAGTGATAAACCTGGTTTATACAACATCTGCCTTCTCCAAATTTTATGAGCAGTCTGTTGTTGCAGATGTCAGCAACAACAAAGGCGGCCTCCTTGTCCACTTTTGGATTGTTTTTGTCATGCCACGTGCCAAAGGCCACATCTTCTGTGAAGACTGTGTTGCCGCCATCTTGAAGGACTCCATCCAGACAAGCATCATAAACCGGACCTCTGTGGGGAGCTTGCAGGGACTGGCTGTGGACATGGACTCTGTGGTACTAAATGACAAAGGCTGCTCTCAGTACTTCTATGCAGAGCATCTGTCTCTCCACTACCCGCTGGAGATTTCTGCAGCCTCAGGGAGGCTGATGTGTCACTTCAAGCTGGTGGCCATAGTGGGCTACCTGATTCGTCTCTCAATCAAGTCCATCCAAATCGAAGCCGACAACTGTGTCACTGACTCCCTGACCATTTACGACTCCCTTTTGCCCATCCGGAGCAG

### Breakdown of Output Components

- **Sequence ID**: The first line of each block represents the identifier of the sequence, such as `ENST00000435737.5` and `ENST00000419127.5`.
- **Sequence**: The second line is the actual nucleotide sequence, represented by letters (A, T, C, G).
- **Sequence Length**: The third line shows the length of the sequence, indicating the number of nucleotides.

### Purpose of Each Component

- **Sequence ID**: Identifies each unique sequence in the FASTA file, which is essential for referencing and further analysis.
- **Sequence**: Provides the nucleotide composition of the DNA or RNA sequence, crucial for biological studies and computational analysis.
- **Sequence Length**: Indicates the size of each sequence, useful for understanding the complexity and for size comparisons between sequences.


### Introduction to Sequence Data Preprocessing with Python

In this script, we demonstrate how to preprocess biological sequence data using Python. The goal is to convert a nucleotide sequence string into a numpy array and prepare it for further analysis. We will also create a label encoder to transform the nucleotide characters into numerical values.


In [None]:
import numpy as np
import re

### Define the `string_to_array` Function

- **Input**: This function takes a nucleotide sequence string as input.
- **Lowercase Conversion**: It converts the string to lowercase.
- **Character Replacement**: It replaces any non-nucleotide characters (anything other than 'a', 'c', 'g', 't') with 'n'.
- **Numpy Array Conversion**: It converts the cleaned string into a numpy array of characters.


In [None]:
def string_to_array(seq_string):
   seq_string = seq_string.lower()
   seq_string = re.sub('[^acgt]', 'n', seq_string)
   seq_string = np.array(list(seq_string))
   return seq_string

Create a Label Encoder:

We use the LabelEncoder from the sklearn.preprocessing module to encode the nucleotide characters into numerical values.
The encoder is trained on the characters 'a', 'c', 'g', 't', and an additional placeholder 'z' for any other characters.

In [None]:
# create a label encoder with 'acgtn' alphabet
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
label_encoder.fit(np.array(['a','c','g','t','z']))

### Ordinal Encoder Function

The `ordinal_encoder` function transforms an array of nucleotide characters into a numerical format suitable for machine learning models. Here's a step-by-step description of how the function works:

1. **Input**:
   - The function takes an array of nucleotide characters (`my_array`) as input.

2. **Integer Encoding**:
   - The nucleotide characters are first transformed into integer values using a pre-defined label encoder (`label_encoder`).
   - The `label_encoder.transform(my_array)` method converts the characters into integers based on the mapping defined during the label encoder creation.

3. **Float Encoding**:
   - The integer-encoded array is then converted to float type using `astype(float)`.

4. **Custom Encoding Values**:
   - Specific float values are assigned to each nucleotide to create a more nuanced representation:
     - `0` is replaced with `0.25` for 'A'
     - `1` is replaced with `0.50` for 'C'
     - `2` is replaced with `0.75` for 'G'
     - `3` is replaced with `1.00` for 'T'
     - `4` is replaced with `0.00` for any other character, represented as 'n'

5. **Return**:
   - The function returns the float-encoded array.



In [None]:
def ordinal_encoder(my_array):
   integer_encoded = label_encoder.transform(my_array)
   float_encoded = integer_encoded.astype(float)
   float_encoded[float_encoded == 0] = 0.25 # A
   float_encoded[float_encoded == 1] = 0.50 # C
   float_encoded[float_encoded == 2] = 0.75 # G
   float_encoded[float_encoded == 3] = 1.00 # T
   float_encoded[float_encoded == 4] = 0.00 # anything else, lets say n
   return float_encoded


In [None]:
seq_test = 'TTCAGCCAGTG'
print(string_to_array(seq_test))
ordinal_encoder(string_to_array(seq_test))


['t' 't' 'c' 'a' 'g' 'c' 'c' 'a' 'g' 't' 'g']


array([1.  , 1.  , 0.5 , 0.25, 0.75, 0.5 , 0.5 , 0.25, 0.75, 1.  , 0.75])

### Difference Between Ordinal Encoding and One-Hot Encoding

#### Ordinal Encoding

The `ordinal_encoder` function converts nucleotide sequences into numerical values that represent each nucleotide with a specific float value. Here’s a summary of its key characteristics:

- **Method**:
  - Converts characters to integers using a label encoder.
  - Maps these integers to specific float values representing each nucleotide ('A', 'C', 'G', 'T') and a placeholder for any other character ('n').

- **Output**:
  - Produces a one-dimensional array of floats.
  - Encodes nucleotides as specific float values: 0.25 for 'A', 0.50 for 'C', 0.75 for 'G', 1.00 for 'T', and 0.00 for 'n'.

In [None]:
from sklearn.preprocessing import OneHotEncoder
def one_hot_encoder(seq_string):
   int_encoded = label_encoder.transform(seq_string)
   onehot_encoder = OneHotEncoder(sparse_output=False, dtype=int)
   int_encoded = int_encoded.reshape(len(int_encoded), 1)
   onehot_encoded = onehot_encoder.fit_transform(int_encoded)
   onehot_encoded = np.delete(onehot_encoded, -1, 1)
   return onehot_encoded


In [None]:
seq_test = 'GAATTCTCGA'
one_hot_encoder(string_to_array(seq_test))


TypeError: OneHotEncoder.__init__() got an unexpected keyword argument 'sparse'

### Explanation of `Kmers_funct` Code

The `Kmers_funct` function is designed to generate k-mers from a given nucleotide sequence. A k-mer is a substring of length `k` from a sequence, commonly used in bioinformatics for sequence analysis. Here’s a detailed breakdown of the function:

List Comprehension:

The function uses list comprehension to generate a list of k-mers.
It iterates over the range of the sequence length minus the k-mer size plus one (range(len(seq) - size + 1)).
Substrings Extraction:

For each position x in the range, it extracts a substring of length size starting from position x (seq[x:x+size]).
The substring is converted to lowercase using the lower() method.

In [None]:
def Kmers_funct(seq, size):
   return [seq[x:x+size].lower() for x in range(len(seq) - size + 1)]

In [None]:
mySeq = 'GTGCCCAGGTTCAGTGAGTGACACAGGCAG'
Kmers_funct(mySeq, size=4)

['gtgc',
 'tgcc',
 'gccc',
 'ccca',
 'ccag',
 'cagg',
 'aggt',
 'ggtt',
 'gttc',
 'ttca',
 'tcag',
 'cagt',
 'agtg',
 'gtga',
 'tgag',
 'gagt',
 'agtg',
 'gtga',
 'tgac',
 'gaca',
 'acac',
 'caca',
 'acag',
 'cagg',
 'aggc',
 'ggca',
 'gcag']

### Explanation of Code for Generating and Joining K-mers

This code snippet demonstrates how to generate k-mers from a nucleotide sequence and join them into a single string. Here’s a step-by-step explanation:

Function Call: The Kmers_funct function is called with mySeq as the input sequence and size=6 to generate 6-mers (substrings of length 6).
Output: This line generates a list of 6-mers from the sequence mySeq and stores it in the variable words.


In [None]:
words = Kmers_funct(mySeq, size=4)
joined_sentence = ' '.join(words)
joined_sentence

'gtgc tgcc gccc ccca ccag cagg aggt ggtt gttc ttca tcag cagt agtg gtga tgag gagt agtg gtga tgac gaca acac caca acag cagg aggc ggca gcag'

In [None]:
mySeq1 = 'TCTCACACATGTGCCAATCACTGTCACCC'
mySeq2 = 'GTGCCCAGGTTCAGTGAGTGACACAGGCAG'
sentence1 = ' '.join(Kmers_funct(mySeq1, size=4))
sentence2 = ' '.join(Kmers_funct(mySeq2, size=4))

In [None]:
sentence1

'tctc ctca tcac caca acac caca acat catg atgt tgtg gtgc tgcc gcca ccaa caat aatc atca tcac cact actg ctgt tgtc gtca tcac cacc accc'

In [None]:
sentence2

'gtgc tgcc gccc ccca ccag cagg aggt ggtt gttc ttca tcag cagt agtg gtga tgag gagt agtg gtga tgac gaca acac caca acag cagg aggc ggca gcag'

### Steps in Text Vectorization

- **Tokenization**:
  - The `fit_transform` method tokenizes the text (splits it into individual words) and builds a vocabulary of all tokens in the input documents.

- **Count Matrix**:
  - It then transforms the input text into a matrix where each row corresponds to a document and each column corresponds to a token from the vocabulary. The values in the matrix represent the count of each token in the respective document.

- **Conversion to Array**:
  - The resulting sparse matrix is converted to a dense numpy array using the `toarray()` method and stored in the variable `X`.


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform([joined_sentence, sentence1, sentence2]).toarray()
print(X)

[[0 1 1 0 0 0 1 1 2 0 0 0 1 0 0 2 1 0 0 1 1 0 0 1 1 1 0 1 1 1 0 2 1 1 0 1
  0 1 1 1 0 0 1]
 [1 1 0 1 1 1 0 0 0 1 1 1 2 1 1 0 0 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 1 0 3 0
  1 0 0 1 1 1 0]
 [0 1 1 0 0 0 1 1 2 0 0 0 1 0 0 2 1 0 0 1 1 0 0 1 1 1 0 1 1 1 0 2 1 1 0 1
  0 1 1 1 0 0 1]]


In [None]:
# Get the feature names (tokens)
tokens = cv.get_feature_names_out()

# Print the number of tokens
print(f"Number of tokens: {len(tokens)}")

# Print the tokens
print("Tokens:")
for token in tokens:
    print(token)

# Optionally, print the count matrix for verification
print("Count Matrix:")
print(X)

Number of tokens: 43
Tokens:
aatc
acac
acag
acat
accc
actg
aggc
aggt
agtg
atca
atgt
caat
caca
cacc
cact
cagg
cagt
catg
ccaa
ccag
ccca
ctca
ctgt
gaca
gagt
gcag
gcca
gccc
ggca
ggtt
gtca
gtga
gtgc
gttc
tcac
tcag
tctc
tgac
tgag
tgcc
tgtc
tgtg
ttca
Count Matrix:
[[0 1 1 0 0 0 1 1 2 0 0 0 1 0 0 2 1 0 0 1 1 0 0 1 1 1 0 1 1 1 0 2 1 1 0 1
  0 1 1 1 0 0 1]
 [1 1 0 1 1 1 0 0 0 1 1 1 2 1 1 0 0 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 1 0 3 0
  1 0 0 1 1 1 0]
 [0 1 1 0 0 0 1 1 2 0 0 0 1 0 0 2 1 0 0 1 1 0 0 1 1 1 0 1 1 1 0 2 1 1 0 1
  0 1 1 1 0 0 1]]
