# One-Hot Encoding of DNA Sequences for AI in Drug Development

This notebook explains **why and how** categorical biological sequences are converted into numerical representations suitable for machine learning models.

## Conceptual Overview

DNA nucleotides are symbols without numerical meaning. One-hot encoding converts each symbol into a binary vector, preventing artificial ordering or distance assumptions.

## Code: One-Hot Encoding with Detailed Syntax Explanation

In [None]:
# Import NumPy for numerical array handling
# NumPy provides efficient storage and manipulation of numeric data
import numpy as np

# Import OneHotEncoder from scikit-learn
# OneHotEncoder converts categorical values into binary indicator vectors
from sklearn.preprocessing import OneHotEncoder

# -------------------------------
# Define the DNA sequence
# -------------------------------
# list("ATGC") splits the string into individual characters
# np.array(...) converts the list into a NumPy array
# reshape(-1, 1) converts the array into a column vector:
#   - '-1' lets NumPy infer the number of rows automatically
#   - '1' specifies one column, as required by OneHotEncoder
sequence = np.array(list("ATGC")).reshape(-1, 1)

# -------------------------------
# Initialize the OneHotEncoder
# -------------------------------
# categories explicitly defines the allowed categories and their column order
# This prevents accidental reordering if data changes
# sparse_output=False forces a dense NumPy array output instead of a sparse matrix
# Dense arrays are easier to inspect and visualize for learning purposes
encoder = OneHotEncoder(
    categories=[['A', 'C', 'G', 'T']],
    sparse_output=False
)

# -------------------------------
# Fit and transform the data
# -------------------------------
# fit_transform performs two steps:
#   1) fit(): learns the mapping from symbols to columns
#   2) transform(): applies the binary encoding to the data
# Each nucleotide becomes a 4-dimensional binary vector
one_hot_matrix = encoder.fit_transform(sequence)

# -------------------------------
# Display the results
# -------------------------------
# print() is used instead of display to preserve original code behavior
print("--- Sequence: ATGC ---")
print("Columns: [A, C, G, T]")
print(one_hot_matrix)