# Conceptual and Statistical Introduction

## Why categorical data is a problem in statistics

Categorical variables such as DNA bases (A, T, G, C) have no natural ordering or numerical distance. Assigning numbers directly introduces artificial structure that does not exist biologically.

## Geometric intuition

One-hot encoding represents each category as an orthogonal axis. This ensures all categories are equidistant, preventing the model from assuming false similarity.

## Drug development relevance

Sequence-based models in drug discovery rely on distance-sensitive algorithms. Incorrect encoding leads to learning numerical artifacts instead of biological patterns.

# One-Hot Encoding of DNA Sequences (Drug Development)

## 1. Motivation

DNA sequences must be converted into numerical form for machine learning models without introducing artificial ordering or similarity.

## 2. Preparing the DNA Sequence

Each nucleotide is treated as an independent observation. The encoder expects a two-dimensional input array.

In [None]:
import numpy as np
# NumPy is used for efficient numerical array storage and manipulation

# Convert the DNA string into individual characters
# list("ATGC") â†’ ['A', 'T', 'G', 'C']
# np.array(...) converts the list into a NumPy array
# reshape(-1, 1):
#   - -1 lets NumPy infer the number of rows automatically
#   - 1 enforces a single feature column, required by sklearn encoders
sequence = np.array(list("ATGC")).reshape(-1, 1)

sequence
# Display the reshaped sequence to verify correct structure

## 3. Defining the One-Hot Encoder

Explicit category ordering ensures consistent column semantics across datasets and model training runs.

In [None]:
from sklearn.preprocessing import OneHotEncoder
# OneHotEncoder converts categorical symbols into binary indicator vectors

# categories explicitly defines the allowed nucleotide symbols and their column order
# This prevents sklearn from inferring order based on data appearance
# sparse_output=False forces a dense NumPy array instead of a sparse matrix
# Dense output is easier to inspect and reason about during learning
encoder = OneHotEncoder(
    categories=[['A', 'C', 'G', 'T']],
    sparse_output=False
)

encoder
# Display encoder configuration for verification

## 4. Applying One-Hot Encoding

The encoder learns the mapping from symbols to columns and then converts each nucleotide into a binary vector.

In [None]:
# fit_transform performs two operations:
# 1. fit(): learns the mapping from symbols to column positions
# 2. transform(): applies that mapping to generate binary vectors
one_hot_matrix = encoder.fit_transform(sequence)

# Print statements preserve original output behavior
print("--- Sequence: ATGC ---")
print("Columns: [A, C, G, T]")
print(one_hot_matrix)
# Each row corresponds to one nucleotide, with exactly one active (1) position