# Conceptual and Statistical Introduction

## Why categorical data is a problem in statistics

In statistics, distances and averages are meaningful only for numerical quantities. Categorical variables such as DNA bases (A, T, G, C) have no inherent magnitude, order, or spacing. Assigning numbers directly (e.g., A=1, C=2) silently introduces a false metric.

## Geometric intuition

Numeric encoding places categories on a number line, creating unequal distances. One-hot encoding instead embeds categories as orthogonal basis vectors. All categories become equidistant, preserving neutrality.

## Drug development relevance

Sequence-based models in drug discovery rely on distance-sensitive algorithms. Incorrect encoding causes models to learn numerical artifacts instead of biological patterns.

# One-Hot Encoding for DNA Sequences (Drug Development)

## 1. Motivation

DNA sequences consist of symbols (A, T, G, C) with no numerical ordering. Machine learning models require numerical input, so categorical encoding must preserve biological neutrality.

## 2. Preparing the DNA Sequence

The sequence is converted into a column vector so that each nucleotide is treated as an independent observation by the encoder.

In [None]:
import numpy as np

sequence = np.array(list("ATGC")).reshape(-1, 1)

sequence

## 3. Defining the One-Hot Encoder

Explicit category ordering fixes the semantic meaning of each column and prevents accidental reordering during training.

In [None]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(
    categories=[['A', 'C', 'G', 'T']],
    sparse_output=False
)

encoder

## 4. Applying One-Hot Encoding

The encoder learns the mapping from symbols to columns and then produces a binary matrix suitable for ML models.

In [None]:
one_hot_matrix = encoder.fit_transform(sequence)

print("--- Sequence: ATGC ---")
print("Columns: [A, C, G, T]")
print(one_hot_matrix)