# Module 1: Vectors & Gene Expression (Single-Cell Focus)

This notebook is part of the *Linear Algebra for Omics Data Science* series. It introduces vectors using **single-cell gene expression data** as a motivating example. You'll learn how to represent, visualize, and compare gene expression patterns using basic vector operations.


## 🧬 What is Gene Expression in Single-Cell Analysis?

In single-cell transcriptomics, we measure how active each gene is in **individual cells** by counting RNA molecules. This gives us a high-resolution view of cellular diversity, enabling us to identify cell types, track development, or detect disease-related changes.

Each gene’s expression across cells can be represented as a vector. These vectors are then analyzed using linear algebra tools to uncover patterns in cell behavior and gene activity.



## What is a Vector?

A vector is an ordered list of numbers. In single-cell gene expression, a vector typically represents the expression levels of a single gene across individual cells. More broadly, in omics, vectors can represent measured values of any biological feature — such as proteins or metabolites — across samples or conditions.


In [None]:

import numpy as np

# Gene expression values for GeneA and GeneB across 5 single cells
geneA_expr = np.array([2.1, 3.5, 1.8, 4.0, 2.9])
geneB_expr = np.array([1.2, 3.3, 2.4, 3.8, 3.0])


## Visualizing Gene Expression Across Cells
Let's visualize how two genes behave across five single cells.

In [None]:

import matplotlib.pyplot as plt

plt.plot(geneA_expr, marker='o', label='GeneA')
plt.plot(geneB_expr, marker='s', label='GeneB')
plt.title("Gene Expression Across Single Cells")
plt.xlabel("Cell Index")
plt.ylabel("Expression Level")
plt.xticks(ticks=range(len(geneA_expr)), labels=[f"Cell{i+1}" for i in range(len(geneA_expr))])
plt.legend()
plt.grid(True)
plt.show()



##  Vector Operations in Single-Cell Gene Expression

Once gene expression data is represented as vectors, we can apply vector operations to explore relationships between genes or cells. These operations help us:
- Combine expression levels (e.g., summing two genes' activity)
- Compare expression patterns (e.g., dot product, cosine similarity)
- Prepare for dimensionality reduction or clustering


###  Addition
Adds the expression levels of two genes across the same cells.

In [None]:

sum_expr = geneA_expr + geneB_expr
print("Combined Expression (GeneA + GeneB):", sum_expr)


###  Dot Product
Measures similarity in both pattern and magnitude. High values suggest co-expression.

In [None]:

dot_product = np.dot(geneA_expr, geneB_expr)
print("Dot Product:", dot_product)



###  Understanding the Dot Product in Gene Expression

The **dot product** between two vectors is a way to measure how much they align, considering both their **direction** and their **magnitude**. Mathematically:

\[
\mathbf{A} \cdot \mathbf{B} = \sum_{i=1}^{n} A_i \cdot B_i
\]

In the context of **single-cell gene expression**, the dot product between two gene vectors reflects how similarly and how strongly two genes are expressed across the same set of cells.

---

#### 🧬 What It Tells Us

- A **high dot product** means both genes are **strongly** and **similarly** expressed in the same cells. High values align together.
- A **small dot product** suggests:
  - Expression levels may be low overall
  - The genes might be active in **different cells**
  - Or their patterns don’t align

---

#### 📊 Example (GeneA and GeneB across 5 cells)

We compute:

```python
np.dot(geneA_expr, geneB_expr)
```

This performs:

\[
2.1 \cdot 1.2 + 3.5 \cdot 3.3 + 1.8 \cdot 2.4 + 4.0 \cdot 3.8 + 2.9 \cdot 3.0
\]

It adds up the **cell-wise products**, giving more weight to cell pairs where **both genes are highly expressed**.

---

####  When is it useful?

- To detect **co-expressed genes**
- To find **correlated patterns**
- As a building block for similarity, PCA, and other machine learning techniques


In [None]:

# Manual dot product step by step
products = geneA_expr * geneB_expr
print("Element-wise products:", products)
print("Dot product:", np.sum(products))


### 🔗 Cosine Similarity
Measures similarity in expression trends regardless of magnitude.

In [None]:

cos_sim = np.dot(geneA_expr, geneB_expr) / (np.linalg.norm(geneA_expr) * np.linalg.norm(geneB_expr))
print("Cosine Similarity:", cos_sim)



### 🧮 How to Normalize a Gene Expression Vector

To visualize cosine similarity, we first normalize the gene expression vectors so that they have a **length (norm) of 1**. This keeps the direction but removes the scale — turning them into **unit vectors**.

Given a gene expression vector like:

```python
geneA_expr = np.array([2.1, 3.5, 1.8, 4.0, 2.9])
```

We compute its norm (length) as:

\[
\text{norm} = \sqrt{2.1^2 + 3.5^2 + 1.8^2 + 4.0^2 + 2.9^2}
\]

Then divide each value by that norm to create the unit vector:



###  Visualizing Cosine Similarity

We can visualize cosine similarity as the angle between two vectors. The closer the vectors point in the same direction, the smaller the angle and the higher the cosine similarity (close to 1).


In [None]:

import numpy as np

geneA_expr = np.array([2.1, 3.5, 1.8, 4.0, 2.9])
norm_geneA = np.linalg.norm(geneA_expr)
print("Norm of geneA_expr:", norm_geneA)

v1 = geneA_expr / norm_geneA
print("Normalized (unit) vector v1:", v1)
print("Length of v1:", np.linalg.norm(v1))



## Summary

| Metric            | What it Measures          | Magnitude-Sensitive? | Use Case                            |
|-------------------|----------------------------|-----------------------|-------------------------------------|
| Dot Product       | Magnitude + Pattern        | ✅ Yes                | Find strongly co-expressed genes    |
| Cosine Similarity | Direction/Pattern Only     | ❌ No                 | Identify similar expression trends  |



### ⚠️ Note on 2D Visualization of Cosine Similarity

In reality, gene expression vectors often live in high-dimensional space (e.g., 5D if measured across 5 cells). The **cosine similarity we calculate** uses **all dimensions**.

However, to visualize the direction of vectors in a plot, we can only show 2D (or 3D). So we use **only the first two values** of each normalized vector to illustrate their alignment.

This 2D projection is **approximate** — it's for building **intuition**, not an exact depiction of the angle used in cosine similarity.


In [None]:

from numpy.linalg import norm
import matplotlib.pyplot as plt

# Normalize vectors
v1 = geneA_expr / norm(geneA_expr)
v2 = geneB_expr / norm(geneB_expr)

# Plot normalized vectors from origin
plt.quiver(0, 0, v1[0], v1[1], angles='xy', scale_units='xy', scale=1, color='blue', label='GeneA (unit vector)')
plt.quiver(0, 0, v2[0], v2[1], angles='xy', scale_units='xy', scale=1, color='orange', label='GeneB (unit vector)')

plt.xlim(-1, 1)
plt.ylim(-1, 1)
plt.gca().set_aspect('equal')
plt.grid(True)
plt.title('Cosine Similarity Visualization (first 2D projection)')
plt.legend()
plt.show()


In [None]:

# Compare full cosine similarity vs. 2D projection
cos_full = np.dot(geneA_expr, geneB_expr) / (norm(geneA_expr) * norm(geneB_expr))
cos_2D = np.dot(geneA_expr[:2], geneB_expr[:2]) / (norm(geneA_expr[:2]) * norm(geneB_expr[:2]))

print("Cosine similarity (full vector):", cos_full)
print("Cosine similarity (2D projection):", cos_2D)



---

##   Vectors vs Matrices

In this module, we focused on **vectors** — representing the expression of individual genes across single cells. Each gene’s expression across cells was a separate vector that we could analyze using linear algebra.

However, in real datasets like those in the **Human Cell Atlas**, gene expression is stored as a **matrix**:

- **Rows** = genes  
- **Columns** = cells  
- Each cell in the matrix = expression value of a gene in a single cell

This matrix allows us to apply more advanced operations — like **dimensionality reduction (PCA)**, **clustering**, or **graph-based analyses** — by treating the entire dataset as a mathematical object.

👉 In **Module 2**, we’ll work directly with gene expression matrices, explore their structure, and apply matrix-based techniques to reveal biological patterns.

---



## 🧪 Exercises

1. Add a third gene vector and compare it with GeneA and GeneB.
2. Visualize all three gene expression profiles.
3. Compute cosine similarities between all pairs.
4. Try these operations on rows of a real single-cell dataset (e.g., PBMC).
