---
title: "Roulis Lab - Yale Colon Cancer Analysis I"
format:
  pdf:
    code-overflow: wrap
    geometry: margin=0.5in
---

# LOG FILE FOR YALE COLON CANCER ANALYSIS — PART I

**Author:** Bruno Ndiba Mbwaye Roy, School of Engineering and Applied Science (SEAS) 2027  
**Research mentor:** Manolis Roulis, Perelman School of Medicine (PSOM), Department of Pathology and Laboratory Medicine  
**Date:** February 15, 2026

---

## What this log covers

This notebook is the first part of the log book for Yale colon cancer analysis. It will document data loading, preprocessing, and initial exploratory analyses. Below are the **output folders** and their contents (to be updated as the analysis progresses).

| Folder | Contents |
|--------|----------|
| *(To be filled as analysis develops)* | |

Data: Yale colon cancer dataset; paths and file names to be specified in the data-loading section below.

---

## Important definitions

This section explains the main ideas and terms used in the notebook, in plain language. Where relevant, we mention the **function** used in the code (e.g. `sc.tl.umap`) so you can search for it if needed.

---

**UMAP (Uniform Manifold Approximation and Projection)**  
UMAP is a way to take many measurements per cell (thousands of genes) and represent them in a simple 2D picture. Cells that are similar in their gene activity end up close together on the plot; cells that are different end up far apart. So a UMAP is like a *map* of cells based on how alike they are. In the code we use **`sc.tl.umap`** to build this map. When we say "expression UMAP," we mean a UMAP built from gene expression (how much each gene is turned on in each cell).

---

**Clustering (e.g. Leiden clustering)**  
Clustering groups cells that look similar into the same "cluster." For example, we might get clusters that correspond to fibroblasts, muscle cells, immune cells, etc. **Leiden** is the name of the method we use to find these groups; it uses the connections between nearby cells (from the UMAP or from gene space) to decide who belongs together. In the code we use **`sc.tl.leiden`**. A **subcluster** is a finer grouping *within* a bigger group (e.g. different types of fibroblasts within all fibroblast-like cells). We get subclusters by running clustering again on just that subset of cells.

---

**Expression**  
Expression means how much a gene is "on" in a cell—i.e. how much RNA or protein is produced. When we color a UMAP or a spatial plot by a gene, we are showing where that gene is more or less active across cells or across the tissue.

---

**Spatial plot**  
A spatial plot shows cells at their *real* positions in the tissue (x and y coordinates from the microscope). So unlike a UMAP (which is a "similarity map"), a spatial plot shows *where* cells actually sit in the slice—useful to see if a cell type or a gene is in a particular region (e.g. near the muscle layer or the lining).

---

**Slice**  
A slice is one piece of tissue that was imaged (e.g. one section from the colon). We have multiple slices per sample; each has its own x, y coordinates. Many analyses are done *per slice* so that we only compare neighbors within the same physical tissue.

---

**Tier1, Tier2, Tier3**  
These are levels of cell-type labels from an earlier annotation. **Tier1** is broad (e.g. "Fibroblast," "Epithelial," "Immune"); **Tier2** and **Tier3** are more detailed (e.g. specific subtypes). We use them to color plots and to study the "neighborhood" of each cell (who its neighbors are by cell type).

---

**Neighborhood**  
In this notebook, "neighborhood" can mean two things: (1) The **spatial** neighbors of a cell—the *k* closest cells in the tissue (by x, y distance). (2) A **cluster** from an analysis that uses the *composition* of those neighbors (e.g. how many Tier1/Tier3 types are nearby). So "neighborhood composition" is: for each cell, what types of cells are around it, and in what proportion?

---

**Marker genes / top markers**  
Marker genes are genes that are especially characteristic of a cell type or cluster (e.g. high in that group and lower elsewhere). We find them by comparing gene expression between clusters; the "top" markers are the ones that best distinguish a given cluster. In the code we use **`sc.tl.rank_genes_groups`** to rank genes and **`sc.get.rank_genes_groups_df`** to export the list.

---

**Dot plot**  
A dot plot shows how strongly marker genes are expressed in each cluster: dot *size* usually means "how many cells in that cluster express the gene," and dot *color* means "how strong the expression is on average." So you can quickly see which genes are specific to which cluster.

---

**Feature plot**  
A feature plot is a UMAP (or sometimes a spatial plot) where each point (cell) is colored by a single "feature"—usually the expression level of one gene. So you see *where* in the map or in the tissue that gene is high or low.

---

**PCA (Principal Component Analysis)**  
PCA is a way to reduce many variables (e.g. thousands of genes) to a smaller set of "summary" dimensions that capture most of the variation. We often run PCA first and then build the UMAP or the neighbor graph from these summary dimensions, which makes the analysis faster and less noisy. In the code we use **`sc.tl.pca`**.

---

**k nearest neighbors (kNN)**  
For each cell, we look at the *k* "nearest" other cells—nearest either in gene space (similar expression) or in space (physical distance in the tissue). *k* is just a number we choose (e.g. 15 or 40). The neighbor graph (used for clustering and UMAP) is built from these connections; **`sc.pp.neighbors`** does this in the code.