In [None]:
print("Can one define different classes of peaks based on the signal and its variation across cells?**")

import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from scipy.cluster.hierarchy import linkage, dendrogram
from scipy.spatial.distance import pdist
import seaborn as sns
import matplotlib.pyplot as plt
#pearson, spearman auch noch machen

# Load the data
df_raw = pd.read_csv("data\ImmGenATAC18_AllOCRsInfo.csv", header=0, quotechar='"', low_memory=False)

# Extract only the columns for NK and ILC. Only 5000 samples
df_expr = df_raw[['NK.27+11b-.BM', 'NK.27+11b+.BM', 'NK.27-11b+.BM', 'NK.27+11b-.Sp',
       'NK.27+11b+.Sp', 'NK.27-11b+.Sp', 'ILC2.SI', 'ILC3.NKp46-CCR6-.SI',
       'ILC3.NKp46+.SI', 'ILC3.CCR6+.SI']]
df_expr = df_expr.iloc[:5000, :]
df_expr.index = df_raw.iloc[:5000, :] 
print(df_expr[:10])


# Hierarchical clustering using Ward's linkage

from sklearn.cluster import KMeans

k = 9 
kmeans = KMeans(n_clusters=k, random_state=42)
df_expr['Cluster'] = kmeans.fit_predict(df_expr)

cluster_means = df_expr.groupby('Cluster').mean()

# heatmap of the cluster means
sns.heatmap(cluster_means, cmap='vlag')
plt.title("Cluster Mean Accessibility per Cell Type")
plt.show()

# OCRs of Cluster 4
cluster_4_ocr = df_expr[df_expr['Cluster'] == 4]

ocr_names_cluster_4 = cluster_4_ocr.index.tolist()

# Printing of OCR names
for name in ocr_names_cluster_4:
    print(name)

# Number OCRs in Cluster 4
print(f"Anzahl OCRs in Cluster 4: {len(ocr_names_cluster_4)}")

#Next: compare NK and ILC to closely related cell types
df_expr2 = df_raw[['NK.27+11b-.BM', 'NK.27+11b+.BM', 'NK.27-11b+.BM', 'NK.27+11b-.Sp',
       'NK.27+11b+.Sp', 'NK.27-11b+.Sp', 'ILC2.SI', 'ILC3.NKp46-CCR6-.SI',
       'ILC3.NKp46+.SI', 'ILC3.CCR6+.SI', 'proB.CLP.BM','proB.FrA.BM','proB.FrBC.BM', 'preT.DN1.Th','preT.DN2a.Th', 'preT.DN2b.Th','preT.DN3.Th']]
df_expr2 = df_expr2.iloc[:5000, :]
df_expr2.index = df_raw.iloc[:5000, :] 

print(df_expr2[:10])

from sklearn.cluster import KMeans

k = 15  
kmeans = KMeans(n_clusters=k, random_state=42)
df_expr2['Cluster'] = kmeans.fit_predict(df_expr2)

cluster_means = df_expr2.groupby('Cluster').mean()

# heatmap of the cluster means
sns.heatmap(cluster_means, cmap='vlag')
plt.title("Cluster Mean Accessibility per Cell Type")
plt.show()

# OCRs of Cluster 3
cluster_4_ocr = df_expr2[df_expr2['Cluster'] == 4]

ocr_names_cluster_4 = cluster_4_ocr.index.tolist()

# Printing of OCR names
for name in ocr_names_cluster_4:
    print(name)

# Number OCRs in Cluster 3
print(f"Anzahl OCRs in Cluster 3: {len(ocr_names_cluster_4)}")

#When are the clusters active?
cluster4_df = df_expr2[df_expr2['Cluster'] == 4].drop(columns='Cluster')
# mean accessibility of cluster 4 per cell type
mean_accessibility = cluster4_df.mean(axis=0)

# define differentiation level (NK from spleen)
diff_path = [
             'NK.27+11b-.Sp', 'NK.27+11b+.Sp', 'NK.27-11b+.Sp']

# Plot 

mean_accessibility = mean_accessibility[diff_path]
plt.figure(figsize=(12, 4))
plt.plot(mean_accessibility.index, mean_accessibility.values, marker='o')
plt.xticks(rotation=90)
plt.ylabel("Mean Accessibility (Cluster 4)")
plt.title("CRE Activity along Differentiation Path (Cluster 4)")
plt.tight_layout()
plt.show()

#same with NK from bone marrow
cluster4_df = df_expr2[df_expr2['Cluster'] == 4].drop(columns='Cluster')
# mean accessibility of cluster 4 per cell type
mean_accessibility = cluster4_df.mean(axis=0)

# define differentiation level
diff_path = ['NK.27+11b-.BM', 'NK.27+11b+.BM', 'NK.27-11b+.BM'
             ]

# Plot 

mean_accessibility = mean_accessibility[diff_path]
plt.figure(figsize=(12, 4))
plt.plot(mean_accessibility.index, mean_accessibility.values, marker='o')
plt.xticks(rotation=90)
plt.ylabel("Mean Accessibility (Cluster 4)")
plt.title("CRE Activity along Differentiation Path (Cluster 4)")
plt.tight_layout()
plt.show()

  df_raw = pd.read_csv("data\ImmGenATAC18_AllOCRsInfo.csv", header=0, quotechar='"', low_memory=False)


Can one define different classes of peaks based on the signal and its variation across cells?**


In this section, we want to take a closer look at the OCR x Cell type matrix and determine, if we can define different classes of peaks according to their signal variation across the NK and ILC subtypes. 
We load the data set, extract only the relevant columns and pick a random sample of 5000 peaks. The index was set to the OCR IDs from the original dataset to allow for easier identification.

We then peformed a Kmeans clustering and grouped the clusters by their mean accessibility. The heatmeap visualizes the clusters: Each row represents a cluster of OCRs grouped by similar accessibility patterns across NK and ILC cell types. The observed differences between clusters — some showing broad accessibility (e.g. cluster 4) — suggest that distinct classes of peaks do exist. This supports the idea that chromatin accessibility varies across cell types and can be used to define functionally distinct groups of regulatory elements. We take a closser look at the cluster that is highly available in NK and ILC subtypes and extract the specific peaks that are in this cluster, to compare if they match with the peaks in the heatmap of atac.ipynb.

To compare NK and ILC cells with closely related cells and determine if they differ, we do the same procedure and take pre T-Cells and pro B-cells into account. Cluster 4, containing the same peaks as cluster 4 in the first heatmap, is still acceessible in all cell subtypes. This suggests, that these peaks are highly conserved. However, the clusters do not show distinct patterns in the different cell types. This does not support the hypothesis, that one can define different classes of peaks according to their signal in different cell types, so we have to investigate further. 

Cluster 4 is active in all three NK cell subtypes:
All three cell types show relatively high mean accessibility, indicating that the CREs in Cluster 4 are accessible across NK subtypes in the spleen. There is a slight drop in accessibility from NK.27+11b-.Sp to NK.27+11b+.Sp, followed by an increase in NK.27-11b+.Sp.
This suggests that some CREs in Cluster 4 may become transiently less active during intermediate stages of NK cell differentiation. Their highest accessibility is observed in the NK.27-11b+.Sp subtype, possibly indicating mature or fully differentiated NK cells. Cluster 4 CREs also show dynamic regulation during NK development in bone marrow, with early and late activity peaks