# Interferon Signature Analysis in Pancreatic Adenocarcinoma

This analysis investigates the interferon gene signature in pancreatic cancer using gene expression data from the `PAAD.gct` file. We investigate IFN activity using GSVA and explore tumor subtypes via PCA.

### Dataset:
- `PAAD.gct`: Gene expression data (~20,000 genes & 185 samples)
- `type1_IFN.txt`: 25 gene IFN signature

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from cmapPy.pandasGEXpress import parse
from sklearn.decomposition import PCA
from sklearn.decomposition import StandardScaler
from gseapy import gsva

gct_data = parse.parse("PAAD.gct")
expression_data = gct_data.data_df
metadata = gct_data.col_metadata_df
cleaned_data = expression_data.dropna()

In [None]:
plt.figure(figsize=(14, 6))
sns.boxplot(data=cleaned_data.T, palette="coolwarm")
plt.xticks([], [])
plt.ylabel("Gene Expression")
plt.title("Boxplot of Gene Expression for All Samples")
plt.tight_layout()
plt.show()

In [None]:
X = StandardScaler().fit_transform(cleaned_data.T)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
metadata = gct_data.col_metadata_df
histology = metadata["histological_type_other"].values

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(x=X_pca[:, 0], y=X_pca[:, 1], hue=histology, palette="Set2")
plt.title("PCA of Samples (covered by Histology)")
plt.xlabel("PCA1")
plt.ylabel("PCA2")
plt.legend(bbox_to_anchor=(1.05, 1), loc="upper left")
plt.tight_layout()
plt.show()

In [None]:
exocrine_samples = metadata[metadata["histological_type_other"] != "Neuroendocrine"].index
exocrine_expression = cleaned_data[exocrine_samples]

with open("type1_IFN.txt", "r") as f:
    ifn_genes = [line.strip() for line in f]

expression_gsva = exocrine_expression.copy()
expression_gsva = expression_gsva.loc[expression_gsva.index.intersection(ifn_genes)]
exocrine_expression.index = exocrine_expression.index.str.upper()
ifn_genes = [g.strip().upper() for g in ifn_genes]
matching_genes = [g for g in ifn_genes if g in exocrine_expression.index]

with open("ifn_signature.gmt", "w") as f:
    f.write("IFN_Signature\tNA\t" + "\t".join(matching_genes) + "\n")

filtered_expression = exocrine_expression.loc[matching_genes.astype(float)]
gsva_results = gsva(data=exocrine_expression, gene_sets="ifn_signature.gmt", method="gsva", outdir=None, verbose=True, min_size=5)
gsva_results_df = gsva_results.res2d

In [None]:
plt.figure(figsize=(8, 5))
sns.histplot(gsva_results_df, kde=True, bins=30)
plt.title("Distribution of IFN Signature GSVA Scores")
plt.xlabel("GSVA Score")
plt.ylabel("Sample Count")
plt.tight_layout()
plt.show()

## Summary

- There were approximately 4,367 genes in the dataset that were NaN values.
- The data with NaNs removed shows a high density of expression values around a central range, and a large number of outliers across all samples.
- The PCA plot displays variation in sample gene expression, while samples with similar expression patterns cluster together.
- Neuroendocrine tumors show partial clustering separate from the main cluster.
- PCA1 and PCA2 represent a substantial portion of the total variance, but not a majority of it.
- GSVA characterizes the presence of IFN signature in PAAD by assigning a score to each sample which denotes the positive/negative presence of IFN genes in the sample.
- The distribution of scores among the different samples range from -0.8 to 0.8, with most samples being distributed relatively evenly (except for a few clusters).
- Samples with high GSVA scores represent a high IFN subtype while low GSVA scores represent a low IFN subtype.