## Introduction

In our analysis, we aim to critically evaluate and predict with a ML classifier whether a given cell sample, for the two cell lines of MCF7 and HCC1806 obtained with Dropseq and Smartseq sequencing methods, is either hypoxic or normoxic. To achieve this goal, we extensively analyse the distribution of the genes based on the cell condition (hypoxia vs normoxia) in the Exploratory Data analysis. We then continue to investigate the structure of the clusters corresponding to the hypoxia and normoxia classes throughout the unsupervised analysis trying to evaluate how effectively the two cell conditions can be separated, using different dimensionality reduction techniques (for both visualisation and performance improvements). In the meanwhile, we try to analyse the dataset for the presence of outlier samples and other noisy observations to understand if there are naturally only 2 cell conditions that can be derived from the data. With these findings, we then move on to select the most expressive and useful genes in the dataset, Therefore, we train a wide range of different ML classifiers models, not knowing a priori the best one, and finally combining their prediction using Voting and Stacking ensemble classifiers. 

In this report, we include only one specimen of notebook for each part of the analysis since we have applied similar notebooks to the different datasets and all the notebooks can be found on the provided drive/github repository.

The structure of the report is as follows:

- Exploratory Data analysis (focused on Smartseq unfiltered data)

- Outlier/noise detection with Isolation Forest (focused on Dropseq data)

- Dimensionality reduction methods and parameter optimization (focused on Smartseq data)

- Clustering with DBSCAN and K-means (focused on Smartseq data)

- Supervised analysis of HCC cell-line for both Dropseq and Smartseq data, using different classifiers including KNN, SVM, Random forest, XGboost and MLP

# Exploratory Data analysis

In [None]:
# Project - validation data to be given at the end of the course & data not to be shared
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from scipy.spatial.distance import pdist, squareform
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import IsolationForest
from sklearn.inspection import DecisionBoundaryDisplay

Analyse metadata to better understnad contents of data labels and general info for each sample.

For instance, one cell is kept under experiment conditions for 72 hours and the other instead only for 24 because of different reaction times of the two cells 

In [None]:
#retrieving metadata, cheifly sample labels 
df_meta = pd.read_csv("SmartSeq/MCF7_SmartS_MetaData.tsv",delimiter="\t",engine='python',index_col=0)
df_meta.iloc[10:20,:]

In [None]:
#retrieving metadata, cheifly sample labels 
df_meta = pd.read_csv("SmartSeq/HCC1806_SmartS_MetaData.tsv",delimiter="\t",engine='python',index_col=0)
df_meta.iloc[20:30,:]

## EDA on Unfiltered SmartSeq HCC186 and MCF7 data

In [None]:
df_MCF7_unfiltered = pd.read_csv("SmartSeq/MCF7_SmartS_Unfiltered_Data.txt",delimiter="\ ",engine='python',index_col=0).T
print("Dataframe dimensions:", np.shape(df_MCF7_unfiltered))
df_MCF7_unfiltered.head()

In [None]:
df_MCF7_unfiltered.info()
df_MCF7_unfiltered.describe()

In [None]:
df_HCC_unfiltered = pd.read_csv("SmartSeq/HCC1806_SmartS_Unfiltered_Data.txt",delimiter="\ ",engine='python',index_col=0).T
print("Dataframe dimensions:", np.shape(df_HCC_unfiltered))
df_HCC_unfiltered.head()

In [None]:
df_HCC_unfiltered.info()
df_HCC_unfiltered.describe()

In the next cell, we try to understand how many genes are in common between the two cell lines in order to see if the same classifier could be trained on both datasets.

In [None]:
# Get sets of gene names (column names)
genes_1 = set(df_MCF7_unfiltered.columns)
genes_2 = set(df_HCC_unfiltered.columns)

# Genes only in grouped_sums_1
unique_to_1 = genes_1 - genes_2

# Genes only in grouped_sums_2
unique_to_2 = genes_2 - genes_1

# Union of all unique genes (not shared)
not_in_common = unique_to_1.union(unique_to_2)

# Output
print(f"Genes only in MCF7 dataset: {len(unique_to_1)}")
print(f"Genes only in HCC dataset: {len(unique_to_2)}")
print(f"Total genes not in common: {len(not_in_common)}")
print(f"{not_in_common}")

### MISSING VALUES, DUPLICATES and SPARSITY ANALYSIS 

In [None]:
# checking for duplicates and missing values
# Percentage of missing values for the whole dataset
print("MCF7 dataset")
total_cells = (df_MCF7_unfiltered.shape[0] * df_MCF7_unfiltered.shape[1])
total_missing = df_MCF7_unfiltered.isnull().sum().sum()
missing_percentage_total = (total_missing / total_cells) * 100
print(f"Total missing values in MCF7 dataset: {missing_percentage_total:.2f}%")

# Missing values per column (only where missing values exist)
missing_per_column = df_MCF7_unfiltered.isnull().sum()

print("\nColumns with missing values (percentage):")
print(missing_per_column)

print(f"count of duplicates: {df_MCF7_unfiltered.duplicated().sum()}")
# Remove if needed
df = df_MCF7_unfiltered.drop_duplicates()

print("\nHCC dataset")
# checking for duplicates and missing values
# Percentage of missing values for the whole dataset
total_cells = (df_HCC_unfiltered.shape[0] * df_HCC_unfiltered.shape[1])
total_missing = df_HCC_unfiltered.isnull().sum().sum()
missing_percentage_total = (total_missing / total_cells) * 100
print(f"Total missing values in HCC dataset: {missing_percentage_total:.2f}%")

# Missing values per column (only where missing values exist)
missing_per_column = df_HCC_unfiltered.isnull().sum()

print("\nColumns with missing values (percentage):")
print(missing_per_column)

print(f"count of duplicates: {df_HCC_unfiltered.duplicated().sum()}")
# Remove if needed
df = df_HCC_unfiltered.drop_duplicates()

The dataset doesn't present any missing values and so it can be directly analysed although we are unsure of whether some missing entries were entered as 0, as could be suggested by the great majority of the latter 

In [None]:
print("MCF7 dataset")
# analyse the transposed dataset with samples as columns and genes as rows
print(f"count of duplicates: {df_MCF7_unfiltered.T.duplicated().sum()}")

# Find duplicated rows
duplicated_rows = df_MCF7_unfiltered.T[df_MCF7_unfiltered.T.duplicated(keep=False)]

# Print the sample names (index labels) of duplicated rows
if not duplicated_rows.empty:
    print("Duplicated Columns of Genes found:")
    print(duplicated_rows.index.tolist())
else:
    print("No duplicated samples found.")

print("\n HCC dataset")
# analyse the transposed dataset with samples as columns and genes as rows
print(f"count of duplicates: {df_HCC_unfiltered.T.duplicated().sum()}")

# Find duplicated rows
duplicated_rows2 = df_HCC_unfiltered.T[df_HCC_unfiltered.T.duplicated(keep=False)]

# Print the sample names (index labels) of duplicated rows
if not duplicated_rows2.empty:
    print("Duplicated Columns of Genes found:")
    print(duplicated_rows2.index.tolist())
else:
    print("No duplicated samples found.")

Few of the genes for both datasets present duplicate entries, thus increasing dataset dimensionality without adding any additional information. They are not removed for now because theese genes are very sparse, not presenting a mjor issues for EDA, and since the dataset shape is very unbalanced (very few samples with abundantly many genes), it is possible if not likely that the entries for some genes repeat.

### SPARSITY and GENE EXPRESSION by genes' expression distribution

In what follows, we analyse the dataset sparsity defined as columns/genes with a distribution that is extremely skewed towards zero, which is measured by the mean of the gene count minus the standard deviation of the gene count.

In [None]:
# understanding the sparsity and distribution of the dataset 
non_sparse_genes = [i for i in df_MCF7_unfiltered.columns if df_MCF7_unfiltered[i].mean() - df_MCF7_unfiltered[i].std() >= 1] # non-sparse genes have mean greater than 1 considering also std
non_sparse_genes = sorted(non_sparse_genes, key = lambda x: df_MCF7_unfiltered[x].mean(), reverse=True) # sort according to sparsity and use then for visual
print(f"Number of non-sparse genes in MCF7 dataset: {len(non_sparse_genes)} or {len(non_sparse_genes)/len(df_MCF7_unfiltered.columns):.4}% of all genes")

# understanding the sparsity and distribution of the dataset 
non_sparse_genes2 = [i for i in df_HCC_unfiltered.columns if df_HCC_unfiltered[i].mean() - df_HCC_unfiltered[i].std() >= 1] # non-sparse genes have mean greater than 1 considering also std
non_sparse_genes2 = sorted(non_sparse_genes2, key = lambda x: df_HCC_unfiltered[x].mean(), reverse=True) # sort according to sparsity and use then for visual
print(f"Number of non-sparse genes in HCC dataset: {len(non_sparse_genes2)} or {len(non_sparse_genes2)/len(df_HCC_unfiltered.columns):.4}% of all genes")

In [None]:
n = 10
print(f"TOP non-sparse genes in MCF7 dataset: {non_sparse_genes[:n]}")
print(f"TOP non-sparse genes in HCC dataset: {non_sparse_genes2[:n]}")

We now move to analyse the distribution of the genes that are the least sparse according to the above definition by using histograms and violin plots

In [None]:
fig, ax = plt.subplots(ncols = int(n/2), nrows = 2, figsize = (30,15))

for i, col in enumerate(non_sparse_genes[:n]):
    # Plot histogram for df_MCF7_unfiltered
    ax.flat[i].hist(df_MCF7_unfiltered[col], bins=30, alpha=0.5, edgecolor='black', density=True, label='MCF7')
    # Plot histogram for df2
    ax.flat[i].hist(df_HCC_unfiltered[col], bins=30, alpha=0.5, edgecolor='black', density=True, label='HCC')
    
    # KDE plots for both
    sns.kdeplot(df_MCF7_unfiltered[col], ax=ax.flat[i], bw_adjust=1, color='blue', label='MCF7 KDE')
    sns.kdeplot(df_HCC_unfiltered[col], ax=ax.flat[i], bw_adjust=1, color='red', label='HCC KDE')

    ax.flat[i].set_xlabel('')
    ax.flat[i].set_title(f"{col}\nMCF7 μ: {df_MCF7_unfiltered[col].mean():.0f}, HCC μ: {df_HCC_unfiltered[col].mean():.0f}")
    ax.flat[i].legend()

plt.suptitle("distribution of top 10 non-sparse genes in MCF7 dataset and comparison with HCC dataset", fontsize = 24)
plt.tight_layout(rect=[0, 0.03, 1, 0.97]) 
plt.show()

In [None]:
fig, ax = plt.subplots(nrows=2, ncols=5, figsize=(30, 15))

for i, col in enumerate(non_sparse_genes[:n]):
    combined_df = pd.DataFrame({
        'value': pd.concat([df_MCF7_unfiltered[col], df_HCC_unfiltered[col]]),
        'dataset': ['MCF7'] * len(df_MCF7_unfiltered) + ['HCC'] * len(df_HCC_unfiltered)
    })
    
    # Violin plot
    sns.violinplot( data=combined_df, x='dataset', y='value', hue='dataset',  palette=['lightblue', 'lightgreen'],  ax=ax.flat[i],  inner=None,  linewidth=1,  legend=False)
    # Boxplot overlay
    sns.boxplot(    data=combined_df, x='dataset',  y='value',  hue='dataset', palette=['lightblue', 'lightgreen'], ax=ax.flat[i],   width=0.2, showcaps=True, fliersize=2, legend=False)

    # Remove default labels
    ax.flat[i].set_xlabel('')
    ax.flat[i].set_ylabel('')
    
    # Custom title with means
    mcf7_mean = df_MCF7_unfiltered[col].mean()
    df2_mean = df_HCC_unfiltered[col].mean()
    ax.flat[i].set_title(f"{col}\nMCF7 μ: {mcf7_mean:.0f}, HCC μ: {df2_mean:.0f}")


plt.suptitle("distribution of top 20 non-sparse genes in MCF7 dataset and comparison with HCC dataset", fontsize = 16)
plt.tight_layout(rect=[0, 0.03, 1, 0.97]) 
plt.show()

In [None]:
fig, ax = plt.subplots(ncols = int(n/2), nrows = 2, figsize = (30,15))

for i, col in enumerate(non_sparse_genes2[:n]):
    # Plot histogram for df_MCF7_unfiltered
    ax.flat[i].hist(df_MCF7_unfiltered[col], bins=30, alpha=0.5, edgecolor='black', density=True, label='MCF7')
    # Plot histogram for df2
    ax.flat[i].hist(df_HCC_unfiltered[col], bins=30, alpha=0.5, edgecolor='black', density=True, label='HCC')
    
    # KDE plots for both
    sns.kdeplot(df_MCF7_unfiltered[col], ax=ax.flat[i], bw_adjust=1, color='blue', label='MCF7 KDE')
    sns.kdeplot(df_HCC_unfiltered[col], ax=ax.flat[i], bw_adjust=1, color='red', label='HCC KDE')

    ax.flat[i].set_xlabel('')
    ax.flat[i].set_title(f"{col}, MCF7 μ: {df_MCF7_unfiltered[col].mean():.0f}, HCC μ: {df_HCC_unfiltered[col].mean():.0f}")
    ax.flat[i].legend()

plt.suptitle("distribution of top 10 non-sparse genes in HCC dataset and comparison with MCF7 dataset", fontsize = 24)
plt.tight_layout(rect=[0, 0.03, 1, 0.97]) 
plt.show()

In [None]:
fig, ax = plt.subplots(nrows=2, ncols=5, figsize=(30, 15))

for i, col in enumerate(non_sparse_genes2[:n]):
    combined_df = pd.DataFrame({
        'value': pd.concat([df_MCF7_unfiltered[col], df_HCC_unfiltered[col]]),
        'dataset': ['MCF7'] * len(df_MCF7_unfiltered) + ['HCC'] * len(df_HCC_unfiltered)
    })
    
    # Violin plot
    sns.violinplot( data=combined_df, x='dataset', y='value', hue='dataset',  palette=['lightblue', 'lightgreen'],  ax=ax.flat[i],  inner=None,  linewidth=1,  legend=False)
    # Boxplot overlay
    sns.boxplot(    data=combined_df, x='dataset',  y='value',  hue='dataset', palette=['lightblue', 'lightgreen'], ax=ax.flat[i],   width=0.2, showcaps=True, fliersize=2, legend=False)

    # Remove default labels
    ax.flat[i].set_xlabel('')
    ax.flat[i].set_ylabel('')
    
    # Custom title with means
    mcf7_mean = df_MCF7_unfiltered[col].mean()
    df2_mean = df_HCC_unfiltered[col].mean()
    ax.flat[i].set_title(f"{col}\nMCF7 μ: {mcf7_mean:.0f}, HCC μ: {df2_mean:.0f}")


plt.suptitle("distribution of top 20 non-sparse genes in HCC dataset and comparison with MCF7 dataset", fontsize = 16)
plt.tight_layout(rect=[0, 0.03, 1, 0.97]) 
plt.show()

### SPARSITY by density of zeroes in datasets

We now consider a different and more trivial definition of sparsity, namely the count of zeros in the dataset columns/genes, looking at how it compares with previous results and how the least sparse genes are distributed also in terms of whether they are expressed more for hypoxic or normoxic cells

In the following heatmap (for randomly selected rows and columns) the zeroes are represented as blue squares and from the colour, it can be qualitateively inferred that sparsity is a rather big issues in the dataset

In [None]:
# Set random seed for reproducibility
np.random.seed(42)

# Choose a random subset of columns (e.g., 10 columns)
num_cols = 500
random_cols_1 = np.random.choice(df_MCF7_unfiltered.columns, size=num_cols, replace=False)
random_cols_2 = np.random.choice(df_HCC_unfiltered.columns, size=num_cols, replace=False)

# Filter both datasets with the same random columns
df1_sampled = df_MCF7_unfiltered[random_cols_1]
df2_sampled = df_HCC_unfiltered[random_cols_2]

# Compute zero masks
zero_mask1 = df1_sampled == 0
zero_mask2 = df2_sampled == 0

# Print sparsity index
sparsity1 = zero_mask1.sum().sum() / (df1_sampled.shape[0] * df1_sampled.shape[1])
sparsity2 = zero_mask2.sum().sum() / (df2_sampled.shape[0] * df2_sampled.shape[1])

print(f"Dataset MCF7 Sparsity Index: {sparsity1:.4f}")
print(f"Dataset HCC Sparsity Index: {sparsity2:.4f}")

# Plotting
plt.figure(figsize=(14, 8))

# Heatmap for Dataset 1
plt.subplot(2,1, 1)
sns.heatmap(zero_mask1, cbar=False, cmap=sns.color_palette(["white", "blue"]))
plt.title('Zero Values Heatmap - MFC7 Dataset')
plt.xlabel("Genes")
plt.ylabel("Samples")
plt.xticks([], [])
plt.yticks([], [])

# Heatmap for Dataset 2
plt.subplot(2, 1, 2)
sns.heatmap(zero_mask2, cbar=False, cmap=sns.color_palette(["white", "blue"]))
plt.title('Zero Values Heatmap - HCC Dataset')
plt.xlabel("Genes")
plt.ylabel("Samples")
plt.xticks([], [])
plt.yticks([], [])

plt.tight_layout()
plt.show()

In the following plot, we plot a sparsity index, defined as the percentage of zeros in a given column, against the count of how many times that any value of such index is attained. It can be clearly seen that there are two spikes for both datasets, around 0% and 1%, so that some genes are almost identically zero while others are very dense, rendering the datasets very suitable to feature selection

In [None]:
#graph of sparsity index (number of zeroes) against coutn of such index in the genes
# Calculate sparsity index for each gene (column) in both datasets
sparsity1 = (df_MCF7_unfiltered == 0).sum(axis=0) / df_MCF7_unfiltered.shape[0]
sparsity2 = (df_HCC_unfiltered == 0).sum(axis=0) / df_HCC_unfiltered.shape[0]

# Plot overlayed histograms
plt.figure(figsize=(10, 8))

plt.hist(sparsity1, bins=80, range=(0, 1), edgecolor='black', alpha=0.6, label='MCF7 Dataset', color='blue')
plt.hist(sparsity2, bins=80, range=(0, 1), edgecolor='black', alpha=0.6, label='HCC Dataset', color='lightgreen')

plt.title("Overlayed Histogram of Gene Sparsity Indices")
plt.xlabel("Sparsity Index (Fraction of Zero Counts)")
plt.ylabel("Number of Genes")
plt.legend()
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()

The unfiltered dataset is very sparse so that it makes sense to use dimensionality reduction to preprocess the data before unsupervised or supervised analysis since the columns corresponding to some genes are almost indetically zero and so it is only natural to conjecture that a low-dimensional embedding of the whole datasets exists and is a good presentation of the whole. 

## Genes expressions

In [None]:
# Calculate total counts per cell (row) for both datasets
df1_total_counts = df_MCF7_unfiltered.sum(axis=1)
df2_total_counts = df_HCC_unfiltered.sum(axis=1)

# Plot overlayed histograms with KDE
plt.figure(figsize=(10, 6))

sns.histplot(df1_total_counts, bins=70, kde=True, color='steelblue', label='MCF7 Dataset', alpha=0.6)
sns.histplot(df2_total_counts, bins=70, kde=True, color='darkorange', label='HCC Dataset', alpha=0.6)

plt.title("Overlayed Histogram of Total Gene Counts per Cell")
plt.xlabel("Total Counts per Cell")
plt.ylabel("Number of Samples")
plt.legend()
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()

In what follows, we try to distinguish genes' expression based on cell condition (hypoxia vs normoxia) to understand if any given gene predominantly appears for one condition and thus acts as a good discriminant that should be used for our supervised analysis. To do so, we consider only the genes which are not sparse, according to 2nd defintion and extract the labels from the sample names.

In [None]:
sample_names = df_MCF7_unfiltered.T.columns
cell_type = []
for name in sample_names:
    if "Norm" in name:
        cell_type.append("Norm")
    elif "Hypo" in name:
        cell_type.append("Hypo")
df_MCF7_unfiltered["CellType"] = cell_type

sample_names = df_HCC_unfiltered.T.columns
cell_type = []
for name in sample_names:
    if "Norm" in name:
        cell_type.append("Norm")
    elif "Hypo" in name:
        cell_type.append("Hypo")
df_HCC_unfiltered["CellType"] = cell_type

print("MCF7 dataset")
print(f"Checking for unbalanced classes \nNormoxia: {(df_MCF7_unfiltered["CellType"]=="Norm").sum()/ len(df_MCF7_unfiltered["CellType"]):.5}%, \t Hypoxia: {1-(df_MCF7_unfiltered["CellType"]=="Norm").sum()/ len(df_MCF7_unfiltered["CellType"]):.5}%")
print(f"Differnece in absolute count of samples with Normoxia vs (-) Hypoxia: {(df_MCF7_unfiltered["CellType"]=="Norm").sum() - (df_MCF7_unfiltered["CellType"]=="Hypo").sum()}")
print("the classes are almost perfectly balanced \n")

# Separate the features from the label
feature_columns_1 = df_MCF7_unfiltered.columns.difference(["CellType"])
# Group by the label and sum each group
grouped_sums_1 = df_MCF7_unfiltered.groupby("CellType")[feature_columns_1].sum() 
print(grouped_sums_1.head())

print("\nHCC dataset")
print(f"Checking for unbalanced classes \nNormoxia: {(df_HCC_unfiltered["CellType"]=="Norm").sum()/ len(df_HCC_unfiltered["CellType"]):.5}%, \t Hypoxia: {1-(df_HCC_unfiltered["CellType"]=="Norm").sum()/ len(df_HCC_unfiltered["CellType"]):.5}%")
print(f"Differnece in absolute count of samples with Normoxia vs (-) Hypoxia: {(df_HCC_unfiltered["CellType"]=="Norm").sum() - (df_HCC_unfiltered["CellType"]=="Hypo").sum()}")
print("the classes are almost perfectly balanced \n")

# Separate the features from the label
feature_columns_2 = df_HCC_unfiltered.columns.difference(["CellType"])
# Group by the label and sum each group
grouped_sums_2 = df_HCC_unfiltered.groupby("CellType")[feature_columns_2].sum() 
grouped_sums_2.head()

In [None]:
top_n = 20

# Create subplots with 2 rows and 1 column
fig, axes = plt.subplots(2, 1, figsize=(14, 10))

# MCF7 dataset
total_sums_1 = grouped_sums_1.sum(axis=0)

# Sort columns by total sum and pick top N
top_columns_1 = total_sums_1.sort_values(ascending=False).head(top_n).index
top_data_1 = grouped_sums_1[top_columns_1]

# First subplot: MCF7 dataset
top_data_1.T.plot(kind='bar', stacked=True, ax=axes[0])
axes[0].set_title("Genes' Expression by Cell Condition - MCF7 Dataset", fontsize=18)
axes[0].set_xlabel('Genes')
axes[0].set_ylabel('Total Counts')
axes[0].legend(title='Cell Condition')
axes[0].grid(axis='y', linestyle='--', alpha=0.5)

# HCC dataset
total_sums_2 = grouped_sums_2.sum(axis=0)

# Sort columns by total sum and pick top N
top_columns_2 = total_sums_2.sort_values(ascending=False).head(top_n).index
top_data_2 = grouped_sums_2[top_columns_2]

# Second subplot: HCC dataset
top_data_2.T.plot(kind='bar', stacked=True, ax=axes[1])
axes[1].set_title("Genes' Expression by Cell Condition - HCC Dataset", fontsize=18)
axes[1].set_xlabel('Genes')
axes[1].set_ylabel('Total Counts')
axes[1].legend(title='Cell Condition')
axes[1].grid(axis='y', linestyle='--', alpha=0.5)

# Final layout adjustment
plt.tight_layout()
plt.show()

Similarly, in the following graph, we see how a given gene for a particular condition, either hypoxia or normoxia, is expressed wtih different counts across the two cell lines of MCF7 and HCC1806.

In [None]:
# 2. Get top genes from dataset 1
total_sums = grouped_sums_1.sum(axis=0)
top_genes = total_sums.sort_values(ascending=False).head(top_n).index

# 3. Subset both datasets
df1_top = grouped_sums_1[top_genes].copy()
df2_top = grouped_sums_2[top_genes].copy()

# 4. Add 'Condition' column before melting
df1_top['Condition'] = df1_top.index
df2_top['Condition'] = df2_top.index

# 5. Melt both to long format
df1_long = df1_top.melt(id_vars='Condition', var_name='Gene', value_name='Count')
df1_long['Dataset'] = 'MCF7 Dataset'
df2_long = df2_top.melt(id_vars='Condition', var_name='Gene', value_name='Count')
df2_long['Dataset'] = 'HCC Dataset'

# 6. Combine both
combined_long = pd.concat([df1_long, df2_long], ignore_index=True)

# 7. Create composite category: Gene + Condition
combined_long['Gene_Condition'] = combined_long['Gene'] + ' (' + combined_long['Condition'] + ')'

# 8. Plot side-by-side bars
plt.figure(figsize=(16, 6))
sns.barplot(
    data=combined_long,
    x='Gene_Condition',
    y='Count',
    hue='Dataset',
    palette=['blue', 'red']
)

# 9. Styling
plt.title("Gene Expression by Condition — HCC Dataset vs MCF7 Dataset", fontsize=16)
plt.xlabel("Gene (Condition)", fontsize=12)
plt.ylabel("Total Counts", fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.legend(title='Dataset')
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()

In [None]:
df_HCC_unfiltered.drop(columns="CellType", inplace=True)
df_MCF7_unfiltered.drop(columns="CellType", inplace=True)

This suggests that some genes are more important than others to identify hypoxic cells since some are linked to a particular condition. This means that feature selection and dimensionality reduction might be very helpful to remove genes produced for both hypoxic and normal cells, which don't add any information on whether a sample belongs to one class or the other.

## Evaluating correlation of samples and genes 

In [None]:
n_genes = 50
top_genes_1 = df_MCF7_unfiltered.iloc[:, :-2].sum().sort_values(ascending=False).head(n_genes).index
sampled_cells_1 = df_MCF7_unfiltered.sample(25)
truncated_rows_1 = [i[12:27] for i in sampled_cells_1.index.tolist()]

top_genes_2 = df_HCC_unfiltered.iloc[:, :-2].sum().sort_values(ascending=False).head(n_genes).index
sampled_cells_2 = df_HCC_unfiltered.sample(25)
truncated_rows_2 = [i[12:38] for i in sampled_cells_2.index.tolist()]

In [None]:
# Plotting
fig, axes = plt.subplots(2, 1, figsize=(14, 12))

# Heatmap for Dataset 1 (MCF7)
sns.heatmap(sampled_cells_1[top_genes_1].corr(), ax=axes[0], cbar=True, cmap="coolwarm")
axes[0].set_title('Gene correlation matrix: Top 20 Genes in Sampled Cells - MCF7 Dataset')
axes[0].set_xlabel("Genes")
axes[0].set_ylabel("Genes")

# Heatmap for Dataset 2 (HCC)
sns.heatmap(sampled_cells_2[top_genes_2].corr(), ax=axes[1], cbar=True, cmap="coolwarm")
axes[1].set_title('Gene correlation matrix: Top 20 Genes in Sampled Cells - HCC Dataset')
axes[1].set_xlabel("Genes")
axes[1].set_ylabel("Genes")

plt.tight_layout()
plt.show()

In [None]:
# Scale and compute distances for MCF7                                           
scaler = StandardScaler()
df_scaled_1 = scaler.fit_transform(df_MCF7_unfiltered[top_genes_1].T)
distance_matrix_1 = pdist(df_scaled_1, metric='correlation')
distance_square_1 = squareform(distance_matrix_1)

# Scale and compute distances for HCC
df_scaled_2 = scaler.fit_transform(df_HCC_unfiltered[top_genes_2].T)
distance_matrix_2 = pdist(df_scaled_2, metric='correlation')
distance_square_2 = squareform(distance_matrix_2)

# Create two clustermaps separately
g1 = sns.clustermap(distance_square_1, cmap='coolwarm', xticklabels=top_genes_1, yticklabels=top_genes_1)
g1.cax.set_position([.99, .08, .03, .74])
g2 = sns.clustermap(distance_square_2, cmap='coolwarm', xticklabels=top_genes_2, yticklabels=top_genes_2)
g2.cax.set_position([.99, .08, .03, .74])

# Optional: improve layout
g1.fig.suptitle('Gene Correlation Matrix - MCF7 Dataset (Top 20 Genes)', y=1.05, fontsize=16)
g2.fig.suptitle('Gene Correlation Matrix - HCC Dataset (Top 20 Genes)', y=1.05, fontsize=16)

plt.show()

In [None]:
# Sample 100 cells
sampled_cells_1 = df_MCF7_unfiltered.sample(100)
sampled_cells_2 = df_HCC_unfiltered.sample(100)

# Select top 20 genes by total expression
top_genes_1 = df_MCF7_unfiltered.iloc[:, :-2].sum().sort_values(ascending=False).head(20).index
top_genes_2 = df_HCC_unfiltered.iloc[:, :-2].sum().sort_values(ascending=False).head(20).index

# Extract expression data for top genes
sampled_data_1 = sampled_cells_1[top_genes_1]
sampled_data_2 = sampled_cells_2[top_genes_2]

# Compute sample-to-sample correlation
sample_corr_1 = sampled_data_1.corr(method='pearson') if sampled_data_1.shape[0] < sampled_data_1.shape[1] else sampled_data_1.T.corr()
sample_corr_2 = sampled_data_2.corr(method='pearson') if sampled_data_2.shape[0] < sampled_data_2.shape[1] else sampled_data_2.T.corr()

# Truncate sample names
truncated_names_1 = [idx[12:27] for idx in sample_corr_1.index]
truncated_names_2 = [idx[12:38] for idx in sample_corr_2.index]

# Apply truncated names to both axes
sample_corr_1.index = truncated_names_1
sample_corr_1.columns = truncated_names_1
sample_corr_2.index = truncated_names_2
sample_corr_2.columns = truncated_names_2

# Plotting
fig, axes = plt.subplots(2, 1, figsize=(14, 14))

# MCF7 correlation heatmap
sns.heatmap(sample_corr_1, ax=axes[0], cmap="coolwarm", cbar=True)
axes[0].set_title("Sample Correlation Heatmap - MCF7 (Top 20 Genes)")
axes[0].set_xlabel("Samples")
axes[0].set_ylabel("Samples")

# HCC correlation heatmap
sns.heatmap(sample_corr_2, ax=axes[1], cmap="coolwarm", cbar=True)
axes[1].set_title("Sample Correlation Heatmap - HCC (Top 20 Genes)")
axes[1].set_xlabel("Samples")
axes[1].set_ylabel("Samples")

plt.tight_layout()
plt.show()

Samples are not really independent, as can be seen from bright heatmap colour and also genes show some dependence between each other, thus showing how effectively dimensionality could be easily reduced with PCA or other techniques without substantial loss of information