<a href="https://colab.research.google.com/github/DPariser/DataScience/blob/main/Preprocessing/052423_DNP1_Combined_h5ad_files.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# This is  used to time the running of the notebook
import time
start_time = time.time()

In [None]:
# These packages are pre-installed on Google Colab, but are included here to simplify running this notebook locally
%%capture
!pip install matplotlib
!pip install scikit-learn
!pip install numpy
!pip install scipy
!pip install scanpy
!pip install anndata
!pip3 install leidenalg

In [3]:
# Install packages for analysis and plotting
from scipy.io import mmread
from sklearn.decomposition import TruncatedSVD
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import os
import scanpy as sc
import anndata as ad
import pandas as pd
import seaborn as sns
import xml.etree.ElementTree as ET

from scipy.sparse import csr_matrix
matplotlib.rcParams.update({'font.size': 22})
%config InlineBackend.figure_format = 'retina'

## ❗**Connect to the Data**

The data is stored on a shared location in Google Drive. Since many of the files are very large and thus it is not feasable to download them to a location and use them. One good way of dealing with this situation is to create a shortcut to your own Google Drive and point to the shortcut and use them just like they are your own files on Google Drive. Here is the instruction how to set this up.

* Click on the link to the share location of the data.
* Nevigate to the "Data files" folder.
* Click on the "Dropdown" arrow right next to the breaksrumb on the top right.
* Choose "Add shortcut to Drive".

Now it should appear in your Google Drive as the "Data files" folder.
You can now connect to your Google Drive and access the file.
From this point on, we assume that you have the Google Drive setup this way.

Let's mount the Google Drive:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Google drive root
gd_root = "/content/drive/MyDrive/Pate_Lab/DNP/Bioinformatics"

# Data roots
patient_root = f"{gd_root}/H17_LungMk/Data_files/HRA001149/HRR339729"
lungmk_root = f"{gd_root}/H17_LungMk/LungMk"

# Working directories
patient_dir = f"{patient_root}"
lungmk_dir = f"{lungmk_root}"

# Create the directories if they don't exist
!mkdir -p "{patient_dir}"
!mkdir -p "{lungmk_dir}"

# List the contents of the directories
print("Contents of patient directory:")
!ls "{patient_dir}"
print("\nContents of LungMk directory:")
!ls "{lungmk_dir}"

Contents of patient directory:
10xv2_whitelist.txt		 HRR339729_r2.fastq.gz	output.bus
counts_unfiltered		 HRR339729_sta.xml	output.unfiltered.bus
filtered_normalized_counts.h5ad  inspect.json		run_info.json
HRR339729_f1.fastq.gz		 matrix.ec		transcripts.txt

Contents of LungMk directory:
10xv2_whitelist.txt  inspect.json	    run_info.json
counts_unfiltered    matrix.ec		    t2g.txt
GRCh38genome.idx     output.bus		    transcripts.txt
index.idx	     output.unfiltered.bus  v1nm7lpnqz5syh8dyzdk2zs8bglncfib.gz


This code performs an analysis of patient data stored in filtered_normalized_counts.h5ad files within a specific directory. It starts by importing the necessary libraries. The goal is to extract relevant information for each patient, including cell count, UMI count, gene count, and mitochondrial percentage. The code defines a function called get_patient_data that retrieves and organizes the data for each patient. It iterates through the patient IDs, checks if the corresponding file exists, and loads it using anndata. The required metrics are calculated from the loaded data. The patient data is then stored in a DataFrame. The code sorts the DataFrame based on cell count in descending order and prints the sorted data. Finally, the sorted data is saved as a LaTeX file. This code allows for the efficient extraction, analysis, and sorting of patient data, facilitating further investigations or reporting.

In [None]:
# Identify all the patients so we can loop through them in the cell below
folder_path = "/content/drive/MyDrive/Pate_Lab/DNP/Bioinformatics/H17_LungMk/Data_files/HRA001149"

# List the directories in the folder
directories = [d for d in os.listdir(folder_path) if os.path.isdir(os.path.join(folder_path, d))]

# Use the directories as patient_ids
patient_ids = directories

def get_patient_data(patient_ids, folder_path):
    # Open the LaTeX file
    with open("/content/drive/MyDrive/Pate_Lab/DNP/Bioinformatics/H17_LungMk/Combined_Data/patient_data.tex", "w") as f:
        # Write the header
        f.write("\\begin{tabular}{lrrrr}\n")
        f.write("\\textbf{Patient ID} & \\textbf{Cell Count} & \\textbf{UMI Count} & \\textbf{Gene Count} & \\textbf{Mito \%} \\\\ \\hline\n")

        for patient_id in patient_ids:
            patient_root = f"{folder_path}/{patient_id}"
            filtered_normalized_counts_file = f"{patient_root}/filtered_normalized_counts.h5ad"

            # Check if the filtered and normalized counts file exists
            if os.path.exists(filtered_normalized_counts_file):
                # Load the h5ad file
                adata = ad.read_h5ad(filtered_normalized_counts_file)

                # Calculate the cell count
                cell_count = adata.shape[0]

                # Calculate total UMI count
                umi_count = adata.X.sum()

                # Calculate total gene count
                gene_count = adata.shape[1]

                # Calculate mito % per patient
                mito_genes = adata.var_names.str.startswith('MT-')
                mito_percent = np.sum(adata[:, mito_genes].X) / np.sum(adata.X)

                # Write the data for this patient to the file
                f.write(f"{patient_id} & {cell_count} & {umi_count} & {gene_count} & {mito_percent:.2f} \\\\ \n")

        # Write the footer
        f.write("\\end{tabular}")

# Call the function with your patient_ids and folder_path
get_patient_data(patient_ids, folder_path)

  mito_percent = np.sum(adata[:, mito_genes].X) / np.sum(adata.X)


Here is a piece of code that will go through all the patients and print the IDs of those with more than 2000 cells

In [None]:
# Identify all the patients
folder_path = "/content/drive/MyDrive/Pate_Lab/DNP/Bioinformatics/H17_LungMk/Data_files/HRA001149"

# List the directories in the folder
directories = [d for d in os.listdir(folder_path) if os.path.isdir(os.path.join(folder_path, d))]

# Use the directories as patient_ids
patient_ids = directories

for patient_id in patient_ids:
    patient_root = f"{folder_path}/{patient_id}"
    filtered_normalized_counts_file = f"{patient_root}/filtered_normalized_counts.h5ad"

    # Check if the filtered and normalized counts file exists
    if os.path.exists(filtered_normalized_counts_file):
        # Load the h5ad file
        adata = ad.read_h5ad(filtered_normalized_counts_file)

        # Calculate the cell count
        cell_count = adata.shape[0]

        # If the cell count is above 2000, print the patient ID
        if cell_count > 2000:
            print(f"Patient {patient_id} has {cell_count} cells.")


Patient HRR339741 has 4390 cells.
Patient HRR339743 has 5175 cells.
Patient HRR339748 has 4847 cells.
Patient HRR339751 has 6052 cells.
Patient HRR339754 has 4922 cells.
Patient HRR339757 has 5752 cells.
Patient HRR339760 has 5193 cells.
Patient HRR339763 has 7206 cells.
Patient HRR339787 has 6195 cells.
Patient HRR339790 has 5372 cells.


Here we combine h5ad files from the patients with the most cells. However, due to memory constraints, we will chunk the data. We append the AnnData objects from each h5ad file one by one. It will then save the combined data to a new h5ad file.

In this script, each AnnData object is loaded and then concatenated to a main AnnData object, which is then written back to disk after each iteration. This allows us to keep only one AnnData object in memory at a time.

Note: Be aware that this operation may take some time depending on the size of your data.

*   I tried this method with all top ten of the patients with the highest cell count and it was too much RAM
* Originally I had it at >2,000 cells, changed it to >5,300 cells



In [None]:
# Identify all the patients
folder_path = "/content/drive/MyDrive/Pate_Lab/DNP/Bioinformatics/H17_LungMk/Data_files/HRA001149"
save_path = "/content/drive/MyDrive/Pate_Lab/DNP/Bioinformatics/H17_LungMk/Combined_Data/combined_data.h5ad"

# List the directories in the folder
directories = [d for d in os.listdir(folder_path) if os.path.isdir(os.path.join(folder_path, d))]

# Use the directories as patient_ids
patient_ids = directories

# Initialize combined_adata as an empty AnnData object
combined_adata = ad.AnnData(X=np.empty((0,0)))

for patient_id in patient_ids:
    patient_root = f"{folder_path}/{patient_id}"
    filtered_normalized_counts_file = f"{patient_root}/filtered_normalized_counts.h5ad"

    # Check if the filtered and normalized counts file exists
    if os.path.exists(filtered_normalized_counts_file):
        # Load the h5ad file
        adata = ad.read_h5ad(filtered_normalized_counts_file)

        # Calculate the cell count
        cell_count = adata.shape[0]

        # If the cell count is above 5300, include the patient data in the combined file
        if cell_count > 5300:
            print(f"Adding data for patient {patient_id}...")
            adata.obs['batch'] = patient_id  # Assign batch category for the current patient

            if combined_adata.n_obs == 0:
                combined_adata = adata
            else:
                combined_adata = combined_adata.concatenate(adata, index_unique=None, batch_key='batch')

# Write the final combined dataset to disk
if combined_adata.n_obs > 0:
    combined_adata.write(save_path)

print("Combination of all files finished.")


Adding data for patient HRR339751...
Adding data for patient HRR339757...



See the tutorial for concat at: https://anndata.readthedocs.io/en/latest/concatenation.html
  utils.warn_names_duplicates("obs")
  utils.warn_names_duplicates("obs")



It seems that we can only get through HRR339751 and HRR339757, thus we are going to just focus on combining these two

In [1]:
# Identify all the patients
folder_path = "/content/drive/MyDrive/Pate_Lab/DNP/Bioinformatics/H17_LungMk/Data_files/HRA001149"
save_path = "/content/drive/MyDrive/Pate_Lab/DNP/Bioinformatics/H17_LungMk/Combined_Data/combined_data.h5ad"

# Specify the two patients
patient_ids = ["HRR339751", "HRR339757"]

# Initialize combined_adata as an empty AnnData object
combined_adata = ad.AnnData(X=np.empty((0,0)))

for patient_id in patient_ids:
    patient_root = f"{folder_path}/{patient_id}"
    filtered_normalized_counts_file = f"{patient_root}/filtered_normalized_counts.h5ad"

    # Check if the filtered and normalized counts file exists
    if os.path.exists(filtered_normalized_counts_file):
        # Load the h5ad file
        adata = ad.read_h5ad(filtered_normalized_counts_file)

        print(f"Adding data for patient {patient_id}...")
        adata.obs['batch'] = patient_id  # Assign batch category for the current patient

        if combined_adata.n_obs == 0:
            combined_adata = adata
        else:
            combined_adata = combined_adata.concatenate(adata, index_unique=None, batch_key='batch')

# Write the final combined dataset to disk
if combined_adata.n_obs > 0:
    combined_adata.write(save_path)

print("Combination of all files finished.")

Adding data for patient HRR339751...
Adding data for patient HRR339757...



See the tutorial for concat at: https://anndata.readthedocs.io/en/latest/concatenation.html
  utils.warn_names_duplicates("obs")
  utils.warn_names_duplicates("obs")


Combination of all files finished.


I tried adding more patients to the combined data

RAM crashed with these patients
*   HRR339763 

Patients Added:

*   HRR339751
*   HRR339757
* HRR339741

Patient HRR339743 has 5175 cells.
Patient HRR339748 has 4847 cells.
Patient HRR339754 has 4922 cells.
Patient HRR339760 has 5193 cells.
Patient HRR339763 has 7206 cells.
Patient HRR339787 has 6195 cells.
Patient HRR339790 has 5372 cells.



In [1]:
# Paths to the files
folder_path = "/content/drive/MyDrive/Pate_Lab/DNP/Bioinformatics/H17_LungMk/Data_files/HRA001149"
save_path = "/content/drive/MyDrive/Pate_Lab/DNP/Bioinformatics/H17_LungMk/Combined_Data/combined_data.h5ad"
combined_data_path = "/content/drive/MyDrive/Pate_Lab/DNP/Bioinformatics/H17_LungMk/Combined_Data/combined_data.h5ad"

# Load the combined data
combined_adata = ad.read_h5ad(combined_data_path)

# Add the new patient
new_patient_id = "HRR339741"
new_patient_root = f"{folder_path}/{new_patient_id}"
new_patient_file = f"{new_patient_root}/filtered_normalized_counts.h5ad"

# Check if the new patient's file exists
if os.path.exists(new_patient_file):
    # Load the h5ad file
    new_patient_adata = ad.read_h5ad(new_patient_file)

    print(f"Adding data for patient {new_patient_id}...")
    new_patient_adata.obs['batch'] = new_patient_id  # Assign batch category for the current patient

    combined_adata = combined_adata.concatenate(new_patient_adata, index_unique=None, batch_key='batch')

    # Write the final combined dataset to disk
    combined_adata.write(save_path)

    print("Addition of the new patient's data finished.")
else:
    print(f"No data found for patient {new_patient_id}.")


  utils.warn_names_duplicates("obs")


Adding data for patient HRR339741...



See the tutorial for concat at: https://anndata.readthedocs.io/en/latest/concatenation.html
  utils.warn_names_duplicates("obs")
  utils.warn_names_duplicates("obs")
  utils.warn_names_duplicates("obs")


Addition of the new patient's data finished.


I tried adding more patients to the combined data

RAM crashed with these patients
*   HRR339763 
* HRR339743

Patients Added:

*   HRR339751
*   HRR339757
* HRR339741
* HRR339748

Patient HRR339754 has 4922 cells.
Patient HRR339760 has 5193 cells.
Patient HRR339763 has 7206 cells.
Patient HRR339787 has 6195 cells.
Patient HRR339790 has 5372 cells.

In [1]:
# Paths to the files
folder_path = "/content/drive/MyDrive/Pate_Lab/DNP/Bioinformatics/H17_LungMk/Data_files/HRA001149"
save_path = "/content/drive/MyDrive/Pate_Lab/DNP/Bioinformatics/H17_LungMk/Combined_Data/combined_data.h5ad"
combined_data_path = "/content/drive/MyDrive/Pate_Lab/DNP/Bioinformatics/H17_LungMk/Combined_Data/combined_data.h5ad"

# Load the combined data
combined_adata = ad.read_h5ad(combined_data_path)

# Add the new patient
new_patient_id = "HRR339748"
new_patient_root = f"{folder_path}/{new_patient_id}"
new_patient_file = f"{new_patient_root}/filtered_normalized_counts.h5ad"

# Check if the new patient's file exists
if os.path.exists(new_patient_file):
    # Load the h5ad file
    new_patient_adata = ad.read_h5ad(new_patient_file)

    print(f"Adding data for patient {new_patient_id}...")
    new_patient_adata.obs['batch'] = new_patient_id  # Assign batch category for the current patient

    combined_adata = combined_adata.concatenate(new_patient_adata, index_unique=None, batch_key='batch')

    # Write the final combined dataset to disk
    combined_adata.write(save_path)

    print("Addition of the new patient's data finished.")
else:
    print(f"No data found for patient {new_patient_id}.")


  utils.warn_names_duplicates("obs")


Adding data for patient HRR339748...



See the tutorial for concat at: https://anndata.readthedocs.io/en/latest/concatenation.html
  utils.warn_names_duplicates("obs")
  utils.warn_names_duplicates("obs")
  utils.warn_names_duplicates("obs")


Addition of the new patient's data finished.


I tried adding more patients to the combined data

RAM crashed with these patients
*   HRR339763 
* HRR339743
* HRR339754
* HRR339760
* HRR339763
* HRR339787
* HRR339790

Patients Added:

*   HRR339751
*   HRR339757
* HRR339741
* HRR339748

In [None]:
# Paths to the files
folder_path = "/content/drive/MyDrive/Pate_Lab/DNP/Bioinformatics/H17_LungMk/Data_files/HRA001149"
save_path = "/content/drive/MyDrive/Pate_Lab/DNP/Bioinformatics/H17_LungMk/Combined_Data/combined_data.h5ad"
combined_data_path = "/content/drive/MyDrive/Pate_Lab/DNP/Bioinformatics/H17_LungMk/Combined_Data/combined_data.h5ad"

# Load the combined data
combined_adata = ad.read_h5ad(combined_data_path)

# Add the new patient
new_patient_id = "HRR339790"
new_patient_root = f"{folder_path}/{new_patient_id}"
new_patient_file = f"{new_patient_root}/filtered_normalized_counts.h5ad"

# Check if the new patient's file exists
if os.path.exists(new_patient_file):
    # Load the h5ad file
    new_patient_adata = ad.read_h5ad(new_patient_file)

    print(f"Adding data for patient {new_patient_id}...")
    new_patient_adata.obs['batch'] = new_patient_id  # Assign batch category for the current patient

    combined_adata = combined_adata.concatenate(new_patient_adata, index_unique=None, batch_key='batch')

    # Write the final combined dataset to disk
    combined_adata.write(save_path)

    print("Addition of the new patient's data finished.")
else:
    print(f"No data found for patient {new_patient_id}.")


  utils.warn_names_duplicates("obs")


Adding data for patient HRR339790...



See the tutorial for concat at: https://anndata.readthedocs.io/en/latest/concatenation.html


In [3]:
import anndata as ad

# Identify all the patients
combined_data_path = "/content/drive/MyDrive/Pate_Lab/DNP/Bioinformatics/H17_LungMk/Combined_Data/combined_data.h5ad"

# Load the combined data
combined_adata = ad.read_h5ad(combined_data_path)

# Get patient IDs
patient_ids = combined_adata.obs['batch'].unique()

# Print patient IDs and cell counts
for patient_id in patient_ids:
    cell_count = sum(combined_adata.obs['batch'] == patient_id)
    print(f'Patient ID: {patient_id}, Cell Count: {cell_count}')


Patient ID: 0, Cell Count: 16194
Patient ID: 1, Cell Count: 4847


  utils.warn_names_duplicates("obs")


In [None]:
import scanpy as sc
import pandas as pd
import matplotlib.pyplot as plt

# Path to the combined data file
combined_data_path = "/content/drive/MyDrive/Pate_Lab/DNP/Bioinformatics/H17_LungMk/Combined_Data/combined_data.h5ad"

# Load the combined data
adata = sc.read_h5ad(combined_data_path)

# Calculate n_genes, n_counts, percent_mito for each cell
adata.obs['n_genes'] = adata.X.getnnz(axis=1)
adata.obs['n_counts'] = adata.X.sum(axis=1).A1
adata.var['mt'] = adata.var_names.str.startswith('MT-')  # assuming mitochondrial genes are prefixed with 'MT-'
adata.obs['percent_mito'] = np.sum(
    adata[:, adata.var['mt']].X, axis=1).A1 / np.sum(adata.X, axis=1).A1

# Prepare a figure to plot the violin plots
fig, axs = plt.subplots(3, 1, figsize=(5, 15))

# A list of metrics to plot
metrics = ['n_genes', 'n_counts', 'percent_mito']

# Loop through each metric and plot a violin plot
for ax, metric in zip(axs, metrics):
    # Create a boxplot for each batch (patient)
    sc.pl.violin(adata, metric, groupby='batch', ax=ax, show=False)

plt.tight_layout()
plt.show()
