<a href="https://colab.research.google.com/github/DPariser/DataScience/blob/main/042923_DNP2_QC_and_Pre_Processing_FASTQ_ipynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Plese DO NOT RUN THIS
## Re-running this will override the files existing in the google drive. Please use as a VIEW ONLY.

# Setup Environment

This Notebook is created fresh with nothing else installed explicitly besides what is shown. So we assume that if you follow the instruction exactly, it should run out of the box.

In [1]:
# This is  used to time the running of the notebook
import time
start_time = time.time()

In [2]:
# These packages are pre-installed on Google Colab, but are included here to simplify running this notebook locally
%%capture
!pip install matplotlib
!pip install scikit-learn
!pip install numpy
!pip install scipy
!pip install scanpy
!pip install anndata
!pip3 install leidenalg

In [3]:
# Install packages for analysis and plotting
from scipy.io import mmread
from sklearn.decomposition import TruncatedSVD
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import os
import scanpy as sc
import anndata
import pandas as pd
import seaborn as sns
import xml.etree.ElementTree as ET

from scipy.sparse import csr_matrix
matplotlib.rcParams.update({'font.size': 22})
%config InlineBackend.figure_format = 'retina'

In [4]:
%%time
%%capture
# `kb` is a wrapper for the kallisto and bustools program, and the kb-python package contains the kallisto and bustools executables.
!pip install kb-python==0.24.1

CPU times: user 151 ms, sys: 18.6 ms, total: 170 ms
Wall time: 16.8 s


## ❗**Connect to the Data**

The data is stored on a shared location in Google Drive. Since many of the files are very large and thus it is not feasable to download them to a location and use them. One good way of dealing with this situation is to create a shortcut to your own Google Drive and point to the shortcut and use them just like they are your own files on Google Drive. Here is the instruction how to set this up.

* Click on the link to the share location of the data.
* Nevigate to the "Data files" folder.
* Click on the "Dropdown" arrow right next to the breaksrumb on the top right.
* Choose "Add shortcut to Drive".

Now it should appear in your Google Drive as the "Data files" folder.
You can now connect to your Google Drive and access the file.
From this point on, we assume that you have the Google Drive setup this way.

Let's mount the Google Drive:

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
# Google drive root
gd_root = "/content/drive/MyDrive/Pate_Lab/DNP/Bioinformatics"

# Data roots
patient_root = f"{gd_root}/H17_LungMk/Data_files/HRA001149/HRR339729"
lungmk_root = f"{gd_root}/H17_LungMk/LungMk"

# Working directories
patient_dir = f"{patient_root}"
lungmk_dir = f"{lungmk_root}"

# Create the directories if they don't exist
!mkdir -p "{patient_dir}"
!mkdir -p "{lungmk_dir}"

# List the contents of the directories
print("Contents of patient directory:")
!ls "{patient_dir}"
print("\nContents of LungMk directory:")
!ls "{lungmk_dir}"

Contents of patient directory:
10xv2_whitelist.txt		 HRR339729_r2.fastq.gz	output.bus
counts_unfiltered		 HRR339729_sta.xml	output.unfiltered.bus
filtered_normalized_counts.h5ad  inspect.json		run_info.json
HRR339729_f1.fastq.gz		 matrix.ec		transcripts.txt

Contents of LungMk directory:
10xv2_whitelist.txt  inspect.json	    run_info.json
counts_unfiltered    matrix.ec		    t2g.txt
GRCh38genome.idx     output.bus		    transcripts.txt
index.idx	     output.unfiltered.bus  v1nm7lpnqz5syh8dyzdk2zs8bglncfib.gz


In [7]:
# Check if the directories exist
if os.path.exists(lungmk_dir):
    print(f"The directory {lungmk_dir} exists.")
else:
    print(f"The directory {lungmk_dir} does not exist.")

if os.path.exists(patient_dir):
    print(f"The directory {patient_dir} exists.")
else:
    print(f"The directory {patient_dir} does not exist.")

The directory /content/drive/MyDrive/Pate_Lab/DNP/Bioinformatics/H17_LungMk/LungMk exists.
The directory /content/drive/MyDrive/Pate_Lab/DNP/Bioinformatics/H17_LungMk/Data_files/HRA001149/HRR339729 exists.


# Quality Control

## need to edit this for looping

* https://colab.research.google.com/github/pachterlab/kallistobustools/blob/master/docs/tutorials/kb_getting_started/python/kb_intro_2_python.ipynb

## Filtering cells based on count
Preliminary counts were then used for downstream analysis. Quality control was applied to cells based on three metrics step by step: the total UMI counts, number of detected genes and proportion of mitochondrial gene counts per cell. Specifically, cells with less than 1500 UMI counts and 500 detected genes were filtered, as well as cells with more than 10% mitochondrial gene counts. 

## Remove potential doublets (double balloon effect)

This is what the investigators did in the original paper:


*   To remove potential doublets, for PBMC samples, cells with UMI counts above 25,000 and detected genes above 5,000 are filtered out. For other tissues, cells with UMI counts above 70,000 and detected genes above 7,500 are filtered out. Additionally, we applied Scrublet (Wolock et al., 2019 link text) to identify potential doublets. The doublet score for each single cell and the threshold based on the bimodal distribution was calculated using default parameters. The expected doublet rate was set to be 0.08, and cells predicted to be doublets or with doubletScore larger than 0.25 were filtered. After quality control, a total of 1,598,708 cells were remained.
*   for now we will not be using onliy the PBMC filter methods applied to all tissues
*  *We may revisit this later*

## Normalize
For normalization of UMI counts, the Scanpy package provides several normalization methods, including the Total Count Normalization (TCN) and Normalization by Logarithm (LogNormalize) methods, which are commonly used in single-cell RNA-seq analysis. Here, we first load the count matrix using Scanpy's read_text function. We then normalize the data using the normalize_total function, which scales the counts for each cell so that they have the same total count (in this case, 10,000). We then scale the data by cell-specific size factors using the scale function, and logarithmically transform the data using the log1p function.

# DO NOT RUN THIS UNLESS NEEDED!
This code is to delete the post-processed .ha5d file if a new one needs to be created for any reason

In [None]:
import os

# Google drive root
gd_root = "/content/drive/MyDrive/Pate_Lab/DNP/Bioinformatics"

# Data roots
data_root = f"{gd_root}/H17_LungMk/Data_files/HRA001149"
lungmk_root = f"{gd_root}/H17_LungMk/LungMk"

# Working directories
data_dir = f"{data_root}"
lungmk_dir = f"{lungmk_root}"

# Create the directories if they don't exist
os.makedirs(data_dir, exist_ok=True)
os.makedirs(lungmk_dir, exist_ok=True)

# List the contents of the directories
print("Contents of data directory:")
!ls "{data_dir}"
print("\nContents of LungMk directory:")
!ls "{lungmk_dir}"

# Loop through the patients
patient_ids = [d for d in os.listdir(data_dir) if os.path.isdir(os.path.join(data_dir, d))]

for patient_id in patient_ids:
    # Patient directories
    patient_root = f"{data_dir}/{patient_id}"
    patient_dir = f"{patient_root}"
    
    # Check if the file exists and delete it
    filtered_normalized_counts_file = f"{patient_dir}/filtered_normalized_counts.h5ad"
    if os.path.exists(filtered_normalized_counts_file):
        print(f"Deleting {filtered_normalized_counts_file}")
        os.remove(filtered_normalized_counts_file)
        print("File deleted.")
    else:
        print(f"File {filtered_normalized_counts_file} does not exist.")


# DO NOT RUN THIS UNLESS NEEDED!
This code is to delete the post-processed .ha5d file if a new one needs to be created for any reason

In [None]:

# Identify all the patients so we can loop through them in the cell below
folder_path = "/content/drive/MyDrive/Pate_Lab/DNP/Bioinformatics/H17_LungMk/Data_files/HRA001149"

# List the directories in the folder
directories = [d for d in os.listdir(folder_path) if os.path.isdir(os.path.join(folder_path, d))]

# Print the directories
print(directories)

# Use the directories as patient_ids
patient_ids = directories

for patient_id in patient_ids:
    patient_root = f"{folder_path}/{patient_id}"
    patient_dir = f"{patient_root}"
    
    # Create the directory if it doesn't exist
    !mkdir -p "{patient_dir}"
    
    # Check if the directory exists
    if os.path.exists(patient_dir):
        print(f"The directory {patient_dir} exists.")
    else:
        print(f"The directory {patient_dir} does not exist.")
        
# Function to filter and normalize the data based on the given criteria
def filter_and_normalize_data(patient_dir):
    try:
        adata = sc.read_mtx(f"{patient_dir}/counts_unfiltered/cells_x_genes.mtx")
        adata.var_names = pd.read_csv(f"{patient_dir}/counts_unfiltered/cells_x_genes.genes.txt", header=None, sep='\t')[0]
        adata.obs_names = pd.read_csv(f"{patient_dir}/counts_unfiltered/cells_x_genes.barcodes.txt", header=None)[0]
    except FileNotFoundError:
        print(f"File not found error encountered for patient data in {patient_dir}. Skipping.")
        return
    
    print(f"Initial number of cells: {adata.shape[0]}, Initial number of genes: {adata.shape[1]}")
    
    # Filter cells based on total UMI counts, number of detected genes, and proportion of mitochondrial gene counts per cell
    sc.pp.filter_cells(adata, min_counts=1500)
    sc.pp.filter_cells(adata, min_genes=500)
    mito_genes = adata.var_names.str.startswith('MT-')
    adata.obs['percent_mito'] = np.sum(adata[:, mito_genes].X, axis=1).A1 / np.sum(adata.X, axis=1).A1
    adata = adata[adata.obs['percent_mito'] < 0.1, :]
    
    print(f"Number of cells after filtering by UMI counts, detected genes, and mitochondrial gene counts: {adata.shape[0]}")
    
    # Filter cells based on potential doublets
    sc.pp.filter_cells(adata, max_counts=25000)
    sc.pp.filter_cells(adata, max_genes=5000)
    
    print(f"Number of cells after filtering by potential doublets: {adata.shape[0]}")
    
    if adata.shape[0] == 0:
        print(f"No cells remaining after filtering for patient data in {patient_dir}. Skipping.")
        return
    
    try:
        # Normalize the data using Total Count Normalization (TCN) method
        sc.pp.normalize_total(adata, target_sum=1e4)

        # Logarithmically transform the data
        sc.pp.log1p(adata)

        # Scale the data by cell-specific size factors
        sc.pp.scale(adata)

    except ZeroDivisionError:
        print(f"ZeroDivisionError encountered for patient data in {patient_dir}. Skipping.")
        return
    
    # Calculate the remaining number of cells after filtering
    num_cells = adata.shape[0]
    print(f"Number of cells remaining after filtering: {num_cells}")

    # Save the filtered and normalized data in the patient's original folder
    adata.write(f"{patient_dir}/filtered_normalized_counts.h5ad")

# Loop through each patient and filter and normalize the data
for patient_id in patient_ids:
    patient_root = f"{folder_path}/{patient_id}"
    print(f"Processing patient {patient_id}")

    # Check if the filtered and normalized counts file already exists
    filtered_normalized_counts_file = f"{patient_root}/filtered_normalized_counts.h5ad"
    if os.path.exists(filtered_normalized_counts_file):
        print(f"Filtered and normalized counts file for patient {patient_id} already exists. Skipping.")
        continue

    # If the filtered and normalized counts file doesn't exist, process, filter and normalize the data
    filter_and_normalize_data(patient_root)

## Combine all patient files into one combine data file (.ha5d)

In [None]:
import os
import scanpy as sc
import anndata

# Identify all the patients so we can loop through them
folder_path = "/content/drive/MyDrive/Pate_Lab/DNP/Bioinformatics/H17_LungMk/Data_files/HRA001149"

# List the directories in the folder
directories = [d for d in os.listdir(folder_path) if os.path.isdir(os.path.join(folder_path, d))]

# Use the directories as patient_ids
patient_ids = directories

chunk_size = 5  # Define the size of the chunks
chunks = [patient_ids[i:i + chunk_size] for i in range(0, len(patient_ids), chunk_size)]

combined_data = None

for i, chunk in enumerate(chunks):
    adatas = []
    for patient_id in chunk:
        patient_root = f"{folder_path}/{patient_id}"
        patient_dir = f"{patient_root}"

        # Load the filtered and normalized data for each patient
        filtered_normalized_file = f"{patient_dir}/filtered_normalized_counts.h5ad"
        try:
            adata = sc.read(filtered_normalized_file)  # Load data into memory
        except FileNotFoundError:
            print(f"File not found error for patient {patient_id}. Skipping.")
            continue

        # Append the adata to the list
        adatas.append(adata)

    # Concatenate the Anndata objects
    chunk_data = anndata.concat(adatas, join='outer')

    # If this is the first chunk, assign it to combined_data
    if combined_data is None:
        combined_data = chunk_data
    else:
        # If this is not the first chunk, concatenate it with the existing combined_data
        combined_data = anndata.concat([combined_data, chunk_data], join='outer')

    # Delete the adatas and chunk_data variables to free up memory
    del adatas
    del chunk_data

# Save the combined data to a file
combined_data_file = "/content/drive/MyDrive/Pate_Lab/DNP/Bioinformatics/H17_LungMk/Data_files/combined_data.h5ad"
combined_data.write(combined_data_file)


  utils.warn_names_duplicates("obs")
  utils.warn_names_duplicates("obs")


# Perform a QC check of the counts post-filtering

In [None]:
# Generate violin plots for combined data
sc.pl.violin(combined_data, ['n_genes', 'n_counts', 'percent_mito'], jitter=0.4, multi_panel=True)

In [None]:
# Scatter plot: n_counts vs. percent_mito
sc.pl.scatter(combined_data, x='n_counts', y='percent_mito')

# Scatter plot: n_counts vs. n_genes
sc.pl.scatter(combined_data, x='n_counts', y='n_genes')