Gene Expression Data Preprocessing

Overview:
This notebook performs normalization, filtering, and statistical analysis of gene expression data
from single-cell RNA sequencing (scRNA-seq).

The workflow involves:
1. Loading and cleaning the gene expression data from a CSV file.
2. Normalizing the data by calculating the gene expression ratio for each gene in each cell.
   (Each cell's gene expression value is divided by the total RNA detected for that cell.)
3. Filtering out low-expression genes, removing genes expressed in less than 1% of the cells.
4. Computing key statistics for each gene, including the sum of gene expression, mean expression ratio,
   and variance.
5. Saving the results to a text file for further analysis.

In [10]:
# Required Libraries
import pandas as pd
from tqdm import tqdm

In [11]:
# Function to Load and Clean Data
def load_and_clean_data(file_path, chunksize=1000):
    """
    Load gene expression data in chunks and remove rows with all zero values.
    :param file_path: Path to the CSV file containing gene expression data.
    :param chunksize: Number of rows to process in each chunk (for large files).
    :return: Cleaned gene expression data.
    """
    chunks = []
    total_rows = sum(1 for _ in open(file_path)) - 1  # Subtract 1 for the header
    with pd.read_csv(file_path, index_col=0, chunksize=chunksize) as reader:
        for chunk in tqdm(reader, desc="Reading and cleaning data", total=total_rows // chunksize):
            chunk = chunk.loc[~(chunk == 0).all(axis=1)]
            chunks.append(chunk)
    data = pd.concat(chunks, axis=0)
    return data

In [12]:
# Function to Normalize Gene Expression Data
def normalize_gene_expression(data):
    """
    Normalize gene expression by dividing each value by the sum of its respective column.
    :param data: Gene expression data (pandas DataFrame).
    :return: Normalized gene expression data.
    """
    column_sums = data.sum(axis=0)
    normalized_data = data.copy()
    for col in tqdm(data.columns, desc="Normalizing gene expression"):
        normalized_data[col] = data[col] / column_sums[col]
    return normalized_data

In [13]:
# Function to Filter Low-Expression Genes
def filter_low_expression_genes(data, threshold=0.01):
    """
    Filter out genes expressed in less than a certain percentage of cells.
    :param data: Normalized gene expression data (pandas DataFrame).
    :param threshold: Minimum percentage of cells in which a gene must be expressed.
    :return: Filtered gene expression data.
    """
    non_zero_counts = (data > 0).sum(axis=1)
    num_cells = data.shape[1]
    min_cells_expressed = threshold * num_cells
    filtered_data = data.loc[non_zero_counts >= min_cells_expressed]
    return filtered_data

In [14]:
# Function to Compute Gene Statistics
def compute_gene_statistics(data):
    """
    Compute the sum, mean, and variance for each gene across all cells.
    :param data: Filtered gene expression data (pandas DataFrame).
    :return: Dictionary containing statistics (sum, mean, variance) for each gene.
    """
    gene_statistics = {}
    
    for gene in tqdm(data.index, desc="Calculating gene statistics"):
        gene_values = data.loc[gene].values  # Get the values as a numpy array
        gene_sum = gene_values.sum()  # Sum of gene expression across all cells
        mean_expression = gene_values.mean()  # Mean expression ratio
        variance = gene_values.var()  # Variance of expression
        
        # Store results in the dictionary
        gene_statistics[gene] = [gene_sum, mean_expression, variance]
    
    return gene_statistics

In [15]:
# Function to Save Gene Statistics to a File
def save_dict_to_file(gene_statistics, output_file="gene_statistics.txt"):
    """
    Save the computed gene statistics to a text file.
    :param gene_statistics: Dictionary containing gene statistics.
    :param output_file: Output file name.
    """
    with open(output_file, 'w') as f:
        for gene, stats in gene_statistics.items():
            gene_sum, mean_expression, variance = stats
            f.write(f"{gene}: Sum = {gene_sum}, Mean = {mean_expression}, Variance = {variance}\n")
    print(f"Gene statistics saved to '{output_file}'.")

In [16]:
# Main Processing Function
def process_gene_expression_data(file_path, threshold=0.01, chunksize=1000):
    """
    Main function to process gene expression data: load, clean, normalize, filter, and compute statistics.
    :param file_path: Path to the gene expression CSV file.
    :param threshold: Minimum percentage of cells for filtering genes.
    :param chunksize: Number of rows to process per chunk (for large files).
    :return: Dictionary containing computed gene statistics.
    """
    data = load_and_clean_data(file_path, chunksize=chunksize)
    print("Normalizing gene expression data...")
    normalized_data = normalize_gene_expression(data)
    print("Filtering low-expression genes...")
    filtered_data = filter_low_expression_genes(normalized_data, threshold)
    
    # Compute gene statistics and store in a dictionary
    gene_statistics = compute_gene_statistics(filtered_data)
    
    return gene_statistics

In [17]:
# Input Path Prompt for Reproducibility
# The user is prompted to input the path to the CSV file dynamically
file_path = input("Please enter the path to your gene expression CSV file (dense matrix): ")
threshold = 0.01  # Example threshold for gene expression filtering
chunksize = 1000  # Set chunk size for reading the file

In [18]:
# Process the data and save results
try:
    gene_statistics = process_gene_expression_data(file_path, threshold, chunksize)

    # Save gene statistics to a text file
    save_dict_to_file(gene_statistics, "gene_statistics.txt")
except Exception as e:
    print(f"An error occurred during processing: {e}")

Reading and cleaning data: 61it [02:14,  2.21s/it]                        


Normalizing gene expression data...


Normalizing gene expression: 100%|██████████| 12333/12333 [00:17<00:00, 709.61it/s]


Filtering low-expression genes...


Calculating gene statistics: 100%|██████████| 19603/19603 [07:05<00:00, 46.09it/s]


Gene statistics saved to 'gene_statistics.txt'.
