Gene Expression Data Preprocessing

Overview:
This notebook performs normalization, filtering, and statistical analysis of gene expression data
from single-cell RNA sequencing (scRNA-seq).

The workflow involves:
1. Loading and cleaning the gene expression data from a CSV file.
2. Normalizing the data by calculating the gene expression ratio for each gene in each cell.
   (Each cell's gene expression value is divided by the total RNA detected for that cell.)
3. Computing key statistics for each gene, including the sum of gene expression, mean expression ratio,
   and variance.
4. Saving the results to a csv file in the 'gene_statistics' output dir for further analysis.

In [1]:
# Required Libraries
import os
import pandas as pd
from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor
import csv

In [2]:
# Function to load and clean data in chunks, allowing for large files to be processed
def load_and_clean_data(file_path, chunksize=1000):
    """
    Loads and cleans the data from a CSV file in chunks.
    It removes rows where all values are zero.
    
    Parameters:
    - file_path: str, the path to the CSV file
    - chunksize: int, the number of rows per chunk

    Returns:
    - data: DataFrame, the concatenated and cleaned data
    """
    chunks = []  # List to store chunks
    total_rows = sum(1 for _ in open(file_path)) - 1  # Calculate total rows minus header row
    
    # Read the data in chunks and remove rows where all values are zero
    try:
        with pd.read_csv(file_path, index_col=0, chunksize=chunksize) as reader:
            for chunk in tqdm(reader, desc="Reading and cleaning data", total=total_rows // chunksize):
                # Remove rows where all values are zero
                chunk = chunk.loc[~(chunk == 0).all(axis=1)]
                chunks.append(chunk)
    except pd.errors.ParserError as e:
        print(f"Error reading CSV file: {e}")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
    
    # Concatenate the chunks into a single DataFrame
    data = pd.concat(chunks, axis=0)
    return data

In [3]:
# Function to normalize the gene expression data by dividing each gene expression by the column sum
def normalize_gene_expression(data):
    """
    Normalizes gene expression data by dividing each value by the sum of its column (i.e., total expression per gene).
    
    Parameters:
    - data: DataFrame, the gene expression data
    
    Returns:
    - normalized_data: DataFrame, the normalized data
    """
    column_sums = data.sum(axis=0)  # Sum of expression values for each gene
    normalized_data = data.copy()   # Create a copy to avoid modifying the original data
    
    # Normalize each column by dividing by its column sum
    for col in tqdm(data.columns, desc="Normalizing gene expression"):
        normalized_data[col] = data[col] / column_sums[col]
    
    return normalized_data

In [4]:
# # Function to filter out genes with low expression based on a threshold
# def filter_low_expression_genes(data, threshold=0.01):
#     """
#     Filters out genes with low expression across cells based on a threshold.
    
#     Parameters:
#     - data: DataFrame, the normalized gene expression data
#     - threshold: float, the minimum percentage of cells where a gene must be expressed to retain the gene
    
#     Returns:
#     - filtered_data: DataFrame, the filtered data
#     """
#     num_cells = data.shape[1]  # Number of cells (columns)
#     min_cells_expressed = threshold * num_cells  # Minimum number of cells required for a gene to be expressed
#     non_zero_counts = (data > 0).sum(axis=1)  # Count of non-zero values per gene (row)
    
#     # Use .loc to filter rows (genes) based on the condition
#     filtered_data = data.loc[non_zero_counts >= min_cells_expressed]
    
#     # Print the number of genes retained after filtering
#     print(f"Filtering complete: {filtered_data.shape[0]} genes retained out of {data.shape[0]} total.")
    
#     return filtered_data

In [5]:
# Function to compute gene statistics like sum, mean, and variance for each gene
def compute_gene_statistics(data):
    """
    Computes gene statistics such as sum, mean, and variance for each gene across all cells.
    
    Parameters:
    - data: DataFrame, the filtered gene expression data
    
    Returns:
    - gene_statistics: dict, a dictionary of gene statistics where keys are gene names
      and values are lists containing sum, mean, and variance for each gene
    """
    gene_statistics = {}  # Dictionary to store statistics
    
    # Function to calculate statistics for a single gene
    def calc_stats(gene):
        gene_values = data.loc[gene].values  # Get expression values for the gene
        gene_sum = gene_values.sum()         # Total expression across all cells
        mean_expression = gene_values.mean() # Mean expression value
        variance = gene_values.var()         # Variance of expression
        return gene, [gene_sum, mean_expression, variance]

    # Use tqdm to show progress bar while calculating gene statistics
    with ThreadPoolExecutor() as executor:
        results = tqdm(executor.map(calc_stats, data.index), total=len(data.index), desc="Calculating gene statistics")
        
        # Update the dictionary with calculated statistics
        for gene, stats in results:
            gene_statistics[gene] = stats

    return gene_statistics

In [6]:
# Function to save gene statistics to a CSV file
def save_dict_to_csv(gene_statistics, output_file):
    """
    Saves the gene statistics dictionary to a CSV file.
    
    Parameters:
    - gene_statistics: dict, the dictionary containing gene statistics
    - output_file: str, the name of the output CSV file
    
    Returns:
    - None
    """
    with open(output_file, 'w', newline='') as f:
        writer = csv.writer(f)
        # Write header row
        writer.writerow(["Gene", "Sum", "Mean", "Variance"])
        # Write gene statistics
        for gene, stats in gene_statistics.items():
            writer.writerow([gene] + stats)
    
    print(f"Gene statistics saved to '{output_file}'.")

In [7]:
# Recursive function to find 'dense-matrix.csv' in nested subdirectories
def find_dense_matrix_file(root_dir):
    """
    Recursively searches for the 'dense-matrix.csv' file in the given directory and its subdirectories.
    
    Parameters:
    - root_dir: str, the root directory to start the search
    
    Returns:
    - str or None: the full path to 'dense-matrix.csv' if found, otherwise None
    """
    for root, dirs, files in os.walk(root_dir):
        for file in files:
            if file == "dense-matrix.csv":
                return os.path.join(root, file)
    return None  # Return None if the file is not found

In [8]:
def process_gene_expression_data(file_path, threshold=0.01, chunksize=1000):
    """
    Processes the gene expression data by loading, cleaning, normalizing, filtering,
    and computing statistics for gene expression data.
    
    Parameters:
    - file_path: str, the path to the gene expression data file
    - threshold: float, the minimum percentage of cells where a gene must be expressed
    - chunksize: int, the number of rows to read per chunk when loading data
    
    Returns:
    - gene_statistics: dict, the dictionary of computed gene statistics
    """
    # Load and clean the data in chunks
    data = load_and_clean_data(file_path, chunksize=chunksize)
    
    # Normalize the gene expression data
    print("Normalizing gene expression data...")
    normalized_data = normalize_gene_expression(data)
    
    # Filter out low-expression genes based on the threshold
    # print("Filtering low-expression genes...")
    # filtered_data = filter_low_expression_genes(normalized_data, threshold)
    
    # Compute gene statistics for the filtered data
    gene_statistics = compute_gene_statistics(normalized_data)
    
    return gene_statistics

In [9]:
def process_all_files(input_dir, output_dir, threshold=0.01, chunksize=1000):
    """
    Loops through the input directory to find 'dense-matrix.csv' in each sample subdirectory,
    processes it, and saves the results in the output directory. Creates a .txt file for each missing
    dense-matrix.csv file. Stops checking further samples after encountering a missing sample directory.
    
    Parameters:
    - input_dir: str, the root directory containing cases and sample subdirectories
    - output_dir: str, the directory where the output CSV or text files will be saved
    - threshold: float, the minimum percentage of cells where a gene must be expressed
    - chunksize: int, the number of rows to read per chunk when loading data
    
    Returns:
    - None
    """
    # Ensure output directory exists, create it if not
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    # Sort case directories to ensure consistent processing order
    for case_dir in sorted(os.listdir(input_dir)):
        case_path = os.path.join(input_dir, case_dir)
        # Ensure it's a directory and ignore hidden/system files like .DS_Store
        if not os.path.isdir(case_path) or case_dir.startswith('.'):
            continue  # Skip non-directory or hidden/system files

        # Loop through sample directories within each case
        for sample_dir in sorted(os.listdir(case_path)):
            sample_path = os.path.join(case_path, sample_dir, 'single_cell')

            # Ignore hidden/system files in the sample directories as well
            if not os.path.isdir(os.path.join(case_path, sample_dir)) or sample_dir.startswith('.'):
                continue  # Skip non-directory or hidden/system files

            # Search for 'dense-matrix.csv' in the sample's subdirectories
            dense_matrix_path = find_dense_matrix_file(sample_path)

            # If dense-matrix.csv is found, process the file
            if dense_matrix_path:
                print(f"Processing {dense_matrix_path} for case {case_dir}, sample {sample_dir}...")
                try:
                    # Process the dense-matrix.csv file and compute statistics
                    gene_statistics = process_gene_expression_data(dense_matrix_path, threshold, chunksize)
                    
                    # Save the output as a CSV in the output directory
                    output_file = os.path.join(output_dir, f"{case_dir}_{sample_dir}.csv")
                    save_dict_to_csv(gene_statistics, output_file)
                except Exception as e:
                    print(f"Error processing file {dense_matrix_path}: {e}")
            
            # If dense-matrix.csv is not found, create a .txt file indicating the file is missing
            else:
                print(f"No dense-matrix.csv found for {case_dir}, sample {sample_dir}.")
                # Create a text file indicating no file was found
                output_file = os.path.join(output_dir, f"{case_dir}_{sample_dir}.txt")
                with open(output_file, 'w') as f:
                    f.write(f"No dense-matrix.csv found in {sample_path}")
                print(f"Created {output_file}.")

In [10]:
# Process the data and save results
try:
    # Input and output directories
    input_dir = "GDC-data"  # Input directory with all the cases
    output_dir = "gene_statistics"  # Output directory for results

    # Process all files
    process_all_files(input_dir, output_dir, threshold=0.01, chunksize=1000)

except Exception as e:
    print(f"An error occurred during processing: {e}")

Processing GDC-data/C3L-00359/1/single_cell/acfe95e5-b1ad-46ca-b709-46644c9a0c6d/qc_filtered_bc_feature_matrix/dense-matrix.csv for case C3L-00359, sample 1...


Reading and cleaning data: 61it [02:33,  2.52s/it]                        


Normalizing gene expression data...


Normalizing gene expression: 100%|██████████| 14598/14598 [00:18<00:00, 773.85it/s]
Calculating gene statistics: 100%|██████████| 37387/37387 [19:36<00:00, 31.78it/s] 


Gene statistics saved to 'gene_statistics/C3L-00359_1.csv'.
Processing GDC-data/C3L-00606/1/single_cell/03845ba9-15a3-4216-b484-0489eb0fef90/qc_filtered_bc_feature_matrix/dense-matrix.csv for case C3L-00606, sample 1...


Reading and cleaning data: 61it [01:34,  1.55s/it]                        


Normalizing gene expression data...


Normalizing gene expression: 100%|██████████| 8961/8961 [00:07<00:00, 1237.19it/s]
Calculating gene statistics: 100%|██████████| 34237/34237 [10:38<00:00, 53.64it/s] 


Gene statistics saved to 'gene_statistics/C3L-00606_1.csv'.
Processing GDC-data/C3L-00606/2/single_cell/1259da7b-864a-494d-b712-6c566dae06cd/qc_filtered_bc_feature_matrix/dense-matrix.csv for case C3L-00606, sample 2...


Reading and cleaning data: 61it [04:50,  4.76s/it]                        


Normalizing gene expression data...


Normalizing gene expression: 100%|██████████| 16446/16446 [00:21<00:00, 771.94it/s]
Calculating gene statistics: 100%|██████████| 35468/35468 [21:23<00:00, 27.64it/s]  


Gene statistics saved to 'gene_statistics/C3L-00606_2.csv'.
Processing GDC-data/C3L-00606/3/single_cell/b37c3cf9-8215-4636-906c-fecced524729/qc_filtered_bc_feature_matrix/dense-matrix.csv for case C3L-00606, sample 3...


Reading and cleaning data: 61it [01:44,  1.72s/it]                        


Normalizing gene expression data...


Normalizing gene expression: 100%|██████████| 10270/10270 [00:08<00:00, 1227.21it/s]
Calculating gene statistics: 100%|██████████| 34007/34007 [12:50<00:00, 44.16it/s]  


Gene statistics saved to 'gene_statistics/C3L-00606_3.csv'.
Processing GDC-data/C3L-01287/1/single_cell/1587552f-810b-4ae7-9efc-d0ec3da9fb2b/qc_filtered_bc_feature_matrix/dense-matrix.csv for case C3L-01287, sample 1...


Reading and cleaning data: 61it [00:49,  1.22it/s]                        


Normalizing gene expression data...


Normalizing gene expression: 100%|██████████| 7726/7726 [00:06<00:00, 1265.81it/s]
Calculating gene statistics: 100%|██████████| 36564/36564 [10:43<00:00, 56.80it/s]  


Gene statistics saved to 'gene_statistics/C3L-01287_1.csv'.
Processing GDC-data/C3L-01287/2/single_cell/cdea80e2-c55c-45c8-8656-f77374e35b6e/qc_filtered_bc_feature_matrix/dense-matrix.csv for case C3L-01287, sample 2...


Reading and cleaning data: 61it [01:34,  1.54s/it]                        


Normalizing gene expression data...


Normalizing gene expression: 100%|██████████| 9210/9210 [00:07<00:00, 1276.70it/s]
Calculating gene statistics: 100%|██████████| 34914/34914 [12:10<00:00, 47.78it/s] 


Gene statistics saved to 'gene_statistics/C3L-01287_2.csv'.
Processing GDC-data/C3L-01953/1/single_cell/0a8da4f4-7030-4c81-8f91-48d1b31d7551/qc_filtered_bc_feature_matrix/dense-matrix.csv for case C3L-01953, sample 1...


Reading and cleaning data: 61it [01:47,  1.77s/it]                        


Normalizing gene expression data...


Normalizing gene expression: 100%|██████████| 10783/10783 [00:08<00:00, 1200.80it/s]
Calculating gene statistics: 100%|██████████| 34687/34687 [14:46<00:00, 39.12it/s] 


Gene statistics saved to 'gene_statistics/C3L-01953_1.csv'.
Processing GDC-data/C3L-02705/1/single_cell/f1285e81-3d62-4e4d-bf54-005b25c95351/qc_filtered_bc_feature_matrix/dense-matrix.csv for case C3L-02705, sample 1...


Reading and cleaning data: 61it [02:01,  1.98s/it]                        


Normalizing gene expression data...


Normalizing gene expression: 100%|██████████| 11938/11938 [00:12<00:00, 923.46it/s]
Calculating gene statistics: 100%|██████████| 39523/39523 [19:46<00:00, 33.30it/s]  


Gene statistics saved to 'gene_statistics/C3L-02705_1.csv'.
Processing GDC-data/C3L-02858/1/single_cell/13224031-ccab-401b-bdb9-1759b0e2a469/qc_filtered_bc_feature_matrix/dense-matrix.csv for case C3L-02858, sample 1...


Reading and cleaning data: 61it [00:33,  1.85it/s]                        


Normalizing gene expression data...


Normalizing gene expression: 100%|██████████| 5210/5210 [00:03<00:00, 1405.25it/s]
Calculating gene statistics: 100%|██████████| 33071/33071 [06:46<00:00, 81.44it/s] 


Gene statistics saved to 'gene_statistics/C3L-02858_1.csv'.
Processing GDC-data/C3L-03405/1/single_cell/4aada0be-e49e-4816-8f8c-c83ab265a7f2/qc_filtered_bc_feature_matrix/dense-matrix.csv for case C3L-03405, sample 1...


Reading and cleaning data: 61it [00:35,  1.70it/s]                        


Normalizing gene expression data...


Normalizing gene expression: 100%|██████████| 5631/5631 [00:03<00:00, 1451.64it/s]
Calculating gene statistics: 100%|██████████| 35278/35278 [07:47<00:00, 75.52it/s] 


Gene statistics saved to 'gene_statistics/C3L-03405_1.csv'.
Processing GDC-data/C3L-03968/1/single_cell/e0983229-ac11-4903-901e-950c4b15b14a/qc_filtered_bc_feature_matrix/dense-matrix.csv for case C3L-03968, sample 1...


Reading and cleaning data: 61it [01:33,  1.53s/it]                        


Normalizing gene expression data...


Normalizing gene expression: 100%|██████████| 9172/9172 [00:07<00:00, 1283.56it/s]
Calculating gene statistics: 100%|██████████| 34818/34818 [12:12<00:00, 47.51it/s]  


Gene statistics saved to 'gene_statistics/C3L-03968_1.csv'.
Processing GDC-data/C3N-00148/1/single_cell/dbe28c52-cdbf-477e-a728-9a61fd4a8139/qc_filtered_bc_feature_matrix/dense-matrix.csv for case C3N-00148, sample 1...


Reading and cleaning data: 61it [00:48,  1.26it/s]                        


Normalizing gene expression data...


Normalizing gene expression: 100%|██████████| 7598/7598 [00:05<00:00, 1305.99it/s]
Calculating gene statistics: 100%|██████████| 36865/36865 [10:45<00:00, 57.07it/s] 


Gene statistics saved to 'gene_statistics/C3N-00148_1.csv'.
Processing GDC-data/C3N-00148/2/single_cell/30b9bf4c-6db5-4663-a700-e6fe3082dcf9/qc_filtered_bc_feature_matrix/dense-matrix.csv for case C3N-00148, sample 2...


Reading and cleaning data: 61it [00:45,  1.33it/s]                        


Normalizing gene expression data...


Normalizing gene expression: 100%|██████████| 7171/7171 [00:05<00:00, 1253.93it/s]
Calculating gene statistics: 100%|██████████| 36520/36520 [10:01<00:00, 60.72it/s] 


Gene statistics saved to 'gene_statistics/C3N-00148_2.csv'.
Processing GDC-data/C3N-00148/3/single_cell/e2213c27-8571-498f-98dd-c1a4319e2146/qc_filtered_bc_feature_matrix/dense-matrix.csv for case C3N-00148, sample 3...


Reading and cleaning data: 61it [02:03,  2.03s/it]                        


Normalizing gene expression data...


Normalizing gene expression: 100%|██████████| 11994/11994 [00:12<00:00, 950.48it/s]
Calculating gene statistics: 100%|██████████| 37870/37870 [17:55<00:00, 35.20it/s]  


Gene statistics saved to 'gene_statistics/C3N-00148_3.csv'.
Processing GDC-data/C3N-00148/4/single_cell/1f898240-4707-4d5f-a992-faa54104cef3/qc_filtered_bc_feature_matrix/dense-matrix.csv for case C3N-00148, sample 4...


Reading and cleaning data: 61it [01:28,  1.45s/it]                        


Normalizing gene expression data...


Normalizing gene expression: 100%|██████████| 8796/8796 [00:06<00:00, 1323.80it/s]
Calculating gene statistics: 100%|██████████| 36901/36901 [12:33<00:00, 48.97it/s] 


Gene statistics saved to 'gene_statistics/C3N-00148_4.csv'.
Processing GDC-data/C3N-00149/1/single_cell/1896aafc-d107-4a34-9b87-8d893dce0ca0/qc_filtered_bc_feature_matrix/dense-matrix.csv for case C3N-00149, sample 1...


Reading and cleaning data: 61it [01:43,  1.70s/it]                        


Normalizing gene expression data...


Normalizing gene expression: 100%|██████████| 10305/10305 [00:08<00:00, 1181.69it/s]
Calculating gene statistics: 100%|██████████| 36694/36694 [14:07<00:00, 43.28it/s] 


Gene statistics saved to 'gene_statistics/C3N-00149_1.csv'.
Processing GDC-data/C3N-00149/2/single_cell/f9f01268-2c82-40c1-91c7-eab1adee0c99/qc_filtered_bc_feature_matrix/dense-matrix.csv for case C3N-00149, sample 2...


Reading and cleaning data: 61it [01:34,  1.55s/it]                        


Normalizing gene expression data...


Normalizing gene expression: 100%|██████████| 9412/9412 [00:07<00:00, 1198.49it/s]
Calculating gene statistics: 100%|██████████| 36317/36317 [12:58<00:00, 46.66it/s] 


Gene statistics saved to 'gene_statistics/C3N-00149_2.csv'.
Processing GDC-data/C3N-00149/3/single_cell/7339223d-c470-498c-a0ca-52b05cc5a405/qc_filtered_bc_feature_matrix/dense-matrix.csv for case C3N-00149, sample 3...


Reading and cleaning data: 61it [00:27,  2.20it/s]                        


Normalizing gene expression data...


Normalizing gene expression: 100%|██████████| 4411/4411 [00:02<00:00, 1634.73it/s]
Calculating gene statistics: 100%|██████████| 34299/34299 [05:48<00:00, 98.44it/s] 


Gene statistics saved to 'gene_statistics/C3N-00149_3.csv'.
Processing GDC-data/C3N-00439/1/single_cell/cc317a3f-9aa8-4e82-b394-25a13320956b/qc_filtered_bc_feature_matrix/dense-matrix.csv for case C3N-00439, sample 1...


Reading and cleaning data: 61it [00:46,  1.31it/s]                        


Normalizing gene expression data...


Normalizing gene expression: 100%|██████████| 7471/7471 [00:05<00:00, 1321.29it/s]
Calculating gene statistics: 100%|██████████| 31830/31830 [08:56<00:00, 59.34it/s]  


Gene statistics saved to 'gene_statistics/C3N-00439_1.csv'.
Processing GDC-data/C3N-00662/1/single_cell/15baaf0d-6dce-4fe0-add9-a703eb4cdade/qc_filtered_bc_feature_matrix/dense-matrix.csv for case C3N-00662, sample 1...


Reading and cleaning data: 61it [00:40,  1.51it/s]                        


Normalizing gene expression data...


Normalizing gene expression: 100%|██████████| 6315/6315 [00:04<00:00, 1305.59it/s]
Calculating gene statistics: 100%|██████████| 38103/38103 [09:37<00:00, 66.00it/s] 


Gene statistics saved to 'gene_statistics/C3N-00662_1.csv'.
Processing GDC-data/C3N-01175/1/single_cell/c8a2aad7-a5c0-473f-a3d6-3d2b1ccb9102/qc_filtered_bc_feature_matrix/dense-matrix.csv for case C3N-01175, sample 1...


Reading and cleaning data: 61it [01:54,  1.88s/it]                        


Normalizing gene expression data...


Normalizing gene expression: 100%|██████████| 11463/11463 [00:10<00:00, 1105.90it/s]
Calculating gene statistics: 100%|██████████| 35341/35341 [15:13<00:00, 38.68it/s] 


Gene statistics saved to 'gene_statistics/C3N-01175_1.csv'.
No dense-matrix.csv found for C3N-01180, sample 1.
Created gene_statistics/C3N-01180_1.txt.
Processing GDC-data/C3N-01270/1/single_cell/b23449fb-b4e2-4816-9f9e-721eed95a124/qc_filtered_bc_feature_matrix/dense-matrix.csv for case C3N-01270, sample 1...


Reading and cleaning data: 61it [00:10,  5.92it/s]                        


Normalizing gene expression data...


Normalizing gene expression: 100%|██████████| 2098/2098 [00:01<00:00, 1903.25it/s]
Calculating gene statistics: 100%|██████████| 27985/27985 [00:20<00:00, 1367.14it/s] 


Gene statistics saved to 'gene_statistics/C3N-01270_1.csv'.
No dense-matrix.csv found for C3N-01334, sample 1.
Created gene_statistics/C3N-01334_1.txt.
Processing GDC-data/C3N-01798/1/single_cell/596d80de-595a-46b6-9309-6030a7213648/qc_filtered_bc_feature_matrix/dense-matrix.csv for case C3N-01798, sample 1...


Reading and cleaning data: 61it [02:17,  2.25s/it]                        


Normalizing gene expression data...


Normalizing gene expression: 100%|██████████| 13089/13089 [00:17<00:00, 747.75it/s]
Calculating gene statistics: 100%|██████████| 42498/42498 [22:58<00:00, 30.83it/s]  


Gene statistics saved to 'gene_statistics/C3N-01798_1.csv'.
No dense-matrix.csv found for C3N-01798, sample single_cell.
Created gene_statistics/C3N-01798_single_cell.txt.
Processing GDC-data/C3N-01814/1/single_cell/255567c3-0640-4de5-b224-1e70cfca6b7f/qc_filtered_bc_feature_matrix/dense-matrix.csv for case C3N-01814, sample 1...


Reading and cleaning data: 61it [02:08,  2.10s/it]                        


Normalizing gene expression data...


Normalizing gene expression: 100%|██████████| 12431/12431 [00:17<00:00, 697.48it/s]
Calculating gene statistics: 100%|██████████| 42400/42400 [22:07<00:00, 31.93it/s]


Gene statistics saved to 'gene_statistics/C3N-01814_1.csv'.
Processing GDC-data/C3N-01815/1/single_cell/35ebba03-d063-4b0a-ad43-ada21fedc3da/qc_filtered_bc_feature_matrix/dense-matrix.csv for case C3N-01815, sample 1...


Reading and cleaning data: 61it [00:51,  1.17it/s]                        


Normalizing gene expression data...


Normalizing gene expression: 100%|██████████| 7762/7762 [00:06<00:00, 1249.58it/s]
Calculating gene statistics: 100%|██████████| 38474/38474 [12:23<00:00, 51.72it/s] 


Gene statistics saved to 'gene_statistics/C3N-01815_1.csv'.
Processing GDC-data/C3N-01816/1/single_cell/b6fe5e0d-1fd3-4630-bbb4-6b77180e757a/qc_filtered_bc_feature_matrix/dense-matrix.csv for case C3N-01816, sample 1...


Reading and cleaning data: 61it [04:47,  4.72s/it]                        


Normalizing gene expression data...


Normalizing gene expression: 100%|██████████| 16977/16977 [00:23<00:00, 708.45it/s]
Calculating gene statistics: 100%|██████████| 42754/42754 [33:44<00:00, 21.12it/s]  


Gene statistics saved to 'gene_statistics/C3N-01816_1.csv'.
Processing GDC-data/C3N-01904/1/single_cell/fc6c80a4-0827-4247-b526-74dae27bbeb2/qc_filtered_bc_feature_matrix/dense-matrix.csv for case C3N-01904, sample 1...


Reading and cleaning data: 61it [02:21,  2.32s/it]                        


Normalizing gene expression data...


Normalizing gene expression: 100%|██████████| 13775/13775 [00:17<00:00, 765.51it/s]
Calculating gene statistics: 100%|██████████| 36187/36187 [21:06<00:00, 28.58it/s]  


Gene statistics saved to 'gene_statistics/C3N-01904_1.csv'.
Processing GDC-data/C3N-02181/1/single_cell/6ad587be-e070-4017-a7ef-fceb4dfa6eeb/qc_filtered_bc_feature_matrix/dense-matrix.csv for case C3N-02181, sample 1...


Reading and cleaning data: 61it [02:00,  1.98s/it]                        


Normalizing gene expression data...


Normalizing gene expression: 100%|██████████| 11806/11806 [00:16<00:00, 725.99it/s]
Calculating gene statistics: 100%|██████████| 41742/41742 [21:04<00:00, 33.01it/s] 


Gene statistics saved to 'gene_statistics/C3N-02181_1.csv'.
Processing GDC-data/C3N-02188/1/single_cell/6c3e3003-7d10-4cf9-a031-6dca74f90274/qc_filtered_bc_feature_matrix/dense-matrix.csv for case C3N-02188, sample 1...


Reading and cleaning data: 61it [02:34,  2.53s/it]                        


Normalizing gene expression data...


Normalizing gene expression: 100%|██████████| 15121/15121 [00:19<00:00, 781.02it/s]
Calculating gene statistics: 100%|██████████| 40357/40357 [27:09<00:00, 24.77it/s] 


Gene statistics saved to 'gene_statistics/C3N-02188_1.csv'.
Processing GDC-data/C3N-02190/1/single_cell/149fdcf8-350b-44d2-89c1-865e6ac7c88f/qc_filtered_bc_feature_matrix/dense-matrix.csv for case C3N-02190, sample 1...


Reading and cleaning data: 61it [01:22,  1.35s/it]                        


Normalizing gene expression data...


Normalizing gene expression: 100%|██████████| 8313/8313 [00:06<00:00, 1349.78it/s]
Calculating gene statistics: 100%|██████████| 33007/33007 [09:54<00:00, 55.48it/s]  


Gene statistics saved to 'gene_statistics/C3N-02190_1.csv'.
Processing GDC-data/C3N-02769/1/single_cell/5b104813-de6b-4369-b8ae-30f01302c232/qc_filtered_bc_feature_matrix/dense-matrix.csv for case C3N-02769, sample 1...


Reading and cleaning data: 61it [00:42,  1.43it/s]                        


Normalizing gene expression data...


Normalizing gene expression: 100%|██████████| 6650/6650 [00:05<00:00, 1305.69it/s]
Calculating gene statistics: 100%|██████████| 35777/35777 [09:05<00:00, 65.64it/s] 


Gene statistics saved to 'gene_statistics/C3N-02769_1.csv'.
Processing GDC-data/C3N-02783/1/single_cell/b3d2a5df-bf0d-4f79-8c1c-ef473c07b412/qc_filtered_bc_feature_matrix/dense-matrix.csv for case C3N-02783, sample 1...


Reading and cleaning data: 58it [04:49,  5.00s/it]                        


Normalizing gene expression data...


Normalizing gene expression: 100%|██████████| 17997/17997 [00:23<00:00, 759.69it/s]
Calculating gene statistics: 100%|██████████| 40247/40247 [33:46<00:00, 19.86it/s]  


Gene statistics saved to 'gene_statistics/C3N-02783_1.csv'.
No dense-matrix.csv found for C3N-02783, sample single_cell.
Created gene_statistics/C3N-02783_single_cell.txt.
Processing GDC-data/C3N-02784/1/single_cell/84ec08cc-c425-45fc-a3e3-387c418769b6/qc_filtered_bc_feature_matrix/dense-matrix.csv for case C3N-02784, sample 1...


Reading and cleaning data: 61it [00:48,  1.25it/s]                        


Normalizing gene expression data...


Normalizing gene expression: 100%|██████████| 7710/7710 [00:06<00:00, 1216.12it/s]
Calculating gene statistics: 100%|██████████| 39291/39291 [12:03<00:00, 54.33it/s] 


Gene statistics saved to 'gene_statistics/C3N-02784_1.csv'.
Processing GDC-data/C3N-03184/1/single_cell/90569c86-d332-4743-8064-5314188bf06a/qc_filtered_bc_feature_matrix/dense-matrix.csv for case C3N-03184, sample 1...


Reading and cleaning data: 61it [04:32,  4.47s/it]                        


Normalizing gene expression data...


Normalizing gene expression: 100%|██████████| 16445/16445 [00:20<00:00, 796.27it/s]
Calculating gene statistics: 100%|██████████| 35789/35789 [23:44<00:00, 25.12it/s]  


Gene statistics saved to 'gene_statistics/C3N-03184_1.csv'.
Processing GDC-data/C3N-03186/1/single_cell/261a8ad9-59ba-4ba3-895d-86c4b8c396e2/qc_filtered_bc_feature_matrix/dense-matrix.csv for case C3N-03186, sample 1...


Reading and cleaning data: 61it [00:30,  2.00it/s]                        


Normalizing gene expression data...


Normalizing gene expression: 100%|██████████| 4827/4827 [00:03<00:00, 1505.37it/s]
Calculating gene statistics: 100%|██████████| 35834/35834 [06:19<00:00, 94.36it/s] 


Gene statistics saved to 'gene_statistics/C3N-03186_1.csv'.
Processing GDC-data/C3N-03188/1/single_cell/f7404457-fae1-487d-8c35-345a457d2b30/qc_filtered_bc_feature_matrix/dense-matrix.csv for case C3N-03188, sample 1...


Reading and cleaning data: 61it [02:06,  2.07s/it]                        


Normalizing gene expression data...


Normalizing gene expression: 100%|██████████| 12333/12333 [00:16<00:00, 743.48it/s]
Calculating gene statistics: 100%|██████████| 43011/43011 [24:44<00:00, 28.98it/s] 


Gene statistics saved to 'gene_statistics/C3N-03188_1.csv'.
