## Ground Truth Data Extraction 

### Overview
This notebook performs ground truth extraction from copy number variation (CNV) data generated from single-cell RNA sequencing (scRNA-seq).

An overall collection of 30 patient samples are extracted using this notebook. 

The ground truth consists of copy number variations (CNVs) detected in tumour samples. Each row in this table represents a specific genomic segment with an altered copy number. It contains the following information:

- GDC_Aliquot: This is a unique identifier for each tumour sample.
- Chromosome: Indicates which chromosome the CNV is located on.
- Start: The starting genomic position of the CNV segment.
- End: The ending genomic position of the CNV segment.
- Copy_Number: The total number of copies of this genomic segment in the tumour cells. In a normal diploid genome, this would be 2.
- Major_Copy_Number: The number of copies of the more abundant allele.
- Minor_Copy_Number: The number of copies of the less abundant allele.

### Import Libraries

In [1]:
import pandas as pd

In [14]:
# Helper function to extract the ground truth files from the cluster directories
def find_files_with_ending(root_dir, file_ending):
    # List to store file details
    file_data = []
    
    # Recursively run through each of the subdirectories
    for dirpath, dirnames, filenames in os.walk(root_dir):
        for file in filenames:
            if file.endswith(file_ending):  # Match files with the specific ending
                # Split the directory path into individual subdirectories
                path_parts = dirpath.split(os.sep)
                
                file_info = {}
                
                # Excluding the root, by starting at 1, add subdirectory levels to the dictionary
                for i, part in enumerate(path_parts[1:], start=1):
                    file_info[f'subdir_{i}'] = part
                
                # Add the file path to the dictionary
                file_info['file_path'] = os.path.join(dirpath, file)

                file_data.append(file_info)
    
    # Convert the list of dictionaries to a DataFrame
    df = pd.DataFrame(file_data)
    return df

# Function to read tab-delimited files from a DataFrame and combine them
def read_tab_delimited_files_from_dataframe(df):
    all_dataframes = []
    for index, row in df.iterrows():
        file_path = row['file_path']  # Get the full file path
        try:
            # Read the tab-delimited file
            file_df = pd.read_csv(file_path, delimiter='\t')
            
            # Include original metadata in the new DataFrame
            for key in row.index:
                if key not in file_df.columns and key != 'file_path':  # Avoid overwriting columns from the CSV and exclude 'file_path'
                    file_df[key] = row[key]
            
            # Append the DataFrame to the list
            all_dataframes.append(file_df)
        except Exception as e:
            print(f"Error reading {file_path}: {e}")
    
    # Combine all DataFrames into one
    combined_df = pd.concat(all_dataframes, ignore_index=True)
    
    return combined_df

root_directory = "D:/GDC-data"
file_ending = "copy_number_variation.seg.txt"

# Find files with the specified ending
df_files = find_files_with_ending(root_directory, file_ending)

# Read tab-delimited files from the DataFrame and combine them
combined_dataframe = read_tab_delimited_files_from_dataframe(df_files)
combined_dataframe.rename(columns={'subdir_1': 'Case_ID', 'subdir_2': 'Sample'}, inplace=True)
combined_dataframe.drop(columns=['subdir_3', 'subdir_4'], inplace=True)
display(combined_dataframe.head())

# Export the DataFrame as a CSV file
output_csv_path = "ground_truth_combined.csv"
combined_dataframe.to_csv(output_csv_path, index=False)

Unnamed: 0,GDC_Aliquot,Chromosome,Start,End,Copy_Number,Major_Copy_Number,Minor_Copy_Number,Case_ID,Sample
0,2eb1196d-5e6d-4223-953c-d558c75a8c2e,chr1,13116,248945703,4,2,2,C3L-00359,1
1,2eb1196d-5e6d-4223-953c-d558c75a8c2e,chr2,10587,242183243,4,2,2,C3L-00359,1
2,2eb1196d-5e6d-4223-953c-d558c75a8c2e,chr3,18519,198181744,4,2,2,C3L-00359,1
3,2eb1196d-5e6d-4223-953c-d558c75a8c2e,chr4,11961,190122722,4,2,2,C3L-00359,1
4,2eb1196d-5e6d-4223-953c-d558c75a8c2e,chr5,11882,181363900,4,2,2,C3L-00359,1
