# Cell Profiler - Per Patient
Redistribute rows of Nucleus.csv files into one file per patient.  
In preparation for other csv files, store each file in its own directory.  
Also, drop the columns that won't be used for classification.  

Problem: the raw CellProfiler outputs were organized per run.  
Inputs like: Class0/Process100_Nucleus.csv (combines patients of cancer class 0)  
Outputs like: TCGA-HT-7482-01Z-00/Process100_Image.csv (specific to one patient)

Based on CP_PerPatient.01 which worked out the Image.csv mechanics.   

Fail. This exhausts memory.    
The class 0 nucleus file is too big.    
Take another approach in next notebook.

In [1]:
import numpy as np
import pandas as pd
import os
from datetime import datetime
print(datetime.now())

2022-08-08 16:21:47.831663


In [2]:
BASE_PATH_IN='/home/jrm/Adjeroh/Glioma/August_Run/CellProfilerOutputs/'
BASE_PATH_OUT='/home/jrm/Adjeroh/Glioma/August_Run/CellProfilerPerPatient/'
TRACKING_FILE=BASE_PATH_OUT+'PatchTracking.csv'
INPUT_DIRS=[
'Output5/',
'Output5.1/',
'Output4/',
'Output4.1/',
'Output3/',
'Output3.1/',
'Output2/',
'Output1/',
'Output0/'
]
FILENAMES=[
'Process100_Image.csv',
'Process100_Cells.csv',
'Process100_ExpandCells.csv',
'Process100_Experiment.csv',
'Process100_MergeRBC.csv',
'Process100_Nucleus.csv',
'Process100_RBC.csv',
'Process100_ShrinkRBC.csv',
'Process100_Tissue.csv']
IMAGE_FILE=FILENAMES[0]
NUCLEUS_FILE=FILENAMES[5]
IMAGE_COL='ImageNumber'
TUMOR_COL='FileName_Tumor'   # use this column to disambiguate patients
PATIENT_COL='Patient'        # add this column to emphasize patient ID
# Patch filename format: TCGA-06-0129-01Z-00-DX1_5400_5100.png
# For WSI ID, use first 23 letters.
# For patient or case ID, use first 19 letters. 
LEN_CASE_ID=19
LEN_WSI_ID=23

In [3]:
TEST = BASE_PATH_IN + INPUT_DIRS[0] + NUCLEUS_FILE
df = pd.read_csv(TEST)
bad_cols=['Children_Cells_Count',
          'Number_Object_Number','ImageNumber','ObjectNumber']
loc_cols=[c for c in df.columns if c.startswith('Location_')
          or c.startswith('AreaShape_BoundingBoxM')
          or c.startswith('AreaShape_Center')]
#for c in df.columns:
#    print(c)

In [4]:
def load_tracking_info():
    df = pd.read_csv(TRACKING_FILE)
    return df

In [5]:
def save_patient(patient,nucleus_df):
    if patient is not None:
        print('Save')

In [None]:
print(datetime.now())
tracking_df = pd.read_csv(TRACKING_FILE)
prev_directory = None
orig_nucleus_df = None
patient_nucleus_df = None
prev_patient = None
for ndx,data in tracking_df.iterrows():
    patient=data['patient_directory']
    patch_filename=data['patch_filename']
    orig_directory=data['orig_directory']
    orig_imagenum=data['orig_imagenum']
    new_imagenum=data['new_imagenum']
    #
    if prev_patient != patient:
        save_patient(prev_patient,patient_nucleus_df)
        patient_nucleus_df = None
        prev_patient = patient
    if prev_directory != orig_directory:
        prev_directory = orig_directory
        filename = BASE_PATH_IN + orig_directory + NUCLEUS_FILE
        print('Loading big original file',filename)
        orig_nucleus_df = pd.read_csv(filename)
        orig_nucleus_df = nucleus_df.fillna(0)
    #
    print(patient)
    patch_df = orig_nucleus_df[orig_nucleus_df[IMAGE_COL]==orig_imagenum]
    if len(patch_df)<1:
        print('Uh oh')
    if patient_nucleus_df is None:
        patient_nucleus_df = patch_df
    else:
        patient_nucleus_df = pd.concat( (patient_nucleus_df,patch_df) )
    print('Nucleus shape',patient_nucleus_df.shape)
    #
    # must renumber images

2022-08-08 16:22:23.389333
Loading big original file /home/jrm/Adjeroh/Glioma/August_Run/CellProfilerOutputs/Output0/Process100_Nucleus.csv
