<a href="https://colab.research.google.com/github/EdRey05/Resources_for_Mulligan_Lab/blob/main/Tools%20for%20students/Eduardo%20Reyes/04-ExtractCells_Broad_Institute_DepMapCCLE%5BColab%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#***Notebook to extract data from the updated Cancer Cell Line Encyclopedia (DepMap Public 22Q2 files)***

**Original data from:** *Broad Institute and Novartis*

**Data downloaded from:** *https://depmap.org/portal/download/*

**Notebook made by:** *Eduardo Reyes-Alvarez (Ph.D. candidate)*

**Affiliation:** *Dr. Lois Mulligan's lab, Queen's University.*

**Contact:** *eduardo_reyes09@hotmail.com*

**Date of latest version:** December 08, 2022.

##Instructions

***This is a similar version (with more recent data) to the notebook: 03-ExtractCells_Broad_Institute_CCLE_2019_[Colab]***

***NOTE:*** You need to log in using the Gmail account for the Mulligan lab to use this notebook. Alternatively, you can download the files mentioned below from the website, save them in your Google Drive and edit the directories in the "Main function" section of the code.

Before starting a search, it is recommended to check the website mentioned above and read the description of the files they have there. In this notebook, we used the most recent files (version 22Q2):

  1.   **CCLE_expression.csv** : Gene expression TPM values of the protein coding genes for DepMap cell lines. Values are inferred from RNA-seq data using the RSEM tool and are reported after log2 transformation, using a pseudo-count of 1; log2(TPM+1).
  2.   **sample_info.csv** : Metadata for all of DepMap’s cancer models/cell lines. A full description of each column is available in the DepMap Release README file.

* You can download and try to preview the files in Excel if you want, but they are very big and may not load very well.

* Once ready to run this notebook, place your cursor in the grey box under the "Code" section (it says 15 cells hidden), a "play" button should appear, click on it. After ~1min you can run the box under the "Run your search" section, a "play" button will appear, click on it and follow the instructions.

##Code



###Function to validate inputs

In [None]:
#Input validation (type, min, max and range)
#Modified from here: https://stackoverflow.com/questions/23294658/asking-the-user-for-input-until-they-give-a-valid-response

def check_input(prompt, type_=None, min_=None, max_=None, range_=None):
    if min_ is not None and max_ is not None and max_ < min_:
        raise ValueError("min_ must be less than or equal to max_.")
    while True:
        user_input = input(prompt)
        if type_ is not None:
            try:
                user_input = type_(user_input)
            except ValueError:
                print("Input type must be {0}.".format(type_.__name__))
                continue
        if max_ is not None and user_input > max_:
            print("Input must be less than or equal to {0}.".format(max_))
        elif min_ is not None and user_input < min_:
            print("Input must be greater than or equal to {0}.".format(min_))
        elif range_ is not None and user_input not in range_:
            if isinstance(range_, range):
                template = "Input must be between {0.start} and {0.stop}."
                print(template.format(range_))
            else:
                template = "Input must be {0}."
                if len(range_) == 1:
                    print(template.format(*range_))
                else:
                    expected = " or ".join((
                        ", ".join(str(x) for x in range_[:-1]),
                        str(range_[-1])
                    ))
                    print(template.format(expected))
        else:
            return user_input

###Function to search for cell lines

In [None]:
def search_cells(cell_menu):

  #Lists to store selected cell line names
  keepcells_name = []

  #Loop to search text keys
  while True:
    #Check if we want to continue of exit
    continue_search = check_input("\n Search for cell line? (Y/N) ", str, range_=("Y", "y", "N", "n"))
    if continue_search=="N" or continue_search=="n":
      break
    
    #Get string of interest (we need a valid input to proceed)  
    while True:
      search_string = str(input("\n Type term to search for: "))
      search_string = search_string.upper()
      search_results = [cell for cell in cell_menu if search_string in cell]
      
      #Print columns that contain that string
      if search_results==[]:
        print("\n Nothing was found! Try other term or a shorter version of it!")
      else:
        search_results = ["-"] + search_results
        break
        
    for i,cell in enumerate(search_results):
          print(" \t ", i,cell)
    
    #Get index of cell line the user wants to extract and save its name
    keepcells_index = check_input("\n Number of cell line to keep (use 0 if none are needed):", int, 0, len(search_results)-1)
    if keepcells_index!=0:
      keepcells_name.append(search_results[keepcells_index]) 

    #Clear output window before starting at the top of the loop again
    clear_output(wait=True)

  #Once the search is done, import all the columns selected by the user
  #We add back the column containing the gene names (Hugo_Symbol) and sort gene names

  keepcells_name = sorted(keepcells_name)
 
  return keepcells_name

###Function to finish analysis and save

In [None]:
def end_analysis(extracted_RNA_data):
  
  if len(extracted_RNA_data.columns)>1:
    print("\n \t Preview of your dataset: \n", extracted_RNA_data.head())
    save_file = check_input("\n Save dataset? (Y/N)", str, range_=("Y", "y", "N", "n"))
    
    if save_file=="Y" or save_file=="y":
      output_name = str(input("\n Save file as: "))
    
      global destination
      destination = data_output_dir + "/" + output_name + ".xlsx"
      writer = pd.ExcelWriter(destination, engine='xlsxwriter')
      extracted_RNA_data.to_excel(writer, sheet_name="RNA_expression")
      writer.save()
  
  new_search = check_input("\n Start a new search? (Y/N)", str, range_=("Y", "y", "N", "n"))
  new_search = True if new_search=="Y" or new_search=="y" else False
  return new_search


###Main function

**Import required packages**

In [None]:
#This is needed when working in Google Colab to synchronize google drive
from google.colab import drive
drive.mount('/content/drive/')

import pandas as pd
from IPython.display import clear_output

!pip install XlsxWriter

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


**Import required files**

In [None]:
#Directories of the 3 files to be used
directory1 = "/content/drive/MyDrive/Colab Notebooks/Mulligan Lab/RNASeq/Broad_Institute_CCLE/input_files/DepMap Public 22Q2 Sample_info.csv"
directory2 = "/content/drive/MyDrive/Colab Notebooks/Mulligan Lab/RNASeq/Broad_Institute_CCLE/input_files/DepMap Public 22Q2 CCLE_expression.csv"

#Directory where any outputs generated will be saved
data_output_dir = "/content/drive/MyDrive/Colab Notebooks/Mulligan Lab/RNASeq/Broad_Institute_CCLE/output_files"

#Import all csv files into dataframes
sample_IDs = pd.read_csv(directory1)
RNA_expression = pd.read_csv(directory2)

**Pre-Processing**

We need to do some pre-processing of the dataframes since the datasets are not exactly organized in the same way. The dataset with RNA expression has gene names as columns (which we want as rows) and the Achilles ID as rows (this code is the newer common identifier for all the datasets of DepMap) ***-the names of the cell lines are not given in this newer dataset-***. Since we want to generate a list of names of cell lines so the user can search their names, we need to open the Sample info file to get the CCLE Names (to be searched on), then get their corresponding Achilles ID so we can search them in the expression dataset and extract only the ones wanted by the user.

In [None]:
#First, sort the IDs by CCLE name and extract them to show the user the available cell lines (there are some NaN entries!!!)
sample_IDs = sample_IDs.sort_values(by=["CCLE_Name"])
sample_IDs = sample_IDs[sample_IDs["CCLE_Name"].notna()]
cell_menu = sample_IDs["CCLE_Name"].values.tolist()

#Second, the RNASeq dataset has no name for the Gene name column, and we need to transpose it
RNA_expression = RNA_expression.set_index("Unnamed: 0").T
RNA_expression.index.name = "Gene"
RNA_expression = RNA_expression.sort_index(axis=1)
RNA_expression = RNA_expression.sort_index()


**Function to run the search loops**

In [None]:
def Begin_Search_Here():

  #We do this loop in case the user wants to generate multiple files with different subsets of cell lines
  while True:
    
    #Pass the list of available cells, get back the ones requested by the user
    keepcells_name = search_cells(cell_menu)
    
    #Search the names the user wants in the sample info file to get their index and find their Achilles code
    #NOTE: The .index method gives Int64Index([1360], dtype='int64'), so we need to select the [0] item to get the number only
    keepcells_ACH = []
    for cell_name in keepcells_name:
      cell_name_index = sample_IDs.index[sample_IDs["CCLE_Name"] == cell_name]
      keepcells_ACH.append(sample_IDs["DepMap_ID"].loc[cell_name_index[0]])
    
    #Once we have the Achilles IDs, we can filter the dataset with the corresponding cell lines and rename them
    extracted_RNA_data = RNA_expression[keepcells_ACH]
    extracted_RNA_data.columns = keepcells_name

    #Once done, we have processed this search, trigger the next one and/or save results
    clear_output(wait=True)
    new_search = end_analysis(extracted_RNA_data)
    if new_search==False:
      print("\n Process completed! Your file(s) can be found in the output_files folder...", "\n\n To start a new search, run the Begin_Search_Here box again :) ")
      break


##Run your search

In [None]:
#RUN THIS BOX TO START!
Begin_Search_Here()