<a href="https://colab.research.google.com/github/EdRey05/Resources_for_Mulligan_Lab/blob/main/Tools%20for%20students/Eduardo%20Reyes/ExtractCells_Broad_Institute_CCLE_2019_%5BColab%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#***Notebook to extract data from the Cancer Cell Line Encyclopedia 2019***

**Original data from:** *Broad Institute and Novartis*

**Publication DOI:** *10.1038/s41586-019-1186-3*

**Data downloaded from:** *https://www.cbioportal.org/study/summary?id=ccle_broad_2019*

**Notebook made by:** *Eduardo Reyes-Alvarez (Ph.D. candidate)*

**Affiliation:** *Dr. Lois Mulligan's lab, Queen's University.*

**Contact:** *eduardo_reyes09@hotmail.com*

**Date of latest version:** February 04, 2021.

##Instructions

* Before starting an analysis it is recommended to check the txt file containing Info and Metadata from the study used in this notebook (you will be asked to synchronize a Google Drive account where the inputs are to be able to use this notebook).

* The study provides the RNASeq data of **only cancer cell lines** in 3 different units:
  1.   mRNA expression (RNA-Seq RPKM)
  2.   Log-transformed mRNA z-Scores compared to the expression distribution of all samples (log RNA-Seq RPKM)
  3. mRNA expression z-Scores relative to diploid samples (RNA Seq RPKM)

* Once you have decided what units you want, remember the number (1, 2 or 3). If you are not sure which one, you can use the option 1, which contains raw values.

* Place your cursor in the grey box under the "Code" section (it says 13 cells hidden), a "play" button should appear, click on it. A "stop" icon will appear as it runs the code, once it finishes it will dissapear. 

* Place your cursor in the grey box under the "Run your search" section, a "play" button will appear, click on it and follow the instructions.

##Code



###Import packages and directories

In [None]:
import pandas as pd
from IPython.display import clear_output

!pip install XlsxWriter

Collecting XlsxWriter
  Downloading XlsxWriter-3.0.2-py3-none-any.whl (149 kB)
[?25l[K     |██▏                             | 10 kB 25.6 MB/s eta 0:00:01[K     |████▍                           | 20 kB 29.0 MB/s eta 0:00:01[K     |██████▋                         | 30 kB 12.4 MB/s eta 0:00:01[K     |████████▊                       | 40 kB 9.6 MB/s eta 0:00:01[K     |███████████                     | 51 kB 5.2 MB/s eta 0:00:01[K     |█████████████▏                  | 61 kB 5.7 MB/s eta 0:00:01[K     |███████████████▎                | 71 kB 6.1 MB/s eta 0:00:01[K     |█████████████████▌              | 81 kB 6.8 MB/s eta 0:00:01[K     |███████████████████▊            | 92 kB 6.5 MB/s eta 0:00:01[K     |█████████████████████▉          | 102 kB 5.5 MB/s eta 0:00:01[K     |████████████████████████        | 112 kB 5.5 MB/s eta 0:00:01[K     |██████████████████████████▎     | 122 kB 5.5 MB/s eta 0:00:01[K     |████████████████████████████▍   | 133 kB 5.5 MB/s eta 0:00:

In [None]:
#Directories of the 3 files available from the Broad Institute (different units of RNA expression)

directory1 = "/content/drive/MyDrive/Colab Notebooks/Mulligan Lab - PhD/RNASeq/Broad_Institute_CCLE/input_files/1-data_RNA_Seq_expression_median.txt"
directory2 = "/content/drive/MyDrive/Colab Notebooks/Mulligan Lab - PhD/RNASeq/Broad_Institute_CCLE/input_files/2-data_RNA_Seq_mRNA_median_all_sample_Zscores.txt"
directory3 = "/content/drive/MyDrive/Colab Notebooks/Mulligan Lab - PhD/RNASeq/Broad_Institute_CCLE/input_files/3-data_RNA_Seq_mRNA_median_Zscores.txt"
data_output_dir = "/content/drive/MyDrive/Colab Notebooks/Mulligan Lab - PhD/RNASeq/Broad_Institute_CCLE/output_files"


###Function to validate inputs

In [None]:
#Input validation (type, min, max and range)
#Modified from here: https://stackoverflow.com/questions/23294658/asking-the-user-for-input-until-they-give-a-valid-response

def check_input(prompt, type_=None, min_=None, max_=None, range_=None):
    if min_ is not None and max_ is not None and max_ < min_:
        raise ValueError("min_ must be less than or equal to max_.")
    while True:
        user_input = input(prompt)
        if type_ is not None:
            try:
                user_input = type_(user_input)
            except ValueError:
                print("Input type must be {0}.".format(type_.__name__))
                continue
        if max_ is not None and user_input > max_:
            print("Input must be less than or equal to {0}.".format(max_))
        elif min_ is not None and user_input < min_:
            print("Input must be greater than or equal to {0}.".format(min_))
        elif range_ is not None and user_input not in range_:
            if isinstance(range_, range):
                template = "Input must be between {0.start} and {0.stop}."
                print(template.format(range_))
            else:
                template = "Input must be {0}."
                if len(range_) == 1:
                    print(template.format(*range_))
                else:
                    expected = " or ".join((
                        ", ".join(str(x) for x in range_[:-1]),
                        str(range_[-1])
                    ))
                    print(template.format(expected))
        else:
            return user_input

###Function to select input file

In [None]:
#Ask for RNA expression file to use, and read only the cell lines/column names

def select_input_file ():

  RNA_input = check_input("\n Select the gene expression file to use (1, 2 or 3): ", int, range_=(1, 2, 3))
  global directory
  directory = directory1 if RNA_input==1 else directory2 if RNA_input==2 else directory3
  cell_menu = pd.read_csv(directory, sep='\t', header=0, nrows=0).columns.tolist()
  cell_menu.remove("Hugo_Symbol")

  return cell_menu
 

###Function to search for cell lines

In [None]:
def search_cells(cell_menu):

  #Lists to store selected cell line names
  keepcells_name = []

  #Loop to search text keys
  while True:
    #Check if we want to continue of exit
    continue_search = check_input("\n Search for cell line? (Y/N) ", str, range_=("Y", "y", "N", "n"))
    if continue_search=="N" or continue_search=="n":
      break
    
    #Get string of interest (we need a valid input to proceed)  
    while True:
      search_string = str(input("\n Type term to search for: "))
      search_string = search_string.upper()
      search_results = [cell for cell in cell_menu if search_string in cell]
      
      #Print columns that contain that string
      if search_results==[]:
        print("\n Nothing was found! Try other term or a shorter version of it!")
      else:
        search_results = ["-"] + search_results
        break
        
    for i,cell in enumerate(search_results):
          print(" \t ", i,cell)
    
    #Get index of cell line the user wants to extract and save its name
    keepcells_index = check_input("\n Number of cell line to keep (use 0 if none are needed):", int, 0, len(search_results)-1)
    if keepcells_index!=0:
      keepcells_name.append(search_results[keepcells_index]) 

    #Clear output window before starting at the top of the loop again
    clear_output(wait=True)

  #Once the search is done, import all the columns selected by the user
  #We add back the column containing the gene names (Hugo_Symbol) and sort gene names

  keepcells_name = sorted(keepcells_name)
  import_cells = ["Hugo_Symbol"] + keepcells_name
 
  return import_cells

###Function to finish analysis and save

In [None]:
def end_analysis(extracted_data):
  
  if len(extracted_data.columns)>1:
    print("\n \t Sample of your dataset: \n", extracted_data.sample(10))
    save_file = check_input("\n Save dataset? (Y/N)", str, range_=("Y", "y", "N", "n"))
    
    if save_file=="Y" or save_file=="y":
      output_name = str(input("\n Save file as: "))
    
      global destination
      destination = data_output_dir + "/" + output_name + ".xlsx"
      writer = pd.ExcelWriter(destination, engine='xlsxwriter')
      extracted_data.to_excel(writer, sheet_name=output_name)
      writer.save()
  
  new_search = check_input("\n Start a new search? (Y/N)", str, range_=("Y", "y", "N", "n"))
  new_search = True if new_search=="Y" or new_search=="y" else False
  return new_search


###Main function

In [None]:
def Begin_Search_Here():

  #This is needed when working in Google Colab to synchronize google drive
  from google.colab import drive
  drive.mount('/content/drive/')

  while True:
    cell_menu = select_input_file()
    import_cells = search_cells(cell_menu)
    
    extracted_data = pd.read_csv(directory, sep='\t', usecols=import_cells)
    extracted_data = extracted_data.sort_values("Hugo_Symbol")
    extracted_data = extracted_data.reset_index(drop=True)
    clear_output(wait=True)
    
    new_search = end_analysis(extracted_data)
    if new_search==False:
      print("\n Process completed!", "\n\n To start a new analysis, run the Begin_Search_Here box again :) ")
      break


##Run your search

In [None]:
#Run this code box to begin!
Begin_Search_Here()


 	 Sample of your dataset: 
          Hugo_Symbol  MIAPACA2_PANCREAS  ...  TT_THYROID  DLD1_LARGE_INTESTINE
36409  RP11-276M12.1            0.01839  ...     2.37423               0.00000
39075   RP11-425A6.5            0.02247  ...     0.08701               0.00000
11480  CTD-2196E14.9            1.33212  ...     2.45094               0.71639
37579   RP11-343H5.4           30.09873  ...     3.07645              37.76643
10793    CTC-297N7.5            0.57077  ...     0.17824               0.00000
6918            ATL1            0.10348  ...     0.37566               1.52382
22979      MIRLET7A1            0.00000  ...     0.00000               0.00000
36948  RP11-308D16.1            0.00000  ...     0.00000               0.00000
32701   RP1-130L23.1            0.00000  ...     0.00000               0.00000
17422       HNRNPUL2           18.92155  ...    32.47757              32.01634

[10 rows x 5 columns]

 Save dataset? (Y/N)y

 Save file as: TT-Miapaca-Panc1-DLD1

 Start a new sea