First you need to link your Google Drive to the notebook in order to access the files needed for this module.

Run the cell below and follow instructions to mount the drive.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Installing Biopython

At the beginning of each module, we will install **Biopython**. Biopython is a large open-source application programming interface (API) used in both bioinformatics software development and in everyday scripts for common bioinformatics tasks. It contains several packages that you will need to import which will allow you to run the analyses required for this project. 

REF:
* Cock, P. J., Antao, T., Chang, J. T., Chapman, B. A., Cox, C. J., Dalke, A., Friedberg, I., Hamelryck, T., Kauff, F., Wilczynski, B., & de Hoon, M. J. (2009). Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics (Oxford, England), 25(11), 1422–1423. https://doi.org/10.1093/bioinformatics/btp163


In [None]:
!pip install biopython

# Investigating the biological impact of the mutation and its possible role in human disease
For this section, your research will focus on investigating the biological impact of the mutation you are studying. To do this, you will use the OMIM and KEGG databases.

## OMIM Search for information on genetic diseases

The **OMIM** (Online Mendelian Inheritance of Man) database contains short, referenced reviews about genetic loci and genetic diseases. It
can be a very useful resource for finding out what type of research has been done on a gene or a disease.

REF:
* http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM

## Install and import the necessary packages:

The **romim package** was created to query the OMIM database but it runs in R. 

**R** is another programming language so you will need to install **rpy2** to run R code in Google Colab.

**Methods** and **remotes** are R packages that help us both install the package and use the functions in the code.

The **XML** package will be used to read the results that you obtain from your database searches. 

REF:
* https://github.com/davetang/romim

In [None]:
# Install the rpy2 interface to run R code
%load_ext rpy2.ipython

In [None]:
%%R # This must precede all R code in Colab, to allow R code to run 

# Installing the main package
# Note how different it is from Python code
remotes::install_github('davetang/romim')

# Import the library associated with the package
library(romim)

# Intalling several packages
install.packages('XML')
install.packages('methods')
install.packages("remotes")

# Press 1 and ENTER when prompted

## Obtaining the ID number (called mim number) associated with the lung cancer entry in OMIM

In [None]:
%%R # This must precede all R code in Colab, to allow R code to run 

# To access OMIM, we will use this key which will work as our password to access the database
set_key('4PUvWRqSSD2BuprIVAP_VQ') 

# First lets get a list of the entries associated with KRAS2
# Write KRAS2 in the parenthesis below to create a list of the entries

my_list <- gene_to_omim('####')

# Now lets obtain the mim number (with get_omim) and list it with our entries
my_list_omim <- sapply(my_list, get_omim)

# This will append the title of the entry to the list
sapply(my_list_omim, get_title)

Write down the ID number for 'LUNG CANCER' from the results above. (ID precedes the entry title)

Answer here

## Using OMIM to obtain more information about the disease
This time you will search the OMIM 'LUNG CANCER' entry for information.

The function 'get_omim' helps you do just that since we can set certain arguments to 'TRUE' and obtain specific information about the entry.

Run the next cell to see a list of Arguments that you can access.

In [None]:
%%R
help(get_omim)

### Start by setting 'referenceList' to TRUE


In [None]:
%%R
set_key('4PUvWRqSSD2BuprIVAP_VQ') # The key must be added before every request

# Using mim number to get the article list
# Write the mim number inside the parenthesis and set 'referenceList' to 'TRUE'
omim_result <- get_omim(####, referenceList = ####)

# Save the results as an XML file
saveXML(omim_result, file='FILE NAME HERE.xml') # Write a file name

# File name will display as output for this cell


### Display the results in the form of a table

In [None]:
#@title Load the results by providing the file name in this form (include file extension .xml)

# MAKING RESULTS LOOK GOOD
import xml.etree.ElementTree as ET
import csv
import pandas as pd

file_name = '' #@param {type:"string"}


tree = ET.parse(file_name)
root = tree.getroot()
 
Ref_data = open('refdata.csv', 'w')
 
csvwriter = csv.writer(Ref_data)
authors_head = []

 
count = 0
for member in root.findall('.//reference'):
    authors = []
    ref_list = []
    if count == 0:
        author = member.find('.//authors').tag
        authors_head.append(author)

        title = member.find('.//title').tag
        authors_head.append(title)

        source = member.find('.//source').tag
        authors_head.append(source)
        
        pubmed = member.find('.//pubmedID').tag
        authors_head.append(pubmed)
       
        csvwriter.writerow(authors_head)
        count = count + 1
    
    author = member.find('.//authors').text
    authors.append(author) 
    title = member.find('.//title').text
    authors.append(title)
    source = member.find('.//source').text
    authors.append(source)
    pubmed = member.find('.//pubmedID').text
    authors.append(pubmed)
         
    csvwriter.writerow(authors)
 
 
Ref_data.close()

data= pd.read_csv("refdata.csv")
data

## Reading the abstract from the first article in the list

In [None]:
from Bio import Entrez
# Searching for the abstract in the database Pubmed
Entrez.email = 'YOUR EMAIL HERE'

query7 = Entrez.efetch(db='####', id = '####', rettype = '####', retmode = 'text')
# Hint for rettype (retrieval type): you want to retrieve the abstract

# Reading the query and printing it
print(query7.read())

# Closing the query
query7.close()

## Answer the following questions:##
Input your answer in the cell below each question and press SHIFT+ENTER.

1. Who is the first author on this article and what journal was it published in?


Answer here

2. Describe who was involved in the study (how many and what categories of patients?).

Answer here

3. What did the researchers find out about K-Ras mutations?


Answer here

4. What conclusion(s) did the researchers come to about K-Ras mutations based on their data? (Summarize and put into your own words)

Answer here

### Get information about genes related to the disease by setting 'geneMap' to 'TRUE'

In [None]:
%%R
set_key('####')

# Using mim number of 'LUNG CANCER' we search OMIM, setting 'geneMap' to TRUE
omim_result2 <- get_omim(####, geneMap = ####)

# Save the results as an XML file
saveXML(omim_result2, file='FILE NAME HERE.xml")

### Display the results in the form of a table

In [None]:
#@title Load the results by providing the file name in this form (include file extension .xml)

# MAKING RESULTS LOOK GOOD
import xml.etree.ElementTree as ET
import csv
import pandas as pd

file_name = "" #@param {type:"string"}

tree = ET.parse(file_name)
root = tree.getroot()
 
Ref_data2 = open('refdata2.csv', 'w')
 
csvwriter = csv.writer(Ref_data2)
genes_head = []

 
count = 0
for member in root.findall('.//phenotypeMap'):
    genes = []
    ref_list = []
    if count == 0:
        mim = member.find('.//mimNumber').tag
        genes_head.append(mim)

        gen = member.find('.//geneSymbols').tag
        genes_head.append(gen)

        phen = member.find('.//phenotype').tag
        genes_head.append(phen)

        csvwriter.writerow(genes_head)
        count = count + 1
    
    mim = member.find('.//mimNumber').text
    genes.append(mim) 
    gen = member.find('.//geneSymbols').text
    genes.append(gen)
    phen = member.find('.//phenotype').text
    genes.append(phen)

    csvwriter.writerow(genes)
 
 
Ref_data2.close()

data2= pd.read_csv("refdata2.csv")
data2

Write down the KRAS mim number below

Answer here

## Using OMIM to search for gene information

### First, read the description for the gene entry

In [None]:
# Search OMIM again but with the KRAS mim number to obtain description information
%%R
set_key('####')


# Using mim number to get the description, set 'text' argument to true
omim_result <- get_omim(####, text = ####)

saveXML(omim_result, file="FILE NAME HERE.xml")

### Display the results in the form of a table



In [None]:
#@title Load the results by providing the file name in this form (include file extension .xml)

# MAKING RESULTS LOOK GOOD
import xml.etree.ElementTree as ET
import csv
import pandas as pd

file_name = "" #@param {type:"string"}

tree = ET.parse(file_name)
root = tree.getroot()
 
Ref_data3 = open('refdata3.csv', 'w')
 
csvwriter = csv.writer(Ref_data3)
kras_head = []

 
count = 0
for member in root.findall('.//entry'):
    kras = []
    ref_list = []
    if count == 0:
        des = member.find('.//textSectionList//textSectionContent').tag
        kras_head.append(des)

        csvwriter.writerow(kras_head)
        count = count + 1
    
    des = member.find('.//textSectionList//textSectionContent').text
    kras.append(des) 

    csvwriter.writerow(kras)
 
 
Ref_data3.close()

data3= pd.read_csv("refdata3.csv")
pd.set_option('display.max_colwidth',1000)

data3

### Now read information about allelic variants 
An allele is a variant of a gene were the DNA sequence differs between two or more variants. 

Allelic variation describes the presence or number of different allele forms at a particular locus (locus or loci = place) on a chromosome.

REF:  
* https://warwick.ac.uk/fac/sci/lifesci/research/vegin/geneticimprovement/diversitycollection/allelicvariation/


In [None]:
# Search OMIM again but with KRAS mim Number to obtain description info
%%R
set_key('####')

# Set allelicVariantList to TRUE
omim_result <- get_omim(###, #### = ###)

saveXML(omim_result, file='FILE NAME HERE.xml')

### Display the results in the form of a table

In [None]:
#@title Load the results by providing the file name in this form (include file extension .xml)

# MAKING RESULTS LOOK GOOD
import xml.etree.ElementTree as ET
import csv
import pandas as pd

file_name = "" #@param {type:"string"}

tree = ET.parse(file_name)
root = tree.getroot()
 
Ref_data4 = open('refdata4.csv', 'w')
 
csvwriter = csv.writer(Ref_data4)
allele_head = []

 
count = 0
for member in root.findall('.//allelicVariant'):
    allele = []
    ref_list = []

    if count == 0:
      des = member.find('.//mutations').tag
      allele_head.append(des)
      
      mut = member.find('.//text').tag
      allele_head.append(mut)
     
      csvwriter.writerow(allele_head)
      count = count + 1
       
    des = member.find('.//mutations').text
    allele.append(des)
    
    mut = member.find('.//text').text
    allele.append(mut)

    csvwriter.writerow(allele)
  
Ref_data4.close()

data4= pd.read_csv("refdata4.csv")
pd.set_option('display.max_colwidth',10000)

data4



## Answer the following questions:
Input your answer in the cell below each question and press SHIFT+ENTER.

1. In the results above, read the entry for the Gly12Cys mutation. Describe the differences and similarities for the K-Ras, H-Ras, and N-Ras genes and proteins.

Answer here

2. How common was the G12C mutation in the Ahrendt et al. (2001) study?

Answer here