# Python Fundamentals for Data Science: Final Exam
### Summer 2016


## Instructions
The final exam is designed to evaluate your grasp of Python theory as well as Python coding.

- This is an individual exam.
- You have 24 hours to complete the exam, starting from the point at which you first access it.
- You will be graded on the quality of your answers.  Use clear, persuasive arguments based on concepts we covered in class.

## General Questions (21 pts )

1) The following method is part of a larger program used by a mobile phone company.  It will work when an object of type MobileDevice or of type ServiceContract is passed in.  This is a demonstration of (select all that apply):

    1. Inheritance
    2. Polymorphism
    3. Duck typing
    4. Top-down design
    5. Functional programming

In [1]:
def add_to_cart(item):
    cart.append(item)
    total += item.price

2) Suppose you have a long list of digits (0-9) that you want to write to a file.  Would it be more efficient to use ASCII or UTF-8 as an encoding?  How could you create an even smaller binary file to store the information?

3) You are part of a team working on a spreadsheet program that is written in Python 3.  The program includes several classes to represent different types of objects that fit into a cell of a spreadsheet.  Give a strong argument for why your team should write an abstract base class to respresent such objects, and give examples of what should go into such an abstract base class.

4) Explain why NumPy is better than lists for "vectorized" math operations. Give an example of an operation that is either impossible or painful to implement using tradtional Python lists compared to NumPy arrays.

5) We want a list of the numbers that are the square of nonnegative integer less than 10, but whose squares are greater than 10.  The list comprehension below gives an empty list.  Correct it so that we get the desired output, [16, 25, 36, 49, 64, 81].

In [4]:
[ x**2 for x in range(10) if x > 10]

[]

6) Explain why the following code prints what it does.

In [15]:
def f(): pass
print(type(f))

<class 'function'>


7) Explain why the following code prints something different.

In [13]:
def f(): pass
print(type(f()))

<class 'NoneType'>


## Data Integrity (25 pts)

1) Why is it important to sanity-check your data before you begin your analysis? What could happen if you don't?

2) Explain, in your own words, why real-world data is often messy.

3) How do you determine which variables in your dataset you should check for issues prior to starting an analysis? 

4) How do you know when you have adequately checked these variables?

5) Is it possible to fully vet your data for errors before you begin your analysis? If not, what should you be looking out for while you complete your analysis?

## Elections (24 pts)

Consider the following data frame in Pandas.

In [6]:
import pandas

# creating a data frame from scratch - list of lists

data = [ ['marco', 165, 'blue', 'FL'], 
         ['jeb', 0, 'red', 'FL'], 
         ['chris', 0, 'white', 'NJ'], 
         ['donald', 1543, 'white', 'NY'],
         ['ted', 559, 'blue', 'TX'],
         ['john', 161, 'red', 'OH']
       ]

# create a data frame with column names - list of lists

col_names = ['name', 'delegates', 'color', 'state']
df = pandas.DataFrame(data, columns=col_names)
df

Unnamed: 0,name,delegates,color,state
0,marco,165,blue,FL
1,jeb,0,red,FL
2,chris,0,white,NJ
3,donald,1543,white,NY
4,ted,559,blue,TX
5,john,161,red,OH


1) Using bracket indexing in Pandas, show how many delegates `ted` got.

2) Using bracket indexing in Pandas, show how many total delegates were obtained by candidates whose favorite color is blue.

3) Using groupby and aggregate in Pandas, show how many total delegates were obtained by candidates grouped by favorite color.

## Clinical disease data (30 pts)

Your boss comes to you Monday morning and says “I figured out our next step; we are going to pivot from an online craft store and become a data center for genetic disease information! I found **ClinVar** which is a repository that contains expert curated data, and it is free for the taking. This is a gold mine! Take a week and **<u>tell me what gene and mutation combinations are classified as dangerous.”</u>**

1)  Look a the data set and develop a plan of action to use python to extract and summarize just what your boss wants. **Don’t code**. You can use pseudocode and/or and essay format to generate a plan in 500 words or less. 

2) Tell us the output that you expect from your planned code

**Hints:**  

* Look at the file carefully. What fields do you want to extract? Are they in the same place every time? What strategy will you use to robustly extract and filter your data of interest? How do you plan to handle missing data?

* Filter out junk. Just focus on what your boss asked for (1) gene name (2) mutation reference. (3) Filter your data to include only mutations that are dangerous as you define it. 

* Pandas and NumPy parsers correctly recognize the end of each line in in the ClinVar file.

* The unit of observation of this dataset is one row per mutation.

* While you shouldn't code your analysis, creating a few lines of code while you think through the problem may be helpful (so that you can sanity check that your plan works). So you can experiment, we have included the data file below as a Tab Separated Value file "Genomics_Questions.txt". Please do not submit any such code. For example, if I wanted to check that I accurately understand the "split" function in the context of this data, I could type:

```python
sample = "abc;def;asd"
test = sample.split(';')
```

**This is a planning question we want you to lay out a plan in text not code.** 

### VCF file description (Summarized from version 4.1)

```
* The VCF specification:

VCF is a text file format which contains meta-information lines, a header
line, and then data lines each containing information about a position in the genome. The format also has the ability to contain genotype information on samples for each position.

* Fixed fields:

There are 4 fixed fields per record. All data lines are **tab-delimited**. In all cases, missing values are specified with a dot (‘.’). 

1. CHROM - chromosome number
2. POS - position DNA nuceleotide count (bases) along the chromosome
3. ID - The unique identifier for each mutation
4. INFO - a semicolon-separated series of  keys with values in the format: <key>=<data>, and specified as <key>=<data name>[data value definition].

```
### INFO field specifications

```
GENEINFO = <Gene symbol>
CLNSIG =  <Variant Clinical Significance (Severity)>[0 - Uncertain significance, 2 - Benign, 5 - Pathogenic, 255 - other]

```

### Representative ClinVar data (vcf file format)

```
##fileformat=VCFv4.0
##fileDate=20160705
##source=ClinVar and dbSNP
##dbSNP_BUILD_ID=147
#CHROM	POS	ID	INFO
1	949523	rs786201005   GENEINFO=ISG15;CLNSIG=5
1	949696	rs672601345   GENEINFO=ISG15;CLNSIG=5
1	949739	rs672601312	  GENEINFO=ISG15; CLNSIG=0
1	955597	rs115173026	  GENEINFO=AGRN;CLNSIG=2
1	955619	rs201073369	  GENEINFO=AGG
1	957640	.	  GENEINFO=AGG;CLNSIG=5
1	976059	rs544749044	  GENEINFO=AGG;CLNSIG=255
```