# Python Fundamentals for Data Science: Final Exam
### Summer 2016


## Instructions
The final exam is designed to evaluate your grasp of Python theory as well as Python coding.

- This is an individual exam.
- You have 24 hours to complete the exam, starting from the point at which you first access it.
- You will be graded on the quality of your answers.  Use clear, persuasive arguments based on concepts we covered in class.

## General Questions (21 pts )

1) The following method is part of a larger program used by a mobile phone company.  It will work when an object of type MobileDevice or of type ServiceContract is passed in.  This is a demonstration of (select all that apply):

    1. Inheritance
    2. Polymorphism
    3. Duck typing
    4. Top-down design
    5. Functional programming

In [1]:
def add_to_cart(item):
    cart.append(item)
    total += item.price

**[Cynthia Hu]:**
Above example is a demostration of:
2. Polymorphism
3. Duck typing
5. Functional programming

2) Suppose you have a long list of digits (0-9) that you want to write to a file.  Would it be more efficient to use ASCII or UTF-8 as an encoding?  How could you create an even smaller binary file to store the information?

**[Cynthia Hu]:**
It would be more efficient to use UTF-8 for encoding as UTF-8 is widely used and also the method Python3 uses. 
After encoding the list to binary type (.encode('utf-8')), use write() function to write to a binary file, with 'rb' as the argument.

3) You are part of a team working on a spreadsheet program that is written in Python 3.  The program includes several classes to represent different types of objects that fit into a cell of a spreadsheet.  Give a strong argument for why your team should write an abstract base class to respresent such objects, and give examples of what should go into such an abstract base class.

**[Cynthia Hu]:**
As the cell of a spreadsheet can store different types of objects, it's better to create an abstract base class first and then subclasses for each type of objects with more attributes. Using the base class, it's easier and faster to create subclasses with flexibility. If some changes need to be made across all objects, then perhaps only one change is made in the base class. Also Python doesn't check the type of objects, so you can use the same function for different type of objects.

For example, the base class may include attributes 'type' and 'value' and write() method.

4) Explain why NumPy is better than lists for "vectorized" math operations. Give an example of an operation that is either impossible or painful to implement using tradtional Python lists compared to NumPy arrays.

**[Cynthia Hu]:**
NumPy has great ability to work with multi-dimentional arrays and hides the details from end users for common array or vector math operations.

For example, it's easy to time each item by 2 in an array in numpy but you have to use loop using list.

a = np.arange(10)

a * 2

Another example is using numpy to create two 3 by 2
arrays and add each items of two arrays respectively to create a third 3 by 2 array.

5) We want a list of the numbers that are the square of nonnegative integer less than 10, but whose squares are greater than 10.  The list comprehension below gives an empty list.  Correct it so that we get the desired output, [16, 25, 36, 49, 64, 81].

In [3]:
[ x**2 for x in range(1,10) if x**2 > 10]

[16, 25, 36, 49, 64, 81]

6) Explain why the following code prints what it does.

In [4]:
def f(): pass
print(type(f))

<class 'function'>


**[Cynthia Hu]:**
pass the function means defining an abstract and empty function and no errors throw out. Even though it's an empty function, f is still a function object in Python. Thus the type of f is 'function'.

7) Explain why the following code prints something different.

In [13]:
def f(): pass
print(type(f()))

<class 'NoneType'>


**[Cynthia Hu]:**
f() means callling the function. As it's an empty function, it returns 'None' which is a Python type. 

## Data Integrity (25 pts)

1) Why is it important to sanity-check your data before you begin your analysis? What could happen if you don't?

**[Cynthia Hu]:**
It's import to sanity-check the data before begin analysis. If something is wrong or missing in the dataset, you may detect it later in the analysis and you have to redo the work. What's even worse, if you don't detect the issues, you will get the wrong analysis and conclusions.

Common sanity check I would do include followings:
0. Overview of the dataset, such as data type and size
1. count how many records and distinct count some variables. For example, if I expect one row for one customer, then I would get the same numbers of counting rows and counting distinct customers. Also, using count and distinct count, I would know whether I have a unique key in my table.
2. check range of some variables, like the date range
3. check missing values
4. check outliers or abnormal distribution for some variables, such as 90% are zeros for 'Amount'.

2) Explain, in your own words, why real-world data is often messy.

**[Cynthia Hu]:**
There are several reasons I could think of for the messy data in the real-word:
1. Upstream data may be from user input. The user may not understand the system or which values should be entered or didn't follow the standard process. Or sometimes they just enter some values to pass the system as it's required. However, these values may not reflect the truth and invalid for further analysis.

2. There are a lot of changes in the upstream data sources and different systems/tables are not synced. Then you will notice a lot of variances between sources and spend much time investigating.

3. There are multiple layers in the data warehouse. Different rules are applied during the process and there is no master data dictionary or document. To answer some business questions, you may find several relevant tables but it's difficult to decide which one is correct. Or you don't know how data was processed and you may misinterprate the data you have.

3) How do you determine which variables in your dataset you should check for issues prior to starting an analysis? 

**[Cynthia Hu]:**
I would first have an overview of the dataset, like head(), describe(), dtypes. If some variables are not relevant or duplicates, I would select fewer variables.

Next, I will check the key whether it's unique and not null.
Then, I will check variables quite relevant to my questions. For example, I'm analyzing the revenue by customer and product type. I will check the distribution of revenue, product type, customer.

4) How do you know when you have adequately checked these variables?

**[Cynthia Hu]:**
It's difficult to say that I have adequately checked these variables. I may miss something but detect it later during exploring and analysis process. And then I would look into the issues further.

During the sanity check, I may check the same variables from different perspective. If the conclusions are consistent or reasonable, I would say I can move to the next step.

5) Is it possible to fully vet your data for errors before you begin your analysis? If not, what should you be looking out for while you complete your analysis?

**[Cynthia Hu]:**
As I explained above, it's impossible to fully investigate the data for any errors before begin analysis. This is a repeatable process, not just one way. If initial sanity check is passed, I would move on or keep in mind some potential issues.

After exploring the dataset further, I may answer some questions I had before or really found some issues with the dataset. Then I have to make decisions on how to process the data, such as removing nulls or replace them with zeros. Finally, once complete the analysis, do my conclusions make sense or meet my expection or contradict the data itself?

## Elections (24 pts)

Consider the following data frame in Pandas.

In [5]:
import pandas

# creating a data frame from scratch - list of lists

data = [ ['marco', 165, 'blue', 'FL'], 
         ['jeb', 0, 'red', 'FL'], 
         ['chris', 0, 'white', 'NJ'], 
         ['donald', 1543, 'white', 'NY'],
         ['ted', 559, 'blue', 'TX'],
         ['john', 161, 'red', 'OH']
       ]

# create a data frame with column names - list of lists

col_names = ['name', 'delegates', 'color', 'state']
df = pandas.DataFrame(data, columns=col_names)
df

Unnamed: 0,name,delegates,color,state
0,marco,165,blue,FL
1,jeb,0,red,FL
2,chris,0,white,NJ
3,donald,1543,white,NY
4,ted,559,blue,TX
5,john,161,red,OH


1) Using bracket indexing in Pandas, show how many delegates `ted` got.

In [8]:
df[df['name'] == 'ted']['delegates']

4    559
Name: delegates, dtype: int64

2) Using bracket indexing in Pandas, show how many total delegates were obtained by candidates whose favorite color is blue.

In [10]:
temp = df[df['color'] == 'blue']
temp.delegates.sum()

724

3) Using groupby and aggregate in Pandas, show how many total delegates were obtained by candidates grouped by favorite color.

In [13]:
df.groupby('color').delegates.sum()

color
blue      724
red       161
white    1543
Name: delegates, dtype: int64

## Clinical disease data (30 pts)

Your boss comes to you Monday morning and says “I figured out our next step; we are going to pivot from an online craft store and become a data center for genetic disease information! I found **ClinVar** which is a repository that contains expert curated data, and it is free for the taking. This is a gold mine! Take a week and **<u>tell me what gene and mutation combinations are classified as dangerous.”</u>**

1)  Look a the data set and develop a plan of action to use python to extract and summarize just what your boss wants. **Don’t code**. You can use pseudocode and/or and essay format to generate a plan in 500 words or less. 

2) Tell us the output that you expect from your planned code

**Hints:**  

* Look at the file carefully. What fields do you want to extract? Are they in the same place every time? What strategy will you use to robustly extract and filter your data of interest? How do you plan to handle missing data?

* Filter out junk. Just focus on what your boss asked for (1) gene name (2) mutation reference. (3) Filter your data to include only mutations that are dangerous as you define it. 

* Pandas and NumPy parsers correctly recognize the end of each line in in the ClinVar file.

* The unit of observation of this dataset is one row per mutation.

* While you shouldn't code your analysis, creating a few lines of code while you think through the problem may be helpful (so that you can sanity check that your plan works). So you can experiment, we have included the data file below as a Tab Separated Value file "Genomics_Questions.txt". Please do not submit any such code. For example, if I wanted to check that I accurately understand the "split" function in the context of this data, I could type:

```python
sample = "abc;def;asd"
test = sample.split(';')
```

**This is a planning question we want you to lay out a plan in text not code.** 

### VCF file description (Summarized from version 4.1)

```
* The VCF specification:

VCF is a text file format which contains meta-information lines, a header
line, and then data lines each containing information about a position in the genome. The format also has the ability to contain genotype information on samples for each position.

* Fixed fields:

There are 4 fixed fields per record. All data lines are **tab-delimited**. In all cases, missing values are specified with a dot (‘.’). 

1. CHROM - chromosome number
2. POS - position DNA nuceleotide count (bases) along the chromosome
3. ID - The unique identifier for each mutation
4. INFO - a semicolon-separated series of  keys with values in the format: <key>=<data>, and specified as <key>=<data name>[data value definition].

```
### INFO field specifications

```
GENEINFO = <Gene symbol>
CLNSIG =  <Variant Clinical Significance (Severity)>[0 - Uncertain significance, 2 - Benign, 5 - Pathogenic, 255 - other]

```

### Representative ClinVar data (vcf file format)

```
##fileformat=VCFv4.0
##fileDate=20160705
##source=ClinVar and dbSNP
##dbSNP_BUILD_ID=147
#CHROM	POS	ID	INFO
1	949523	rs786201005   GENEINFO=ISG15;CLNSIG=5
1	949696	rs672601345   GENEINFO=ISG15;CLNSIG=5
1	949739	rs672601312	  GENEINFO=ISG15; CLNSIG=0
1	955597	rs115173026	  GENEINFO=AGRN;CLNSIG=2
1	955619	rs201073369	  GENEINFO=AGG
1	957640	.	  GENEINFO=AGG;CLNSIG=5
1	976059	rs544749044	  GENEINFO=AGG;CLNSIG=255
```

**[Cynthia Hu]:**

My goal is to answer the question: what gene and mutation combinations are classified as dangerous. My expected output from the data is a processed data set with columns related to gene and mutation information only and danderous combinations only. Aslo, I will summarize the data, saying show the top 10 dangerous combinations in bar charts.

Below are my steps.

1. read the data
   Read VCF file, tab delimited. use pandas.read_csv()
   don't need the first four rows as they are information about the data.
   The data has a column names in the fifth row.
   
2. print several rows to exam

3. remove '#' from column name for #CHROM

4. understand defintion of each column, refering to the document above

5. select variables used for this task: CHROM, POS, ID and INFO. 

6. split the INFO columns to get separate columns for GENEINFO, CLNSIG and CLNDBN

   1) use apply() + lambda funtion to split each row --> output may be a list of list  
   2) each variable with key = value format, using dictionary/list to convert it to columns?
      loop through each item in the list (nested loops), extract the key and value respectively, using find("=") and len() functions.       
   3) not all records with all three variables; may use blank or null for missing values  
   4) after testing the steps, then try to combine them to create several columns directly
   
7. explore the dataset and do sanity-check

    1) check several records of the dataframe, head()
    
    2) how many variables and their data type: shape, dtypes,describe()
    
    3) which variables have missing values, and what's the percentage of missing records. can calculate from step 2)
    
    4) distribution of the variables, using pd.value_counts()
    
8. filter data set from prior step to include records with only dangerous mutation
    CLNSIG == 5 (Pathogenic)
    Records with missing values for CLNSIG would be excluded from the analysis
    
9. group the dataset by GENEINFO and ID; use count function to summarize and sort the data descending

10. print the grouped data

11. transform the data for plotting. reset_index

12. show top 10 combinations in a bar chart
