In [2]:
# We have to do something a bit weird as a first step and we will learn the reasons for this as we move through 
# the course material. We have to mount a drive so that our Jupyter notebooks can access image files and data
# files. 


# from google.colab import drive
# drive.mount('/content/drive')

# this will ask you for permission to mount whatever is in your Google drive

# because we will want to display images that are embedded in the Jupyter notebook, we will also bring in the following:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import os 

# What is Data Science? 


- "Thinking clearly with help from data"

- "The scientific process of extracting value from data"
<br></br>

Data science involves solving problems using the following tools: 

        Statistics + Computer Science + sprinking of Linear Algebra + Domain Expertise

Good data scientists understand their tools, but also how they affect society, culture, and politics
<br></br>

## Data Science Pipeline
Typically, Data Scientists use the following pipeline: 

        *DATA --> DATA WRANGLING*--> *MODELING/ANALYSIS --> INTERPRETATION/VISUALIZATION/COMMUNICATION   
*HYPOTHESIS can be formulated in multiple places in this pipeline (beginning of pipeline, end of wrangling, or in the modeling/analysis step) 
<br></br>

This program is designed to introduce you to this pipeline. Along the way we will refine the following question:

**How can we ask a good question that is answerable via Data Science Methods?**

# Course Overview

## Objectives
- Develop [Habits of Mind](https://www.habitsofmindinstitute.org/what-are-habits-of-mind/)
  - Data Scientists are __curious people.__ 
- Formulate a plan for and complete a data science project from start (question) to finish (communication)
  - Data Scientists are __problem solvers__ who __iterate__ their solutions
- Explain and carry out descriptive, exploratory, inferential, and predictive analyses in Python
  - Data Scientists __quantify uncertainty__ 
- Communicate results concisely and effectively in presentations
  - Data Scientists pull out an __authentic narrative__ from their data that is accessible for their audience
- Identify and explain how to approach an unfamiliar data science task
  - Data Scientists are __fearless__ confronting uncertainty
 
## Outline
Below is a general outline of things we will cover in this program. See the Syllabus for additional details.
1. Learn Python via Rosalind examples
2. Use Biostatistics and understand statistical principles that guide inquiry (how do we know what we know and how well do we know it?)
3. Interrogate model assumptions and interpret data ethically 
4. Career development: how to track down a problem, how other people have ended up in Data Science, how to display your data fairly
5. A small capstone project and presentation to demonstrate how you apply data science thinking and communicate your findings

# Final Project

## Objectives:
- Identify the problems and goals of a real situation and dataset (hopefully one that is interesting to you).
- Clean and curate the data, as necessary to address your question.
- Choose an appropriate approach for formalizing and testing the problems and goals, and articulate the reasoning for that selection.
- Implement your analysis choices on the dataset.
- Interpret the results of the analyses.
- Contextualize those results within a greater scientific and social context, acknowledging and addressing any potential issues related to privacy, bias, and ethics.
- Work effectively as a member of a team.

## Components:
1. **One-paragraph Project Proposal (Due 1st Friday end of day):**
     1. identify problem/formulate question
     2. secure data set
     3. distribute your efforts in a schedule map
        
We recommend you consider [writing a group contract](https://sheridancollege.libguides.com/groupwork/managing-group-projects/writing-a-group-contract) to establish expectations about how your group will work together to accomplish your project.

2. **Oral presentation (10 minutes, given 2nd Friday afternoon):**
    1. Central Question - why is it interesting?
    2. Data and data grooming decisions - what data and how did you curate it?
    3. Analysis (why, what, how) - how robust is your model?
    4. Visualization, interpretation of the results - is there > 1 interpretation? How to resolve? 
    5. Lingering questions, for example:
       - Social context: any ethical issues that you either did or couldn't account for and what impact might they have (or how did you account for them)?
       - Next steps: if you had time and resources, how would you expect to expand this project? 

## Potential Data Sources:
These are suggestions - you do not have to use any of these if you have other repositories or interests!

* [Gap minder - Hans Rosling](https://www.gapminder.org/data/)
* [Awesome Public Datasets](https://github.com/awesomedata/awesome-public-datasets/blob/master/README.rst)
* [Data.gov](https://catalog.data.gov/dataset)
* [Data Is Plural](https://docs.google.com/spreadsheets/d/1wZhPLMCHKJvwOkP4juclhjFgqIY8fQFMemwKL2c64vk/edit#gid=0)
* [Datasets | Deep Learning](http://deeplearning.net/datasets/)
* [Stanford | Social Science Data Collection](https://data.stanford.edu/)
* [US Census](https://www.census.gov/)
* [Open Climate Data](http://openclimatedata.net/)
* [Data and Story Library](https://dasl.datadescription.com/datafiles/)
* [Kaggle](https://www.kaggle.com/)
* [FiveThirtyEight](https://data.fivethirtyeight.com/)
* [data.world](https://data.world/)
* [Free Datasets - R and Data Mining ](http://www.rdatamining.com/resources/data)
* [CIA World FactBook](https://www.cia.gov/the-world-factbook/)

# How to think like a Data Scientist

1. Nature of a data scientist
    * data-driven
    * care about answers
    * analyze data to discover something about how the world works
    * know that each analysis is just a different viewpoint, trying to make sense of a complex whole that can’t easily be perceived
    * care about whether the results make sense, because they care about it means
    * are comfortable with the idea that data have errors
    * are comfortable with the idea that there’s more than one way to analyze the same data
    * know nothing is ever completely true or false in science
    * know that you can still learn something and make decisions in spite of these uncertainties
    * care about communicating these subtleties as well as the results themselves
<br></br>

2. Nature of a **GREAT** data scientist
    * Conscientious, works using proven and understood methods, triple checks things
    * Yet is open to new methods and creative at finding solutions (just checks them thoroughly!)
    * Methodical and transparent about process
    * Yet after working down in the details, takes a step back and questions the big picture

In [None]:
# This cell only needs to be evaluated if you are using Colab. 
img_path="/content/drive/MyDrive/DS_Pipeline.png"
img=mpimg.imread(img_path)
plt.imshow(img)

# typically in a local Jupyter notebook, such as Anaconda etc, you could run this within a markdown cell
#![DS Pipeline](/content/drive/MyDrive/DS_Pipeline.png)
# but that isn't true in Colab. 

# How do you ask a 'good' question?

![DS Pipeline](DS_Pipeline.png)

Image from [KDnuggets](https://www.kdnuggets.com/2016/03/data-science-process-rediscovered.html)
<br></br>
1. Data Science questions should be…
    - answerable with data it is possible to collect
    - unambiguous in meaning
    - specifically describing exactly what data/metrics/analysis are required to answer the question
    - big enough to be interesting, small enough to be accomplishable
<br></br>

2. Specifying what you’re going to measure is important. 
<br></br>

## Examples:
Examples of poor questions that leave wiggle room for useless answers:
* What can my data tell me about my business?
* What should I do with *some piece of information*?
* How can I increase my profits?

Examples of good questions where the answer is impossible to avoid:
* How many (particular car model) will sell in region H (whereever that is defined to be) during the third quarter?
* How many students will apply for admission to a particular university ("X") in 2030?
* How many students should a particular university ("X") admit in 2030 for a target class size of 50,000?

# Exercises: Nailing down the right question

### Example 1: Cancer

**Too-vague question: What causes cancer?**

__Improving: Can gene expression predict cancer subtype?__

* Given RNA-seq gene expression profiles (obtained through [GEO](https://www.ncbi.nlm.nih.gov/geo/) or [TCGA](https://www.cancer.gov/ccg/access-data)), can we classify tumor samples into cancer subtypes?
* Is there a relationship between RNA-seq expression data and cancer prognosis?

### Example 2: Microbial Diversity

**Too-vague question: How do microbial communities vary across environments?**

__Improving: How does the gut microbome diversity differ between individuals with and without a condition?__

* Are there obvious differences between make-up?
* Data: Human Microbiome Project or Earth Microbiome Project (depending on focus)

### Group work

We will split the room into two and improve the following two questions: 

1. Too-vague question: Who does well at university?
2. Too-vague question: Why is it so expensive to live in NYC?
<br></br>

Then, we will come back together to examine the following: 

4. What factors influence lifespan?
5. How do seasonal changes affect disease outbreaks?
6. Does methylation change with aging?

# Congratulations, you have figured out a question. What's next? 
* What are the constraints?
* What are the resources available?
* IS THE NECESSARY DATA available??
* What are the potential risks and rewards? (Includes ethical!)
* Can we define a metric to determine the success of the project?

# ALWAYS LURKING: The Alignment Problem
When there is a difference between what we *think* we are programming (or solving) or asking and what we are *actually* programming and asking.
  
"When the systems we attempt to teach will not, in the end, do what we want or what we expect, ethical and potentially existential risks emerge" - Brian Christian