In [None]:
%config InlineBackend.figure_format='retina'

# DSC 80
The Practice and Application of Data Science

**Spring 2021**

## Lecture Outline
* Introduction
* What's a "practicing data scientist"?
* Course overview
* The "Data Science Workflow"

# What is a data scientist?

<div class="image-txt-container">
    
* [The Venn Diagram](http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram), Drew Conway, 2010
* Statistics: drawing conclusions.
* Hacking: extracting information.
* Substantive Expertise: understanding context.

<img src="imgs/image_0.png">

</div>

# What is a data scientist?

<div class="image-txt-container">

* [Battle of the Data Science Venn Diagrams](http://www.prooffreader.com/2016/09/battle-of-data-science-venn-diagrams.html)
* Stephan Kolassa, 2014
* Data Science is expanding and maturing.
* Find a niche you enjoy!
* Understanding a little of everything is powerful.

<img src="imgs/image_1.png">

</div>

## What is a data scientist?

Who knows? What do data scientists do?

* Extracts usable information from data. (~CS)
* Uses that information to answer questions. (~Stats)
* Uses that information to solve problems. (~Science)

Let's look at a few examples...

## Predicting Elections

<div class="image-txt-container">
    <div></div>
        
* What does the electorate look like?
* What are the important traits?
* How do we gather/measure those traits?
* Quantify into a model / Draw conclusion with confidence!
        
<img src="imgs/538.png" width="50%">
</div>

## Internet Advertisements
Can using internet ads increase a dealership's truck sales?

<div class="image-txt-container">
    <div></div>
        
* How likely is a person to click?
* Who is it shown to?
* How do ad clicks translate to sales? 
* Are they *causing* higher sales?
* What data should I collect?

<img src="imgs/internet_ads.gif" width="500">
</div>




## Image recognition for celebrity GIFs

[Gfycat](https://www.wired.com/story/how-coders-are-fighting-bias-in-facial-recognition-software/) resolves celebrity faces to generate memes.

<div class="image-txt-container">
<img src="imgs/Chris.png" width="50%">
<img src="imgs/Twice.png" width="50%">
</div>




## Data Science requires responsible practice

<div class="image-txt-container">
    
- Easy to obscure complex decisions made with data:
    * 2009 market crash.
    * Criminal sentencing.
    * Hiring and Admissions.
    * Hyper-personalized ad recommendations.

<img src="imgs/image_2.png" width="50%">

</div>

## Data Science requires responsible practice

<div class="image-txt-container">
    
* Reinforcing historical trends and biases:
    - Hiring based on previous hiring data.
    - Criminal sentencing using previous decisions.
    - Social media, news, and politics

* Data is generated by people; treat people responsibly!


<img src="imgs/image_2.png" width="50%">

</div>

# Course Information

## Course Goals

* Practice translating potentially vague questions into quantitative questions about measurable observations.
* Learn to reason about 'black-box' processes (e.g. complicated models).
* Understand computational and statistical implications of working with data.
* Learn to use real data tools (e.g. love the documentation!).
* Give a taste of the "life of a data scientist".

## Course Outcomes

* Prepare you for internships and data science "take home" interviews!
* Enable you to create your own portfolio of personal projects.
* Prepare you for upper-division ML and Stats courses (material and maturity).

## Course Materials and Information

* The course [github repository](https://github.com/eldridgejm/dsc80-sp21): assignments, lectures, references.
    - https://github.com/eldridgejm/dsc80-sp21
* The course [website](http://dsc80.com/): syllabus, links, schedule.
    - http://dsc80.com

##  Course Components: Description

|Component|Description|
|---|---|
|Labs |Practice with basic material|
|Discussions|(Extra) Practice with basic material|
|Projects|In-depth work on real problems|
|First 5 weeks|Working with and reasoning about data|
|Last 5 weeks|Creating and evaluating models|


##  Course Components: Grade Breakdown

|Component|Percentage|
|---|---|
|Labs (9, weekly)|25%|
|Projects (5, bi-weekly)|30%|
|Checkpoints|5%|
|Midterm (mid-quarter)|15%|
|Final (cumulative)|25%|
|Discussions (extra credit)|5% ec|

## Course Materials

* This is (still) a new course (anywhere!); be patient, please. 😁
* We kinda-sorta have a book! (finally!)
    - https://afraenkel.github.io/practical-data-science
    - the lecture slides serve as sources of information and practice.
    
* Just like a working Data Scientist:
    - using new, evolving tools requires doing your own research!
    - using real data require doing your own research!
    - *You* will be responsible for assessing the correctness of your research!

## Secondary references

* Wes McKinney. "Python for Data Analysis" ([Link - requires UCSD internet](proquest.safaribooksonline.com/9781449323592))
* Sam Lau, Joey Gonzalez, and Deb Nolan. "Principles and Techniques of Data Science" (https://www.textbook.ds100.org/)
* Ani Adhikari and John DeNero. "Computational and Inferential Thinking" (https://www.inferentialthinking.com)
* On-line tutorials are great, but be sure to understand the *concepts* in the lecture!

## A few last remarks...

* This course requires *a lot* of work; gaining fluency working with data is hard!
* You will have to learn things on your own (e.g. love the documentation).
* Learning to effectively check your work (and debug) pays dividends:
    - Does my answer *look* right for the context?
    - Does my code do what I think? (small data testing)
    - Does my code generalize properly? (unseen data testing)

# The Data Science Lifecycle

## The scientific method hides complexity

<div class="image-txt-container">
    
* From what context did your hypothesis come?
* What data are you using/measuring?
    - What if the data isn't sufficient?
* Under which conditions are the conclusions valid?
* The language of modeling helps answer these questions.

<img src="imgs/image_3.png">

</div>

## Introduction to models

* A **data generating process** is the real-world phenomenon under consideration.
* The **true (probability) model** is a mathematical representation of random phenomenon that generates any representative observations.
* The **observations** are data representing the data generating process.
* A **(fit) statistical model** of the data is the best approximation of the data generating process under the probability model.

### The Data Science Life-cycle: Everything leads to more questions!
<img src="imgs/DSLC.png" width="40%">

<img src="./imgs/eeg.png" width=60%>

### Research Domain

<div class="image-txt-container">
    <div></div>

* What subject do we care about?
* How is relevant data generated?

<img src="imgs/DSLC.png">
</div>





### Question or Hypothesis

<div class="image-txt-container">
    <div></div>

* What do we want to know?
* What problem are we trying to solve?
* What are our metrics for success?
* Hypotheses are refined from any stage of work!

<img src="imgs/DSLC.png">
</div>

### Find and Clean Data

<div class="image-txt-container">
    <div></div>

* What data exists and can it answer the question?
* Do we need to collect/measure our own data?
* Cleaning the data: does it well-represent the domain?

<img src="imgs/DSLC.png">
</div>


### Data Modeling

<div class="image-txt-container">
    <div></div>

* What assumptions are made of data to draw conclusions?
* What biases or anomalies exist in the data?
* How is the data simplified to use for predictions and inference?

<img src="imgs/DSLC.png">
</div>



### Predictions and Inference

<div class="image-txt-container">
    <div></div>

* What does the data say about the world?
* Does it answer our questions? Solve our problem?
* Can we trust our conclusions and predictions?

<img src="imgs/DSLC.png">
</div>


