In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (10, 5)

import util

# Lecture 1 – Introduction, Data Science Lifecycle

## DSC 80, Spring 2022

<center><h3>Welcome to DSC 80! 🎉</h3></center>

### Agenda

- Who are we?
- What does a data scientist do?
- What is this course about, and how will it run?
- The data science lifecycle.

### About the instructor

#### Suraj Rampure (call me Suraj, pronounced “soo-rudge”)

- Originally from Windsor, ON, Canada 🇨🇦.
- BS (’20) and MS (’21) in EECS from UC Berkeley 🐻.
    - Designed and taught several data science courses as a student there.
- Third quarter teaching at UCSD 🌴, and first time teaching DSC 80.
    - Also teaching DSC 90 ([History of Data Science](https://historyofdsc.com)) this quarter.
    - Previously taught DSC 10 (x2) and DSC 40A.
- Outside the classroom 👨‍🏫: watching basketball, traveling, eating, watching TikTok, FaceTiming my dog 🐶, etc.

<center><img src='imgs/break-pic.png' width=700></center>

<center>Me with my mom, my dog, and my friend over spring break.</center>

### Course staff

In addition to the instructor, we have several other course staff members who are here to support you in discussion, office hours, and Campuswire.

- 1 graduate TA: Murali Dandu.
- 11 undergraduate tutors: Nicole Brye, Aven Huang, Shubham Kaushal, Karthikeya Manchala, Yash Potdar, Costin Smiliovici, Anjana Sriram, Ruojia Tao, Du Xiang, Sheng Yang, and Winston Yu.

Learn more about them at [dsc80.com/staff](https://dsc80.com/staff).

## What is data science? 🤔

### What is data science?

<br>

<center><img src='imgs/what-is-ds.png' width=800></center>

<center>Everyone seems to have their own definition of what data science is.</center>

### The DSC 10 definition

In DSC 10, we told you that science is about **drawing useful conclusions from data using computation**.

- **Exploration.**
    - Identifying patterns in information.
    - Uses visualizations.
- **Prediction.**
    - Making informed guesses.
    - Uses machine learning and optimization.
- **Inference.**
    - Quantifying whether those predictions are reliable.
    - Uses randomization.

Let's look at some other definitions.

### What is data science?

<center><img src="imgs/image_0.png"></center>

In 2010, Drew Conway published his famous [Data Science Venn Diagram](http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram).

### What is data science?

There isn't agreement on which "Venn Diagram" is correct!

<center><img src="imgs/image_1.png" width=500></center>

- **Why not?** The field is new and rapidly developing.
- Make sure you're solid on the fundamentals, then find a niche that you enjoy.
- Read Kolassa, [Battle of the Data Science Venn Diagrams](http://www.prooffreader.com/2016/09/battle-of-data-science-venn-diagrams.html).

### What does a _data scientist_ do?

In 2016, O'Reilly administered a [Data Scientice Salary Survey](https://www.oreilly.com/radar/2016-data-science-salary-survey-results/). Part of the survey asked self-identified data scientists what tasks they do on a regular basis.

<center><img src='imgs/oreilly.png' width=600></center>

What do you notice?

### What does a _data scientist_ do?

My take: in DSC 80, and in the DSC major more broadly, we are equipping you to **ask and answer questions using data**.

Let's look at some examples of data science in practice.

### Analyzing Wordle trends

<center><img src='imgs/wordle-moving-average.png'</center>
    
Moving average of the average number of guesses taken for each Wordle word, based on patterns shared on Twitter. ([source](https://observablehq.com/@rlesser/wordle-twitter-exploration))

### Analyzing Wordle trends

<center><img src='imgs/wordle-tweets.png'</center>
    
Number of Wordle patterns shared per day on Twitter. ([source](https://observablehq.com/@rlesser/wordle-twitter-exploration))

### Forecasting COVID cases

<center><img src='imgs/ucsd-forecast.png' width=80%></center>

Results of the `UCSD_NEU-DeepGLEAM` COVID cases forecasting model for the upcoming week ([source](https://viz.covid19forecasthub.org)).

### Forecasting COVID cases

> Evaluation of case forecasts showed that more reported cases than expected fell outside the forecast prediction intervals for extended periods of time. Given this low reliability, COVID-19 case forecasts will no longer be posted by the Centers for Disease Control and Prevention. - [CDC.gov](https://www.cdc.gov/coronavirus/2019-ncov/science/forecasting/forecasts-cases.html)

### Depixelizer

A ["Face Depixelizer"](https://github.com/tg-bomze/Face-Depixelizer) released in 2020 takes pixelated images and generates images that are perceptually realistic and downscale correctly.

<center>
<img src='imgs/depixel.png' width=300>
</center>

What happened here? Why do you think this happened?

<center>
<img src='imgs/depixel2.png' width=600>
</center>

### Data science involves _people_ 🧍

The decisions that we make as data scientists have the potential to impact the livelihoods of other people.

- COVID case forecasting.
- Admissions and hiring.
- Hyper-personalized ad recommendations.
- Criminal sentencing.

### Warning! ⚠️

- Good data analysis is not:
    - A simple application of a statistics formula.
    - A simple application of statistical software.

- There are many tools out there for data science, but they are merely tools. **They don’t do any of the important thinking – that's where you come in!**

> _“The purpose of computing is insight, not numbers.”_ - R. Hamming. Numerical Methods for Scientists and Engineers (1962).

## Course content

### Course goals

In this course, you will...

* Practice translating potentially vague questions into quantitative questions about measurable observations.
* Learn to reason about 'black-box' processes (e.g. complicated models).
* Understand computational and statistical implications of working with data.
* Learn to use real data tools (e.g. love the documentation!).
* Get a taste of the "life of a data scientist".

### Course outcomes

After this course, you will...

* Be prepared for internships and data science "take home" interviews!
* Be ready to create your own portfolio of personal projects.
* Have the background and maturity to succeed in the upper-division.

### Topics

This course was desgined by a former data scientist at Amazon (Aaron Fraenkel). As such, you'll be learning skills that you **need** to know as a data scientist.

- Week 1: DataFrames in `pandas`
- Week 2: Messy Data and Hypothesis Testing
- Week 3: Combining Data
- Week 4: Permutation Testing and Missing Values
- Week 5: Imputation, **Midterm Exam**
- Week 6: Web Scraping and Regex
- Week 7: Feature Engineering
- Week 8: Modeling in `scikit-learn`
- Week 9: Model Evaluation
- Week 10: Review, **Final Exam**

## Course logistics

### Course website

The course website is your one-stop-shop for all things related to the course.

<br>

<center><h3><a href="https://dsc80.com">dsc80.com</a></h3></center>

<br>

Make sure to **read the [syllabus](https://dsc80.com/syllabus)**!

### Getting set up

- **Campuswire**: Q&A forum. Must be active here, since this is where all announcements will be made. You should have been added already; if not, [join here](https://campuswire.com/p/G325FA25B) (code 2756).
- **Gradescope**: Where you will submit all assignments for autograding, and where all of your grades will live. You should have been added already; contact us if not.

In addition, you must also fill out our [Welcome + Alternate Exams Form](https://docs.google.com/forms/d/e/1FAIpQLSdBKLcPs4Xi0plaIw0MVZ0DyGcvnSZyHxKVC7S7LwEiCchepQ/viewform).

### Accessing course content on GitHub

You will access all course content by pulling the course GitHub repository:

<br>

<center><b><a href=https://github.com/dsc-courses/dsc80-2022-sp>github.com/dsc-courses/dsc80-2022-sp</a></b></center>

<br>

We will post HTML versions of lecture notebooks on the course website, but otherwise you must **pull** from this repository to access all course materials (including blank copies of assignments).

### Environment setup

- You have two choices:
    - Set up your own Python environment (**strongly recommended**).
    - Use DataHub.
- Either way, follow the instructions on the [Tech Support](https://dsc80.com/tech_support/) page of the course website.
- Once you set up your environment, you will pull the course repo every time a new assignment comes out.
- **Note:** You will submit your work to Gradescope directly, without using Git.
- Will post a demo video with Lab 1.

### Course meetings

- **Lectures** are MWF 3-3:50PM or 4-4:50PM. Come in-person at Center Hall 109 or via Zoom. Anyone can attend either section.
- **Discussions** are W 6-6:50PM or 7-7:50PM. Come in-person at Pepper Canyon Hall 122 or via Zoom. Anyone can attend either section.

### Assignments

In this course, you will learn by doing!

- **Labs (25%):** 9 total. Due weekly on Mondays, starting next week.
- **Projects (30%)**: 5 total. Usually due on Thursdays, and usually have a "checkpoint."
- **Discussions (2% EC)**: 8 total. Extra credit.

In DSC 80, assignments will usually consist of both a Jupyter Notebook and a `.py` file. You will write your code in the `.py` file; the Jupyter Notebook will contain problem descriptions and test cases. Lab 1 will explain the workflow.

### Exams

- **Midterm Exam (15%):** Wednesday, April 27th during your assigned lecture slot (3-3:50PM or 4-4:50PM). **In-person in Center 109**.
- **Final Exam (25%):** Saturday, June 4th from 11:30AM-2:30PM. **In-person, location TBD.**
- Fill out the [Welcome + Alternate Exams Form](https://docs.google.com/forms/d/e/1FAIpQLSdBKLcPs4Xi0plaIw0MVZ0DyGcvnSZyHxKVC7S7LwEiCchepQ/viewform) to tell us if you have a conflict.

### Resources

- Your main resource will be lecture notebooks.
- Most lectures also have supplemental readings that come from our course notes, [notes.dsc80.com](https://notes.dsc80.com).
- Other resources:
    - Wes McKinney. "Python for Data Analysis".
    - [DSC 10 Course Notes](https://notes.dsc10.com) – great refresher on `babypandas`.
    - [Principles and Techniques of Data Science](https://www.textbook.ds100.org/).
    - [Computational and Inferential Thinking](https://www.inferentialthinking.com).
    - [pandastutor.com](https://pandastutor.com).
    - As the quarter progresses, we'll add more resources to the [Resources tab](https://dsc80.com/resources) of the course website.

### Support 🫂

It is no secret that this course requires **a lot** of work - becoming fluent with working with data is hard!

- You will learn how to solve problems **independently** – documentation and the internet will be your friends.
- Learning how to effectively check your work and debug is extremely useful.

Once you've tried to solve problems on your own, we're glad to help.

- **Office hours** are offered both remotely and in-person. See the [Calendar 📆](https://dsc80.com/calendar/) for details.
- **Campuswire** is your friend too. Make your conceptual questions public, and make your debugging questions private.

## The data science lifecycle 🚴

### The scientific method

You learned about the scientific method in elementary school. 

<center><img src="imgs/image_3.png" width=500></center>

However, it hides a lot of complexity.
- Where did the hypothesis come from?
- What data are you modeling? Is the data sufficient?
- Under which conditions are the conclusions valid?

### The data science lifecycle

<center><img src="imgs/DSLC.png" width="40%"></center>

**All steps lead to more questions!**

## Example: San Diego employee salaries

<center><img src="imgs/DSLC.png" width="40%"></center>

### Research Domain and Questions

We have our domain – City of San Diego employee salaries. What are some questions we might want to ask? 

- Which jobs have the highest and lowest salaries?
- Who works part-time? full-time?
- Are salaries "fair"?
- What is the predicted 2025 salary for the mayor of San Diego?
- Can we build a "profile" of the average San Diego city employee?

### Context

Why is this dataset relevant?
- Journalists might search for salary anomalies.
- Auditors may want actionable advice on fair employment practices.

### Find and Clean Data

<center><img src="imgs/DSLC.png" width="40%"></center>

### Initial look at the data

- [Transparent California](https://transparentcalifornia.com/salaries/san-diego/) publishes the salaries of all City of San Diego employees.
- The latest available data is from 2020.

In [None]:
salary_path = util.safe_download('https://transcal.s3.amazonaws.com/public/export/san-diego-2020.csv')

In [None]:
salaries = pd.read_csv(salary_path)
util.anonymize_names(salaries)
salaries

### Aside on privacy and ethics

- Employee names correspond to **real** people.
- Be careful when dealing with PII (personably identifiable information).
    - Only work with the data that is needed for your analysis.
    - Even when data is public, people have a reasonable right to privacy.

### Data cleaning

- As we saw in the O'Reilly's survey results either, data cleaning is a **huge** component of real-world data science.
    - You didn't get much exposure to it in DSC 10, but you will in DSC 80.
- Let's look at **summary statistics** for each of our numeric columns.
    - Do you notice anything strange?
    - What are the implications on data reliability?

In [None]:
# .T is for transpose()
salaries.describe().T

- Someone had an `'Overtime Pay'` of -\$293!
- The `'Other Pay'` column contained numbers, why doesn't it appear here?
- How many people have salaries of \$0?
- Why is there a `'Notes'` column that is missing for everybody?

### Empirical distribution of salaries

Let's plot the distribution of salaries.

In [None]:
salaries['Total Pay'].plot(kind='hist', density=True, bins=50, ec='w', 
                           title='City of San Diego Employee Salaries');

### Discussion Question

Which of the following best describe the distribution of San Diego employee salaries?

- A. Right-skewed, unimodal
- B. Right-skewed, bimodal
- C. Left-skewed, unimodal
- D. Left-skewed, bimodal

### 🙋 To answer, go to [yellkey.com/job](https://www.yellkey.com/job).

### Empirical distribution of salaries

Let's draw the distribution of salaries separately for part-time and full-time employees.

In [None]:
bystatus = salaries.groupby('Status')
bystatus['Total Pay'].plot(kind='kde', title='City of San Diego Employee Salaries, Part-Time vs. Full-Time')
plt.legend(bystatus.groups);

### Question: Does gender influence pay?

- Do employees of different genders have similar pay?
- The salary dataset we downloaded does not contain employee gender, so we can't answer this question using just the data we have.

In [None]:
salaries.head()

- We **do**, however, have the first name of each employee.

### Social Security Administration baby names 👶

- The US Social Security Administration (SSA) keeps track of the **first name**, **birth year**, and **assigned gender at birth** for all babies born in the US.
- We can somehow combine the SSA's dataset with the `salaries` dataset to infer the gender of San Diego employees.

In [None]:
names_path = util.safe_download('https://www.ssa.gov/oact/babynames/names.zip')

In [None]:
import pathlib

dfs = []
for path in pathlib.Path('data/names/').glob('*.txt'):
    year = int(str(path)[14:18])
    if year >= 1964:
        df = pd.read_csv(path, names=['firstname', 'gender', 'count']).assign(year=year)
        dfs.append(df)
        
names = pd.concat(dfs)
names

> We began compiling the baby name list in 1997, with names dating back to 1880. At the time of a child’s birth, parents supply the name to us when applying for a child’s Social Security card, thus making Social Security America’s source for the most popular baby names. Please share this with your friends and family—and help us spread the word on social media. - [Social Security’s Top Baby Names for 2020
](https://blog.ssa.gov/social-securitys-top-baby-names-for-2020/)

### Exploring `names`

- The only values of `'gender'` in `names` are `'M'` and `'F'`.
- Many names have non-zero counts for both `'M'` and `'F'`.
- Most names occur only a few times per year, but a few names occur very often.

In [None]:
names.head()

In [None]:
# Get the count of each unique value in the 'gender' column
names['gender'].value_counts()

In [None]:
# Look at a single name
names[names['firstname'] == 'Billy']

In [None]:
# Look at various summary statistics
names.describe()

### Data Modeling

<center><img src="imgs/DSLC.png" width="40%"></center>

### Determining the most common gender for each name

- Recall, our goal is to infer the gender of each San Diego city employee. To do this, we need a mapping of first names to genders.

- **A (very imperfect) model:** If someone has a name that is predominantly used by gender $g$, we'll infer their gender to be $g$.

- **Approach:** Create a DataFrame indexed by `'firstname'` that describes the total number of `'F'` and `'M'` babies in `names` for each unique `'firstname'`.
    - If there are more female babies born with a given name than male babies, we will "classify" the name as female.
    - Otherwise, we will classify the name as male.

### Determining the most common gender for each name

In [None]:
counts_by_gender = (
    names
    .groupby(['firstname', 'gender'])
    .sum()
    .reset_index()
    .pivot('firstname', 'gender', 'count')
    .fillna(0)
)
counts_by_gender

In [None]:
counts_by_gender['F'] > counts_by_gender['M']

In [None]:
genders = counts_by_gender.assign(gender=np.where(counts_by_gender['F'] > counts_by_gender['M'], 'F', 'M'))
genders

### Adding a `'gender'` column to `salaries`

This involves two steps:
1. Extracting just the first name from `'Employee Name'`.
2. **Merging** `salaries` and `genders`.

In [None]:
# Add firstname column
salaries['firstname'] = salaries['Employee Name'].str.split().str[0]
salaries

In [None]:
# Merge salaries and genders
salaries_with_gender = salaries.merge(genders[['gender']], on='firstname', how='left')
salaries_with_gender

### Predictions and Inference

<center><img src="imgs/DSLC.png" width="40%"></center>

### Question: Does gender influence pay?

This was our original question. Let's find out!

In [None]:
pd.concat([
    salaries_with_gender.groupby('gender')['Total Pay'].describe().T,
    salaries_with_gender['Total Pay'].describe().rename('All')
], axis=1)

- Unfortunately, there's a fairly large difference between the mean salaries of male employees and female employees.
- A similar difference also exists for the median.
- Can this difference be explained by random chance?

### A hypothesis test

- **Null Hypothesis:** Gender is independent of salary, and any observed differences are due to random chance.
- **Alternate Hypothesis:** Gender is not independent of salary. Female employees earn less than male employees.

In [None]:
n_female = np.count_nonzero(salaries_with_gender['gender'] == 'F')
n_female

**Strategy:** 
- Randomly select 4075 employees from `salaries_with_gender` and compute their median salary.
- Repeat this many times.
- See where the observed median salary of female employees lies in this empirical distribution.

### Running the hypothesis test

In [None]:
# Observed statistic
female_median = salaries_with_gender.loc[salaries_with_gender['gender'] == 'F']['Total Pay'].median()

# Simulate 1000 samples of size n_female from the population
medians = np.array([])
for _ in np.arange(1000):
    median = salaries_with_gender.sample(n_female)['Total Pay'].median()
    medians = np.append(medians, median)

medians[:10]

In [None]:
title='Median salary of randomly chosen groups from population'
pd.Series(medians).plot(kind='hist', density=True, ec='w', title=title);
plt.axvline(x=female_median, color='red')
plt.legend(['Observed Median Salary of Female Employees', 'Median Salaries of Random Groups']);

- Our hypothesis test has a p-value of 0, so we reject the null.
    - Under the assumption that gender is independent of salary, the chance of seeing a median salary this low is essentially 0.

### Next time

- While performing this analysis, we made several assumptions. What were they, and how did they affect our results?
- After wrapping up this example, we'll dive deep into `pandas`!
- **Lab 1 will be released tomorrow!**