# Discussion 1

### Due Saturday April 3, 11:59 PM

**Discussions will be due by the end of the day on Saturday**

* Lecture Review: models and the data science life-cycle.
* Overview: How to work on homework.
* Tutorial: `numpy` review and an example HW problem.

---

## Lecture Review

The terminology of modeling:
* A **data generating process (DGP)** is the real-world phenomenon under consideration.
* The **true (probability) model** is a mathematical representation of the random phenomenon that generates any representative observations.
* The **observations** are data representing the data generating process.
* A **(fit) statistical model** of the data is the best approximation of the data generating process under the probability model.

The data science life-cycle:
* Researching domain
* Questions and hypotheses
* Finding and cleaning data
* Data modeling
* Predictions and Inference
* Decisions

Where does each term describing the modeling process fit into the data science life-cycle?

Researching domain tells us what we care about and how relevant data is generated, allowing us to formulate questions and hypotheses as well as identify and clean data sets. Questions and hypotheses can lead to us finding and cleaning even more data, as this is the step where we narrow the scope of what we want to know, the problem we're looking to solve, and the metrics we will use for measuring. Finding and cleaning the data tells us what data exists and whether or not we need to collect our own data, but it also helps us understand how well data sets represent the domain we're interested in. Next, we can proceed to data modeling, where we identify biases or anomalies in the data, simplify the data for use in predictions and inference, and use data-informed assumptions to draw conclusions. Lastly, we use these models to tell stories and answer our questions by predicting and inferring.

**Example:** Suppose you want to predict the outcome of the next presidential election.
1. What is the DGP?
2. What observations might you collect? Are they representative of the DGP?
3. What measurements do you care about? (i.e. what do your observations look like?)
4. What is the probability model? what statistical model might you use?
5. How might you assess the quality of your fit model?

1. We're considering the outcome of the next presidential election, or in other words, which candidate is most likely going to win the most electoral votes. The population we're seeking to sample from would be the US electorate.
2. Observations we can gather include poll information on the general public's political affiliation on a state-by-state level by aggregating from public online polls. Census data might also be useful to predict election results if we're interested in using demographic information as proxy measures or if we want to conduct PCA to predict political ideology. Both of these are very representative of the DGP.
3. Each observation would represent either a household or individual, depending on whether it's information gathered from a poll on the individual level or household information gathered from the census.
4. We might want to rely on traditional election forecasting models that use time-series data, demographic, or biographical data. Alternatively, we can use artificial neural networks and support vector regression models based on a select few datapoints from aggregating poll data.
5. We would test it on poll data from previous elections  and run many iterations comparing predicted results against actual results.

**Example**: Suppose we want to understand the pay disparity between men and women among city of SD employees.
1. What is the DGP?
2. Does the dataset in lecture adequately represent the DGP?
3. What is the probability model?
4. Address the questions above with the applicability to other years and cities?

1. The DGP in this example would be the wage gap between men and women in San Diego. The population we're seeking to sample from would be the San Diego workforce.
2. Yes, it contains thousands of observations of SD employees, including both men and women.
3. We would likely want to conduct a Student T-Test for Difference of Means here to determine if there's a significant difference between the wages that men and women earn in SD.
4. If there is a similarly formatted dataset for other cities in other years, the same answers apply. However, if the datasets are too small or if they don't contain enough information regarding employees, the same answers do not hold.

---

## Overview: working on assignments

The class assignments are available on the class git repository; they consist of a notebook with the problems statements, starter code in a `.py` file, and any required supplementary files (e.g. data). After pulling the HW material, you will develop your solutions using a combination of jupyter notebooks and your favorite IDE (e.g. sublime text, or the jupyterhub server). Once finished, you will submit your assignment to gradescope.


### Obtaining course materials (assignments)

Git is a version control system that is used to with the development of the course materials. For an introduction to using git in the course, see this [tutorial](https://drive.google.com/open?id=1m6mXfhjFInHPeJyaHdAwfiakcFYh73HC8TeAB9E9Xeo) and this hands-on [tutorial](https://docs.google.com/document/d/1E2Zg0pC8S3cyT564jug6rqAhSNraR_7Yy_4AnvHaGu4/edit?usp=sharing). To use git on a Mac, you will need to open the terminal; on Windows, you should download [git-bash](https://gitforwindows.org/).

The course materials are stored in a git repository on *github* (a git server) -- you can view it in a browser [here](https://github.com/ucsd-ets/dsc80-sp19). To obtain the course files, follow the directions in the tutorial above.

### The notebook / IDE balance

Now that your assignment is on your computer, you are ready to work. You will be using two different tools to develop the code and create the analyses that the assignments require of you. Generally, these are:
1. Jupyter notebooks contains the problems statements themselves; they also provide a place to test out code, understand data, and produce reports/summaries of conclusions.
2. An IDE for developing re-usable and testable python code. Abstracting your notebook code into python library code avoids common mistakes in notoriously error prone notebook environments. Luckily, once a function is in your `.py` file, you can still import/use it in a notebook!

Both of these environments are essential in the data scientist toolkit.

### Checking your work

An effective environment for testing and understanding your work is essential to success in the class. The notebook and the IDE play different roles in checking your work.

* The notebook provides a place to understand the output of your function and test it against your intuition and understanding of what the correct output should be. When working with data, you should always check the correctness of your work using your understanding of that data (i.e. is my conclusion reasonable given what I know about the data?). This is typically the ultimate goal for a problem, so you should *always* interpret your answer on the data in a notebook.

* Abstracting your code to library functions/classes in a `.py` file encourages using software development best-practices in your data processing and analyses. While expressive, notebooks are error-prone, manual, and hard to debug. Moving useful code to a `.py` file makes your code more clear, encourages code reuse, and makes debugging easier. Once you have moved any work from your notebook to a `.py` file, you should check the correctness of your work in two ways:
    - Run the doctests. The doctests ensure your code *meets the contract* specified in the question (or by you, in your own projects). That is, is your code expecting the correct inputs and outputs? **Doctests do not check more than if your code is acting on the correct types**.
    - Import your function into the notebook and test it on data as above. Use your understanding of the data to assess the correctness of your code!


### HW submission

Once you have finished the assignment, log into Gradescope and submit the `.py` file to the appropriate assignment. 
* Upon submission, the autograder will run the doctests and make visible if they tests passed or not. These are worth *zero* points; the purpose is to check that the autograder environment is consistent with the environment on which you developed your HW.
* The results of the "correctness tests" that you will be ultimately graded on will not be visible until after the due date.
* The autograder will tell you if your code failed to run, though generally will not tell you why. The most common reasons are listed below.
    - **Timeout**: the autograder *will* tell you if your code failed to run after 20 minutes. If this occurs, you should try to isolate which problem is causing the timeout and either fix it or comment it out!
    - **Syntax Errors**: Any syntax errors (e.g. bad code indentation) will cause gradescope to fail (giving a 0 on the assignment). Always double check your code passes doctests *on the commandline* (just as the autograder runs it). Further, pulling your code (from github) onto DataHub and running the tests there is a good debugging technique, as the environment is very similar to gradescope. 
    - **OOM (out of memory)**: The autograder runs a 1GB server, which is smaller than your computer. Assignments should never require more memory than this; you should think about how to simplify your code!

If the problem persists, ask course staff why the autograder is failing.
    

### A remark on DataHub

UCSD Educational Technology Services has made servers available for use at [DataHub](datahub.ucsd.edu). Once logged in, you have not only a jupyter notebook server running, but an entire unix environment. To make best use of this environment, once logged in, replace the `/tree` in the URL with `/lab` and you can use the JupyterLab IDE/notebook environment. Here, you can use (1) jupyter notebooks, (2) terminals, and (3) a simple text editor for editing python files.

## Tutorial: `numpy` review as a HW problem

Work on this tutorial like an assignment. **Complete the questions 1 and 8, and turn them into gradescope by midnight on Saturday**.

In [1]:
# What is this? (discuss imports)
%load_ext autoreload
%autoreload 2

In [3]:
import disc01 as disc

In [5]:
# What is this?

%matplotlib inline
import matplotlib.pyplot as plt

In [6]:
import numpy as np
import os

For a review of working with Numpy arrays, see the [arrays chapter]https://www.inferentialthinking.com/chapters/05/1/Arrays.html of Inferential Thinking (DSC10). The most relevant concepts are:
* element-wise array operations, that avoid loops ('vectorization')
* the functions and methods for performing array arithmetic (see the tables in the page referenced above).

**Question 1** Write a function that takes in a file-path that points to a data file like `restaurants.csv` and returns an array of values of restaurant bills.

*Notes*: Where is the file? What values? Look at the starter code documentation in `disc.py`.

In [8]:
fp = os.path.join('data', 'restaurant.csv')
fp

'data\\restaurant.csv'

In [22]:
def data2array(filepath):
    """
    data2array takes in the filepath of a 
    data file like `restaurant.csv` in 
    data directory, and returns a 1d array
    of data.

    :Example:
    >>> fp = os.path.join('data', 'restaurant.csv')
    >>> arr = data2array(fp)
    >>> isinstance(arr, np.ndarray)
    True
    >>> arr.dtype == np.dtype('float64')
    True
    >>> arr.shape[0]
    100000
    """
    fh = open(filepath)
    fh.readline()
    
    data = [float(line.strip()) for line in fh]
    
    return np.array(data)

In [23]:
arr = disc.data2array(fp)
print(isinstance(arr, np.ndarray))
print(arr.dtype == np.dtype('float64'))
print(arr.shape[0])

True
True
100000


**Question 2:** How many restaurant bills are there?

In [24]:
arr.size

100000

In [25]:
arr.shape[0]

100000

**Question 3:** Suppose everyone leaves an 18% tip. Create an array of tip amounts. What is the total amount of tips in the array?

In [26]:
tips = arr * 0.18
tips

array([3.0366, 3.1338, 1.8324, ..., 2.8962, 2.7504, 5.3496])

**Question 4:** What is the average/median/min/max restaurant bills? Give answer in an array, in the order listed.

In [31]:
centers = np.array([np.mean(arr), np.median(arr), np.min(arr), np.max(arr)])
centers

array([14.9644172, 13.09     ,  3.       , 77.91     ])

**Question 5:** How many restaurant bills are greater than $15?

In [35]:
grt_than_15 = arr[arr>15].size
grt_than_15

42347

**Question 6:** How much total money for the restaurant is there? What proportion of that comes from bills less than $5?

In [36]:
total = np.sum(arr)
total

1496441.7200000002

In [37]:
proportion = np.sum(arr[arr>15]) / total
proportion

0.6654181560776051

**Question 7:** What proportion of bills have at least one other bill within $0.05 of the given amount?

In [66]:
# count = 0
# for i in range(len(arr)):
#     check = arr[i]
#     checklow = check - 0.05
#     checkhigh = check + 0.05
#     for j in range(len(arr)):
#         if i==j:
#             continue
#         elif arr[j] >= checklow or arr[j] <= checkhigh:
#             count += 1
# print(count/len(arr))

KeyboardInterrupt: 

**Question 8:** What proportion of restaurant bills end in 9?

*Hint:* Use the remainder function `%`, but be careful of floating point operations! (What sort of data types have remainders?)

Create a function `ends_in_9` that takes in an array of dollar amounts (like the output of Question 1) and returns the proportion of values that end in 9 in the hundredths place.

In [60]:
def ends_in_9(arr):
    """
    ends_in_9 takes in an array of dollar amounts 
    and returns the proprtion of values that end 
    in 9 in the hundredths place.

    :Example:
    >>> arr = np.array([23.04, 45.00, 0.50, 0.09])
    >>> out = ends_in_9(arr)
    >>> 0 <= out <= 1
    True
    """
    rounded = np.round((arr * 100).astype(int))
    return np.count_nonzero(rounded % 100 % 10 == 9) / len(rounded)

In [None]:
def ends_in_9(arr):
    """
    ends_in_9 takes in an array of dollar amounts 
    and returns the proprtion of values that end 
    in 9 in the hundredths place.

    :Example:
    >>> arr = np.array([23.04, 45.00, 0.50, 0.09])
    >>> out = ends_in_9(arr)
    >>> 0 <= out <= 1
    True
    """
    rounded = (arr * 100)
    return np.count_nonzero(rounded % 100 % 10 == 9) / len(rounded)

In [61]:
ends_in_9(arr)

0.1019

In [62]:
test = np.array([23.04, 45.00, 0.50, 0.09])
out = ends_in_9(test)
print(0 <= out <= 1)
print(out)

True
0.25


In [63]:
test = np.array([23.04, 45.00, 0.50, 0.595])
out = ends_in_9(test)
print(0 <= out <= 1)
print(out)

True
0.25
