# Learning Goals
In our [Principled Data Processing](https://www.youtube.com/watch?v=ZSunU9GQdcI) approach, we want to make our data analysis:
- **auditable**
     - the result of every task, where possible, can be tested by a different analyst in a different language
- **reproducible**
     - the same results can be rebuilt by any scientist with the same tools and code
- **transparent**
     - the result of every task is reviewable
- **scalable**
     - the task and results are well-suited for "more than 2" (datasets used, analysists contributing to codebase, reports written, etc.)

There are some specific tools and functions we look for to achieve these characteristics in our code. This notebook is intended to introduce a number of these, provide some context, and stage some examples where possible.

_Note: This notebook is formatted as an overview of many python topics, and is not a good example of principled data analysis using Jupyter notebooks. For tips on how to utilize Jupyter notebooks to write python scripts iteratively and quickly, ask Bailey!_

In [None]:
# dependencies
import argparse
import logging
import numpy as np
import pandas as pd

## Absolute basics
We'll cover a few language-specific pointers here to get up to speed on some python syntax, and then we'll dive in to a few other topics related to principled data processing.

If you need more information on python syntax and standard libraries, you're encouraged to ask questions, share code examples, and do a bit of googling. Also, check out:
- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)
- [the library of python books](https://wiki.python.org/moin/PythonBooks)
- [python's built-in types](https://docs.python.org/3/library/stdtypes.html#)
- [python's built-in functions](https://docs.python.org/3/library/functions.html)

If you want more information on performance of different implementations in python, try:
- [scalability.md](https://github.com/HRDAG/training-docs/blob/main/language-tips/python/scalability.md)
- [time complexity](https://wiki.python.org/moin/TimeComplexity)
- searching for topics related to a specific data structure plus 'documentation', 'time complexity', or 'trade offs'

In [None]:
# assignment
# ints, floats, np.nan

In [None]:
# str, None

In [None]:
# if statements
# missing check

In [None]:
# for loops
# 1) build input
# 2) filter for a condition

In [None]:
# list
# list comprehension

# dict
# dict comprehension

# set
# set comprehension

In [None]:
# function declaration
# function calling

## Running python scripts
You can (and should) run python code as a script from the command line or as a target in a makefile. In order to do this effectively, there are 2 libraries, 1 command, and 1 logic block we can't live without.

#### libraries
- `argparse` lets us handle arguments provided to the script call in the terminal

In [None]:
# argparse

- `logging` lets us save info to a logfile that we can read later on to answer questions about our data at runtime, without having to reload our data or run the program again

In [None]:
# logging

#### command
- `assert` is a critical command that asks python to assert that something is true before proceeding. This function speaks to our ability to audit the data and be transparent in our approach. 

In [7]:
# checking df.shape

In [None]:
# all() to check columns after shape passes

In [None]:
# assert len(data[col].unique()) == known_unique_count

In [None]:
# check to confirm a manipulation was successful

#### logic block
While we can technically run a .py file that doesn't contain a `if __name__ == '__main__'` block as a script, the instructions we can leave in the `main` block can synthesize our script as a program, and it signals something to the compiler that gives us the ability to pass arguments to the script, so it's an ideal inclusion to your py scripts.

In [8]:
# jupyter notebooks run cells as mini-main blocks, so this conditional is always True
if __name__ == '__main__':
    print('name:\t', __name__)
    print('msg:\t', 'Hello world')

name:	 __main__
msg:	 Hello world


## Other useful libraries

In [None]:
# numpy
# - stat methods

In [None]:
# pandas
# - DataFrame
# - Series
# - suggested tutorial