# Learning Goals
In our [Principled Data Processing](https://www.youtube.com/watch?v=ZSunU9GQdcI) approach, we want to make our data analysis:
- **auditable**
     - the result of every task, where possible, can be tested by a different analyst in a different language
- **reproducible**
     - the same results can be rebuilt by any scientist with the same tools and code
- **transparent**
     - the result of every task is reviewable
- **scalable**
     - the task and results are well-suited for "more than 2" (datasets used, analysists contributing to codebase, reports written, etc.)

There are some specific tools and functions we look for to achieve these characteristics in our code. This notebook is intended to introduce a number of these, provide some context, and stage some examples where possible.

_Note: This notebook is formatted as an overview of many python topics, and is not a good example of principled data analysis using Jupyter notebooks. For tips on how to utilize Jupyter notebooks to write python scripts iteratively and quickly, ask Bailey!_

In [2]:
# dependencies
import argparse
import logging
import numpy as np
import pandas as pd

In [1]:
# support methods
def print_thing_info( val ):
    print('arg:\t', val)
    print('type:\t', type(val))
    print()

def print_formula_info( a, b ):
    c = a+b
    print('formula:\t', f'{a} + {b} = {c}')
    print('result type:', type(c))
    print()

## Absolute basics
We'll cover a few language-specific pointers here to get up to speed on some python syntax, and then we'll dive in to a few other topics related to principled data processing.

If you need more information on python syntax and standard libraries, you're encouraged to ask questions, share code examples, and do a bit of googling. Also, check out:
- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)
- [the library of python books](https://wiki.python.org/moin/PythonBooks)
- [python's built-in types](https://docs.python.org/3/library/stdtypes.html#)
- [python's built-in functions](https://docs.python.org/3/library/functions.html)

If you want more information on performance of different implementations in python, try:
- [scalability.md](https://github.com/HRDAG/training-docs/blob/main/language-tips/python/scalability.md)
- [time complexity](https://wiki.python.org/moin/TimeComplexity)
- searching for topics related to a specific data structure plus 'documentation', 'time complexity', or 'trade offs'

#### assignment examples

In [4]:
test_int = 3
test_float = 3.14
test_nan = np.nan
test_str = 'Hello'
test_none = None

In [5]:
print("Assigned variables:")
print("===================")
print_thing_info(test_int)
print_thing_info(test_float)
print_thing_info(test_nan)
print_thing_info(test_str)
print_thing_info(test_none)

Assigned variables:
arg:	 3
type:	 <class 'int'>

arg:	 3.14
type:	 <class 'float'>

arg:	 nan
type:	 <class 'float'>

arg:	 Hello
type:	 <class 'str'>

arg:	 None
type:	 <class 'NoneType'>



In [6]:
print("Mixing types")
print_formula_info(test_int, test_float)
print_formula_info(test_int, test_nan)

Mixing types
formula:	 3 + 3.14 = 6.140000000000001
result type: <class 'float'>

formula:	 3 + nan = nan
result type: <class 'float'>



#### for loops
Two common purposes:
1. building a collection
2. filtering a collection for a condition

In [9]:
vals = ['a', 'b', 'c']
for val in vals:
    if val == 'a':
        print(val)

    
for i in range(len(vals)):
    val = vals[i]
    if val == 'b':
        print(val)

a
b


In [12]:
out = {0: {'record_id':'alkjfd23l', 'age': 47}}

df = pd.DataFrame(out).T
df.head()

Unnamed: 0,age,record_id
0,47,alkjfd23l


In [16]:
for idx, info_dict in df.T.to_dict().items():
    for i in range(len(info_dict)):
        key = list(info_dict.keys())[i]
        print(key)

age
record_id


In [17]:
df.loc[df.age == 47]

Unnamed: 0,age,record_id
0,47,alkjfd23l


#### Collections
Some data structures exist to give us the ability to collect multiple values and store them. There are several ways to intialize these variables:
1. empty intialization
2. initialization with values manually prescribed
3. initialization with values prescribed by a comprehension formula

In [20]:
empty_list = []
hand_list = [0, 1, 2, 3, 4, 5]
auto_list = [val for val in range(6)]

empty_dict = {}
hand_dict = {0:'a', 1:'a', 2:'a', 3:'a', 4:'a', 5:'a'}
auto_dict = {key:'a' for key in hand_list}

empty_set = set()
hand_set = {0, 1, 2, 3, 4, 5}
auto_set = {key for key in hand_list}

In [21]:
print_thing_info(empty_list)
print_thing_info(hand_list)
print_thing_info(empty_dict)
print_thing_info(hand_dict)
print_thing_info(empty_set)
print_thing_info(hand_set)

arg:	 []
type:	 <class 'list'>

arg:	 [0, 1, 2, 3, 4, 5]
type:	 <class 'list'>

arg:	 {}
type:	 <class 'dict'>

arg:	 {0: 'a', 1: 'a', 2: 'a', 3: 'a', 4: 'a', 5: 'a'}
type:	 <class 'dict'>

arg:	 set()
type:	 <class 'set'>

arg:	 {0, 1, 2, 3, 4, 5}
type:	 <class 'set'>



In [26]:
empty_set.add('test')
empty_set

{'test'}

In [27]:
print(hand_list == auto_list)
print(hand_dict == auto_dict)
print(hand_set == auto_set)

True
True
True


#### Booleans and conditionals

In [28]:
if True:
    print('True')

if 1:
    print('wait what?')
else:
    print('I never print')

True
wait what?


In [29]:
test_vals = [1, 2, 3, np.nan, 5, 6, 7, 8, None, 10]

In [30]:
print("Check for missing data with `if`")
print("================================")
for val in test_vals:
    if val:
        print(val)

print()
print("Check for missing data with `np.isnan()`")
print("========================================")
for i in range(len(test_vals)):
    val = test_vals[i]
    if not np.isnan(val):
        print(val)

Check for missing data with `if`
1
2
3
nan
5
6
7
8
10

Check for missing data with `np.isnan()`
1
2
3
5
6
7
8


TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Check out the [missingness](https://github.com/HRDAG/training-docs/blob/main/language-tips/python/missingness.md) python doc in our training repo or the in-depth chapter in the Python Data Science Handbook [here](https://jakevdp.github.io/PythonDataScienceHandbook/03.04-missing-values.html) to learn more about this type error and handling missing data.

#### functions
You might've noticed I created and called some functions before this section, and that's because I had a very repetitive series of instructions I wanted to run on mutiple inputs. Functions are encapsulated into 2 sets of rules:
1. declaration
2. calling

In [31]:
def pretty_print( label, val ):
    print('{:40}{}'.format(label, val))

pretty_print('test', 1)

test                                    1


## Running python scripts
You can (and should) run python code as a script from the command line or as a target in a makefile. In order to do this effectively, there are 2 libraries, 1 command, and 1 logic block we can't live without.

#### libraries
- `argparse` lets us handle arguments provided to the script call in the terminal

In [None]:
# argparse

- `logging` lets us save info to a logfile that we can read later on to answer questions about our data at runtime, without having to reload our data or run the program again

In [None]:
# logging

#### command
- `assert` is a critical command that asks python to assert that something is true before proceeding. This function speaks to our ability to audit the data and be transparent in our approach. 

In [32]:
# checking df.shape
df.shape

(1, 2)

In [43]:
# all() to check columns after shape passes
try:
    assert all(df.columns.tolist()) == ['age', 'record_id']
except:
    print(df.columns)
    raise

Index(['age', 'record_id'], dtype='object')


AssertionError: 

In [37]:
df.columns

Index(['age', 'record_id'], dtype='object')

In [41]:
assert len(df['age'].unique()) == 1

In [None]:
# check to confirm a manipulation was successful

#### logic block
While we can technically run a .py file that doesn't contain a `if __name__ == '__main__'` block as a script, the instructions we can leave in the `main` block can synthesize our script as a program, and it signals something to the compiler that gives us the ability to pass arguments to the script, so it's an ideal inclusion to your py scripts.

In [None]:
# jupyter notebooks run cells as mini-main blocks, so this conditional is always True
if __name__ == '__main__':
    print('name:\t', __name__)
    print('msg:\t', 'Hello world')

## Other useful libraries

In [None]:
# numpy
# - stat methods

In [None]:
# pandas
# - DataFrame
# - Series
# - suggested tutorial