# Scripts, modules, packages

# scripts
- text file, input for interpreter
- define and use functions, variables



In [4]:
%%bash 

python script.py

12.56636
mod __name__: mod
28.27431


### problem
- as code grows, scripts unwieldy


### solution: modules
- helps maintain/extend code
- easier to read
- easier to reuse across different programs

# Modules

- defines functions, constants, variables, runnable code
- identified by .py extension
- `sys.path` contains directory list searched by python 
    - current directory
    - shell variable PYTHONPATH
- can be run directly or imported by other scripts/modules


### Running mod.py

mod.py contains a function to compute the area of a circle, but also runs some code, i.e. for testing.

In [3]:
%%bash
python mod.py

12.56636
mod __name__: __main__


### Importing mod.py

- `script.py` imports `mod.py`:
- Based on the contents of `script.py` we expected 2 things to be printed, but code within mod.py was run upon import!

#### special variable \_\_name\_\_

To prevent runnable code within mod.py from getting executed (e.g. tests), uncomment out the line containing `if __name__ == "__main__":` and rerun the previous cell!

For `mod.py`, `__name__` takes on values:
- `__main__` if given to python interpreter
- `mod` if imported

#### standard practice

- modules typically only define functions, etc., and do not contain runnable code
    - tests usually stored elsewhere (next section)
- the `if __name__ == "__main__":` typically only included in the top-level code
    - shows user the program's entrypoint
    - prevents code from running if this code is imported

### Modules: final notes
- to simplify syntax, you can use `from mod import circ_area` to use the function without the dot notation
- to speed up loading modules, python caches the compiled version in the \_\_pycache\_\_ directory 
- the built in dir() function shows which names a module defines



In [4]:
import mod
dir(mod)

['__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 'circ_area',
 'pi']

# Packages

- a directory containing a collection of modules
    - each module contains code related to topic (e.g. I/O functionality)
    - use an \_\_init\_\_.py file in directory/subdirectories to indicate it’s a package
- structured using python’s “dotted module names” namespace
    - Module name A.B specifies a submodule B in a package named A
    - e.g. `sklearn.mixture.GaussianMixture`

### why use packages?

- many tools to build, install, distribute packages
- divides code into well-structured, logical units with standard layout
    - easier to read/understand/use/extend





### package structure

A basic package not meant for publishing, as it lacks the special files to do so (`setup.py`, `README.md`, and `MANIFEST.in`)

```
datascience
├── __init__.py
├── __main__.py
├── analysis            # module
│   ├── __init__.py
│   └── regression.py
└── preprocess          # module
    ├── __init__.py
    ├── filter.py
    └── transform.py
```

### accessing modules and functions

See the `__init__.py` file in the `datascience` directory to explore the various ways to change the behavior of import statements. (Un)comment out various sections, restart the Kernel (Kernel -> Restart Kernel), and re-import packages.

Use the `ds()` function to see what modules/functions get imported to your namespace.

In [5]:
import datascience as ds
# "from datascience import *" is not recommended!
dir(ds)

['__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 'analysis',
 'filter_1',
 'filter_2',
 'lasso',
 'logistic',
 'preprocess',
 'transform_1',
 'transform_2']

In [7]:
dir(ds.preprocess.filter)

['__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 'filter_1',
 'filter_2']

### \_\_main\_\_.py 

Some packages also have a main script that imports submodules and runs an analysis for the user.

One option is to have a top-level script that includes:
```
if __name__ == __main__:
  main()
```

but it could be ambiguous which script in your module is the main file (e.g. `run.py` or `pipeline.py`). 

Instead, include a `__main__.py` file and execute the package's top-level code with:


```
python -m datascience
```

or

```
python datascience
```

In [2]:
%%bash 

python -m datascience 
"input_file"

Please provide input file
"input_file" processed and analyzed
