## Advanced libraries and Subprocesses
### BIOINF 575

### Advanced libraries
#### scipy, scikit-learn, tensorflow, keras, pytorch

It is important to practice using common advanced libraries:
* https://dataheadhunters.com/academy/how-to-use-python-for-bioinformatics-detailed-step-by-step-guide/

#### scipy
https://scipy.org

https://github.com/scipy/scipy

Summary generated with perplexity AI:    
https://www.perplexity.ai/search/you-are-a-teacher-in-grad-leve-540weGGrTa.oAUSrn8ecqg

#### Overview of SciPy
SciPy is a powerful open-source Python library designed for scientific and technical computing. It builds on the capabilities of the NumPy library, providing additional tools and algorithms for a wide range of mathematical and scientific calculations. SciPy is known for its efficient and user-friendly interfaces, making it a popular choice among scientists, engineers, and data analysts126.
##### Key Features
* Numerical Integration and Optimization: SciPy offers functions for numerical integration, interpolation, and optimization, which are essential for solving complex mathematical problems.
* Statistics: The scipy.stats module provides a comprehensive suite of statistical functions, including probability distributions, statistical tests, and descriptive statistics.
* Linear Algebra: SciPy includes advanced linear algebra capabilities, such as solving systems of equations, eigenvalue problems, and matrix decompositions14.
Signal Processing: The library also supports signal processing tasks through its signal module, which includes tools for filtering, spectral analysis, and resampling.

Table from:   
https://lectures.scientific-python.org/intro/scipy/index.html


| Module            | Description                            |
|-------------------|----------------------------------------|
| scipy.cluster     | Vector quantization / Kmeans           |
| scipy.constants   | Physical and mathematical constants    |
| scipy.fft         | Fourier transform                      |
| scipy.integrate   | Integration routines                   |
| scipy.interpolate | Interpolation                          |
| scipy.io          | Data input and output                  |
| scipy.linalg      | Linear algebra routines                |
| scipy.ndimage     | n-dimensional image package            |
| scipy.odr         | Orthogonal distance regression         |
| scipy.optimize    | Optimization                           |
| scipy.signal      | Signal processing                      |
| scipy.sparse      | Sparse matrices                        |
| scipy.spatial     | Spatial data structures and algorithms |
| scipy.special     | Any special mathematical functions     |
| scipy.stats       | Statistics                             |


In [None]:
! pip install scipy

In [None]:
# https://stackoverflow.com/questions/8400382/python-pip-silent-install/8400396

In [None]:
import scipy as sp
import numpy as np
from matplotlib import pyplot as plt

dist = sp.stats.norm(loc=0, scale=1)  # standard normal distribution
sample = dist.rvs(size=100000)  # "random variate sample"
plt.hist(sample, bins=50, density=True, label='normalized histogram')  
x = np.linspace(-5, 5)
plt.plot(x, dist.pdf(x), label='PDF')

plt.legend()


In [None]:
res = sp.stats.normaltest(sample)
res.statistic

In [None]:
! pip install pooch

In [None]:
# Load an image
face = sp.datasets.face(gray=True)

# Shift, rotate and zoom it
shifted_face = sp.ndimage.shift(face, (50, 50))
shifted_face2 = sp.ndimage.shift(face, (50, 50), mode='nearest')
rotated_face = sp.ndimage.rotate(face, 30)
cropped_face = face[50:-50, 50:-50]
zoomed_face = sp.ndimage.zoom(face, 2)
zoomed_face.shape

In [None]:
plt.subplot(151)
plt.imshow(shifted_face, cmap=plt.cm.gray)
plt.subplot(152)
plt.imshow(shifted_face2, cmap=plt.cm.gray)
plt.subplot(153)
plt.imshow(rotated_face, cmap=plt.cm.gray)
plt.subplot(154)
plt.imshow(cropped_face, cmap=plt.cm.gray)
plt.subplot(155)
plt.imshow(zoomed_face, cmap=plt.cm.gray)
plt.axis('off')

#### scikit-learn

https://scikit-learn.org/stable/

https://github.com/scikit-learn/scikit-learn

Summary generated with perplexity AI:    
https://www.perplexity.ai/search/you-are-a-teacher-in-grad-leve-eLszD.e9QpOEC14u7goMRg#0

#### Overview of Scikit-learn
Scikit-learn, often referred to as sklearn, is a widely-used open-source machine learning library for Python. It provides a range of supervised and unsupervised learning algorithms through a consistent interface in Python. The library is built on top of NumPy, SciPy, and Matplotlib, making it efficient for data analysis and modeling124.

#### Key Features
* Classification: Algorithms for categorizing data into predefined classes, such as logistic regression and support vector machines14.
Regression: Techniques for predicting continuous values, like linear regression and decision tree regression14.
Clustering: Methods for grouping similar data points, including k-means and DBSCAN14.
* Dimensionality Reduction: Tools like Principal Component Analysis (PCA) to reduce the number of variables in data14.
Model Selection and Evaluation: Functions to compare, validate, and choose models and their parameters1.
Applications in Bioinformatics
In bioinformatics, scikit-learn is particularly useful due to its versatility in handling various types of biological data. Here are some specific applications:
* Gene Expression Analysis
Classification: Used to classify gene expression profiles to identify different types of cancer or other diseases. Algorithms like support vector machines (SVM) are commonly applied for this purpose.
Protein Structure Prediction
Regression Models: Employed to predict protein structures based on amino acid sequences. Linear regression can be used to model relationships between sequence features and structural properties.
* Clustering Biological Data
Clustering Techniques: Useful for grouping similar gene expression profiles or protein structures. K-means clustering helps in identifying patterns within large datasets.
* Dimensionality Reduction
PCA and Other Techniques: These are used to reduce the dimensionality of complex biological datasets, making it easier to visualize and interpret the data.

Scikit-learn's ability to integrate with other Python libraries such as NumPy, Pandas, and Matplotlib allows for seamless preprocessing, analysis, and visualization of bioinformatics data125. Its user-friendly API makes it accessible for researchers looking to apply machine learning techniques without extensive programming expertise.

In [None]:
! pip install scikit-learn

In [None]:
# An example of clustering of digits
# This dataset contains handwritten digits from 0 to 9. We would like to group images such that the handwritten digits on the image are the same.
# Example from:
# https://scikit-learn.org/1.5/auto_examples/cluster/plot_kmeans_digits.html#sphx-glr-auto-examples-cluster-plot-kmeans-digits-py

In [None]:
# More examples at: 
# https://github.com/glouppe/tutorial-sklearn-lhcb/blob/master/An%20introduction%20to%20Machine%20Learning%20with%20Scikit-Learn-Rendered.ipynb


In [None]:
import numpy as np

from sklearn.datasets import load_digits

data, labels = load_digits(return_X_y=True)
(n_samples, n_features), n_digits = data.shape, np.unique(labels).size

print(f"# digits: {n_digits}; # samples: {n_samples}; # features {n_features}")

In [None]:
from time import time

from sklearn import metrics
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler


def bench_k_means(kmeans, name, data, labels):
    """Benchmark to evaluate the KMeans initialization methods.

    Parameters
    ----------
    kmeans : KMeans instance
        A :class:`~sklearn.cluster.KMeans` instance with the initialization
        already set.
    name : str
        Name given to the strategy. It will be used to show the results in a
        table.
    data : ndarray of shape (n_samples, n_features)
        The data to cluster.
    labels : ndarray of shape (n_samples,)
        The labels used to compute the clustering metrics which requires some
        supervision.
    """
    t0 = time()
    estimator = make_pipeline(StandardScaler(), kmeans).fit(data)
    fit_time = time() - t0
    results = [name, fit_time, estimator[-1].inertia_]

    # Define the metrics which require only the true labels and estimator
    # labels
    clustering_metrics = [
        metrics.homogeneity_score,
        metrics.completeness_score,
        metrics.v_measure_score,
        metrics.adjusted_rand_score,
        metrics.adjusted_mutual_info_score,
    ]
    results += [m(labels, estimator[-1].labels_) for m in clustering_metrics]

    # The silhouette score requires the full dataset
    results += [
        metrics.silhouette_score(
            data,
            estimator[-1].labels_,
            metric="euclidean",
            sample_size=300,
        )
    ]

    # Show the results
    formatter_result = (
        "{:9s}\t{:.3f}s\t{:.0f}\t{:.3f}\t{:.3f}\t{:.3f}\t{:.3f}\t{:.3f}\t{:.3f}"
    )
    print(formatter_result.format(*results))

In [None]:
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

print(82 * "_")
print("init\t\ttime\tinertia\thomo\tcompl\tv-meas\tARI\tAMI\tsilhouette")

kmeans = KMeans(init="k-means++", n_clusters=n_digits, n_init=4, random_state=0)
bench_k_means(kmeans=kmeans, name="k-means++", data=data, labels=labels)

kmeans = KMeans(init="random", n_clusters=n_digits, n_init=4, random_state=0)
bench_k_means(kmeans=kmeans, name="random", data=data, labels=labels)

pca = PCA(n_components=n_digits).fit(data)
kmeans = KMeans(init=pca.components_, n_clusters=n_digits, n_init=1)
bench_k_means(kmeans=kmeans, name="PCA-based", data=data, labels=labels)

print(82 * "_")

In [None]:
import matplotlib.pyplot as plt

reduced_data = PCA(n_components=2).fit_transform(data)
kmeans = KMeans(init="k-means++", n_clusters=n_digits, n_init=4)
kmeans.fit(reduced_data)

# Step size of the mesh. Decrease to increase the quality of the VQ.
h = 0.02  # point in the mesh [x_min, x_max]x[y_min, y_max].

# Plot the decision boundary. For that, we will assign a color to each
x_min, x_max = reduced_data[:, 0].min() - 1, reduced_data[:, 0].max() + 1
y_min, y_max = reduced_data[:, 1].min() - 1, reduced_data[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Obtain labels for each point in mesh. Use last trained model.
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(
    Z,
    interpolation="nearest",
    extent=(xx.min(), xx.max(), yy.min(), yy.max()),
    cmap=plt.cm.Paired,
    aspect="auto",
    origin="lower",
)

plt.plot(reduced_data[:, 0], reduced_data[:, 1], "k.", markersize=2)
# Plot the centroids as a white X
centroids = kmeans.cluster_centers_
plt.scatter(
    centroids[:, 0],
    centroids[:, 1],
    marker="x",
    s=169,
    linewidths=3,
    color="w",
    zorder=10,
)
plt.title(
    "K-means clustering on the digits dataset (PCA-reduced data)\n"
    "Centroids are marked with white cross"
)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()

#### tensorflow

https://www.tensorflow.org

https://github.com/tensorflow/tensorflow

Summary generated with perplexity AI:    
https://www.perplexity.ai/search/you-are-a-teacher-in-grad-leve-vrp1AdYmQl6NaeCh7428Eg#0

#### Overview of TensorFlow
TensorFlow is an open-source library developed by Google for numerical computation and large-scale machine learning, particularly deep learning applications. It allows developers to build and train models using a comprehensive ecosystem of tools and libraries. TensorFlow supports various machine learning tasks, including image recognition, natural language processing, and more, by utilizing dataflow graphs where nodes represent mathematical operations and edges represent multidimensional data arrays called tensors136.

#### Key Features of TensorFlow
* Dataflow Graphs: TensorFlow uses dataflow graphs to represent computations. This allows for efficient execution across different hardware platforms, including CPUs, GPUs, and TPUs34.
* High-Level APIs: TensorFlow provides high-level APIs that simplify model building and training, making it accessible even to those not deeply familiar with the underlying algorithms14.
* Cross-Platform Support: Models built with TensorFlow can be deployed on various platforms such as desktops, mobile devices, and cloud environments46.
* Abstraction: The library abstracts complex operations, allowing developers to focus on the logic of their applications without delving into the intricacies of algorithm implementation3.

##### Applications in Bioinformatics
TensorFlow's capabilities are particularly useful in bioinformatics for handling large datasets and complex models. Here are some ways it is applied in this field:
* Genomic Data Analysis: TensorFlow can process large genomic datasets to identify patterns and make predictions about genetic traits or disease predispositions.
* Protein Structure Prediction: Deep learning models built with TensorFlow can predict protein structures based on amino acid sequences, aiding in drug discovery and development.
* Medical Imaging: TensorFlow's image recognition capabilities are leveraged for analyzing medical images such as MRI or CT scans to detect anomalies or diseases.
* Natural Language Processing (NLP): In bioinformatics, NLP models can process scientific literature or patient records to extract meaningful information for research or clinical decision-making.

#### Advantages for Bioinformatics
* Scalability: TensorFlow's ability to handle large-scale computations makes it ideal for bioinformatics tasks that require processing vast amounts of biological data16.
* Flexibility: The library's support for various machine learning models allows researchers to experiment with different approaches to solve bioinformatics problems.
* Community and Resources: Being widely used, TensorFlow has a strong community and a wealth of resources that can aid bioinformaticians in developing robust solutions3.

In summary, TensorFlow is a powerful tool in the bioinformatics domain due to its scalability, flexibility, and comprehensive ecosystem that supports a wide range of applications from genomic analysis to medical imaging.

In [None]:
! pip install tensorflow

#### keras

In [None]:
import keras

Summary generated with perplexity AI:    
https://www.perplexity.ai/search/you-are-a-teacher-in-grad-leve-3Rzgk7IoTfCkvSQdcl4B9Q#0

#### Overview of Keras
Keras is a high-level, deep learning API written in Python and developed by Google. It is designed to simplify the implementation of neural networks by providing a user-friendly and modular interface. Keras operates on top of several backend engines, including TensorFlow, Theano, MXNet, and CNTK, with TensorFlow being its most common backend due to its integration as the official high-level API123.

##### Key Features
* Ease of Use: Keras offers a simple and consistent API that reduces the complexity of implementing deep learning models. This makes it accessible for both beginners and experts15.
* Modularity: The library is modular, allowing users to create complex architectures using the Sequential API or the Functional API, which supports arbitrary graphs of layers45.
* Cross-Platform Compatibility: Keras models can run on various hardware configurations, including CPUs and GPUs, and can be exported for use across different platforms45.
* Rapid Prototyping: The high-level abstractions in Keras facilitate quick iteration on ideas, making it suitable for fast experimentation25.

#### Applications in Bioinformatics
Keras is extensively used in bioinformatics for tasks such as genomics, proteomics, and medical imaging. Here are some specific applications:

##### Genomics
* Sequence Classification: Keras can be used to classify DNA sequences by building models that predict functional regions or detect mutations.
* Gene Expression Analysis: Deep learning models in Keras can analyze gene expression data to identify patterns associated with diseases.

##### Proteomics
* Protein Structure Prediction: Keras models help predict protein structures from amino acid sequences using techniques like convolutional neural networks (CNNs).
* Protein Function Prediction: By leveraging large datasets of protein sequences, Keras can assist in predicting protein functions based on structural and sequence data.

##### Medical Imaging
* Image Classification: Keras is used to classify medical images such as MRI or CT scans for diagnostic purposes.
* Segmentation Tasks: In tasks like tumor detection, Keras models can segment medical images to highlight areas of interest.

#### Advantages in Bioinformatics
* Integration with TensorFlow: This allows for scalability and efficient computation on large datasets typical in bioinformatics.
* Preprocessing Capabilities: Keras provides preprocessing layers that facilitate handling complex bioinformatics data types like sequences and images24.
* Community Support: A robust community and extensive documentation support bioinformatics researchers in implementing cutting-edge models quickly15.

Keras's simplicity, flexibility, and integration with powerful backends make it a valuable tool for advancing research and applications in bioinformatics.

In [None]:
# Example from:
# https://www.tensorflow.org/tutorials/quickstart/beginner

In [None]:
import tensorflow as tf
print("TensorFlow version:", tf.__version__)

In [None]:
mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

In [None]:
model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10)
])

In [None]:
predictions = model(x_train[:1]).numpy()
predictions

In [None]:
tf.nn.softmax(predictions).numpy()

In [None]:
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

In [None]:
loss_fn(y_train[:1], predictions).numpy()

In [None]:
model.compile(optimizer='adam',
              loss=loss_fn,
              metrics=['accuracy'])

In [None]:
model.fit(x_train, y_train, epochs=5)

In [None]:
model.evaluate(x_test,  y_test, verbose=2)

In [None]:
probability_model = tf.keras.Sequential([
  model,
  tf.keras.layers.Softmax()
])

In [None]:
probability_model(x_test[:5])

#### pytorch   
https://pytorch.org

https://github.com/pytorch/pytorch

Summary generated with perplexity AI:   
https://www.perplexity.ai/search/you-are-a-teacher-in-grad-leve-urfuUUoZT1.IY8zNfEfHIQ#0


#### Overview of PyTorch
PyTorch is a widely-used open-source machine learning library developed by Meta AI, now part of the Linux Foundation. It is designed for applications such as computer vision and natural language processing (NLP) and is known for its flexibility and ease of use in building deep learning models. PyTorch offers a dynamic computational graph, which allows for more intuitive model building and debugging compared to static graph frameworks like TensorFlow.

#### Key Features
* Tensor Computing: PyTorch provides tensor computing similar to NumPy, with strong acceleration via GPUs, making it suitable for high-performance scientific computing2.
* Deep Neural Networks: The library supports the creation of deep neural networks through its torch.nn module, which includes various layers and activation functions2.
* Automatic Differentiation: PyTorch uses a tape-based automatic differentiation system that simplifies the process of computing gradients, essential for training neural networks.

#### Applications in Bioinformatics
In bioinformatics, PyTorch is increasingly used due to its flexibility and powerful features that support complex data analysis and model development. Here are some specific applications:
##### Genomic Data Analysis
* Sequence Analysis: PyTorch can be used to build models that analyze DNA sequences, predicting gene expression or identifying mutations. The dynamic computational graph allows researchers to experiment with different architectures easily.
* Protein Structure Prediction: Deep learning models in PyTorch can predict protein structures from amino acid sequences, aiding in understanding protein functions and interactions.
##### Imaging in Bioinformatics
* Microscopy Image Analysis: PyTorch's capabilities in computer vision make it suitable for analyzing microscopy images, such as identifying cellular structures or quantifying biological markers.
* Medical Imaging: It is also used in processing medical images like MRI or CT scans to detect anomalies or classify diseases.

#### Integration with Other Tools
PyTorch integrates well with other bioinformatics tools and libraries. For instance, it can be combined with libraries like scikit-learn for preprocessing or post-processing tasks. Additionally, its compatibility with cloud platforms facilitates large-scale data processing and model training5.

#### Advantages in Bioinformatics
* Flexibility: The dynamic nature of PyTorch allows bioinformaticians to modify models on-the-fly, which is crucial when dealing with complex biological data.
* Community and Ecosystem: A robust community supports the development of specialized bioinformatics tools within the PyTorch ecosystem, enhancing collaborative research efforts.
* 
In summary, PyTorch's adaptability and comprehensive feature set make it an excellent choice for bioinformatics applications, supporting a wide range of tasks from sequence analysis to medical imaging.

In [None]:
# https://github.com/pytorch/examples
# Example from:
# https://pytorch.org/tutorials/beginner/basics/intro.html

In [None]:
# https://pytorch.org/get-started/locally/

In [None]:
! pip3 install torch torchvision torchaudio

In [None]:
import torch
import numpy as np

In [None]:
data = [[1, 2],[3, 4]]
x_data = torch.tensor(data)

In [None]:
np_array = np.array(data)
x_np = torch.from_numpy(np_array)

In [None]:
x_ones = torch.ones_like(x_data) # retains the properties of x_data
print(f"Ones Tensor: \n {x_ones} \n")

x_rand = torch.rand_like(x_data, dtype=torch.float) # overrides the datatype of x_data
print(f"Random Tensor: \n {x_rand} \n")

In [None]:
shape = (2,3,)
rand_tensor = torch.rand(shape)
ones_tensor = torch.ones(shape)
zeros_tensor = torch.zeros(shape)

print(f"Random Tensor: \n {rand_tensor} \n")
print(f"Ones Tensor: \n {ones_tensor} \n")
print(f"Zeros Tensor: \n {zeros_tensor}")

In [None]:
# https://pytorch.org/tutorials/beginner/basics/buildmodel_tutorial.html
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)
print(f"Using {device} device")


### Subprocesses

One of the biggest strengths of Python is that it can be used as a *glue* language. <br>
It can 'glue' together a series of programs into a flexible and highly extensible pipline.

### Why subprocesses
One of the most common, yet complicated, tasks that most programming languages need to do is creating new processes. <br>
This could be as simple as seeing what files are present in the current working directory (`ls`) or as complicated as creating a program workflow that *pipes* output from one program into another program's input. <br/><br/>
Many such tasks are easily taken care of through the use of Python libraries and modules (`import`) that *wrap* the programs into Python code, effectively creating Application Programming Interfaces (API). <br/><br/>
However, there are many use cases that require the user to make calls to the terminal from ***within*** a Python program.

#### Operating System Conundrum

First, we need to address the following issue. As many in this class have found out, while Python can be installed on most operating systems; doing the same thing in one operating system (Unix) may not always yield the same results in another (Windows).<br/><br/>
The very first step to making a program **"OS-agnostic"** is through the use of the `os` module.

In [None]:
import os

https://docs.python.org/3/library/os.html

In [None]:
#dir(os)

In [None]:
for elem in dir(os):
    if "error" in elem:
        print(elem)

In [None]:
# The name of the operating system dependent module imported. 
# The following names have currently been registered: 'posix', 'nt', 'java'
# Portable Operating System Interface -  IEEE standard designed to facilitate application portability
# (Windows) New Technology - a 32-bit operating system that supports preemptive multitasking
# 
os.name

In [None]:
# Returns information identifying the current operating system. The return value is an object with five attributes:
# - sysname - operating system name
# - nodename - name of machine on network (implementation-defined)
# - release - operating system release
# - version - operating system version
# - machine - hardware identifier

os.uname()

In [None]:
import sys

# https://docs.python.org/3/library/sys.html
# This string contains a platform identifier that can be used to append platform-specific components
# to sys.path, for instance.
    
sys.platform

In [None]:
# A list of strings that specifies the search path for modules. 

sys.path

In [None]:
# A mapping object representing the string environment.

os.environ['HOME']

In [None]:
os.environ

In [None]:
#Return the value of the environment variable key if it exists, 
#or default if it doesn’t. key, default and the result are str.

os.getenv("HOME")

In [None]:
os.getenv("PATH")

In [None]:
# Returns the list of directories that will be searched for a named executable,
#similar to a shell, when launching a process. 
# env, when specified, should be an environment variable dictionary to lookup the PATH in. 
# By default, when env is None, environ is used.

os.get_exec_path()

The `os` module wraps OS-specific operations into a set of standardized commands. <br>
For instance, the Linux end-of-line (EOL) character is a `\n`, but `\r\n` in Windows. <br>
In Python, we can just use the following:

In [None]:
# EOL - for the current (detected) environment

'''
The string used to separate (or, rather, terminate) lines on the current platform. 
This may be a single character, such as '\n' for POSIX, or multiple characters, 
for example, '\r\n' for Windows. 
Do not use os.linesep as a line terminator when writing files opened in text mode (the default); 
use a single '\n' instead, on all platforms.
'''

os.linesep

Another example, in a Linux environment, one must use the following command to list the contents of a given directory:
```
ls -alh 
```

In Windows, the equivalent is as follows:
```
dir
```

Python allows users to do a single command, in spite of the OS:

In [None]:
# List directory contents

os.listdir("demoCM")

However, the biggest issue for creating an OS-agnostic program is ***paths*** <br/>
Windows: `"C:\\Users\\MDS\\Documents"`<br/>
Linux: `/mnt/c/Users/MDS/Documents/`<br/><br/>
Enter Python:

In [None]:
# path joining from pwd
pwd = os.getcwd()
os.path.join(pwd,"test.py")

### `subprocess`

If you Google anything on how to run shell commands, but don't specify Python 3.x, you will likely get an answer that includes `popen`, `popen2`, or `popen3`. These were the most prolific ways to *open* a new *p*rocess. In Python 3.x, they encapsulated these functions into a new one called `run` available through the `subprocess` library.

In [None]:
# Import and alias
import subprocess as sp

#### `check_output`

In [None]:
# check_output returns a bytestring by default, so I set encoding to convert it to strings.
# [command, command line arguments]
# change from bytes to string using encoding

sp.check_output(["echo","test"],encoding='utf_8')

In [None]:
sp.check_output([os.path.join(pwd,"test.py"),"[1,2,3]"],encoding='utf_8')

The first thing we will look at are trivial examples that demonstrate just capturing the *output* (stdout) of a program

However, while the `check_output` function is still in the `subprocess` module, it can easily be converted into into a more specific and/or flexible `run` function signature.

#### `run`

In [None]:
sub = sp.run(
    [
        'echo',             # The command we want to run
        'test'              # Arguments for the command
    ],
    encoding='utf_8',       # Converting byte code
    stdout=sp.PIPE,         # Where to send the output
    check=True              # Whether to raise an error if the process fails
)  
sub

In [None]:
dir(sub)

In [None]:
print(sub.stdout)

The main utility of `check_output` was to capture the output (stdout) of a program. <br>
By using the `stdout=subprocess.PIPE` argument, the output can easily be captured, along with its return code. <br>
A return code signifies the program's exit status: 0 for success, anything else otherwise

In [None]:
sub.returncode

With our `run` code above, our program ran to completetion, exiting with status 0. The next example shows a different status.

In [None]:
sp.run(
        'exit 1',      # Command & arguments
        shell = True   # Run from the shell
        )


However, if the `check=True` argument is used, it will raise a `CalledProcessError` if your program exits with anything different than 0. This is helpful for detecting a pipeline failure, and exiting or correcting before attempting to continue computation.

In [None]:
sp.run(
        'exit 1',      # Command & arguments
        shell = True,  # Run from the shell
        check = True   # Check exit status
    )

In [None]:
sub = sp.run(
        'exit 1',      # Command & arguments
        shell = True,  # Run from the shell
        # check = True   # Check exit status
    )
if (sub.returncode != 0):
    print(f"Exit code {sub.returncode}. Expected 0 when there is no error.")

#### Syntax

Syntax when using `run`:<br/>
1. A list of arguments: `subprocess.run(['echo', 'test', ...], ...)` 
2. A string and `shell`: `subprocess.run('exit 1', shell = True, ...)`

The preferred way of using `run` is the first way. <br>
This preference is mainly due to security purposes (to prevent shell injection attacks). <br>
It also allows the module to take care of any required escaping and quoting of arguments for a pseudo-OS-agnostic approach. 

There are some guidelines though:
1. Sequence (list) of arguments is generally preferred
2. A str is appropriate if the user is just calling a program with no arguments
3. The user should use a str to pass argument if `shell` is `True`<br/>
Your next questions should be, "What is `shell`?"

`shell` is just your terminal/command prompt. This is the environment where you call `ls/dir` in. It is also where users can define variables. More importantly, this is where your *environmental variables* are set...like `PATH`.<br/><br/>
By using `shell = True`, the user can now use shell-based environmental variable expansion from within a Python program.

In [None]:
sp.run(
        'echo $PATH',            # Command
        shell = True,            # Use the shell
        stdout=sp.PIPE,          # Where to send it
        encoding='utf_8'         # Convert from bytes to string
    )      # Look at the output


In [None]:
p1 = sp.run(
        'sleep 5',               # Command
        shell = True,            # Use the shell
        stdout=sp.PIPE,          # Where to send it
        encoding='utf_8'         # Convert from bytes to string
    )
print(p1)
p2 = sp.run(
        'echo done',             # Command
        shell = True,            # Use the shell
        stdout=sp.PIPE,          # Where to send it
        encoding='utf_8'         # Convert from bytes to string
    )
print(p2)

For the most part, you shouldn't need to use `shell` simply because Python has modules in the standard library that can do most of the shell commands. For example `mkdir` can be done with `os.mkdir()`, and `$PATH` can be retrieved using os.getenv("PATH") or os.get_exec_path() as shown above. 

#### Blocking vs Non-blocking

The last topic of this lecture is "blocking". This is computer science lingo/jargon for whether or not a program ***waits*** until something is complete before moving on. Think of this like a really bad website that takes forever to load because it is waiting until it has rendered all its images first, versus the website that sets the formatting and text while it works on the images.

1. `subprocess.run()` is blocking (it waits until the process is complete)
2. `subprocess.Popen()` is non-blocking (it will run the command, then move on)

***Most*** use cases can be handled through the use of `run()`.<br> 
`run()` is just a *wrapped* version of `Popen()` that simplifies use. <br>
However, `Popen()` allows the user a more flexible control of the subprocess call. <br>
`Popen()` can be used similar way as run (with more optional parameters).

An example use case for `Popen()` is if the user has some intermediate data that needs to get processed, but the output of that data doesn't necessarily affect the rest of the pipeline.

#### `Popen`

In [None]:
p1 = sp.Popen(
        'sleep 5',               # Command
        shell = True,            # Use the shell
        stdout=sp.PIPE,          # Where to send it
        encoding='utf_8'         # Convert from bytes to string
    )
print(p1)
p2 = sp.Popen(
        'echo done',             # Command
        shell = True,            # Use the shell
        stdout=sp.PIPE,          # Where to send it
        encoding='utf_8'         # Convert from bytes to string
    )
print(p2)
print("processes ran")

print(p1.stdout.read())
print(p2.stdout.read())
print("processes completed")



In [None]:
# Use context manager to handle process while it is running,
# and gracefully close it
with sp.Popen(
    [
        'echo',         # Command
        'here we are'       # Command line arguments
    ],
    encoding='utf_8', # Convert from byte to string
    stdout=sp.PIPE    # Where to send it
) as proc:            # Enclose and alias the context manager
    print(
        proc.stdout.read() # Look at the output
    )

In [None]:
for elem in dir(proc):
    if not elem.startswith('_'):
        print(elem)

#### ***NOTE***: From here on out, there might be different commands used for **Linux** / **MacOS** or **Windows**

Add the following text to a new file `test_pipe.txt` 
```
testing
a
subprocess
pipe
```

In [None]:
# another way to add the text to the file
#test_pipe.txt - a file to be used to demonstrate pipe of cat and sort 
!echo testing > test_pipe.txt
!echo a >> test_pipe.txt
!echo subprocess >> test_pipe.txt
!echo pipe >> test_pipe.txt


In [None]:
# start the first process - cat - reading the file content

# mac OS
p1 = sp.Popen(['cat','test_pipe.txt'], stdout=sp.PIPE, encoding='utf_8')

# windows OS
# p1 = sp.Popen(['type','test_pipe.txt'], stdout=sp.PIPE, encoding='utf_8')

print(p1.stdout.read())

In [None]:
# add the second process and connect the pipe: 
# for p2 we use stdin=p1.stdout

# mac OS
p1 = sp.Popen(['cat','test_pipe.txt'], stdout=sp.PIPE, encoding='utf_8')

# windows OS
# p1 = sp.Popen(['type','test_pipe.txt'], stdout=sp.PIPE, encoding='utf_8')


p2 = sp.Popen(['sort'], stdin=p1.stdout, stdout=sp.PIPE, encoding='utf_8')
p1.stdout.close()  # Allow p1 to receive a SIGPIPE if p2 exits
output = p2.communicate()[0]
print(output)


`Popen` can create background processes, shell-background-like behavior means not blocking. <br>
`Popen` has a lot more functionality than `run`.

In [None]:
sub_popen = sp.Popen(
    [
        'echo',          # Command
        'test',        # Command line arguments
    ],
    encoding='utf_8',  # Convert from byte to string
    stdout=sp.PIPE     # Where to send it
)
for j in dir(sub_popen):
    if not j.startswith('_'):
        print(j)


In [None]:
sub_popen.kill()       # Close the process

Example creating child process.<br>
https://pymotw.com/3/subprocess/

A collection of `Popen` examples: <br>
https://www.programcreek.com/python/example/50/subprocess.Popen

#### Exercise 
Write bash script that takes the file as an argument and returns lines that contain the letter p.    
Call that script from python.



#### Exercise -  only if you have R installed
`test.R` - R script that takes the file `test_R.txt` as an argument and returns the sum of the matrix from the file
`Rscript` - the executable/interpreter used to run the R script

```
rN	val1	val2
r1	1	2
r2	3	4
```