<html>
    <summary></summary>
    <p float="left">
         <div> <p></p> </div>
         <div style="font-size: 20px; width: 800px;">
              <h1>
               <left>Intro to Python: importing packages, classes/modules, os navigation, reading files </left>
              </h1>
              <p><left>============================================================================</left> </p>              
             <pre>Course: BIOM 421, Spring 2024
Instructor: Dr. Brian Munsky
Contact Info: munsky@colostate.edu
Authors: Will Raymond, Dr. Luis Aguilera, Dr. Brian Munsky
</pre>
         </div>
    </p>

</html>



<details>
  <summary>Copyright info</summary>

```
Copyright 2023 Brian Munsky

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
```
<details>

Here we will learn about one of the nicest things about python -- the modular nature of object oriented programming. 

Importing modules lets you load particular subset of code into your instance of Python for use as needed.  

For example, we can import one section or function of a given code instead of loading the entire package. Available modules are based on your current path / environment variables and your current instillation of python.

You can import the package ```sys``` (system) and take a glance at what paths (folders/directories) python has access to when importing.

---

```Reading: Kinder, Nelson Section 4 and Appendix B```


In [None]:
import os, sys
sys.path  #current paths python has

Let's take a glance at what we have installed in your local environment (its a lot). To do this we are going to call ```pip```, ```pip``` is the main package manager for Python and comes included since Python 3. ```pip```connects to [PyPI](https://pypi.org/), the python package index where people are free to release code and modules that they have developed.

**Tip - The exclamation point below tells the Jupyter Notebook to run this command in the background kernel, as if this were a terminal or command prompt input.** By calling ```!pip list``` or ```%pip list``` we are telling Jupyter to list every package.

In [None]:
%pip list

## 1) Importing, Using, and Installing packages

To get information about any of these modules, we need to import a package, like numpy, and then use ```module?``` or help(module) to get additional information about this package. What is displayed is from a specifically formatted comment called a docstring (documentation string) that comes along with the package.

In [None]:
import numpy
# If we want to learn more about a module, we can use the help command !(name)?
numpy?

### 1.A) importing different packages or sub-modules

```import``` in Python is the command to load in a particular ```package``` (large collection of code with several submodules) or ```module``` (single files up to medium sized classes).

When you import you have several options:

```python
# import the entire package all at once
import numpy

# import the package and use a (simpler) variable name, "np" to refer to it
import numpy as np

# import a specific submodule or function from the package
from numpy import linalg as la

# importing one function only from the sub - module linear algebra contained in numpy
from numpy.linalg import svd as svd
```

In [None]:
import numpy as np
# if we want to know the version:
np.__version__

In [None]:
from numpy import linalg as la #import only the linear algebra file
# or if we want to know where a specific file is:
la.__file__

In [None]:
# or if we want just the docstring:
la.__doc__

In [None]:
from numpy.linalg import svd as svd
svd?

#### Importing local packages
You can also write your own packages and then import them, but they folders need to be written in the right format.  To learn more, see the extensive page within Python's own documentation: https://docs.python.org/3/tutorial/modules.html

### 1.B) Using packages

Once we have imported a package, you can call any submodules or functions by module.function()

In [None]:
arr = np.ones([10,100])
la.svd(arr)

### 1.C) Installing new packages

To find more packages to install, usually they are stored on [PyPI](https://pypi.org/) - Python Package Index. Searching for specific modules is difficult within PyPI, so I recommend finding packages via searching academic papers or Google scholar for your specific purposes and using the authors' particular Githubs!

From there you can install a package in Colab by using ```!pip``` according to the package's instructions.

In [None]:
# lets install Biopython
# https://biopython.org/

!pip install biopython
import Bio
print('')
print('The version of BioPython installed:')
print(Bio.__version__)

### 1.D) os - "operating system"

```os```  is the main package that lets you navigate and change directories for various purposes, similar to how you move around in a command prompt or terminal. ```os``` also performs miscellaneous operating system functions such as changing directories or checking process ids.

https://docs.python.org/3/library/os.html


Another popular file path manager is ```pathlib```.

https://docs.python.org/3/library/pathlib.html

In [None]:
import os
print(os.getcwd())  #what is our current working directory (cwd)?

Once again, for us to import ```os```, the ```os``` module must exist in our system path so Python can "see" it.



In [None]:
print('We are using the following file as the package os:')
print(os.__file__) # Print the actual file that we imported the os packages from
print('')
print('Here is our current sys.path:')
print(sys.path)
# Note how this file we just imported from is inside our path!

Let's take a look at our current directory and its files using the ```os``` module:

In [None]:
#list all files in a directory, '.' denotes our current working directory
os.listdir('.')

```os``` uses Unix-like paths:

``` .``` a single period denotes my current working directory

``` ..``` a single period denotes one parent folder above my current working directory

``` ~``` a tilde represents the current users home directory

In [None]:
print('Files and folders in my current directory')
print(os.listdir('.'))
print('')
print('Files and folders one directory above my current directory')
print(os.listdir('..'))

In [None]:
print(os.path.expanduser('~'))

## 2) Reading and Writing files

Opening a file is done with "open" and opens a buffer to a particular file on your hard drive. It's best practice to use ``` with open() as object:``` so the buffer closes on its own after the execution of code within the ``` with ``` is over. Typically you may see people use ```f``` to denote their file buffer object.

 ```python
with open('example.txt', 'w') as f:
    print(f.readlines)
    
 ```

 ```open``` takes an additional flag to denote whether you are reading, ```'r'```, writing, ```'w'```, or appending ```'a'``` a file.

Additonally, if you are not reading a text file, you may have to read / write a file in binary mode, ```'b'```, by appending b to your flag (```'wb'``` would mean write in binary mode).

In [None]:
text_lines = ['hi im some lines to write to a file \n', '1\n','2\n','3\n']
# \n is a newline character, ie pressing enter

# im writing a regular text file, 'w'
with open('test.txt','w') as f:
  f.writelines(text_lines)

# im opening a file to read it, 'r'
with open('test.txt','r') as f:
  read_in_lines = f.readlines()

read_in_lines


Try opening the file we just made in your finder or file explorer, we should have saved it to the same place as this notebook file.

We can also write in binary mode and see what that looks like:

In [None]:
import numpy as np

# lets write some arbitrary array to a file as binary
print('writing array to bin file:')
array = np.array([5,1,4,2,3,6])
print(array)
with open('test.bin','wb') as f:
  for i in array:
    f.write(i)

# lets use the kernel to print out what is stored in the binary file
print('')
print('stored in file:')
!head test.bin
print('')


# lets read that back into python
with open('test.bin','rb') as f:
  bitarray = []
  byte = f.read(1)
  bitarray.append(byte)
  while byte != b"":
    byte = f.read(1)
    bitarray.append(byte)

# converting back from a byte array to a usable numpy array
print('')
read_array = []
for i in range(int(len(bitarray)/8) ):
  read_array.append(b''.join(bitarray[i*8:8 + i*8]))
read_array = [int.from_bytes(x,byteorder='little') for x in read_array]

print('array after read back in:')
print(read_array)

We will see more about numpy later, but for now you can also save a numpy array to its own file format using ```np.save(file, array)```

In [None]:
np.save('test.npy', array)
# print out this new file
!head test.npy

## 3) Pandas file management

```pandas``` is a library that extends NumPy by adding row and column headers. The primary data format it uses is called a "DataFrame." Pandas will read a plethora of file formats into the dataframe format for easy object oriented manipulation within python. When you are done with the manipulation, you can write the file back out or return the underlying NumPy array. We wont go into great detail with ```pandas``` in this course, however it is a very powerful tool for manipulating data just as ```numpy``` is for numerical analyses.

https://pandas.pydata.org/docs/getting_started/index.html#getting-started


### 3.A) Opening a file with Pandas

Let's open some example data into Jupyter.

In [None]:
import pandas as pd

#Let's read some pre-generated data that is commonly used in machine learning practice:
housing_dataframe = pd.read_csv('./california_housing_test.csv')

In [None]:
housing_dataframe #lets print out the beginning of this file

Getting columns and Logicals

The (dis)advantage of pandas is that the slicing that you would do with numpy can be done with strings instead of indexes.

Pros:

* Comprehension

It's clear which column you are using at a time since you can use text label headers.

Cons:
* Readability

When stringing together long logicals slicing dataframes, the readability can quickly become incomprehensible.


In [None]:
# lets get only the housing data from those with an household age of > 50
housing_dataframe[housing_dataframe['housing_median_age'  ] > 50]

In [None]:
# getting just the housing median age
all_median_ages = housing_dataframe['housing_median_age']

# just the median ages above 25
ages_above_25 = housing_dataframe[housing_dataframe['housing_median_age'] > 25]

print(len(all_median_ages))
print(len(ages_above_25))


In [None]:
#Lets plot some of that data with a module called matplotlib
import matplotlib.pyplot as plt

#pull the values out of the pandas array (stored as a numpy array)
housing_data_array = housing_dataframe['housing_median_age'].values

plt.hist(housing_data_array,50,density=True)
plt.xlabel('age')
plt.ylabel('probability')

| Attribute    | Description |
| ----------- | ------------ |
| .values()|  get a numpy array of all values within the dataframe    |  
| .head()|  print out the top 5 rows    |  
| .tail()|  print out the bottom 5 rows    |  
| .iloc()|  lets you use indexes like the dataframe is numpy    |  
| .columns()|  return out all the headers of the columns in the dataframe |  



In [None]:
housing_dataframe.columns

### 3.B) Writing a csv file with Pandas

Lets write some simple CSVs using pandas. Pandas accepts multiple formats as long as indexes are provided.

In [None]:
# The easiest way to write a new dataframe is to generate a dictionary of your data
# data keyed by its column header
new_dataframe_dict = {'column 1': [1,2,3,4],
                      'column 2': ['1','2','3','4'],
                      'last col': ['a','b','c','d']}

# optional row labels, we have 4 entries for each column, so we need 4 labels
row_labels = ['row 1', 'row 2', 'row 3', 'row 4']

# Generate our new pandas dataframe
new_df = pd.DataFrame(new_dataframe_dict, index=row_labels)
print('in memory:')
print(new_df.head())

# We can write this data frame to a .csv for easy of access in multiple softwares
print('')
new_df.to_csv('test_csv.csv',  index_label='blank')
print('on disk:')
!head test_csv.csv

### 3.C) Creating a dataframe from a numpy array

This is as simple as passing the array and the correct number of column headers into DataFrame

In [None]:
import numpy as np
random_array = np.random.randint(0,1000,size=(5,100)).T
headers = ['col'+str(x) for x in range(0,5)]

new_df = pd.DataFrame(random_array,columns=headers)

In [None]:
new_df.head()

## 4) Saving numpy arrays with npy files

 ```numpy``` aslo has a built in file format for you to save n-dimensional arrays and read them back in as needed.

In [None]:
import numpy as np

my_array = np.ones((3,100,100))
for i in range(3):
  my_array[i] = my_array[i]*i

print(my_array)

print('')
print('file on the disk:')
np.save('test_save',my_array, )

!head test_save.npy

loaded_array = np.load('test_save.npy')

print('')
print('')
print('loaded array:')
print(loaded_array)

A note about memory sizes and datatypes:

Numpy arrays can get large fast when you are working in multiple dimensions. When saving you can save disk space by making sure you are saving in the smallest datatype you can, keep in mind the maximum possible integers for different data types!

| dtype | min | max | hex |
| --- | --- | --- | --- |
| int8 | -128 |127|  (-0x80, 0x7F)|
| uint8 | 0 |255|  (0x00, 0xFF)|
| int16 | -32768 |32767|  (-0x8000, 0x7FFF)|
| uint16 | 0 |65535|  (0x0, 0xFFFF)|
| int32 | -2147483648 | 2147483647 | (-0x80000000, 0x7FFFFFFF)
| uint32 | 0 |4294967295|  (0x0, 0xFFFFFFFF)|
| int64| -9223372036854775808 | 9223372036854775807 | (-0x8000000000000000, 0x7FFFFFFFFFFFFFFF)
| uint64 | 0 |18446744073709551615|  (0x0, 0xFFFFFFFFFFFFFFFF)|

----

Note that these are maximums for **integer** values, representing floating point values is complicated and beyond the scope of this course, but 99% of the time they will use 32 or 64 bits of memory just like the larger integer data types. If you are interested check out the Wikipedia page on double precision floating points:
https://en.wikipedia.org/wiki/Double-precision_floating-point_format





In [None]:
dtypes = {'int16': np.int16,
          'int32': np.int32,
          'float32':np.float32,
          'float64':np.float64,
          'bool':bool}
data_type = 'int16' #@param ["int16","int32","float32", 'float64','bool']
dtype = dtypes[data_type]
array = np.ones([300,300,300]).astype(dtype)
totalbytes = array.itemsize*array.size

print('total mb of this array with type %s: %1.4f mb'% (str(dtype),totalbytes/1e6))

## Questions

In [None]:
## Print the current working directory using the module os

In [None]:
## Open a file and write a short message in text, save the file as 'example_file.txt'

In [None]:
## List out the default packages installed in your environment,
## pick one, and view its docstring / documentation