### Data Manipulation and Analysis with Python
#### Robert Palmere, 2021
#### Email: rdp135@chem.rutgers.edu
--------------------------------------------------------------------
#### Topics:
1. Using Standard Python
* Retrieving data from external sources
* Altering data (e.g. normalization)
* Data output

####
2. Using NumPy library
* Retrieving data from external sources
* Altering data (e.g. normalization)
* Some Convenience functions of NumPy
* Data output

####
3. Brief Use of Pandas library
* Retrieving data from external sources
* Organizing and Displaying data

####
4. Basic Applications of SymPy

#### Standard Python Data Retrieval (touched on in Session 1):

##### First we will import data from a data set included with the "sklearn" package of Python.
##### This is so that we have a data set to work with throughout the session.
##### These data include attributes of benign and malignant breast cancer cell nuclei of patients in Wisconsin. 
More information on this data set can be found [Here](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)).

In [None]:
def generate_data():
    from sklearn.datasets import load_breast_cancer # load in data from sklearn
    X, y = load_breast_cancer(return_X_y=True)
    avg_radius = X[:, 0]
    avg_num_concaves = X[:, 7]
    lines = list(zip(avg_radius.astype(str), avg_num_concaves.astype(str)))
    lines = [' '.join(x) for x in lines]
    with open('Data.txt', 'w') as fp:
        fp.writelines('\n'.join(lines))
       
generate_data()

##### Now that we have working data in the "Data.txt" file let's move on to retrieving it using standard Python.

In [None]:
data = open('Data.txt', 'r') # Open() function to returns a Wrapper class with functions to access contents of file using 'read' mode

##### We see that this is a file "IO" (in/out) wrapper data type (class).

In [None]:
print(type(data))

##### Let's display the functions contained within this class:

In [None]:
for method in dir(data): print(f"'{method}'", end=' ')

##### The 'readlines()' method seems like a reasonable choice to get our data.

In [None]:
import inspect
print(inspect.signature(data.readlines))
print(inspect.signature(open))

##### Seems not to take any arguments either. Let's try it.

In [None]:
lines = data.readlines()
print(type(lines))
print(repr(data.readlines)) # Not very useful here.

##### We see that data.readlines() returns a list of the lines within the file. Let's print the first 5 lines to make sure.

In [None]:
print(lines[0:5])

##### For analysis we want this data to be in numeric form (float) not string. We also want to remove '\n'. Split() does this nicely as it defaults to splitting a list by white space.

In [None]:
for i in range(5): print(lines[i].split())

##### We can now use the map() function of Python to change each element of these lists to floats (Syntax map(func, iterable)).

In [None]:
for i in range(5): print(map(float, lines[i].split()))

##### A map object is just an iterator.

In [None]:
iterator = map(float, ['1', '2', '3'])
for i in iterator: print(i, end=' ')

##### To fix returning map objects we can just use the list() function to turn the map() returned iterator into a list.

In [None]:
for i in range(5): print(list(map(float, lines[i].split())))

##### Now let's generate to lists (x, y) from the columns of our data.

In [None]:
avg_radius = []
avg_concavities = []

for line in lines: avg_radius.append(list(map(float, line.split()))[0]) ; avg_concavities.append(list(map(float, line.split()))[1])

print(avg_radius[0:5])
print(avg_concavities[0:5])

##### We can do this with list comprehension as well.

In [None]:
avg_radius = [list(map(float, line.split()))[0] for line in lines]
avg_concavities = [list(map(float, line.split()))[1] for line in lines]

print(avg_radius[0:5])
print(avg_concavities[0:5])

data.close()

##### Now our data is ready for analysis and manipulation by other packages. We can place all of this code into a single method.

In [None]:
def retrieve_data(filename):
    '''
    params: 'filename' - file name containing space-separated data columns in string format
    returns: the first two columns as lists
    '''
    file = open(filename, 'r')
    lines = file.readlines()
    xs = [list(map(float, line.split()))[0] for line in lines]
    ys = [list(map(float, line.split()))[1] for line in lines]
    file.close()
    return xs, ys

x, y = retrieve_data('Data.txt')
print(x[0:5])
print(y[0:5])

##### We can do some simple manipulations on this data. Say normalize the data such that it spans [0, 1] with the maximum value of the list being 1.

In [None]:
norm_x = [i/max(x) for i in x]
norm_y = [i/max(y) for i in y]

##### Here are some other basic functionalities that we can implement on our data without external libraries.

In [None]:
# Average
def mean(i):
    return sum(i)/len(i)

avg_x = mean(x)
avg_y = mean(y)
print(avg_x, avg_y)

In [None]:
# Standard Deviation
def sqrt(value):
    return value**(1/2)

def std(i):
    mean = sum(i)/len(i)
    s = sum([((x - mean)**2) for x in i])
    return sqrt(s/len(i))

print(std(x), std(y))

In [None]:
# Sum
sum_x = sum(x)
sum_y = sum(y)
print(sum_x, sum_y)

In [None]:
# Max / Min
x_max = max(x)
x_min = min(x)
print(x_max, x_min)

In [None]:
# Index where Max / Min found - since x, y are lists we can use the index method

min_indx = x.index(x_min)
max_indx = x.index(x_max)
print(min_indx, max_indx)
print(x[min_indx], x[max_indx])

##### Say after we normalize the data such that the max value of each list is equal to 1 that we want to write a file with this output.

In [None]:
def output_normalized(x_norm, y_norm):
    '''
    Function to write normalized x, y data to text file in cwd
    params:
        x_norm - list of normalized x values
        y_norm - list of noramlized y values
    '''
    lines = list(zip(list(map(str, x_norm)), list(map(str, y_norm))))
    lines = [' '.join(x) for x in lines]
    with open('Output.txt', 'w') as fp:
        fp.writelines('\n'.join(lines))
        
output_normalized(norm_x, norm_y)

#### NumPy

##### We can do the same types of things with the conveniently pre-written functions in the NumPy library.

##### So why not just stay using the original instead of NumPy? NumPy is faster for the most part.

In [None]:
import numpy as np
from timeit import default_timer as timer

def speedtest(key, length):
    d = {'array' : np.array([x for x in range(length)]),
         'list' : [x for x in range(length)]}
    time = []
    if key == list(d.keys())[0]:
        for i in range(10000):
            start = timer()

            mult = d['array'] * d['array']

            end = timer()
            dt = end - start
            time.append(dt)
    elif key == list(d.keys())[1]:
        for i in range(10000):
            start = timer()

            for n, i in enumerate(d['list']):
                mult = d['list'][n] * d['list'][n]

            end = timer()
            dt = end - start
            time.append(dt)
    return time

t = speedtest('list', 1000)
t2 = speedtest('array', 1000)

import matplotlib.pyplot as plt
plt.plot(t)
plt.plot(t2)
plt.ylabel('Time (s)')
plt.xlabel('Iteration')
plt.text(400, np.max(t)/2, s="Numpy is a lot faster.");

##### Why is this though? -- In this case, numpy arrays don't have to multiply indices individually as one would have to using a list.

However, this really only pertains to large arrays.

In [None]:
test1 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
test2 = np.asarray([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

t = speedtest('list', 3)
t2 = speedtest('array', 3)
plt.plot(t)
plt.plot(t2)
plt.ylabel('Time (s)')
plt.xlabel('Iteration')
plt.text(400, np.max(t)/2, s="Now it's not so clear");

In [None]:
f = 'Data.txt'
data = np.loadtxt(f)
print(type(data))

In [None]:
for method in dir(data):  print(f"'{method}'", end=' ')

##### Other ways to retrieve data from a file using NumPy:

In [None]:
np.genfromtxt(f) # Better for incomplete CSVs etc with function keyword options such as "filling_values="

##### We can see that the methods we wrote in standard Python are available for a NumPy array.

In [None]:
data.shape # The shape of the matrix (569 rows with 2 columns)

##### We can use usual slicing methods to define x and y.

In [None]:
x = np.asarray(data[:, 0])
y = np.asarray(data[:, 1]) # We use np.asarray() to convert from list back to 1D NumPy array
print(type(x))
print(x[0:5])

##### Now using NumPy let's manipulate the data sets as we did above. Notice NumPy did the data cleaning for us.

In [None]:
# Averages

x_avg = np.mean(x)
y_avg = np.mean(y)
print(x_avg, y_avg)

In [None]:
# Sqrt
np.sqrt(2)

In [None]:
# Standard Dev.
np.std(x)

In [None]:
# Sums
x.sum() # or
np.sum(x)

In [None]:
xnorm = x/np.max(x) # Note that / operation carried out for each element of the np.array

# Can also use NumPy functions for these mathematical operations

test = np.divide(x, np.max(x))
np.any(test == xnorm)

##### We can find out what np.any() is doing here from its __doc__ string special method.
##### The Python interpreter automatically sets the first string literal as what __doc__ of a class or method returns.

In [None]:
def example():
    '''Super helpful doc string.'''

example.__doc__

def doc(func):
    if callable(func):
        return func.__doc__
    else:
        raise ValueError('Argument must be a function.')

In [None]:
doc(example) # Our function works

In [None]:
doc(x) # If the argument is not callable (e.g. doesn't have __call__ method) -- All functions have __call__

In [None]:
print(type(example))

In [None]:
print(np.any.__doc__) 

##### In fact, conditionals apply element-wise for np.arrays

In [None]:
t1 = np.asarray([[1, 2], [3, 4]])
t2 = np.asarray([[5, 6], [7, 8]])

t2 > t1 # Elements of t2 which have equivalent indices with t1 are greater

#### Convenience Functions

In [None]:
a = np.array([1, 2, 3, 4, 5, 6])
a.reshape(2, 3) # Two rows three columns

In [None]:
# We can try to restore the array using reshape, but notice that the array is not flattened
print(a.reshape(1, a.size))

In [None]:
# Ravel will flatten the array into a single 1D array
a = a.ravel()
print(a)

In [None]:
np.zeros(3) # Null matrix

In [None]:
np.ones(3) # All-ones matrix

In [None]:
id_mat = np.identity(3) # Identity Matrix
print(id_mat)

In [None]:
np.trace(id_mat) # Trace (sum along the i=j elements of matrix)

In [None]:
n1 = np.array([0, 0, 0])
n2 = np.array([1, 1, 1])
print(np.vstack((n1, n2))) # stack arrays vertically
print(np.hstack((n1, n2))) # stack arrays horizontally

In [None]:
tiled = np.tile(n1, (3, 3)) # we can "tile" or repeat an array over (i,j) iterations
print(tiled)
print(tiled.size)

In [None]:
new_tiled = np.insert(tiled, 5, 1)
print(new_tiled)
print(new_tiled.shape)
new_tiled = np.delete(new_tiled, 0) # Delete element index 0
print(new_tiled.shape)

In [None]:
new_tiled.reshape(3, 9)

In [None]:
print(new_tiled) # doesn't affect original array

In [None]:
np.hsplit(new_tiled, 3) # Similar to reshape() but generates a list of arrays one for each row

In [None]:
print(new_tiled[4])
print(np.roll(new_tiled, 1)[5]) # roll shifts values of set axis over by 1
print(new_tiled)
print(np.roll(new_tiled, 1)) # useful for periodic boundary conditions

In [None]:
new_tiled.fill(0) # affects original array to set all elements to zero
print(new_tiled)

In [None]:
np.random.rand(1, 3) # Random floats on interval [0, 1]

In [None]:
np.arange(0, 10.1, .1, dtype=float) # advantage over range() - can use other types (Syntax: [start, stop, step])

In [None]:
range(0, 10.1, .1) # Range can't handle non-integers

In [None]:
ex = np.asarray([['Human', 'Gorilla', 'Chimpanzee'], [60, 70, 80]])
print(ex)

In [None]:
print(np.sort(ex, axis=1)) # axis = 1 is the x-axis in numpy

# Notice that although first row is sorted, the corresponding values have not?

In [None]:
indxs = np.argsort(ex[0, :]) # return the indices after sorting and then apply to the original array
print(indxs)

In [None]:
ex = ex[:, indxs]
print(ex) # Sorted by alphabetical

##### Output with NumPy

In [None]:
np.savetxt('numpy_output.txt', ex, fmt='%s', delimiter=' ')

In [None]:
ex.tofile('numpy_output.txt', sep=' ')

There are many more function that are not covered here.

#### Pandas Library

In [None]:
import pandas as pd

In [None]:
# pd.read_fwf() "Fixed width formatted"
data = pd.read_fwf('Data.txt', header=None)
print(type(data)) # Stores as data frame object

In [None]:
data = pd.read_csv('Data.txt', sep=" ", header=None)
print(type(data)) # Stores as data frame object
print(data)

In [None]:
data = data.rename(columns={0: "Average Radius", 1: "Average # Concavities"})
print(data)

In [None]:
data.head() # First 5 entries 

In [None]:
data.tail() # Last 5 entries

In [None]:
data.info() # General information about the df

In [None]:
data['Average Radius'] # We can select data columns using keywords

In [None]:
type(data['Average Radius'].array) # some NumPy functionalities included in Pandas

In [None]:
np.asarray(data['Average Radius']).sum() # Ex

In [None]:
data['Average Radius'].sum()

#### Basic SymPy - Symbolic Computation in Python

In [None]:
from sympy import *

In [None]:
x, y, t = symbols('x y t')
print(x, y, t)

In [None]:
diff(sin(x))

In [None]:
x**2

In [None]:
diff(x**2)

In [None]:
integrate(sin(x)) # Analytical 

In [None]:
integrate(sin(x), (x, pi/2, pi)) # Definite

In [None]:
limit(sin(x), x, pi/2) # Limits

In [None]:
y = Function('y')
print(y)

Solve the differential equation 𝑦″−𝑦=𝑒^𝑡

In [None]:
dsolve(Eq(y(t).diff(t, t) - y(t), exp(t)), y(t)) # required variable as 't'

Let's plot the results just to see what our solution looks like (we will go over plotting in more detail a future session).

In [None]:
import matplotlib.pyplot as plt
import numpy as np

def y(ts, c1, c2): # set c1, c2 = 1
    return c2*np.exp(-t)+(c1+t/2)*np.exp(t)

t = np.linspace(0, 10, 10)

for i in range(10):
    for j in range(10):
        plt.plot(t, y(t, i, j))

plt.show() # The rate changes as a function of c1, c2

In [None]:
factor(x**3 + x**2 + x + 1)

#### Applications to Chemical Systems

##### **1.) Balancing Chemical Equations**

Balancing chemical equations by hand is discussed during introductory chemistry classes. Let's write a function using SymPy to do the work for us. We will generate a matrix of coefficients for each atom and then convert the matrix to reduced row echelon form (RREF) for our answer.

C<sub>3</sub>H<sub>8</sub> + O<sub>2</sub> &#8594; CO<sub>2</sub> + H<sub>2</sub>O

We want to balance the combustion equation for propane. The matrix for this system by element is:

In [None]:
from sympy import *
import numpy as np
Matrix([['a', 'b', 'c', 'd'], [3, 0, -1, 0], [8, 0, 0, -2], [0, 2, -2, -1]])

Here a, b, c, and d are our desired coefficients in front of each chemical.

The procedure to acquire RREF: Perform row operations until the first non-zero entry (the "pivot") in each row is 1, the last row has all zeros, and all numbers above and below each pivot number (1) is zero.

Legal row operations:

1. Switching row positions
2. Mutliplying a row by a number
3. Adding rows together

We can write our own code following the pseudo-code presented on [wikipedia](https://en.wikipedia.org/wiki/Row_echelon_form), or we can write a shorter function using SymPy which is what we'll do here.

In [None]:
matrix = np.array([[3, 0, -1, 0], [8, 0, 0, -2], [0, 2, -2, -1]])
print(matrix)

In [None]:
Matrix(matrix).rref()

This is means:\
a = -1/4d\
b = -5/4d\
c = -3/4d

d is the factor to require to achieve integer values for our coefficients (lowest common denominator which is f).

In [None]:
def balance(m):
    cols = ('a', 'b', 'c', 'd', 'e', 'f', 'g')
    a = Matrix(matrix).rref()
    for n, i in enumerate(range(3, len(a[0]), 4)):
        value = a[0][i]
        denom = fraction(together(value))[1]
        print('{} = {}'.format(cols[n], abs(value*denom)))
    print('{} = {}'.format(cols[n+1], denom)) # Notice 'n' is saved from enumerate

In [None]:
balance(m)

Balanced Equation: C<sub>3</sub>H<sub>8</sub> + **5**O<sub>2</sub> &#8594; **3**CO<sub>2</sub> + **4**H<sub>2</sub>O

NumPy does not have this functionality. If you'd like practice and want to contribute this function to NumPy check out [this](https://numpy.org/doc/stable/dev/index.html) link. The pseudo-code for RREF can be found on [wikipedia](https://en.wikipedia.org/wiki/Row_echelon_form).

##### **2.) Chemical Kinetics**

In the previous example, we saw that coefficients of elementary chemical equations can be calculated by formulating a matrix.

We can also integrate chemical rate equations using linear algebra as outlined in [this](https://pubs.acs.org/doi/pdf/10.1021/ed067p375) paper by the Journal of Chemical Education.

The steps are:

1. Generate a matrix, K, of the rate constants.
2. Compute the eigenvalues and eigenvectors of the matrix.
3. Compute the scalar coefficients

Let's do this for the example of two unimolecular steps presented in the paper using NumPy.

X<sub>1</sub> &#8594;<sup>k<sub>1</sub></sup> X<sub>2</sub> &#8594; <sup>k<sub>2</sub></sup> X<sub>3</sub>

The rate equations are:
    
   dX<sub>1</sub>/dt = -k<sub>1</sub>X<sub>1</sub>\
   dX<sub>2</sub>/dt = k<sub>1</sub>X<sub>2</sub> - k<sub>2</sub></sub>X<sub>2</sub>\
   dX<sub>3</sub>/dt = k<sub>2</sub>X<sub>3</sub>
   
So the matrix containing the rates is:

In [None]:
K = Matrix([['-k1', 0, 0], ['k1', '-k2', 0], [0, 'k2', 0]])
K

In [None]:
K.eigenvals() # Returns a dictionary where the keys are the eigenvalues

In [None]:
K.eigenvects() # Returns a list of eigenvalues and their corresponding eigenvectors

In [None]:
X = Matrix(['X1', 'X2', 'X3']) # a single list is considered a column vector for SymPy
X

In [None]:
C = Matrix([['k1/k2-1',0,0],['-k1/k2',0,-1],[1,1,1]]) # rows are our eigenvectors
C

In [None]:
CT = Matrix(transpose(C)) # Take the transpose and swap second and third rows
CT.row_swap(1, 2)
CT

In [None]:
simplify(CT.row(0)*-k2)# Multiplied first row by -k2 then second row by -1

In [None]:
CT = Matrix([['k2-k1', 0, 0], ['k1', 1, 0], ['-k2', -1, 1]])
CT

This is fair game because C<sup>T</sup> = C since AX=${\lambda}$X and A<sup>T</sup>=${\lambda}$X

We can get $\alpha$ by taking the inverse of C<sub>n</sub> and multiplying it with X<sub>n</sub>. $\alpha$ = C<sub>n</sub><sup>-1</sup>X<sub>n</sub>

In [None]:
a = CT.inv()*X # a (alpha) contains our scalar coefficients
a

Our original equation was a set of first order differential equations.For reasons we won't go over here, our solutions to X<sub>n</sub>(t) will be exponential functions.

e.g. $\vec{x(t)}$' = A$\vec{x(t)}$ has solutions $\vec{x(t)}$ = $\vec{c}$exp($\lambda$t)

The paper derives: ![title](./DEsoln.png)

We have all the components to solve X<sub>n</sub>(t) so let's plug them in and check out the results.

In [None]:
print(*list(K.eigenvals().keys())) # Eigenvalues

${\alpha}$<sub>1</sub>C<sub>1</sub>exp(${\lambda}$<sub>1</sub>t)

In [None]:
t = symbols('t')
Matrix(a[0]*CT.col(0)*exp(-k1*t))[0]

${\alpha}$<sub>1</sub>C<sub>1</sub>exp(${\lambda}$<sub>1</sub>t) + ${\alpha}$<sub>2</sub>C<sub>2</sub>exp(${\lambda}$<sub>2</sub>t)

In [None]:
Matrix(a[0]*CT.col(0)*exp(-k1*t)+a[1]*CT.col(1)*exp(-k2*t))[1]

${\alpha}$<sub>1</sub>C<sub>1</sub>exp(${\lambda}$<sub>1</sub>t) + ${\alpha}$<sub>2</sub>C<sub>2</sub>exp(${\lambda}$<sub>2</sub>t) + ${\alpha}$<sub>3</sub>C<sub>3</sub>exp(${\lambda}$<sub>3</sub>t)

In [None]:
Matrix(a[0]*CT.col(0)*exp(-k1*t)+a[1]*CT.col(1)*exp(-k2*t)+a[2]*CT.col(2)*exp(0))[2]

Now that we have our solutions for X(t)<sub>1</sub>, X(t)<sub>2</sub>, X(t)<sub>3</sub>, let's plot these solutions. Given a starting concentration of 1 for X<sub>1</sub>, X<sub>2</sub>, and X<sub>3</sub>.

In [None]:
import matplotlib.pyplot as plt

ts = np.linspace(0, 10, num=50)

# Functions for each chemical species
def s1(x1, k1, t):
    return x1*np.exp(-k1*t)

def s2(x1, x2, k1, k2, t):
    return (x1*k1)/(-k1*k2)*np.exp(-k1*t) + (((-x1*k1)/(-k1+k2))+x2)*np.exp(-k2*t)

def s3(x1, x2, x3, k1, k2, t):
    return -(x1*k2)/(-k1*k2)*np.exp(-k1*t)+x1+x2+x3+(((x1*k1)/(-k1+k2))-x2)*np.exp(-k2*ts)

In [None]:
plt.plot(ts, s1(1, 1, ts), '-b^', mfc='white')
plt.plot(ts, s2(1, 0, 1, .4, ts),'-r^', mfc='white')
plt.plot(ts, s3(1, 0, 0, 1, .4, ts), '-k^', mfc='white')
plt.show()