# Activity 8 - curve fitting and processing multiple files

## Analyzing enzyme kinetics using data from a plate reader

We are analyzing the kinetic properties of a newly discovered enzyme. For this analysis, we need [product] vs time (i.e. rates), at multiple initial substrate concentrations.

The instrument used to collect the data was the Protoreader X, a great 96-well plate reader that writes files in a format that isn't the easiest to work with. Its output files have a `.prx` extension and these files contain notes fields with the total time the reaction has occurred (in seconds) and the initial concentration of substrate of the corresponding wells. This is followed by the data (concentration of product) in concentration units (micromolar). Note that the data field contains measurements for all 96 wells, some of which may not have anything in them (as described in the initial concentration field). 

Write a function that extracts the time point at which the data were acquired from a file's contents.

In [None]:
def extract_incubation_time(prx_file)->float:
    pass

with open('data/enzyme/3_5_21_1.prx') as prx_file:
    print(extract_incubation_time(prx_file.read()))
# should return 0.1

Now write a function that extracts all of the starting substrate concentrations from a file's contents.

In [None]:
def extract_substrate_concentrations(prx_file)->list[float]:
    pass

with open('data/enzyme/3_5_21_1.prx') as prx_file:
    print(sum(extract_substrate_concentrations(prx_file.read())))
# should return 0.1

Now write a function that extracts all of the absorbance readings from a file's contents.

In [None]:
def extract_absorbance_measurements(prx_file)->list[float]:
    pass

with open('data/enzyme/3_5_21_1.prx') as prx_file:
    print(sum(extract_absorbance_measurements(prx_file.read())))
# should return 0.1

Now we have the tools to load all of the data - the question is how we should format it. We would like to access absorbance vs time data at different substrate concentrations. For this we need a list or numpy array of all of the time points and a corresponding list of absorbance measurements at each substrate concentration. We can accomplish this by:

- building an array of the time points
- building an array of the substrate concentrations
- building a two dimensional array of the absorbance measurements

Build an array of the time points represented in the dataset in the cell below

It is often the case that datasets are composed of more than one file. How can we process more than one file using Python? 

The files in the `data/enzyme` directory are from a study to measure an enzyme's activity. The assay takes advantage of a plate reader to monitor the absorbance of multiple reactions with different initial substrate concentrations over time. Plate readers are great because they allow for lots of data to be collected but it can be a challenge dealing with it all! Let's develop a Python workflow to handle these data. First, how can we deal with multiple files? The [`pathlib`](https://docs.python.org/3/library/pathlib.html) module of the standard library allows for easy iteration through a directory. The `.name` attribute of a path object contains the file name. Complete the following `for in` loop to print the name of each `.prx` file in the `data/enzyme` directory.

In [None]:
from pathlib import Path

for p in Path('data/enzyme').glob('*.prx'):
    pass

In [None]:
for p in Path('data/enzyme').glob('*.prx'):
    pass

Build an array of the substrate concentrations represented in the dataset in the cell below

In [None]:
for p in Path('data/enzyme').glob('*.prx'):
    pass

Build a two dimensional array of the absorbance measurements in the cell below

In [None]:
for p in Path('data/enzyme').glob('*.prx'):
    pass

Try to work with the data by plotting two different datasets (absorbance vs time) contained within the arrays you constructed.

The next step is to fit the product concentration to the time at each substrate concentration. This is where the `scipy.optimize.curve_fit` function is useful. This function takes a function as its first argument. The function takes the independent variable as its first argument (as a numpy array) and returns the dependent variable, with other parameters as the additional arguments. Complete the following code by assigning the `time_data` and `product_data` variables to the relevant data for a single substrate concentration from your DataFrame. Note that a column can be converted to a numpy array using the [`.to_numpy`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_numpy.html) function.

In [None]:
from scipy.optimize import curve_fit

def rate_equation(time:float,rate:float,offset:float)->float:
    return rate * time + offset

time_data = None
product_data = None

results,_ = curve_fit(rate_equation,time_data,product_data)
results

The relationship between $[S]_{0}$ and $[product]$ is given by $$ V_{0} = \frac{V_{max} [S]_{0}}{K_{M} + [S]_{0}} $$ where $V_{0}$ is the rate for a given substrate concentration. Calculate the rates for all substrate concentrations in your DataFrame and use the resulting information to fit the values for $K_{M}$ and $V_{max}$