# Activity 8 - curve fitting and processing multiple files
It is often the case that datasets are composed of more than one file. How can we process more than one file using Python? 

The files in the `data/enzyme` are from a study to measure an enzyme's activity. The assay takes advantage of a plate reader to monitor the absorbance of multiple reactions with different initial substrate concentrations over time. Plate readers are great because they allow for lots of data to be generated but it can be a challenge dealing with it all! Let's develop a Python workflow to handle this data. First, how can we deal with multiple files? The [`pathlib`](https://docs.python.org/3/library/pathlib.html) module of the standard library allows for easy iteration through a directory. The `.name` attribute of a path object contains the file name. Complete the following `for in` loop to print the name of each `.prx` file in the `data/enzyme` directory.

In [None]:
from pathlib import Path

for p in Path('data/enzyme').glob('*.prx'):
    pass

To determine the kinetic parameters of the enzyme, we need [product] vs time (i.e. rates), at multiple [substrate].

The plate reader is the Protoreader X, a great 96-well plate reader but not the best interface. It writes output to .prx files and these files contain notes fields with the total time the reaction has occurred (in seconds) and the initial concentration of substrate of the corresponding wells. This is followed by the data (concentration of product) in concentration units (micromolar). Note that the data field contains measurements for all 96 wells, some of which may not have anything in them (as described in the initial concentration field). 

Complete the function below that is intended to process a plate reader file into a DataFrame with time, [S] and [P] columns. Note that it is currently missing the product concentration column

In [None]:
import pandas as pd

def process_plate_reader_file(prx_file)->pd.DataFrame:
    data = prx_file.read()
    time = float(data[data.index('Time:')+len('Time:'):data.index(',',data.index('Time:'))])
    subs = [float(i) for i in data[data.index('[')+1:data.index(']')].split()]
    return pd.DataFrame({'time':[time]*len(subs),'[S]':subs})

with open('data/enzyme/3_5_21_1.prx') as prx_file:
    enzyme_df = process_plate_reader_file(prx_file)
enzyme_df

Complete the `for in` loop that processes all the files using the process_plate_reader_file function and places it into the `enzyme_df` DataFrame. Use the `.open` function for the variable `p` as you would with the builtin `open` function (i.e. in a context manager). Note that Pandas has an [`.concat`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) function that can be used to combine DataFrames. Set the `ignore_index` keyword argument to `True` to reindex the row labels of the resulting combined DataFrame.

In [None]:
enzyme_df = None

for p in Path('data/enzyme').glob('*.prx'):
    pass


The next step is to fit the product concentration to the time at each substrate concentration. This is where the `scipy.optimize.curve_fit` function is useful. This function takes a function as its first argument. The function takes the independent variable as its first argument (as a numpy array) and returns the dependent variable, with other parameters as the additional arguments. Complete the following code by assigning the `time_data` and `product_data` variables to the relevant data for a single substrate concentration from your DataFrame. Note that a column can be converted to a numpy array using the [`.to_numpy`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_numpy.html) function.

In [None]:
from scipy.optimize import curve_fit

def rate_equation(time:float,rate:float,offset:float)->float:
    return rate * time + offset

time_data = None
product_data = None

results,_ = curve_fit(rate_equation,time_data,product_data)
results

The relationship between $[S]_{0}$ and $[product]$ is given by $$ V_{0} = \frac{V_{max} [S]_{0}}{K_{M} + [S]_{0}} $$ where $V_{0}$ is the rate for a given substrate concentration. Calculate the rates for all substrate concentrations in your DataFrame and use the resulting information to fit the values for $K_{M}$ and $V_{max}$