Non-Linear Curve Fitting, Part 1
=========================

<div class="overview-this-is-a-title overview">
<p class="overview-title">Overview</p>
<p>Questions</p>
    <ul>
        <li>How can I analyze enzyme kinetics data in Python?</li>
        <li>What is the process for non-linear least squares curve fitting in Python?</li>
    </ul>
<p>Objectives:</p>
    <ul>
        <li> Create a pandas dataframe with enzyme kinetics data from a .csv file</li>
        <li> Add velocity calculations to the dataframe</li>
        <li> Perform the non-linear regression calculations</li>
    </ul>
    
<p>In this module, we will calculate initial rates from the raw data ($\Delta$A$_{405}$) in an enzyme kinetics experiment with alkaline phosphatase. In the process, we will import the raw data into a pandas dataframe, use some pandas tools to reorganize the data, produce a second pandas dataframe that contains the substrate concentrations and initial rates at each concentration. Finally, we will export this information to a csv file to use in the next module, where you will explore nonlinear curve fitting in python.
    </p>
</div>

### Setting up the first dataframe
We start by importing data from a csv file as we did earlier with the data for linear regression. These data are the rate of p-nitrophenol appearance for a series of p-nitrophenol phosphate concentrations in the presence of alkaline phosphatase. We will import the libraries we need, import the data and set up the dataframe.

In [1]:
# import the libraries we need
import os # to create a filehandle for the .csv file
import pandas as pd # for importing the .csv file and creating a dataframe
from scipy import stats # for performing non-linear regression

In [9]:
pwd  ## make sure we are in '/Users/username/Desktop/python-scripting-biochemistry'

'/Users/pac8612/Desktop/python-scripting-biochemistry'

Your output here may be different. It's important to move to the correct directory level. You may need to move up or down the directory tree.


In [10]:
# You can create the filehandle once you are in the right directory
datafile = os.path.join('biochemist-python', 'chapters', 'data', 'AP_kinetics.csv') # filehandle created
print(datafile)  # filehandle confirmed

biochemist-python/chapters/data/AP_kinetics.csv


In [12]:
# Creating the pandas dataframe using read_csv
AP_kinetics_df = pd.read_csv(datafile)  # Use pandas to create a dataframe of the alkaline phosphatase kinetics data
AP_kinetics_df.head()  # dataframe confirmed

Unnamed: 0,pNPP (mM),0.25,0.5,0.75,1,1.25,1.5,1.75,2,2.25,...,2.75,3,3.25,3.5,3.75,4,4.25,4.5,4.75,5
0,20.0,0.073923,0.139234,0.226077,0.287081,0.366029,0.434928,0.522488,0.574163,0.67177,...,0.828947,0.818182,0.933014,1.044976,1.098086,1.182775,1.256699,1.266029,1.431818,1.392344
1,10.0,0.066055,0.143119,0.208486,0.26422,0.330275,0.39633,0.481651,0.522936,0.606881,...,0.794725,0.784404,0.849771,0.915138,1.042431,1.06789,1.193119,1.250917,1.294266,1.444954
2,7.0,0.063797,0.130253,0.205348,0.25519,0.328956,0.394747,0.455886,0.515696,0.610063,...,0.738323,0.789494,0.889842,0.911772,0.986867,1.105823,1.095854,1.244051,1.325791,1.262658
3,4.0,0.060612,0.121224,0.192857,0.237551,0.303061,0.367347,0.441429,0.499592,0.567551,...,0.666735,0.72,0.764082,0.848571,0.881633,0.950204,1.061633,1.113061,1.186531,1.17551
4,2.0,0.052759,0.104483,0.147414,0.215172,0.271552,0.322759,0.372931,0.409655,0.465517,...,0.568966,0.614483,0.652241,0.753103,0.744828,0.786207,0.861724,0.921724,1.012241,1.075862


### Datatype
To simplify the analysis below, we will need to do a bit of data processing. Right now the column headers are the time values and the indexes (row labels) are the concentrations. We are going to transpose the dataframe so that the column headers are the concentrations and the indexes are the time values.

Before doing that, we need to check the datatypes for the numbers. We must ensure that the numbers are floats, rather than strings, so we can do calculations on them.

Notice that the df.dtypes command gives the overall datatype for the dataframe as an `object`, but also lists the datatypes for each of the columns.

In [13]:
AP_kinetics_df.dtypes # checking to see if the numbers are strings or floats

pNPP (mM)    float64
0.25         float64
0.5          float64
0.75         float64
1            float64
1.25         float64
1.5          float64
1.75         float64
2            float64
2.25         float64
2.5          float64
2.75         float64
3            float64
3.25         float64
3.5          float64
3.75         float64
4            float64
4.25         float64
4.5          float64
4.75         float64
5            float64
dtype: object

### Calculating initial velocities

The first column in our dataframe is the pNPP concentration in mM ('pNPP (mM)'). The other colulmn headers are the times in minutes for the kinetic data. Notice that these are listed as strings. To calculate initial velocities, these need to be changed to floats.

We need to set up the column headers as our x values. For the y values, we need to skip the first value ('pNPP (mM)') and then use the remaining values (A-405 as a function of time) to calculate slopes and get our initial velocities. The extinction coefficient for p-nitrophenol under these buffer conditions is 15.0 mM<sup>-1</sup>cm<sup>-1</sup>.

Before transposing the dataframe, we will explicitly define the index as the concentrations, rather than as the row numbers (the default index)

In [None]:
# Set index to concentrations
AP_kinetics_df.set_index('pNPP (mM)', inplace=True)

In [None]:
AP_kinetics_df.head()

In [None]:
# Transpose to get columns as rows
AP_kinetics_df_transpose = AP_kinetics_df.T
AP_kinetics_df_transpose.head()

In [None]:
AP_kinetics_df_transpose.tail()

In [None]:
# Make sure the index is a float
AP_kinetics_df_transpose.index = AP_kinetics_df_transpose.index.astype('float64')

We can see that the column headers (the kinetics time points) are strings. ***See below***

In [None]:
# Check to see how the data look
# Using the plot command that is available with the dataframe
AP_kinetics_df_transpose.plot(marker = 'o')

In [None]:
AP_kinetics_df.columns # checking to see if the column labels are strings or floats

The plot shows the time course for $\Delta$A$_{405}$ over time in minutes

In [None]:
# Make sure the time values are floats, not strings
AP_kinetics_df.columns.values[1:].astype('float64')

We want to calculate the slope for each column. We can use "apply" to do this. Apply takes all of the row or column values and applies a function.

We will define a function which returns the slope only. It will take in a pandas series. A pandas series always includes an index and a column. We have set the index to be the x-values, so we just need to give a series to this function.


In [None]:
def linregress_column(df_series):

  # often times in python if you do not plan to use a variable, you can just
  # name it with an underscore. This tells anyone reading your code and python
  # that you don't intend to do anything with the values in the variable.

  # Since we only want the slope, we'll just name the rest of the
  # variables with underscores.
  slope, _, _, _, _, = stats.linregress(df_series.index, df_series.values)
  return slope

In [None]:
# Make an empty dataframe
MM_df = pd.DataFrame()

In [None]:
# Apply function to get slopes and save in empty dataframe
MM_df['slopes'] = AP_kinetics_df_transpose.apply(linregress_column)

In [None]:
MM_df

In [None]:
# Calculate initial velocities
MM_df['initial velocities'] = MM_df['slopes'] / 0.015
MM_df

In [None]:
MM_df.to_csv('MM_data.csv')

We will use this dataframe now to perform the nonlinear regression fit using the SciPy library in part 2 of this lesson. To save this data for part 2, so we need to write it to a csv file in our data directory.

In [None]:
MM_df.to_csv('biochemist-python/chapters/data/MM_data.csv')

<div class="exercise-this-is-a-title exercise">
<p class="exercise-title">Check your understanding</p>
    <p>You will find an Excel file in your data folder, chymotrypsin_kinetics.xlsx, with some kinetic data from a chymotrypsin experiment. Apply the principles above to create dataframes and a .csv file for creating a Michaelis-Menten plot with these data. Under these assay conditions the extinction coefficient for p-nitrophenol is 18,320 M<sup>-1</sup>cm<sup>-1</sup>.</p>

```{admonition} Hint
:class: dropdown
    You will need to get the data into a layout and file format that is easily read by pandas. 
    <ul>
        <li>Delete the first seven lines of the Excel file.</li>
        <li>Delete the first column of the Excel file.</li>
        <li>Save the file as chymotrypsin_kinetics.csv.</li>
        <li>Your data will should look something like this:</li>
        <img src="biochemist-python/chapters/images/csv_image.png" alt="csv image">
    
```{admonition} Solution
:class: dropdown
    
    
    
```
    
</div>


In [None]:
import os 
import pandas as pd 
import numpy as np 
from scipy import stats 
datafile = os.path.join('biochemist-python', 'chapters', 'data', 'chymotrypsin_kinetics.csv') # filehandle created
chymo_rates_df = pd.read_csv(datafile)

def slope_only(xdata, ydata):  # SciPy linregress has five outputs; I only want the slope
    slope, intercept, rvalue, pvalue, stderr = stats.linregress(xdata, ydata)
    return slope

slope_list = []  # setting up a list to contain the slope values
for i in range(0, len(chymo_rates_df)):  # looping through the pandas dataframe. Is there a better way to do this?
    xdata = chymo_rates_df.columns.values[2:len(chymo_rates_df.columns)].astype('float64')
    ydata = chymo_rates_df.iloc[i, 2:len(chymo_rates_df.columns)]
    slope = slope_only(xdata, ydata)
    slope_list.append(slope)

chymo_MM_df = pd.DataFrame(chymo_rates_df, columns = ['[pNPA] (mM)'])      
chymo_MM_df
chymo_MM_df['slopes'] = slope_list 
chymo_MM_df['Initial Velocities'] = MM_df['slopes'] / 18.32 
chymo_MM_df
MM_df.to_csv('biochemist-python/chapters/data/chymo_MM_data.csv')
