Working with Pandas
=================

<div class="overview-this-is-a-title overview">
<h2 class="overview-title">Overview</h2>
    
<p>Questions</p>
    <ul>
        <li>How can I use import data for analysis in my notebook?
    </ul>
<p>Objectives:</p>
    <ul>
        <li>Import the pandas library.
        <li>Use pandas library funtions to import data from a csv file.
        <li>Import data from a .csv formatted file.
        <li>Perform linear regression on the data and obtain best fit statistics.
        <li>Create a plot of your data that includes uncertainty.
    </ul>
<p>Keypoints:</p>
    <ul>
        <li>Use the pandas library to create dataframes from csv formatted data.</li>
        <li>Use SciPy functions to perform linear regression with statistical output.</li>
        <li>Use matplotlib and seaborn to prepare a plot with data and a best fit line.</li>
    </ul>
</div>

## Why Linear Regression?
When I was a biochemistry grad student, we almost always manipulated our data into a linear format so that we could do linear regression on our handheld calculators. The most prominent example was the manipulation of enzyme kinetic data for Lineweaver-Burke or Eadie-Hofstee plots, so that we could determine the kinetic parameters (**Note**: we'll actually do non-linear curve fitting for enzyme kinetics in the next module).  I also remember doing semi-log plots of enzyme inactivation because they were linear. Now we have many more options, especially with Jupyter notebooks.

However, some data can still be analyzed by simple linear regression. Perhaps the most common case is the protein assay. Whether you use Lowry, Bradford or BCA methods, it is still most common to use a linear regression fit to the results.

In this module, we will explore linear regression in Jupyter notebooks using Python. Please keep in mind - this is just a beginning. If you take a course in data science, you are likely to encounter a much deeper look at linear regression and trend forecasting.

We'll begin by looking at the libraries we will need to use, including a few new ones. Then we will use one of these libraries (pandas) to import the data for this module. Next, we will perform the linear regression with two different approaches, using scipy and seaborn. Finally, we will learn to plot the data using matplotlib.pyplot and seaborn.

## Libraries you will need

To perform linear regression, we will need to import a python **library**. A **library** is a set of modules which contain a set of related functions which can be used to complete specific tasks. Using libraries in Python reduces the amount of code you have to write. Usually a function has some type of input and gives a particular output.  To use a function that is in a library, you often use the dot notation introduced in the previous lesson.

In the last lesson, we imported the `os` library, which makes it easy to assign the location of a file to a single variable (e.g. datafile), rather than having to input the full path every time (e.g., Users/user_name/Desktop/python-scripting-biochemistry/biochemist-python/chapters/data). In this lesson, we will be using the `numpy`, `scipy`, `pandas`, `matplotlib`, and `seaborn` libraries, which are described briefly in the table below.  

| Library | Uses | Abbreviation |
| :------- | :----: | :------------: |
| numpy | calculations | np  | 
| SciPy | calculations and statistics | sp or sc |
| pandas | data management | pd |
| matplotlib.pyplot | creating plots | plt |
| seaborn | higher level plotting | sns |


### Stages of this module
1. Import the correct libraries.
1. Importing data with pandas
1. Running simple linear regression
1. Plotting the data matplotlib.pyplot
1. Plotting the data with seaborn

### Practice with pandas

The first part of the module is modeled after an exercise in Charlie Weiss's excellent online textbook, *Scientific Computing for Chemists*, which you can find on his GitHub site, [SciCompforChemists](https://github.com/weisscharlesj/SciCompforChemists).

In [1]:
import pandas as pd

In [8]:
import os

In [34]:
ls data

COVID-19_proteins.csv       protein_assay2.xlsx
[1m[36mPDB_files[m[m/                  remdesivir.pdb
[1m[36menzymes[m[m/                    remdesivir.sdf
protein_assay.csv           remdesivir.xyz
protein_assay.xlsx          thrombin_with_ligands.csv
protein_assay2.csv          thrombin_with_ligands.xlsx


In [29]:
thrombin_file = os.path.join('data', 'thrombin_with_ligands.csv')
print(thrombin_file)

data/thrombin_with_ligands.csv


In [25]:
help(pd.read_csv)

Help on function read_csv in module pandas.io.parsers:

read_csv(filepath_or_buffer: Union[str, pathlib.Path, IO[~AnyStr]], sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal: str = '.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, dialect=None, error_bad_lines=True, warn_bad_lines=True, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None)
    Read a comma-separated values (csv) file in

In [48]:
thrombin_df = pd.read_csv(thrombin_file)

In [47]:
thrombin_df['Resolution']

0     1.90
1     1.43
2     1.80
3     1.55
4     1.27
5     1.30
6     2.80
7     2.50
8     2.14
9     2.20
10    2.25
Name: Resolution, dtype: float64

In [65]:
thrombin_df.iloc[2,5]

'D-leucyl-N-(4-carbamimidoylbenzyl)-L-prolinamide'

In [69]:
thrombin_df.loc[3,'Resolution']

1.55

In [70]:
thrombin_df.head()

Unnamed: 0,PDB ID,Method,Resolution,Structure,Ligand ID,Ligand name
0,3SHC,X-RAY DIFFRACTION,1.9,Human Thrombin,B01,D-phenylalanyl-N-[(4-chloropyridin-2-yl)methyl...
1,3P17,X-RAY DIFFRACTION,1.43,Thrombin,99P,D-phenylalanyl-N-(pyridin-3-ylmethyl)-L-prolin...
2,2ZNK,X-RAY DIFFRACTION,1.8,Thrombin,31U,D-leucyl-N-(4-carbamimidoylbenzyl)-L-prolinamide
3,3SI3,X-RAY DIFFRACTION,1.55,Human Thrombin,B03,D-phenylalanyl-N-(pyridin-2-ylmethyl)-L-prolin...
4,3SI4,X-RAY DIFFRACTION,1.27,Human Thrombin,B04,D-phenylalanyl-N-[(1-methylpyridinium-2-yl)met...


In [71]:
thrombin_df.tail()

Unnamed: 0,PDB ID,Method,Resolution,Structure,Ligand ID,Ligand name
6,1UVU,X-RAY DIFFRACTION,2.8,Bovine Thrombin,DCH,3-(7-DIAMINOMETHYL-NAPHTHALEN-2-YL)-PROPIONIC ...
7,1UVT,X-RAY DIFFRACTION,2.5,Bovine Thrombin,I48,N-{3-METHYL-5-[2-(PYRIDIN-4-YLAMINO)-ETHOXY]-P...
8,2C8Z,X-RAY DIFFRACTION,2.14,Thrombin,C2A,1-(3-CHLOROPHENYL)METHANAMINE
9,2C8Y,X-RAY DIFFRACTION,2.2,Thrombin,C3M,"N-[(2R,3S)-3-AMINO-2-HYDROXY-4-PHENYLBUTYL]NAP..."
10,2C90,X-RAY DIFFRACTION,2.25,Thrombin,C1M,1-(4-CHLOROPHENYL)-1H-TETRAZOLE
