Linear Regression
===============

<div class="overview-this-is-a-title overview">
<h2 class="overview-title">Overview</h2>
    
<p>Questions</p>
    <ul>
        <li>How can I complete linear regression with statistics in Python?
    </ul>
<p>Objectives:</p>
    <ul>
        <li>Use a pandas dataframe for data analysis
        <li>Perform linear regression on the data and obtain best fit statistics
    </ul>
<p>Keypoints:</p>
    <ul>
        <li>Use pandas to create dataframes from csv formatted data.</li>
        <li>Use SciPy functions to perform linear regression with statistical output.</li>
    </ul>
</div>

## Why Linear Regression?
When I was a biochemistry grad student, we almost always manipulated our data into a linear format so that we could do linear regression on our handheld calculators. The most prominent example was the manipulation of enzyme kinetic data for Lineweaver-Burke or Eadie-Hofstee plots, so that we could determine the kinetic parameters (**Note**: we'll actually do non-linear curve fitting for enzyme kinetics in a future lesson).  I also remember doing semi-log plots of enzyme inactivation because they were linear. Now we have many more options by using python in Jupyter notebooks.

However, some data can still be analyzed by simple linear regression. Perhaps the most common case is the protein assay. Whether you use Lowry, Bradford or BCA methods, it is still most common to use a linear regression fit to the results.

In this module, we will explore linear regression in Jupyter notebooks using Python. Please keep in mind - this is just a beginning. If you take a course in data science, you are likely to encounter a much deeper look at linear regression where you explore the relationships among many variables in a dataset.

We're going to build on the previous lesson, using the pandas library to import the data for this lesson. Then, we will perform the linear regression using the scipy library. 

## Libraries you will need

In previous lessons, we have used `os`, `numpy`, and `pandas`. In this lesson, we will add the SciPy library. We will use the dot notation we introduced earlier to access the functions in these libraries. I have expanded our table of libraries to help you keep track of your tools. Please note that the abbreviations listed are just the most common. You can actually define any abbreviation you like, but it is best to use the conventional abbreviations. This will help future coders (including yourself six months from now) as they work with the code. 

| Library | Uses | Abbreviation |
| :------- | :----: | :------------: |
| os | file management in operating systems | os |
| numpy | calculations | np  | 
| pandas | data management | pd |
| scipy | calculations and statistics | sc or sp | 


### Stages of this module
1. Importing the correct libraries
1. Importing data with pandas
1. Running simple linear regression to print out the desired statistics

The first part of the module is modeled after an exercise in Charlie Weiss's excellent online textbook, *Scientific Computing for Chemists*, which you can find on his GitHub site, [SciCompforChemists](https://github.com/weisscharlesj/SciCompforChemists).

In [None]:
# Import the libraries you need
import os
import pandas as pd
import scipy as sp

## Importing data with pandas
In this lesson you will use `os` and `pandas` to import your data. This is a very straightforward case, but if you look at the code, you can see that it could be applied to much more complex situations. When using pandas, people often append \_df to the end of a variable name as a reminder that this is a dataframe, so the dataframe that contains our results might be called results_df. The code in the next cell will enable you to create the dataframe for this lesson. It should be mentioned that DataFrame is a reserved word when working with pandas dataframes, so you should not use that word as a variable name.

In [None]:
protein_file = os.path.join('data', 'protein_assay.csv') # gives a path to your csv file
results_df = pd.read_csv(protein_file) # use pandas to read the csv file into a dataframe
results_df  # display your dataframe

In a dataframe, you can refer to the data in a column just by the column header. These are strings, so you need to put single quotes around them when you use them. In pandas, the terms "series" is often used to refer to the data in a single column in a dataframe. So we are going to use the two column headers to define the data for our linear regression. 

In [None]:
xdata = results_df['mass (ug)']  # Setting the x values
ydata = results_df['A-595']  # setting the y values
print(xdata,ydata) # checking to make sure everything is in place

## Linear Regression with SciPy
Now that the data are in place, we simply need to import the stats module from SciPy to perform the linear regression. 

Notice the the `linregress` function in scipy.stats accepts the two data series and actually generates five values: slope, intercept, r-value, p-value, and standard error. In the next cell all of those variables are assigned in a single command. Then the results are presented with a series of print statements.

In [None]:
from scipy import stats
slope, intercept, r_value, p_value, std_err = stats.linregress(xdata, ydata)
print("Slope = ", slope, "/microgram/mL", sep = "")
print("Intercept = ", intercept)
print("R-squared = ", r_value**2)
print("P value = ", p_value)
print("Standard error = ", std_err)

<div class="exercise-this-is-a-title exercise">
<p class="exercise-title">Exercise</p>
    <p> Now that you have completed a simple linear regression exercise with protein assay data, here is a problem with a slightly larger dataset, taken from a ground water survey of wells in Texas, kindly provided by [Houghton-Mifflin](https://college.cengage.com/mathematics/brase/understandable_statistics/7e/students/datasets/slr/frames/frame.html). Using the skills you have learned with pandas and SciPy, get the linear regression statistics for the relationship between pH (dependent variable) and bicarbonate levels (ppm in well water in Texas; independent variable). The data are available in the file, Ground_water.csv in the data folder. Once complete, your output should look like this:</p>
        
<p>Slope =  -0.0030521595419827677 <br>
Intercept =  8.097595134597833 <br>
R-squared =  0.1152673937227531 <br>
P value =  0.04948248037131796 <br>
Standard error =  0.0014948066523110296
</p>

```{admonition} Hint
:class: dropdown

You will need to use a multi-step process to complete this exercise.
1. Use os to set the file path.
1. Use pandas to create a dataframe from the csv file. 
1. Explore the file to find the column headers for your data. 
1. Assign your independent and dependent variables.
1. Use SciPy.stats to perform linear regression. 
```
    
```{admonition} Solution
:class: dropdown
    import os
    import pandas as pd
    import scipy as sp
    water_file = os.path.join('data', 'Ground_water.csv')
    water_df = pd.read_csv(water_file)
    water_df.head() # to see the headers for each series (column)
    xdata = water_df['Bicarb']
    ydata = water_df['pH']
    from scipy import stats
    slope, intercept, r_value, p_value, std_err = stats.linregress(xdata, ydata)
    print("Slope = ", slope)
    print("Intercept = ", intercept)
    print("R-squared = ", r_value**2)
    print("P value = ", p_value)
    print("Standard error = ", std_err)
```
    
</div>