# Intro to Python in Chemistry

This lesson and subsequent ones aim to teach you about data visualization, data tidying, statistics, and a bit of Python coding. If you don't know how to code, don't worry! These lessons assume no prior knowledge of code or Python.

A few things to start:

1.   These lessons only work in Google Chrome
2.   If you want to save your progress, go to File> Save a Copy in Drive; then locate a spot in your Drive folder
3.   You can save these notebooks and run offline as a Jupyter Notebook.

If you have questions, feel free to contact Dr. Chris Berndsen in the JMU Chemistry Department.

# Introduction to Python


Python is a computer language that is associated with many functions in and out of science including statistical analysis and data visualization. However, it has many uses including for data visualization and analysis in chemistry and biochemistry. While Microsoft Excel and Google Sheets can perform many of the tasks that we will learn to do in Python, these programs are limited in their ability to work with and visualize large or complex data sets. Moreover, much of Excel and Sheets is designed for simplicity over customization.

While Python may be difficult to use in the beginning, the experience of learning a bit of computational thinking and the practice applying problem solving skills provide gains outside of the chemistry world. Through these modules you will explore Python programming and develop your skills in data analysis and visualization.


---
# Definitions
We start with a few definitions to help you succeed in learning Python and completing the tasks.

A **code chunk** is what you see below:

In [None]:
## comments look like this
code looks like this

These code chunks show examples of code along with **comments**. The comments are preceded by two `##` symbols. These are not lines that the computer reads, but lines for you, the reader, to read and learn from. As we move forward in these modules, the `##` will also serve as notes to future you and your instructor as you explain what you are doing. *Think of comments like your lab notebook, it is a way to keep notes and indicate what you are doing and observe.*


---

There will also be interactive chunks of code like shown below:

In [None]:
## This chunk of code is interactive and you can type into these blocks

## Substitute your name in the blank space.
print("Hello, ________. You just ran Python code!")

## Press the Run Code button at the left side of this chunk to see the result

## Assigning Values

A fundamental skill in Python is assigning a value, which allows you to refer to one or more pieces of data in a simple way. To assign a value, first pick a name like x or df and then type an equals sign `=`. Then assign the value.

For example, let's assign the value of 12 to the letter w. We would type `w = 12` and then press the "Run" button to let Python know that w has a value of 12. Try the exercise below and assign values as instructed.

In [None]:
## assign 3 to the letter t

## type t, then press Run Code button
t

## Why assign values rather than just refer to the number?

At first glance, value assignment seems ridiculous. Why refer to 3 as t, rather than use 3 in an equation? For simple assignments, this is correct, however we may want to look at a set of data and will assign multiple values to a single name. For example, a student measures the pH value for one solution five times to determine the accuracy and precision of their measurement. The values are 4.9, 5.1, 5.1, 4.8, and 5.0. To calculate the average value and the error, you could type in each number into a calculator or the R console and do all the math. Alternatively, you could assign these values to a single name and simplify your typing. In doing this, you save yourself some effort and reduce the opportunities for errors.

### Naming multiple values

So, how to we assign multiple values to a name? First, you select your name (ex. `ph`), draw your `=`, then type `[]`. The `[]` takes multiple inputs and combines them into a single unit which we name. The values are entered between the `[]` separated by commas. If we take the pH values from the above section and put them into `[]`, our result is:

`ph = [4.9, 5.1, 5.1, 4.8, 5.0]`

The `ph` object can now be used to calculate statistical values without having to type and re-type this string every time. The formal name of the `ph` object is an **array**. A vector requires that all the data be the same type, meaning all numbers or all text and you *cannot* mix data types.

In [None]:
## assign mass the values of 120.1, 122.0, 121.6, 120.1, and 127

## type mass



Now, calculate the average and standard deviation of this group of numbers.

In [None]:
## calculate the len of mass the function --> len()
len()

## use the sum()) function to calculate the sum of mass


The above code is our first step toward programming and statistical analysis, as we used *functions* to perform math on our array. How could we use the above code to calculate the average mass value?

### Mathematical calculations

In addition to functions, you can do math on both numbers and named objects with one or more values. The language to do mathematical calculations is similar to that used in Google Sheets and Microsoft Excel.

  - `*` is for multiplication
  - `/` is for division
  - `+` is for addition
  - `-` is for subtraction
  - `**` raises the number to a power

If we want to add two numbers together, we type `2 + 14` and then run the code to produce `16`.



In [None]:
# calculate the average value in the mass array using the sum() and len() functions plus a math function


If you produced a value of ~122 you did your first bit of Python programming!



---
Maybe we want to convert the numbers in mass from grams to kilograms, unfortunately we can't just say `mass/1000` and get an answer. We need to iterate through the list of numbers in mass. An example is shown below:


In [None]:
# create the loop parameters
for i in mass:
  # do the math
  kg = i/1000
  # report out the results, kg is the number and we follow that with the units
  print(kg, "kg")

The above is called a *loop* as we wrote commands for python to loop through the array and do something. These types of loops are used frequently in data analysis and is the underlying code for when you drag a formula through a column in Excel or Sheets to create a new column of numbers.
While it is a lot of code to write to divide by 1000, sometimes you just have to write some code.

---
Let's combine some ideas together in a quick check of knowledge.

In [None]:
# create the array `abs`, which consists of 5 absorbance measurements
# 0.23, 0.21, 0.212, 0.223, 0.23

# calculate the average absorbance value


# convert absorbance to molarity using Beer's law in a for loop
# pathlength = 0.3 cm
# extinction coefficient = 1250 M-1 cm-1
# hint there are a few ways that this can be done, but adapting the layout above is a good start



For loops are fun to write (when they work), but there are ways to speed up the analysis and reduce the complexity of the code you write.



---


## NumPy

Like we download software or apps onto devices to do new things, we can customize Python with apps, called packages or libraries. These packages contain pre-packaged functions that can do math or handle data in ways that make it easier for us. The code below tells Python to use this package and that we will designate any commands that use NumPy, with an np. An example of calculating an average is shown.

In [None]:
import numpy as np

# calculate the average of mass
np.mean(mass)

That was so much easier! We can even do math on the individual elements of the array without a for loop. Try it below:

In [None]:
# make a np.array called mass with the values values of 120.1, 122.0, 121.6, 120.1, and 127
mass = np.array([120.1, 122.0, 121.6, 120.1, 127])

# divide each value in mass by 1000, but do not use a loop, just divide mass by 1000


This is the real power of packages, using other's work and code to reduce how much we have to do.

Take a moment to search the internet on how to carry out the math requested below using the numpy package. In some cases, numpy is spelled out and others it is abbreviated using np, like we have done above.

In [None]:
# create a np.array called abs which consists of 5 absorbance measurements
# 0.23, 0.21, 0.212, 0.223, 0.23

# calculate the average or mean value of abs


# calculate the standard deviation of abs


# convert abs to molar using beer's law
# pathlength = 0.3 cm
# extinction coefficient = 1250 M-1 cm-1



Now that we have learned a bit of Python programming, we can begin to explore data and plotting the data in informative ways.


---
## Data frames
In chemistry and biochemistry, we often are observing the effect of something on an outcome. The effect of reactant concentration on a reaction rate or how temperature changes the structure of a protein are good examples. In both of these examples, there are two sets of values to be considered and we might structure data in a table:


In [None]:
import pandas as pd
import numpy as np

## create two arrays
## one called concentration and one called rate
concentration = np.array([0, 1, 2, 4, 8, 16, 32, 64, 128])
rate = np.array([0, 0.1, 0.24, 0.37, 0.82, 1.55, 3.1, 6.7, 12.2])

df = pd.DataFrame(zip(concentration, rate),
                  columns=["concentration", "rate"])
df

The table above is structured more like we see in a spreadsheet and how we will work with data for the most part in the future. *But* we can treat each column like an array above by referring to it as below:


In [None]:
# treat it like an array
rate_arr = df['rate'].multiply(3)
print(rate_arr)

# put result in a new column
df['rate*3'] = df['rate'].multiply(3)
print(df)

# put result in the original column
df['rate'] = df['rate'].multiply(3)
print(df)

Notice how we can create new columns or modify old ones with only small changes to the code. This is a nice feature but can result in lost data and time if you are not careful with naming. A good tip is to make a copy of your data in the code using `df.copy` and work with the copy so that you can get back to the original data if you make a mistake.

In addition to creating the data frame in Python, you could enter the data into Microsoft Excel or Google Sheets as a spreadsheet and then import the file into Python. We will use both methods eventually.

---

### Tidy tables and data

Data frames are only useful forms of data if they are constructed carefully and **tidily**. In a tidy table, each column contains observations of a single type and each row represents a single observation. An untidy table is shown below:

In [None]:
#@markdown press play to see the messy table

concentration = np.array(["0", "1 M", "2000 mM", "4.00001", "8", "16", "32 M", "64 M", "128 M"])
rate = np.array(["0", "0.1 nmol/sec", "0.24 U/sec", "0.37", "0.82", "1.55", "3.1", "6.7", "12.2"])
color = np.array(["red", "red", "pink", "pink-ish", "cloudy but pink", "pink", "clear", "transparent", "light pink"])
measurements = np.array(["2 replicates", "3", "3", "2 + 1", "4", "6", "3", "3", "3"])

mess = pd.DataFrame(zip(concentration, rate, color, measurements),
                  columns=["concentration", "rate", "color", "measurements"])

mess

***The table above has issues!***

  1. The *concentration* and *rate* columns have issues with numbers with and without units.

  2. The *concentration* and *rate* columns show different units of measurement.

  3. The *color* column has inconsistent wording of color.Defined colors are used sometimes and then -ish or words like transparent and clear used at other times.

  4. The measurements column has replicates used in one observation and then `2+1`. What does that even mean?

  5. Significant figures are a mess and inconsistent!

We likely all have made tables like the untidy one when we started out or were in a hurry. However, untidy tables create issues with accessibility and communicating with others in addition to creating problems when trying to analyze and visualize the data. How can we fix the table?


---

One easy fix would be to make sure that all the numbers have consistent units which are indicated in the column headers. A second more difficult fix would be to ensure that all measurements are standardized or scored using the same scale to avoid pink vs. clear issues. This latter solution requires good communication and a bit of planning in advanced, but will save you from some frustrating times later on.

Let's look at the same data, but in tidy format:

In [None]:
#@markdown Press play to see the tidy table

#make the table
concentration = np.array([0, 1, 2, 4, 8, 16, 32, 64, 128])
rate = np.array([0, 0.10, 0.24, 0.37, 0.82, 1.5, 3.1, 6.7, 12])
absorbance = np.array([0.05, 0.07, 0.08, 0.10, 0.11, 0.13, 0.15, 0.18, 0.21])
turbidity = np.array([1, 1, 1, 1, 1, 2, 3, 6, 10])
number = np.array([2, 3, 3, 3, 4, 6, 3, 3, 3])

tidy = pd.DataFrame(zip(concentration, rate, absorbance, turbidity, number),
                    columns=["concentration (M)", "rate (mol/sec)", "color (Abs 450 nm)", "Turbidity (NTU)", "number of measurements"])

# view the table
tidy

Unnamed: 0,concentration (M),rate (mol/sec),color (Abs 450 nm),Turbidity (NTU),number of measurements
0,0,0.0,0.05,1,2
1,1,0.1,0.07,1,3
2,2,0.24,0.08,1,3
3,4,0.37,0.1,1,3
4,8,0.82,0.11,1,4
5,16,1.5,0.13,2,6
6,32,3.1,0.15,3,3
7,64,6.7,0.18,6,3
8,128,12.0,0.21,10,3


This table looks a lot better! Notice how units are included in the column headers and  the numbers in each column use the same units. In addition, the color column was changed from using a qualitative and inconsistent measurement like pink or red to using a spectrophotometer to measure absorbance. Then a new set of observations called turbidity was collected specifically focusing on the clarity of the sample. Finally, the number of trials is standardized. Now it is clear what the numbers represent and we can start to think about analyzing these data.


---



### Math in data frames

Like vectors, data in data frames can be analyzed and summarized. Run the code below and compare the differences between the first command and the second command.

In [None]:
## create the data frame of mass and volume measurements
mass = np.array([100, 105, 102, 92, 117])
volume = np.array([15, 14, 9.0, 16, 13])

tbl = pd.DataFrame(zip(mass, volume),
                   columns = ["mass", "volume"])

## summarise the data in tbl
tbl.describe()

# summarize just the data in the mass column
tbl['mass'].describe()

The command in line 9 shows the summary of both columns, while the command in line 11, results in the just a summary of one column.

Notice in the second command, the inclusion of `['mass']`. A data frame name, followed by `[]` and a column name (from the data frame), allows you to isolate only that column within the data frame for a function or doing calculations. For summarizing data, this can reduce the size of the results especially if you have parts of the data frame that aren't numbers or that you are not focusing on.

Pandas and numpy both have a number of data exploration and descriptor functions that you can access by indicating which dataframe, which column (if any), and the command. For example, the mean of the mass column in tbl could be calculated by `tbl['mass'].mean()`. In the code below, try using some of what we learned above to work through a table of masses and volumes to get the average density and the density for each sample.

In [None]:
# create the table
mass = np.array([100, 105, 102, 92, 117])
volume = np.array([15, 14, 9.0, 16, 13])

tbl = pd.DataFrame(zip(mass, volume),
                   columns = ["mass", "volume"])

## calculate the average of mass and name that mass_avg

## calculate the average of volume and name that vol_avg

## calculate the average density from the average of mass and volume


## calculate the density of each pair of samples


## try the same command, but assign it to a new column called density in the data frame.
## Hint: if the column already existed in the data frame how would you refer to it?


## calculate the statistics of the density column

