<div>
<img src="img/CRC1333_long.png" style="float: right" height="150"/>
</div>

<div>
<img src="img/RDM_course_2024_alpha_crop.png" height="150"/>
</div>

# **CRC 1333 Summer School 2024**

## **Workshop: Python for Beginners**

> **2024-10-01**

---

Welcome to this **Jupyter Notebook**!

Jupyter is a coding environment that combines documentation and code which is very suitable for micropublications and to explain code. It is a new fundamental in scientific scripting that aids reproduciblity. Ultimately, Jupyter can help understanding what has been done better than a plain script.

There are two types of cells:
- **Code**: Here the actual code will be executed
- **Markdown**: Explanations and any kind of text that helps understanding the code

Each code cell can be executed individually by using the **play-button** or the key-combo `shift` `+` `enter`. New cells can be added by using the **plus-button (+)** or the key-combo `alt` `+` `enter` which will execute the current cell and add a new one.


### 1 - Python fundamentals

Python is a high-level programming language that is widely used in scientific computing. It is known for its readability and ease of use. High-level means that Python is abstracted from the machine code, meaning it resembles human language more closely, and is therefore easier to read and write. Python is an interpreted language, which means that the code is executed line by line. This is different from compiled languages like C or Fortran, where the code is translated into machine code before execution. This makes Python slower than compiled languages, but also easier to use as you can directly see the results of your code. It is one of the most popular programming languages in the world and is used in a wide range of scientific applications.

#### 1.1 - Python as a calculator

In [None]:
# You can just use Python to calculate stuff, no special syntax needed
2 * 2

#### 1.2 Variables

Variables are used to store data that can be accessed and manipulated later in the code. They act as "containers" or "sticky notes" for data values. The are one of the most essential concepts in programming.

- **Naming**: Variable names should be descriptive and meaningful. They can contain letters, numbers, and underscores, but must start with a letter or an underscore.
- **Assignment**: Use the `=` operator to assign a value to a variable. For example, `x = 5` assigns the value `5` to the variable `x`.
- **Types**: Variables can store different types of data, such as numbers, strings, lists, and more. Python is dynamically typed, meaning you don't need to declare the type of a variable explicitly.
- **Reassignment**: You can change the value of a variable by assigning a new value to it.

In [None]:
number = 2 * 2

# Jupyter notebooks can print the last line of code
number

In [None]:
# But in a script, you would need to use the print function
print(number)

In [None]:
# Use variables like you would use numbers
another_number = 20

multiplication_result = number * another_number

multiplication_result

In [None]:
# There are many operators in Python

# Division
division_result = 80 / another_number
print("Division: ", division_result)

# Exponents
power_result = 80 ** 2
print("Exponent: ", power_result)

# Modulo
modulo_result = 80 % 2
print("Modulo: ", modulo_result)

# There are, of course, many more operators!

In [None]:
# Variables do not have to be numbers
string = "We call text 'strings' in programming!"

string

In [None]:
# There are special string operations, like concatenation (combining strings)
string1 = "Hello, "
string2 = "World!"

string1 + string2

#### 1.3 Lists

Lists are a fundamental data structure in Python that can be used to store a collection of items.

- Contains a sequence of arbitrary data types
- Started with `[` followd by comma-separated entries and closed by `]`
- Individual entries can be accessed using indicies
- Order is important

In [9]:
# Let us assume three variables (that may have been measured)
time1 = 0.0
time2 = 2.0
time3 = 5.0

In [None]:
# Lists are arrays of things
measurement = [0.0, 2.0]

print(measurement)

In [None]:
# Lists can be empty
empty_list = []

empty_list

In [None]:
# You can add things to a list
measurement.append(time3)

print(measurement)

In [None]:
# You can access elements of a list by their index, starting from 0 and
# by using square brackets
print(measurement[0])

# You can also access list elements starting from the end using -1
print(measurement[-1])

#### 1.4 Dictionaries

- Associate a `key` with a `value`
- Can be regarded as a collection of "sticky notes"
- Behaves similar to a list, but that the indices are generalized to custom indices
- Order is not important, but keys must be unique

In [None]:
projects = {
    "Liam": "B06",
    "Alex": "C03",
    "Ruba": "A04",
    "Max": "INF"
}

projects

In [None]:
print(f"Ruba's project is {projects["Ruba"]}.")

In [None]:
# The values in a dictionary can be lists
projects["Prof. Buchmeiser"] = ["MGK", "Z01"]

projects

#### 1.5 The `for` loop

<img src="https://media.geeksforgeeks.org/wp-content/uploads/20191101172216/for-loop-python.jpg"/>

- Used to dynamically iterate over each entry in a list or dictionary
- Helps solve the problem of individual accessing a list
- Most used loop in Python and straightforward

**Procedure**

1. The for-loop "pulls" an entry per round from the list.
2. Stored in the "variable scope" to the name you've chosen, e.g. `variable`.
3. Inside the "body", operations can be executed as usual.

Start a loop with the `for` keyword, followed by a variable name, the `in` keyword, an iterable (like a list), and a colon `:`.
The body of the loop is indented by four spaces.  
Indentation is important, as it tells Python what the code block is that is to be executed in each iteration of the loop. The code block is executed for each element in the iterable. The current element is accessible through the variable name defined after the "for" keyword.  

```python
for variable in iterable:
    # Do something with variable
```

In [None]:
# Let us assume a list of sidelengths of squares
sidelengths_of_squares = [1, 2, 3, 5, 8, 13, 21]

# Into the loop we go!
for sidelength in sidelengths_of_squares:

    # We calculate the area of the square and print it for each sidelength
    area_of_square = sidelength ** 2
    print(f"The area of a square with sidelength {sidelength} is {area_of_square}.")
    
print("This is not indented, so it is not part of the loop!")

---

### 2 - Titration study

In this "study", we simulate the **titration of hydrochloric acid (HCl)** of unknown concentration **with sodium hydroxide solution (NaOH)**. Given that both HCl and NaOH are strong acid and base, it makes finding the equivalence point and calculating the concentration of the HCl rather easy – perfect for this example.

**Simulated Experiment**

Consider 100 aqueous HCl solutions of 50 ml with unknown concentrations. They have been titrated with a 0.2 molar solution of NaOH and both the volume of NaOH solution used and the measured pH were recorded. The objectives are:

- Load a larger dataset into Python
- Visualize pH curves for each titration.
- Write a function to find the equivalence point of an easy titration curve.
- Calculate the concentration of HCl for each solution.
- Take a look at the statistics of the dataset.

#### 2.1 - Load data

Max and I have already prepared the data for this titration study. The Jupyter Hub allows to upload files directly to the university server, allowing to access data within the Jupyter environment. For today we will skip this step and load the prepared data from GitHub. We will pull it from a GitHub repository that Max and I have prepared for this workshop. Talking in depth about GitHub is beyond the scope of this workshop, but it is a platform for version control and collaboration. It is widely used in software development and data science. In this case, we will use it to store the data for this workshop.

In [None]:
# Get data from GitHub
!git clone --quiet https://github.com/FAIRChemistry/pySummerSchool24
%cd pySummerSchool24

#### 2.2 - Import libraries

Usually, the first step in any Python script is to install and import the required libraries. A library is a collection of functions and methods that allows you to perform many actions without writing your own code. One huge advantage of Python is the vast number of libraries already available for different tasks. For data analysis and visualization, the most important libraries are `numpy`, `pandas`, and `matplotlib`, which we will use in this workshop. They enhance Python's capabilities and make it easier to work with various data.  

- **numpy**: Provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays.
- **pandas**: Provides data structures and data analysis tools for Python. It is built on top of numpy and makes it easier to work with even the largest data.
- **matplotlib**: A plotting library for Python and its numerical mathematics extension `numpy`. It provides an object-oriented API for embedding plots into applications.

In [19]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

#### 2.3 - Read data

First, we will read the data from the file into a `pandas` `DataFrame`. You can think of a DataFrame as a table with rows and columns, similar to an Excel spreadsheet. It is a powerful data structure that makes it easy to work with data in Python. 

In [None]:
# Pandas has a special function making it easy to read CSV files
df = pd.read_csv('data/titration_data.csv')
df

In [None]:
# Let us also have a quick look at the data using the plot function
df.plot(x='Volume NaOH / l', legend=False, ylabel='pH')
plt.show()

#### 2.4 - Data analysis

To start with the data analysis, we of course need the concentration of the NaOH solution and the volume of the HCl samples for later use – so we store them in two variable. Also, we extract the list of NaOH volumes used in the titrations and save it in a variable for convenient access. We do something similar for the pH values, but since we have multiple titrations, we will store them in a dictionary instead of a list. This way, we can access the pH values for each titration using the titration name as a key and the values as a list.

In [22]:
NaOH_concentration = 0.2  # mol/l
HCl_volume = 0.05         # l

In [23]:
# With this we can get the list of volumes of NaOH from the DataFrame
volume_naoh = df['Volume NaOH / l'].values

In [24]:
# Here we generate the dictionary of titrations
ph_dict = {col: df[col].values for col in df.columns[1:]}

We now need a function that finds the equivalence point of a titration curve.  
A function is a block of code that only runs when it is called. You can pass data, known as parameters, into a function. A function can return data as a result. Functions are a convenient way to divide your code into useful blocks, allowing us to order our code, make it more readable, reuse it, and save some time.  
Writing the function for the equivalence point is not easy for beginners, so we will do it how it would most likely happen in real life: Use ChatGPT to generate the function for us. Below we have an example prompt that we could use to generate the function.

**Prompt**:
> Can you generate a Python function called "find_equivalence_point", with two arguments "volume" and "ph", that finds the equivalence point of a titration curve and returns the volume of the titrant at the equivalence point?

In [25]:
# Code generated by ChatGPT will go here

In [None]:
# We can now use the new function to calculate the equivalence point
HCl_eq_volume = find_equivalence_point(volume_naoh, ph_dict["Titration 1"])
print(HCl_eq_volume)

In [None]:
# Let us create a new dictionary to store the equivalence points
equivalence_dict = {}

# Using a loop we can easily calculate the equivalence points for all
# 100 titrations in one step
for titration_curve, ph_values in ph_dict.items():
    equivalence_dict[titration_curve] = find_equivalence_point(volume_naoh, ph_values)

print(equivalence_dict)

In [None]:
# It is the concentration of HCl that we are interested in, let us
# create yet another dictionary to hold the results
results = {}

# Again, we use a loop to calculate the HCl concentration for each
# titration and store it in the result dictionary
for titration_curve, equivalence_volume in equivalence_dict.items():
    HCl_concentration = (NaOH_concentration * equivalence_volume) / HCl_volume
    results[titration_curve] = HCl_concentration

print(results)

#### 2.5 - Results

After having our desired results (the concentrations of the HCl solutions), we can have some fun with the data. We can calculate some statistics, like the mean, median, and standard deviation of the HCl concentrations. We can find out, which solution had the highest and lowest concentration respectively. And we can also visualize the concentrations per solution in a bar plot.

In [29]:
# Some statistics
mean_concentration = np.mean(list(results.values()))      # mean
median_concentration = np.median(list(results.values()))  # median
std_concentration = np.std(list(results.values()))        # standard deviation
max_concentration = np.max(list(results.values()))        # maximum concentration
max_titration_curve = max(results, key=results.get)   # titration curve with maximum concentration
min_concentration = np.min(list(results.values()))        # minimum concentration
min_titration_curve = min(results, key=results.get)   # titration curve with minimum concentration

In [None]:
print("*************************")
print("* Experiment statistics *")
print("*************************")
print(f"The mean HCl concentration is {mean_concentration:.2f} mol/l.")
print(f"The median HCl concentration is {median_concentration:.2f} mol/l.")
print(f"The standard deviation of the HCl concentration is {std_concentration:.2f} mol/l.")
print(f"The maximum HCl concentration of {max_concentration:.2f} mol/l is found in titration {max_titration_curve}.")
print(f"The minimum HCl concentration of {min_concentration:.2f} mol/l is found in titration {min_titration_curve}.")

In [None]:
plt.figure(figsize=(10, 5), dpi=300)
plt.bar(results.keys(), results.values())
plt.ylabel("HCl concentration / mol/l")
plt.xticks(rotation=90, fontsize=5)
plt.show()

#### 2.6 - Export data

Finally, we can export the results to a file. This is useful if we want to share the results with others or use them in another program. We will export the results back to a CSV file, which is a common file format for storing tabular data.

In [32]:
with open("data/hcl_concentrations.csv", "w") as f:
    result_df = pd.DataFrame(results.items(), columns=["Experiment", "HCl concentration / mol/l"])
    result_df.to_csv(f, index=False)

---

### 3 - Summary

We hope we could give you a short and useful introduction into Python. We believe that neither Python or Excel is superior, but that both are powerful tools which cater to different data analysis requirements. Using Excel for smaller datasets is efficient and easy, but with increasing data complexity or size you may find Python to be more flexible and to be a real time-saver!

---

In [None]:
assert False, "Stop! This is uncharted territory!"

In [1]:
def find_equivalence_point(volume, ph):
    """
    Finds the equivalence point of a titration curve.

    Parameters:
    - volume: List or numpy array of titrant volumes
    - ph: List or numpy array of pH values corresponding to each volume

    Returns:
    - equivalence_volume: The volume at the equivalence point
    """
    # Ensure inputs are numpy arrays
    volume = np.array(volume)
    ph = np.array(ph)

    # Compute the first derivative of the pH curve (numerical differentiation)
    dpH_dV = np.gradient(ph, volume)

    # Find the index where the slope is the steepest (maximum derivative)
    max_slope_index = np.argmax(dpH_dV)

    # The equivalence point corresponds to the volume at this index
    equivalence_volume = volume[max_slope_index]

    return equivalence_volume