# Toney Group Retreat 2024 - Professional Development - Data Workflows Tutorial 

Introduction:

This notebook has 3 different sections, which have different target audiences:
* Level 1: For those that are unfamiliar with or new to using python, jupyter, and basic data science tools
* Level 2: For everyone, relatively basic yet still useful and often overlooked tools
* Level 3: For those that are already comfortable with the fundamentals of python data science work

So if you're new or unfamiliar with python data science, then complete levels 1 and 2. 
If you're an advanced python data science user, then complete levels 2 and 3.  

This tutorial notebook is to serve as a starting point and guide to various tools that we have found useful in our own data workflows. Additional references to learn more about any of the topics mentioned will be included at the end of each level section. 

## Level 1: Jupyter, NumPy, Matplotlib, and Pandas

In this level you will learn how to navigate jupyterlab and use the fundamental python data science packages numpy, matplotlib, and pandas. 

You will use these packages to load, process, and plot example beamtime data.

### Jupyter 

Jupyter Notebooks and JupyterLab are interactive ways to use python in your data workflows. From __[Jupyter's website](https://jupyter.org/about)__:
> "Project Jupyter is a non-profit, open-source project, born out of the IPython Project in 2014 as it evolved to support interactive data science and scientific computing across all programming languages. Jupyter will always be 100% open-source software, free for all to use and released under the liberal terms of the modified BSD license."

The following two subsections are adapted from one of Michael Shirt's python class notebooks:

#### Entering data in Jupyter notebooks 

This **notebook** is an interactive session. We just type the command `print("Hello, World.")` and the press `Shift+Enter` to execute.

In [None]:
print("Hello, World.")

The cells also show the results of the last statement, but not the previous ones.

In [None]:
x=8
y=7
print(y+1)
print(x+1)
x+1

A useful feature of using JupyterLab is the ability to quickly view docstrings and source code within the notebook. You can press `Shift+Tab` while the cursor is at the end of a function or inside its call parentheses, or you can type a function and replace with parentheses with a '?' to view the docstring or '??' to view the source code. 

Try placing your cursor at the end of the print function in the two cells below and pressing `Shift+Tab` to pull up a floating, scrollable docstring. Then run the cell to see the output from using question mark.

In [None]:
# since print is a builtin function, double question marks wont show the source code in this case
print?

#### Execution order

**Note**: When using Jupyter notebooks, it is very important to know that the cells can be executed out of order
(intentionally or not). The state of the environment (e.g., values of variables, imports, etc.) is defined by the execution
order, i.e. the order in which you run cells.  This can come back to bite you if you reopen a notebook and try to pick up where you left off! To see this, try running the cells below in different orders!

In [None]:
x=8

In [None]:
x=5

In [None]:
print(x)

This may seem problematic if you are used to programming in environments where the state is linked to the order of the commands as *written*, not as *executed*.

**Again, notice that the state of the environment is determined by the execution order.**
Note also that the square brackets to the left of the cell show the order that cells were executed. If you scroll to the top, you should see that the code cells show an execution order of `[1]` , `[4]` , and `[3]`, (or similar numbers) indicating the actual execution order.

There are some useful menu commands at the top of the Jupyter notebook to help with these problems and make sure you retain the execution order as expected.

Some important commands to remember:
* You can clear the current state with the menu item `Kernel | Restart & Clear Output`
* It is often useful to clear the state using the menu command just described, and then execute all the lines above the currently selected cell using `Cell | Run All Above` .
* You can clear all the state and re-run the entire notebook using `Kernel | Restart & Run All`.

#### Nagivating Jupyter notebook cells

When working in Jupyter notebooks, you will frequently want to add or modify cells as your notebook grows. 

There are buttons at the top of the notebook to add cells, copy cells, paste cells, run cells and more (hovering over each button tells you what it does). However, it is much faster and convenient to use keyboard shortcuts to accomplish the majority of these functions:

1. `esc` = move cursor out of text editing mode within the cell to select cells mode
2. `enter` = move cursor into selected cell 
<br>When in select cells mode:
3. `A` = add cell above selected cell
4. `B` = add cell below selected cell
5. `Arrow-key-up`, `Arrow-key-down` = select cells up or down
6. `D,D` = delete selected cell (press 'D' twice)

Jupyter encourages the use of these and other keyboard shortcuts for improved productivity and ease of use. Try creating, deleting, editing, and selecting various cells below. You can always press `CMD+Z` or `CTRL+Z` to undo any actions (also in `Edit | Undo`)

In [None]:
# add a cell above me, then delete it and this cell

In [None]:
# write something else in here, then add a cell below it

#### Navigating the JupyterLab sidebar

You may have already noticed the sidebar to the left when working in JupyterLab, there are 4 different tabs by default:
1. `File Browser`
    <br> Navigate around directories within the directory where you launched JupyterLab. You double-click on folders to move the file browser to inside that folder and on files to view their contents in new tabs. JupyterLab can view .ipynb Jupyter notebooks, .py python scripts, most types of text files, many image types, and more. 
    <br>**Useful tip:** You can go into `File | Open from Path` to open a path directly. This can save you lots of time when working with multiple directories.
    
2. `Running Terminals and Kernels`
    <br>Shows open tabs, kernels, and terminals. Kernels refers to the running python environment for active notebooks, which also contain any loaded objects if the notebook has been ran. Kernels store their variables in your computer's RAM, so it is good to manage them to make sure you don't have any that are becoming too large in memory or many open simultaneously. 
    <br> Terminals shows you any open terminal tabs. 
    
3. `Table of Contents`
    <br> Shows an outline of all the section headers within a notebook. Useful for quickly navigating around notebooks!
    
4. `Extension Manager`
    <br> Here you can install extensions to potentially enhance your local JupyterLab setup. 

In [None]:
# Check all of the sidebar menus out now!

### NumPy, Matplotlib, and Pandas

This section will demonstrate some very basic ways to use numpy, matplotlib, and pandas. This  __[online data analysis handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)__ is a great resource for learning about these tools in more depth when you're starting out. 
Some excerpts from the linked handbook for some context are below:
> "NumPy (short for Numerical Python) provides an efficient interface to store and operate on dense data buffers. In some ways, NumPy arrays are like Python's built-in list type, but NumPy arrays provide much more efficient storage and data operations as the arrays grow larger in size. NumPy arrays form the core of nearly the entire ecosystem of data science tools in Python, so time spent learning to use NumPy effectively will be valuable no matter what aspect of data science interests you."

> "Matplotlib is a multi-platform data visualization library built on NumPy arrays, and designed to work with the broader SciPy stack. It was conceived by John Hunter in 2002, originally as a patch to IPython for enabling interactive MATLAB-style plotting via gnuplot from the IPython command line. IPython's creator, Fernando Perez, was at the time scrambling to finish his PhD, and let John know he wouldn’t have time to review the patch for several months. John took this as a cue to set out on his own, and the Matplotlib package was born, with version 0.1 released in 2003."

> "Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a DataFrame. DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data. As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs."

In the following cells, you will generate artificial data with numpy, visualize it with matplotlib, and save the data as a .csv file using pandas. 

In [None]:
# Imports
import pathlib
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

<div class="alert alert-block alert-info"><b>Tip:</b> You can check available autocompletions in a module by pressing 'Tab' when the cursor is after the '.' when accessing a method or attribute of a module/class. 
</div>

#### Use numpy and matplotlib

In [None]:
# Generate data with numpy
x = np.linspace(-np.pi, np.pi, 300)  # generate a 1D array of 300 values evenly spaced between negative pi and positive pi
y = np.sin(x)  # generate another 1D array of the sin of each value contained in 'x'

# Show data with matplotlib.pyplot directly
plt.plot(x, y)  # quickly plot the x array along the x axis and the y array along the y axis
plt.show()  # this is technically not needed in jupyter notebooks, but suppressed the output

In [None]:
# Lets make some minor adjustments:

plt.plot(x, y)

plt.gcf().set(size_inches=(5,2.5), dpi=150)  # Access the current figure to quickly adjust the dpi
plt.title('sin(x)') # maybe we want a title?
plt.xlabel('radians')  # maybe we want an xlabel?

plt.show()  # This is technically not needed in jupyter notebooks

<div class="alert alert-block alert-info"><b>Using plt.plot() vs plt.subplots():</b> plt.plot(), as shown above is useful for quickly plotting data, but is not generally recommended for more complicated figures with detailed formatting and/or multiple plots.
    <br> In those situations, it is better to use a format like <b>fig, ax = plt.subplots()</b>. Check the documation on plt.subplots() to learn more about how and when to use it. 
</div>

#### Make a pandas DataFrame

In [None]:
# Puts data in numpy array with columns x and y (the shape of these are horizontal, so one way to do this
# is to stack them vertically and then transpose the array)
data_array = np.vstack((x, y)).T

# Make pandas dataframe with data and labeled columns, the set_index method just removes the 
# first index column that is automatically generated
df = pd.DataFrame(data=data_array, columns=('x', 'y'))  #.set_index('x')
df

#### Saving as csv file

In [None]:
# Use pd.DataFrame.to_csv() function to save data as csv
# We usually don't care about saving the index values as a column, so we can disable this
# by setting index=False
df.to_csv('numpy_sin_data.csv', index=False)  # By default this saves your file to the notebook working directory

# To choose other directories, put the full path you want to use. 
# I like pathlib for this, here I save the csv to an 'output' folder within my working direcoty
notebookPath = pathlib.Path.cwd()  # Saves current working direcoty as Windows or Posix Path depending on your OS
outputPath = notebookPath.joinpath('output')  # Specify path to save to
outputPath.mkdir(exist_ok=True)  # Makes directory if not already made
df.to_csv(outputPath.joinpath('numpy_sin_data.csv'), index=False)  # Joinpath also can be used for filename
df

#### Loading and plotting data from csv files

In [None]:
# Use pandas to load data frame from a csv file

df_read = pd.read_csv(outputPath.joinpath('numpy_sin_data.csv'))
df_read

#### Plotting

In [None]:
# Directly from DataFrame colums
plt.plot(df_read['x'], df_read['y'])
plt.show()

In [None]:
# Or if you prefer to work with numpy arrayes
array = np.array(df_read)
plt.plot(array[:, 0], array[:, 1])
plt.show()

### Beamtime data example

In this example, you are going to: 
1. Load some NEXAFS data from a .txt file into a numpy array
2. Plot it and save a .png file of the plot
3. Make a pandas dataframe and plot it again
4. Save a .csv file of the same data now with column headers

In [None]:
# The first steps are to import the necessary packages and definte relevant paths:

# Imports
import pathlib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Define paths:
notebookPath = pathlib.Path.cwd()
nexafsPath = notebookPath.joinpath('Y6-CF_nexafs_data.txt')

In [None]:
# Now lets load the nexafs data using np.loadtxt()

nexafs_data = np.loadtxt(nexafsPath)
nexafs_data.shape

Notice that the nexafs_data numpy array has the shape (X rows, Y columns). This data is NEXAFS data of a pure film of the small molecule acceptor Y6, a common high performance material used in organic photovoltaics. The first column is the X-ray energy and the remaining columns are the \$\theta$ angle, the angle between the X-ray polarization and the substrate normal. <br> The columns are [energy, 20°, 30°, 55°, 70°, and 90°]

In [None]:
# Make a simple plot with matplotlib 

plt.plot(nexafs_data[:,0], nexafs_data[:,1], label='20')
plt.plot(nexafs_data[:,0], nexafs_data[:,3], label='55')
plt.plot(nexafs_data[:,0], nexafs_data[:,5], label='90')
plt.legend()
plt.show()

Again, using plt.plot() is nice to quickly plot and see your data in very few lines of code. But we probably want to make some immediate changes/improvements to this plot:
1. Add title and axis labels
2. Improve dpi & increase font size (adjust figure size)
3. Plot all angles and use a reasonable color scheme

This will be easier to do using plt.subplots():

In [None]:
# Generate an empty figure and plot axes, set the size and dpi of the figure
fig, ax = plt.subplots()
fig.set(size_inches=(6,3), dpi=120)

# Make a list of RGBA color values to use for each angle using matplotlib and numpy
colors = plt.cm.viridis_r(np.linspace(0,1,len(nexafs_data[0,:])))

# Set values for the plot not to be looped over
energy = nexafs_data[:, 0]  # Set the energy values from the loaded text file
angles = [20, 30, 55, 70, 90]  # Specify the angles now
for i, angle in enumerate(angles):  
    column_num = i+1  # We are excluding the first column since that is the energy
    nf_spectra = nexafs_data[:, column_num]  # Select the appropriate nexafs data for the corresponding angle    
    # Draw the line on the ax subplot, add the angle as a label and set the color to the correct RGBA value
    ax.plot(energy, nf_spectra, label=angle, color=colors[i]) 
    
# Set the title, axis labels, and x limits
ax.set(title='Y6 Angle-Dependent NEXAFS', xlabel='Photon Energy [eV]', 
       ylabel='Normalized Intensity [arb. units]', xlim=([270,325]))
ax.legend(title=r'$\theta$ [$\degree$]')

# Show the plot, then close it to reduce memory usage
plt.show()
plt.close('all')

Further basic data manipulation tools that are good practice:

In [None]:
# Make a pandas dataframe

In [None]:
# Plot using the dataframe

In [None]:
# Save a .csv file from the pandas dataframe

## Level 2: Bash, Git (and GitHub), VS Code, IPython Magic, Pathlib & Data Organization

In this level you will learn about:
* Useful bash commands and information
* Using git and github to version control your code
* Using VS Code as your interactive development environment
* IPython magic commands
* Using pathlib and making data organization easier

### Bash / terminal commands

In [None]:
# Basics

In [None]:
# Tar

In [None]:
# Bash scripts

### Git and GitHub

In [None]:
# Git

In [None]:
# GitHub

### VS Code

### IPython magic commands

In [None]:
whos

### Pathlib and data organization

In [None]:
# Why use pathlib?

## Level 3: Object-Oriented Programming, Rclone, Xarray, Dask

In this level you will learn about:
* Object-oriented programming and the basics of using classes
* Using rclone to transfer data between remote and local destinations
* Using xarray to easily work with large, multidimensional datasets
* Using dask and file formats such as zarr and netCDF to efficiently load and save big datasets

You will load, process, and plot example beamtime data. 

### Object-oriented programming

In [None]:
# What are classes and why you should be familiar with them:

### Rclone

In [None]:
# Rclone description, how it is useful in our workflows

### Xarray

In [None]:
# I don't use pandas too much anymore, mostly xarray now
# Like pandas but n-dimensional and much more applicable for large dataset work

### Dask and zarr/netCDF 

In [None]:
# Dask

In [None]:
# Saving xarray dataarrays and/or datasets

### Beamtime data example
In this example, you are going to: 
1. Load an xarray DataArray of some time-resolved GIWAXS data from a .zarr store 
2. Plot a few slices/reductions of the data and save .png's

In [None]:
# Imports and paths:

# Imports:
import pathlib
import numpy as np
import matplotlib.pyplot as plt
import xarray as xr

# Paths:
notebookPath = pathlib.Path.cwd()
zarrPath = notebookPath.joinpath('trGIWAXS.zarr')