# Guide to using spectrum module for data analysis

This guide will aid you in using the functionality of Data_macro to read, process, and analyse optical and magneto-optical spectral data.

In [1]:
# Preliminaries
import os
from pathlib import Path
from spectrum import Data_macro, Treatment
current_path = Path.cwd()

# Data paths
path001 = os.path.join(current_path, "Data", "Magneto-optics", "Facet_001")
path100 = os.path.join(current_path, "Data", "Magneto-optics", "Facet_100")

## Getting started: Initialization
The macro takes five main inputs: the path(s) to the folder of measurement data, ``data_path``; the path to the folder of reference data, ``ref_path``; the units, ``units``; a check for whether the file names contain magnetic fields or temperatures, ``col_names``; and a check for whether the data should be normalized by using two zero-field references or using reference files matched to the measurement files, ``zero_field``. A description of these inputs, as well as the type of value they can take, is listed in the table below.

| **Variable name** | data_path                | ref_path               | units                    | col_names                                                  | zero_field                  |
| -----             | ----                     | ----                   | ----                     | ----                                                       | ----                        |   
| **Description**   | Path to measurement data | Path to reference data | Units for the index      | Whether the columns should be field or temperature values  | Whether to use two references or match references to data |
| **Type**          | string or array          | string (optional)      | "cm-1" or "meV" or "THz" | "B" or "T"                                                 |  Boolean                     |

### Setting path names
The macro is initialized using two inputs: ``data_path`` should be a string or list of strings which are paths to folders of measurement data, while ``ref_path`` should be a string which is the path to a folder of reference data. 

```python
example = Data_macro(data_path, ref_path)
```

If only ``data_path`` is entered, the macro will assume that the reference data is stored in the same folder as the measurement data.

If neither path is entered, the macro will ask for the data path and reference path as user inputs.

```python
example = Data_macro()
```
```
Enter the path to the measurement data (or drag and drop):
Enter the path to the zero-field data (or drag and drop):
```

In [None]:
# Example for getting path from user input
example = Data_macro()

### Setting units
By default, the ``units`` variable is set to ``"cm-1"``. This can be changed to ``"meV"`` or ``"THz"`` as follows.

```python
example = Data_macro(data_path, units="meV")
```

The macro can also be initialized with ``units`` set to ``None``, in which case the units will be requested as an input from the user.

```python
example = Data_macro(data_path, units=None)
```
```
Enter the units to be used for the index (cm-1, meV, or THz):
```

### File-handling variables
If ``col_names`` is set to ``"B"``, the macro reads magnetic field names from the file names. This is done by assuming that the field is denoted as a string starting with ``a``, ending with ``T``, and with ``p`` replacing the decimal point. For example, ``a03p500T`` indicates a field strength of 3.5 T.

If ``col_names`` is set to  ``"T"``, the macro reads temperature values from the file names. This is done by assuming that the temperature is the last integer in the file name, excluding the last zero.

If ``zero_field`` is set to ``True``, the macro normalizes the measurement data by interpolating between the reference data to the same columns.

If ``zero_field`` is set to ``False``, the macro normalizes the measurement data by dividing each column by the corresponding column in the reference data. If the file names end in ``.cor``, the macro does not perform this normalization.

## Using the macro
Once the macro is initialized, calling the method ``auto()`` returns a dataframe.

```python
df = example.auto()
```

This method returns a dataframe of normalized spectral intensities with the following attributes.

* ``df.columns()``: 1D array of magnetic field or temperature values, depending on ``col_names``.
* ``df.index()``: 1D array of energy values/wavenumbers. The units are by default cm$^{-1}$ but can be set to meV or THz in the initialization.
* ``df.values()``: 2D array of normalized intensities.

In [None]:
example001 = Data_macro(path001, units="cm-1", col_names="B", zero_field=True, flag_txt=False)
df = example001.auto()
df

## Advanced initialization
As well as the five inputs described in the first section, there are a further six inputs that can be used to control how data is processed. All of these inputs are optional. A full list of the inputs, as well as the type of value they can take and their default value, is listed in the table below.

| **Variable name** | auto                                            | data_head                                       | ref_head                                         | flag_txt                          | as_folder                                         |       
| -----             | ----                                            | ----                                           | ----                                             | ----                              | ----                                              |
| **Description**   | Whether the simpler "Auto" method should be used | Method to read data file headers              | Method to read reference file headers            | Add "txt_files" to each path name | Whether the references are stored in the same folders as the data |
| **Type**          | Boolean, defaults to True                        | "all" or "none" or "check", defaults to "none"| "all" or "none" or "check", defaults to "none" | Boolean, defaults to True          | Boolean, defaults to False                                         |

### File-handling variables
The variables ``data_head`` and ``ref_head`` determine how the headers in the measurement and reference files are respectively handled. These are initialized in a similar way to ``units`` as described above. Each variable takes three possible values:
* ``"all"``: Assume all files have headers.
* ``"none"``: Assume no files have headers.
* ``"check"``: Check if each file has a header. Note that this method fails for headers which contain only floats, such as those generated by the module. In this case, ``"all"`` should be used instead.

There are a number of Boolean-type checks as follows.
* ``flag_txt``: Set to ``True`` if the readable (.txt) data files are in a separate folder, and ``False`` otherwise.
* ``as_folder``: Set to ``True`` if the path entered contains folders for different energy windows, and ``False`` if the path entered is directly a folder of datafiles.
* ``sep_paths``: Set to ``True`` if the measurement and reference data are stored in separate paths, and ``False`` if the same path contains both sets of data.

## Some notes about format
The module makes the following assumptions about the format of the data files, the file names, and the folders.
* Data in different windows (``as_folder = True``) are stored in folders called either ``FIR``, ``HMIR``, ``LMIR``, ``MIR``, or ``NIR``. The names are not case-sensitive. The folders are ordered by energy - in particular, there must be an overlap between the energy windows of consecutive folders.
* Data within any given folder covers the same approximate range of energies, though the ranges don't need to be identical.
* If the reference data are stored in folders in the same format (``sep_paths = False``), the measurement folders have names like ``run`` or ``sam`` or ``sample``, while the reference folders have names like ``ref``. The names are not case-sensitive.
* If the reference data are not stored in the same path as the measurement folders (``sep_paths = True``), the references are stored in just one folder.

If these conditions are not met, it may be simpler to use the ``auto`` method for each window (see above) and to merge the windows manually (see below).

## Using the macro: Single folder
Once the macro is initialized, if the data are stored in a single folder (``as_folder = False``), calling the method ``load_all()`` returns a dataframe.

```python
df = example.load_all()
```

This method returns a dataframe of normalized spectral intensities with the following attributes.

* ``df.columns()``: 1D array of magnetic field or temperature values. Note that the first two columns are the minimum and maximum and the rest are ordered.
* ``df.index()``: 1D array of energy values/wavenumbers. The units are by default cm$^{-1}$ but can be set to meV or THz in the initialization.
* ``df.values()``: 2D array of normalized intensities.

In [None]:
example001 = Data_macro(path001, auto=False, flag_txt=False, as_folder=False, col_names="B", zero_field=True, sep_paths=False, units="cm-1")
df = example001.load_all()
df

## Using the macro: Multiple windows
Once the macro is initialized, if the data is stored in multiple folders for different windows (``as_folder = True``), calling the method ``load_all()`` returns a dataframe of merged windows. For this case, the method takes two parameters, ``high_to_low`` and ``mult``, to determine how the windows should be merged. These parameters are described in the table below.

| **Variable name** | high_to_low                                                           | mult                                                    |
| -----             | ----                                                                  | ----                                                    |
| **Description**   | Method: adjust high to match low (True) or low to match high (False)? | Method: multiply/divide (True) or add/subtract (False)? |
| **Type**          | Boolean (optional), defaults to True                                             | Boolean (optional), defaults to False                              |

By default, the windows are merged pairwise by adjusting the baseline of the higher energy window to match the lower energy window, and by adding or subtracting the difference between the baselines.

```python
df = example.load_all()
```

The method for adjusting baselines can be changed using the parameters described above, as follows.

```python
df = example.load_all(high_to_low=False, mult=True)
```

The parameters above can instead be input by the user if one or both of the parameters is set to ``None``:

```python
df = example.load_all(high_to_low=None, mult=None)
```

```
Adjust baselines high to low? Y/N: 
Adjust baselines by adding? Y/N:
```

This method returns a dataframe of corrected or normalized spectral intensities with the following attributes.

* ``df.columns()``: 1D array of magnetic field or temperature values. Note that the first two columns are the minimum and maximum and the rest are ordered.
* ``df.index()``: 1D array of energy values/wavenumbers. The units are by default cm$^{-1}$ but can be set to meV or THz in the initialization.
* ``df.values()``: 2D array of normalized intensities.

# Manual window merging
The module is equipped to handle data which is stored in separate paths for measurement and reference data, each of which is stored in folders for different measurement windows. However, suppose that the measurement and reference data are stored in the same path, with different folders for each measurement window. In this case, it may be best to read the data from each folder separately and merge the windows manually. 

This section provides a guide for manual reading and merging of different energy windows, as well as demonstrating how the module adjusts the baselines to merge the data.

## Reading the data
Suppose that a set of magneto-optical data is located by the paths ``pathFIR``, ``pathMIR``, and ``pathNIR``, with each folder containing both measurement data and zero-field references. The dataframes for each folder can be processed and stored in a dictionary as follows. 

Note that the order of the windows should be either ascending or descending in energy, so that adjacent elements in the dictionary have overlapping energy windows.

```python
dict = {}
inst_FIR = Data_macro(pathFIR, col_names="B", zero_field=True, units="cm-1")
dict["FIR"] = inst_FIR.auto()
inst_MIR = Data_macro(pathMIR, col_names="B", zero_field=True, units="cm-1")
dict["MIR"] = inst_MIR.auto()
inst_NIR = Data_macro(pathNIR, col_names="B", zero_field=True, units="cm-1")
dict["NIR"] = inst_NIR.auto()
```

### Handling multiple runs
Suppose that one of the windows, FIR, has two (or more) runs, in paths ``pathFIR1`` and ``pathFIR2``, which should be averaged to form the final dataset. This can be handled using the method Auto as follows.

```python
inst_FIR = Data_macro([pathFIR1,pathFIR2], col_names="B", zero_field=True, units="cm-1")
avg_df = inst_FIR.auto()
dict["FIR"] = avg_df
```

## Visualizing the data

Once the dictionary ``dict`` is created, it is often useful to visualize the windows by plotting a single column in each dataframe and examining the regions of overlap. This can be done using the static method ``Treatment.show_windows``. This plots the desired column ``col`` on a single plot with a log scale for the energy index.

```python
Treatment.show_windows(dict,col)
```

## Merging the windows
The windows can be merged into a single dataframe using the static method ``Treatment.merge_windows``. Before merging, the baselines of the dataframes must be adjusted so that there are no steps or jumps in the merged data. 

The method for this adjustment is determined by two user inputs, ``adjust`` and ``mult``, which have been described in previous sections. By default, when calling this method directly, these parameters are both set by user input.

```python
df = Treatment.merge_windows(dict)
```
```
Adjust baselines high to low? Y/N: 
Adjust baselines by adding? Y/N:
```

The user input can be bypassed by setting ``high_to_low`` and ``mult`` to either ``True`` or ``False`` when calling the static method.

```python
df = Treatment.merge_windows(dict, high_to_low=True, mult=False)
```

After baseline adjustment and merging, the method returns a single dataframe which contains the merged data for all of the windows in the dictionary ``dict``.

# Error handling
The module may raise errors if expected conditions on the data or user inputs are not met. Here is a guide on how to interpret and avoid each error.

### Value errors
```No folders found in (path name) Check that folder names fit format.```

This error occurs if none of the folders fit the expected format - ``FIR``, ``MIR``, ``HMIR``, ``LMIR``, or ``NIR``, not case sensitive. Check the folder names and rename if necessary. This can also occur if the input ``as_folder`` is set to ``True`` but the path entered leads directly to a list of files. Try changing ``as_folder`` to ``False``.

```Unable to read columns from file names. Check that file names fit format.```

This error occurs if none of the file names fit the expected format. Values of the magnetic field should be denoted as a string starting with ``a``, ending with ``T``, and with ``p`` replacing the decimal point. For example, ``a03p500T`` indicates a field strength of 3.5 T. Values of the temperature should be denoted as the last integer in the file name, excluding the last zero. 

```No files found in (path name)```

This error occurs if a folder read by the module contains no files. Check that there are no empty folders with names matching the above list.

### Other outputs
```Invalid input: (description)``` or ```Unrecognized input```

These errors occur if one of the variables input by a user does not fit the expected format. Check the table at the start of the demo for expected values.

```Inconsistent indices in files (list of file names)```

This warns the user if the index values of the files in a folder do not have the same length or do not agree. The module automatically interpolates to fill any missing values. If the output data shows artefacts of this interpolation, it may be worth checking through the files in this list for inconsistencies.

# Methods in the Treatment class

The Treatment class contains a set of static methods which can be used for further treatment and processing of the data.

## Interpolation

The static method ``Treatment.interpolate`` returns a linearly interpolated dataframe according to the specified list of columns.

| **Variable name** | data             | B0                | B                                                    |
| -----             | ----             | ----              | ----                                                 |
| **Description**   | Spectral data    | Original columns  | New columns to which the data should be interpolated |
| **Type**          | pandas dataframe | numpy array, list | numpy array, list                                    |

## Changing units

The static method ``Treatment.change_units`` changes the units of the index between ``cm-1``, ``meV``, and ``THz``. Note that the name of the index column must match the format expected for data processed by the ``Data_macro`` class.

| **Variable name** | data                                  | units                     | 
| -----             | ----                                  | ----                      | 
| **Description**   | Spectral data processed by Data_macro | Units to set the index to | 
| **Type**          | pandas dataframe                      | "cm-1" or "meV" or "THz"  | 

## Visualizing windows

The static method ``Treatment.show_windows`` plots the intensities of the different pre-merged dataframes in a dictionary, using a log scale for the energy index. This can be used to decide whether the data is ready to be merged, and how the baselines should be adjusted (see manual window merging above and ``Treatment.match_baseline`` below). 

If the dataframes have multiple columns, the column that should be used for plotting must be entered. Plotting a single column in this way makes the regions of overlapping energies clearer in the plot.

| **Variable name** | dict                           | col                                | 
| -----             | ----                           | ----                               |
| **Description**   | Spectral data prior to merging | Column to be plotted               | 
| **Type**          | dictionary                     | float (optional), defaults to None | 

## Baseline matching

The static method ``Treatment.match_baseline`` takes two dataframes with overlapping windows, adjusts the baseline of one so that the baselines match at the region of overlap, and returns the adjusted dataframe only.

 The parameter ``high_to_low`` determines whether the dataframe with the higher window is adjusted to match the dataframe with the lower window or vice versa. The parameter ``mult`` determines whether the baseline is adjusted by multiplying/dividing by the ratio or by adding/subtracting the difference. 

The static method ``Treatment.merge_windows`` described in the main demo above uses this static method for baseline matching of pairs of dataframes in a dictionary.

| **Variable name** | df_low           | df_high          | high_to_low                                                    | mult |
| -----             | ----             | ----             | ----                                                      | ----  |
| **Description**   | Spectral data    | Spectral data    | Method: adjust high to match low (True) or low to match high (False)? | Method: multiply/divide (True) or add/subtract (False)? |
| **Type**          | pandas dataframe | pandas dataframe | Boolean | Boolean |

## Merging

Once the baselines of two windows have been matched using ``Treatment.match_baselines`` as described above, the dataframes can be merged into a single dataframe using ``Treatment.merge``. The data is interpolated to the combined indexes in the overlapping region. This takes two dataframes, the first of which must have a lower energy range than the second, and which must overlap in energy.

The static method ``Treatment.merge_windows`` described in the main demo above uses this static method for pairwise merging of dataframes in a dictionary.

| **Variable name** | df_low           | df_high          |
| -----             | ----             | ----             |
| **Description**   | Spectral data    | Spectral data    |
| **Type**          | pandas dataframe | pandas dataframe |

## Derivatives

The static method ``Treatment.derivative`` returns the derivative of the dataframe with respect to energy (if ``axis=0``) or magnetic field/temperature (if ``axis=1``). The arguments are described in the table below.

| **Variable name** | data             | axis                                              | edge                                                               |
| -----             | ----             | ----                                              | ----                                                               |
| **Description**   | Spectral data    | Axis to derivative along: rows (0) or columns (1) | Handling of edges of range (refer to ``numpy.gradient`` documentation) |
| **Type**          | pandas dataframe | int, defaults to 0                                | int, defaults to 1                                                 |

## Baseline correction

The static method ``Treatment.bs_correct`` returns a dataframe adjusted so that the baseline of the specified region is normalized to 1.

| **Variable name** | data             | region                                         | 
| -----             | ----             | ----                                           |
| **Description**   | Spectral data    | Region (e.g. energy interval) to be normalized | 
| **Type**          | pandas dataframe | list, tuple                                    | 

## Smoothing

The static method ``Treatment.sg_smooth`` returns a dataframe which has been smoothed using a Savitzky–Golay filter.

| **Variable name** | data             | window                                         | poly                       |
| -----             | ----             | ----                                           | ----                       |
| **Description**   | Spectral data    | Number of coefficients in the smoothing window | Order of polynomial to fit |
| **Type**          | pandas dataframe | int, odd                                       | int, smaller than window   |

## Kramers-Kronig analysis 

The static method ``Treatment.kramers_kronig`` is described in full in a separate demo, but is listed here for completeness. 

This method takes a single column of a dataframe and parameters for high- and low-energy extrapolation, and returns a dataframe with five columns for the real part of the dielectric function "er", the imaginary part of the dielectric function "ei", the magnitude of reflectivity "rf", the real part of the optical conductivity "sr", and the phase of the reflectivity "phase". 

Note that entered parameters should have units consistent with the energy index. Note also that the calculation can be somewhat time-consuming, with the processing time scaling as the square of the number of datapoints.

| **Variable name** | df                               | n                  |  model                                                                                                        | w_free                                                     | ptail                       | b                                 |
| -----             | ----                             | ----               | ----                                                                                                         |  ----                                                          | ----                       | ----                           |
| **Description**   | Spectral data                    | Number of points to calculate low-frequency extrapolation | Model to be used for low-energy extrapolation                                                                | Free-electron or plasma frequency | Exponent for interband region (usually 0 or 1) | Conductivity for Hagen-Rubens, reflectance for insulator, or exponent for power law |
| **Type**          | pandas dataframe (single column) | int                            | "Hagen-Rubens", "Insulator", "Power law", "Metal", "Marginal Fermi liquid", "Gorter-Casimir two-fluid model", or "Superconducting" | float                                                       | int, defaults to 4            | float (optional), defaults to None             |