# Object Oriented Programming and Documentation

There are many good resources which provide an introduction to object oriented programming. This tutorial won't provide a comprehensive introduction and will assume some prior knowledge. [Here's a good starting point.](https://realpython.com/python3-object-oriented-programming/)

We are interested in using object oriented programming because it lets use represent things that we are familiar with in code. For example, concrete things like data files can be represented as classes. Classes let us group together data and functions that operate on the data; when we are working with the data in a notebook, we can then create as many objects of a class as there are data files, for example. We can also represent more abstract things, like datasets or models, as classes.

Note that [there are several different programming paradigms](https://www.geeksforgeeks.org/introduction-of-programming-paradigms/). We'll likely stick to object oriented and functional programming, but there aren't hard rules for when to use one or the other, and within a single project you may find yourself using both.

### An update on the bug

The bug discussed in the previous tutorial arises again here, this time affecting our ability to read the header (technically the same issue existed in the last tutorial but we weren't using any info within the header). For some reason, .dat files are sometimes comma separated and sometimes space separated. I think Excel is to blame (that is, if you open a .dat file in Excel and save it, it will be comma separated). We'll need to update our code to handle both cases.

The functions for reading both headers and data sections now account for this difference.

In [1]:
import hashlib
import csv
import re
from pathlib import Path
from datetime import datetime
from dataclasses import dataclass
from collections import OrderedDict

import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

The classes we'll write in this tutorial will be able to handle all of the following files:

In [2]:
mvsh1_path = Path("data/mvsh1.dat")
mvsh2_path = Path("data/mvsh2.dat")
mvsh2a_path = Path("data/mvsh2a.dat")
mvsh2b_path = Path("data/mvsh2b.dat")
mvsh3_path = Path("data/mvsh3.dat")
mvsh4_path = Path("data/mvsh4.dat")
zfcfc1_path = Path("data/zfcfc1.dat")
zfcfc2_path = Path("data/zfcfc2.dat")
zfcfc3_path = Path("data/zfcfc3.dat")
fc4a_path = Path("data/fc4a.dat")
fc4b_path = Path("data/fc4b.dat")
zfc4a_path = Path("data/zfc4a.dat")
zfc4b_path = Path("data/zfc4b.dat")
zfcfc4_path = Path("data/zfcfc4.dat")
dataset4_path = Path("data/dataset4.dat")

all_files = [
    mvsh1_path, mvsh2_path, mvsh2a_path, mvsh2b_path, mvsh3_path, mvsh4_path,
    zfcfc1_path, zfcfc2_path, zfcfc3_path, fc4a_path, fc4b_path, zfc4a_path,
    zfc4b_path, zfcfc4_path, dataset4_path
]

The functions `label_clusters()`, `unique_values()`, `find_outlier_indices()`, and `find_temp_turnaround_point()` from the previous tutorial have been rewritten here. Some have been slightly modified to handle the data we'll find in this tutorial.

Examples of code documentation in the form of doc strings have been added to these functions. Docstrings are a way of documenting functions, classes, and modules in Python. They are enclosed in triple quotes and are the first thing in a function, class, or module. They are accessible via the `__doc__` attribute of the function, class, or module. For example, `label_clusters.__doc__` will return the doc string for the `label_clusters()` function.

[Here is an excellent video from ArjanCodes on code documentation](https://youtu.be/L7Ry-Fiij-M). The whole video is worth watching, but section 3 (starting at 10:44) is specifically about docstrings. As was done in the video, the templates for the docstrings are automatically created in VS Code using the "autoDocstring - Python Docstring Generator" extension, which you should install for use later in this tutorial. The format we'll be using is the [numpy docstring format](https://numpydoc.readthedocs.io/en/latest/format.html).

Docstrings and proper Python type hinting add a lot of value to code, especially when working in an IDE which fully utilizes them. Take note of how VS Code pop ups help you when working with the functions in the following cell.

In [3]:
def label_clusters(
        vals: pd.Series,
        eps: float = 0.001,
        min_samples: int = 10
    ) -> np.ndarray:
    """For determining the nominal values of data in a series containing one or more
    nominal values with some noise.

    Parameters
    ----------
    vals : pd.Series
        A series of data containing one or more nominal values with some noise.
    eps : float, optional
        Passed to `sklearn.cluster.DBSCAN()`. The maximum distance between two samples
        for one to be considered as in the neighborhood of the other, by default 0.001.
    min_samples : int, optional
        Passed to `sklearn.cluster.DBSCAN()`. The number of samples in a neighborhood
        for a point to be considered as a core point, by default 10.

    Returns
    -------
    np.ndarray
        An array of the same size as `vals` which contains the cluster labels for each
        element in `vals`. Noisy samples are given the label -1. A `vals` series
        containing, for example, one nominal temperature with noise should return an
        array with only one cluster label of -1.
   
    """
    reshaped_vals = vals.values.reshape(-1, 1)
    scaler = StandardScaler()
    reshaped_normalized_vals = scaler.fit_transform(reshaped_vals)
    dbscan = DBSCAN(eps=eps, min_samples=min_samples)
    cluster_labels = dbscan.fit_predict(reshaped_normalized_vals)
    return cluster_labels

def unique_values(
    x: pd.Series, eps: float = 0.001, min_samples: int = 10
) -> list[float]:
    """Given a series of data containing one or more nominal values with some noise,
    returns a list of the nominal values.

    Parameters
    ----------
    x : pd.Series
        A series of data containing one or more nominal values with some noise.
    eps : float, optional
        Passed to `sklearn.cluster.DBSCAN()`. The maximum distance between two samples
        for one to be considered as in the neighborhood of the other, by default 0.001.
    min_samples : int, optional
        Passed to `sklearn.cluster.DBSCAN()`. The number of samples in a neighborhood
        for a point to be considered as a core point, by default 10.

    Returns
    -------
    list[float]
        The nominal values in `x` with the noise removed.
    """
    cluster_labels = label_clusters(x, eps=eps, min_samples=min_samples)
    unique_values = []
    for i in np.unique(cluster_labels):
        # average the values in each cluster
        unique_val = np.mean(x[cluster_labels == i])
        unique_val = round(unique_val, 1)
        unique_values.append(unique_val)
    return unique_values

def find_outlier_indices(x: pd.Series, threshold: float = 3) -> list[int]:
    """Finds the indices of outliers in a series of data.

    Parameters
    ----------
    x : pd.Series
        A series of data.
    threshold : float, optional
        The number of standard deviations from the mean to consider a value an outlier,
        by default 3.

    Returns
    -------
    list[int]
        The indices of the outliers in `x`.
    """
    z_scores = (x - x.mean()) / x.std()
    outliers = z_scores.abs() > threshold
    return list(outliers[outliers].index)

def find_temp_turnaround_point(df: pd.DataFrame) -> int:
    """Finds the index of the temperature turnaround point in a dataframe of
    a ZFCFC experiment which includes a column "Temperature (K)". Can handle two cases
    in which a single dataframe contains first a ZFC experiment, then a FC experiment.
    Case 1: ZFC temperature monotonically increases, then FC temperature monotonically
    decreases. Case 2: ZFC temperature monotonically increases, the temperature is
    reset to a lower value, then FC temperature monotonically increases. 
    
    Parameters
    ----------
    df : pd.DataFrame
        A dataframe of a ZFCFC experiment which includes a column "Temperature (K)".

    Returns
    -------
    int
        The index of the temperature turnaround point.
            
    """
    outlier_indices = find_outlier_indices(df['Temperature (K)'].diff())
    if len(outlier_indices) == 0:
        # zfc temp increases, fc temp decreases
        zero_point = abs(df['Temperature (K)'].iloc[20:-20].diff()).idxmin()
        return zero_point
    else:
        # zfc temp increases, reset temp, fc temp increases
        return outlier_indices[0]

We'll make a few classes to organize our datafiles into a useful form for further use in data analysis. The first will be one to handle individual .dat files, then a second to handle a dataset containing several .dat files on the same sample, and finally a third that will organize sample information.

Standard Python classes typically start with an `__init__()` function, which is called when an object of the class is created. Class attributes (variables that are attached to the classes) are typically defined here. The `self` argument is required for all class functions and attributes. It is a reference to the object itself, and is used to access attributes and functions of the class. The [linked Real Python article](https://realpython.com/python3-object-oriented-programming/) gives a good explanation of `self` and how it is used.

An outline of the `DatFile` class has been given in the following cell. The `__init__()` function contains attribute definitions, where the values of the attributes are determined by various functions within the class. This is a good pattern for keeping the `__init__()` function readable -- instead of writing out the code to determine the attribute values, we can just call the functions that do that work. The functions will be defined later in the class.

Several functions have already been written. Most of these are adapted from previous tutorials. Take note of the differences when used in classes -- for example, `_get_comments()` no longer needs a `file` argument, since the `self` argument gives it access to the `local_path` attribute found withing the `DatFile` object.

A note on style: the functions `_get_comments()`, `_get_header()`, and `_get_data()` are prefixed with an underscore. This is a convention that indicates that the function is intended to be used only within the class. It is not enforced by Python, but is a good practice to follow.

Note that the function definitions for `_determine_length()`, `_determine_hash()`, and `_get_date_created()` are single lines with the word `pass`. The `pass` keyword is used as a placeholder for code that will be written later. In this case, it's useful because we generally know what we want to do within the `__init__()` method but we need the class to have partial functionality while we write the rest of it. For example, you'll need access to the `local_path` and `header` attributes when writing those functions, and it's helpful to have a functioning partially defined class to work with while writing the rest of the class.

In [4]:
class DatFile:
    """A class for reading and storing data from a Quantum Design .dat file from a
    MPMS3 magnetometer.

    Attributes
    ----------
    local_path : Path
        The path to the .dat file.
    header : list[list[str]]
        The header of the .dat file.
    data : pd.DataFrame
        The data from the .dat file.
    comments : OrderedDict[str, list[str]]
        Any comments found within the "[Data]" section of the .dat file.
    length : int
        The length of the .dat file in bytes.
    sha512 : str
        The SHA512 hash of the .dat file.
    date_created : datetime
        The date and time the .dat file was created.
    experiments_in_file : list[str]
        The experiments contained in the .dat file. Can include "mvsh", "zfc", "fc",
        and/or "zfcfc".
    """
    def __init__(self, file_path: str | Path) -> None:
        self.local_path: Path = Path(file_path)
        self.header: list[list[str]] = self._read_header()
        self.data: pd.DataFrame = self._read_data()
        self.comments: OrderedDict[str, list[str]] = self._get_comments()
        self.length: int = self._determine_length()
        self.sha512: str = self._determine_hash()
        self.date_created: datetime = self._get_date_created()
        self.experiments_in_file: list[str] = self._get_experiments_in_file()

    def __str__(self) -> str:
        return f"DatFile({self.local_path.name})"
    
    def __repr__(self) -> str:
        return f"DatFile({self.local_path.name})"

    def _read_header(self, delimiter: str = "\t") -> list[list[str]]:
        header: list[list[str]] = []
        with self.local_path.open() as f:
            reader = csv.reader(f, delimiter=delimiter)
            for row in reader:
                header.append(row)
                if row[0] == "[Data]":
                    break
        if len(header[2]) == 1:
            # some .dat files have a header that is delimited by commas
            header = self._read_header(delimiter=",")
        return header

    def _read_data(self, sep: str = "\t",) -> pd.DataFrame:
        skip_rows = len(self.header)
        df = pd.read_csv(self.local_path, sep=sep, skiprows=skip_rows)
        if df.shape[1] == 1:
            # some .dat files have a header that is delimited by commas
            df = self._read_data(sep=",")
        return df
    
    def _get_comments(self) -> OrderedDict[str, list[str]]:
        comments = self.data['Comment'].dropna()
        comments = OrderedDict(comments)
        for key, value in comments.items():
            comments[key] = [comment.strip() for comment in value.split(',')]
        return comments
        
    def _determine_length(self) -> int:
        return self.local_path.stat().st_size

    def _determine_hash(self) -> str:
        buf_size = 4 * 1024 * 1024 # 4MB chunks
        hasher = hashlib.sha512()
        with self.local_path.open("rb") as f:
            while data := f.read(buf_size): #the `:=` is called the walrus operator
                # https://realpython.com/python-walrus-operator/
                hasher.update(data)
        return hasher.hexdigest()

    def _get_date_created(self) -> str:
        for line in self.header:
            if line[0] == "FILEOPENTIME":
                day = line[2]
                hour = line[3]
                break
        hour24 = datetime.strptime(hour, "%I:%M %p")
        day = [int(x) for x in day.split("/")]
        return datetime(day[2], day[0], day[1], hour24.hour, hour24.minute)
    
    def _get_experiments_in_file(self) -> list[str]:
        experiments = []
        if self.comments:
            for comments in self.comments.values():
                for comment in comments:
                    if comment.lower() in ["mvsh", "zfc", "fc", "zfcfc"]:
                        experiments.append(comment.lower())
        else:
            if len(self.data['Magnetic Field (Oe)'].unique()) == 1:
                experiments.append('zfcfc')
            else:
                experiments.append('mvsh')
        return experiments
        

### Exercise 3.1

Go back to the definition of the `DatFile` class and write the `__str__()` and `__repr__()` functions for the `DatFile` class. In this case, both should return a string with the format: "DatFile('name of file')". For example, if the file name is "mvsh1.dat" then `__str__()` and `__repr__()` should both return "DatFile(mvsh1.dat)".

Remember that by using the `self` argument you have access to all attributes within the class, including the `local_path` attribute, which contains the file as a `Path` object. One of the `Path` methods should be useful here.

The `__str__()` function is what is called when you `print()` an instance of the class. The `__repr__()` is the "official" string representation of the object, and is what is returned when you call `repr()` on an instance of the class. [Here is a good explanation of the difference between the two](https://stackoverflow.com/questions/1436703/difference-between-str-and-repr). In this case they can be the same thing, but that isn't always the case.

In [5]:
assert str(DatFile(mvsh1_path)) == "DatFile(mvsh1.dat)"
assert repr(DatFile(dataset4_path)) == "DatFile(dataset4.dat)"

### Exercise 3.2

Go back to the definition of the `DatFile` class and finish writing the `_determine_length()` function. `_determine_length()` should return the size in bytes of the file. See [`Path.stat()`](https://docs.python.org/3/library/stat.html).



In [6]:
assert DatFile(mvsh2_path).length == 277199
assert DatFile(zfcfc2_path).length == 255376

### Exercise 3.3

Go back to the definition of the `DatFile` class and finish writing the `_determine_hash()` function. `_determine_hash()` should return the SHA512 hash of the file. [See this Stack Overflow post](https://stackoverflow.com/questions/22058048/hashing-a-file-in-python)

In [7]:
assert DatFile(mvsh3_path).sha512 == "ea925d9931781ce2797c5ced4825d09f2a1254e6ee0ec453667b896ec5d7eaa366680c32138c14ed42a4fa9df9d719d32e052a32c2f2201ce6eff7ac63909c94"
assert DatFile(zfcfc3_path).sha512 == "2771939279adecf506904d637cc0eb312c97c68926ac90f9a59fc37dd11505a61e6a2e1fe403c8fa22fbcb09b143d3f65c8d9a1e012193c83373dd926e92ad99"

### Exercise 3.4

Go back to the definition of the `DatFile` class and finish writing the `_get_date_created()` function. `_get_date_created()` should return the date the file was created as a `datetime` object. In this case, the .dat file header contains information about the date the file was created. Use that rather than, for example `Path.stat().st_ctime`, which returns the date the file was last changed. Here's a good [deep dive on datetime](https://youtu.be/TFa38ONq5PY), though you can probably figure out what you need for this function from the [Python docs](https://docs.python.org/3/library/datetime.html).

Note that in the `assert` tests below we use one of the string representations of the datetime object, in particular the ISO formatted date. One of the nice things about the `datetime` class is the easy way in which you can read in dates of standard formats and output them to other standard formats. It's a good argument for writing data files that contain date information in a standard format.

In [8]:
assert DatFile(mvsh1_path).date_created.isoformat() == "2020-07-11T11:07:00"
assert DatFile(mvsh4_path).date_created.isoformat() == "2022-05-03T22:44:00"

We can start to see the benefit of classes by how our workflow changes now that we have the `DatFile` class to work with. We can make as many objects of the `DatFile` class as we want, and each one will have all of the attributes and functions defined in the class.

In [9]:
for file in all_files:
    dat_file = DatFile(file)
    print(f"{dat_file} experiments: {dat_file.experiments_in_file}")

DatFile(mvsh1.dat) experiments: ['mvsh']
DatFile(mvsh2.dat) experiments: ['mvsh']
DatFile(mvsh2a.dat) experiments: ['mvsh']
DatFile(mvsh2b.dat) experiments: ['mvsh']
DatFile(mvsh3.dat) experiments: ['mvsh']
DatFile(mvsh4.dat) experiments: ['mvsh']
DatFile(zfcfc1.dat) experiments: ['zfcfc']
DatFile(zfcfc2.dat) experiments: ['zfcfc']
DatFile(zfcfc3.dat) experiments: ['zfcfc']
DatFile(fc4a.dat) experiments: ['fc']
DatFile(fc4b.dat) experiments: ['fc']
DatFile(zfc4a.dat) experiments: ['zfc']
DatFile(zfc4b.dat) experiments: ['zfc']
DatFile(zfcfc4.dat) experiments: ['zfc', 'fc', 'zfc', 'fc']
DatFile(dataset4.dat) experiments: ['zfc', 'fc', 'zfc', 'fc', 'mvsh']


In [10]:
mvsh1 = DatFile(mvsh1_path)
mvsh1.data.head()

Unnamed: 0,Comment,Time Stamp (sec),Temperature (K),Magnetic Field (Oe),Moment (emu),M. Std. Err. (emu),Transport Action,Averaging Time (sec),Frequency (Hz),Peak Amplitude (mm),...,Map 07,Map 08,Map 09,Map 10,Map 11,Map 12,Map 13,Map 14,Map 15,Map 16
0,,3803627317,2.000165,70000.375,0.736924,0.000996,1,1,13.006381,0.999015,...,,,,,,,,,,
1,,3803627320,2.000241,69995.39844,0.736522,0.001055,1,1,13.006381,0.999028,...,,,,,,,,,,
2,,3803627325,1.999892,69746.85938,0.7374,0.00147,1,1,13.006381,0.999024,...,,,,,,,,,,
3,,3803627334,2.000141,69286.15625,0.736039,0.000992,1,1,13.006381,0.999066,...,,,,,,,,,,
4,,3803627335,1.999827,69246.48438,0.737444,0.00102,1,1,13.006381,0.999066,...,,,,,,,,,,


In [11]:
zfcfc4 = DatFile(zfcfc4_path)
zfcfc4.comments

OrderedDict([(0, ['ZFC', '100']),
             (1894, ['FC', '100']),
             (3766, ['ZFC', '1000']),
             (5659, ['FC', '1000'])])

## Datasets

Let's write a new class, `Dataset`, to handle datasets, which we'll define as the collection of files associated with a single sample.

Two important concepts in object-oriented programming are [inheritance and composition](https://realpython.com/inheritance-composition-python/). Composition is perhaps easier to understand, and what we'll be using here. The `Dataset` class will be composed of `DatFile` objects -- that is, the `Dataset.files` attribute will be a list of `DatFile` objects. The `Dataset` class will have its own attributes and functions, but it will also have access to all of the attributes and functions of the `DatFile` class. Here's [a more advanced discussion of composition and inheritance](https://youtu.be/0mcP8ZpUR38), and why composition is generally preferred when possible to make code more flexible.

We'll reserve the next cell for the `SampleInfo` class, which we'll come back to later.

In [12]:
@dataclass
class SampleInfo:
    """Information specific to the particular sample used for magnetic measurements.

    Attributes
    ----------
    material : str | None
        The material used for the sample. Possibly a chemical formula.
    comment : str | None
        Any comments about the sample.
    mass : float | None
        The mass of the sample in milligrams.
    volume : float | None
        The volume of the sample in milliliters or cubic centimeters.
    molecular_weight : float | None
        The molecular weight of the sample in grams per mole.
    size : float | None
        The size of the sample in millimeters.
    shape : str | None
        The shape of the sample.
    holder : str | None
        The type of sample holder used. Usually "quartz", "straw", or "brass".
    holder_detail : str | None
        Any additional details about the sample holder.
    offset : float | None
        The vertical offset of the sample holder in millimeters.
    eicosane_mass : float | None
        The mass of the eicosane in milligrams.
    diamagnetic_correction : float | None
        The diamagnetic correction in emu/mol.
    """
    material: str | None = None
    comment: str | None = None
    mass: float | None = None
    volume: float | None = None
    molecular_weight: float | None = None
    size: float | None = None
    shape: str | None = None
    holder: str | None = None
    holder_detail: str | None = None
    offset: float | None = None
    eicosane_mass: float | None = None
    diamagnetic_correction: float | None = None

    def rinehart_usage(self) -> None:
        """
        The Rinehart group uses the "Sample Volume" and "Sample Size" fields to store
        the eicosane mass and diamagnetic correction, respectively. This function
        moves those values to their correct fields and sets the "Sample Volume" and
        "Sample Size" fields to None.
        """
        self.eicosane_mass = self.volume
        self.diamagnetic_correction = self.size
        self.volume, self.size = None, None


The following cell contains an outline for the `Dataset` class with the `__init__()` and `_get_data_from_commented_file()` methods already written. More on the latter method in a bit.

In [13]:
class Dataset:
    """A dataset is a collection of magnetometry data files containing experiments run
    on the same sample.

    Attributes
    ----------
    id : str
        A unique identifier for the dataset.
    files : list[DatFile]
        A list of the .dat files that make up the dataset as DatFile objects.
    mvsh : dict[float, pd.DataFrame]
        A dictionary of the mvsh data for each temperature in the dataset. The keys are
        the nominal temperatures in Kelvin and the values are the mvsh data as a
        pandas DataFrame.
    zfc : dict[float, pd.DataFrame]
        A dictionary of the zfc data for each temperature in the dataset. The keys are
        the nominal fields in Oe and the values are the zfc data as a pandas DataFrame.
    fc : dict[float, pd.DataFrame]
        A dictionary of the fc data for each temperature in the dataset. The keys are
        the nominal fields in Oe and the values are the fc data as a pandas DataFrame.
    sample_info : SampleInfo
        Information specific to the particular sample used for magnetic measurements.
    """
    def __init__(self, id: str, dat_files: list[DatFile]) -> None:
        self.id: str = id
        self.files: list[DatFile] = dat_files
        self.mvsh: dict[float, pd.DataFrame] = self._get_mvsh()
        self.zfc: dict[float, pd.DataFrame] = self._get_zfc()
        self.fc: dict[float, pd.DataFrame] = self._get_fc()
        self.sample_info: SampleInfo = self._get_sample_info()

    def __str__(self) -> str:
        return f"Dataset({self.id})"
    
    def __repr__(self) -> str:
        return f"Dataset({self.id})"

    def _get_mvsh(self) -> dict[float, pd.DataFrame]:
        mvsh_files = [file for file in self.files if 'mvsh' in file.experiments_in_file]
        mvsh_dfs = {}
        for file in mvsh_files:
            if file.comments:
                mvsh_dfs.update(self._get_data_from_commented_file(file, 'mvsh'))
            else:
                df = file.data.copy()
                df['cluster'] = label_clusters(df['Temperature (K)'])
                for temp, cluster in zip(unique_values(df['Temperature (K)']), df['cluster'].unique()):
                    mvsh_dfs[temp] = df[df['cluster'] == cluster].drop(columns=['cluster']).reset_index(drop=True)
        return mvsh_dfs
    
    def _get_zfc(self) -> dict[float, pd.DataFrame]:
        zfc_files = [file for file in self.files if set(['zfc', 'zfcfc']).intersection(file.experiments_in_file)]
        zfc_dfs = {}
        for file in zfc_files:
            if file.comments:
                zfc_dfs.update(self._get_data_from_commented_file(file, 'zfc'))
            else:
                df = file.data.copy()
                turnaround_point = find_temp_turnaround_point(df)
                zfc = df.iloc[:turnaround_point].reset_index(drop=True)
                avg_field = round(zfc['Magnetic Field (Oe)'].mean())
                zfc_dfs[avg_field] = zfc
        return zfc_dfs

    def _get_fc(self) -> dict[float, pd.DataFrame]:
        fc_files = [file for file in self.files if set(['fc', 'zfcfc']).intersection(file.experiments_in_file)]
        fc_dfs = {}
        for file in fc_files:
            if file.comments:
                fc_dfs.update(self._get_data_from_commented_file(file, 'fc'))
            else:
                df = file.data.copy()
                turnaround_point = find_temp_turnaround_point(df)
                fc = df.iloc[turnaround_point:].reset_index(drop=True)
                avg_field = round(fc['Magnetic Field (Oe)'].mean())
                fc_dfs[avg_field] = fc
        return fc_dfs

    @staticmethod
    def _get_data_from_commented_file(file: DatFile, experiment: str) -> dict[float, pd.DataFrame]:
        data = file.data
        comments = file.comments
        experiment_dfs = {}
        for i, (dat_idx, comment_list) in enumerate(comments.items()):
            if experiment.lower() in map(str.lower, comment_list):
                for comment in comment_list:
                    if match := re.search('\d+', comment):
                        nominal_value = float(match.group())
                        break
                start_idx = dat_idx + 1
                end_idx = list(comments.keys())[i+1] if i+1 < len(comments) else (len(data))
                experiment_dfs[nominal_value] = data.iloc[start_idx:end_idx].reset_index(drop=True)
        return experiment_dfs
    
    def _get_sample_info(self) -> SampleInfo:
        sample = SampleInfo()
        for line in self.files[0].header:
            category = line[0]
            if category != "INFO":
                continue
            if not line[1]:
                continue
            info = line[2]
            if info == "SAMPLE_MATERIAL":
                sample.material = line[1]
            elif info == "SAMPLE_COMMENT":
                sample.comment = line[1]
            elif info == "SAMPLE_MASS":
                sample.mass = float(line[1])
            elif info == "SAMPLE_VOLUME":
                sample.mass = float(line[1])
            elif info == "SAMPLE_MOLECULAR_WEIGHT":
                sample.molecular_weight = float(line[1])      
            elif info == "SAMPLE_SIZE":
                sample.size = float(line[1])
            elif info == "SAMPLE_SHAPE":
                sample.shape = line[1]
            elif info == "SAMPLE_HOLDER":
                sample.holder = line[1]
            elif info == "SAMPLE_HOLDER_DETAIL":
                sample.holder_detail = line[1]
            elif info == "SAMPLE_OFFSET":
                sample.offset = float(line[1])
        return sample

The following cell should run at any stage of your progress on `Dataset`. It creates a `Dataset` object for each sample in the "data" folder.

Similar to the last tutorial, the files in the "data" folder contain .dat files formatted in several different ways:
- "mvsh1.dat" and "mvsh2.dat" contain multiple M vs H experiments at different nominal temperatures within the same files.
- "mvsh2a.dat" and "mvsh2b.dat" contain the same experiments as "mvsh2.dat", but with separate files for the 5 and 300 K experiments. "mvsh3.dat" also contains a single M vs H experiment.
- "mvsh4.dat" contains a single M vs H experiment with a user-created comment within the "[Data]" section of the file indicating the nominal temperature of the experiment.
- "zfcfc1.dat", "zfcfc2.dat", and "zfcfc3.dat" contain ZFC and FC experiments at one nominal field in each file, though the first two have ZFC and FC experiments both with monotonically increasing temperatures while the third has a ZFC with monotonically increasing temperatures and an FC with monotonically decreasing temperatures.
- "zfc4a.dat" and "zfc4b.dat" contain ZFC experiments at 100 and 1000 Oe respectively. "fc4a.dat" and "fc4b.dat" contain FC experiments at 100 and 1000 Oe respectively. All contain user-created comments within the "[Data]" section of the file indicating the nominal field of the experiment.
- "zfcfc4.dat" contains the previously mentioned ZFC and FC experiments for sample 4 but all in one file. The experiments are separated by user-created comments.
- "dataset4.dat" contains the data within "zfcfc4.dat" and "mvsh4.dat", with user-created comments separating the experiments.

In [14]:
# run this cell before running assert tests
dset1 = Dataset("dset1", [DatFile(mvsh1_path), DatFile(zfcfc1_path)])
dset2_1 = Dataset("dset2", [DatFile(mvsh2_path), DatFile(zfcfc2_path)])
dset2_2 = Dataset("dset2", [DatFile(mvsh2a_path), DatFile(mvsh2b_path), DatFile(zfcfc2_path)])
dset3 = Dataset("dset3", [DatFile(mvsh3_path), DatFile(zfcfc3_path)])
dset4_1 = Dataset("dset4", [DatFile(mvsh4_path), DatFile(zfcfc4_path)])
dset4_2 = Dataset("dset4", [DatFile(mvsh4_path), DatFile(zfc4a_path), DatFile(zfc4b_path), DatFile(fc4a_path), DatFile(fc4b_path)])
dset4_3 = Dataset("dset4", [DatFile(dataset4_path)])

all_datasets = [dset1, dset2_1, dset2_2, dset3, dset4_1, dset4_2, dset4_3]

### Exercise 3.5

Write the `__str__()` and `__repr__()` functions for the `Dataset` class. In this case, both should return a string with the format: "Dataset(<name of sample>)", where the name of the sample is the `Dataset.id` attribute. For example, if the sample name is "dset1" then `__str__()` and `__repr__()` should both return "Dataset(dset1)".

In [15]:
assert str(dset1) == repr(dset1) == "Dataset(dset1)"

### Exercise 3.6

Write the `_get_mvsh()`, `_get_zfc()`, and `_get_fc()` functions for the `Dataset` class. You should be able to use combinations of functions you wrote for the previous tutorial (the ones given near the top of this file).

- `_get_mvsh()` should return a dictionary with keys corresponding to the nominal temperatures of individual M vs H experiments with values of dataframes containing the data for each experiment. For example, if the sample has M vs H experiments at 5 and 300 K, then `_get_mvsh()` should return a dictionary with keys of 5 and 300 and values of dataframes containing the data for each experiment.
- `_get_zfc()` and `_get_fc()` should return a dictionary with the keys corresponding to the nominal fields of individual ZFC or FC experiments with values of dataframes containing the data for each experiment. For example, if the sample has ZFC experiments at 100 and 1000 Oe, then `_get_zfc()` should return a dictionary with keys of 100 and 1000 and values of dataframes containing the data for each experiment.

All three functions should have the same general structure:
1. Go through the `Dataset.files` attribute and find the files that contain the relevant experiments. Note that for ZFC (FC) experiments the `file.experiments_in_file` may be either "zfc" ("fc") or "zfcfc". [Python `set()` methods may help you here.](https://realpython.com/python-sets/)
2. Create an empty dictionary
3. Go through the files that contain the relevant experiments and add the data for each experiment to the dictionary. If the files contain user-commented data, you'll need to use `Dataset._get_data_from_commented_file()` to get the data. Otherwise you'll need to implement the algorithms we created in the previous tutorial to get the data. [See the `dict.update()` method.](https://www.geeksforgeeks.org/python-dictionary-update-method/)

#### Static Methods

You may have noticed a `@staticmethod` line above the definition of `_get_data_from_commented_file()`. The `@` symbol indicates that the following line is a [decorator](https://realpython.com/primer-on-python-decorators/). A ["static method"](https://realpython.com/instance-class-and-static-methods-demystified/) is a method that does not require an instance of the class to be called (i.e., it doesn't need the `self` argument). It is placed in the class because it is very closely tied to what is going on in the class. Technically, you can call `_get_data_from_commented_file()` outside of the class, but it's unlikely that you'll need to. If the function were potentially useful outside the classs, you would define it outside the class, as we did with `label_clusters()`, `unique_values()`, `find_outlier_indices()`, and `find_temp_turnaround_point()`.

In [16]:
# calling `_get_data_from_commented_file()` from outside of a class instance works
some_data = Dataset._get_data_from_commented_file(DatFile(dataset4_path), 'mvsh')
some_data[293].head()

Unnamed: 0,Comment,Time Stamp (sec),Temperature (K),Magnetic Field (Oe),Moment (emu),M. Std. Err. (emu),Transport Action,Averaging Time (sec),Frequency (Hz),Peak Amplitude (mm),...,Map 07,Map 08,Map 09,Map 10,Map 11,Map 12,Map 13,Map 14,Map 15,Map 16
0,,3860779996,293.223587,-70000.0,,,6.0,,,,...,,,,,,,,,,
1,,3860780025,293.209259,-65000.42188,,,6.0,,,,...,,,,,,,,,,
2,,3860780055,293.174545,-60000.35547,,,6.0,,,,...,,,,,,,,,,
3,,3860780086,293.203033,-55000.37891,,,6.0,,,,...,,,,,,,,,,
4,,3860780116,293.191849,-50000.18359,,,6.0,,,,...,,,,,,,,,,


In [17]:
assert list(dset1.mvsh.keys()) == [2, 4, 6, 8, 10, 12, 300]
assert dset1.mvsh[2].shape == (1124, 89)
assert list(dset1.zfc.keys()) == [100]
assert dset1.zfc[100].shape == (252, 89)
assert dset2_1.mvsh.keys() == dset2_2.mvsh.keys()
assert dset2_1.mvsh[5]["Magnetic Field (Oe)"].equals(dset2_2.mvsh[5]["Magnetic Field (Oe)"])
assert dset4_1.mvsh.keys() == dset4_2.mvsh.keys() == dset4_3.mvsh.keys()
assert dset4_1.zfc.keys() == dset4_2.zfc.keys() == dset4_3.zfc.keys()

Let's add one more bit of utility to the `Dataset` class. Since all files within a dataset pertain to a single magnetometry sample, it would be useful to be able to get sample information (e.g. mass, molecular weight, diamagnetic correction, etc.) from the dataset. That sample information is stored in the header of the .dat files.

In [18]:
for line in dset1.files[0].header:
    if line[0] == "INFO":
        print(line)

['INFO', 'MPMS3 Measurement Release 1.1.16 Build 424', ' MultiVu Release 2.3.4.19', 'APPNAME', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
['INFO', 'Linear Motor Servo Controller', 'MOTOR_MODULE_NAME', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
['INFO', '3101-100 J0', 'MOTOR_HW_VERSION', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''

This information would be well-suited to a `dataclass`. Python `dataclass` objects are similar to classes, but are intended to primarily store data rather than perform operations. They have a convenient format for defining them and come with some useful built-in methods. Here's [a great intro video](https://youtu.be/CvQ7e6yUtnw) and [article](https://realpython.com/python-data-classes/).

In short, we can define a `dataclass` using the `@dataclass` decorator (making sure to include a `from dataclasses import dataclass` at the top of the file). Defining the attributes of the class simply requires defining the name and type of the attribute. We then get `__init__()`, `__str__()`, and `__repr__()` methods for free. We can also define methods within the class as we would for a normal class.

In [19]:
@dataclass
class SimpleSampleInfo:
    material: str | None = None
    mass: float | None = None

The above class uses optional arguments -- the type of `material` is a `str` or `None` (where the pipe symbol, `|`, means or), with a default of `None`. We need the arguments of `SampleData` to be optional because most of the time the .dat files won't have all of the fields filled in.

Let's make a couple instances of the `SimpleSampleInfo` class to see how it works.

In [20]:
simple1 = SimpleSampleInfo("Fe3O4", 1.3) # all fields are present, so no keywords needed

simple2 = SimpleSampleInfo(mass = 2.6) # material is missing, so mass must be keyword argument

simple3 = SimpleSampleInfo() # since all fields are optional, no arguments are needed
simple3.mass = 3.9 # fields can be set after initialization

We get nice `__str__()` and `__repr__()` methods built in to the class.

In [21]:
assert str(simple1) == repr(simple1)
print(simple1)

SimpleSampleInfo(material='Fe3O4', mass=1.3)


### Exercise 3.7

Go back to the cell immediately above the one where we defined `Dataclass` and create a `SampleInfo` class that has the following attributes:
- material
- comment
- mass
- volume
- molecular_weight
- size
- shape
- holder
- holder_detail
- offset
- eicosane_mass
- diamagnetic_correction

Attritubes are optional with a default value of `None`, and the type of the attribute (if it exists) is either `str` or `float` (e.g. `material` is a `str`, `mass` is a `float`, etc.).

Add one method called `RinehartUsage()` to `SampleInfo`. The Rinehart group uses the "Sample Volume" and "Sample Size" fields in the .dat file to store the eicosane mass and diamagnetic correction, respectively. `RinehartUsage()` function moves those values to their correct fields, sets the "Sample Volume" and "Sample Size" fields to `None`, and returns `None`.

### Exercise 3.8

Add a method `_get_sample_info()` to `Dataclass` which reads the header of the first file in `Dataset.files` and returns a `SampleInfo` object with the information from the header. If the header does not include any sample information the function should still return an empty `SampleInfo` object (i.e. all of the attributes are `None`).

In [22]:
assert dset1.sample_info.molecular_weight == 1566.22
assert dset2_1.sample_info.mass == 0.7
assert dset3.sample_info.holder == "Straw"
assert dset4_1.sample_info.comment == "brown powder"

## Documentation

Now that we have complete functions and classes, we can add documentation to them. Use the `numpy` formatting standards to add docstrings to `DatFile`, `SampleInfo`, and `Dataset`. Since these are classes, the docstrings should be added to the class definition, not the `__init__()` method. Further, because these classes only contain "private" methods (i.e. methods that are only used within the class), we can skip adding docstrings to those classes. Proper type hinting, meangingful variable names, and in-line commenting should be enough for those simple methods to help anyone who needs to use them.

From before, here is [a video on code documentation](https://youtu.be/L7Ry-Fiij-M) and here is the [numpy formatting standards](https://numpydoc.readthedocs.io/en/latest/format.html).