### Assignment 1

The folder `results` contains the result of 11 density functional theory (DFT) calculations of mine. The goal of this exercise is to traverse the files in these folders and create a summary dictionary and save it as a JSON file. To do so:

- Traverse the subfolders in the `results` folder with the function `walk` of the built-in `os` Python package (see info [here](https://docs.python.org/3/library/os.html#os.walk)) and store all file paths in a list. For working with paths you may want to consider using the `pathlib` package (see info [here](https://www.pythonmorsels.com/pathlib-module/)), but this is optional (plain strings with the `os.path.join` function work too).
- Loop through the path list and open the `log.txt` and `duration.txt` files with a context manager (i.e., use: `with open("log.txt", "r") as f:`). Further information on opening and writing files can be found [here](https://www.geeksforgeeks.org/python/difference-between-modes-a-a-w-w-and-r-in-built-in-open-function/) and [here](https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files). 
    - Inside a `log.txt` file: Search through the lines of text for the *"the Fermi energy is"* and the *"total energy"* and save each value in a variable. (see screenshot below) Remember the string methods `split` and `strip`, which may be helpful to clean up the string and then convert it to a float value.

    ![Relevant information of log.txt file](./dft_log_example.jpg)
    - Inside a `duration.txt` file: Take the first line which is of form `hour:minutes:seconds` and convert it to seconds and save it in an integer variable.
- Be careful about the case that the required information is not available (one of the files encountered an error during DFT calculation, and hence the required values are not available). One option is to use the `try` and `except` approach. Another approach is to use the `finally` clause in the `for` loop that contains a `break` statement once a value is found (and if no value is found, the `finally` block will run).
- Save all the collected information from all folders in a dictionary that has the folder name as an integer key and the corresponding value is another dictionary that contains the keys "duration_seconds" (with an integer value), "Fermi_energy" (with a float value), "total_energy" (with a float value), and "error" (with a boolean value). The final dictionary should look something like this (the values may of course look different, but the structure should look basically the same):
```Python
results = {
    0: {
        "duration_seconds": 478, 
        "Fermi_energy": 4.743, 
        "total_energy": -7056.654239, 
        "error": False
        }, 
    1: {
        "duration_seconds": 1289, 
        "Fermi_energy": 7.047, 
        "total_energy": -6429.00935, 
        "error": False
        },
    2: {
        "duration_seconds": None, 
        "Fermi_energy": None, 
        "total_energy": None, 
        "error": True
        },
     ...
     }
```
- Finally, save that dictionary to a JSON file using the built-in `json` package utilizing a context manager (similar to how you opened files).

In [None]:
# Traverse the 'results' folder and collect all file paths
import os
from pathlib import Path

results_dir = Path("results")
file_paths = []
for root, dirs, files in os.walk(results_dir):
    for fname in files:
        file_paths.append(str(Path(root) / fname))

# Print the collected file paths
print(f"Found {len(file_paths)} files under '{results_dir}'")
print("Path list:")
for p in file_paths[:]:
    print(p)

Found 22 files under 'results'
Path list:
results/9/log.txt
results/9/duration.txt
results/0/log.txt
results/0/duration.txt
results/7/log.txt
results/7/duration.txt
results/6/log.txt
results/6/duration.txt
results/1/log.txt
results/1/duration.txt
results/10/log.txt
results/10/duration.txt
results/8/log.txt
results/8/duration.txt
results/4/log.txt
results/4/duration.txt
results/3/log.txt
results/3/duration.txt
results/2/log.txt
results/2/duration.txt
results/5/log.txt
results/5/duration.txt


In [3]:
import json

results_summary = {}
results_dir = Path("results")

for sub in sorted(results_dir.iterdir(), key=lambda p: int(p.name) if p.name.isdigit() else p.name):
    if not sub.is_dir():
        continue
    key = int(sub.name) if sub.name.isdigit() else sub.name
    fermi = None
    total = None
    duration_seconds = None
    error = False

    # Parse log.txt
    log_path = sub / "log.txt"
    try:
        with open(log_path, "r") as f:
            for line in f:
                if 'the Fermi energy is' in line:
                    parts = line.split()
                    try:
                        # assume last token is the numeric value
                        fermi = float(parts[-1])
                    except Exception:
                        try:
                            # try second last
                            fermi = float(parts[-2])
                        except Exception:
                            fermi = None
                if 'total energy' in line:
                    parts = line.split()
                    for tok in reversed(parts):
                        try:
                            total = float(tok)
                            break
                        except Exception:
                            continue
    except FileNotFoundError:
        error = True
    except Exception:
        error = True

    # Parse duration.txt
    duration_path = sub / "duration.txt"
    try:
        with open(duration_path, "r") as f:
            first = f.readline().strip()
            if first:
                # expected format H:M:S
                try:
                    h, m, s = [int(x) for x in first.split(":")]
                    duration_seconds = h*3600 + m*60 + s
                except Exception:
                    duration_seconds = None
    except FileNotFoundError:
        error = True
    except Exception:
        error = True

    # If any required value is missing, mark error True
    if fermi is None or total is None or duration_seconds is None:
        error = True

    results_summary[key] = {
        "duration_seconds": duration_seconds,
        "Fermi_energy": fermi,
        "total_energy": total,
        "error": error
    }

# Save to JSON
out_path = Path("results_summary.json")
with open(out_path, "w") as f:
    json.dump(results_summary, f, indent=2)

print(f"Parsed {len(results_summary)} folders. Summary saved to {out_path}.")

Parsed 11 folders. Summary saved to results_summary.json.


### Assignment 2

Utilize Python to automate a task you have done in the past. This can be anything you previously did manually, but now can automate using Python. A few random examples, to give you an idea of what that could be:

- Read some CSV output files that you obtained by an experimental device and do basic analysis. Can use the `csv` package here. For XLSX files there is a library called `openpyxl`, but it is not a built-in library (i.e., you would need to install it). It may be easier to convert your XLSX files to CSV, in that case.
- Copy files from one folder to another (e.g., for backup or sorting purposes). This could utilize the `os`, `pathlib`, and `shutil` built-in packages.
- Using the `input` statement you could write a basic Python script that converts one set of units to another set of units.
- Be creative! Anything you have done previously that required manual repetition on a computer/files, can be automated with Python!

In [5]:
import csv
from statistics import mean, median, pstdev

csv_path = Path("SI HAADF 20250723 1334 88000 x.csv")
summary = {}

with open(csv_path, newline='') as csvfile:
    reader = csv.reader(csvfile)
    rows = list(reader)

# Create column names
# Use second row as names, third row as units
col_names = [c.strip('"') for c in rows[0]]
units = [c.strip('"') for c in rows[2]]

# data starts at row index 3
data_rows = rows[3:]
# Convert to floats where possible
cols = list(zip(*data_rows))
num_cols = {}
for i, col in enumerate(cols):
    nums = []
    for v in col:
        try:
            nums.append(float(v))
        except Exception:
            pass
    num_cols[i] = nums

# Build summary
summary['file'] = str(csv_path)
summary['n_rows'] = len(data_rows)
summary['columns'] = []
for i, name in enumerate(col_names):
    col_summary = {
        'index': i,
        'name': name.strip('"'),
        'unit': units[i].strip('"') if i < len(units) else None,
        'numeric': len(num_cols[i]) > 0
    }
    if col_summary['numeric']:
        vals = num_cols[i]
        col_summary.update({
            'count': len(vals),
            'mean': mean(vals),
            'median': median(vals),
            'std_pop': pstdev(vals) if len(vals) > 1 else 0.0,
            'min': min(vals),
            'max': max(vals)
        })
    summary['columns'].append(col_summary)

# capture head (first 10 data rows)
summary['head'] = [row for row in data_rows[:10]]

out_path = Path("csv_summary_SI_HAADF_20250723.json")
with open(out_path, 'w') as f:
    json.dump(summary, f, indent=2)

print(f"CSV analyzed: {csv_path} -> summary saved to {out_path}")

CSV analyzed: SI HAADF 20250723 1334 88000 x.csv -> summary saved to csv_summary_SI_HAADF_20250723.json


In [None]:
import shutil
import fnmatch


def copy_files(src, dst, pattern='*', overwrite=False, dry_run=True):
    """Copy files matching pattern from src to dst.

    Args:
        src (str/Path): source directory
        dst (str/Path): destination directory
        pattern (str): glob or fnmatch pattern (e.g. '*.csv')
        overwrite (bool): overwrite existing files
        dry_run (bool): if True, only print actions
    Returns:
        list: list of (src_path, dst_path, action) tuples
    """
    src = Path(src)
    dst = Path(dst)
    dst.mkdir(parents=True, exist_ok=True)
    actions = []
    for p in src.rglob(pattern):
        if p.is_file():
            target = dst / p.name
            if target.exists() and not overwrite:
                actions.append((str(p), str(target), 'skip'))
                if not dry_run:
                    continue
            actions.append((str(p), str(target), 'copy'))
            if not dry_run:
                shutil.copy2(p, target)
    for a in actions:
        print(a)
    return actions


# Run actual copy (disable dry_run to perform copy)
copy_files(
    "/Users/emma/Documents/GitHub/ML4MSD-HW--PeiyunF-/Homework1/Homework/Homework1",
    "/Users/emma/Documents/GitHub/ML4MSD-HW--PeiyunF-/Homework1/Homework/Homework1/BackUp",
    pattern="*.csv",
    overwrite=False,
    dry_run=False
)

('/Users/emma/Documents/GitHub/ML4MSD-HW--PeiyunF-/Homework1/Homework/Homework1/SI HAADF 20250723 1334 88000 x.csv', '/Users/emma/Documents/GitHub/ML4MSD-HW--PeiyunF-/Homework1/Homework/Homework1/BackUp/SI HAADF 20250723 1334 88000 x.csv', 'copy')


[('/Users/emma/Documents/GitHub/ML4MSD-HW--PeiyunF-/Homework1/Homework/Homework1/SI HAADF 20250723 1334 88000 x.csv',
  '/Users/emma/Documents/GitHub/ML4MSD-HW--PeiyunF-/Homework1/Homework/Homework1/BackUp/SI HAADF 20250723 1334 88000 x.csv',
  'copy')]

### Assignment 3

Go through the `ML4MSD-Files/Resources/Python_resources.md` file and summarize any interesting observations or things you learned from it.

I did not go through all the resources yet and I will do! Thank you for the information Professor!