# Regular Expressions *(regex)*: XRDML file parser

## Background
Regular expressions (*regex*) is a text-based pattern recognition tool. It is developed to extract information from text and comes in very handy for machine-generated files, since such files are generally standardized and keep their format.  

In regex, patterns (consisting of a combination of regular- and metacharacters) are matched against strings. Many available token-/combinations make complex pattern often unreadable. [regex101](https://www.regex101.com) is a very useful page to dive into regular expressions and try it out interactively.


## Task
For this project, we want to write a regex-based file parser for XRDML files, a standardized XML-based format for XRD data. I have created a [shareable example](https://regex101.com/r/ddp03v/3) for the following exercise, in which the `sample name` is already exemplary matched. (Please note that the number of intensity counts in this online version is reduced for performance reasons.)



## Inputs:
- You can either download the XRDML files [here](https://hessenbox-a10.rz.uni-frankfurt.de/getlink/fi6WviJEF3n4FdeidUXDqW/xdrml) and, when you are using google colab, drag&grop them into your runtime environment, otherwise, if you are working locally, use the local path to the files.
- The file content can also be directly imported from the cloud using the `requests` module


## Steps:
- use the `re`, `pandas` and `matplotlib` modules
- define patterns that match and extract: `sample name`, `comment`, `begin 2Theta`, `end 2Theta`, `intensities`  
- write a file parser that makes use of implemented regex-based data extraction functions
- plot the XRD diffractogram


### First block is solved as example!

In [6]:
import re

'''
#import via file path
filePath = '/content/2023_0219_MB1_10.xrdml'
with open(filePath, 'r') as file:
    fileString = file.read()
'''
#import via hyperlink
import requests
url = "https://hessenbox-a10.rz.uni-frankfurt.de/dl/fi2ZisSCaud2xa8AqmJ6MN/2023_0219_MB1_10.xrdml"
fileString = str(requests.get(url).content.decode('utf-8'))


pattern_name = r'<sample.*>[\s\w]*<id>(.*)<\/id>'
re_name = re.compile(pattern_name)
# needs ...[0], because we want to access the first (and only) matched group defined as `(.*)`
matched_group = re_name.findall(fileString)[0]

print(matched_group)

0219_MB1_10


Pattern explanation for `'<sample.*>[\s\w]*<id>(.*)<\/id>'`:

- `sample` matches "sample"
- `.` matches any character, `*` means `zero or more` times the character/expression defined before
- `>` matches the character ">"
- `[\s\w]*` matches any word and whitespace character 0-inf times (e.g., "space", "newline", "tab" or other entries within the \<sample\> argument)
- `<id>` matches "<id>"
- `(.*)` matches any character via `.` for `zero or more` times via `*`. The round brackets let us extract this information as a capturing group, i.e., the portion we want to extract ultimately.
- `<\/id>` matches `<\id>`, which represents the end of our pattern to be matched. Please note that a backslash `\` needs to be escaped `\/` in order to be recognized as a character.

Wrapped up:
- `<sample.*>` matches the sample tag including any attribute (in our case the ` type="To be analyzed"` portion)
- `.*` matches everything (e.g., "newline", "tab" or other entries within the \<sample\> argument)
- `<id>(.*)<\/id>` matches the `id` tag including everything `.*` within it. The round brackets around `(.*)` define a group, making it possible to extract just the information of interest (, i.e., the name itself) from within our pattern.



As next information we need to extract the `<intensities>...<\intensities>` entry, our y-axis data.

- define a pattern to grab intensity values as a capture group from within the `<intensities>...<\intensities>` tag
- use `findall()` to extract information from the capture group into a `match` variable
- create an empty list `y`
- write a loop to iteratively access individual intensities within `match` using `split()` and `append()` these as `float()` to `y`
-


In [None]:
pattern_intensities = re.compile(r"")




The xrdml format stores 2Theta values (x-axis) using `StartPosition` and `EndPosition` as `xMin` and `xMax`, respectively.

- Define a pattern that contains two capture groups, which match both `StartPosition` and `EndPosition` and store the values in `xMin` and `xMax` variables.
- The x-array is generated using the `xMin` and `xMax` variables in combination with the number of measurements (`len(y)`). Create a list in which the number of values matches that of `y`, and the minimum and maximum values are `xMin` and `xMax`, respectively.


In [None]:
pattern_angles = re.compile(r'')






In a final step, create a function `parse_xrdml`, which takes `filePath` as input argument and returns `x`, `y` and `name` by copying the code fragments developed above. This enables us to easily plot multiple spectra within one plot.

- create a `list` containing the 3 filenames
- iterate over the `list`, invoke the function in every step to receive `x`,`y` and the sample name to plot the respective spectrum, taking the sample name as label
- create a legend & axis labels
- Please place `plt.show()` outside/behind the loop, in order to receive 1 plot with 3 spectra instead of 3 individual plots.

In [None]:
def parse_xrdml(filePath):






    return x,y, name


files = [
    "https://hessenbox-a10.rz.uni-frankfurt.de/dl/fi2ZisSCaud2xa8AqmJ6MN/2023_0219_MB1_10.xrdml",
    "https://hessenbox-a10.rz.uni-frankfurt.de/dl/fiVFAV4oBe3jDaMwpCbgnD/2023_0235_MB1_2.xrdml",
    "https://hessenbox-a10.rz.uni-frankfurt.de/dl/fiQvuZbzZjfQPHKdqeg9TH/2023_0248_MB1_66.xrdml",
]
