# X-ray Diffraction Practical Class - Data Analysis (Demonstrator's copy)

## Class Introduction

This [Jupyter](http://jupyter.org) notebook is to accompany the X-ray Diffraction Practical (XRD) practical class.  The notebook will help you complete the Part 2 and Part 3 of the assessments.

## Learning Objectives - Whole Course
By the end of the two afternoon practical session you will:
- Have a basic understanding of the practical design of a diffractometer and be able to identify key components from both kinds
- You will be familiar with the concept of X-ray radiation in the context of powder diffraction and understand why different energy X-rays are used to analyse samples with different compositions
- You will have applied Bragg's law for the conversion of diffraction angle (&deg;2&theta;) and d-spacing (&#8491;)
- You will have experience preparing a powder sample for measurement
- An appreciation of the key components of a standard powder diffraction measurement along with analysing the results
- You will have a rudimentary knowledge of phase identification and will have applied it to diffraction patterns of
    - A pure sample
    - A mixed sample
- Be able to interpret the diffraction patterns from an unknown mixture of known composition to build a calibration curve and determine from the pattern of an unknown mixture of the composition of the mixture
- Develop an appreciation of the use of python for data analysis

# Important notes

Please read through the text accompanying the code cells carefully. Failure to do so will result in an inability to pass the quiz.

## Juypter Notebook Introduction
[Jupyter notebook](http://www.jupyter.org) is a dynamic open-source web application which can be hosted locally allowing you to generate rich project documents containing documentation, images, live code and results.  The projects can be easily shared and since we are utilising the Jupyter interface for [python](http://www.python.org) based projects the system is truly cross platform, Windows, MAC and Linux compatible.  It is progressively being used more and more as a method of reporting scientific data in peer reviewed journals.[1]

The notebook is built up of cells - each cell has the format of either being <code>Markdown</code> or <code>source code</code>.

*This notebook is yours so feel free to edit the cells and expand or alter the examples to help you learn and understand the process. Remember to make a backup copy first though.*

[1]:https://doi.org/10.1103/PhysRevLett.116.061102

# Internal use

In [1]:
answer_dir = r'./answer_sheets/'

def write_to_file(filename, content):
    """Write the content to a file, creating the file if it doesn't exist."""
    with open(filename, 'a') as file:  # 'a' mode appends to the file or creates it if it doesn't exist
        file.write(f"{content}\n")

# write_to_file(answer_dir + "E.txt", output)


# Load imports

The `os` module provides functions for interacting with the operating system. You can use it to handle file paths, read or write files, and manage directories.

The `pandas` library is essential for data manipulation and analysis. It provides powerful data structures like DataFrames, which allow for easy handling of tabular data.

`numpy` is a fundamental package for numerical computing in Python. It provides support for large multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays. 


`matplotlib` is a plotting library for creating static, animated, and interactive visualizations in Python. The pyplot module provides a MATLAB-like interface for creating plots and graphs.

`scipy` is a library used for scientific and technical computing. The find_peaks function from scipy.signal is used to identify peaks (local maxima) in data, which is particularly useful for signal processing.

`pybaselines` is a library for baseline correction, which is a common preprocessing step in signal processing.


$\color{red}{\text{IMPORTANT}}$

Ask a demonstrator for assistance if the imports fail to load correctly. Alternatively, you may try running the notebook on your own personal laptop.

In [2]:
%matplotlib qt5

import os, sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from scipy.signal import find_peaks

import os, sys
sys.path.insert(0,os.path.expanduser(os.path.join(".","pybaselines-main")))

from pybaselines import Baseline
from pybaselines import polynomial

# Part 2 : Identification of pure unknown phase X

## Overview

You are given a XRD spectra of an unknown phase X.

The broad objectives are to:

Identify a suitable background subtraction method for the XRD spectra

Identify the 2-theta positions of the 3 strongest peaks in the XRD spectra

Identify the d-spacings corresponding to the peaks to which you have found in Q2

Cross reference the 2-theta peak positions to a reference materials database called 'PDF Database Index.csv' and hence, identify the unknown material.

Use the following code to assist you in answering questions to Part 2 of the blackboard quiz.


## Load data

In [3]:
relative_dir = r'\\data\\'
filename = r'part_2_unknown.csv'
full_path = os.path.join(os.getcwd() + relative_dir, filename)

In [4]:
data = pd.read_csv(full_path, usecols=[1,2])

y = data['Intensity'].to_numpy()
x = data['2 theta'].to_numpy()

data.plot(x='2 theta', y='Intensity')
plt.title('Pure A')

Text(0.5, 1.0, 'Pure A')

## Identify the best background subtraction method

An anonymous scientist S.H. has collected this xrd spectra on a glass substrate. The glass substrate has contributed to a large background hump in the xrd spectra that is detrimental to analysis.

There are three different methods for background subtraction - only one of them will provide the right answer for quantitative analysis later on.

$\color{red}{\text{IMPORTANT}}$

The correct method should be fairly obvious, if you do not get this right $\textbf{the error will propagate through the whole script}$ and your answers to the quiz will be wrong. Please check with a demonstrator if you are unsure. 

Once you have settled on a particular method, $\textbf{rerun ALL cells under that method header}$ again to ensure that the variables are storing the correctly subtracted intensity values.

### Demonstrator's notes

Method 2 should yield the correct baseline. 

### Try different background subtraction methods

#### Method 1

In [5]:
poly_order = 6
baseline, params = polynomial.imodpoly(data=y,x_data=x, poly_order=poly_order)
y_subtracted = y - baseline
y_subtracted[y_subtracted<0]=0

In [286]:

fig, ax = plt.subplots(tight_layout={'pad': 0.2})
data_handle = ax.plot(y)
baseline_handle = ax.plot(baseline, '--')
ax.legend(
    (data_handle[0], baseline_handle[0]),
    ('data','fit baseline'), frameon=False
)
plt.show()

In [17]:

plt.figure()
plt.title('Post-background subtraction')
plt.plot(x, y_subtracted, alpha=0.5, label='background subtracted')
plt.plot(x, y, alpha=0.5, label='original')
plt.legend()
plt.xlabel('2-theta')
plt.ylabel('Intensity')
plt.show()


#### Method 2

In [6]:
baseline_fitter = Baseline(x_data=x)
half_window = 20
baseline = baseline_fitter.mor(y, half_window=half_window)[0]
y_subtracted = y - baseline
y_subtracted[y_subtracted<0]=0

In [11]:

plt.figure()
plt.plot(y, label='data')
plt.plot(baseline, label=f'baseline')
plt.legend()

<matplotlib.legend.Legend at 0x275de30b700>

In [10]:

plt.figure()
plt.title('Post-background subtraction')
plt.plot(x, y_subtracted, alpha=0.5, label='background subtracted')
plt.plot(x, y, alpha=0.5, label='original')
plt.legend()
plt.xlabel('2-theta')
plt.ylabel('Intensity')
plt.show()


#### Method 3

In [12]:
baseline_fitter = Baseline(x_data=x)
baseline = baseline_fitter.beads( y, lam_0=0.00006, lam_1=0.00008, lam_2=0.05, fit_parabola=False, tol=1e-3, freq_cutoff=0.04, asymmetry=3)[0]
y_subtracted = y - baseline
y_subtracted[y_subtracted<0]=0

In [13]:
plt.figure()
plt.plot(y, label='data')
plt.plot(baseline, label=f'baseline')
plt.legend()

<matplotlib.legend.Legend at 0x275de62df40>

In [14]:
plt.figure()
plt.title('Post-background subtraction')
plt.plot(x, y_subtracted, alpha=0.5, label='background subtracted')
plt.plot(x, y, alpha=0.5, label='original')
plt.legend()
plt.xlabel('2-theta')
plt.ylabel('Intensity')
plt.show()


## Identify the 2-theta peak positions corresponding to the three largest d-spacings using scipy find_peaks function

Ensure that you have re-run the cells corresponding to the background subtraction method you have chosen.

Use the plotted graph to ensure that the first three prominent peaks are correctly found to compare with the database

Hint - the parameters in `find_peaks` have to be optimized for the first three peaks to be identified correctly
Have a look at the documentation to see what `distance`, `height`, `prominence` do.

https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.find_peaks.html


### Demonstrator's notes

Student's copy will have messed up parameters. 

Correct parameters are roughly:
- find_peaks(y_subtracted, distance=30, height=350, prominence=100)  

In [7]:
peaks, _ = find_peaks(y_subtracted, distance=30, height=350, prominence=100)  
# Adjust distance/height as needed to only include peaks and not background
# Play with other parameters if necessary
# https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.find_peaks.html


# Plot original data and highlight peaks
plt.figure()
plt.plot(x, y_subtracted, label='XRD Data')
plt.plot(x[peaks], y_subtracted[peaks], 'x', label='Peaks', color='red')
plt.legend()
plt.xlabel('2-theta')
plt.ylabel('Intensity')
plt.show()


In [8]:
for i in range(3):
    print(f'peak {i} :' + f'{round(x[peaks][i],2)}')

peak 0 :28.3
peak 1 :47.01
peak 2 :55.74


In [9]:
# # internal use only

# write_to_file(answer_dir + "part_2_ans.txt", f'Q2: Identify and key in values for the three peaks to the nearest 3 sf in degrees')

# for i in range(3):
#     write_to_file(answer_dir + "part_2_ans.txt", f'Q2: peak {i} { round(x[peaks][i],2)}')

## Calculate d-spacing for the three peaks identified

### Bragg's Law

Is the simple numerical relationship between the angle (&theta;) of the observed diffraction intensity in relation to the wavelength of the radiation source (&lambda;) and the lattice spacing of the material being probed (d)


### Hints


X-ray diffraction data is acquired for a range of 2&theta; angles in degrees using Cu-K-alpha radiation. The upper and lower limits of the diffraction pattern govern the minimum and maximum lattice spacings observable within the pattern.

To calculate either the minimum or maximum lattice spacing (d-spacings, remember that it is a reciprocal relationship between 2&theta; and d-spacing) that can be measured in any crystal  we first need to identify which is of interest and then simply rearrange Bragg's law to provide the answer in d-spacing

`np.radians` is required to take our data in measure in degrees and place it into the correct format, radians, for the function.


$\color{red}{\text{IMPORTANT}}$
    
There is an error in the Bragg's law function below which you must edit or it will give the wrong answer



### Demonstrator's notes

theta needs to be divided by 2 

Change this to work out the answers in nm instead of Angstroms

In [9]:
def calculate_d_spacing(n, wavelength, theta):
    """
    Calculate the d-spacing using Bragg's law.

    Parameters:
        n (int): Order of diffraction
        wavelength (float): Wavelength of X-rays (in nm)
        theta (float): Angle of incidence (in degrees)

    Returns:
        float: d-spacing (nm)
    """
    # Convert theta from degrees to radians
    theta = np.radians(theta)
    
    # Calculate d-spacing using Bragg's law
    d_spacing = (n * wavelength) / (2 * np.sin(theta/2))
    
    return d_spacing

n = 1
wavelength = 0.15418 # corresponding to Cu k_alpha 

peak_index = 2 # change this to match the peak index
theta = x[peaks][peak_index]

d_spacing = calculate_d_spacing(n, wavelength, theta)
print(f"The d-spacing is: {round(d_spacing,3)} nm")


The d-spacing is: 0.165 nm


In [11]:
# # internal use

# write_to_file(answer_dir + "part_2_ans.txt", f'Q2: Enter the correct d-spacing for the first three peaks that you find')

# for i, value in enumerate(peaks):
#     n = 1
#     wavelength = 1.5418
#     peak_index = i # change this to match the peak index
#     theta = x[peaks][peak_index]

#     d_spacing = calculate_d_spacing(n, wavelength, theta)
#     write_to_file(answer_dir + "part_2_ans.txt", f'Q3: peak {i} {round(d_spacing,2)} Angstroms')

## Compare found 2-theta peak positions to a reference material database

The powder diffraction pattern is a “fingerprint” of the material. We can utilise a database of peak positions and relative intensities to match to your pattern. These are often referred to as “phases”. However diffraction patterns are not always unique and therefore you must use chemical sense and select phases /material patterns that could be present in your sample. It should be noted that intensities can vary as a function of preferred orientation and sample quality. These can differ from measurement to measurement so sample rotation and good preparation is normally employed to help mitigate these factors.

Does it match? If so on how many peaks? Is it just peak position and not intensity or both.
Are there residual peaks leftover over? Could it be more than one phase?

Are there parameters you can change to reduce the number of matches? Keep it simple, the fewer phases the better.


### Load database

In [10]:
database_loc = 'PDF Database Index.csv'

database = pd.read_csv(database_loc,keep_default_na=False)

for col in database.columns:
    # this coerces column values to integer whenever possible
    database[col] = pd.to_numeric(database[col], errors='ignore', downcast='integer')

database.head()

  database[col] = pd.to_numeric(database[col], errors='ignore', downcast='integer')


Unnamed: 0,Reference Code,PDF FileName,Powder Pattern,Name,No.,h,k,l,d,2Theta[deg],I [%]
0,00-004-0673,00-004-0673.pdf,00-004-0673.RD,Tin,1.0,2.0,0.0,0.0,2.915,30.645,100.0
1,00-004-0674,00-004-0673.pdf,00-004-0673.RD,Tin,2.0,1.0,0.0,1.0,2.793,32.019,90.0
2,00-004-0675,00-004-0673.pdf,00-004-0673.RD,Tin,3.0,2.0,2.0,0.0,2.062,43.872,34.0
3,,,,,,,,,,,
4,00-004-0783,00-004-0783.pdf,00-004-0783.RD,Silver,1.0,1.0,1.0,1.0,2.359,38.117,100.0


### Match the three found peaks to the database

$\color{red}{\text{IMPORTANT}}$


You need to input the correct `match_tolerance` level for the match to occur. 

E.g. if your experimentally found peak is 27.4 and the database is 27.9, you need a `match_tolerance` of at least 0.5

However. increasing `match_tolerance` too much can result in many false positive matches with the database.

Identify the corresponding pdf reference that matches to the correct phase.

#### Demonstrator's notes

correct match_tolerance = 0.3

Answer should be 'Calcium Fluoride'

#### Matching the first peak

In [11]:
target_peak = x[peaks][0] # this is finding matches for the first identified peak

match_tolerance = 0.3 # change as necessary
match_1 = database['Name'][database['2Theta[deg]'].apply(lambda x: abs(x - target_peak) < match_tolerance)]
match_1

24         Zinc Sulfide
41         Zinc Sulfide
136    Calcium Fluoride
Name: Name, dtype: object

#### Matching the second peak

In [12]:
target_value = x[peaks][1] # this is finding matches for the second identified peak
match_tolerance = 0.3 # change as necessary
match_2 = database['Name'][database['2Theta[deg]'].apply(lambda x: abs(x - target_value) < match_tolerance)]
match_2

137    Calcium Fluoride
Name: Name, dtype: object

#### Matching the third peak

In [15]:
target_value = x[peaks][2] # this is finding matches for the third identified peak
match_tolerance = 0.3 # change as necessary
match_3 = database['Name'][database['2Theta[deg]'].apply(lambda x: abs(x - target_value) < match_tolerance)]
match_3

138    Calcium Fluoride
Name: Name, dtype: object

#### Finding the common intersection of matches amongst the three peaks using set operations

Ideally, there should only be one final answer

In [16]:
intersecting_set = set(match_1).intersection(match_2)
intersecting_set = list(intersecting_set)
# use different matches, e.g. 1,3 or 2,3
intersecting_set

['Calcium Fluoride']

In [None]:
# # internal use
# write_to_file(answer_dir + "part_2_ans.txt", f'Q4: Enter the material that you found' )

# write_to_file(answer_dir + "part_2_ans.txt", f'Q4: {intersecting_set}' )

# Part 3 Identifying unknown weight fractions of a known mixture of two unknown phases

The product of the diffraction pattern is a function of the material, crystalline and amorphous, being measured. Once the phases, and therefore peaks, of a pattern have been determined (qualitative analysis) it is possible to identify the quantity of the phases within the sample (quantitative analysis). This is due to the diffracted peak area and to a first approximation the intensity of the recorded peak being a function of the x-ray source and sample.


The area under the diffracted peak scales with the amount of the corresponding phase in the sample, allowing quantitative information to be gained by comparing peaks for different phases. It is more difficult to calculate the area under a peak and to a first approximation the intensity of the recorded peak serves as an accurate alternative where relative peak heights of different phases can be compared.

There are several methods which can be utilised in order to undertake quantitative analysis. They all rely on correlating some parameter with a recorded parameter from the diffraction pattern itself. Such as lattice parameter vs concentration, intensity vs concentration, etc.

1. External Standard Method
2. Direct Comparison Method
3. Internal Standard Method

For the purpose of this course we are utilising a version of method 3 where we mix one crystalline material with a second and plot the intensity vs the calculated weight percent for known components.

## Overview

Each student is given a mix of CaF2 and an unknown phase A/B/C/D/E at random. As such, do not blindly copy your coursemate's answers. 

For illustration purposes, we will use CaF2-A for the following discussion.

3 known weight fractions are given - Pure CaF2, Pure A, and a 50-50 mixture of CaF2:A. There is one unknown weight fraction 'A Mystery%.csv', and your task is to identify the weight percentage of CaF2/A present in this file. 

To accomplish this, the task has to be broken down further:

Perform background subtraction as necessary.

Identify the 2-theta peak positions of the three strongest peaks for unknown material A and CaF2. 

Identify material A from 'PDF Database Index.csv'. 
(Refer to Part 2)

Construct a calibration chart using the known weight fractions (Pure CaF2/Pure A/50-50 CaF2-A) and peak intensities corresponding to the strongest 3 peaks from CaF2 and A.

Identify the background-subtracted peak intensities of the three strongest peaks for unknown material A

Use the calibration chart and the peak intensities to identify the unknown weight fraction of CaF2/A for 'A Mystery%.csv'. 

#### Demonstrator's notes

Check answer sheets for peak location

## Load XRD spectra for pure unknown phase

Hint:

Load the appropriate data according to the unknown phase you are given on the quiz

In [47]:
file_path = r'.\\data\\part_3_data\\Student\\CaF2 - D\\D 100%.csv'

data = pd.read_csv(file_path, usecols=[1,2])
data.plot(x='2 theta', y='Intensity')

<Axes: xlabel='2 theta'>

In [48]:
data.head()

Unnamed: 0,2 theta,Intensity
0,10.0001,1106.004145
1,10.020451,1080.012278
2,10.040802,1093.049056
3,10.061153,1098.008998
4,10.081504,1104.980224


## Perform background subtraction

Copy the appropriate code from the background subtraction code in Part 2.

In [49]:
baseline_fitter = Baseline(x_data=data['2 theta'])
half_window = 20
baseline = baseline_fitter.mor(data['Intensity'], half_window=half_window)[0]
y_subtracted = data['Intensity'] - baseline
y_subtracted[y_subtracted<0]=0
peaks, _ = find_peaks(y_subtracted, distance=30, height=350, prominence=100)  

In [50]:
# Plot original data and highlight peaks
plt.figure()
plt.plot(data['2 theta'], y_subtracted, label='XRD Data')
plt.plot(data['2 theta'][peaks], y_subtracted[peaks], 'x', label='Peaks', color='red')
plt.legend()
plt.xlabel('2-theta')
plt.ylabel('Intensity')
plt.show()


## Identify 2-theta peak positions for pure unknown phase

Fill in the 2-theta peak positions to the nearest 0.1 degree precision in `unknown_first_peak_loc`, `unknown_sec_peak_loc`, `unknown_third_peak_loc`

A code snippet to match the first peak is provided. However this is insufficient and you should check the other 2 peaks as well. Refer back to Part 2 if you are unsure.

In [51]:
for i in range(3):
    x = data['2 theta'].to_numpy()
    print(f'peak {i} :' + f'{round(x[peaks][i],2)}')

peak 0 :36.74
peak 1 :42.68
peak 2 :61.94


In [52]:
# Fill in the peak locations to the nearest 0.1 degree precision.
unknown_first_peak_loc = 36.74

unknown_sec_peak_loc = 42.68


unknown_third_peak_loc = 61.94



In [187]:
# # internal use
# write_to_file(answer_dir + f"{unknown_phase}.txt", f'Q5: Identify the highest intensity peak positions for unknown material')
# write_to_file(answer_dir + f"{unknown_phase}.txt", f'Q5: {unknown_first_peak_loc}, {unknown_sec_peak_loc}, {unknown_third_peak_loc}')

## Identify pure unknown phase using the database

#### Hint

You may adapt the code from Part 2 to do an intersecting match.

Again, the match_tolerance parameter will affect the confidence of your matches.

A match between two out of the three peaks identified is sufficient to proceed.


In [53]:
target_peak = unknown_first_peak_loc # this is finding matches for the first identified peak

match_tolerance = 0.3 # change as necessary
match_1 = database['Name'][database['2Theta[deg]'].apply(lambda x: abs(x - target_peak) < match_tolerance)]
match_1

68              Uranium
93              Uranium
148    Titanium Nitride
152     Magnesium Oxide
Name: Name, dtype: object

In [54]:
target_value = unknown_sec_peak_loc # this is finding matches for the second identified peak
match_tolerance = 0.2 # change as necessary
match_2 = database['Name'][database['2Theta[deg]'].apply(lambda x: abs(x - target_value) < match_tolerance)]
match_2

149    Titanium Nitride
Name: Name, dtype: object

In [55]:
target_value = unknown_third_peak_loc # this is finding matches for the second identified peak
match_tolerance = 0.3 # change as necessary
match_2 = database['Name'][database['2Theta[deg]'].apply(lambda x: abs(x - target_value) < match_tolerance)]
match_2

150    Titanium Nitride
Name: Name, dtype: object

In [2]:
# # internal use
# write_to_file(answer_dir + f"{unknown_phase}.txt", f'Q6: Identify unknown material')
# write_to_file(answer_dir + f"{unknown_phase}.txt", f'Q6: MgO')

## Form the calibration curves using the known weight fractions

### Grab all known weight fraction xrd pattern csv file locations for loading


$\color{red}{\text{IMPORTANT}}$

Load the appropriate data according to the unknown phase you are given on the quiz

Ensure that the files are sorted in the order [B 100%..., B 50%..., CaF2 100%..., B Mystery%]. 

You might have to manually sort them:

part_3_data_sorted = [part_3_data[1],part_3_data[0],part_3_data[3], part_3_data[2]]

Even if the data is sorted when loaded, you still have to update the variable 'part_3_data_sorted'


For example:

['C 100%', 'C 50%', 'C Mystery%', 'CaF2 100%']

has to be re-arranged to

['C 100%', 'C 50%',  'CaF2 100%', 'C Mystery%']

In [56]:
def find_files(directory):
    """
    Find all text files within the specified directory and its subdirectories.

    Parameters:
        directory (str): The directory path to search for text files.

    Returns:
        list: A list of paths to text files.
    """
    text_files = []
    for root, dirs, files in os.walk(directory):
        for file in files:
            if file.endswith(".csv"):
                text_files.append(os.path.join(root, file))
    return text_files

def get_filename_without_extension(file_path):
    # Split the path into root and extension
    root, ext = os.path.splitext(file_path)
    # Split the root into directory and filename
    directory, filename = os.path.split(root)
    return filename

In [57]:
relative_dir = r'.\\data\\part_3_data\\Student\\CaF2 - D\\'

part_3_data = sorted(find_files(os.path.join(os.getcwd() + relative_dir)))

[get_filename_without_extension(file) for file in part_3_data]

['CaF2 100%', 'D 100%', 'D 50%', 'D Mystery%']

In [59]:
part_3_data_sorted = [part_3_data[1],part_3_data[2],part_3_data[0],part_3_data[3]]
[get_filename_without_extension(file) for file in part_3_data_sorted]

['D 100%', 'D 50%', 'CaF2 100%', 'D Mystery%']

### Create empty data structure first holding the known weight fraction peak intensities

$\color{red}{\text{IMPORTANT}}$

Do a sanity check that the `weight_frac_CaF2` and `weight_frac_unknown` columns matches the file names in the `intensity_data` dataframe created.

In [60]:
# 2-theta peaks for CaF2 are given
CaF2_first_peak_loc = 28.3
CaF2_sec_peak_loc = 47.0
CaF2_third_peak_loc = 55.75

In [61]:
default_peaks_list = [CaF2_first_peak_loc, CaF2_sec_peak_loc, CaF2_third_peak_loc, unknown_first_peak_loc, unknown_sec_peak_loc, unknown_third_peak_loc]
# You might have to change the variable name to match the correct phase you are assigned e.g. A/C or A/D

intensity_data = pd.DataFrame(index=[os.path.splitext(os.path.basename(text))[0] for text in part_3_data_sorted[:-1]], columns=default_peaks_list)
intensity_data['weight_frac_unknown'] = [100, 50, 0]

intensity_data['weight_frac_CaF2'] = 100 - intensity_data['weight_frac_unknown']
intensity_data

Unnamed: 0,28.3,47.0,55.75,36.74,42.68,61.94,weight_frac_unknown,weight_frac_CaF2
D 100%,,,,,,,100,0
D 50%,,,,,,,50,50
CaF2 100%,,,,,,,0,100


### Populate data structure with values from each XRD pattern

The following code block populates the empty dataframe above with the corresponding peak intesities at the six 2-theta positions specified (corresponding to CaF2 and the unknown phase)

For each csv file:
The XRD spectra is plotted
Peak fitting is conducted
If a peak is found close to the default 2-theta positions defined in the dataframe above, the corresponding intensity is keyed into the dataframe.
If no peak is found, the intensity of the 2-theta position that is closest to the default 2-theta positions is keyed into the dataframe instead.

Contrast this to how you would have to do this manually by opening each file yourself. 

#### Demonstrator notes

Students are expected to fill in the lines of code necessary for background subtraction from 'method 2'


    half_window = 20
    baseline_fitter = Baseline(x_data = x)
    baseline = baseline_fitter.mor(y, half_window=half_window)[0]
    y_subtracted = y - baseline
    y_subtracted[y_subtracted<0]=0

#### Hint

Ensure that you have performed the correct background subtraction.

In [62]:
for ind,file in enumerate(part_3_data_sorted[:-1]):

    current_index = str(get_filename_without_extension(file))

    print((f'Currently on {current_index}'))

    data = pd.read_csv(file, usecols=[1,2])

    x = data['2 theta'].to_numpy()
    y = data['Intensity'].to_numpy()

    half_window = 20
    baseline_fitter = Baseline(x_data = x)
    baseline = baseline_fitter.mor(y, half_window=half_window)[0]
    y_subtracted = y - baseline
    y_subtracted[y_subtracted<0]=0

    data['Intensity'] = y_subtracted
    y = y_subtracted

    plt.figure()
    plt.title(f'{current_index}')
    plt.plot(x, y_subtracted)
    plt.plot(x[peaks], y_subtracted[peaks], 'x', label='Peaks', color='red')

    peaks_ind, _ = find_peaks(y, distance=30, height=350, prominence=10)

    plt.plot(x[peaks_ind], y_subtracted[peaks_ind], 'x', label='Peaks', color='red')

    for default_peak in default_peaks_list:
        # Find the intensity at the target 2-theta value within the tolerance
        mask = np.isclose(data['2 theta'], default_peak, atol=0.01)

        if mask.any():
            intensity_at_target = data.loc[np.where(mask==True), 'Intensity'].values[0]
            intensity_data.loc[current_index, default_peak] = intensity_at_target

        # if a peak is found close to the default peak, use the more precise indice
        if np.isclose(default_peak, x[peaks_ind], atol=0.1).any():
            intensity_data.loc[current_index, default_peak] = y[peaks_ind[np.where(np.isclose(default_peak, x[peaks_ind], atol=0.1))[0][0]]]

Currently on D 100%
Currently on D 50%
Currently on CaF2 100%


## Check that the table is completely filled

The data structure is now complete, although make sure to check that the values are sensible - peak heights corresponding to A's smallest 2-theta positions should have decreasing values as less of A is present, and vice versa for CaF2. 

If they are not, now is a good time to check if you have correctly edited `part_3_data_sorted` above.

In [63]:
intensity_data

Unnamed: 0,28.3,47.0,55.75,36.74,42.68,61.94,weight_frac_unknown,weight_frac_CaF2
D 100%,64.961474,123.901334,84.971489,3103.691618,4132.417375,2151.87048,100,0
D 50%,1215.359178,1224.177384,422.216619,1615.960726,2091.605862,1129.219617,50,50
CaF2 100%,2276.661,2344.43256,737.904756,147.984238,69.959026,111.992318,0,100


## Identify the equation of the line for each of the three peaks identified for the unknown phase

With the populated dataframe, we can now plot the calibration curves using the known weight fractions of CaF2 and the unknown phase.

$\color{red}{\text{IMPORTANT}}$

If you do not obtain a straight line, check the following
- `part_3_data_sorted` contains the files in the correct order specified
- background subtraction is performed consistently using the same method across all the given files

In [64]:
intensity_data.plot(x='weight_frac_CaF2', y=28.3, kind='scatter')

<Axes: xlabel='weight_frac_CaF2', ylabel='28.3'>

In [65]:
# Peaks to process
unknown_peaks = [unknown_first_peak_loc, unknown_sec_peak_loc, unknown_third_peak_loc]

for intensity in unknown_peaks:

    intensity_data.plot(x='weight_frac_CaF2', y=intensity, kind='scatter')
    plt.title('A intensities')

    # Fit a linear regression line
    coefficients = np.polyfit(x=intensity_data['weight_frac_CaF2'], y=pd.to_numeric(intensity_data[intensity]), deg=1)

    line = np.poly1d(coefficients)
    print(f'{intensity} equation of line: {line}')

    # Plot the linear regression line
    plt.plot(intensity_data['weight_frac_CaF2'], line(intensity_data['weight_frac_CaF2']), color='red')
    plt.title(f'{intensity}')
    # # Plot the equation of the line
    plt.text(0.95, 0.05, f'{line}', color='red',
             ha='right', va='bottom', transform=plt.gca().transAxes)

    plt.show()


36.74 equation of line:  
-29.56 x + 3100
42.68 equation of line:  
-40.62 x + 4129
61.94 equation of line:  
-20.4 x + 2151


In [133]:
# # internal use
# write_to_file(answer_dir + f"{unknown_phase}.txt", f'Q7: Key in the equation of the lines for all 3 peaks for B, give gradient, intercept')

# for ind, intensity in enumerate(unknown_peaks):
#     coefficients = np.polyfit(x=intensity_data['weight_frac_CaF2'], y=pd.to_numeric(intensity_data[intensity]), deg=1)
#     line = np.poly1d(coefficients,2)
#     write_to_file(answer_dir + f"{unknown_phase}.txt", f'Q7: peak {ind} m,c: {coefficients}')

## Identify background-subtracted peak intensities corresponding to the smallest 2-theta peak positions of unknown phase A in the unknown mixture

In [66]:
unknown_mixture =  pd.read_csv(part_3_data_sorted[-1], usecols=[1,2])

plt.close()
plt.figure()
plt.plot(unknown_mixture['2 theta'], unknown_mixture['Intensity'])

[<matplotlib.lines.Line2D at 0x2573a63c3a0>]

In [67]:
x = unknown_mixture['2 theta'].to_numpy()
y = unknown_mixture['Intensity'].to_numpy()

half_window = 20
baseline_fitter = Baseline(x_data = x)
baseline = baseline_fitter.mor(y, half_window=half_window)[0]
y_subtracted = y - baseline
y_subtracted[y_subtracted<0]=0

plt.figure()
plt.plot(x,y_subtracted)

[<matplotlib.lines.Line2D at 0x2573a949070>]

In [68]:
# Ensure that this variable holds the correct values (3 most intense peaks for B/C/D)
unknown_peaks = [unknown_first_peak_loc, unknown_sec_peak_loc, unknown_third_peak_loc]

# Plot the data
plt.close()
plt.figure()
plt.plot(x, y_subtracted, label='y_subtracted')

# Add visual indicators for peaks
for peak in unknown_peaks:
    # Find the index of the closest value to the peak in x
    index = np.abs(x - peak).argmin()
    # Get the corresponding x and y values
    x_value = x[index]
    y_value = y_subtracted[index]
    # Add a vertical line
    plt.axvline(x=x_value, color='red', linestyle='--', label=f'Peak near {peak}')
    # Add text annotation for the y value
    plt.text(x_value, y_value, f'{y_value:.2f}', color='blue', ha='center', va='bottom')

# Customize the plot
plt.xlabel('x')
plt.ylabel('y_subtracted')
plt.show()

# Identify the weight fraction of the unknown phase A in the mystery mixture

## Hint

You can use np.mean

$\color{red}{\text{IMPORTANT}}$

Make sure you use the correct specified peak and its corresponding equation of the line or you will not arrive at the correct answer.

Be clear on what is the 'X' and 'Y' in the equation of the line.

In [69]:
# internal use
coefficients_list = []

for ind, intensity in enumerate(unknown_peaks):
    coefficients = np.polyfit(x=intensity_data['weight_frac_CaF2'], y=pd.to_numeric(intensity_data[intensity]), deg=1)
    coefficients_list.append(coefficients)

y_list = []

for peak in unknown_peaks:
    # Find the index of the closest value to the peak in x
    index = np.abs(x - peak).argmin()
    # Get the corresponding x and y values
    x_value = x[index]
    y_value = y_subtracted[index]
    y_list.append(y_value)

x_list = []

for i in range(len(y_list)):   
    x_value = round((y_list[i]-coefficients_list[i][1])/coefficients_list[i][0])
    x_list.append(x_value)

100 - np.mean(x_list)

29.0

In [138]:
# # internal use
# coefficients_list = []

# for ind, intensity in enumerate(unknown_peaks):
#     coefficients = np.polyfit(x=intensity_data['weight_frac_CaF2'], y=pd.to_numeric(intensity_data[intensity]), deg=1)
#     coefficients_list.append(coefficients)

# y_list = []

# write_to_file(answer_dir + f"{unknown_phase}.txt", f'Q8: key in the intensities of unknown peaks that have been background subtracted')

# for peak in unknown_peaks:
#     # Find the index of the closest value to the peak in x
#     index = np.abs(x - peak).argmin()
#     # Get the corresponding x and y values
#     x_value = x[index]
#     y_value = y_subtracted[index]
#     y_list.append(y_value)
#     write_to_file(answer_dir + f"{unknown_phase}.txt", f'Q8: {y_value}')

# x_list = []

# for i in range(len(y_list)):   
#     x = round((y_list[i]-coefficients_list[i][1])/coefficients_list[i][0])
#     x_list.append(x)

# np.mean(x_list)
# write_to_file(answer_dir + f"{unknown_phase}.txt", f'Q9: Use answers from Q7/Q8,  key into equation of lines to get the averaged weight fraction round of B to 2 sf.')
# write_to_file(answer_dir + f"{unknown_phase}.txt", f'Q9: {100 - np.mean(x_list)} +- 3')