# X-ray Diffraction Practical Class - Data Analysis (Student's copy)

## Class Introduction

This [Jupyter](http://jupyter.org) notebook is to accompany the X-ray Diffraction Practical (XRD) practical class.  The notebook will help you complete the Part 2 and Part 3 of the assessments.

## Learning Objectives - Whole Course
By the end of the two afternoon practical session you will:
- Have a basic understanding of the practical design of a diffractometer and be able to identify key components from both kinds
- You will be familiar with the concept of X-ray radiation in the context of powder diffraction and understand why different energy X-rays are used to analyse samples with different compositions
- You will have applied Bragg's law for the conversion of diffraction angle (&deg;2&theta;) and d-spacing (&#8491;)
- You will have experience preparing a powder sample for measurement
- An appreciation of the key components of a standard powder diffraction measurement along with analysing the results
- You will have a rudimentary knowledge of phase identification and will have applied it to diffraction patterns of
    - A pure sample
    - A mixed sample
- Be able to interpret the diffraction patterns from an unknown mixture of known composition to build a calibration curve and determine from the pattern of an unknown mixture of the composition of the mixture
- Develop an appreciation of the use of python for data analysis

# READ THIS

Please read through the text accompanying the code cells carefully. Failure to do so will result in an inability to pass the quiz.

## Juypter Notebook Introduction
[Jupyter notebook](http://www.jupyter.org) is a dynamic open-source web application which can be hosted locally allowing you to generate rich project documents containing documentation, images, live code and results.  The projects can be easily shared and since we are utilising the Jupyter interface for [python](http://www.python.org) based projects the system is truly cross platform, Windows, MAC and Linux compatible.  It is progressively being used more and more as a method of reporting scientific data in peer reviewed journals.[1]

The notebook is built up of cells - each cell has the format of either being <code>Markdown</code> or <code>source code</code>.

*This notebook is yours so feel free to edit the cells and expand or alter the examples to help you learn and understand the process. Remember to make a backup copy first though.*

[1]:https://doi.org/10.1103/PhysRevLett.116.061102

# Load imports

The `os` module provides functions for interacting with the operating system. You can use it to handle file paths, read or write files, and manage directories.

The `pandas` library is essential for data manipulation and analysis. It provides powerful data structures like DataFrames, which allow for easy handling of tabular data.

`numpy` is a fundamental package for numerical computing in Python. It provides support for large multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays. 


`matplotlib` is a plotting library for creating static, animated, and interactive visualizations in Python. The pyplot module provides a MATLAB-like interface for creating plots and graphs.

`scipy` is a library used for scientific and technical computing. The find_peaks function from scipy.signal is used to identify peaks (local maxima) in data, which is particularly useful for signal processing.

`pybaselines` is a library for baseline correction, which is a common preprocessing step in signal processing.


$\color{red}{\text{IMPORTANT}}$

Ask a demonstrator for assistance if the imports fail to load correctly. Alternatively, you may try running the notebook on your own personal laptop.

In [4]:
%matplotlib qt5

import os, sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from scipy.signal import find_peaks

import os, sys
sys.path.insert(0,os.path.expanduser(os.path.join(".","pybaselines-main")))

from pybaselines import Baseline
from pybaselines import polynomial

# Part 2 : Identification of pure unknown phase X

## Overview

You are given a XRD spectra of an unknown phase X.

The broad objectives are to:

Identify a suitable background subtraction method for the XRD spectra

Identify the 2-theta positions of the 3 strongest peaks in the XRD spectra

Identify the d-spacings corresponding to the peaks to which you have found

Cross reference the 2-theta peak positions to a reference materials database called 'PDF Database Index.csv' and hence, identify the unknown material.

Use the following code to assist you in answering questions to Part 2 of the blackboard quiz.


## Load data

If the file does not load, make sure that your pwd is within the `xrd_practical` folder that you have just unzipped.

Check by running the `pwd` cell below.

If it is not within the `xrd_practical` folder, use the command `cd` to change your directory.

In [None]:
pwd

NameError: name 'p' is not defined

In [5]:
relative_dir = r'.\\data\\'
filename = r'part_2_unknown.csv'
full_path = os.path.join(os.getcwd() + relative_dir, filename)

In [None]:
data = pd.read_csv(full_path, usecols=[1,2])

y = data['Intensity'].to_numpy()
x = data['2 theta'].to_numpy()

data.plot(x='2 theta', y='Intensity')
plt.title('Pure X')

Text(0.5, 1.0, 'Pure A')

## Identify the best background subtraction method

An anonymous scientist S.H. has collected this xrd spectra on a glass substrate. The glass substrate has contributed to a large background hump in the xrd spectra that is detrimental to analysis.

There are three different methods for background subtraction - only one of them will provide the right answer for quantitative analysis later on.

$\color{red}{\text{IMPORTANT}}$

The correct method should be fairly obvious, if you do not get this right $\textbf{the error will propagate through the whole script}$ and your answers to the quiz will be wrong. Please check with a demonstrator if you are unsure. 

Once you have settled on a particular method, $\textbf{rerun ALL cells under that method header}$ again to ensure that the variables are storing the correctly subtracted intensity values.

### Try different background subtraction methods

#### Method 1

In [None]:
poly_order = 6
baseline, params = polynomial.imodpoly(data=y,x_data=x, poly_order=poly_order)
y_subtracted = y - baseline
y_subtracted[y_subtracted<0]=0

In [2]:

fig, ax = plt.subplots(tight_layout={'pad': 0.2})
data_handle = ax.plot(y)
baseline_handle = ax.plot(baseline, '--')
ax.legend(
    (data_handle[0], baseline_handle[0]),
    ('data','fit baseline'), frameon=False
)
plt.show()

NameError: name 'plt' is not defined

In [17]:

plt.figure()
plt.title('Post-background subtraction')
plt.plot(x, y_subtracted, alpha=0.5, label='background subtracted')
plt.plot(x, y, alpha=0.5, label='original')
plt.legend()
plt.xlabel('2-theta')
plt.ylabel('Intensity')
plt.show()


#### Method 2

In [7]:
baseline_fitter = Baseline(x_data=x)
half_window = 20
baseline = baseline_fitter.mor(y, half_window=half_window)[0]
y_subtracted = y - baseline
y_subtracted[y_subtracted<0]=0

In [11]:

plt.figure()
plt.plot(y, label='data')
plt.plot(baseline, label=f'baseline')
plt.legend()

<matplotlib.legend.Legend at 0x275de30b700>

In [10]:

plt.figure()
plt.title('Post-background subtraction')
plt.plot(x, y_subtracted, alpha=0.5, label='background subtracted')
plt.plot(x, y, alpha=0.5, label='original')
plt.legend()
plt.xlabel('2-theta')
plt.ylabel('Intensity')
plt.show()


#### Method 3

In [12]:
baseline_fitter = Baseline(x_data=x)
baseline = baseline_fitter.beads( y, lam_0=0.00006, lam_1=0.00008, lam_2=0.05, fit_parabola=False, tol=1e-3, freq_cutoff=0.04, asymmetry=3)[0]
y_subtracted = y - baseline
y_subtracted[y_subtracted<0]=0

In [13]:
plt.figure()
plt.plot(y, label='data')
plt.plot(baseline, label=f'baseline')
plt.legend()

<matplotlib.legend.Legend at 0x275de62df40>

In [14]:
plt.figure()
plt.title('Post-background subtraction')
plt.plot(x, y_subtracted, alpha=0.5, label='background subtracted')
plt.plot(x, y, alpha=0.5, label='original')
plt.legend()
plt.xlabel('2-theta')
plt.ylabel('Intensity')
plt.show()


## Identify the 2-theta peak positions corresponding to the three largest d-spacings using scipy find_peaks function

Ensure that you have re-run the cells corresponding to the background subtraction method you have chosen.

Use the plotted graph to ensure that the first three prominent peaks are correctly found to compare with the database

Hint - the parameters in `find_peaks` have to be optimized until the red 'x's only identify the real peaks within the spectra rather than the background.

Have a look at the documentation to see what `distance`, `height`, `prominence` do.

https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.find_peaks.html


In [None]:
peaks, _ = find_peaks(y_subtracted, distance=10, height=10, prominence=10)  
# Adjust distance/height as needed to only include peaks and not background
# Play with other parameters if necessary

# Plot original data and highlight peaks
plt.figure()
plt.plot(x, y_subtracted, label='XRD Data')
plt.plot(x[peaks], y_subtracted[peaks], 'x', label='Peaks', color='red')
plt.legend()
plt.xlabel('2-theta')
plt.ylabel('Intensity')
plt.show()


In [9]:
for i in range(3):
    print(f'peak {i} :' + f'{round(x[peaks][i],2)}')

peak 0 :10.08
peak 1 :10.33
peak 2 :10.65


## Calculate d-spacing for the three peaks identified

### Bragg's Law

Is the simple numerical relationship between the angle (&theta;) of the observed diffraction intensity in relation to the wavelength of the radiation source (&lambda;) and the lattice spacing of the material being probed (d)


### Hints


X-ray diffraction data is acquired for a range of 2&theta; angles in degrees using Cu-K-alpha radiation. The upper and lower limits of the diffraction pattern govern the minimum and maximum lattice spacings observable within the pattern.

To calculate either the minimum or maximum lattice spacing (d-spacings, remember that it is a reciprocal relationship between 2&theta; and d-spacing) that can be measured in any crystal  we first need to identify which is of interest and then simply rearrange Bragg's law to provide the answer in d-spacing

`np.radians` is required to take our data in measure in degrees and place it into the correct format, radians, for the function.


$\color{red}{\text{IMPORTANT}}$
    
There is an error in the Bragg's law function below which you must edit or it will give the wrong answer

Check the units


In [None]:
def calculate_d_spacing(n, wavelength, theta):
    """
    Calculate the d-spacing using Bragg's law.

    Parameters:
        n (int): Order of diffraction
        wavelength (float): Wavelength of X-rays (in nm)
        theta (float): Angle of incidence (in degrees)

    Returns:
        float: d-spacing (nm)
    """
    # Convert theta from degrees to radians
    theta = np.radians(theta)
    
    # Calculate d-spacing using Bragg's law
    d_spacing = (n * wavelength) / (2 * np.sin(theta))
    
    return d_spacing

n = 1
wavelength = 0.15418 # corresponding to Cu k_alpha 

peak_index = 2 # change this to match the peak index
theta = x[peaks][peak_index]

d_spacing = calculate_d_spacing(n, wavelength, theta)

print(f"The d-spacing for peak {peak_index}: {round(d_spacing,3)} nm")


The d-spacing is: 0.417 nm


## Compare found 2-theta peak positions to a reference material database

The powder diffraction pattern is a “fingerprint” of the material. We can utilise a database of peak positions and relative intensities to match to your pattern. These are often referred to as “phases”. However diffraction patterns are not always unique and therefore you must use chemical sense and select phases /material patterns that could be present in your sample. It should be noted that intensities can vary as a function of preferred orientation and sample quality. These can differ from measurement to measurement so sample rotation and good preparation is normally employed to help mitigate these factors.

Does it match? If so on how many peaks? Is it just peak position and not intensity or both.
Are there residual peaks leftover over? Could it be more than one phase?

Are there parameters you can change to reduce the number of matches? Keep it simple, the fewer phases the better.


### Load database

In [None]:
database_loc = 'PDF Database Index.csv'

database = pd.read_csv(database_loc,keep_default_na=False)

for col in database.columns:
    # this coerces column values to integer whenever possible
    database[col] = pd.to_numeric(database[col], errors='ignore', downcast='integer')

database.head() # This just shows a preview of the database that has just been loaded

  database[col] = pd.to_numeric(database[col], errors='ignore', downcast='integer')


Unnamed: 0,Reference Code,PDF FileName,Powder Pattern,Name,No.,h,k,l,d,2Theta[deg],I [%]
0,00-004-0673,00-004-0673.pdf,00-004-0673.RD,Tin,1.0,2.0,0.0,0.0,2.915,30.645,100.0
1,00-004-0674,00-004-0673.pdf,00-004-0673.RD,Tin,2.0,1.0,0.0,1.0,2.793,32.019,90.0
2,00-004-0675,00-004-0673.pdf,00-004-0673.RD,Tin,3.0,2.0,2.0,0.0,2.062,43.872,34.0
3,,,,,,,,,,,
4,00-004-0783,00-004-0783.pdf,00-004-0783.RD,Silver,1.0,1.0,1.0,1.0,2.359,38.117,100.0


### Match the three found peaks to the database

$\color{red}{\text{IMPORTANT}}$


You need to input the correct `match_tolerance` level for the match to occur. 

E.g. if your experimentally found 2-theta peak position is 27.4 deg and the PDF database has a 2-theta peak position of 27.9 deg, you need a `match_tolerance` of at least 0.5

However. increasing `match_tolerance` too much can result in many false positive matches with the database.

Identify the corresponding pdf reference that matches to the correct phase.

#### Matching the first peak

In [12]:
target_peak = x[peaks][0] # this is finding matches for the first identified peak

match_tolerance = 100 # change as necessary
match_1 = database['Name'][database['2Theta[deg]'].apply(lambda x: abs(x - target_peak) < match_tolerance)]
match_1

0                   Tin
1                   Tin
2                   Tin
4                Silver
5                Silver
             ...       
149    Titanium Nitride
150    Titanium Nitride
152     Magnesium Oxide
153     Magnesium Oxide
154     Magnesium Oxide
Name: Name, Length: 117, dtype: object

#### Matching the second peak

In [13]:
target_value = x[peaks][1] # this is finding matches for the second identified peak
match_tolerance = 100 # change as necessary
match_2 = database['Name'][database['2Theta[deg]'].apply(lambda x: abs(x - target_value) < match_tolerance)]
match_2

0                   Tin
1                   Tin
2                   Tin
4                Silver
5                Silver
             ...       
149    Titanium Nitride
150    Titanium Nitride
152     Magnesium Oxide
153     Magnesium Oxide
154     Magnesium Oxide
Name: Name, Length: 117, dtype: object

#### Matching the third peak

In [14]:
target_value = x[peaks][2] # this is finding matches for the third identified peak
match_tolerance = 100 # change as necessary
match_3 = database['Name'][database['2Theta[deg]'].apply(lambda x: abs(x - target_value) < match_tolerance)]
match_3

0                   Tin
1                   Tin
2                   Tin
4                Silver
5                Silver
             ...       
149    Titanium Nitride
150    Titanium Nitride
152     Magnesium Oxide
153     Magnesium Oxide
154     Magnesium Oxide
Name: Name, Length: 117, dtype: object

#### Finding the common intersection of matches amongst the three peaks using set operations

Ideally, there should only be one final answer.

The code below only does a intersecting set between the smallest and 2nd smallest 2-theta peak positions. You can try other intersections such as `match_2` and `match_3`. A match between two out of the three identified 2-theta peak positions should be sufficient to arrive at a single answer with the correct `match_tolerance` parameters. 

In [15]:
intersecting_set = set(match_1).intersection(match_2)
intersecting_set = list(intersecting_set)
# use different matches, e.g. 1,3 or 2,3
intersecting_set

['Titanium Oxide',
 'Copper Sulfate Hydrate',
 'Aluminum',
 'Isotactic Polypropylene',
 'Silver',
 'Carbon',
 'Sucrose',
 'Iron Oxide',
 'Copper Chloride Hydrate',
 'Copper',
 'Platinum',
 'Chromium',
 'Zirconium',
 'Potassium Bromide',
 'Barium Titanium Oxide',
 'Titanium Nitride',
 'Tin',
 'Calcium Fluoride',
 'Cobalt Iron',
 'Zinc Sulfide',
 'Uranium',
 'Gold',
 'Polypropylene',
 'Sodium Chloride',
 'Magnesium Oxide',
 'Copper Oxide']

Once you have identified the compound, cross-refer to 'PDF Database Index.csv' to identify the corresponding Reference Code.

Optional: Try to obtain the reference code by selecting the appropriate index in `database` using `Reference Code` column

Refer to https://pandas.pydata.org/docs/user_guide/indexing.html

Otherwise, just look it up in the excel sheet.

In [16]:
database.head()

Unnamed: 0,Reference Code,PDF FileName,Powder Pattern,Name,No.,h,k,l,d,2Theta[deg],I [%]
0,00-004-0673,00-004-0673.pdf,00-004-0673.RD,Tin,1.0,2.0,0.0,0.0,2.915,30.645,100.0
1,00-004-0674,00-004-0673.pdf,00-004-0673.RD,Tin,2.0,1.0,0.0,1.0,2.793,32.019,90.0
2,00-004-0675,00-004-0673.pdf,00-004-0673.RD,Tin,3.0,2.0,2.0,0.0,2.062,43.872,34.0
3,,,,,,,,,,,
4,00-004-0783,00-004-0783.pdf,00-004-0783.RD,Silver,1.0,1.0,1.0,1.0,2.359,38.117,100.0


# Part 3 Identifying unknown weight fractions of a known mixture of two unknown phases

The product of the diffraction pattern is a function of the material, crystalline and amorphous, being measured. Once the phases, and therefore peaks, of a pattern have been determined (qualitative analysis) it is possible to identify the quantity of the phases within the sample (quantitative analysis). This is due to the diffracted peak area and to a first approximation the intensity of the recorded peak being a function of the x-ray source and sample.


The area under the diffracted peak scales with the amount of the corresponding phase in the sample, allowing quantitative information to be gained by comparing peaks for different phases. It is more difficult to calculate the area under a peak and to a first approximation the intensity of the recorded peak serves as an accurate alternative where relative peak heights of different phases can be compared.

There are several methods which can be utilised in order to undertake quantitative analysis. They all rely on correlating some parameter with a recorded parameter from the diffraction pattern itself. Such as lattice parameter vs concentration, intensity vs concentration, etc.

1. External Standard Method
2. Direct Comparison Method
3. Internal Standard Method

For the purpose of this course we are utilising a version of method 3 where we mix one crystalline material with a second and plot the intensity vs the calculated weight percent for known components.

## Overview

Each student is given a mix of CaF2 and an unknown phase A/B/C/D/E at random. As such, do not blindly copy your coursemate's answers. 

For illustration purposes, we will use CaF2-A for the following discussion.

3 known weight fractions are given - Pure CaF2, Pure A, and a 50-50 mixture of CaF2:A. There is one unknown weight fraction 'A Mystery%.csv', and your task is to identify the weight percentage of CaF2/A present in this file. 

To accomplish this, the task has to be broken down further:

Perform background subtraction as necessary.

Identify the 2-theta peak positions of the three strongest peaks for unknown material A and CaF2. 

Identify material A from 'PDF Database Index.csv'. 
(Refer to Part 2)

Construct a calibration chart using the known weight fractions (Pure CaF2/Pure A/50-50 CaF2-A) and peak intensities corresponding to the strongest 3 peaks from CaF2 and A.

Identify the background-subtracted peak intensities of the three strongest peaks for unknown material A

Use the calibration chart and the peak intensities to identify the unknown weight fraction of CaF2/A for 'A Mystery%.csv'. 

## Load XRD spectra for pure unknown phase

Hint:

Load the appropriate data according to the unknown phase you are given on the quiz

In [None]:
file_path = r'.\\data\\part_3_data\\CaF2 - Z\\Z 100%.csv' # Edit this to match the unknown phase you are given on the quiz

data = pd.read_csv(file_path, usecols=[1,2])
data.plot(x='2 theta', y='Intensity')

<Axes: xlabel='2 theta'>

In [248]:
data.head()

Unnamed: 0,2 theta,Intensity
0,10.0001,1105.908119
1,10.020451,1079.959928
2,10.040802,1092.980528
3,10.061153,1098.039187
4,10.081504,1105.047293


## Perform background subtraction


In [249]:
baseline_fitter = Baseline(x_data=data['2 theta'])
half_window = 20
baseline = baseline_fitter.mor(data['Intensity'], half_window=half_window)[0]
y_subtracted = data['Intensity'] - baseline
y_subtracted[y_subtracted<0]=0
peaks, _ = find_peaks(y_subtracted, distance=30, height=350, prominence=100)  

In [251]:
# Plot original data and highlight peaks
plt.figure()
plt.plot(data['2 theta'], y_subtracted, label='XRD Data')
plt.plot(data['2 theta'][peaks], y_subtracted[peaks], 'x', label='Peaks', color='red')
plt.legend()
plt.xlabel('2-theta')
plt.ylabel('Intensity')
plt.show()


## Identify 2-theta peak positions for pure unknown phase A

Fill in the 2-theta peak positions to the nearest 0.1 degree precision in `unknown_first_peak_loc`, `unknown_sec_peak_loc`, `unknown_third_peak_loc`

In [252]:
for i in range(3):
    x = data['2 theta'].to_numpy()
    print(f'peak {i} :' + f'{round(x[peaks][i],2)}')

peak 0 :44.94
peak 1 :65.42
peak 2 :82.88


In [None]:
# Fill in the 2-theta peak positions to the nearest 0.1 degree precision for the unknown pure A phase.
unknown_first_peak_loc = 

unknown_sec_peak_loc =

unknown_third_peak_loc =

## Identify pure unknown phase using the database

#### Hint

You may adapt the code from Part 2 to do an intersecting match. When doing so, make sure that the variables being used are holding the right values

Again, the match_tolerance parameter will affect the confidence of your matches.

A match between two out of the three peaks identified is sufficient to proceed.


## Form the calibration curves using the known weight fractions

### Grab all known weight fraction xrd pattern csv file locations for loading


$\color{red}{\text{IMPORTANT}}$

Load the appropriate data according to the unknown phase you are given on the quiz by editing `relative_dir`

Ensure that the variable `part_3_data_sorted` holds appropriately sorted files - ['C 100%', 'C 50%',  'CaF2 100%', 'C Mystery%']

For example:

`part_3_data = ['C 100%', 'C 50%', 'C Mystery%', 'CaF2 100%']`

has to be re-arranged to

`part_3_data_sorted  = ['C 100%', 'C 50%',  'CaF2 100%', 'C Mystery%']`

You might have to manually sort them using indexing:

`part_3_data_sorted = [part_3_data[1],part_3_data[0],part_3_data[3], part_3_data[2]]`

Even if the data is sorted when loaded, you still have to update the variable `part_3_data_sorted`

In [257]:
def find_files(directory):
    """
    Find all text files within the specified directory and its subdirectories.

    Parameters:
        directory (str): The directory path to search for text files.

    Returns:
        list: A list of paths to text files.
    """
    text_files = []
    for root, dirs, files in os.walk(directory):
        for file in files:
            if file.endswith(".csv"):
                text_files.append(os.path.join(root, file))
    return text_files

def get_filename_without_extension(file_path):
    # Split the path into root and extension
    root, ext = os.path.splitext(file_path)
    # Split the root into directory and filename
    directory, filename = os.path.split(root)
    return filename

In [None]:
relative_dir = r'.\\data\\part_3_data\\CaF2 - Z\\' # Edit this to match the unknown phase you are given on the quiz

part_3_data = sorted(find_files(os.path.join(os.getcwd() + relative_dir)))

[get_filename_without_extension(file) for file in part_3_data]

['A 100%', 'A 50%', 'A Mystery%', 'CaF2 100%']

In [None]:
part_3_data_sorted = [part_3_data[0],part_3_data[2],part_3_data[3],part_3_data[1]] # Example of an index re-ordering. You have to edit the numbers
[get_filename_without_extension(file) for file in part_3_data_sorted]

['A 100%', 'A 50%', 'CaF2 100%', 'A Mystery%']

### Create empty data structure first holding the known weight fraction peak intensities


$\color{red}{\text{IMPORTANT}}$

Fill in the variables `CaF2_sec_peak_loc` and `CaF2_third_peak_loc` corresponding to the 2nd and 3rd smallest 2-theta peak positions for pure CaF2 XRD spectra. You should have them from answering Part 2. The smallest 2-theta peak position has been filled in for you. Use this to do a sanity check.

Do a sanity check that the `weight_frac_CaF2` and `weight_frac_unknown` columns matches the file names in the `intensity_data` dataframe created.

In [None]:

CaF2_first_peak_loc = 28.3
CaF2_sec_peak_loc = 
CaF2_third_peak_loc = 

In [None]:
default_peaks_list = [CaF2_first_peak_loc, CaF2_sec_peak_loc, CaF2_third_peak_loc, unknown_first_peak_loc, unknown_sec_peak_loc, unknown_third_peak_loc]

intensity_data = pd.DataFrame(index=[os.path.splitext(os.path.basename(text))[0] for text in part_3_data_sorted[:-1]], columns=default_peaks_list)
intensity_data['weight_frac_unknown'] = [100, 50, 0]

intensity_data['weight_frac_CaF2'] = 100 - intensity_data['weight_frac_unknown']
intensity_data

Unnamed: 0,28.3,47.0,55.75,44.94,65.42,82.88,weight_frac_unknown,weight_frac_CaF2
A 100%,,,,,,,100,0
A 50%,,,,,,,50,50
CaF2 100%,,,,,,,0,100


### Populate data structure with values from each XRD pattern

The following code block populates the empty dataframe above with the corresponding peak intesities at the six 2-theta positions specified (corresponding to CaF2 and the unknown phase)

For each csv file:

The XRD spectra is plotted

Peak fitting is conducted

If a peak is found close to the default 2-theta positions defined in the dataframe above, the corresponding intensity is keyed into the dataframe.

If no peak is found, the intensity of the 2-theta position that is closest to the default 2-theta positions is keyed into the dataframe instead.

Contrast this to how you would have to do this manually by opening each file yourself. 

In [274]:
for ind,file in enumerate(part_3_data_sorted[:-1]):

    current_index = str(get_filename_without_extension(file))

    print((f'Currently on {current_index}'))

    data = pd.read_csv(file, usecols=[1,2])

    x = data['2 theta'].to_numpy()
    y = data['Intensity'].to_numpy()

    half_window = 20
    baseline_fitter = Baseline(x_data = x)
    baseline = baseline_fitter.mor(y, half_window=half_window)[0]
    y_subtracted = y - baseline
    y_subtracted[y_subtracted<0]=0

    data['Intensity'] = y_subtracted
    y = y_subtracted

    plt.figure()
    plt.title(f'{current_index}')
    plt.plot(x, y_subtracted)
    plt.plot(x[peaks], y_subtracted[peaks], 'x', label='Peaks', color='red')

    peaks_ind, _ = find_peaks(y, distance=30, height=350, prominence=10)

    plt.plot(x[peaks_ind], y_subtracted[peaks_ind], 'x', label='Peaks', color='red')

    for default_peak in default_peaks_list:
        # Find the intensity at the target 2-theta value within the tolerance
        mask = np.isclose(data['2 theta'], default_peak, atol=0.01)

        if mask.any():
            intensity_at_target = data.loc[np.where(mask==True), 'Intensity'].values[0]
            intensity_data.loc[current_index, default_peak] = intensity_at_target

        # if a peak is found close to the default peak, use the more precise indice
        if np.isclose(default_peak, x[peaks_ind], atol=0.1).any():
            intensity_data.loc[current_index, default_peak] = y[peaks_ind[np.where(np.isclose(default_peak, x[peaks_ind], atol=0.1))[0][0]]]

Currently on A 100%
Currently on A 50%
Currently on CaF2 100%


## Check that the table is completely filled

The data structure is now complete, although make sure to check that the values are sensible - peak heights corresponding to A's smallest 2-theta positions should have decreasing values as less of A is present, and vice versa for CaF2. 

If they are not, now is a good time to check if you have correctly edited `part_3_data_sorted` above.

In [275]:
intensity_data

Unnamed: 0,28.3,47.0,55.75,44.94,65.42,82.88,weight_frac_unknown,weight_frac_CaF2
A 100%,64.993524,123.712748,84.641444,15975.054772,2061.785257,3997.632071,100,0
A 50%,1215.282191,1224.173727,422.529564,8032.444201,1035.017387,2030.493163,50,50
CaF2 100%,2276.661,2344.43256,737.904756,101.467615,56.380877,63.364535,0,100


## Identify the equation of the line for each of the three peaks identified for the unknown phase

With the populated dataframe, we can now plot the calibration curves using the known weight fractions of CaF2 and the unknown phase.

$\color{red}{\text{IMPORTANT}}$

If you do not obtain a straight line, check the following
- `part_3_data_sorted` contains the files in the correct order specified
- background subtraction is performed consistently using the same method across all the given files

In [277]:
intensity_data.plot(x='weight_frac_CaF2', y=intensity, kind='scatter')
plt.title('A intensities')

Text(0.5, 1.0, 'A intensities')

In [278]:
# Peaks to process
unknown_peaks = [unknown_first_peak_loc, unknown_sec_peak_loc, unknown_third_peak_loc]

for intensity in unknown_peaks:

    intensity_data.plot(x='weight_frac_CaF2', y=intensity, kind='scatter')
    plt.title('A intensities')

    # Fit a linear regression line
    coefficients = np.polyfit(x=intensity_data['weight_frac_CaF2'], y=pd.to_numeric(intensity_data[intensity]), deg=1)

    line = np.poly1d(coefficients)
    print(f'{intensity} equation of line: {line}')

    # Plot the linear regression line
    plt.plot(intensity_data['weight_frac_CaF2'], line(intensity_data['weight_frac_CaF2']), color='red')
    plt.title(f'{intensity}')
    # # Plot the equation of the line
    plt.text(0.95, 0.05, f'{line}', color='red',
             ha='right', va='bottom', transform=plt.gca().transAxes)

    plt.show()


44.94 equation of line:  
-158.7 x + 1.597e+04
65.42 equation of line:  
-20.05 x + 2054
82.88 equation of line:  
-39.34 x + 3998


## Identify background-subtracted peak intensities corresponding to the smallest 2-theta peak positions of unknown phase A in the unknown mixture

$\color{red}{\text{IMPORTANT}}$

Perform background subtraction using the same method that has been used for populating the values in the `intensity_data` dataframe.

Refer to section 'Populate data structure with values from each XRD pattern' and copy the necessary code

You may wish to plot to check that the background subtraction is working as intended.

### Load data

In [None]:
unknown_mixture =  pd.read_csv(part_3_data_sorted[-1], usecols=[1,2])

x = unknown_mixture['2 theta'].to_numpy()
y = unknown_mixture['Intensity'].to_numpy()

plt.close()
plt.figure()
plt.plot(unknown_mixture['2 theta'], unknown_mixture['Intensity'])

[<matplotlib.lines.Line2D at 0x2644e277580>]

### Perform background subtraction

In [None]:
# insert background subtraction code here.

plt.figure()
plt.plot(x,y_subtracted) 
# Note that even without inserting the background subtraction, this code cell would still run perfectly 
# as you are supposed to over-ride the `y_subtracted` variable from previous runs

[<matplotlib.lines.Line2D at 0x26452866820>]

### Identify the peak heights corresponding to the smallest 3 2-theta positions for the unknown phase in the unknown mixture

A vertical line marker is provided in the figure plotted below, but it isn't always accurate. Please zoom in using the interactive plot and read as accurately as possible the corresponding peak heights.

In [None]:
unknown_peaks = [unknown_first_peak_loc, unknown_sec_peak_loc, unknown_third_peak_loc]

# Plot the data
plt.close()
plt.figure()
plt.plot(x, y_subtracted, label='y_subtracted')

# Add visual indicators for peaks
for peak in unknown_peaks:
    # Find the index of the closest value to the peak in x
    index = np.abs(x - peak).argmin()
    # Get the corresponding x and y values
    x_value = x[index]
    y_value = y_subtracted[index]
    # Add a vertical line
    plt.axvline(x=x_value, color='red', linestyle='--', label=f'Peak near {peak}')

plt.xlabel('x')
plt.ylabel('y_subtracted')
plt.show()

# Identify the weight fraction of the unknown phase A in the mystery mixture

$\color{red}{\text{IMPORTANT}}$

You now have three equation of the lines corresponding to the three smallest 2-theta peak positions of unknown phase A, along with the corresponding intensities for the mystery A mixture. 

Use them to determine the average weight fraction of A in the mystery A mixture. 
- Hint: Use `np.mean`

Make sure you use the correct specified peak and its corresponding equation of the line or you will not arrive at the correct answer.

Be clear on what is the 'X' and 'Y' in the equation of the line.