In order to be able to graph multiple spectra simultaneously, glob was used to import multiple files into an empty list called Data.

This adds the names of all files with the .txt suffix into the 'Data' list

In [None]:
import glob
Data = []
for each_file in glob.iglob('*.txt'):
    Data.append(each_file)
print(Data)

Once read in the next challenge is to read in all of the files in the Data list and putting it in the new list Data1. At the same time the column names were changed to be more informative. MRE is the main value we are interested in.

In [None]:
import pandas as pd
Data1 = []
for each_file in Data:
    each_file = pd.read_csv(each_file, skiprows=19,  sep='\s+' , header=None)
    each_file = each_file.rename(columns = 
           {0:"Wavelength", 1:"MRE", 2:"HTV"})
    Data1.append(each_file)
print(Data1)

These two lists were then zipped together to form the dictionary dict1 so that they can be linked and information could be extracted from the name of the files.

In [None]:
dict1 = dict(zip(Data,Data1))
print(dict1)
    

Having established this dictionary we next need to be able to be able to pull out the blank data and then subtract this from the blank MRE data from sample data. First we need to be able to recognise the word 'blank' in the title and the pull that from the dictionary. Initially tried using wild card characters as with glob but these are not supported and requires the fnmatch module. We import this and the check to see if it is working.

In [None]:
import fnmatch
fnmatch.fnmatch('test', 't??t')

Next we create a loop using fnmatch to identify variables in the dictionary that include the word blank in the title, extract it and put it in the empty 'blank' list. Initially this code did not work as the source it was adapted from (https://stackoverflow.com/questions/52656701/wildcard-in-dictionary-key) used it to produce flexible keys within dictionaries. This was solved by changing 'if fnmatch.fnmatch(blank_identification, name):' to 'if fnmatch.fnmatch(name, blank_identification):'. After this, I needed to remove the blank dataset from the dictionary which proved harder than expected as its name would change. Adding del() or pop() functions into the for loop didn't work as it changed the length of the dictionary in the middle of a for loop which through up an error.

In [None]:
from pandas import DataFrame
blank_identification = 'B_*'
for name, data in dict1.items():
    if fnmatch.fnmatch(name, blank_identification):
        blank = pd.DataFrame(data)
print(blank)
print(blank["MRE"])

The problem of removing the blank was done using a rather round about method. By using a for loop to search through the keys this time using 'dict1.keys()' the blank name could exported. Outside of the loop the pop() function could be used to remove the blank data set from the dictionary. This avoids problems later down the line, as the blank title code does not usually fit the same format as the samples which caused problems. It also had to be removed from the Data list of file names as this is used below to form the blanked_dict.

In [None]:
blank_identification = 'B_*'
for name in dict1.keys():
    if fnmatch.fnmatch(name, blank_identification):
        print(name)
        blank_name = name
print(dict1.pop(blank_name))
Data.remove(blank_name)
print(Data)

To blank the data we must now subtract the MRE data in the blank list from the MRE data in library of samples. To do this we must loop through the library subtracting the data from each sample. Initially this was tried using a loop to call collumn 2 but this appeared to give the second row. To avoid this confusion collum was called using sample["MRE"].

In [None]:
blanked_data = []
for sample in dict1.values():
    df = pd.DataFrame(columns=['Wavelength', 'MRE', 'HTV'])
    df["MRE"] = (sample["MRE"]-blank["MRE"])
    df["Wavelength"] = sample["Wavelength"]
    df["HTV"] = sample["HTV"]
    blanked_data.append(df)
blanked_dict = dict(zip(Data,blanked_data))
print(blanked_dict)

This code uses the re module which is used to recognise patterns and in this case extracts data from it. re.match function utilises the standardised naming system, using the code at the beginning of each value (e.g. Tc for temp in celcius) and defines the value after that as a group (in brackets). Initially it was difficult for this extracted data to be used as it is extracted as a string. For each we therefore have to use the int() function to turn this from a string into an integer which can then be used later on. All of these values are used to calculate a 'normalisation factor' which is used to normalise the data to protein concentration, width of the cell it is measured in, and the total amide bonds in the protein you are measuring, so that the output values are comparable when these values vary.

In [None]:
import re
normalisation_factor = []
for name in dict1.keys():
    print(name)
    m = re.match(r".*Tc(.*)_TPC(.*)um_AB(.*)_PL(.*)mm_Ti_(.*).txt", name)
    Total_Protein_Concentration = m.group(2)
    Total_Protein_Concentration = int(Total_Protein_Concentration)
    print(Total_Protein_Concentration)
    Total_Amide_Bonds = m.group(3)
    Total_Amide_Bonds = int(Total_Amide_Bonds)
    print(Total_Amide_Bonds)
    Path_Length = m.group(4)
    Path_Length = int(Path_Length)
    print(Path_Length)
    normalisation_factor.append(1000000/(Total_Protein_Concentration*Total_Amide_Bonds*Path_Length))
print(normalisation_factor)

In this section the normalisation factors are used to create a new dictionary (normalised_dict) where the blanked MRE data has been transformed to normalised data. Initially I tried to do this step in the same for loop as above without the need for storing the normalisation factors in their own list, but as the above loop cycles through the dictionary keys and we needed to loop through the values a different loop had to be used. 

In [None]:
i = 0
normalised_data = []
for data in blanked_dict.values():
    df = pd.DataFrame(columns=['Wavelength', 'MRE', 'HTV'])
    df['MRE'] = data['MRE']*normalisation_factor[i]
    df["Wavelength"] = data["Wavelength"]
    df["HTV"] = data["HTV"]
    normalised_data.append(df)
    i = i + 1
normalised_dict = dict(zip(Data,normalised_data))
print(normalised_dict)

Using the same formula for the normalisation process the titles are extracted and put into their own list. This will then be used to add the titles to the graphs below.

In [None]:
titles = []
for names in Data:
    m = re.match(r".*Tc(.*)_TPC(.*)um_AB(.*)_PL(.*)mm_Ti_(.*).txt", names)
    Title = m.group(5)
    titles.append(Title)
print(titles)

Plot the graph of the normalised data with wavelength. The characteristic double dip plot indicates that this protein is alpha helical.

In [None]:
import matplotlib.pyplot as plt  
import numpy as np
i = 0
for data in normalised_data:
    heading = titles[i]
    fig,ax=plt.subplots()
    plt.title(heading + " MRE vs Wavelength")
    data.plot(x = "Wavelength", y = "MRE", ax = ax)
    plt.ylabel("MRE (deg cm2 (dmol res)-1)")
    ax.set_xlabel("Wavelength (nm)")
    
    fig,ax1=plt.subplots()
    plt.title("HTV vs Wavelength")
    data.plot(x = "Wavelength", y = "HTV", ax = ax1)
    ax.set_xlabel("Wavelength (nm)")
    plt.ylabel("HTV (AU)")
    i = i + 1