# Import raw urine spectra (part 1)

The goal of this notebook is to create a dictionary file that contains information about each patient (keys) and their respective spectra (values). The spectral data is saved in hundreds of txt files. The txt files were generated by the software called LabSpectra that operates the Raman microscope we used.
This script runs for 260 sec.

## Imports important modules

In [1]:
# imports modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import glob # imports to find nested files
import os
import re

## Searches for file names and detects corrupted files
The spectral data is saved in hundreds of txt files. The txt files were generated by the software called LabSpectra that operates the Raman microscope we used.

In [2]:
# search for file names
path = "Data raw urine spectra/" 
substrate_folders = glob.glob(path + "/**NPs")
patients_folders = glob.glob(path + "/**NPs/**[0-9]")
txt_naming = glob.glob(path + "/**NPs/**[0-9]/*.txt")

In [3]:
# visualize txt_naming, first 5 entries
txt_naming[:5]

['Data raw urine spectra\\Ag_100nm_AgNPs\\1\\1-1.txt',
 'Data raw urine spectra\\Ag_100nm_AgNPs\\1\\1-2.txt',
 'Data raw urine spectra\\Ag_100nm_AgNPs\\1\\1-3.txt',
 'Data raw urine spectra\\Ag_100nm_AgNPs\\1\\2-1.txt',
 'Data raw urine spectra\\Ag_100nm_AgNPs\\1\\2-2.txt']

From the output below, a reader can understand how the data is organized. The cell below finds "corrupted" files.

In [4]:
# Finds corrupted files and save them as strings in a list. Some files are corrupted. 
# This block finds that files by bytes size. Corrupted files take more than 1 mb.
corrupted_files = []
for i in range(0, len(txt_naming)):
    check_size=os.stat(txt_naming[i])
    size=check_size.st_size
    if size > 1000000:
        corrupted_files.append(txt_naming[i])
corrupted_files

['Data raw urine spectra\\Ag_100nm_AgNPs\\1\\3-1.txt',
 'Data raw urine spectra\\Ag_100nm_AgNPs\\107\\3-1.txt',
 'Data raw urine spectra\\Ag_100nm_AgNPs\\108\\1-1.txt',
 'Data raw urine spectra\\Ag_100nm_AgNPs\\116\\2-2.txt',
 'Data raw urine spectra\\Ag_100nm_AgNPs\\128\\1-1.txt',
 'Data raw urine spectra\\Ag_100nm_AgNPs\\62\\3-1.txt',
 'Data raw urine spectra\\Ag_100nm_AuNPs\\108\\2-1.txt',
 'Data raw urine spectra\\Ag_100nm_AuNPs\\110\\3-1.txt',
 'Data raw urine spectra\\Ag_100nm_AuNPs\\115\\2-1.txt',
 'Data raw urine spectra\\Ag_updated_100nm_AuNPs\\137\\1-1.txt',
 'Data raw urine spectra\\Al_tape_100nm_AuNPs\\113\\1-1.txt',
 'Data raw urine spectra\\Al_tape_60nm_AuNPs\\104\\1-1.txt',
 'Data raw urine spectra\\Au_60nm_AuNPs\\133\\1-1.txt',
 'Data raw urine spectra\\Au_60nm_AuNPs\\133\\2-1.txt',
 'Data raw urine spectra\\Au_60nm_AuNPs\\36\\3-1.txt',
 'Data raw urine spectra\\Si_3x_100nm_AuNPs\\113\\1-1.txt',
 'Data raw urine spectra\\Si_3x_100nm_AuNPs\\98\\1-1.txt']

In [5]:
# Remove corrupted files by finding their names (filepath).
for i in range(0, len(corrupted_files)):
    txt_naming.remove(corrupted_files[i])

In [6]:
len(txt_naming)

7279

## Create dictionaries: Keys - patients, Value - spectra
Firstly, a function that organizes data into a dictionary is created.

In [7]:
def raman_create_dict(files_path, substrate):
    """
    example: files_path = 'Data raw urine spectra\\Au_100nm_AuNPs\\99\\3-3.txt',
    "Au_100nm_AuNPs" refers to experimental set
    99 - ID of a patient
    3-3.txt - contains spectral data
    
    Input:
    1) files_path
    2) substrate - a string with only these arguments:
    experimental_sets = ["Ag_100nm_AgNPs",
                        "Ag_100nm_AuNPs",
                        "Al_tape_60nm_AuNPs",
                        "Al_tape_100nm_AuNPs",
                        "Au_60nm_AuNPs",
                        "Au_100nm_AuNPs",
                        "Si_60nm_AuNPs",
                        "Si_3x_100nm_AuNPs",
                        "Au_40nm_AuNPs",
                        "Au_no_AuNPs",
                        "Au_HSA_AuNPs",
                        "glass_no_AuNPs",
                        "Si_no_AuNPs",
                        "Ag_updated_100nm_AuNPs",
                        "Ag_500rods_AuNPs",
                        "Au_650rods_AuNPs"]
    
    Output:
    dictionary with Keys - patients, and Values - spectra
    """
    
     #find a relevant list containing relevant paths
    rel_path = [] 
    for file_path in files_path:
        a = re.search(substrate, file_path) #finds patients within a single set of a substrate
        if a != None:
            rel_path.append(file_path)
    rel_files_path = rel_path
    
    # create empty dictionary with numeric key values to delete later
    # our research group does not have ID higher than 200, so 300 would be more than enough
    dict_raw_spectra = {} 
    for i in range(0,300):
        dict_raw_spectra[i] = []
    
    # create matrix from relevant path
    for file_path in rel_files_path:
        # keys in dict
        y = re.findall(r"\\([0-9]*)\\", file_path)
        key = int(y[0])
        
        # values in dict
        value = pd.read_table(file_path)
        value_sliced = np.array(value.iloc[:,2:])
        # converts to dt
        # if statement prevents files with incorrect numbers of row to pass
        if value_sliced.shape[0] <= 100:
            dict_raw_spectra[key].append(value_sliced)
    
    # delete keys without values, if keys have values, this script concatenates all values  
    for key in list(dict_raw_spectra.keys()):
        if len(dict_raw_spectra[key]) == 0:
            del dict_raw_spectra[key]
        else:
            value=np.concatenate(dict_raw_spectra[key])
            dict_raw_spectra[key]=value

    return dict_raw_spectra

The cell below creates dictionary for each substrate. Each of them are saved into the single dictionary called 'raw_urine_spectra'

In [8]:
# names of experimental sets
experimental_sets = ["Ag_100nm_AgNPs",
                    "Ag_100nm_AuNPs",
                    "Al_tape_60nm_AuNPs",
                    "Al_tape_100nm_AuNPs",
                    "Au_60nm_AuNPs",
                    "Au_100nm_AuNPs",
                    "Si_60nm_AuNPs",
                    "Si_3x_100nm_AuNPs",
                    "Au_40nm_AuNPs",
                    "Au_no_AuNPs",
                    "Au_HSA_AuNPs",
                    "glass_no_AuNPs",
                    "Si_no_AuNPs",
                    "Ag_updated_100nm_AuNPs",
                    "Ag_500rods_AuNPs",
                    "Au_650rods_AuNPs"]

In [9]:
# create a dictionaries for each substrate (experimental set)
raw_urine_spectra = {}

for exp_set in experimental_sets:
    raw_urine_spectra[exp_set] = raman_create_dict(txt_naming, exp_set)    

In [10]:
# check results
raw_urine_spectra.keys()

dict_keys(['Ag_100nm_AgNPs', 'Ag_100nm_AuNPs', 'Al_tape_60nm_AuNPs', 'Al_tape_100nm_AuNPs', 'Au_60nm_AuNPs', 'Au_100nm_AuNPs', 'Si_60nm_AuNPs', 'Si_3x_100nm_AuNPs', 'Au_40nm_AuNPs', 'Au_no_AuNPs', 'Au_HSA_AuNPs', 'glass_no_AuNPs', 'Si_no_AuNPs', 'Ag_updated_100nm_AuNPs', 'Ag_500rods_AuNPs', 'Au_650rods_AuNPs'])

## Saves output dictionaries into pickle

In [11]:
# save the output dictionary into a pickle file
import pickle

with open("raw_urine_spectra.pkl","wb") as file:
    
    pickle.dump(raw_urine_spectra,file)