# Peak Vector Nodes

This notebook briefly explores the idea of representing a exosome raman spectra sample as a vector in an n-dimensional space. The relationships between the nodes would represent either teh distance or cosine similarity between the vectors. To get a vector for each sample we would take the peaks from each sameple and then take the "Absorbance" value of each peak and use these values as our vector.

In [1]:
# Import relevant libraries
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from scipy.signal import find_peaks
from scipy.signal import peak_widths
from scipy.signal import peak_prominences
from scipy.integrate import simps
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.model_selection import StratifiedKFold
import numpy as np

In [2]:
# Import raw spectral data
spectra_df = pd.read_csv("../../data/exosomes.raw_spectrum_1.csv")

In [3]:
# Create empty lists
peaks = []
widths = []
prominences = []
areas = []

# Create a sopy of the dataframe
df = spectra_df.copy()

# Find the index and width of each peak
for _, group in df.groupby('SpecID'):

    # Find the peaks
    peak_index, _ = find_peaks(x=group['Absorbance'], prominence=75, width=6)

    # Calculate the widths of each peak
    widths += list(peak_widths(group['Absorbance'], peaks=peak_index, rel_height=0.5)[0])

    # Calculate prominence of each peak
    prominences += list(peak_prominences(group['Absorbance'], peaks=peak_index)[0])

    # Find the index of the peak within the full dataframe
    peaks += list(group.iloc[peak_index].index.values)

peaks_df = df.iloc[peaks]

In [4]:
# Group the dataframe by relevant columns
grouped = peaks_df.groupby(['SpecID', 'Status'])

# Create empty dictionary
arrays_dict = {}

# Build dictionary with "SpecID" and "Status" as a key and our vector as a value
for group_name, group_data in grouped:
    absorbance_values = group_data['Absorbance'].values
    arrays_dict[group_name] = absorbance_values

After looking at our dictionary, we decided that other methodologies seemed to have more promise as this approach shares similarities to algoritms such as SVM and K-means clustering and decided to focus on these other approaches for now.

In [5]:
arrays_dict

{('201210-1-00',
  'Normal'): array([1842.571 , 1851.9185, 1746.4041, 1702.7238, 1475.0056]),
 ('201210-1-01',
  'Normal'): array([1971.2809, 2034.2784, 2031.1062, 1977.515 , 1868.7283, 1779.342 ,
        1735.8127, 1735.0914, 1716.4309, 1733.3473]),
 ('201210-1-02',
  'Normal'): array([2214.0876, 2846.9824, 2166.9624, 3696.4109, 2275.5774, 2203.6494,
        2195.7212, 2190.1194, 2257.9094, 2418.2576, 2250.4634, 2063.5022]),
 ('201210-1-03',
  'Normal'): array([10350.545 ,  8466.6904,  7053.9658,  2536.3599,  2446.353 ,
         3342.7229,  2360.0664,  3452.3679,  2854.2239,  2866.572 ,
         2702.4502,  3134.1235,  2604.0405,  2910.6362,  3426.8677,
         2551.9229,  2552.4478]),
 ('201210-1-04',
  'Normal'): array([2277.2156, 2264.5063, 2240.8013, 2238.8494, 2278.3433, 2366.2205,
        2356.8567, 2429.5254, 2458.5142]),
 ('201210-1-05',
  'Normal'): array([2414.5728, 2334.2688, 2287.447 , 2296.0637, 2368.3093, 2463.7754,
        2459.7104, 2539.3604, 2461.4141, 2504.2095, 26