## Patient selection criteria

### Objective is to find MS patients with contrasting features

#### 1. Compute ARMSS score for each patient
#### 2. Select two groups of patients: N patients with low ARMSS and N patients with high ARMSS score
#### 3. From each group, select M patients with same diet in the ratio of 70(Female):30(Male)
Include diet as confounding variable in the model
#### 4. Fit GLM model on each metabolites using the selected 2M patient samples (using Serum or Feces or ratio?) - Feces
Separate models for targeted and untargeted
#### 5. Select metabolites with significant relationship with ARMSS score difference
Send the selected metabolites to Maura and Shaobo


Manual selection on the selected metabolite
#### 6. Create patient profile with the selected metabolites
#### 7. Compute patient similarity matrix

In [1]:
import pandas as pd
import os
import sys
sys.path.insert(0, '..')
from paths import *

In [2]:
CLINICAL_DATA_PATH = '../../wetlab/data/patient_selection/iMSMS_clinical_subset_20240105.xlsx'
SAVE_DATA_PATH = '../../wetlab/data/patient_selection/clinical_data_for_ARMSS_computation.csv'


In [3]:
clinical_data = pd.read_excel(CLINICAL_DATA_PATH, engine='openpyxl')


### Preparing data for computing ARMSS score using the webservice: 
https://aliman.shinyapps.io/ARMSS/

#### Data should be according to the following specs:

#### Notes:
#### 1. Your file should be in CSV format (.csv).
#### 2. It should contain three variables named: ageatedss, dd and edss.

#### Step 1: Selecting only MS patients

In [6]:

clinical_data_ms = clinical_data[clinical_data.Case_Control=='MS Participant']


#### Step 2: Removing patients without EDSS date

In [8]:
clinical_data_ms.dropna(subset='EDSS_Date', inplace=True)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  clinical_data_ms.dropna(subset='EDSS_Date', inplace=True)


#### Step 3: Removing patients without EDSS score

In [9]:
clinical_data_ms.dropna(subset='EDSS', inplace=True)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  clinical_data_ms.dropna(subset='EDSS', inplace=True)


(1382, 73)

#### Step 4: Calculating Age at EDSS

In [86]:
clinical_data_ms.loc[:, 'EDSS_Date'] = pd.to_datetime(clinical_data_ms['EDSS_Date'])

clinical_data_ms.loc[:, 'EDSS_Year'] = clinical_data_ms['EDSS_Date'].dt.year

clinical_data_ms.loc[:, 'ageatedss'] = clinical_data_ms['EDSS_Year'] - clinical_data_ms['YOB']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  clinical_data_ms.loc[:, 'EDSS_Date'] = pd.to_datetime(clinical_data_ms['EDSS_Date'])
  clinical_data_ms.loc[:, 'EDSS_Date'] = pd.to_datetime(clinical_data_ms['EDSS_Date'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  clinical_data_ms.loc[:, 'EDSS_Year'] = clinical_data_ms['EDSS_Date'].dt.year
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/ind

#### Step 5: Renaming 'EDSS' column to 'edss'

In [87]:
clinical_data_ms = clinical_data_ms.rename(columns={'EDSS':'edss'})


#### Step 6: Renaming 'Disease Duration (years)' to 'dd'

In [88]:
clinical_data_ms = clinical_data_ms.rename(columns={'Disease Duration (years)':'dd'})



#### Step 7: Selecting only untreated patients using metabolomics data

In [10]:
sample = "serum"

filename = SHORT_CHAIN_FATTY_ACID_DATA_FILENAME
mapping_filename = "short_chain_fatty_acid_spoke_map.csv"
file_path = os.path.join(DATA_ROOT_PATH, filename)

metabolomics_data = pd.read_excel(file_path, engine='openpyxl')
metabolomics_data = metabolomics_data[metabolomics_data["Client Matrix"]==sample]
untreated_patient_id = metabolomics_data[(metabolomics_data.Treatment == 'Off')]['Client Sample ID'].unique()


clinical_data_ms = clinical_data_ms[clinical_data_ms['Record ID'].isin(untreated_patient_id)]


(233, 73)

#### Step 8: Extracting relevant columns for ARMSS processing

In [90]:
clinical_data_ms_prepared = clinical_data_ms[['Record ID', 'ageatedss', 'dd', 'edss']]


#### Step 9: Dropping rows with any nan values for the selected columns

In [91]:
clinical_data_ms_prepared.dropna(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  clinical_data_ms_prepared.dropna(inplace=True)


#### Step 10: Save the data

In [92]:
clinical_data_ms_prepared.to_csv(SAVE_DATA_PATH, index=False, header=True)


In [93]:
print('Total {} MS patients (untreated) are selected to compute ARMSS score'.format(clinical_data_ms_prepared['Record ID'].unique().shape[0]))
      

Total 230 MS patients (untreated) are selected to compute ARMSS score


#### Output file after ARMSS computation, has following scores:
####    gARMSS: global ARMSS
####    ugMSSS: updated global MSSS
####    ogMSSS: original MSSS
####    lMSSS: local MSSS
####    lARMSS: local ARMSS

#### Ref: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5700773/pdf/10.1177_1352458517690618.pdf

Notes (from above Ref):
Creation of the global ARMSS matrix
A global ARMSS matrix was constructed using the
cross-sectional data set. This matrix included the
ARMSS scores obtained for EDSS scores recorded
between ages of 18 and 75 years.