# Overview

### Clinical Study
* In this study, we collect voice recordings of people with Parkinson's disease (PD) and other neurodegenrative diseases like ataxia (AX), as well as healthy controls (HC).
* The data collection is conducted using our app, **Vocapp**, which has two interfaces, for **users** (PD, HC), and for **samplers**.
* The conductors of the study are **samplers**. Samplers can register new users. When registering a PD, the sampler also assesses the condition of the PD with clinical questionnaires like MDS-UPDRS, MoCA, etc., and documents everything in **Vocapp**. 
* After the registration, PDs will record themselves once every month, before taking medication and one hour afterwards. 
* HC and participants with other diseases will record themselves once without any further assessment or requirements.

### Vocapp
* The app has two interfaces, one for users and one for sampler. 
* When a user starts the recording exercises, the app starts a session. All the recordings and the self-report qnnrs. will be saved under this session. The filekey uploaded by a **user** has the following path: `{username}/{session}/{filename}`.
* Sampler will upload qnnrs. files for a given user, but not under any session. The filekey of a file uploaded by a **sampler** will be: `{username}/{filename}`. This includes registration, medications, MDS-UPDRS, etc. 

### Alerts
In order to collect the data effectively, we need an alerts system with the following alerts:
* During session: if a user has stopped recording in the middle of the session, i.e. didn't upload a new file for 3 minutes, the app will send a Whatsapp msg with a reminder or with suggestions how to operate the app in case it is stuck.
* Middle session: an hour after medication intake, send a reminder to record again.
* Monthly reminder: send a monthly reminder to perform a recording session. Send a reminder on a weekly basis until the user has recorded. 
* Feedbak: we want to send a feedback as a bounty for PDs that have completed the full session (before + after medication). The feedback takes the recordings before medication, extracts some voice qualities into a dictionary, plots this dictionary, and sends it as a whatsapp msg.

### Database
* The dashboard is used to track and manage the clinical study. Thus, we extract all the metadata according to the following order:
    * List all the files from the bucket (aws).
    * Extract the metadata from the filekey name **only** (`bucket.csv`).
    * Open the csv fils and extract additional metadata from them, including registration data and questionnaires (e.g. `raw.csv`).
    * Files & Sessions:
        * Update needed attributes for filekeys and merge with samplers and phone numbers databases.
        * Match session to all filekeys (sampler's files do not have a session originally).
        * Resolve sessions issues (mostly merge sessions).
        * Propagate metadata to all filekeys.
        * Save the full database (`all_files.csv`) and the sessions (`sessions.csv`).

### Dashboard
The dashbard uses the database to plot some figures:
* Distribution of attributes 
* Patients registratino over time
* Users per sampler
* Number of full/part sessions vs. session number
* Qnnrs. results
* Broken (part) sessions
* Users that need to record again. 

# List all files from the bucket

- List all filekeys from the aws bucket.
- Extract the metadata that is embedded in the file path **only**:
    - filekey
    - username
    - Entity: PD, HC, Sampler (SA)
    - time stamps
    - pattern: RECORDING, RECORDING1, FOG, UPDRS3, etc.
    - exercise: refers to RECORDING patterns, for other patterns, exercise = pattern.
    - timing: 
        - pre: first recording part of the session, before medication.
        - post: second recording part of the session, after medication.
        - healthy: session of a healthy participant; one part only.
    - onmed: did the patient *actually* take or didn't take the medication
    - onoff: does the patient feel the effect of the medication (ON) or it has already worn off (OFF)
- New filekeys are appended to the bucket csv file.

```python
def get_bucket(skip=True) -> None:
    if (not exists(Settings.BUCKET_CSV)) or (skip==False):
        bucket = pd.DataFrame(columns=Bucket.values())
        processed_filekeys = []
    else:
        bucket = pd.read_csv(Settings.BUCKET_CSV, dtype=str)
        processed_filekeys = bucket['filekey'].to_list()       
    
    filekeys = list_bucket()
    filekeys = [f for f in filekeys if f not in processed_filekeys]

    dfs = []
    for filekey in tqdm(filekeys, desc="Adding new files to database"):
        df = pd.DataFrame(columns=Bucket.values())
        df.loc[0, Bucket.FILEKEY] = filekey
        filename = filekey.split('/')[-1]
        for pattern in Patterns.values():
            if re.match(pattern.value, filename):
                username = filekey.split('/')[0]
                df.loc[0, Bucket.USERNAME] = username
                if username.startswith("hc_"):
                        df.loc[0, Bucket.ENTITY] = Entity.HC  # Ataxia will be resolved later
                elif len(username)==40:
                    df.loc[0, Bucket.ENTITY] = Entity.PD
                else:
                    df.loc[0, Bucket.ENTITY] = Entity.SA
                
                df.loc[0, Bucket.PATTERN] = pattern.name
                df.loc[0, Bucket.EXERCISE] = pattern.name.lower()
                if pattern.name not in ["REGISTRATION0", 'APKINSON']:
                    df.loc[0, Bucket.DATE] = extract_from_filename(filekey, 'date')
                    df.loc[0, Bucket.TIME] = extract_from_filename(filekey, 'time')
                    df.loc[0, Bucket.DATETIME] = extract_from_filename(filekey, 'datetime')

                if pattern.name!="UPDATE":
                    df.loc[0, Bucket.LANG] = extract_from_filename(filekey, 'language')
                
                if pattern.name in ['RECORDING', 'RECORDING1', 'FOG', 'SDQ', 'WOQ', 'UPDATE']:
                    df.loc[0, Bucket.SESSION] = filekey.split('/')[1]
                
                if pattern.name in ['UPDRS', 'UPDRS3', 'UPDRS124']:
                    df.loc[0, 'timing'] = extract_from_filename(filekey, 'timing')                    

                if pattern.name=='RECORDING':
                    df.loc[0, Bucket.EXERCISE] = extract_from_filename(filekey, 'exercise') # override
                    df.loc[0, Bucket.TIMING] = extract_from_filename(filekey, 'timing')
                    df.loc[0, Bucket.ONMED] = extract_from_filename(filekey, 'onmed')
                    df.loc[0, Bucket.ONOFF] = extract_from_filename(filekey, 'onoff')
                elif pattern.name=='RECORDING1':
                    df.loc[0, Bucket.EXERCISE] = extract_from_filename(filekey, 'exercise') # override
                    df.loc[0, Bucket.TIMING] = extract_from_filename(filekey, 'timing')
                    df.loc[0, Bucket.ONMED] = OnMed.ONMED if filekey.endswith("_on") else OnMed.NOTONMED
                    df.loc[0, Bucket.ONOFF] = FeelOnOff.UNKNOWN
                
                dfs.append(df)
                break
    if dfs:
        dfs = pd.concat(dfs, ignore_index=True)
        bucket = pd.concat([bucket, dfs], ignore_index=True)
        bucket.to_csv(Settings.BUCKET_CSV, index=False)
```

# Fetch users login credentials from the VM (EC2)

### python code I run from my jupyter
We did not keep the credentials on the VM from the beginning, so I had to concat the data with other logs to get all the phone numbers. This is the purpose of `combine_yahav_ec2()`. In later version, we can omit it.

```python
def users_data():
    print("\nDownloading credentials from ec2...")
    os_type = get_os()
    if os_type == "Linux" or os_type == "Darwin":
        run_shell_command("chmod +x src/download_users.sh")
        run_shell_command("./src/download_users.sh")
    elif os_type == "Windows":
        run_windows_shell_command("./src/download_users_windows.ps1")
    else:
        print(f"Unsupported OS: {os_type}")
        raise SystemExit
    
    combine_yahav_ec2()
    healthy_ec2()



def combine_yahav_ec2() -> None:
    pd_yahav = pd.read_csv(Settings.USERS_YAHAV_CSV, dtype=str, index_col=0)
    pd_yahav[ExtraCols.PASSWORD.value] = np.nan
    pd_ec2 = pd.read_csv(Settings.USERSPD, dtype=str)
    pd_ec2.columns = [ExtraCols.USER_PHONE.value, ExtraCols.PASSWORD.value]
    pd_ec2['username'] = pd_ec2[ExtraCols.USER_PHONE.value].apply(hash_phone_number)
    combo = pd.concat([pd_yahav, pd_ec2], ignore_index=True)

    phonesNpasswords = pd.read_csv('resources/passwords.csv', dtype=str)
    # Merge the dataframes on ExtraCols.USER_PHONE.value with an outer join to ensure all users are included
    combined_df = pd.merge(combo, phonesNpasswords, on=ExtraCols.USER_PHONE.value, how='outer', suffixes=('_users', '_passwords'))

    # Fill missing passwords in users_df with passwords from passwords_df
    combined_df[ExtraCols.PASSWORD.value] = combined_df['password_users'].combine_first(combined_df['password_passwords'])

    # Drop the now redundant columns
    combined_df.drop(columns=['password_users', 'password_passwords'], inplace=True)
    combined_df = combined_df.drop_duplicates(subset=ExtraCols.USER_PHONE.value, keep='last', ignore_index=True)
    combined_df.to_csv(Settings.USERS_EC2_CSV)



def healthy_ec2() -> None:
    hc_ec2 = pd.read_csv(Settings.USERSHC, dtype=str)
    hc_ec2.columns = [ExtraCols.USER_PHONE.value, ExtraCols.PASSWORD.value]
    hc_ec2[Bucket.USERNAME] = hc_ec2[ExtraCols.USER_PHONE.value].apply(hash_phone_number)
    hc_ec2[Bucket.USERNAME] = "hc_" + hc_ec2[Bucket.USERNAME]

    tocat = pd.read_csv(Settings.HC_PHONES_CSV, dtype=str, index_col=0)
    tocat[ExtraCols.PASSWORD.value] = np.nan
    hc_ec2 = pd.concat([hc_ec2, tocat], ignore_index=True)
    hc_ec2 = hc_ec2.drop_duplicates(subset=ExtraCols.USER_PHONE.value, keep='last', ignore_index=True)
    hc_ec2.to_csv(Settings.HC_EC2_CSV)
```

### shell script to access the VM

```shell
#!/bin/bash

# Define variables
PEM_FILE="../build-key.pem"
EC2_USER="ec2-user"
EC2_HOST="ec2-3-83-206-91.compute-1.amazonaws.com"
CONTAINER_NAME="vocabe"
REMOTE_FILE_PATH1="/data/.usershc.csv"
EC2_LOCAL_PATH1="/home/ec2-user/.usershc.csv"
LOCAL_FILE_PATH1="resources/usershc.csv"
REMOTE_FILE_PATH2="/data/.userspd.csv"
EC2_LOCAL_PATH2="/home/ec2-user/.userspd.csv"
LOCAL_FILE_PATH2="resources/userspd.csv"

# Log into AWS EC2 and copy the file from the Docker container to the EC2 instance's home directory
ssh -i $PEM_FILE $EC2_USER@$EC2_HOST << EOF
  sudo docker cp $CONTAINER_NAME:$REMOTE_FILE_PATH1 $EC2_LOCAL_PATH1
  sudo docker cp $CONTAINER_NAME:$REMOTE_FILE_PATH2 $EC2_LOCAL_PATH2
EOF

# Check if the SSH command was successful
if [ $? -eq 0 ]; then
  echo "File copied from container to EC2 instance successfully."
else
  echo "Failed to copy file from container to EC2 instance."
  exit 1
fi

# Download the file from the EC2 instance to the local machine
scp -i $PEM_FILE $EC2_USER@$EC2_HOST:$EC2_LOCAL_PATH1 $LOCAL_FILE_PATH1
scp -i $PEM_FILE $EC2_USER@$EC2_HOST:$EC2_LOCAL_PATH2 $LOCAL_FILE_PATH2

# Check if the SCP command was successful
if [ $? -eq 0 ]; then
  echo "File downloaded to local machine successfully."
else
  echo "Failed to download file to local machine."
  exit 1
fi
```

# Extract raw data
Open csv files to get the data of:
- Participants at registration
- Data updates
- Answers to questionnaires
- Medications list

```python
def get_raw_data(skip=True, print_filekey=False):
    pd.set_option('future.no_silent_downcasting', True)

    def check_exist_and_return(filepath: str, skip=skip) -> pd.DataFrame:
        if (not exists(filepath)) or (skip==False):
            return pd.DataFrame()
        else:
            return pd.read_csv(filepath, dtype=str)
    
    def put_filekey_first(df: pd.DataFrame) -> pd.DataFrame:
        if 'filekey' in df.columns:
            cols = df.columns.tolist()
            cols.remove('filekey')
            df = df[['filekey'] + cols]
        else:
            print("The DataFrame does not contain a 'filekey' column.")
        return df

    bucket = pd.read_csv(Settings.BUCKET_CSV, dtype=str)
    raw = pd.read_csv(Settings.RAW_CSV, dtype=str) if exists(Settings.RAW_CSV) else pd.DataFrame()
    if exists(Settings.RAW_CSV):
        new_filekeys = bucket.loc[~bucket[Bucket.FILEKEY].isin(raw[Bucket.FILEKEY])].copy()
    else:
        new_filekeys = bucket.copy()
    

    updrs = check_exist_and_return(Settings.UPDRS_CSV)
    moca = check_exist_and_return(Settings.MOCA_CSV)
    pdq8 = check_exist_and_return(Settings.PDQ8_CSV)
    fog = check_exist_and_return(Settings.FOG_CSV)
    sdq = check_exist_and_return(Settings.SDQ_CSV)
    woq = check_exist_and_return(Settings.WOQ_CSV)
    registration = check_exist_and_return(Settings.REGISTRATION_CSV)
    update = check_exist_and_return(Settings.UPDATE_CSV)
    medications = check_exist_and_return(Settings.MEDICATION_CSV)

    for ii,row in tqdm(new_filekeys.iterrows(), desc="Extracting raw data", total=len(new_filekeys)):
        filekey = row[Bucket.FILEKEY]
        pattern = row[Bucket.PATTERN]
        if print_filekey:
            print(filekey)
        if pattern in ["UPDRS", "UPDRS3", "UPDRS124"]:
            if Qnnrs.UPDRS1.value not in row or pd.isna(row[Qnnrs.UPDRS1.value]) or row[Qnnrs.UPDRS1.value]=='':
                df = download_csv_to_df(filekey)
                new_filekeys.loc[ii, Qnnrs.UPDRS1] = df.loc[0, UPDRS.updrs1.value].astype(int).sum()
                new_filekeys.loc[ii, Qnnrs.UPDRS2] = df.loc[0, UPDRS.updrs2.value].astype(int).sum()
                new_filekeys.loc[ii, Qnnrs.UPDRS3] = df.loc[0, UPDRS.updrs3.value].astype(int).sum()
                new_filekeys.loc[ii, Qnnrs.UPDRS4] = df.loc[0, UPDRS.updrs4.value].astype(int).sum()
                new_filekeys.loc[ii, Qnnrs.HY] = df.loc[0, UPDRS.hy.value]
                new_filekeys.loc[ii, Bucket.SAMPLER] = df.loc[0, Bucket.SAMPLER] if Bucket.SAMPLER in df else pd.NaT
                if updrs.empty or (filekey not in updrs[Bucket.FILEKEY].tolist()):
                    df[Bucket.FILEKEY] = filekey
                    df = put_filekey_first(df)
                    updrs = pd.concat([updrs, df], ignore_index=True)
        
        if pattern=="MOCA":
            if Qnnrs.MOCA.value not in row or pd.isna(row[Qnnrs.MOCA.value]) or row[Qnnrs.MOCA.value]=='':
                df = download_csv_to_df(filekey)
                df = df.replace({"True": 1, "False": 0})
                new_filekeys.loc[ii, Qnnrs.MOCA] = df.loc[0, MoCA.moca.value].astype(int).sum()
                new_filekeys.loc[ii, Bucket.SAMPLER] = df.loc[0, Bucket.SAMPLER] if Bucket.SAMPLER in df else pd.NaT
                if moca.empty or (filekey not in moca[Bucket.FILEKEY].tolist()):
                    df[Bucket.FILEKEY] = filekey
                    df = put_filekey_first(df)
                    moca = pd.concat([moca, df], ignore_index=True)
        
        if pattern=="PDQ8":
            if Qnnrs.PDQ8.value not in row or pd.isna(row[Qnnrs.PDQ8.value]) or row[Qnnrs.PDQ8.value]=='':
                df = download_csv_to_df(filekey)
                new_filekeys.loc[ii, Qnnrs.PDQ8] = df.loc[0, PDQ8.pdq8.value].astype(int).sum()
                new_filekeys.loc[ii, Bucket.SAMPLER] = df.loc[0, Bucket.SAMPLER] if Bucket.SAMPLER in df else pd.NaT
                if pdq8.empty or (filekey not in pdq8[Bucket.FILEKEY].tolist()):
                    df[Bucket.FILEKEY] = filekey
                    df = put_filekey_first(df)
                    pdq8 = pd.concat([pdq8, df], ignore_index=True)
        
        if pattern=="FOG":
            if Qnnrs.FOG.value not in row or pd.isna(row[Qnnrs.FOG.value]) or row[Qnnrs.FOG.value]=='':
                df = download_csv_to_df(filekey)
                new_filekeys.loc[ii, Qnnrs.FOG] = df.loc[0, FOG.fog.value].astype(int).sum()
                if fog.empty or (filekey not in fog[Bucket.FILEKEY].tolist()):
                    df[Bucket.FILEKEY] = filekey
                    df = put_filekey_first(df)
                    fog = pd.concat([fog, df], ignore_index=True)
        
        if pattern=="SDQ":
            if Qnnrs.SDQ.value not in row or pd.isna(row[Qnnrs.SDQ.value]) or row[Qnnrs.SDQ.value]=='':
                df = download_csv_to_df(filekey)
                df = df.replace({"True": 1, "False": 0})
                score = df.loc[0, SDQ.sdq.value[:-1]].astype(int).sum()
                respiratory = 2.5 if df.loc[0, SDQ.sdq.value[-1]]=="True" else 0.5
                score += respiratory
                new_filekeys.loc[ii, Qnnrs.SDQ] = score
                if sdq.empty or (filekey not in sdq[Bucket.FILEKEY].tolist()):
                    df[Bucket.FILEKEY] = filekey
                    df = put_filekey_first(df)
                    sdq = pd.concat([sdq, df], ignore_index=True)
        
        if pattern=="WOQ":
            if Qnnrs.WOQ_PRE.value not in row or pd.isna(row[Qnnrs.WOQ_PRE.value]) or row[Qnnrs.WOQ_PRE.value]=='':
                df = download_csv_to_df(filekey)
                df = df.replace({"True": 1, "False": 0})
                try:
                    new_filekeys.loc[ii, Qnnrs.WOQ_PRE] = df.loc[0, WOQ.pre.value].astype(int).sum()
                    new_filekeys.loc[ii, Qnnrs.WOQ_POST] = df.loc[0, WOQ.pre.value].astype(int).sum() - df.loc[0, WOQ.post.value].astype(int).sum()
                except:
                    new_filekeys.loc[ii, Qnnrs.WOQ_PRE] = -1
                    new_filekeys.loc[ii, Qnnrs.WOQ_POST] = -1
                if woq.empty or (filekey not in woq[Bucket.FILEKEY].tolist()):
                    df[Bucket.FILEKEY] = filekey
                    df = put_filekey_first(df)
                    woq = pd.concat([woq, df], ignore_index=True)

        if pattern=="REGISTRATION":
            if Registration.BIRTHDATE.value not in row or pd.isna(row[Registration.BIRTHDATE.value]) or row[Registration.BIRTHDATE.value]=='':
                df = download_csv_to_df(filekey)
                for col in Registration.values():
                    if col in df:
                        new_filekeys.loc[ii, col] = str(df.loc[0, col])
                if registration.empty or (filekey not in registration[Bucket.FILEKEY].tolist()):
                    df[Bucket.FILEKEY] = filekey
                    df = put_filekey_first(df)
                    registration = pd.concat([registration, df], ignore_index=True)
    
        if pattern=="UPDATE":
            if Update.IS_DBS.value not in row or pd.isna(row[Update.IS_DBS.value]) or row[Update.IS_DBS.value]=='':
                df = download_csv_to_df(filekey)
                for col in Update.values():
                    if col in df:
                        new_filekeys.loc[ii, col] = str(df.loc[0, col])
                if update.empty or (filekey not in update[Bucket.FILEKEY].tolist()):
                    df = download_csv_to_df(filekey)
                    df[Bucket.FILEKEY] = filekey
                    df = put_filekey_first(df)
                    update = pd.concat([update, df], ignore_index=True)

        if pattern=="MEDICATIONS":
            if medications.empty or (filekey not in medications[Bucket.FILEKEY].tolist()):
                df = download_csv_to_df(filekey)
                df[Bucket.FILEKEY] = filekey
                df = put_filekey_first(df)
                medications = pd.concat([medications, df.astype(str)], ignore_index=True)

        
    updrs.to_csv(Settings.UPDRS_CSV, index=False)
    moca.to_csv(Settings.MOCA_CSV, index=False)
    pdq8.to_csv(Settings.PDQ8_CSV, index=False)
    fog.to_csv(Settings.FOG_CSV, index=False)
    sdq.to_csv(Settings.SDQ_CSV, index=False)
    woq.to_csv(Settings.WOQ_CSV, index=False)
    update.to_csv(Settings.UPDATE_CSV, index=False)
    registration.to_csv(Settings.REGISTRATION_CSV, index=False)
    medications.to_csv(Settings.MEDICATION_CSV, index=False)

    raw = pd.concat([raw, new_filekeys], ignore_index=True)
    raw = raw.sort_values(by=['date', 'username', 'time'], ascending=False, ignore_index=True)
    raw.to_csv(Settings.RAW_CSV, index=False)
```

# "Get all files"

* `change_columns()`: update values of attributes (columns) for specific filekeys.
* `resolve()`: remove unneeded filekeys, update attributes for old patterns or filenames.
* `add_patient_phone()`: merge sampler's name (from the registration csv) with samplers phones csv.
* `add_patient_phone()`: use the data from the VM to merge the phone numbers of the patients (users).
* `add_session_to_all()`: find the closest session (in time) for the sampler's files.
* `resolve_sessions()`: resolve sessions issues; mostly merging sessions.
* `propagate_values()`: propagate the data from the csv files to all filekeys.
* Create `all_files.csv`, `sessions.csv`, and `all_users.csv`.


```python
def get_all_files():
    print("Arranging data ...", end=' ')
    df = pd.read_csv(Settings.RAW_CSV, dtype=str)
    df = change_columns(df)
    df = resolve(df)

    df = add_sampler_phone(df)
    df = add_patient_phone(df)
    df = add_caregiver_phone(df)

    df['datetime'] = pd.to_datetime(df['datetime'], format=Settings.DATETIME, errors='coerce')
    df = add_session_to_all(df)

    real_sessions = df.copy()
    real_sessions = add_session_number(real_sessions)
    real_sessions = add_sampler_to_HC(real_sessions)

    df = resolve_sessions(df)
    df = add_session_number(df)
    df = remove_qnnrs_duplicates(df)
    df = add_sampler_to_HC(df)
    print("Done!")

    df = propagate_values(df)
    df = add_age(df)
    df = add_updrs_columns(df)
    df = df.sort_values(by=['date', 'username', 'time'], ascending=False)
    df.to_csv(Settings.ALL_FILES, index=False)
    
    sessions = get_sessions(df)
    sessions.to_csv(Settings.SESSIONS, index=False)
    real_sessions = get_sessions(real_sessions)
    real_sessions.to_csv(Settings.REAL_SESSIONS, index=False)

    all_users = get_all_users(df)
    all_users.to_csv(Settings.ALL_USERS, index=False)
    ```

# Plots

### Pie plots
![](pies.png)

### Users over time
![](overtime.png)

### Users per sampler
![How many users (PD/HC) each sampler sampled?](userpersampler.png)

### Number of sessions vs. session number
![](sessions.png)

### Questionnaires results
![](qnnrs.png)

### Uncompleted session (broken sessions)
![](broken.png)

### Users that need to record their next session
![](password.png)