# EEG EDA

#### Dataset:

So there are ~23k eeg recordings with readme files on the web avaliable for thoes with access to it.
Each sample got its readme. Session differs in every detail so it's crucial to pick most standarized, or somehow define standard criterium, also filter out sessions with such conditions as autism, epilepsia etc.

#### Goal:

Select N samples to fit as an controll group for eeg research.

#### Main problems:

Two differend readme file formats with even more inner differences.

#### Main prefferences:

1. International 10-20 System
2. Normal EEG.
3. State: awake.
4. Medicals: none.
5. Avoid persons with conditions disturbing brainwaves (for now, as side controll groups with such might help):
    - Narcolepsy
    - Anxiety
    - Obsessive–compulsive disorder
    - Attention deficit hyperactivity disorder (ADHD) and its three subtypes.
    - Juvenile chronic arthritis
    - Postural orthostatic tachycardia syndrome (PoTS)
    - Ehlers-Danlos Syndrome
    - Temporal low-voltage irregular delta wave
    - Parasomnias
    - Parkinson's
    - Diabetes
    - Insulin
    - Alcoholism
    - Drugs
    - Sleep deprivation
    - Dementia
    - wieloletnie doświadczenie w medytacji


# Plan:

1. Differ samples by readme structure
2. Present new structure to unify informations. Focus on useful informations.
3. Parse data into new unified structure.
4. Pick what we need into a new dataframe with key features.
5. Pick samples on specified criteria as an controll group.

# Part 1:

Differ samples by readme data structure.

In [248]:
import pandas as pd

# DATA
df = pd.read_csv('edf_reports.csv') #, index=None)

# 
def repport_type(text):
    
    text = str(text).strip()
    
    first = text.split()
    if first:
        first = first[0]
    else:
        first = ''
        
    if first.isupper():
        return 'standard_1'
    elif first.title():
        return 'standard_2'
    else:        
        return 'unclassified'

df['readme_type'] = df.readme.apply(repport_type)

s1 = df[df['readme_type'] == 'standard_1']
s2 = df[df['readme_type'] == 'standard_2']
print('standard_1', len(s1))
print('standard_2', len(s2))
print('unclassified', len(df) - (len(s1) + len(s2)))


standard_1 18611
standard_2 4399
unclassified 1


# Part 2:

Present new structure to unify informations. Focus on useful informations.

In [None]:
data_format = {
    'clinical_history' : '',
    'age' : '',
    'sex' : '',
    'medications' : '',
    'sedation' : '',
    'eeg_type' : '',
    'technique' : '',
    'study_details' : '',
    'impression' : '',
    'correlations' : '',
    'others' : '',
}

# Part 3:

Parse data into new unified structure.

In [249]:
def clean_medication_s1(text):
    text = text[12:].strip()
    if text != 'None' and text != None:
        return text
    else:
        return ''  
    
def clean_medication_s2(text):
    text = text[13:].strip()
    if text != 'None' and text != None:
        return text
    else:
        return ''  
    
def add_text_to(obj_data, category, line):
    if not obj_data[category]:
        obj_data[category] = line
    else:
        obj_data[category] += ' ' + line
    return obj_data, category

def get_data_out_of_readme_standard_2(readme):
    readme = str(readme)
    obj_data = data_format.copy()
    last_structure = None
    for i, line in enumerate(readme.split('\n')):
        
        line = line.strip()

        if line.startswith('History'):
            obj_data, last_structure = add_text_to(obj_data, 'clinical_history', line)
        elif line.startswith('Medications'):
            obj_data, last_structure = add_text_to(obj_data, 'medications', line)
            obj_data[last_structure] = clean_medication_s2(obj_data[last_structure])
        elif line.startswith('Sedation'):
            obj_data, last_structure = add_text_to(obj_data, 'sedation', line)
        elif line.startswith('EEG Type'):
            obj_data, last_structure = add_text_to(obj_data, 'eeg_type', line)
        elif line.startswith('Technique'):
            obj_data, last_structure = add_text_to(obj_data, 'technique', line)
        elif line.startswith('Description'):
            obj_data, last_structure = add_text_to(obj_data, 'study_details', line)
        elif line.startswith('Interpretation'):
            obj_data, last_structure = add_text_to(obj_data, 'impression', line)
        elif line.startswith('Summary of Findings'):
            obj_data, last_structure = add_text_to(obj_data, 'correlations', line)
        else:
            if last_structure is not None:
                obj_data[last_structure] += ' ' + line
                obj_data[last_structure] = obj_data[last_structure].strip()
            else:
                obj_data['others'] += ' ' + line
    return obj_data  

def get_data_out_of_readme_standard_1(readme):
    readme = str(readme)
    obj_data = data_format.copy()
    last_structure = None
    for i, line in enumerate(readme.split('\n')):

        line = line.strip()
        
        # MAIN FEATURE FIELDS.
        # --------------------

        # HISTORY [might contain age, sex]
        if line.startswith('CLINICAL HISTORY'):
            obj_data, last_structure = add_text_to(obj_data, 'clinical_history', line)
        elif line.startswith('HISTORY'):
            obj_data, last_structure = add_text_to(obj_data, 'clinical_history', line)

        # MEDS
        elif line.startswith('MEDICATIONS'):
            obj_data, last_structure = add_text_to(obj_data, 'medications', line)
            obj_data[last_structure] = clean_medication_s1(obj_data[last_structure])
            
        # AGE
        elif line.startswith('AGE'):
            obj_data, last_structure = add_text_to(obj_data, 'age', line)

        # SEX
        elif line.startswith('SEX'):
            obj_data, last_structure = add_text_to(obj_data, 'sex', line)

        # TECH
        elif line.startswith('TECHNIQUE'):
            obj_data, last_structure = add_text_to(obj_data, 'technique', line)
        elif line.startswith('EEG TYPE'):
            obj_data, last_structure = add_text_to(obj_data, 'eeg_type', line)   

        # --------------------------------------------------

        # STUDY DETAILS
        elif line.startswith('INTRODUCTION'):
            obj_data, last_structure = add_text_to(obj_data, 'study_details', line)  
        elif line.startswith('REASON FOR STUDY'):
            obj_data, last_structure = add_text_to(obj_data, 'study_details', line)
        elif line.startswith('BACKGROUND'):
            obj_data, last_structure = add_text_to(obj_data, 'study_details', line)
        elif line.startswith('GENERALIZED SLOWING'):
            obj_data, last_structure = add_text_to(obj_data, 'study_details', line)
        elif line.startswith('SPORADIC EPILEPTIFORM ACTIVITY'):
            obj_data, last_structure = add_text_to(obj_data, 'study_details', line)
        elif line.startswith('PAROXYSMAL ACTIVITY'):
            obj_data, last_structure = add_text_to(obj_data, 'study_details', line)
        elif line.startswith('ABNORMAL DISCHARGES'):
            obj_data, last_structure = add_text_to(obj_data, 'study_details', line)
        elif line.startswith('TECHNICAL DIFFICULTIES'):
            obj_data, last_structure = add_text_to(obj_data, 'study_details', line)
        elif line.startswith('EVENTS'):
            obj_data, last_structure = add_text_to(obj_data, 'study_details', line)
        elif line.startswith('RECORDING ENVIRONMENT'):
            obj_data, last_structure = add_text_to(obj_data, 'study_details', line)
        elif line.startswith('DESCRIPTION OF THE RECORD'):
            obj_data, last_structure = add_text_to(obj_data, 'study_details', line)
        elif line.startswith('SEDATION'): # Meds sedation. Not important.
            obj_data, last_structure = add_text_to(obj_data, 'study_details', line)
        # DATES
        elif line.startswith('START DATE OF STUDY'):
            obj_data, last_structure = add_text_to(obj_data, 'study_details', line)
        elif line.startswith('END DATE OF STUDY'):
            obj_data, last_structure = add_text_to(obj_data, 'study_details', line)
        elif line.startswith('DURATION OF STUDY'):
            obj_data, last_structure = add_text_to(obj_data, 'study_details', line)

        # IMPRESSIONS
        elif line.startswith('IMPRESSION'):
            obj_data, last_structure = add_text_to(obj_data, 'impression', line)
        elif line.startswith('DAY 1 IMPRESSION'):
            obj_data, last_structure = add_text_to(obj_data, 'impression', line)
        elif line.startswith('DAY 2 IMPRESSION'):
            obj_data, last_structure = add_text_to(obj_data, 'impression', line)
        elif line.startswith('FINAL IMPRESSION '):
            obj_data, last_structure = add_text_to(obj_data, 'impression', line)

        # CORELLATIONS
        elif line.startswith('CLINICAL CORRELATION:'):
            obj_data, last_structure = add_text_to(obj_data, 'correlations', line)
        elif line.startswith('FINAL CLINICAL CORRELATION'):
            obj_data, last_structure = add_text_to(obj_data, 'correlations', line)

        else:
            if last_structure is not None:
                obj_data[last_structure] += ' ' + line
                obj_data[last_structure] = obj_data[last_structure].strip()
            else:
                obj_data['others'] += ' ' + line
                
    return obj_data


# MAIN

for i, readme in enumerate(df.readme):
    readme_type = df.loc[i, 'readme_type']

    if readme_type == 'standard_1':
        obj_data = get_data_out_of_readme_standard_1(readme)
    elif readme_type == 'standard_2':
        obj_data = get_data_out_of_readme_standard_2(readme)

    for k, v in obj_data.items():
        df.loc[i, k] = v
        
df.to_csv("edf_reports_data_sorted.csv", index=False)

# Part 4:

Pick what we need into a new dataframe with key features.

In [755]:
import nltk
from nltk.stem import WordNetLemmatizer
WNL = WordNetLemmatizer()

df2 = pd.read_csv("edf_reports_data_sorted.csv")

def clean_sedation(text):
    text = str(text)
    text = text.strip()
    if 'Sedation:' in text:
        text = text[9:]
    if '\t' in text:
        text = text[2:]
    if 'None' in text:
        text = ''
    return text

def clean_eeg_type(text):
    text = str(text)
    if text.startswith('EEG Type') or text.startswith('EEG TYPE'):
        text = text[8:]
    if '\t' in text:
        text = text[2:]
    text = text.strip()
    if text.startswith(':'):
        text = text[2:]
    
    return text
    
def get_person_state(text):
    text = str(text)
    defined_items = ['routine', 'coma', 'awake', 'drowsy', 'asleep', 'neonatal', 'portable']
    text_alpha = ' '.join([''.join([c for c in w.lower() if c.isalpha()]) for w in text.split()])
    items = ', '.join([w for w in text_alpha.split() if w in defined_items])
    return items

def eeg_type_time(text):
    text = str(text)
    defined_items = ['routine', 'coma', 'awake', 'drowsy', 'asleep', 'neonatal', 'portable']
    text_alpha = ' '.join([''.join([c for c in w.lower() if c.isalpha()]) for w in text.split()])
    items = ', '.join([w for w in text_alpha.split() if w in defined_items])
    return items    

def norm_abnorm(text):
    text = str(text)
    text = text.upper()

    if 'NORMAL EEG' in text:
        return 'Normal'
    elif 'ABNORMAL' in text:
        return 'Abnormal'
    else:
        return ''
    
def get_system(text):
    text = str(text)
    if '10/20' in text or '10-20' in text:
        return '10/20'
    else:
        return ''
    
def is_awake(text):
    text = str(text).lower()
    if 'awake' in text:
        return 1
    else:
        return 0
    
def is_not_asleep(text):
    text = str(text).lower()
    asleep = ['coma', 'asleep', 'neonatal']
    for w in text.split(', '):
        if w in asleep:
            return 0
    return 1
    
def clean_meds_again(text):
    if isinstance(text, float):
        text = ''
    if text == None:
        text = ''
    text = str(text)
    text = text.strip()
    text = text.lower()
    if text.startswith('medicatio'):
        text = text[12:].strip()
    if text == 'NaN' or text == 'none':
        text = ''
    return text

def get_age(text):
    words = str(text).lower().strip().split()
    age = ''
    for i, w in enumerate(words):
        
        if age != '':
            break #continue
            
        if '-year-old' in w:
            age = w.split('-')[0]
        elif 'yr' in w and w[:2].isdigit():
            age = w[:2]
        elif 'yo' in w and w[:2].isdigit():
            age = w[:2]
        elif 'year-old' in w:
            age = words[i-1]
        elif w == 'year':
            age = words[i-1]
        elif w.startswith('y.o'):
            age = words[i-1]
        elif w.startswith('.o'):
            age = words[i-2]
        elif w.startswith('yo'):
            age = words[i-1]
        elif w.startswith('yr'):
            age = words[i-1]
        elif w.startswith('y/o'):
            age = words[i-1]
        elif w.startswith('y/o'):
            age = words[i-1]
        elif w == 'y' and words[i+1] == 'o':
            age = words[i-1]
        elif w == 'y' and words[i+1] == '0':
            age = words[i-1]
        elif w == 'r' and words[i+1] == 'old':
            age = words[i-1]
        elif 'y/o' in w:
            age = w[:2]
        elif 'yo' in w and 'eeg' in w:
            age = w[3:5]
        elif 'y.o' in w and len(w) == 5:
            age = w[:2]  
        elif '-year' in w:
            age = w[:2]
        elif 'years' == w:
            age = words[i-1]
    
    if age != None:       
        if age.isdigit():
            age = age
        else:
            age2 = ''.join([w for w in age if w.isdigit()])
            if len(age2) == 0:
                age = ''
            elif len(age) > 2 and '.' in age or '-' in age or '&' in age:
                age = ''
            else:
                age = age2
    else:
        age = ''
        
    return age

def get_sex(text):
    words = str(text).lower().strip().split()
    sex = ''
    for i, w in enumerate(words):
        w = ''.join([c for c in w if c.isalpha()])
        if sex:
            return sex
        if w == 'lady':
            sex = 'F'
        elif w == 'gentleman':
            sex = 'M'
        elif w == 'woman':
            sex = 'F'
        elif w == 'man':
            sex = 'M'
        elif w == 'male':
            sex = 'M'
        elif w == 'female':
            sex = 'F'
        elif w == 'girl':
            sex = 'F'
        elif w == 'boy':
            sex = 'M'
    return 'Unknown'

# Some conditions may or may not influence brainwaves, let's mark samples with a such possibility.
# Narcolepsy
# Anxiety
# Obsessive–compulsive disorder
# Attention deficit hyperactivity disorder (ADHD) and its three subtypes.[31]
# Juvenile chronic arthritis[32]
# Postural orthostatic tachycardia syndrome (PoTS) [33]
# Ehlers-Danlos Syndrome[34]
# Temporal low-voltage irregular delta wave
# Parasomnias
# Parkinson's
# Diabetes
# Insulin
# Alcoholism
# Sleep deprivation
# dementia
# objawy odstawienne 
# wieloletnie doswiadczenie w medytacji

def lemmatize_words(text, WNL):
    return ' '.join([WNL.lemmatize(word, pos='v') for word in text.split()])
def get_condition_flags(text): # https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5804435/
    text = str(text)
    text = ''.join([c for c in text if c.isspace() or c.isalpha()])
    conditions = ['autism', 'epilepsy', 'depression','mania','adhd', 'dementia', 'schizophrenia', 'alcoholism', 'diabetes','parasomnias', 'parkinson', 'narcolepsy', 'anexiety', 'obsessive', 'compulsive', 'arthritis', 'tachycardia', 'ehlers']
    text = lemmatize_words(text, WNL)
    condition_flags = ''
    for w in text.split():
        w = w.lower()
        if w in conditions:
            condition_flags += ' ' + w
            condition_flags = condition_flags.strip()            
    return condition_flags

df2['sex'] = df2.clinical_history.apply(get_sex)
df2['age'] = df2.clinical_history.apply(get_age)
df2['system'] = df2.study_details.apply(get_system)
df2['eeg_type'] = df2.eeg_type.apply(clean_eeg_type)
df2['person_state'] = df2.eeg_type.apply(get_person_state)
df2['is_awake'] = df2.person_state.apply(is_awake)
df2['not_asleep'] = df2.person_state.apply(is_not_asleep)

df2['medications'] = df2.medications.apply(clean_meds_again)
df2['eeg_norm_abnorm'] = df2.readme.apply(norm_abnorm)
df2['sensations'] = df2.sedation.apply(clean_sedation)
df2['sedation'] = df2.sedation.apply(clean_sedation)
df2['cond_flags'] = df2.readme.apply(get_condition_flags)

# Part 5:

Pick samples on specified criteria as an controll group.

In [756]:
normal_eeg = len(df2[df2['eeg_norm_abnorm'] == 'Normal'])
not_asleep = len(df2[(df2['not_asleep'] == 1)])
awake = len(df2[df2['is_awake'] == 1])
no_meds = len(df2[(df2['medications'] == '')])
system = len(df2[(df2['system'] == '')])
conditions = len(df2[df2['cond_flags'] == ''])

print('N of samples fulfilling key criteriums:')
print('Normal eeg: {}'.format(normal_eeg))
print('Not asleep: {}'.format(not_asleep))
print('No medicals: {}'.format(no_meds))
print('No special conditions: {}'.format(flags))
print('Standard system: {}'.format(system))

N of samples fulfilling key criteriums:
Normal eeg: 15268
Not asleep: 22340
No medicals: 2546
No special conditions: 14214
Standard system: 5843


In [757]:
cond1 = (df2['system'] == '10/20') # Standard system
cond2 = (df2['eeg_norm_abnorm'] == 'Normal') # Normal eeg
cond3 = (df2['not_asleep'] == 1) # Not asleep
cond4 = (df2['medications'] == '') # No meds
cond5 = (df2['cond_flags'] == '') # No special conditions as autism or epilepsia


picked_samples_A = df2[cond1 & cond2 & cond3 & cond4]
picked_samples_B = df2[cond1 & cond2 & cond3 & cond4 & cond5]

def minimize_data_frame(df):
    df = df.loc[:,['url', 'readme', 'age', 'sex', 'medications', 'system', 'eeg_norm_abnorm', 'not_asleep', 'cond_flags']]
    return df

# SAVE
picked_samples_A.to_csv('initial_controll_group_A_962_ALL.csv', index=False,encoding='utf-8' )
picked_samples_B.to_csv('initial_controll_group_B_467_ALL.csv', index=False,encoding='utf-8' )
minimize_data_frame(picked_samples_A).to_csv('initial_controll_group_A_962.csv', index=False,encoding='utf-8' )
minimize_data_frame(picked_samples_B).to_csv('initial_controll_group_B_467.csv', index=False,encoding='utf-8' )


print('Samples fulfilling most criteria: {}'.format(len(picked_samples_A)))
print('Samples fulfilling all criteria: {}'.format(len(picked_samples_B)))

# NOTES:
# > THERE ARE ABOUT 3-5% DUPLICATES OR SESSIONS WITH THE SAME PERSON. NOT REMOVING.
# > THERE MIGHT BE MUCH MORE CONDITIONS TO EXCLUDE PERSON FROM CONTROLL GROUP, NOT FILTRED.
# > THERE MIGHT BE MUCH MORE ON TECHNICAL DETAILS / METHOD TO AVOID LATER PROBLEMS WITH DATA INTEGRATION.

Samples fulfilling most criteria: 962
Samples fulfilling all criteria: 408


In [753]:
show_n = 0
minimize_data_frame(picked_samples_A).head(show_n)

Unnamed: 0,url,readme,age,sex,medications,system,eeg_norm_abnorm,not_asleep,cond_flags


In [754]:
show_n = 0
minimize_data_frame(picked_samples_B).head(show_n)

Unnamed: 0,url,readme,age,sex,medications,system,eeg_norm_abnorm,not_asleep,cond_flags
