# Project Description

**What**: Create a machine/deep learning based model which will sucessfully predict if a 20 minute baseline EEG session is that of a schizophrenic.

**Why**: To enhance quick diagnosis of patients that land in the emergency room to get to know if they're schizophrenic and need specializied care.

### Notes:
* There are obvious issues that can arise from the contextual setting of landing on ER that will have to be accounted for if we manage to make a working model based on the dataset.

---

## Notebook Description

This notebook is meant for exploration and notes for the [TUH EEG](https://www.isip.piconepress.com/projects/tuh_eeg/downloads/tuh_eeg/v1.1.0/) dataset, alongside with ideas on what to extract from it.

### Notes:
* The next step will be to extract the necessary samples. 
    * around 300-500 of control and maybe 100 of diagnosed schizophrenics (not on meds) to enhance our initial dataset.

### Goals:
* Find the best way to traverse through the [TUH EEG](https://www.isip.piconepress.com/projects/tuh_eeg/downloads/tuh_eeg/v1.1.0/) dataset.


# Structure

* Based on the directory structure, we won't be able to (and there's no need) to download everything locally. 
* Based on this [README.txt](https://www.isip.piconepress.com/projects/tuh_eeg/downloads/tuh_eeg/v1.1.0/_AAREADME.txt), it looks like the .txt files inside of each patients data contains `the EEG report corresponding to the patient and session`, which is (based on some samples) written in semi-structured way. This enables keyword search the directory tree.
    * Not sure if blast of GET requests is a good idea. Need to maybe space them by 1 sec each just not to DDOS their system. Maybe even through a VPN just in case?

# Dataset statistics
```
---
 Files and Sessions:

               no. patients: 13,539
               no. sessions: 23,002
  avg. no. sessions/patient: 1.70
              no. edf files: 53,506
             total duration: 56,726,510 secs (15,757 hrs)

 Signal Data:
   over 40 different channel configurations
   sample frequency varies from 250 Hz to 1024 Hz
   95% of the data includes a 10/20 configuration as a subset of the
      available channels

---
```
* ~The only real issue I see is with the channel configurations... should we include or exclude for now anything with a different than our standard (btw, what IS our standard?)?~ 
    * NVM, it looks like most of the dataset contains 10/20 configuration subset, which saves us quite nicely.
* Quite high sample frequency, in pre-processing we might look to downgrading it a bit maybe? 
* Not sure about patients with multiple sessions. Might be that the best/closest score we can get for the additional schizophrenics will be the first session (hopefully they might contain non-drug EEG sessions).
    * And as for control, might be just best to use unique patients data.

In [2]:
import requests
import requests.auth
from requests.auth import HTTPBasicAuth
import re
from collections import namedtuple
from time import sleep
import pandas as pd
from IPython.display import clear_output


s = requests.Session()
s.auth = (HTTPBasicAuth('some_auth', 'some_auth')) # Paste real auth here.

EDFData = namedtuple("EDFData", ["url", "readme"])

class MultipleFilesFound(Exception):
    """Raised when multiple files found when only one should be found."""
    pass

class NoFilesFound(Exception):
    """Raised when no files where found but we excepted something"""
    pass

def crawl(init_link):
    to_crawl = [init_link]
    i = 0
    while to_crawl:
#         if i > 100:
#             break
        
        current_url = to_crawl.pop(-1)
        r = s.get(current_url)
        # First link is always parent directory link
        new_dirs = re.findall(r'<a href="([^?].*\/)">', r.text)[1:]
        
        to_crawl.extend([current_url + new_url for new_url in new_dirs])
        
        # If we get to the end, download the TXT content and save as tuple
        try:
            if not new_dirs:
                txt_file_url = re.findall(r'<a href="([^?].*\.txt)">', r.text)
                # A couple of check statements to get sure we only get one TXT file.
                if not txt_file_url:
                    raise NoFilesFound 
                if len(txt_file_url) > 2:
                    raise MultipleFilesFound
                i+=1
                txt_file_content = s.get(current_url + txt_file_url[0])
                clear_output(wait=True)
                print(f"Found {i} sessions out of 23,002. {(i/23002)*100:.2f}%")
                
                yield EDFData(url=current_url, readme=txt_file_content.text)

        except NoFilesFound:
            with open("no_files_error.txt", "a+") as f:
                f.write(f"{current_url}\n")
        except MultipleFilesFound:
            with open("multiple_files_error.txt", "a+") as f:
                f.write(f"{current_url}\n")
                
        
edf_reports = crawl("https://isip.piconepress.com/projects/tuh_eeg/downloads/tuh_eeg/v1.1.0/edf/")
# Segmenting it so there are 5 parts. Last time it disconnected right by the end.
edf_reports_df = pd.DataFrame().from_records(edf_reports, columns=EDFData._fields)
edf_reports_df.to_csv("edf_reports_all.csv", index=False)

# Async Version
Below you can find the async version of the script above. It cuts down time from around 1 to 1,5 hours to 15 minutes.

Please run it outside of Jupyter as it has its own event loop.

Please note that the fetch_content could be done better, but I haven't had time to polish it.

In [None]:
import asyncio
import aiohttp
import re
from collections import namedtuple
import pandas as pd
import time

fetch_counter = 0
EDFData = namedtuple("EDFData", ["url", "readme"])


# Quick error classes in case something happens
class MultipleFilesFound(Exception):
    """Raised when multiple files found when only one should be found."""
    pass


class NoFilesFound(Exception):
    """Raised when no files where found but we excepted something"""
    pass


# MAIN ASYNC SCRIPT
async def fetch_content(session, url):
    try:
        async with session.get(url) as response:
            try:
                return await response.text()
            # In some cases it couldn't read it because
            # it couldn't convert to UTF8 due to some artifacts.
            # Needs a better exception handling, for now it works.
            except Exception:
                return await response.content.read()
    # General exceptions are bad. Might be better to specify Timeout
    # and to bind it to a max level of recursion.
    # Had to add it because the sheer volume of requests probably
    # triggered those timeouts.
    except Exception as e:
        print("Caught a timeout error...")
        await asyncio.sleep(30)
        return await fetch_content(session, url)


async def get_relevant_content(session, url, relevant_data):
    global fetch_counter

    url_content = await asyncio.shield(fetch_content(session, url))
    new_dirs = re.findall(r'<a href="([^?].*\/)">', url_content)[1:]
    to_crawl = [url + new_url for new_url in new_dirs]

    if to_crawl:
        tasks = [get_relevant_content(session, new_url, relevant_data)
                 for new_url in to_crawl]
        for f in asyncio.as_completed(tasks):
            data = await f
            if data:
                relevant_data.append(data)

    if not to_crawl:
        txt_file_url = re.findall(r'<a href="([^?].*\.txt)">', url_content)
        # A couple of check statements to get sure we only get one TXT file.
        if not txt_file_url:
            raise NoFilesFound
        if len(txt_file_url) > 2:
            raise MultipleFilesFound
        fetch_counter += 1
        print(f"Found {fetch_counter} sessions out of 23,002. {(fetch_counter/23002)*100:.2f}%")
        txt_file_content = await fetch_content(session, url + txt_file_url[0])

        return EDFData(
            url=url,
            readme=txt_file_content
        )


async def async_crawl(init_url):
    """Main function.

    Note:
        Works by adding the found data to a list which is
        specified in this function.
        Please update the BasicAuth with proper login info.
    """
    data = []
    # Add the auth info below
    async with aiohttp.ClientSession(auth=aiohttp.BasicAuth(AUTH_INFO, AUTH_INFO), headers={"Connection": "close"}) as s:
        await get_relevant_content(s, init_url, data)

    return data

# Might change this to directly dropping it in a DataFrame.
t1 = time.time()
records = asyncio.run(async_crawl("https://isip.piconepress.com/projects/tuh_eeg/downloads/tuh_eeg/v1.1.0/edf/"))
t2 = time.time()
print(f"Time it took: {t2-t1}")
print(f"Length: {len(records)} (make sure it's 23,002)")

edf_reports_df = pd.DataFrame().from_records(records, columns=EDFData._fields)
edf_reports_df.to_csv("edf_reports_all.csv")
