# COGS 118A - Final Project

# Insert title here

## Group members

- Satomi Ito
- Pudan Xu 
- Joakim Nguyen
- Wilson Tan
- Boyong Liu

# Abstract 
This section should be short and clearly stated. It should be a single paragraph <200 words.  It should summarize: 
- what your goal/problem is
- what the data used represents 
- the solution/what you did
- major results you came up with (mention how results are measured) 

__NB:__ this final project form is much more report-like than the proposal and the checkpoint. Think in terms of writing a paper with bits of code in the middle to make the plots/tables

The project focuses on EEG data bad being able to extract data from it. We are focusing on k-complexes and swindles. Our goal is to utilize EEG data that was gained form sleeping and napping to automate k-complex and spindles from thirty brains and apply them to Machine learning algorithms that we learned in class. We will be separating out k-complexes, spindles, and neither. Using this cleaned and labeled data, we can train our model to classify these three classes.

# Background

This study consists of thirty brains and over hundred thousands of data. Before we dive into the technicals of predicting the k complex and swindle, we need to understand how the mind works when people go to sleep. There are four stages of sleep. When a person goes to sleep, bursts of neural oscillatory activity happens, which is the pattern of neural activityin the central nervous system. Stages 1 to 3 are called non-rapid eye movement (NREM) sleep, also known as quiet sleep. Stage 4 is rapid eye movement (REM) sleep, also known as active sleep or paradoxical sleep [1].

In order to better analyze the brain activity in sleep, scientists use electroencephalogram (EEG), a non-invasive test that records brain activity. There are 2 hallmarks of non-rapid eye movement (NREM) - 2 sleep stage: the large multicomponent K-complex (KC) and the rhythmic spindle. Both of them can be seen on electroencephalography (EEG) [2].

According to a clinical psychologist, John Cline, “K complexes are large waves that stand out from the background and often occur in response to environmental stimuli such as sounds in the bedroom. Sleep spindles are brief bursts of fast activity that appear something like the shape of an "eye" as they rapidly increase in amplitude and then rapidly decay.” [3]

Manual data labeling can take large amounts of time since it requires human perusal of large amounts of data. "Manual data labeling has the potential to be somewhat labour intensive. Each instance of labeling may take seconds but the multiplicative effect of thousands of images could create a backlog and impede a project." [4] Because of this issue, automation of image labeling is highly important as it can save time and energy as well as create a smooth and efficient workflow.  

In our project, we will use supervised machine learning model to classify and automate K-complex and spindle labeling.

# Problem Statement

We want to be able to automate k complexes and swindles so that researchers don't have to do it manually. The datasets have been already cleaned.In order to efficiently distinguish between K-complex and sleep spindle, we want to automate the labeling of these two different waves in EEG. We want to use K-nearest neighbors, SVM, and ideally convolutional neural network to train our dataset and classify K-complex and spindle in EEG. We will use more than 100,000 data collected from 30 human brains from website OSFHOME. With the abundance of the dataset, we can be more confident about the accuracy of our model. Since the original EEG files are very complicated, it is important for us to regularize them first so that we can make sure our model can label them more accurately.


# Data

Detail how/where you obtained the data and cleaned it (if necessary)

If the data cleaning process is very long (e.g., elaborate text processing) consider describing it briefly here in text, and moving the actual clearning process to another notebook in your repo (include a link here!).  The idea behind this approach: this is a report, and if you blow up the flow of the report to include a lot of code it makes it hard to read.

Please give the following infomration for each dataset you are using
- link/reference to obtain it
- description of the size of the dataset (# of variables, # of observations)
- what an observation consists of
- what some critical variables are, how they are represented
- any special handling, transformations, cleaning, etc you have done should be demonstrated here!


In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.neighbors import KNeighborsClassifier

# Proposed Solution

As a baseline model, we will be using K-Nearest Neighbors (KNN) with the default sklearn settings. We chose KNN because it's a simple model and also we believe we have enough data to overcome the curse of dimensionality.

In addition to this, we'll be using Support Vector Machines (SVMs) and because SVMs are binary classifiers, we can use a One vs One. SVMs also don't have a probability or confidence that can be compared and so we are unable to use One vs Rest, however, with only 3 classes, the number of classifiers we need for One vs One is only 3. We will also experiment with different parameters such as, how much we care about the loss metric, whether we want to use a linear, polynomial, or radial kernel, degree and gamma values if appropriate.

We have large amounts of data, so we can choose to use a simple train and test or we can implement a K-Fold Cross-Validation.

# Evaluation Metrics

This project looks at binary classification, so, we can use the typical binary classification loss which is the proportion of data points labeled correctly. In this case it would be the proportion of K-complex and Spindles labeled correctly. This would simply be the number of images labeled correctly divided by the number of images. $$ e = \frac{1}{n}\sum_{i=1}^{n} 1(y_i \neq f(x_i; W)) $$
However, this loss has an infinite gradient when going from correct to incorrect decision and zero gradient everywhere else which makes it very hard to minimize. Therefore, it may be best to only use this as a baseline.

Another common classification metric would be to maximize the area under the ROC curve. By maximizing the area under the ROC curve. Because area under the curve corresponds to how good the model is, maximizing ROC-AUC could be another potential metric evaluation. We considered the ROC as a potential metric because it would give use the opportunity to visualize thresholds and the proportion of false negatives to false positives the model makes.

# Results

You may have done tons of work on this. Not all of it belongs here. 

Reports should have a __narrative__. Once you've looked through all your results over the quarter, decide on one main point and 2-4 secondary points you want us to understand. Include the detailed code and analysis results of those points only; you should spend more time/code/plots on your main point than the others.

If you went down any blind alleys that you later decided to not pursue, please don't abuse the TAs time by throwing in 81 lines of code and 4 plots related to something you actually abandoned.  Consider deleting things that are not important to your narrative.  If its slightly relevant to the narrative or you just want us to know you tried something, you could keep it in by summarizing the result in this report in a sentence or two, moving the actual analysis to another file in your repo, and providing us a link to that file.

### Subsection 1

You will likely have different subsections as you go through your report. For instance you might start with an analysis of the dataset/problem and from there you might be able to draw out the kinds of algorithms that are / aren't appropriate to tackle the solution.  Or something else completely if this isn't the way your project works.

### Subsection 2

Another likely section is if you are doing any feature selection through cross-validation or hand-design/validation of features/transformations of the data

### Subsection 3

Probably you need to describe the base model and demonstrate its performance.  Maybe you include a learning curve to show whether you have enough data to do train/validate/test split or have to go to k-folds or LOOCV or ???

### Subsection 4

Perhaps some exploration of the model selection (hyper-parameters) or algorithm selection task. Validation curves, plots showing the variability of perfromance across folds of the cross-validation, etc. If you're doing one, the outcome of the null hypothesis test or parsimony principle check to show how you are selecting the best model.

### Subsection 5 

Maybe you do model selection again, but using a different kind of metric than before?



# Discussion

### Interpreting the result

OK, you've given us quite a bit of tech informaiton above, now its time to tell us what to pay attention to in all that.  Think clearly about your results, decide on one main point and 2-4 secondary points you want us to understand. Highlight HOW your results support those points.  You probably want 2-5 sentences per point.

### Limitations

Are there any problems with the work?  For instance would more data change the nature of the problem? Would it be good to explore more hyperparams than you had time for?   

### Ethics & Privacy

The data used in this project come from human participants. However, this project doesn’t use any personally identifiable information nor does it require any further human participation. Also, the data used in the original EEG experiment involved informed consent and is permitted to be used non-commercially.

We did not label the data and are unable to determine whether the data was correctly labeled. One way that could help this issue would be to look at other examples of K-complexes and Spindles and compare to see whether they look similar to the corresponding labeled images. An issue with this is we would be slightly unconfident in identifying the correct labels for certain images due to a lack of expertise.

Because our data will be images, there is a worry about the “curse of dimensionality”. However, we’re confident that we will still be able to get good training sets because of the abundance of data available to counteract the lower proportional representation.

One issue arises from the fact that our model will not predict with 100 percent accuracy. If the model is used for K-complex and Spindle classification, the class prediction of an image would still need to be verified by a human. The problem arises when this verification is ignored and the prediction of the model is taken as absolutely correct. Not only could this impact the results of studies, but also the waste of resources that go into those studies. Some ways to counter this would be to run classification on the images with multiple models and use the majority classification across the models which could help make more accurate predictions. We could also be transparent about our results and inform users that the model isn’t perfect and the results require verification. In the worst case scenario we will recall the entire algorithm.

### Conclusion

Reiterate your main point and in just a few sentences tell us how your results support it. Mention how this work would fit in the background/context of other work in this field if you can. Suggest directions for future work if you want to.

# Footnotes
1:The 4 Stages of Sleep. What's Happening During NREM and REM Sleep

https://www.verywellhealth.com/the-four-stages-of-sleep-2795920



2:The Emergence of Spindles and K-Complexes and the Role of the Dorsal Caudal Part of the Anterior Cingulate as the Generator of K-Complexes

https://www.frontiersin.org/articles/10.3389/fnins.2019.00814/full



3:Sleep Spindles. Sleep spindles signal processes that refresh our memories.

https://www.psychologytoday.com/us/blog/sleepless-in-america/201104/sleep-spindles

4:A high-density scalp EEG dataset scquired during brief naps after a visual working memory task

https://www.sciencedirect.com/science/article/pii/S2352340918304268?via%3Dihub

5:Automated Data Labeling vs Manual Data Labeling: Optimizing Annotation

https://keymakr.com/blog/automated-data-labeling-vs-manual-data-labeling-optimizing-annotation/
