# EEG Eye State Dataset Case Study
The objective of this case study is to explore data handling techniques, particularly for datasets with time-series characteristics and missing values. 
The [EEG Eye State Dataset](https://archive.ics.uci.edu/dataset/264/eeg+eye+state) is a collection of electroencephalogram (EEG) recordings that capture the brain activity of individuals while their eye states (open or closed) are monitored. This dataset is commonly used in research related to brain-computer interfaces, cognitive science, and machine learning. The duration of measurement is 117 seconds.

**Data Source**: an [invariant EEG Eye State](../attachments/eeg_eye_state.csv) dataset from the UCI Machine Learning Repository.

**Delivery Items**:
1. Source code (Jupyter Notebook or Python scripts) implementing the tasks outlined below.
2. A brief report (Markdown or PDF) summarizing your findings and methodologies in two pages (approximately 1000 words). The report should include:
   1. Outputs from each part of the case study (you can copy-paste relevant plots and tables from your code outputs).
   2. Which AI/tools assists you used (e.g., GitHub Copilot, ChatGPT, etc.)
   3. Limitations and challenges you faced while using AI
      1. Do the tools make mistakes? If so, what kind?
      2. How did you verify the correctness of the AI-generated code or suggestions?
   4. Ethical considerations when using AI for data science tasks
   5. References to any external resources or documentation you consulted.

Questions you need to answer in this case study are organized into 4 parts:

## Part 1 — Data Understanding (15 pts)
1. Dataset description (size, features, target meaning) 5 pts
2. Basic statistical summary 5 pts
3. Distribution of at least 2 randomly chosen EEG channels (patterns) 5 pts

<!-- ## Part 1 — Data Understanding (15 pts)
1. Describe the EEG Eye State dataset.
	- How many instances and features are there?
	- What does the target variable represent?
2. Inspect the data quality.
	- What percentage of values are missing per feature?
	- Is missingness distributed uniformly or concentrated in certain sensors?
3. Plot the distribution of at least 2 randomly chosen EEG channels.
	- What patterns do you notice? -->

## Part 2 — Data Handling (35 pts)
1. Dropping or imputation comparison. If imputing, compare at least two methods + evaluate quality 10 pts
   - Sensitivity discussion (which features are most affected by imputation) 
2. feature scaling/normalization if needed based on your judgment 5 pts
3. Handling class imbalance if needed (e.g., oversampling, undersampling, class weights) 5 pts
4. Feature importance analysis 10 pts
5.  Dimensionality reduction/visualization (e.g., PCA, or t-SNE) 5 pts
<!-- 4. Compare at least two imputation strategies (e.g., mean/median, KNN, MICE, or a simple neural-network-based imputation like GAIN). Show code for each method.
	- Which features are most sensitive to imputation?
5.	Evaluate imputation quality.
	- Use a mask to compute reconstruction error on observed values.
	- Which method performs best? Why? -->

## Part 3 — Modeling and Classification (35 pts)
1. Baseline classifier performance comparison (drop vs impute) 10 pts
	- Metrics: accuracy, F1-score, AUC
	- Comparison table/plot
2. Temporal structure discussion 10 pts
	- Why is EEG inherently temporal?
3. Advanced model training and performance discussion 15 pts
	- Model choice justification
	- Performance comparison with baseline
<!-- 6.	Train a baseline classifier (Logistic Regression or Random Forest) on:
A. Dataset with missing values removed (drop rows)
B. Dataset after your best imputation
	- Compare accuracy, F1-score, and AUC.
7.	Explain how temporal structure matters in this dataset.
  	- Although not explicitly time-tagged, why is EEG inherently temporal?
	- What limitations does this introduce for your model?
8.	Train a more advanced model such as:
	- Gradient Boosted Trees
	- 1D-CNN
	- LSTM (if you reconstruct small sequences)

Discuss whether it improves performance. -->


<!-- ## Part 4 — Feature Importance & Interpretation
9.	Compute feature importance using either:
	- Permutation importance
	- SHAP values

Which EEG channels contribute most to predicting eye state?

10.	Discuss whether missing data in those important channels caused major performance loss. -->


<!-- ## Part 5 — Robustness & Sensitivity Analysis
11.	Simulate more missing data (20%, 30%, MCAR vs MAR).
	- How does your classifier degrade?
	- Plot performance vs. missing rate.
12.	Test robustness by intentionally corrupting a critical EEG channel.
	- How much accuracy drops
	- What does this imply about sensor reliability? -->

## Part 4 — Final Reflection (15 pts)

1.	Write a brief (150–200 words) conclusion summarizing 5 pts
2.	In real-world EEG applications, how would you handle missing data? 5 pts
	- Consider sensor failures, noise, and real-time constraints.
3.	What further analyses or models would you explore with more time or resources? 5 pts