# Text Classification 
The objective of this case study is to explore data handling techniques. 


![image](../attachments/spam.png)


**Data Source**: The [Spambase Dataset](../attachments/spambase.csv) is a collection of email messages labeled as spam or not spam.

## Instruction:
1. First read the entire case study description before starting to code; make notes down the control flow, expected functionality of the various methods and why you are implementing them <br>
2. You are allowed to use scikit-learn libriries for building models and libraries for text preprocessing. **If you use any other tools and IDEs, please mention them.** If you do not declare the use of AI tools but we detect the use of AI tools, it will be considered as academic dishonesty.<br>
3. Make sure to answer all questions listed including these sub-questions in your report to get full credits. And your code should be well-commented to explain your logic.<br>
4. The following deliverables should be compressed and submitted as a single zip file as '<lastname>_<firstname>_final_case_study.zip' bia **Brightspace**.
	**Deliverables**:
	1. Source code (Jupyter Notebook or Python scripts) implementing the tasks outlined below.
	2. A brief report (Markdown or PDF) summarizing your findings and methodologies in two pages (approximately 1000 words). The report should include:
	3. Outputs from each part of the case study (you can copy-paste relevant plots and tables from your code outputs).
	4. Which AI/tools assists you used (e.g., GitHub Copilot, ChatGPT, etc.)
	5. Limitations and challenges you faced while using AI
		1. Do the tools make mistakes? If so, what kind?
		2. How did you verify the correctness of the AI-generated code or suggestions?
	6. Ethical considerations when using AI for data science tasks
	7. References to any external resources or documentation you consulted.

If you have any questions, please contact me via email directly. I can answer via zoom calls as well. Please do it by your own efforts and do not share your code with others. 

## Tasks
Questions you need to answer in this case study are organized into 4 parts:

### Part 1 — Data Understanding (15 pts)
1. Import Libraries and Load Data (5 pts)
2. Dataset description (size, features, target meaning, and etc.,) (5 pts)
	 - look up top 5 rows 
	 - get dimension of data 
	 - get class distribution 
	 - generate a bar plot to display the class distribution
3. Create a separate feature set (data matrix X) and Target (1D vector y) and print dimension of each (5 pts)
   - create train and test sets

<!-- ## Part 1 — Data Understanding (15 pts)
1. Describe the EEG Eye State dataset.
	- How many instances and features are there?
	- What does the target variable represent?
2. Inspect the data quality.
	- What percentage of values are missing per feature?
	- Is missingness distributed uniformly or concentrated in certain sensors?
3. Plot the distribution of at least 2 randomly chosen EEG channels.
	- What patterns do you notice? -->

### Part 2 — Data Handling (35 pts)
1. Data Quality & Outlier Detection (10 pts)
   - Identify and visualize outliers using statistical methods 
   - Discuss impact of outliers on classification performance and decide whether to keep, remove, or cap outliers and justify your choice
2. Feature Scaling/Normalization (10 pts)
   - Apply standardization (z-score) and/or normalization (min-max)
   - Justify why scaling is important for spam classification
3. Handling Class Imbalance (5 pts)
   - Apply techniques if imbalanced (oversampling, undersampling, or class weights)
4. Feature Importance Analysis (5 pts)
   - Identify which word/character features are most predictive of spam
   <!-- - Use correlation analysis, mutual information, or permutation importance -->
   - Visualize top 10-15 important features
5. Dimensionality Reduction/Visualization (5 pts)
   - Apply PCA or t-SNE to reduce features for visualization
   <!-- - Plot spam vs. non-spam clusters in 2D space -->
   - Discuss separability and implications for classification

### Part 3 — Modeling and Classification (35 pts+5 pts Extra Credit)
1. Baseline classifier performance comparison (drop vs impute) (10 pts)
	- Metrics: accuracy, F1-score, AUC
    - Compare table/plot
2. Evaluate impact on metrics (accuracy, precision, recall, F1-score) (10 pts)
    - Compare model performance with and without scaling
    - Compare model performance with and without handling class imbalance
3. Advanced model training and performance discussion (15 pts)
	- Model choice justification
	- Performance comparison with baseline

**(Extra Credit 5 pts) Can you beat the training accuracy of 97% and testing accuracy of 94%. Is your approach generalizable (bias-variance tradeoff)? Explain your approach and discuss.**
<!-- 6.	Train a baseline classifier (Logistic Regression or Random Forest) on:
A. Dataset with missing values removed (drop rows)
B. Dataset after your best imputation
	- Compare accuracy, F1-score, and AUC.
7.	Explain how temporal structure matters in this dataset.
  	- Although not explicitly time-tagged, why is EEG inherently temporal?
	- What limitations does this introduce for your model?
8.	Train a more advanced model such as:
	- Gradient Boosted Trees
	- 1D-CNN
	- LSTM (if you reconstruct small sequences)

Discuss whether it improves performance. -->


<!-- ## Part 4 — Feature Importance & Interpretation
9.	Compute feature importance using either:
	- Permutation importance
	- SHAP values

Which EEG channels contribute most to predicting eye state?

10.	Discuss whether missing data in those important channels caused major performance loss. -->


<!-- ## Part 5 — Robustness & Sensitivity Analysis
11.	Simulate more missing data (20%, 30%, MCAR vs MAR).
	- How does your classifier degrade?
	- Plot performance vs. missing rate.
12.	Test robustness by intentionally corrupting a critical EEG channel.
	- How much accuracy drops
	- What does this imply about sensor reliability? -->

### Part 4 — Final Reflection (15 pts)

1.	Write a brief (150–200 words) key insights summary (5 pts)
2.	In real-world spam application, how would you deal with data distribution related challenges? (5 pts)
3.	What further analyses or models would you explore with more time or resources? (5 pts)