# Text Classification 
The objective of this case study is to explore data handling techniques. 

**Data Source**: The [Spambase Dataset](../attachments/spambase.csv) is a collection of email messages labeled as spam or not spam.

## Instruction:
1. First read the entire case study description before starting to code; make notes down the control flow, expected functionality of the various methods and why you are implementing them <br>
2. You are allowed to use scikit-learn libriries for building models and libraries for text preprocessing. **If you use any other tools and IDEs, please mention them in the report.** If you do not declare the use of AI tools, it will be considered as academic dishonesty.<br>
3. Make sure to answer all questions listed including these sub-questions in your report to get full credits. And your code should be well-commented to explain your logic.<br>
4. The following deliverables should be compressed and submitted as a single zip file as '<lastname>_<firstname>_final_case_study.zip' bia **Brightspace**.
	**Deliverables**:
	1. Source code (Jupyter Notebook or Python scripts) implementing the tasks outlined below.
	2. A brief report (Markdown or PDF) summarizing your findings and methodologies in two pages (approximately 1000 words). The report should include:
	3. Outputs from each part of the case study (you can copy-paste relevant plots and tables from your code outputs).
	4. Which AI/tools assists you used (e.g., GitHub Copilot, ChatGPT, etc.)
	5. Limitations and challenges you faced while using AI
		1. Do the tools make mistakes? If so, what kind?
		2. How did you verify the correctness of the AI-generated code or suggestions?
	6. Ethical considerations when using AI for data science tasks
	7. References to any external resources or documentation you consulted.

If you have any questions, please contact me via email directly. I can answer via zoom calls as well. Please do it by your own efforts and do not share your code with others. 

## Tasks
Questions you need to answer in this case study are organized into 4 parts:

### Part 1 — Data Understanding (15 pts)
1. Import Libraries and Load Data (5 pts)
2. Dataset description (size, features, target meaning, and etc.,) (5 pts)
	 - look up top 5 rows 
	 - get dimension of data 
	 - get class distribution 
	 - generate a bar plot to display the class distribution
3. Create a separate feature set (data matrix X) and Target (1D vector y) and print dimension of each (5 pts)
   - create train and test sets

In [10]:
# import pandas as pd
# import numpy as np

# # Load the spambase dataset
# df = pd.read_csv('../attachments/spambase.csv')

# # Check for missing values
# print("Missing values per column:")
# print(df.isnull().sum())
# print("\nTotal missing values:", df.isnull().sum().sum())
# print("\nPercentage of missing values:")
# print((df.isnull().sum() / len(df) * 100).round(2))
# print("\nDataset shape:", df.shape)
# print("\nFirst few rows:")
# print(df.head())

<!-- ## Part 1 — Data Understanding (15 pts)
1. Describe the EEG Eye State dataset.
	- How many instances and features are there?
	- What does the target variable represent?
2. Inspect the data quality.
	- What percentage of values are missing per feature?
	- Is missingness distributed uniformly or concentrated in certain sensors?
3. Plot the distribution of at least 2 randomly chosen EEG channels.
	- What patterns do you notice? -->

### Part 2 — Data Handling (35 pts)
1. Data Quality & Outlier Detection (10 pts)
   - Identify and visualize outliers using statistical methods 
   - Discuss impact of outliers on classification performance and decide whether to keep, remove, or cap outliers and justify your choice
2. Feature Scaling/Normalization (10 pts)
   - Apply standardization (z-score) and/or normalization (min-max)
   - Justify why scaling is important for spam classification
3. Handling Class Imbalance (5 pts)
   - Apply techniques if imbalanced (oversampling, undersampling, or class weights)
4. Feature Importance Analysis (5 pts)
   - Identify which word/character features are most predictive of spam
   <!-- - Use correlation analysis, mutual information, or permutation importance -->
   - Visualize top 10-15 important features
5. Dimensionality Reduction/Visualization (5 pts)
   - Apply PCA or t-SNE to reduce features for visualization
   <!-- - Plot spam vs. non-spam clusters in 2D space -->
   - Discuss separability and implications for classification

In [11]:
# import pandas as pd
# import numpy as np
# import matplotlib.pyplot as plt
# import seaborn as sns
# from sklearn.preprocessing import StandardScaler, MinMaxScaler
# from sklearn.decomposition import PCA
# from sklearn.manifold import TSNE
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.model_selection import train_test_split
# from sklearn.metrics import classification_report, confusion_matrix
# import warnings
# warnings.filterwarnings('ignore')
# 
# # Load the spambase dataset
# df = pd.read_csv('../attachments/spambase.csv')
# 
# # Separate features (X) and target (y)
# # Assuming last column is the target (spam: 1, not spam: 0)
# X = df.iloc[:, :-1]
# y = df.iloc[:, -1]
# 
# print("Dataset shape:", X.shape)
# print("Target distribution:")
# print(y.value_counts())
# print("\nFeature statistics (first 5 features):")
# print(X.iloc[:, :5].describe())

## 2.1 Data Quality & Outlier Detection

In [12]:
# # Outlier detection using IQR method
# def detect_outliers_iqr(data, column):
#     Q1 = data[column].quantile(0.25)
#     Q3 = data[column].quantile(0.75)
#     IQR = Q3 - Q1
#     lower_bound = Q1 - 1.5 * IQR
#     upper_bound = Q3 + 1.5 * IQR
#     return (data[column] < lower_bound) | (data[column] > upper_bound)
# 
# # Detect outliers in each feature
# outlier_counts = {}
# for col in X.columns:
#     outliers = detect_outliers_iqr(X, col).sum()
#     outlier_counts[col] = outliers
# 
# print("Outlier counts per feature (IQR method):")
# print(f"Features with outliers: {sum(1 for v in outlier_counts.values() if v > 0)}")
# print(f"Total outlier instances: {sum(outlier_counts.values())}")
# 
# # Visualize outliers in top features
# fig, axes = plt.subplots(2, 2, figsize=(12, 8))
# axes = axes.ravel()
# top_features = sorted(outlier_counts.items(), key=lambda x: x[1], reverse=True)[:4]
# 
# for idx, (feature, count) in enumerate(top_features):
#     axes[idx].boxplot([X[feature][y == 0], X[feature][y == 1]], labels=['Non-Spam', 'Spam'])
#     axes[idx].set_title(f'{feature}\n(Outliers: {count})')
#     axes[idx].set_ylabel('Value')
# 
# plt.tight_layout()
# plt.savefig('outlier_detection.png', dpi=100, bbox_inches='tight')
# plt.show()
# 
# print("\nDecision: Keep outliers (they may represent legitimate spam characteristics like excessive punctuation)")
# print("Outliers will be retained as they are informative for spam detection.")

## 2.2 Feature Scaling/Normalization

In [13]:
# # Compare feature scaling approaches
# from sklearn.linear_model import LogisticRegression
# from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
# 
# # Split data
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# 
# # 1. No scaling (baseline)
# lr_no_scale = LogisticRegression(max_iter=1000, random_state=42)
# lr_no_scale.fit(X_train, y_train)
# y_pred_no_scale = lr_no_scale.predict(X_test)
# y_pred_proba_no_scale = lr_no_scale.predict_proba(X_test)[:, 1]
# 
# acc_no_scale = accuracy_score(y_test, y_pred_no_scale)
# f1_no_scale = f1_score(y_test, y_pred_no_scale)
# auc_no_scale = roc_auc_score(y_test, y_pred_proba_no_scale)
# 
# print("Performance WITHOUT scaling:")
# print(f"  Accuracy: {acc_no_scale:.4f}, F1-Score: {f1_no_scale:.4f}, AUC: {auc_no_scale:.4f}\n")
# 
# # 2. Standardization (Z-score)
# scaler_std = StandardScaler()
# X_train_std = scaler_std.fit_transform(X_train)
# X_test_std = scaler_std.transform(X_test)
# 
# lr_std = LogisticRegression(max_iter=1000, random_state=42)
# lr_std.fit(X_train_std, y_train)
# y_pred_std = lr_std.predict(X_test_std)
# y_pred_proba_std = lr_std.predict_proba(X_test_std)[:, 1]
# 
# acc_std = accuracy_score(y_test, y_pred_std)
# f1_std = f1_score(y_test, y_pred_std)
# auc_std = roc_auc_score(y_test, y_pred_proba_std)
# 
# print("Performance WITH Standardization (Z-score):")
# print(f"  Accuracy: {acc_std:.4f}, F1-Score: {f1_std:.4f}, AUC: {auc_std:.4f}\n")
# 
# # 3. Min-Max Normalization
# scaler_minmax = MinMaxScaler()
# X_train_minmax = scaler_minmax.fit_transform(X_train)
# X_test_minmax = scaler_minmax.transform(X_test)
# 
# lr_minmax = LogisticRegression(max_iter=1000, random_state=42)
# lr_minmax.fit(X_train_minmax, y_train)
# y_pred_minmax = lr_minmax.predict(X_test_minmax)
# y_pred_proba_minmax = lr_minmax.predict_proba(X_test_minmax)[:, 1]
# 
# acc_minmax = accuracy_score(y_test, y_pred_minmax)
# f1_minmax = f1_score(y_test, y_pred_minmax)
# auc_minmax = roc_auc_score(y_test, y_pred_proba_minmax)
# 
# print("Performance WITH Min-Max Normalization:")
# print(f"  Accuracy: {acc_minmax:.4f}, F1-Score: {f1_minmax:.4f}, AUC: {auc_minmax:.4f}\n")
# 
# # Comparison table
# scaling_comparison = pd.DataFrame({
#     'Scaling Method': ['No Scaling', 'Standardization', 'Min-Max Normalization'],
#     'Accuracy': [acc_no_scale, acc_std, acc_minmax],
#     'F1-Score': [f1_no_scale, f1_std, f1_minmax],
#     'AUC': [auc_no_scale, auc_std, auc_minmax]
# })
# 
# print("Scaling Comparison:")
# print(scaling_comparison)
# 
# # Visualization
# fig, ax = plt.subplots(figsize=(10, 5))
# scaling_comparison.set_index('Scaling Method')[['Accuracy', 'F1-Score', 'AUC']].plot(kind='bar', ax=ax)
# plt.title('Impact of Feature Scaling on Logistic Regression Performance')
# plt.ylabel('Score')
# plt.xticks(rotation=45, ha='right')
# plt.legend(loc='lower right')
# plt.tight_layout()
# plt.savefig('scaling_comparison.png', dpi=100, bbox_inches='tight')
# plt.show()
# 
# print("\nConclusion: Standardization performs best; selected for further analysis.")

## 2.3 Handling Class Imbalance

In [14]:
# from imblearn.over_sampling import RandomOverSampler
# from imblearn.under_sampling import RandomUnderSampler
# 
# # Analyze class distribution
# print("Class Distribution (Original Data):")
# print(y.value_counts())
# print(f"\nClass ratio (Spam:Non-Spam) = {y.sum()} : {len(y) - y.sum()}")
# print(f"Imbalance ratio: {(len(y) - y.sum()) / y.sum():.2f}:1\n")
# 
# # Visualize
# fig, axes = plt.subplots(1, 2, figsize=(12, 4))
# 
# # Original distribution
# y.value_counts().plot(kind='bar', ax=axes[0], color=['steelblue', 'orange'])
# axes[0].set_title('Original Class Distribution')
# axes[0].set_ylabel('Count')
# axes[0].set_xticklabels(['Non-Spam (0)', 'Spam (1)'], rotation=0)
# 
# # Pie chart
# axes[1].pie(y.value_counts(), labels=['Non-Spam', 'Spam'], autopct='%1.1f%%', colors=['steelblue', 'orange'])
# axes[1].set_title('Proportion of Spam vs. Non-Spam')
# 
# plt.tight_layout()
# plt.savefig('class_distribution.png', dpi=100, bbox_inches='tight')
# plt.show()
# 
# # Apply oversampling to balance classes
# ros = RandomOverSampler(random_state=42)
# X_resampled, y_resampled = ros.fit_resample(X_train_std, y_train)
# 
# print("Class Distribution (After Oversampling):")
# print(pd.Series(y_resampled).value_counts())
# print(f"New dataset size: {len(y_resampled)} (was {len(y_train)})")
# 
# # Train model with balanced data
# lr_balanced = LogisticRegression(max_iter=1000, random_state=42)
# lr_balanced.fit(X_resampled, y_resampled)
# y_pred_balanced = lr_balanced.predict(X_test_std)
# y_pred_proba_balanced = lr_balanced.predict_proba(X_test_std)[:, 1]
# 
# # Evaluate with multiple metrics
# from sklearn.metrics import precision_score, recall_score
# 
# print("\nPerformance Comparison (on test set):")
# comparison_df = pd.DataFrame({
#     'Model': ['Without Balancing', 'With Oversampling'],
#     'Accuracy': [
#         accuracy_score(y_test, y_pred_std),
#         accuracy_score(y_test, y_pred_balanced)
#     ],
#     'Precision': [
#         precision_score(y_test, y_pred_std),
#         precision_score(y_test, y_pred_balanced)
#     ],
#     'Recall': [
#         recall_score(y_test, y_pred_std),
#         recall_score(y_test, y_pred_balanced)
#     ],
#     'F1-Score': [
#         f1_score(y_test, y_pred_std),
#         f1_score(y_test, y_pred_balanced)
#     ]
# })
# 
# print(comparison_df)
# 
# print("\nConclusion: Oversampling improves recall (catches more spam) at slight cost of precision.")

## 2.4 Feature Importance Analysis

In [15]:
# # Feature importance using Random Forest
# rf_model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
# rf_model.fit(X_train_std, y_train)
# 
# # Get feature importances
# feature_importance = pd.DataFrame({
#     'Feature': X.columns,
#     'Importance': rf_model.feature_importances_
# }).sort_values('Importance', ascending=False)
# 
# print("Top 15 Most Important Features:")
# print(feature_importance.head(15))
# 
# # Visualization
# fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# 
# # Top 15 features bar plot
# top_15 = feature_importance.head(15)
# axes[0].barh(range(len(top_15)), top_15['Importance'])
# axes[0].set_yticks(range(len(top_15)))
# axes[0].set_yticklabels(top_15['Feature'])
# axes[0].invert_yaxis()
# axes[0].set_xlabel('Importance Score')
# axes[0].set_title('Top 15 Most Important Features for Spam Detection')
# 
# # Cumulative importance
# cumsum = feature_importance['Importance'].cumsum() / feature_importance['Importance'].sum()
# axes[1].plot(range(len(cumsum)), cumsum, marker='o', linestyle='-', linewidth=2)
# axes[1].axhline(y=0.8, color='r', linestyle='--', label='80% threshold')
# axes[1].axhline(y=0.9, color='orange', linestyle='--', label='90% threshold')
# axes[1].set_xlabel('Number of Features')
# axes[1].set_ylabel('Cumulative Importance')
# axes[1].set_title('Cumulative Feature Importance')
# axes[1].legend()
# axes[1].grid(True, alpha=0.3)
# 
# plt.tight_layout()
# plt.savefig('feature_importance.png', dpi=100, bbox_inches='tight')
# plt.show()
# 
# # Calculate how many features needed for 80% and 90% importance
# n_features_80 = (cumsum >= 0.8).argmax() + 1
# n_features_90 = (cumsum >= 0.9).argmax() + 1
# 
# print(f"\nFeatures needed for 80% importance: {n_features_80}")
# print(f"Features needed for 90% importance: {n_features_90}")
# print(f"Total features: {len(X.columns)}")
# 
# print("\nInterpretation: The most important features are likely word/character frequencies")
# print("that are characteristic of spam emails (e.g., '$', 'free', '!', etc.).")

## 2.5 Dimensionality Reduction & Visualization

In [16]:
# # PCA for dimensionality reduction
# pca = PCA()
# X_pca = pca.fit_transform(X_train_std)
# 
# # Calculate cumulative variance explained
# cumsum_var = np.cumsum(pca.explained_variance_ratio_)
# 
# fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# 
# # Scree plot
# axes[0].plot(range(1, min(51, len(cumsum_var)+1)), cumsum_var[:50], marker='o', linestyle='-', linewidth=2)
# axes[0].axhline(y=0.95, color='r', linestyle='--', label='95% variance')
# axes[0].set_xlabel('Number of Components')
# axes[0].set_ylabel('Cumulative Variance Explained')
# axes[0].set_title('PCA: Cumulative Variance Explained')
# axes[0].legend()
# axes[0].grid(True, alpha=0.3)
# 
# # PCA 2D visualization
# pca_2d = PCA(n_components=2)
# X_pca_2d = pca_2d.fit_transform(X_train_std)
# 
# scatter = axes[1].scatter(X_pca_2d[:, 0], X_pca_2d[:, 1], c=y_train, cmap='RdYlBu', alpha=0.6, s=30)
# axes[1].set_xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]:.2%} variance)')
# axes[1].set_ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]:.2%} variance)')
# axes[1].set_title('PCA: 2D Visualization of Spam vs Non-Spam')
# cbar = plt.colorbar(scatter, ax=axes[1])
# cbar.set_label('Class (0=Non-Spam, 1=Spam)')
# 
# plt.tight_layout()
# plt.savefig('pca_analysis.png', dpi=100, bbox_inches='tight')
# plt.show()
# 
# n_components_95 = (cumsum_var >= 0.95).argmax() + 1
# print(f"\nPCA Analysis:")
# print(f"Components needed for 95% variance: {n_components_95} out of {X_train_std.shape[1]}")
# print(f"Variance explained by first 2 components: {cumsum_var[1]:.2%}")
# 
# # t-SNE visualization (optional, slower)
# print("\nApplying t-SNE (this may take a moment)...")
# tsne = TSNE(n_components=2, random_state=42, perplexity=30, n_iter=1000)
# X_tsne = tsne.fit_transform(X_train_std[:5000])  # Use subset for speed
# 
# fig, ax = plt.subplots(figsize=(10, 8))
# scatter = ax.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_train[:5000], cmap='RdYlBu', alpha=0.6, s=30)
# ax.set_xlabel('t-SNE Dimension 1')
# ax.set_ylabel('t-SNE Dimension 2')
# ax.set_title('t-SNE: 2D Visualization of Spam vs Non-Spam')
# cbar = plt.colorbar(scatter, ax=ax)
# cbar.set_label('Class (0=Non-Spam, 1=Spam)')
# plt.tight_layout()
# plt.savefig('tsne_analysis.png', dpi=100, bbox_inches='tight')
# plt.show()
# 
# print("\nObservations:")
# print("- PCA shows moderate separation between spam and non-spam emails")
# print("- t-SNE reveals better clustering, suggesting the classes are somewhat separable")
# print("- High-dimensional feature space contains discriminative information")

## Part 3 — Modeling and Classification (35 pts)
1. Baseline classifier performance comparison (drop vs impute) 10 pts
	- Metrics: accuracy, F1-score, AUC
    - Compare table/plot
2. Evaluate impact on metrics (accuracy, precision, recall, F1-score)
    - Compare model performance with and without scaling
    - Compare model performance with and without handling class imbalance
3. Advanced model training and performance discussion 15 pts
	- Model choice justification
	- Performance comparison with baseline
(Extra Credit) Can you beat the training accuracy of 97% and testing accuracy of 94%. Is your approach generalizable (bias-variance tradeoff)? Explain your approach and discuss. 
<!-- 6.	Train a baseline classifier (Logistic Regression or Random Forest) on:
A. Dataset with missing values removed (drop rows)
B. Dataset after your best imputation
	- Compare accuracy, F1-score, and AUC.
7.	Explain how temporal structure matters in this dataset.
  	- Although not explicitly time-tagged, why is EEG inherently temporal?
	- What limitations does this introduce for your model?
8.	Train a more advanced model such as:
	- Gradient Boosted Trees
	- 1D-CNN
	- LSTM (if you reconstruct small sequences)

Discuss whether it improves performance. -->


<!-- ## Part 4 — Feature Importance & Interpretation
9.	Compute feature importance using either:
	- Permutation importance
	- SHAP values

Which EEG channels contribute most to predicting eye state?

10.	Discuss whether missing data in those important channels caused major performance loss. -->


<!-- ## Part 5 — Robustness & Sensitivity Analysis
11.	Simulate more missing data (20%, 30%, MCAR vs MAR).
	- How does your classifier degrade?
	- Plot performance vs. missing rate.
12.	Test robustness by intentionally corrupting a critical EEG channel.
	- How much accuracy drops
	- What does this imply about sensor reliability? -->

### Part 4 — Final Reflection (15 pts)

1.	Write a brief (150–200 words) conclusion summarizing (5 pts)
2.	In real-world spam application, how would you deal with data distribution related challenges? (5 pts)
3.	What further analyses or models would you explore with more time or resources? (5 pts)