# AIM 460 – Group Project #2: Comparative Classification from Scratch

In this project, each group will identify and assemble **two distinct classification problems** of your own choosing. You may **not** use any prepackaged datasets from Kaggle, UCI, Hugging Face or similar repositories; instead you must collect datasets that contain at least 3 000 examples each, multiple input features, and a clearly defined target label. In your notebook you must document exactly how and where you obtained each dataset, include any cleaning scripts or SQL queries you wrote, and discuss any privacy or ethical considerations.

Your submission will be a single Jupyter or Colab notebook that implements an end-to-end pipeline for each dataset. Begin with a **Data Description** section where you explain the origin, structure and quirks of your data, show summary statistics and visualizations, and describe any missing or corrupted values. Follow that with a **Preprocessing & Feature Engineering** section in which you remove or impute missing values, detect and handle outliers, encode categorical fields, scale numerical features, and create at least three new features that reflect domain knowledge. Be sure to explain in prose why you chose each transformation and what you expect it to contribute.

Next, implement **Logistic Regression** in two ways: first, write your own gradient-descent solver from scratch (deriving and coding the loss function, gradient calculation, learning-rate schedule, and stopping criterion), and second, compare it to scikit-learn’s `LogisticRegression`. As an innovation, you should experiment with at least one variant—such as adapting the gradient step to momentum, adding L₁ or elastic-net regularization, or applying a custom learning-rate decay—and report how this changes convergence speed and final accuracy.

After logistic regression, apply **Support Vector Machines** using both a linear kernel and at least one non-linear kernel (e.g., RBF). You must tune the key hyperparameters (C and kernel parameters) via grid or randomized search and include code that logs training time, number of support vectors, and classification metrics at each candidate setting. Use visualizations—such as decision-boundary plots or support-vector heatmaps—to highlight how model complexity evolves as you change C or the kernel scale.

Then implement **Naive Bayes** classification appropriate to your data (GaussianNB for continuous features or Multinomial/BernoulliNB for counts/binaries). You must show the effect of different smoothing parameters on performance, discuss the conditional-independence assumption, and compare how Naive Bayes handles rare or zero-frequency features versus logistic regression and SVM.

Throughout your notebook, measure and compare model performance using accuracy, precision, recall, F1 score, and (if applicable) ROC-AUC. Include confusion matrices and ROC curves where relevant. Finally, perform a **cross-domain generalization** test by training the models on Dataset A and evaluating them on Dataset B (after re-preprocessing), and vice versa. In prose, interpret which algorithms transfer best across domains and why.

Your notebook must be fully reproducible: set random seeds, document all library versions, and include a brief “How to run” section at the top. The written portion should be woven into markdown cells. You will seperately hand in a slide deck that you will present. This should all be uploaded to github. Ensure all group participants names are on the work. Only one group member needs to submit. The data you use should be on your github as well.

Ensure that the narrative of data collection, model innovation, hyperparameter testing, results and insights is clear to any reader. You must demonstrate original coding, thoughtful feature work, and critical analysis at every step.  
