# Final Project Overview

For the final project, you will identify an **Unsupervised Learning problem** to perform EDA and model analysis. The project has **140 total points**. The instructions include a summary of the criteria you will use to guide your submission and review others’ submissions. You will submit **three deliverables**:

## Deliverables

### Deliverable 1
A **Jupyter notebook** showing an unsupervised learning problem description, EDA procedure, analysis (model building and training), results, and discussion/conclusion.

- If your work becomes too large to fit into one notebook (or you think it will be less readable by having one large notebook), consider creating several notebooks or scripts in a GitHub repository (as part of deliverable 3) and submit a report-style notebook or PDF instead.
- If your project doesn’t fit into a Jupyter notebook format, write your approach as a report and submit it in a PDF form.

### Deliverable 2
A **video presentation or demo** of your work. The presentation should be a condensed version, as if you're doing a short pitch to advertise your work, so focus on the highlights:

1. **What problem do you solve?**
2. **What ML approach do you use, or what methods does your app use?**
3. **Show the results or run an app demo.**

- Minimum video length: **5 minutes**.
- Maximum video length: **15 minutes**.
- Recommended length: **10 minutes**.
- Submit the video in `.mp4` format.

### Deliverable 3
A **public GitHub repository** with your work (please also include the GitHub repo URL in your notebook/report and slides).

### Data Byproduct
If your project creates data and you want to share it, an excellent way to share would be through a **Kaggle dataset or similar**. Similarly, if you want to make your video public, we recommend uploading it to **YouTube** or similar and posting the link(s) to your repository or blog instead of a direct upload to GitHub.

It is generally good practice not to upload big files to a Git repository.

## Review Criteria
Three of your peers will review each of your three deliverables (Jupyter notebook or PDF report, video presentation, and GitHub repository) based on the rubrics for each deliverable.

Use the rubrics to guide your project to include all parts for the grade you want to achieve. The project has **140 total points**.

## Peer Review
One of the essential components of this project is **peer review**. As a professional data scientist, a critical part of your job will likely involve communicating results to key stakeholders and convincing decision-makers of your conclusions.

To further your professional development, think of your peers as work colleagues and use your report, video presentation, and GitHub repository to communicate with them. Imagine that you are writing code that co-workers will collaborate on and maintain, so make sure to organize and appropriately comment on the codebase.

Reviewing your peers' work also has critical professional value. In your career, you will maintain or work with existing code. As you assess your peers' projects, imagine that you will be collaborating with them and need to understand their codebase and evaluate whether their results lead to the conclusions they claim.

## Instructions

### Step 1: Gather Data (3 points)
- **Gather data**, determine the method of data collection and provenance of the data.
- In the earliest phase, select a **data source** and **problem**.
- Feel free to share and discuss your idea on the class discussion board.

### Step 2: Identify an Unsupervised Learning Problem (6 points)
- **Model building and training** may depend on data type(s) and task type(s).
- When using multiple models, at least **one** of them should be an **unsupervised approach**.
- If you're using a **Kaggle competition** or similar, focus more on model building and/or analysis to make it a valid project.
- It is reasonable to add different approaches and compare them with existing Kaggle kernels.
- Find a research paper, implement an algorithm, and run experiments comparing its performance to different algorithms.

### Step 3: Exploratory Data Analysis (EDA) - Inspect, Visualize, and Clean the Data (26 points)
Go through the initial data cleaning and EDA to judge whether you need to collect more or different data.

**EDA Procedure Example:**

- Describe the factors or components that make up the dataset.
- Use a box-plot, scatter plot, histogram, etc., to describe the data distribution.
- Describe correlations between different factors and justify your assumptions.
- Determine if any data needs to be transformed.
- Indicate if you should transform data, such as using a log transform.
- Check for outliers and missing values. Decide if you will discard, interpolate, or substitute them.
- Mention if specific factors are more important than others and why.

### Step 4: Perform Analysis Using Unsupervised Learning Models of Your Choice, Present Discussion, and Conclusions (70 points)
- Model building and training may depend on data type(s) and task type(s).
- Compare multiple models to show your understanding of which models work better and why.
- At least **one** model should be an **unsupervised approach**.
- Show effort on hyperparameter optimization.
- If your project involves making a **web app** (not required), include a demo.

### Step 5: Produce Deliverables: High-Quality, Organized Jupyter Notebook Report, Video Presentation, and GitHub Repository (35 points)
These deliverables serve two purposes:

1. **Grade for this course**.
2. **Project portfolio** for job applications.

If you haven’t used GitHub previously, please find a tutorial and get acquainted with it before the project deadline.

- Use GitHub to showcase your codebase.
- For versioning with GitHub, consider using [ReviewNB](https://www.reviewnb.com) for Jupyter notebooks.
