# About the Final Project

For the final project, you will identify a Supervised Learning problem to perform EDA and model analysis. The project has **140 total points**. In the instructions is a summary of the criteria you will use to guide your submission and review others’ submissions. You will submit **three deliverables**:

## Deliverable 1

A Jupyter notebook showing a supervised learning problem description, EDA procedure, analysis (model building and training), results, and discussion/conclusion.

Suppose your work becomes so large that it doesn’t fit into one notebook (or you think it will be less readable by having one large notebook). In that case, you can make several notebooks or scripts in a GitHub repository (as deliverable 3) and submit a report-style notebook or PDF instead.

If your project doesn’t fit into Jupyter notebook format (e.g., you built an app that uses ML), write your approach as a report and submit it in a PDF form.

## Deliverable 2

A video presentation or demo of your work. The presentation should be a condensed version as if you're doing a short pitch to advertise your work, so please focus on the highlights:

1. **What problem do you solve?**
2. **What ML approach do you use, or what methods does your app use?**
3. **Show the result or run an app demo.**

The minimum video length is **5 minutes**, and the maximum length is **15 minutes**. The recommended length is about **10 minutes**. Submit the video in `.mp4` format.

## Deliverable 3

A public project GitHub repository with your work (please also include the GitHub repo URL in your notebook/report and slides).

### Data Byproduct
If your project creates data and you want to share it, an excellent way to share would be through a Kaggle dataset or similar. Similarly, suppose you want to make your video public. In that case, we recommend uploading it to YouTube or similar and posting the link(s) to your repository or blog instead of a direct upload to GitHub.

It is generally a good practice not to upload big files to a Git repository.


---

# Supervised Learning Rubric

## Prompt 1 — Submit Deliverable One: Jupyter Notebook or PDF Report

The Jupyter notebook should show a brief problem description, EDA procedure, analysis (model building and training), results, and discussion/conclusion. If your work doesn't fit into one notebook (or you think it will be less readable by having one large notebook), make several notebooks or scripts in the GitHub repository (as deliverable 3) and submit a report-style notebook or PDF instead.

If your project doesn't fit into Jupyter notebook format (e.g., you built an app that uses ML), write your approach as a report and submit it in a PDF form.


| **Prompt**             | **Points** | **Description**                                                                                                           |
|------------------------|------------|---------------------------------------------------------------------------------------------------------------------------|
| **Project Topic**       |            | **Is there a clear explanation of what this project is about? Does it state clearly which type of problem?** E.g., type of learning and type of the task.  |
|                        | **0 pts**  | Not included in the project                                                                                               |
|                        | **1 pts**  | Provides one of the following: explanation of what the project is about, the type of learning/algorithms, or the type of task  |
|                        | **2 pts**  | Provides two of the following: explanation of what the project is about, states the type of learning/algorithms, or states the type of task |
|                        | **3 pts**  | Gives a clear explanation of what the project is about and clearly states both the type of learning/algorithms and the type of task |
| **Project Topic**       |            | **Is the goal of the project clearly stated?** E.g., why it’s important, what goal the author wants to achieve, or what they want to learn.  |
|                        | **0 pts**  | Not included in the project                                                                                               |
|                        | **1 pts**  | Needs improvement—attempts but doesn’t get across the motivation or goal for the project                                  |
|                        | **2 pts**  | Very Good—clearly states the motivation or the goal for the project                                                       |
| **Data**               |            | **Is the data source properly cited and described?** (including links, brief explanations)                                |
|                        | **0 pts**  | Does not include a brief explanation of where the data is from/how it was gathered or does not include a citation using a style manual like APA for a public dataset |
|                        | **1 pts**  | Includes a brief explanation of where the data is from/how it was gathered and, if the data is from a public source, cites the dataset using a style manual like APA |


| **Prompt**             | **Points** | **Description**                                                                                                           |
|------------------------|------------|---------------------------------------------------------------------------------------------------------------------------|
| **Data**               |            | **Is the data description explained properly? The data description should include the data size.**                         |
|                        |            | ● **Tabulated data:** Number of samples/rows, number of features/columns, bytesize (if a huge file), data type of each feature, summary of key features, and whether the data is in multi-table form or gathered from multiple data sources.  |
|                        |            | ● **Images:** Number of samples, number of channels (e.g., color or grayscale), image file format, and whether images have the same dimensions. |
|                        |            | ● **Sequential data:** Number of documents or sound files, typical length, and other properties relevant to the data type.|
|                        | **0 pts**  | Does not include any description of the data or the data size                                                             |
|                        | **2 pts**  | Partially describes the data but does not refer to the data size or does not describe the data size appropriately for the type of data |
|                        | **4 pts**  | Describes the data, including the data size appropriately for the type of data                                            |
| **Data Cleaning**      |            | **Does the data cleaning section include clear explanations of how and why cleaning was performed?**                       |
|                        |            | 1. **Clear Explanations:** E.g., the author dropped a feature because it had too many NaN values, or imputed certain values due to a small number of missing samples. |
|                        |            | 2. **Conclusions or Discussions:** Data cleaning summary, findings, foreseen difficulties, and/or analysis strategy.       |
|                        |            | 3. **Proper Visualizations:** Utilized visualizations to address data-specific issues and ensure completeness of the cleaning process. |
|                        | **0 pts**  | Uses a dataset that hasn’t been cleaned without attempting any cleaning                                                   |
|                        | **5 pts**  | Uses a clean dataset or attempts cleaning but is missing explanations, conclusions, or proper visualizations               |
|                        | **10 pts** | Includes clear explanations, discussions, and visualizations for cleaning steps performed                                 |
| **Exploratory Data Analysis** |     | **Does the EDA include clear explanations of how and why the analysis was performed?**                                     |
|                        |            | 1. **Proper Visualizations:** E.g., histogram, correlation matrix, feature importance (if applicable), etc.                |
|                        |            | 2. **Proper Analysis:** Shows the statistical patterns or insights that guide further modeling.                            |
|                        |            | 3. **Conclusions or Discussions:** EDA summary, findings, foreseen difficulties, and/or next steps based on analysis.      |
|                        | **0 pts**  | EDA section not included                                                                                                  |
|                        | **5 pts**  | EDA does not have proper visualizations, analysis, or discussions                                                         |
|                        | **10 pts** | Simple plots like histograms and box plots without deeper analysis or conclusions                                         |
|                        | **15 pts** | EDA meets expectations with good explanations and proper visualizations, e.g., correlation matrix with analysis            |
|                        | **20 pts** | EDA goes above expectations, including advanced visualizations and multiple analyses (e.g., statistical tests)            |


| **Prompt**             | **Points** | **Description**                                                                                                           |
|------------------------|------------|---------------------------------------------------------------------------------------------------------------------------|
| **Models**             |            | **Some questions to consider:**                                                                                           |
|                        |            | ● Is the choice of model(s) appropriate for the problem?                                                                  |
|                        |            | ● Is the author aware of interaction/collinearity between features and its impact on the model choice?                     |
|                        |            | ● Did the author use multiple appropriate models?                                                                         |
|                        |            | ● Did the author investigate feature importance using appropriate metrics from the model?                                 |
|                        |            | ● Did the author use techniques to reduce overfitting or address data imbalance?                                           |
|                        |            | ● Did the author use new techniques/models not covered in the course?                                                     |
|                        | **0 pts**  | No models attempted                                                                                                       |
|                        | **5 pts**  | Model section does not choose an appropriate single model                                                                  |
|                        | **10 pts** | Single model included but lacks addressing rubric components or other relevant features                                   |
|                        | **15 pts** | Single model included and at least one of the following: addressing collinearity, feature engineering, multiple models, hyperparameter tuning, regularization, or data balancing |
|                        | **20 pts** | Single model included and at least two of the following components: addressing collinearity, feature engineering, multiple models, hyperparameter tuning, regularization, or data balancing |
|                        | **25 pts** | Single model included and at least three of the following: addressing collinearity, feature engineering, multiple models, hyperparameter tuning, regularization, or data balancing |
| **Results and Analysis**|            | **Some questions to consider:**                                                                                           |
|                        |            | ● Does it have a summary of results and analysis?                                                                          |
|                        |            | ● Does it include proper visualizations (e.g., tables, graphs/plots, heat maps, statistics summary with interpretation)?   |
|                        |            | ● Does it use different evaluation metrics appropriately (e.g., F1, ROC, or AUC for imbalanced data)? Does it explain why the metric was chosen? |
|                        | **0 pts**  | No results or analysis attempted                                                                                          |
|                        | **5 pts**  | Results and analysis section does not meet expectations (lacks basic results and analysis)                                 |
|                        | **10 pts** | Includes a summary with basic results and analysis                                                                        |
|                        | **15 pts** | Includes a summary with basic results, analysis, and one of the following: good visualizations, tries different metrics, or iterates training to improve performance |
|                        | **20 pts** | Includes a summary with basic results, analysis, and two of the following: good visualizations, tries different metrics, or iterates training to improve performance |
|                        | **25 pts** | Summary includes basic results, analysis, and three of the following: good visualizations, tries different metrics, iterates training to improve performance, or compares multiple models |
| **Discussion and Conclusion** |     | **Does it include key insights, learning, and strategies for improvement?**                                                |
|                        | **0 pts**  | No discussion or conclusion attempted                                                                                     |
|                        | **5 pts**  | Includes one of the following: learning takeaways, discussion of why something didn’t work, or suggestions for improvement  |
|                        | **10 pts** | Meets expectations with two of the following: learning takeaways, discussion of why something didn’t work, or suggestions for improvement |
|                        | **15 pts** | Goes above expectations with three of the following: learning takeaways, discussion of why something didn’t work, or suggestions for improvement |
| **Write-up**           |            | **Is the write-up organized and clear?**                                                                                   |
|                        | **0 pts**  | No, the write-up is not organized and clear                                                                                |
|                        | **5 pts**  | Yes, the write-up is organized and clear                                                                                   |


## Prompt 2 — Submit Deliverable Two: Video Presentation

Record a video of a presentation or demo of your work. The presentation should be a condensed version, as if you're doing a short pitch to advertise your work. Focus on the highlights:

1. **What problem do you solve?**
2. **What ML approach do you use, or what methods does your app use?**
3. **Show the result or run an app demo.**

### Video Requirements
- **Minimum Length**: 5 minutes  
- **Maximum Length**: 15 minutes  
- **Recommended Length**: About 10 minutes  
- **Format**: .mp4


| **Prompt**             | **Points** | **Description**                                                                                                           |
|------------------------|------------|---------------------------------------------------------------------------------------------------------------------------|
| **Does the video explain the following?** |            |                                                                                                                           |
|                        | **0 pts**  | Video presentation not included                                                                                           |
|                        | **3 pts**  | Presentation needs improvement. E.g., includes only one of the following: problem the project solves, the ML approach/methods used, or shows the results/runs an app demo |
|                        | **7 pts**  | Average presentation. E.g., includes two of the following: problem the project solves, the ML approach/methods used, or shows the results/runs an app demo |
|                        | **10 pts** | Excellent presentation. E.g., includes all of the following: problem the project solves, the ML approach/methods used, and shows the results/runs an app demo |
| **Is the video clear and organized?**  |            | Consider the following: The presentation follows a logical sequence, gives appropriate time to each section, and is well-rehearsed |
|                        | **0 pts**  | Video presentation not included                                                                                           |
|                        | **1 pts**  | Video presentation is not clear or organized, does not seem rehearsed, does not follow a logical sequence, or does not meet time length requirements |
|                        | **3 pts**  | Average quality presentation. E.g., presentation includes two of the following: follows a logical sequence, gives appropriate time to each section, or is well-rehearsed |
|                        | **5 pts**  | Very good clarity and organization. E.g., presentation has all of the following: follows a logical sequence, gives appropriate time to each section, and is well-rehearsed |


## Prompt 3 — Submit Deliverable Three: GitHub Repository Link

Create a public project GitHub repository with your work (please include the GitHub repository URL in your notebook/report and slides). It is essential that it is public so your peers will be able to access it. This repository needs to be specifically for this project.

### Data By-Product
If your project creates data and you want to share it, **do not upload the data to GitHub**. Instead, an excellent way to share would be through a Kaggle dataset or similar platforms.

### Video Uploads
Similarly, **do not upload videos to GitHub**. If you want to share videos, you can upload them to YouTube and post the links in your GitHub repository.


| **Prompt**                                     | **Points** | **Description**                                                                                                           |
|------------------------------------------------|------------|---------------------------------------------------------------------------------------------------------------------------|
| **Does the project have a public GitHub repository with code specifically for this project?** | **0 pts**  | No, the project does not have a public GitHub repository                                                                  |
|                                                | **5 pts**  | Yes, the project has a public GitHub repository                                                                           |
| **Does the code include comments to help you understand the code?** | **0 pts**  | No, the code does not include comments                                                                                   |
|                                                | **5 pts**  | Yes, the code includes comments to indicate why the code is there or to explain tricky sections                           |
| **Is the code organized?**                      | **0 pts**  | No, the code is not organized                                                                                             |
|                                                | **5 pts**  | Yes, the code is well-organized, the file repository structure makes sense, and the code is easy to read and follow       |
