# Project Final Report

### Due: Midnight on April 27 (2-hour grace period) — 50 points  

### No late submissions will be accepted.


## Overview

Your final submission consists of **three components**:

---

### 1. Final Report Notebook [40 pts]

Complete all sections of this notebook to document your final decisions, results, and broader context.

- **Part A**: Select the single best model from your Milestone 2 experiments. Now that you’ve finalized your model, revisit your decisions from Milestones 1 and 2. Are there any steps you would change—such as cleaning, feature engineering, or model evaluation—given what you now know?

- **Part B**: Write a technical report following standard conventions, for example:
  - [CMU guide to structure](https://www.stat.cmu.edu/~brian/701/notes/paper-structure.pdf)
  - [Data science report example](https://www.projectpro.io/article/data-science-project-report/620)
  - The Checklist given in this week's Blackboard Lesson (essentially the same as in HOML).
    
  Your audience here is technically literate but unfamiliar with your work—like your manager or other data scientists. Be clear, precise, and include both code (for illustration), charts/plots/illustrations, and explanation of what you discovered and your reasoning process. 

The idea here is that Part A would be a repository of the most important code, for further work to come, and Part B is
the technical report which summarizes your project for the data science group at your company. Do NOT assume that readers of Part B are intimately familiar with Part A; provide code for illustration as needed, but not to run.

Submit this notebook as a group via your team leader’s Gradescope account.

---

### 2. PowerPoint Presentation [10 pts]

Create a 10–15 minute presentation designed for a general audience (e.g., sales or marketing team).

- Prepare 8–12 slides, following the general outline of the sections of Part B. 
- Focus on storytelling, visuals (plots and illustrations), and clear, simplified language. No code!
- Use any presentation tool you like, but upload a PDF version.
- List all team members on the first slide.

Submit as a group via your team leader’s Gradescope account.

---

### 3. Individual Assessment

Each team member must complete the Individual Assessment Form (same as in Milestone 1), sign it, and upload it via their own Gradescope account.

---

## Submission Checklist

-  Final Report Notebook — Team leader submission
-  PDF Slides — Team leader submission
-  Individual Assessment Form — Each member submits their own


## Part A: Final Model and Design Reassessment [10 pts]

In this part, you will finalize your best-performing model and revisit earlier decisions to determine if any should be revised in light of your complete modeling workflow. You’ll also consolidate and present the key code used to run your model on the preprocessed dataset, with thoughtful documentation of your reasoning.

**Requirements:**

- Reconsider **at least one decision from Milestone 1** (e.g., preprocessing, feature engineering, or encoding). Explain whether you would keep or revise that decision now that you know which model performs best. Justify your reasoning.
  
- Reconsider **at least one decision from Milestone 2** (e.g., model evaluation, cross-validation strategy, or feature selection). Again, explain whether you would keep or revise your original decision, and why.

- Below, include all code necessary to **run your final model** on the processed dataset. This section should be a clean, readable summary of the most important steps from Milestones 1 and 2, adapted as needed to fit your final model choice and your reconsiderations as just described. 

- Use Markdown cells and inline comments to explain the structure of the code clearly but concisely. The goal is to make your reasoning and process easy to follow for instructors and reviewers.

> Remember: You are not required to change your earlier choices, but you *are* required to reflect on them and justify your final decisions.


#TODO: Dumb question here but do we paste in from

In [1]:
# Add as many code cells as you need

**And don't forget about commentary cells!**

____

## Part B: Final Data Science Project Report Assignment [30 pts]

This final report is the culmination of your semester-long Data Science project, building upon the exploratory analyses and modeling milestones you've already completed. Your report should clearly communicate your findings, analysis approach, and conclusions to a technical audience. The following structure and guidelines, informed by best practices, will help you prepare a professional and comprehensive document.

### Required Sections

Your report must include the following sections:


___

#### **1. Executive Summary (Abstract) [2 pts]**
- **Brief overview of the entire project (150–200 words)**
- **Clearly state the objective, approach, and key findings**

TODO: I wonder if the key findings should be *after* our changes here? aka re-do the last paragraph when done? 

**Rough Draft w/ TaxValue** The goal of this project is to develop a predictive model using Zillow’s housing dataset to estimate the tax-assessed value of residential properties (taxvaluedollarcnt). By leveraging various home features—such as square footage, location, and year built—the model aims to build on the success of Zillow’s widely-used Zestimate tool by expanding its capabilities to include predictions of a home’s tax-assessed value. This would offer homeowners greater transparency into how their property taxes—one of the most significant ongoing expenses after purchasing a home—are determined.


Our approach is to apply regression techniques to identify key factors influencing property valuations, build a model to predict the taxvaluedollarcnt of different properties based on the data provided by Zillow, and to minimize prediction errors, as evaluated by the use of metrics like Root Mean Squared Error (RMSE). 

Our key findings were that while ensemble methods like Gradient Boosted Trees yielded RMSEs that were improvements over baseline estimates, models still contained an average error rate north of \$300,000. Due to the trend of models performing significantly worse on homes with taxvalluedollarcnt over \$2,000,000, (consistently undershooting targets) we suspect that this 'average' is doing significantly worse due to the presence of extreme outliers in the data (some close to \$49,000,000). 

___

#### **2. Introduction [2 pts]**
- **Clearly introduce the topic and context of your project**
- **Describe the problem you are addressing (the problem statement)**
- **Clearly state the objectives and goals of your analysis**

Note: You may imaginatively consider this project as taking place in a real estate company with a small data science group in-house, and write your introduction from this point of view (don't worry about verisimilitude to an actual company!).  

TODO: There was a bunch of talk in the live session about it being the taxvalue and not the market value, and that we should humor them and preted that we're pitching this in some way -- so let me know what you think

**Rough Draft** 

Released 15 years ago, Zillow’s **Zestimate** feature revolutionized transparency in the real estate sector by giving home buyers instant access to accurate, market-based property valuations -- allowing consumers to  easily compare listing prices to market estimates with just a click. 

But while this market value is critical during the initial real estate transaction, there is another metric that impacts the finances of homeowners long after the initial purchase, and that is the **tax assessed value**. 

Yet despite property taxes represent one of the largest ongoing costs for homeowners across the nation, many homeowners have little visibility into how this value is determined, or whether their value is fair when compared to similar properties. 

As part of our in-house data science initiative, we aim to address this gap by leveraging machine learning to develop a model that accurately predicts a property's tax assessed value. In this, our main objectives were clear: 

- Identify the most important features when assessing a home's tax value. 

- Develop a machine learning model to accurately predict a home's tax assessed value from zillow's database. 
 



___

#### **3. Data Description [2 pts]**
- **Describe the source of your dataset (described in Milestone 1)**
- **Clearly state the characteristics of your data (size, types of features, missing values, target, etc.)**

**Data Set Source**


The dataset for this project is taken from Zillow's 'Zestimate' refining Kaggle competition dataset that was run in 2017 (reference 1). The dataset was orgininally edited down to 55 columns and 77,613 rows. Although the original dataset was structured for the home value to be the target variable, the target variable in our case was 'taxvaluedollarcnt'. 

**Duplicated and Missing Values:**

Our initialy analysis releaved that 199 of these rows were duplicate values, and the following rows contained null / missing values: 

In [None]:
# profile_dataset(df)

After analyzing the dataset, the following fields were dropped: 

- Columns with excessive null / missing value counts:
  - Samples missing more than 60% of their values were dropped because such rows likely lack sufficient information to contribute meaningfully to the analysis or model training.

- Samples with missing target values:
  - Any sample missing a target variable value were removed, as they could not meaningfully contribute to a supervised learning task where such a target variable value would be required.

- Extreme Outliers in the Target Varaiable:
  - We removed samples that are outliers in the target variable (using the IQR method) because these extreme values might be due to data errors or represent atypical cases that could skew the model.


**Categorical Features**

Further analysis revealed that the following fields were categorical: 
 
- hashottuborspa
- propertycountylandusecode
- propertyzoningdesc
- fireplaceflag
- taxdelinquencyflag

And additionally, the dataset includes several enumerated fields that, while technically numeric, represent categorical data. While there were many of these fields, the following had variable mappings that were able to be paired from Zillow's Kaggle enumeration key (taken from the confirmed - orgiginal dataset in reference 1).

- HeatingOrSystemTypeID
- PropertyLandUseTypeID
- StoryTypeID
- AirConditioningTypeID
- ArchitecturalStyleTypeID
- TypeConstructionTypeID
- BuildingClassTypeID

____

#### 4. Methodology (What you did, and why)  [12 pts]

**Focus this section entirely on the steps you took and your reasoning behind them. Emphasize the process and decision-making, not the results themselves**


- Clearly outline your data cleaning and preprocessing steps
  - Describe what issues you encountered in the raw data and how you addressed them.
  - Mention any key decisions (e.g., removing samples with too many missing values).
  - What worked and what didn't work?
- Describe your feature engineering approach
  - Explain any transformations, combinations, or derived features.
  - Discuss why certain features were chosen or created, even if they were later discarded.
  - What worked and what didn't work?
- Describe your analytical framework 
  - Use of validation curves to see the effect of various hyperparameter choices, and
  - Choice of RMSE as primary error metric
- Detail your model selection process 
  - Outline the models you experimented with and why.
  - Discuss how you evaluated generalization (e.g., cross-validation, shape and relationships of plots).
  - Mention how you tuned hyperparameters or selected the final model.



**Analytical Framework**

____

#### 5. Results and Evaluation (What you found, and how well it worked) [10 pts]

**Focus purely on outcomes, with metrics, visuals, and insights. This is where you present evidence to support your conclusions.**

- Provide a clear and detailed narrative of your analysis and reasoning using the analytical approach described in (4). 
- Discuss model performance metrics and results (RMSE, R2, etc.)
- **Include relevant visualizations (graphs, charts, tables) with appropriate labels and captions**
- Error analysis
  - Highlight specific patterns of error, outliers, or questionable features.
  - Note anything surprising or worth improving in future iterations.


____

#### 6. Conclusion [2 pts]
- Clearly state your main findings and how they address your original objectives
- Highlight the business or practical implications of your findings 
- Discuss the limitations and constraints of your analysis clearly and transparently
- Suggest potential improvements or future directions

___

## **References**  
1. Kaggle. (2017). Zillow Prize: Zillow’s Home Value Prediction (Zestimate) [Dataset]. Kaggle. https://www.kaggle.com/competitions/zillow-prize-1/data
