# Initial Project Report

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# system import to access file on different directory 
import sys
sys.path.append("./util_")

# # wrangle and eda files
# import acquire_

# import prepare_
# import explore_
# import hyp_test_
# import final_visuals_
# import model_

## Goal

- Predicts single family properties tax assessed values.

- Find the key drivers of property value for single family properties.


## Acquire

I am using the Telco churn data from the Codeup database.

- Query the following columns:
    - `bedroomcnt, bathroomcnt,calculatedfinishedsquarefeet, taxvaluedollarcnt, yearbuilt, taxamount, fips`

- 2152863 rows and 7 columns.
- 7 numeric and 0 object
- 22778 total null count (1% of the data)

## Prepare

- Remove all nulls (1% of the data)
- Remove duplicated rows.
- convert data type from float to int (bedrooms, bathrooms)
- remove outliers
- replace the fips code with county names and Encode county column.
- Split data into train, validate, and test. (`60/20/20 ` split)
- scale the numeric categorical and continuous variables and extract a copy of the original data frame.
    - `bedrooms, bathrooms, sqr_feet, year_built,`

## Data Dictionary

| Column Name | Description |
| ----------- | ----------- |
| bedrooms | The number of bedrooms in the property. Bedrooms refer to individual rooms used primarily for sleeping and are typically found in residential properties. |
| bathrooms | The number of bathrooms in the property. Bathrooms refer to rooms containing a toilet, sink, and typically a bathtub or shower, used for personal hygiene. |
| sqr_feet | The total square footage of the property. Square footage is a measurement of the area covered by the property, indicating its size or living space. It is often used to estimate the property's value or to determine the price per square foot. |
| tax_value | The assessed value of the property for tax purposes. Tax value represents the estimated worth of the property as determined by the local tax authority. It is used to calculate property taxes. |
| year_built | The year in which the property was constructed or built. This indicates the age of the property and can be useful in assessing its condition or historical significance. |
| tax_amount | The amount of tax owed on the property. Tax amount refers to the actual dollar amount that needs to be paid in property taxes based on the assessed tax value and local tax rates. |
| county | The county where the property is located. County refers to a specific geographic region or administrative division within a state or country. It helps identify the property's location within a broader jurisdiction. |


## Explore

**1. Is there a linear relationship between the square footage of the property and the tax value?**

- The data we have is enough to say that there is a relationship between square feet and tax value and that this relationship is **strong**.

**A. Does the mean tax value differ significantly between properties with different square footage**

- $H_o$: The mean tax value differs significantly between properties with different square footage.

    - `We reject the null hypothesis`
    
- The test says that the mean tax value does not differ significantly between properties with different square footage.

**2. Do the number of bedrooms or bathrooms impact the tax value?**
    
- The Spearman test shows that there is enough data to say that there is a relationship between bedrooms and tax value and our r-coefficient shows that this relationship is **weak**.

- The Spearman test shows that there is enough data to say that there is a relationship between bedrooms and tax value and our r-coefficient shows that this relationship is **moderate**.

**A. Is there a significant difference in the mean tax value between properties with low numbers of bedrooms or bathrooms against the properties with high numbers of bedrooms or bathrooms**

- $H_0$: There is a significant difference in the mean tax value between properties with low numbers of bedrooms or bathrooms against the properties with high numbers of bedrooms or bathrooms
    
    - `We fail to reject the null hypothesis`

- The test says that there is a `significant difference` in the mean tax value between properties with low numbers of bedrooms or bathrooms against the properties with high numbers of bedrooms or bathrooms.

**3. Does the county where the property is located affect the tax value?**

- The Spearman test shows that there is enough data to say that there is a relationship between county and tax value and our r-coefficient shows that this relationship is **very weak**.

**A. Does the mean tax value differ significantly between properties in Los Angeles County compared to the other counties?**

- $H_0$: The mean tax value differs significantly between properties in Los Angeles County compared to the other counties.

    - `We reject the null hypothesis`
    
- Test show that the mean tax value does not differ significantly between properties in Los Angeles County compared to the other counties?

**4. Does the year the property was built have any influence on its tax value?**

- The Spearman test shows that we have enough data to say that there is a relationship between year built and tax value and that this relationship is **very weak**.

**A. Is there a significant difference in the mean tax value between properties built in different years?**

- $H_0$: There is a significant difference in the mean tax value between properties built in different years

    - `We reject the null hypothesis`
    
- We have data to say that there is not a significant difference in the mean tax value between properties built in different years.

### Explore Summary

- The test says that the mean tax value does not differ significantly between properties with different square footage.

- The test says that there is a significant difference in the mean tax value between properties with low numbers of bedrooms or bathrooms against the properties with high numbers of bedrooms or bathrooms.

- Test show that the mean tax value does not differ significantly between properties in Los Angeles County compared to the other counties.

- We have data to say that there is not a significant difference in the mean tax value between properties built in different years.

## Modeling

**Model features**

- `bedroomcnt, bathroomcnt,calculatedfinishedsquarefeet, taxvaluedollarcnt, yearbuilt, county`
    
**Baseline RMSE**

- mean basline: `288337.85`


### Linear Regression (OLS)

OLS Regressor 
- Train RMSE: `235399.31`
- Validate RMSE: `235138.62`
- RMSE_difference `-260.69`
- R2_validate `0.33`


### LassoLars

Lasso + Lars

- Train RMSE: `235399.22` 
- Validate RMSE:  `235138.92`
- RMSE Difference:  `-260.30`
- R2_validate: `0.33`

### TweedieRegressor (GLM:Generalized Linear Model)

Tweedie Regressor

- Train RMSE: `236386.04` 
- Validate RMSE:  `236127.92`
- RMSE Difference:  `-258.11`
- R2_validate: `0.33`

### Polynomial Features

Polynomial feature regressor (Power 3)

- Train RMSE: `229654.86` 
- Validate RMSE:  `229364.37`
- RMSE Difference:  `-290.49`
- R2_validate: `0.37`

### Test the Cubic Regression

- Train RMSE: `000` 
- Validate RMSE:  `000`
- RMSE Difference:  `000`
- R2_validate: `000`

### Comparing Models

| Model | Validation RMSE | R<sup>2</sup> Value |
| ---- | ----| ---- |
| Baseline | 288337.85 ||
| Linear Regression (OLS) | 235138.62 |0.33|  
| Tweedie Regressor (GLM) | 236127.92 |0.33|  
| Lasso Lars | 235138.92 |0.33|  
| Cubic Regression | 229364.37 |0.37| 

## Conclusion

### Explore

- The test says that the mean tax value does not differ significantly between properties with different square footage.

- The test says that there is a significant difference in the mean tax value between properties with low numbers of bedrooms or bathrooms against the properties with high numbers of bedrooms or bathrooms.

- Test show that the mean tax value does not differ significantly between properties in Los Angeles County compared to the other counties.

- We have data to say that there is not a significant difference in the mean tax value between properties built in different years.

### Modeling

The final (Cubic Regression) model performed significantly better than the baseline by about 5%. Saving $100000 since 1% is about $20000.

- Bedrooms, bathrooms, square feet, year built, and county seems to be moderate drivers for predicting tax value.

### Recommendations

Some properties have a much higher value than others when they are located so close to each other mostly due to the size, year built, and the number of rooms in the house. So to better predict home value we should look father into the condition of the house and identify a unique indoor or outdoor feature that some homes have that makes them worth more or less to the current housing market.

### Next Steps

- Collect bettwe data on current home conditions.