## Modeling for Sale Price

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import root_mean_squared_error, r2_score, mean_absolute_error

In [76]:
df = pd.read_csv('../../data/linear_regression_model_data/processed_data/feature_data.csv')
df.head()

Unnamed: 0,id,pid,lot_frontage,lot_area,lot_shape,overall_qual,overall_cond,year_built,year_remod_add,mas_vnr_area,...,garage_type_Detchd,garage_type_Not Applicable,sale_type_CWD,sale_type_Con,sale_type_ConLD,sale_type_ConLI,sale_type_ConLw,sale_type_New,sale_type_Oth,sale_type_WD
0,109,533352170,70.0,13517,2,6,8,1976,2005,289.0,...,0,0,0,0,0,0,0,0,0,1
1,544,531379050,43.0,11492,2,7,5,1996,1997,132.0,...,0,0,0,0,0,0,0,0,0,1
2,153,535304180,68.0,7922,3,5,7,1953,2007,0.0,...,1,0,0,0,0,0,0,0,0,1
3,318,916386060,73.0,9802,3,5,5,2006,2007,0.0,...,0,0,0,0,0,0,0,0,0,1
4,255,906425045,82.0,14235,2,6,8,1900,1993,0.0,...,1,0,0,0,0,0,0,0,0,1


In [77]:
df.columns

Index(['id', 'pid', 'lot_frontage', 'lot_area', 'lot_shape', 'overall_qual',
       'overall_cond', 'year_built', 'year_remod_add', 'mas_vnr_area',
       ...
       'garage_type_Detchd', 'garage_type_Not Applicable', 'sale_type_CWD',
       'sale_type_Con', 'sale_type_ConLD', 'sale_type_ConLI',
       'sale_type_ConLw', 'sale_type_New', 'sale_type_Oth', 'sale_type_WD '],
      dtype='object', length=191)

In [78]:
X = df.drop(columns=['sale_price', 'id', 'pid'])
y = df['sale_price']

In [79]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 42)

In [80]:
rare_features = [col for col in X_train.columns if X_train[col].nunique() == 2 and X_train[col].sum() < 10]
X_train = X_train.drop(columns=rare_features)
X_test = X_test.drop(columns=rare_features)

In [81]:
# Identify which columns in X_train have values > 10
columns_to_scale = [col for col in X_train.columns if X_train[col].max() > 10]

# Standard scale only those columns (fit on training data only)
scaler = StandardScaler()
X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()

X_train_scaled[columns_to_scale] = scaler.fit_transform(X_train[columns_to_scale])
X_test_scaled[columns_to_scale] = scaler.transform(X_test[columns_to_scale])

In [82]:
# Train Lasso model
lasso = Lasso(alpha=1.0, max_iter=10_000)
lasso.fit(X_train_scaled, y_train)


In [83]:
# Predict and evaluate
y_pred = lasso.predict(X_test_scaled)
rmse = root_mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

print(f"Mean Squared Error: ${rmse:,.0f}")
print(f"R² Score: {r2:.4f}")
print(f"Mean Absolute Error ${mae:,.0f}")

Mean Squared Error: $27,530
R² Score: 0.8858
Mean Absolute Error $19,631


## Interpreting Coefficients

In [84]:
# Get the standard deviations used by the scaler
scaler_std = scaler.scale_  # this aligns with columns_to_scale
scale_map = dict(zip(columns_to_scale, scaler_std))

# Convert to Series for easier mapping
coef_series = pd.Series(lasso.coef_, index=X_train.columns)

# For binary scaled features only:
adjusted_coef = {}

for feature, coef_value in coef_series.items():
    if coef_value != 0:
        if feature in scale_map:
            # Feature was scaled — convert back to raw impact
            true_coef = coef_value * (1 / scale_map[feature])
        else:
            # Feature was not scaled (likely binary already)
            true_coef = coef_value
        adjusted_coef[feature] = true_coef

# Convert to DataFrame for display
adjusted_coef_df = pd.Series(adjusted_coef).sort_values(ascending=False).reset_index()
adjusted_coef_df.columns = ['Feature', 'Estimated Dollar Impact']

In [88]:
# Show top 15 features
adjusted_coef_df.head(50)

Unnamed: 0,Feature,Estimated Dollar Impact
0,neighborhood_StoneBr,52055.60904
1,garage_type_Not Applicable,50073.125034
2,neighborhood_NridgHt,35014.96099
3,bldg_type_2fmCon,31916.122895
4,condition_1_PosN,27241.638924
5,ms_subclass_75,24387.905036
6,roof_matl_Tar&Grv,17559.96527
7,exterior_2nd_Brk Cmn,17055.690152
8,garage_type_BuiltIn,16801.979125
9,neighborhood_NoRidge,16748.078386


## Data-Driven Recommendations to Increase Sale Price

### Neighborhood

| Feature                 | Impact ($) |
|-------------------------|------------|
| `neighborhood_StoneBr` | +52,056     |
| `neighborhood_NridgHt` | +35,015     |
| `neighborhood_NoRidge` | +16,748     |

**Recommendation**:  
Focus on acquiring, developing, or improving homes in **premium neighborhoods** like StoneBrook, NorthRidge Heights, and NorthRidge, which are consistently associated with higher sale prices.

---

### Garage Type & Quality

| Feature                        | Impact ($) |
|--------------------------------|------------|
| `garage_type_Not Applicable`   | +50,073     |
| `garage_type_BuiltIn`          | +16,802     |
| `garage_type_Detchd`           | +13,676     |
| `garage_type_Attchd`           | +13,290     |
| `garage_type_Basment`          | +12,201     |
| `garage_qual`                  | +11,841     |
| `garage_cars`                  | +9,809      |

**Recommendation**:  
While unusual, homes listed as having **no garage (possibly luxury condos)** are highly valued. Built-in and attached garages are also strong positives. Focus on improving **garage quality** and ensuring **space for multiple cars**.

---

### Exterior Materials & Features

| Feature                      | Impact ($) |
|------------------------------|------------|
| `exterior_2nd_Brk Cmn`       | +17,056     |
| `exterior_1st_BrkFace`       | +15,995     |
| `roof_matl_Tar&Grv`          | +17,560     |
| `roof_matl_CompShg`          | +15,761     |
| `mas_vnr_type_Stone`         | +15,592     |
| `exterior_1st_CemntBd`       | +13,883     |
| `mas_vnr_type_BrkFace`       | +8,563      |
| `exterior_2nd_Wd Sdng`       | +7,770      |
| `exterior_2nd_VinylSd`       | +4,001      |
| `exterior_2nd_Plywood`       | +3,931      |
| `exterior_2nd_BrkFace`       | +3,759      |

**Recommendation**:  
Use high-end exterior materials such as **brick face**, **stone veneer**, and **cement board siding**. Roofing materials like **tar & gravel** and **composite shingles** are associated with higher value homes.

---

### House Style & Structure

| Feature               | Impact ($) |
|------------------------|------------|
| `house_style_1Story`  | +12,483     |
| `house_style_SLvl`    | +8,847      |
| `house_style_SFoyer`  | +3,570      |
| `foundation_Slab`     | +4,461      |
| `roof_style_Hip`      | +11,768     |
| `roof_style_Gable`    | +4,077      |
| `lot_config_CulDSac`  | +9,164      |
| `bldg_type_2fmCon`    | +31,916     |

**Recommendation**:  
Invest in or favor homes with **single-story or split-level layouts**, **hip roofs**, **slab foundations**, and lots in **cul-de-sacs**. Two-family conversions (2fmCon) also show strong returns.

---

### Functional Layout & Features

| Feature            | Impact ($) |
|--------------------|------------|
| `kitchen_qual`     | +9,156      |
| `exter_qual`       | +11,069     |
| `bsmt_full_bath`   | +8,762      |
| `full_bath`        | +7,685      |
| `half_bath`        | +6,816      |
| `bsmt_exposure`    | +6,299      |
| `bsmt_qual`        | +4,537      |
| `overall_qual`     | +8,731      |
| `overall_cond`     | +3,974      |

**Recommendation**:  
Boost **interior quality metrics** such as **kitchen**, **basement**, and **overall finish**. Adding or updating **bathrooms** and improving **basement exposure** also contributes significantly to sale price.

---

### Sale Type

| Feature               | Impact ($) |
|------------------------|------------|
| `sale_type_New`       | +15,037     |
| `sale_type_ConLD`     | +12,379     |

**Recommendation**:
Homes sold as **new constructions** (`sale_type_New`) or through **contract with low down payment** (`ConLD`) tend to command higher prices.  
These sale types may reflect **modern builds**, **better condition**, or **favorable buyer-seller terms** — all attractive to buyers.  
When listing or investing, **highlight if a property is new or sold under favorable sale terms** to help maximize perceived value and sale price.

---

### Dwelling Type Recommendations (MS SubClass)

| MS SubClass | Dwelling Type Description                                 | Impact ($) |
|-------------|------------------------------------------------------------|------------|
| 075         | 2-1/2 STORY ALL AGES                                       | +24,388     |
| 050         | 1-1/2 STORY FINISHED ALL AGES                              | +11,656     |
| 030         | 1-STORY 1945 & OLDER                                       | +5,345      |
| 070         | 2-STORY 1945 & OLDER                                       | +4,831      |
| 020         | 1-STORY 1946 & NEWER ALL STYLES                            | +4,694      |
| 180         | PUD - MULTILEVEL - INCL SPLIT LEVEL/FOYER                 | +3,792      |

**Recommendation**:
- Consider **investing in or marketing properties** that fall into high-impact dwelling categories like:
  - **2-1/2 story homes** (MS Subclass 075)
  - **1.5-story homes with finished attics** (Subclass 050)
- These styles may offer **unique design appeal**, **extra space**, or simply be associated with **higher-value areas** in your data.
- Even **older 1- and 2-story homes** (Subclasses 030, 070) show added value — possibly due to location, lot size, or charm.
- **PUD-style homes (Subclass 180)** also carry moderate premium value.

Keep in mind that these values are **correlated** with higher sale prices — so while a home being in a given SubClass doesn’t guarantee a price bump, it does strongly **signal market preference** in your dataset.

---

### Location & Condition

| Feature               | Impact ($) |
|------------------------|------------|
| `condition_1_PosN`    | +27,242     |
| `condition_1_Norm`    | +8,849      |

**Recommendation**:  
Properties in **positive or normal condition** are perceived as more desirable and fetch higher prices. Invest in improvements that elevate a property's structural or visual condition.

---

### Summary
These results are based on a Lasso regression model trained on housing data, with dollar values derived from rescaled coefficients. The features above represent strong associations with **higher sale price** — either through location, material choice, structure, or functionality.