# Urban biodiversity street tree health modeling

This notebook explores a sample of New York City street tree records with the goal of understanding which stewardship and site factors are associated with tree health. The workflow loads a reproducible dataset stored in this repository, performs exploratory analysis, and trains a logistic regression model to predict whether a tree is in good health.

## 1. Load reproducible tree census data

The NYC Street Tree Census is too large to ship directly with this repository, so a stratified sample with 1,000 observations is stored locally at `data/nyc_street_trees_sample.csv`. This avoids brittle network requests while still capturing the categorical structure of the original dataset.

In [None]:
from pathlib import Path
import pandas as pd

PROJECT_ROOT = Path().resolve()
if PROJECT_ROOT.name == 'notebooks':
    PROJECT_ROOT = PROJECT_ROOT.parent
DATA_PATH = PROJECT_ROOT / 'data' / 'nyc_street_trees_sample.csv'

trees = pd.read_csv(DATA_PATH)
print(f'Loaded {len(trees):,} tree records from {DATA_PATH}')
trees.head()

## 2. Explore the dataset

In [None]:
trees.info()

In [None]:
trees.describe(include='all')

In [None]:
trees['health'].value_counts(normalize=True).rename('share').to_frame().style.format({'share': '{:.1%}'})

### Health distribution by borough and stewardship

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme(style='whitegrid')
fig, axes = plt.subplots(1, 2, figsize=(14, 4), sharey=True)
sns.countplot(data=trees, x='health', hue='boroname', ax=axes[0], order=['Good','Fair','Poor'])
axes[0].set_title('Health distribution by borough')
axes[0].set_xlabel('Health')
axes[0].set_ylabel('Count')
axes[0].legend(title='Borough', bbox_to_anchor=(1.05, 1), loc='upper left')

sns.countplot(data=trees, x='health', hue='steward', ax=axes[1], order=['Good','Fair','Poor'])
axes[1].set_title('Health distribution by stewardship level')
axes[1].set_xlabel('Health')
axes[1].set_ylabel('')
axes[1].legend(title='Stewardship', bbox_to_anchor=(1.05, 1), loc='upper left')

fig.tight_layout()
plt.show()

In [None]:
trees.groupby(['boroname', 'steward', 'health']).size().rename('count').reset_index().head(10)

## 3. Build a predictive model

We frame the problem as predicting whether a tree is in **good** health (1) versus any signs of decline (0). A logistic regression model with one-hot encoded categorical variables provides interpretable coefficients while remaining expressive. Class weights are balanced to account for the prevalence of healthy trees.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, ConfusionMatrixDisplay, roc_auc_score

model_df = trees.dropna(subset=['health']).copy()
model_df['is_healthy'] = (model_df['health'] == 'Good').astype(int)
feature_cols = ['spc_common', 'boroname', 'steward', 'guards', 'sidewalk', 'curb_loc', 'soil', 'tree_dbh']
X = model_df[feature_cols]
y = model_df['is_healthy']

numeric_features = ['tree_dbh']
categorical_features = [col for col in feature_cols if col not in numeric_features]

preprocessor = ColumnTransformer(
    transformers=[
        ('num', 'passthrough', numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ]
)

clf = Pipeline(
    steps=[
        ('preprocess', preprocessor),
        ('logreg', LogisticRegression(max_iter=2000, class_weight='balanced'))
    ]
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)
clf.fit(X_train, y_train)

print('ROC AUC:', roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1]).round(3))
print('
Classification report:')
print(classification_report(y_test, clf.predict(X_test)))

ConfusionMatrixDisplay.from_estimator(clf, X_test, y_test)
plt.title('Tree health classifier confusion matrix')
plt.show()

## 4. Interpret model coefficients

In [None]:
import numpy as np

onehot = clf.named_steps['preprocess'].named_transformers_['cat']
encoded_feature_names = onehot.get_feature_names_out(categorical_features)
feature_names = np.concatenate([numeric_features, encoded_feature_names])
coefficients = clf.named_steps['logreg'].coef_[0]
coef_df = (
    pd.DataFrame({'feature': feature_names, 'coefficient': coefficients})
    .sort_values('coefficient', ascending=False)
)

coef_df.head(10)

In [None]:
coef_df.tail(10)

### Takeaways

* Trees tended by more engaged stewards and with protective guards are much more likely to be healthy, underscoring the impact of community care.
* Planting species such as ginkgo and Japanese zelkova or maintaining sidewalk conditions helps improve predicted health, while damage and compacted soils reduce the likelihood of good outcomes.
* The interpretable logistic regression pipeline offers a baseline for prioritizing field inspections and targeting interventions in neighborhoods with higher predicted risk of decline.