# Urban Biodiversity Tree Health Modeling

This notebook retrieves urban tree census data, explores key characteristics, and builds a predictive model for tree health to support biodiversity monitoring.

## 1. Retrieve data
We will use a subset of the 2015 New York City Street Tree Census, hosted as a CSV file on GitHub. The dataset contains tree species, health assessments, and stewardship information recorded throughout the city.

In [None]:
import pandas as pd
import numpy as np

DATA_URL = "https://raw.githubusercontent.com/charleyferrari/CUNY_DATA608/master/module4/data/trees_count_limited.csv"
trees = pd.read_csv(DATA_URL)
trees.head()

## 2. Explore the dataset

In [None]:
trees.info()

In [None]:
trees.describe(include='all')

### Health distribution by borough

In [None]:
trees['health'].value_counts(normalize=True)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 4))
sns.countplot(data=trees, x='health', hue='boroname')
plt.title('Tree health status across NYC boroughs')
plt.xlabel('Health')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

## 3. Preprocess features
We will prepare categorical and numerical features for modeling. The goal is to predict whether a tree is in *good* health or not.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, ConfusionMatrixDisplay

# Binary target: good health vs. other states
trees_model = trees.dropna(subset=['health']).copy()
trees_model['is_healthy'] = (trees_model['health'] == 'Good').astype(int)

feature_cols = ['spc_common', 'boroname', 'steward', 'guards', 'sidewalk', 'curb_loc', 'soil', 'tree_dbh']
X = trees_model[feature_cols]
y = trees_model['is_healthy']

numeric_features = ['tree_dbh']
categorical_features = [col for col in feature_cols if col not in numeric_features]

preprocessor = ColumnTransformer(
    transformers=[
        ('num', 'passthrough', numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ]
)

model = Pipeline(
    steps=[
        ('preprocess', preprocessor),
        ('clf', LogisticRegression(max_iter=1000))
    ]
)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
model.fit(X_train, y_train)

## 4. Evaluate the model

In [None]:
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

ConfusionMatrixDisplay.from_predictions(y_test, y_pred)
plt.title('Confusion matrix: tree health classifier')
plt.show()

## 5. Feature influence
The logistic regression coefficients help interpret which factors drive the likelihood of a tree being healthy.

In [None]:
# Extract feature names from the one-hot encoder
onehot = model.named_steps['preprocess'].named_transformers_['cat']
encoded_cat_features = onehot.get_feature_names_out(categorical_features)
feature_names = np.concatenate([numeric_features, encoded_cat_features])

coef = model.named_steps['clf'].coef_[0]
coef_df = pd.DataFrame({'feature': feature_names, 'coefficient': coef})
coef_df.sort_values(by='coefficient', ascending=False).head(10)

In [None]:
coef_df.sort_values(by='coefficient').head(10)

The model identifies stewardship engagement and species-level differences associated with tree health. These insights can guide urban biodiversity management by targeting species and neighborhoods with higher risk of declining tree health.