## Text and Category Processing Tasks

### Imports

In [28]:
import pandas as pd
import numpy as np
import time
import warnings

warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score, f1_score, classification_report

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from textblob import TextBlob

### Load Cleaned Data

In [35]:
nyc_clean = pd.read_csv("../data/processed/nyc_311_cleaned.csv")
nyc_clean.head()

Unnamed: 0,unique_key,created_date,problem_formerly_complaint_type,problem_detail_formerly_descriptor,borough,agency,location_type,incident_zip,latitude,longitude
0,67859228,2026-02-05 22:37:25,ILLEGAL PARKING,Commercial Overnight Parking,STATEN ISLAND,NYPD,STREET/SIDEWALK,10307.0,40.501312,-74.243295
1,67846595,2026-02-05 00:59:19,SMOKING OR VAPING,Allowed in Smoke Free Area,STATEN ISLAND,DOHMH,RESIDENTIAL BUILDING,10307.0,40.503493,-74.244997
2,67860855,2026-02-05 16:48:01,GENERAL CONSTRUCTION/PLUMBING,Cons - Contrary/Beyond Approved Plans/Permits,STATEN ISLAND,DOB,UNKNOWN,10307.0,40.503809,-74.249895
3,67868108,2026-02-05 13:08:39,SNOW OR ICE,Roadway,STATEN ISLAND,DSNY,STREET,10307.0,40.506658,-74.24608
4,67862279,2026-02-05 07:17:07,ILLEGAL PARKING,Double Parked Blocking Traffic,STATEN ISLAND,NYPD,STREET/SIDEWALK,10307.0,40.508722,-74.244688


In [3]:
print(nyc_clean.shape)
nyc_clean.info()

(10096, 10)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10096 entries, 0 to 10095
Data columns (total 10 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   unique_key                          10096 non-null  int64  
 1   created_date                        10096 non-null  object 
 2   problem_formerly_complaint_type     10096 non-null  object 
 3   problem_detail_formerly_descriptor  10096 non-null  object 
 4   borough                             10096 non-null  object 
 5   agency                              10096 non-null  object 
 6   location_type                       10096 non-null  object 
 7   incident_zip                        10096 non-null  float64
 8   latitude                            10096 non-null  float64
 9   longitude                           10096 non-null  float64
dtypes: float64(3), int64(1), object(6)
memory usage: 788.9+ KB


In [11]:
y.value_counts().head()

problem_formerly_complaint_type
ILLEGAL PARKING        1624
HEAT/HOT WATER         1497
BLOCKED DRIVEWAY        956
NOISE - RESIDENTIAL     792
SNOW OR ICE             752
Name: count, dtype: int64

### Task # 1

**Classify complaint descriptions** into into standardized categories using traditional ML 
models (e.g., logistic regression, SVM, random forest) 

#### Define X and Y

X (feature): `problem_detail_formerly_descriptor`

y (target): `problem_formerly_complaint_type`

In [4]:
X = nyc_clean["problem_detail_formerly_descriptor"]
y = nyc_clean["problem_formerly_complaint_type"]

print(sum(X.isna()))
print(sum(y.isna()))

0
0


#### Train and Test Split

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=50)

#### Pipeline

**TF-IDF Notes:**

Convert text to features using Term Frequency-Inverse Document Frequency (TF-IDF), which transforms text into a numerical matrix.

TF-IDF buils a vocabulary of all words, removing stop words. Then it creates a table so each complaint is represented by numbers. TF asks how often does a word appear in this complaint. IDF ask if this word is common across all complaints. For example, if every complaint has the word 'street', it is down-weighted. If a word is rare and specific, it gets up-weighted. 

In [15]:
pipelines = {
    "Logistic Regression": Pipeline([
        ("tfidf", TfidfVectorizer(stop_words= "english", max_features= 5000)),
        ("clf", LogisticRegression(max_iter = 2000, class_weight = "balanced"))
    ]),
    "Linear SVM": Pipeline([
        ("tfidf", TfidfVectorizer(stop_words= "english", max_features= 5000)),
        ("clf", LinearSVC(multi_class='ovr', class_weight = "balanced"))
    ]),
    "Random Forest": Pipeline([
        ("tfidf", TfidfVectorizer(stop_words= "english", max_features= 5000)),
        ("clf", RandomForestClassifier(
            n_estimators = 300, random_state = 50, n_jobs = -1, class_weight = "balanced"
        ))
    ]),
    "KNN": Pipeline([
        ("tfidf", TfidfVectorizer(stop_words= "english", max_features= 5000)),
        ("clf", KNeighborsClassifier(n_neighbors=3))
    ])
}

#### Model Evaluation Function

In [16]:
def evaluate_model(name, model, X_train, y_train, X_test, y_test):
    # training time
    t0 = time.perf_counter()
    model.fit(X_train, y_train)
    train_time = time.perf_counter() - t0

    # prediction time
    t1 = time.perf_counter()
    y_pred = model.predict(X_test)
    predict_time = time.perf_counter() - t1

    # metrics
    acc = accuracy_score(y_test, y_pred)
    macro_f1 = f1_score(y_test, y_pred, average = "macro")
    weighted_f1 = f1_score(y_test, y_pred, average = "weighted")

    return {
        "Model": name,
        "Accuracy": acc,
        "Macro-F1": macro_f1,
        "Weighted-F1": weighted_f1,
        "Train Time (s)": train_time,
        "Predict Time (s)": predict_time}, y_pred


#### Run all models and build results table

In [25]:
rows = []
preds = {}

for name, model in pipelines.items():
    row, y_pred = evaluate_model(name, model, X_train, y_train, X_test, y_test)
    rows.append(row)
    preds[name] = y_pred

results = pd.DataFrame(rows).sort_values(by = "Macro-F1", ascending = False).round(4)
results

Unnamed: 0,Model,Accuracy,Macro-F1,Weighted-F1,Train Time (s),Predict Time (s)
2,Random Forest,0.9267,0.7863,0.9369,1.607,0.4392
1,Linear SVM,0.9262,0.7831,0.9357,12.653,0.0125
3,KNN,0.9495,0.7599,0.9418,0.0812,0.4495
0,Logistic Regression,0.9124,0.7486,0.9254,1.6746,0.032


#### Best Model

In [14]:
best_name = results.iloc[0]["Model"]
print("Best Model:", best_name)
print(classification_report(y_test, preds[best_name]))

Best Model: Random Forest
                                         precision    recall  f1-score   support

                      ABANDONED VEHICLE       1.00      1.00      1.00        22
                            AIR QUALITY       1.00      0.92      0.96        13
                       ANIMAL IN A PARK       1.00      1.00      1.00         1
                           ANIMAL-ABUSE       1.00      0.40      0.57         5
                              APPLIANCE       1.00      1.00      1.00        15
                       BLOCKED DRIVEWAY       1.00      1.00      1.00       217
                                BOILERS       1.00      1.00      1.00         4
                   BROKEN PARKING METER       0.50      0.33      0.40         3
                           BUILDING/USE       1.00      1.00      1.00         8
             BUS STOP SHELTER COMPLAINT       1.00      1.00      1.00         2
                      CANNABIS RETAILER       0.67      1.00      0.80         2
 

#### Results Discussion

KNN achieved the highest overall accuracy (0.9495), but its lower macro-F1 score (0.7599) indicates that there is reduced performance on complaint categories that are less common. Random Forest acheived the highest macro-F1 score (0.7863), which demonstrates the model has more balanced performance across classes. Linear SVM results were pretty identical to Random Forest, but had a higher training time (1.1385 s vs 18.6957 s) and lower prediction time (0.3037 s vs 0.0127 s). Therefore, Random Forest is the best performing model overall. 

---

### Task # 2

**Estimate severity or sentiment** using lexicon-based methods (e.g., VADER, TextBlob) 
and simple ML classifiers trained on labeled subsets

#### Part 1: Lexicon-based Senitment using VADER and TextBlob

In [36]:
analyzer = SentimentIntensityAnalyzer()

# create a sentiment score column in the df
nyc_clean["sentiment_score_vader"] = nyc_clean["problem_detail_formerly_descriptor"].apply(
    lambda x: analyzer.polarity_scores(str(x))["compound"]
)

nyc_clean["sentiment_score_textblob"] = nyc_clean["problem_detail_formerly_descriptor"].apply(
    lambda x: TextBlob(str(x)).sentiment.polarity
)

# compound score is between -1 (very negative), 0 (neutral), +1 (very positive)

In [None]:
nyc_clean["sentiment_score_vader"].describe().round(3)

count    10096.000
mean        -0.098
std          0.169
min         -0.758
25%         -0.273
50%          0.000
75%          0.000
max          0.586
Name: sentiment_score, dtype: float64

In [30]:
nyc_clean["sentiment_score_textblob"].describe().round(3)

count    10096.000
mean        -0.007
std          0.092
min         -0.800
25%          0.000
50%          0.000
75%          0.000
max          0.700
Name: sentiment_score_textblob, dtype: float64

Note: use VADER instead of TextBlob since the complaints most likely will be negative, which is reflected in VADER more than TextBlob. The `sentiment_severity_textblob` column will be removed from the nyc_clean data.

In [None]:
nyc_clean = nyc_clean.drop(columns=["sentiment_score_textblob"])

Unnamed: 0,unique_key,created_date,problem_formerly_complaint_type,problem_detail_formerly_descriptor,borough,agency,location_type,incident_zip,latitude,longitude,sentiment_score_vader
0,67859228,2026-02-05 22:37:25,ILLEGAL PARKING,Commercial Overnight Parking,STATEN ISLAND,NYPD,STREET/SIDEWALK,10307.0,40.501312,-74.243295,0.0
1,67846595,2026-02-05 00:59:19,SMOKING OR VAPING,Allowed in Smoke Free Area,STATEN ISLAND,DOHMH,RESIDENTIAL BUILDING,10307.0,40.503493,-74.244997,0.5106
2,67860855,2026-02-05 16:48:01,GENERAL CONSTRUCTION/PLUMBING,Cons - Contrary/Beyond Approved Plans/Permits,STATEN ISLAND,DOB,UNKNOWN,10307.0,40.503809,-74.249895,0.4215
3,67868108,2026-02-05 13:08:39,SNOW OR ICE,Roadway,STATEN ISLAND,DSNY,STREET,10307.0,40.506658,-74.24608,0.0
4,67862279,2026-02-05 07:17:07,ILLEGAL PARKING,Double Parked Blocking Traffic,STATEN ISLAND,NYPD,STREET/SIDEWALK,10307.0,40.508722,-74.244688,-0.3818


Now the sentiment will be classified into severity categories for the ML classifier:

In [42]:
def severity_label(score):
    if score < -0.35:
        return "High Severity"
    elif score < 0:
        return "Medium Severity"
    else:
        return "Low Severity"
    
nyc_clean["severity_label"] = nyc_clean["sentiment_score_vader"].apply(severity_label)

Since the complaints are mostly neutral to slightly negative, the thresholds were defined based on the distribution of the sentiment scores. Complains with a sentiment score of <= -0.35 were labeled as `High Severity`, those between -0.35 and 0 were labeled as `Medium Severity`, and all non-negative scores were labeled as `Low Severity`.

In [40]:
nyc_clean["severity_label"].value_counts()

severity_label
Low Severity       7113
Medium Severity    2052
High Severity       931
Name: count, dtype: int64

#### Part 2: Train ML Classifier on Labeled Subset

Goal is to predict the severity of a complaint using ML instead of Lexicon. Since there are no severity labels, the lexicon labels will be used as weak supervision.

In [None]:
# Note: X is already defined in the task above

y = nyc_clean["severity_label"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 50)

#### Build Model Pipeline, train, predict, & evaluate

Note: Logistic Regression was selected for the model since the relationship is mostly linear (negative words -> high severity, neutral words -> low severity), interpretable, and computationally efficient.

In [46]:
severity_model = Pipeline([
    ("tfidf", TfidfVectorizer(stop_words= "english", max_features= 5000)),
    ("clf", LogisticRegression(max_iter = 1000, class_weight = "balanced", solver = "lbfgs"))
])

severity_model.fit(X_train, y_train)

y_pred = severity_model.predict(X_test)

print(classification_report(y_test, y_pred))

                 precision    recall  f1-score   support

  High Severity       0.99      0.98      0.99       185
   Low Severity       1.00      1.00      1.00      1396
Medium Severity       1.00      0.99      0.99       439

       accuracy                           1.00      2020
      macro avg       0.99      0.99      0.99      2020
   weighted avg       1.00      1.00      1.00      2020

