## Text and Category Processing Tasks

In [4]:
import pandas as pd

nyc_clean = pd.read_csv("../data/processed/nyc_311_cleaned.csv")
nyc_clean.head()

Unnamed: 0,unique_key,created_date,problem_formerly_complaint_type,problem_detail_formerly_descriptor,borough,agency,location_type,incident_zip,latitude,longitude
0,67859228,2026-02-05 22:37:25,ILLEGAL PARKING,Commercial Overnight Parking,STATEN ISLAND,NYPD,STREET/SIDEWALK,10307.0,40.501312,-74.243295
1,67846595,2026-02-05 00:59:19,SMOKING OR VAPING,Allowed in Smoke Free Area,STATEN ISLAND,DOHMH,RESIDENTIAL BUILDING,10307.0,40.503493,-74.244997
2,67860855,2026-02-05 16:48:01,GENERAL CONSTRUCTION/PLUMBING,Cons - Contrary/Beyond Approved Plans/Permits,STATEN ISLAND,DOB,UNKNOWN,10307.0,40.503809,-74.249895
3,67868108,2026-02-05 13:08:39,SNOW OR ICE,Roadway,STATEN ISLAND,DSNY,STREET,10307.0,40.506658,-74.24608
4,67862279,2026-02-05 07:17:07,ILLEGAL PARKING,Double Parked Blocking Traffic,STATEN ISLAND,NYPD,STREET/SIDEWALK,10307.0,40.508722,-74.244688


In [3]:
print(nyc_clean.shape)
nyc_clean.info()

(10096, 10)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10096 entries, 0 to 10095
Data columns (total 10 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   unique_key                          10096 non-null  int64  
 1   created_date                        10096 non-null  object 
 2   problem_formerly_complaint_type     10096 non-null  object 
 3   problem_detail_formerly_descriptor  10096 non-null  object 
 4   borough                             10096 non-null  object 
 5   agency                              10096 non-null  object 
 6   location_type                       10096 non-null  object 
 7   incident_zip                        10096 non-null  float64
 8   latitude                            10096 non-null  float64
 9   longitude                           10096 non-null  float64
dtypes: float64(3), int64(1), object(6)
memory usage: 788.9+ KB


### Step 1:

**Classify complaint descriptions** into into standardized categories using traditional ML 
models (e.g., logistic regression, SVM, random forest) 

X (feature): `problem_detail_formerly_descriptor`

y (target): `problem_formerly_complaint_type`

In [33]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

In [24]:
X = nyc_clean["problem_detail_formerly_descriptor"]
y = nyc_clean["problem_formerly_complaint_type"]

In [9]:
print(sum(X.isna()))
print(sum(y.isna()))

0
0


Split into training and testing data

In [25]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=50)

Convert text to features using Term Frequency-Inverse Document Frequency (TF-IDF), which transforms text into a numerical matrix.

TF-IDF buils a vocabulary of all words, removing stop words. Then it creates a table so each complaint is represented by numbers. TF asks how often does a word appear in this complaint. IDF ask if this word is common across all complaints. For example, if every complaint has the word 'street', it is down-weighted. If a word is rare and specific, it gets up-weighted. 

In [26]:
vectorizer = TfidfVectorizer(stop_words="english", max_features=5000) 
# max features of 5000 limits the vocabulary size to the 5000 most important words

X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

Model 1: Logistic Regression

In [34]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X_train_tfidf, y_train)

y_pred_lr = lr.predict(X_test_tfidf)

print(classification_report(y_test, y_pred_lr))

                                         precision    recall  f1-score   support

                      ABANDONED VEHICLE       1.00      1.00      1.00        22
                            AIR QUALITY       1.00      0.62      0.76        13
                       ANIMAL IN A PARK       1.00      1.00      1.00         1
                           ANIMAL-ABUSE       1.00      0.40      0.57         5
                              APPLIANCE       1.00      1.00      1.00        15
                       BLOCKED DRIVEWAY       1.00      1.00      1.00       217
                                BOILERS       1.00      1.00      1.00         4
                   BROKEN PARKING METER       0.00      0.00      0.00         3
                           BUILDING/USE       0.78      0.88      0.82         8
             BUS STOP SHELTER COMPLAINT       1.00      1.00      1.00         2
                      CANNABIS RETAILER       0.50      0.50      0.50         2
          COMMERCIAL DISPOS

  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


Model 2: SVM

In [None]:
from sklearn.svm import LinearSVC

Model 3: Random Forest