# Project Template

Copy this repository to get started with the project

## Setup

Modify the following 2 cells to import libraries and helper files that you might need for your project

In [None]:
# Some possible helper files from class

!wget -q https://github.com/PSAM-5020-2025F-A/5020-utils/raw/main/src/audio_utils.py
!wget -q https://github.com/PSAM-5020-2025F-A/5020-utils/raw/main/src/data_utils.py
!wget -q https://github.com/PSAM-5020-2025F-A/5020-utils/raw/main/src/image_utils.py
!wget -q https://github.com/PSAM-5020-2025F-A/5020-utils/raw/main/src/text_utils.py

In [None]:
# Some possible libraries.
# This isn't complete.

import matplotlib.pyplot as plt
import pandas as pd
import PIL.Image as PImage

from os import listdir, path

from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.manifold import TSNE
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.svm import SVC

from data_utils import object_from_json_url, classification_error, display_confusion_matrix
from image_utils import get_pixels, make_image
from text_utils import get_top_words

## Milestone 01

Preparing and loading the data

In [None]:
df = pd.read_csv("FINAL data_311_complaints_2014 only_ CSV.csv")
df.head()

## Milestone 02



#### Converting Date and Time into Numeric Features
In this step, I first reconstructed a full timestamp for each row by combining the separate Year, Month, Date and Time columns into a single string and converting it into a proper pandas datetime object stored in df["datetime"]. Once I had this unified datetime column, I used it to create day_of_year, which took the datetime and converted it into a number between 1 and 365 that told me where that date falls in the year. I then created minutes_since_midnight by taking the hour and minute from the same datetime and converted them into a single continuous value between 0 and 1439, so the time of day is represented numerically. Finally, I used df.head() and then specifically looked at ["Year","Month","Date","Time","day_of_year","minutes_since_midnight","datetime"] with .head() to quickly check that these new columns line up correctly with the original date and time information.

In [None]:
# creating a full datetime from the separate columns + Time
df["datetime"] = pd.to_datetime(
    df["Year"].astype(str) + "-" +
    df["Month"].astype(str) + "-" +
    df["Date"].astype(str) + " " +
    df["Time"]
)

df["day_of_year"] = df["datetime"].dt.dayofyear
df["minutes_since_midnight"] = df["datetime"].dt.hour * 60 + df["datetime"].dt.minute

In [None]:
df.head()
df[["Year","Month","Date","Time","day_of_year","minutes_since_midnight","datetime"]].head()

I created an extra hour column from the datetime to summarize time of day more intuitively for some of the plots.

In [None]:
#hour of day (0–23)
df["hour"] = df["datetime"].dt.hour

I plotted a histogram of minutes_since_midnight to see when during the day most complaints are concentrated. This showed how noise activity is distributed across the full 24-hour cycle and helped reveal daily rhythms such as late-night peaks or daytime construction patterns.

In [None]:
plt.figure(figsize=(10, 4))
plt.hist(df["minutes_since_midnight"], bins=60)
plt.xlabel("Minutes since midnight")
plt.ylabel("Number of complaints")
plt.title("Distribution of complaints by minute of the day (2014, Harlem)")
plt.tight_layout()
plt.show()

The histogram shows that Harlem residents file most of their noise complaints very late at night and just after midnight, with another peak late in the evening. Complaints drop sharply during the daytime, suggesting that daytime noise (traffic, construction, commercial activity) may be perceived as more “normal” or expected, while nighttime noise, especially music, parties and residential disturbances, appears to trigger complaints much more frequently.

I plotted a histogram of the Month column to examine how noise complaints vary across the year. This helps capture potential seasonal effects, such as higher complaint volumes in warmer months when people are more likely to be outside, play music or open windows.

In [None]:
plt.figure(figsize=(8, 4))
plt.hist(df["Month"], bins=range(1, 14), align="left", rwidth=0.8)
plt.xlabel("Month")
plt.ylabel("Number of complaints")
plt.title("Distribution of complaints by month (2014, Harlem)")
plt.xticks(range(1, 13))
plt.tight_layout()
plt.show()

The monthly distribution shows a clear seasonal pattern. Noise complaints rise steadily from winter into late spring, peak during May–August and then gradually decline toward the end of the year. This aligns with warmer weather months when people spend more time outdoors, host gatherings and keep windows open, making both the production and perception of sound more prominent. The summer peak also reflects Harlem’s active street life, public events and music culture, which become more audible and more contested during periods of increased social and residential activity.

Visualizing here the distribution of complaints by Complaint Type, which is a higher-level and more standardized classification than the Descriptor field.

In [None]:
plt.figure(figsize=(10, 6))
df["Complaint Type"].value_counts().plot(kind="bar")
plt.title("Complaints by Complaint Type (2014, Harlem)")
plt.xlabel("Complaint Type")
plt.ylabel("Number of Complaints")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()

The distribution shows that noise complaints in Harlem are overwhelmingly concentrated in residential settings, which account for the vast majority of reports. Street and sidewalk noise form the second-largest category, followed by commercial noise. Vehicle noise, park-related complaints and helicopter noise are present but occur far less frequently. This hierarchy suggests that most sound conflicts in Harlem emerge within or directly around people’s homes, reflecting how residential density, building conditions and nighttime activity contribute directly to complaint behavior.

Here I plotted the raw distribution of all Descriptor values to see how individual noise categories are represented before any grouping or simplification.

In [None]:
descriptor_counts = df["Descriptor"].value_counts()

plt.figure(figsize=(10, 6))
descriptor_counts.plot(kind="bar")
plt.xlabel("Descriptor")
plt.ylabel("Number of complaints")
plt.title("Number of complaints by descriptor (2014, Harlem)")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()

This visualization immediately shows that the Descriptor field is extremely fragmented, with a few categories appearing very frequently and dozens of others appearing rarely or only once. The result is a long-tailed distribution that is difficult to interpret meaningfully in its raw form, since many descriptors overlap, differ only in wording or represent variations of the same underlying sound type.

Because the raw Descriptor categories are too granular and inconsistent to analyze directly, I grouped them into broader sound categories to produce a clearer and more interpretable representation of Harlem’s noise landscape.

In [None]:
def simplify_descriptor(desc):
    desc = desc.lower()

    if "music" in desc or "party" in desc:
        return "Music/Party"
    if "construction" in desc or "jack hammer" in desc:
        return "Construction"
    if "car" in desc or "truck" in desc or "engine" in desc or "horn" in desc:
        return "Vehicle/Street"
    if "talk" in desc or "banging" in desc or "pounding" in desc:
        return "Talking/Impact Noise"
    if "dog" in desc:
        return "Animals"
    if "television" in desc:
        return "Television"
    return "Other"

df["Descriptor_Grouped"] = df["Descriptor"].apply(simplify_descriptor)

In [None]:
plt.figure(figsize=(10, 6))
df["Descriptor_Grouped"].value_counts().plot(kind="bar")
plt.title("Complaints by Descriptor Group (2014, Harlem)")
plt.xlabel("Grouped Descriptor")
plt.ylabel("Number of Complaints")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()

After grouping similar descriptors into broader sound categories, a much clearer pattern emerges. Music- and party-related noise overwhelmingly dominates Harlem’s complaints, followed by talking and impact noise. Construction, vehicle noise and other categories appear at much lower levels. This grouping highlights the major sound sources shaping Harlem’s daily noise landscape more effectively than the fragmented raw descriptor labels.

I created a crosstab to measure how each grouped sound descriptor is distributed across different location types, allowing me to compare where different kinds of noise tend to occur.

In [None]:
# Crosstab: proportion of location types within each grouped descriptor
descriptor_loc_prop = pd.crosstab(
    df["Descriptor_Grouped"],
    df["Location Type"],
    normalize="index")

descriptor_loc_prop.head()

I then visualized this crosstab as a stacked bar chart to clearly show the proportion of location types within each grouped descriptor category and reveal spatial patterns in the noise complaints.

In [None]:
plt.figure(figsize=(10, 6))
descriptor_loc_prop.plot(kind="bar", stacked=True)
plt.xlabel("Grouped Descriptor")
plt.ylabel("Proportion of complaints")
plt.title("Location Type distribution within each grouped descriptor")
plt.xticks(rotation=45, ha="right")
plt.legend(title="Location Type", bbox_to_anchor=(1.05, 1), loc="upper left")
plt.tight_layout()
plt.show()

This plot shows how each grouped noise descriptor is distributed across different location types. Most categories, especially Music/Party and Talking/Impact Noise, are dominated by residential building complaints, confirming that sound conflicts in Harlem are primarily household or neighbor-related. Construction and vehicle-related noise extend more into street and commercial spaces, reflecting their outdoor and mobility-based nature.

The single tall blue bar (“Above Address”) belongs almost entirely to helicopter-related complaints, which form their own small cluster in the dataset. This outlier appears because helicopter noise is always reported as occurring “above address,” making it a structurally different sound type compared to ground-level noise categories.

<span style="color: red">should we do something about it?</span>

To explore whether different sound types tend to occur in different north–south areas of Harlem, I compared latitude distributions across the most common grouped descriptors. This boxplot shows how each major noise category is spatially distributed along the vertical axis of the neighborhood.

In [None]:
# Focusing on the most common grouped descriptors
top_groups = df["Descriptor_Grouped"].value_counts().head(5).index
subset = df[df["Descriptor_Grouped"].isin(top_groups)]

plt.figure(figsize=(10, 6))
subset.boxplot(column="Latitude", by="Descriptor_Grouped", rot=45)
plt.ylabel("Latitude")
plt.title("Latitude distribution by grouped descriptor (top 5)")
plt.suptitle("")  
plt.tight_layout()
plt.show()

The latitude distributions for the major sound categories overlap heavily, showing that these noise types occur throughout Harlem without strong north–south clustering. While some categories show slight shifts, no descriptor group is concentrated in a distinct latitude band.

To understand how the key numeric features relate to one another, I computed both the correlation and covariance matrices for day_of_year, minutes_since_midnight, Latitude and Longitude. This helps reveal whether temporal and spatial patterns co-vary in meaningful ways and whether any of these features carry overlapping information that could influence the predictive model.

In [None]:
numeric_cols = ["day_of_year", "minutes_since_midnight", "Latitude", "Longitude"]

corr_matrix = df[numeric_cols].corr()
cov_matrix = df[numeric_cols].cov()

corr_matrix, cov_matrix

In [None]:
plt.figure(figsize=(7, 5))
plt.imshow(corr_matrix, vmin=-1, vmax=1)
plt.colorbar(label="Correlation")
plt.xticks(range(len(numeric_cols)), numeric_cols, rotation=45, ha="right")
plt.yticks(range(len(numeric_cols)), numeric_cols)
plt.title("Correlation matrix of key numeric features")
plt.tight_layout()
plt.show()

The correlation matrix shows that the key numeric features in the dataset are largely uncorrelated with one another. Both temporal variables (day_of_year and minutes_since_midnight) have near-zero correlation with spatial variables (Latitude and Longitude), indicating that the timing of complaints does not systematically vary across Harlem’s geography. Latitude and longitude show only a modest correlation, which simply reflects the general orientation of the neighborhood. Overall, this suggests that each numeric feature contributes independent information to the model.