# Project 2: Housing Violations and Rodent Inspections in New York City

In New York City, complaints about poor housing conditions, rodents, and other pests are part of everyday life. Two city agencies look at different sides of this problem:

- **Housing Maintenance Code violations**, issued by the Department of Housing Preservation and Development (HPD) when apartments or buildings do not meet basic standards  
- **Rodent inspections**, carried out by the Department of Health and Mental Hygiene (DOHMH) in response to rat and pest complaints  

These two sets of records are related but not identical. A borough can have a high number of housing violations but relatively fewer rodent inspections, or the opposite. I wanted a clear, data driven view of how these patterns compare across the five boroughs.

## Main question

The main question for my project is how do the five NYC boroughs compare in **2024** in terms of the number of **housing maintenance code violations** recorded by HPD and the number of **rodent inspections** conducted by DOHMH; and what does the relationship between the two look like when they are shown together?

In other words:

> Do boroughs with more housing code violations also see more rodent inspections, and which boroughs appear to receive more rodent inspection attention relative to their volume of violations?

## Data and sources

I work with two datasets from NYC Open Data:

1. **Housing Maintenance Code Violations**  
   - Source: NYC Open Data  
   - URL: `https://data.cityofnewyork.us/resource/wvxf-dwi5.csv`  
   - Each row represents a single violation issued for a building.  
   - I use the `inspectiondate` column to select records from the **2024 calendar year** and summarize them by `boro`.

2. **Rodent Inspection**  
   - Source: NYC Open Data  
   - URL: `https://data.cityofnewyork.us/resource/p937-wjvj.csv`  
   - Each row represents a single rodent inspection at a specific location.  
   - I use the `inspection_date` column to keep only inspections from **2024** and summarize them by `borough`.

Both datasets are loaded directly from their CSV API endpoints with `pandas.read_csv`, and all cleaning, filtering, and aggregation is done in Python.

## Brief overview of what I do in this notebook

1. Load each dataset from its NYC Open Data.  
2. Convert the inspection date columns to datetime and filter to **2024**.  
3. Aggregate each dataset to get **one row per borough** with:
   - total housing violations in 2024  
   - total rodent inspections in 2024  
4. Merge the two borough level summaries into a single combined table.  
5. Create two visualizations:
   - a **scatter plot** showing the relationship between violations and inspections by borough  
   - a **bar chart** of rodent inspections **per 10,000 housing violations** by borough  
6. Conclude with an interpretation section that discusses the main patterns and some limitations of this simple analysis.


In [241]:
# ensure the visualizations render properly across VSCode, Jupyter Book, etc.
# https://plotly.com/python/renderers/

import plotly.io as pio

pio.renderers.default = "notebook_connected+plotly_mimetype"

## Setup

In this section I import the Python libraries I use throughout the project and set up my notebook.

- `pandas` for loading, cleaning, and aggregating the data.  
- `plotly.express` for interactive charts.


In [242]:
import pandas as pd
import plotly.express as px


Here I load the **Housing Maintenance Code Violations** dataset from NYC Open Data
into a DataFrame called `hpd_raw`. I also check the shape and column names to make
sure the data loaded correctly. The original dataset is very large so I add a `$limit=50000` parameter to ensure I get a large but manageable subset of the data and then check the overall shape of the table.

In [243]:
# HPD Housing Maintenance Code Violations CSV API URL
# Use a limit so the notebook is lighter and works on GitHub Pages.
hpd_url = "https://data.cityofnewyork.us/resource/wvxf-dwi5.csv?$limit=50000"

hpd_raw = pd.read_csv(hpd_url)
hpd_raw.head()



Columns (12) have mixed types. Specify dtype option on import or set low_memory=False.



Unnamed: 0,violationid,buildingid,registrationid,boroid,boro,housenumber,lowhousenumber,highhousenumber,streetname,streetcode,...,violationstatus,rentimpairing,latitude,longitude,communityboard,councildistrict,censustract,bin,bbl,nta
0,10081311,375411,306067,3,BROOKLYN,22 FRONT,22 FRON,22 FRONT,STAGG STREET,80930,...,Close,N,,,,,,,,
1,10299683,375411,306067,3,BROOKLYN,22 FRONT,22 FRON,22 FRONT,STAGG STREET,80930,...,Close,N,,,,,,,,
2,10299685,375411,306067,3,BROOKLYN,22 FRONT,22 FRON,22 FRONT,STAGG STREET,80930,...,Close,N,,,,,,,,
3,10299686,375411,306067,3,BROOKLYN,22 FRONT,22 FRON,22 FRONT,STAGG STREET,80930,...,Open,N,,,,,,,,
4,10299690,375411,306067,3,BROOKLYN,22 FRONT,22 FRON,22 FRONT,STAGG STREET,80930,...,Open,N,,,,,,,,


Here I list all the columns in `hpd_raw` to see which fields are available and to identify the specific columns I want to use. The sample confirms that `inspectiondate` and `boro` look usable for filtering and grouping.


In [244]:
hpd_raw.columns.to_list()


['violationid',
 'buildingid',
 'registrationid',
 'boroid',
 'boro',
 'housenumber',
 'lowhousenumber',
 'highhousenumber',
 'streetname',
 'streetcode',
 'zip',
 'apartment',
 'story',
 'block',
 'lot',
 'class',
 'inspectiondate',
 'approveddate',
 'originalcertifybydate',
 'originalcorrectbydate',
 'newcertifybydate',
 'newcorrectbydate',
 'certifieddate',
 'ordernumber',
 'novid',
 'novdescription',
 'novissueddate',
 'currentstatusid',
 'currentstatus',
 'currentstatusdate',
 'novtype',
 'violationstatus',
 'rentimpairing',
 'latitude',
 'longitude',
 'communityboard',
 'councildistrict',
 'censustract',
 'bin',
 'bbl',
 'nta']

I define convenient variables for the HPD columns I plan to use:

- `borough_col` for the borough column (`"boro"`)  
- `date_col` for the inspection date column (`"inspectiondate"`)

In [245]:
borough_col = "boro"            # borough column
hpd_date_col = "inspectiondate" # date column

# Converting inspectiondate to datetime
hpd_raw[hpd_date_col] = pd.to_datetime(hpd_raw[hpd_date_col], errors="coerce")

# Filtering to calendar year 2024
hpd_2024 = hpd_raw[hpd_raw[hpd_date_col].dt.year == 2024].copy()
hpd_2024.shape


(16126, 41)

Here I summarize HPD by borough

In [246]:
hpd_by_boro = (
    hpd_2024
    .groupby(borough_col, as_index=False)
    .size()
    .rename(columns={borough_col: "boro", "size": "hpd_violations_2024"})
)

hpd_by_boro


Unnamed: 0,boro,hpd_violations_2024
0,BRONX,6450
1,BROOKLYN,4812
2,MANHATTAN,2570
3,QUEENS,2056
4,STATEN ISLAND,238


Next I work with the **Rodent Inspection** dataset from the Department of Health and Mental Hygiene (DOHMH).  
This dataset contains one row per inspection and includes the inspection date and borough.

In [247]:
# Rodent Inspection dataset (DOHMH)
rats_url = "https://data.cityofnewyork.us/resource/p937-wjvj.csv?$limit=20000"

rats_raw = pd.read_csv(rats_url)
rats_raw.head()


Unnamed: 0,inspection_type,job_ticket_or_work_order_id,job_id,job_progress,bbl,boro_code,block,lot,house_number,street_name,...,borough,inspection_date,result,approved_date,location,community_board,council_district,census_tract,bin,nta
0,Initial,11670593,PC6530234,1,2032890000.0,2,3289,25,326.0,EAST 198 STREET,...,Bronx,2010-08-30T15:23:11.000,Passed,2010-09-03T10:43:36.000,"\n, \n(40.867726534028, -73.887461100839)",7.0,15.0,40502.0,2016678.0,Bedford Park
1,Initial,11758853,PC6101553,1,1013290000.0,1,1329,121,245.0,EAST 55 STREET,...,Manhattan,2011-08-18T12:05:54.000,Passed,2011-08-19T12:02:56.000,"\n, \n(40.758511490599, -73.967433834067)",6.0,4.0,10801.0,1038588.0,East Midtown-Turtle Bay
2,Initial,12504178,PC7270050,1,,3,3141,20,,MONTIETH STREET,...,Brooklyn,2018-10-10T12:57:02.000,Passed,2018-10-11T08:59:21.000,"\n, \n(40.7014506131434, -73.9354065814951)",,,,,
3,Initial,12560587,PC6481130,1,1021110000.0,1,2111,15,470.0,WEST 165 STREET,...,Manhattan,2019-02-07T12:48:34.000,Passed,2019-02-13T10:28:33.000,"\n, \n(40.83764407994, -73.93777140242)",12.0,10.0,24301.0,1062635.0,Washington Heights (South)
4,Initial,12345229,PC6794074,1,2031490000.0,2,3149,90,2110.0,RYER AVENUE,...,Bronx,2017-10-16T13:02:51.000,Rat Activity,2017-10-27T14:31:42.000,"\n, \n(40.853455091584, -73.900632420841)",5.0,15.0,381.0,2013535.0,Mount Hope


I load the rodent inspections dataset into a DataFrame called `rats_raw` and use `.head()` to preview the first few rows and verify that the load was successful.

In [248]:
rats_raw.columns.to_list()


['inspection_type',
 'job_ticket_or_work_order_id',
 'job_id',
 'job_progress',
 'bbl',
 'boro_code',
 'block',
 'lot',
 'house_number',
 'street_name',
 'zip_code',
 'x_coord',
 'y_coord',
 'latitude',
 'longitude',
 'borough',
 'inspection_date',
 'result',
 'approved_date',
 'location',
 'community_board',
 'council_district',
 'census_tract',
 'bin',
 'nta']


Here, I define variables for the rodent inspection dataset:

- `rats_borough_col` for the borough column (`"borough"`)  
- `rats_date_col` for the inspection date column (`"inspection_date"`)

These were helpful for me in cleaning and summarizing the data.

In [249]:
rats_borough_col = "borough"
rats_date_col = "inspection_date"

# Converting inspection_date to datetime
rats_raw[rats_date_col] = pd.to_datetime(rats_raw[rats_date_col], errors="coerce")

# Filtering to calendar year 2024
rats_2024 = rats_raw[rats_raw[rats_date_col].dt.year == 2024].copy()
rats_2024.shape


(312, 25)

Here I summarize rodents by borough

In [250]:
rats_by_boro = (
    rats_2024
    .groupby(rats_borough_col, as_index=False)
    .size()
    .rename(columns={rats_borough_col: "borough", "size": "rat_inspections_2024"})
)

rats_by_boro


Unnamed: 0,borough,rat_inspections_2024
0,Bronx,60
1,Brooklyn,128
2,Manhattan,88
3,Queens,31
4,Staten Island,5


HPD and DOHMH store borough names in slightly different formats (`boro` vs `borough`, and capitalization differs).  
Here I create a shared `Borough` column in both summary tables using a consistent title case format (e.g., `"Bronx"`, `"Brooklyn"`) so I can merge them reliably.

I also merge `hpd_by_boro` and `rats_by_boro` on the common `Borough` column. This produces a single `combined` table with **one row per borough** containing:

- total HPD housing violations in 2024 (`hpd_violations_2024`)  
- total rodent inspections in 2024 (`rat_inspections_2024`)

Morover, I select only the columns I need for visualization:
- `Borough`  
- `hpd_violations_2024`  
- `rat_inspections_2024`
and store them in a cleaner DataFrame called `combined_clean`.

In [251]:
# Making a formatted 'Borough' string in both tables
hpd_by_boro["Borough"] = hpd_by_boro["boro"].str.title()
rats_by_boro["Borough"] = rats_by_boro["borough"].str.title()

# Merging HPD and rodent summaries
combined = hpd_by_boro.merge(
    rats_by_boro[["Borough", "rat_inspections_2024"]],
    on="Borough",
    how="inner"
)

# keeping only the main columns
combined_clean = combined[["Borough", "hpd_violations_2024", "rat_inspections_2024"]].copy()
combined_clean


Unnamed: 0,Borough,hpd_violations_2024,rat_inspections_2024
0,Bronx,6450,60
1,Brooklyn,4812,128
2,Manhattan,2570,88
3,Queens,2056,31
4,Staten Island,238,5


## Visual 1 – Scatter plot: Violations vs Rodent Inspections

This scatter plot compares the **raw counts** for each borough:

- x axis: total HPD housing violations in 2024  
- y axis: total DOHMH rodent inspections in 2024  
- Each point represents one borough and is labeled with the borough name

This lets me see whether boroughs with more housing violations also tend to have more rodent inspections.

In [252]:
fig = px.scatter(
    combined_clean,
    x="hpd_violations_2024",
    y="rat_inspections_2024",
    text="Borough",
    hover_name="Borough",
    labels={
        "hpd_violations_2024": "Housing Code Violations (HPD, 2024)",
        "rat_inspections_2024": "Rodent Inspections (DOHMH, 2024)",
    },
    title="Housing Violations vs. Rodent Inspections by Borough, NYC 2024"
)

fig.update_traces(textposition="top center")

fig.show()

### Interpreting the scatter plot

- Boroughs with more housing violations generally also have more rodent inspections.  
- Brooklyn and the Bronx sit high on both axes, indicating a heavy burden of both violations and inspections.  
- Manhattan has fewer violations than Brooklyn and the Bronx but still a relatively high number of inspections.  
- Queens and Staten Island have lower counts of both violations and inspections.

This view shows how the two datasets move together, but it does not show inspection *intensity* relative to violations. For that, I compute a normalized rate.


## Visual 2 – Rat inspections per 10,000 housing violations

To compare inspection intensity across boroughs more fairly, I calculate a simple rate:

> rodent inspections per 10,000 HPD violations

for each borough. This shows how many rodent inspections DOHMH conducts relative to the number of housing violations, not just the raw counts.


In [253]:
combined_ratio = combined_clean.copy()

combined_ratio["rats_per_10k_violations"] = (
    combined_ratio["rat_inspections_2024"]
    / combined_ratio["hpd_violations_2024"]
    * 10000
)

combined_ratio

Unnamed: 0,Borough,hpd_violations_2024,rat_inspections_2024,rats_per_10k_violations
0,Bronx,6450,60,93.023256
1,Brooklyn,4812,128,266.001663
2,Manhattan,2570,88,342.412451
3,Queens,2056,31,150.77821
4,Staten Island,238,5,210.084034


I sort the boroughs by this rate and keep a tidy version of the table with:

- `Borough`  
- `hpd_violations_2024`  
- `rat_inspections_2024`  
- `rats_per_10k_violations`

This makes it easier to see which boroughs receive more rodent inspections per violation.

In [254]:
combined_ratio_sorted = combined_ratio.sort_values(
    "rats_per_10k_violations", ascending=False
)

combined_ratio_sorted[["Borough", "hpd_violations_2024",
                       "rat_inspections_2024", "rats_per_10k_violations"]]

Unnamed: 0,Borough,hpd_violations_2024,rat_inspections_2024,rats_per_10k_violations
2,Manhattan,2570,88,342.412451
1,Brooklyn,4812,128,266.001663
4,Staten Island,238,5,210.084034
3,Queens,2056,31,150.77821
0,Bronx,6450,60,93.023256


Finally, I plot the **rat inspections per 10,000 housing violations** as a bar chart by borough.  
This highlights which boroughs have higher or lower inspection intensity relative to their housing violation counts.

In [255]:
import plotly.express as px

fig2 = px.bar(
    combined_ratio_sorted,
    x="Borough",
    y="rats_per_10k_violations",
    labels={
        "Borough": "Borough",
        "rats_per_10k_violations": "Rodent inspections per 10,000 HPD violations (2024)",
    },
    title="Rat Inspections Relative to Housing Violations, NYC 2024",
)

fig2.update_layout(
    yaxis_tickformat=".1f",   # e.g., 15.2 instead of long decimals
)

fig2.show()

## Overall takeaways and limitations

**Takeaways**

1. Housing violations and rodent inspections are positively related: boroughs with more violations generally also have more rodent inspections.  
2. Brooklyn and the Bronx carry a heavy burden in absolute numbers, with many violations and many inspections.  
3. When I normalize by violations, Manhattan appears to have a higher rate of rodent inspections per 10,000 violations than the outer boroughs.  
4. A simple borough level merge of HPD and DOHMH data already reveals differences in how housing conditions and pest control are addressed across the city.

**Limitations**

This analysis has several limitations. First, it only uses one year of data (2024), so the patterns I see might look different if I examined multiple years or longer term trends. Second, I work with counts of records, not counts of unique buildings or households, which means some addresses may appear many times while others do not appear at all. I also do not adjust for population size, the amount of housing stock, or neighborhood level characteristics, so comparisons across boroughs are not normalized for how many people or units are actually at risk. 
Finally, both reporting behavior and enforcement practices likely differ by borough, which can influence the number of violations recorded and the number of inspections conducted, independent of the true underlying conditions.

## Note on the use of AI tools

I used an ChatGPT to support parts of this project. At the beginning, I used it to help define and narrow the scope of the project and to sanity check whether my research question about housing violations and rodent inspections made sense for the assignment requirements. 
I also asked for help with some coding details, such as how to filter by year using the inspection date columns, how to group and aggregate by borough, and how to structure the Plotly code for the scatter plot and bar chart. Finally, I also took assistance from AI to doublecheck on my use of graphs and if they best show the relationship between the two datasets and to improve the clarity of my markdown explanations.

