**Goal**  
Clean and explore the PLUTO tax lot data so it is ready to link to restaurants and to describe neighborhood property characteristics.

**Plan**
1. Load the PLUTO CSV from `data/raw`.
2. Keep only the columns we need 
3. Convert numeric fields (assessed value, year built) to the right types and handle missing.
4. Check basic distributions for borough, land use, building class, assessed value, and year built.
5. Filter or flag obvious outliers or unusable rows if needed.
6. Save a cleaned version 
7. Write a short summary 

In [1]:
import pandas as pd

print("Loading PLUTO data...")

pluto_path = "../data/raw/Primary_Land_Use_Tax_Lot_Output_(PLUTO)_20251203.csv"
pluto = pd.read_csv(pluto_path, low_memory=False)

print(f"✓ Done! Loaded {len(pluto):,} rows and {len(pluto.columns)} columns")

Loading PLUTO data...
✓ Done! Loaded 858,284 rows and 101 columns


In [4]:
pluto.head(80)

Unnamed: 0,borough,Tax block,Tax lot,community board,census tract 2010,cb2010,schooldist,council district,postcode,firecomp,...,bctcb2020,geom,basempdate,dcasdate,edesigdate,landmkdate,masdate,polidate,rpaddate,zoningdate
0,BX,2869,47,205.0,243.0,3000.0,10.0,14.0,10453.0,E075,...,2.024300e+10,,,,,,,,,
1,MN,675,39,104.0,99.0,1017.0,2.0,3.0,10001.0,E034,...,1.009902e+10,,,,,,,,,
2,MN,698,54,104.0,99.0,1030.0,2.0,3.0,10001.0,E003,...,1.009902e+10,,,,,,,,,
3,MN,698,56,104.0,99.0,1030.0,2.0,3.0,10001.0,E003,...,1.009902e+10,,,,,,,,,
4,MN,698,28,104.0,99.0,1030.0,2.0,3.0,10001.0,E003,...,1.009902e+10,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75,MN,1288,1,105.0,102.0,1007.0,2.0,4.0,10022.0,E023,...,1.010200e+10,,,,,,,,,
76,MN,1288,3,105.0,102.0,1007.0,2.0,4.0,10022.0,E023,...,1.010200e+10,,,,,,,,,
77,MN,1288,69,105.0,102.0,1007.0,2.0,4.0,10022.0,E023,...,1.010200e+10,,,,,,,,,
78,MN,1287,62,105.0,102.0,1008.0,2.0,4.0,10022.0,E023,...,1.010200e+10,,,,,,,,,


In [6]:
# See all column names
print("All columns in PLUTO dataset:")
print(pluto.columns.tolist())

All columns in PLUTO dataset:
['borough', 'Tax block', 'Tax lot', 'community board', 'census tract 2010', 'cb2010', 'schooldist', 'council district', 'postcode', 'firecomp', 'policeprct', 'healtharea', 'sanitboro', 'sanitsub', 'address', 'zonedist1', 'zonedist2', 'zonedist3', 'zonedist4', 'overlay1', 'overlay2', 'spdist1', 'spdist2', 'spdist3', 'ltdheight', 'splitzone', 'bldgclass', 'landuse', 'easements', 'ownertype', 'ownername', 'lotarea', 'bldgarea', 'comarea', 'resarea', 'officearea', 'retailarea', 'garagearea', 'strgearea', 'factryarea', 'otherarea', 'areasource', 'numbldgs', 'numfloors', 'unitsres', 'unitstotal', 'lotfront', 'lotdepth', 'bldgfront', 'bldgdepth', 'ext', 'proxcode', 'irrlotcode', 'lottype', 'bsmtcode', 'assessland', 'assesstot', 'exempttot', 'yearbuilt', 'yearalter1', 'yearalter2', 'histdist', 'landmark', 'builtfar', 'residfar', 'commfar', 'facilfar', 'borocode', 'BBL', 'condono', 'tract2010', 'xcoord', 'ycoord', 'latitude', 'longitude', 'zonemap', 'zmcode', 'sanb

In [None]:
# Select key columns for our analysis
keep_cols = [
    'borough',           # Borough code
    'block',             # Tax block
    'lot',               # Tax lot
    'zipcode',           # Zip code
    'address',           # Street address
    'landuse',           # Land use category
    'bldgclass',         # Building class
    'yearbuilt',         # Year building was built
    'assesstot',         # Total assessed value
    'latitude',          # Latitude
    'longitude',         # Longitude
]

# Check which columns actually exist
available_cols = [col for col in keep_cols if col in pluto.columns]
missing_cols = [col for col in keep_cols if col not in pluto.columns]

print(f"Available columns: {available_cols}")
print(f"Missing columns: {missing_cols}")

Available columns: ['borough', 'address', 'landuse', 'bldgclass', 'yearbuilt', 'assesstot', 'latitude', 'longitude']
Missing columns: ['block', 'lot', 'zipcode']


In [None]:
# Create smaller dataset with available columns(summary statistics to understand whats happening)
pluto_small = pluto[available_cols].copy()
print(f"Selected {len(available_cols)} columns")
pluto_small.head()

#Under landuse 3.0 = multi family, 4.0 = mixed use, 5.0 = commercial, and 10.0 = parking.

Selected 8 columns


Unnamed: 0,borough,address,landuse,bldgclass,yearbuilt,assesstot,latitude,longitude
0,BX,89 WEST TREMONT AVENUE,3.0,D1,2003.0,915750.0,40.850594,-73.912743
1,MN,606 WEST 30TH ST,4.0,D7,2020.0,49067100.0,40.753382,-74.004717
2,MN,530 WEST 27 STREET,5.0,J5,1920.0,4349700.0,40.75061,-74.003865
3,MN,534 WEST 27 STREET,5.0,J5,1920.0,2505150.0,40.750648,-74.003959
4,MN,511 WEST 26 STREET,10.0,G6,0.0,1191150.0,40.750129,-74.003136


In [9]:
# Check missing values
print("Missing values:")
print(pluto_small.isnull().sum())
print(f"\nTotal rows: {len(pluto_small):,}")

Missing values:
borough         0
address       444
landuse      2683
bldgclass     331
yearbuilt     334
assesstot     334
latitude     1259
longitude    1259
dtype: int64

Total rows: 858,284


In [None]:
# Filter out rows with missing coordinates 
pluto_clean = pluto_small.dropna(subset=['latitude', 'longitude']).copy()

print(f"Rows after removing missing coordinates: {len(pluto_clean):,}")
print(f"Rows removed: {len(pluto_small) - len(pluto_clean):,}")

Rows after removing missing coordinates: 857,025
Rows removed: 1,259


In [11]:
# Summary statistics
print("Properties by Borough:")
print(pluto_clean['borough'].value_counts())

print("\n" + "="*50)
print("\nLand Use Distribution:")
print(pluto_clean['landuse'].value_counts().head(10))

Properties by Borough:
borough
QN    324266
BK    275807
SI    125477
BX     89347
MN     42128
Name: count, dtype: int64


Land Use Distribution:
landuse
1.0     566637
2.0     131294
4.0      56181
11.0     24640
5.0      21155
3.0      13046
8.0      12038
6.0       9377
10.0      9244
7.0       6062
Name: count, dtype: int64


In [12]:
# Year built and assessed value statistics
print("Year Built Statistics:")
print(pluto_clean['yearbuilt'].describe())

print("\n" + "="*50)
print("\nAssessed Value Statistics:")
print(pluto_clean['assesstot'].describe())

Year Built Statistics:
count    856692.000000
mean       1852.665070
std         406.830415
min           0.000000
25%        1920.000000
50%        1930.000000
75%        1960.000000
max        2025.000000
Name: yearbuilt, dtype: float64


Assessed Value Statistics:
count    8.566920e+05
mean     5.917970e+05
std      1.269877e+07
min      0.000000e+00
25%      4.545000e+04
50%      6.192000e+04
75%      1.125000e+05
max      8.248909e+09
Name: assesstot, dtype: float64


## Save Cleaned Data

In [13]:
# Save cleaned PLUTO data
output_path = '../data/processed/pluto_nyc_clean.csv'
pluto_clean.to_csv(output_path, index=False)

print(f"✓ Saved {len(pluto_clean):,} property records to: {output_path}")
print(f"✓ Columns: {pluto_clean.columns.tolist()}")
print("\nPLUTO data cleaning complete!")

✓ Saved 857,025 property records to: ../data/processed/pluto_nyc_clean.csv
✓ Columns: ['borough', 'address', 'landuse', 'bldgclass', 'yearbuilt', 'assesstot', 'latitude', 'longitude']

PLUTO data cleaning complete!
