# 🏙️ NYC Building Footprint Analysis

<img src="NYC_Building.webp" width="900"/>

# 📚 Importing Required Libraries
In this section, we import all necessary libraries for data analysis, visualization, geospatial processing, and automated reporting.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Spatial tools
import geopandas as gpd
from shapely import wkt
import folium

from ydata_profiling import ProfileReport

# 📥 Loading the Dataset
We load the NYC building footprint dataset and perform a quick overview to understand its structure.


In [None]:
df = pd.read_csv("C:/Users/Me/Downloads/building-footprints-pluto.csv")

# Convert 'the_geom' column to geometry
df['geometry'] = df['the_geom'].apply(wkt.loads)
gdf = gpd.GeoDataFrame(df, geometry='geometry')

df['geometry'] = df['geometry'].apply(lambda x: x.wkt if x else None)


# 🔍 Exploratory Data Analysis (EDA)
We use `ydata_profiling` to automatically generate a comprehensive report that includes:

- Dataset overview
- Missing values
- Duplicates
- Data types
- Univariate and multivariate analysis
- Correlations
- Warnings and data quality checks


In [None]:
profile = ProfileReport(df,title ="NYC Building Footprint Report",explorative=True)
profile.to_notebook_iframe()

# 🧹 Data Cleaning
Based on the EDA report, we clean the dataset by handling missing values  correcting data types.



### 🪓1- Dropping Unnecessary Columns
Removed non-informative or constant columns



In [None]:
df.drop(columns=['NAME'], inplace=True)

In [None]:
df.drop(columns=['SHAPE_AREA'], inplace=True)

In [None]:
df.drop(columns=['SHAPE_LEN'], inplace=True)

In [None]:
print(df.columns)

### 🧩 2- Handling Missing Values
Detection of missing values
Filling or dropping strategy (mean, mode, 0, or median)

In [None]:
df['LSTSTATYPE'] = df['LSTSTATYPE'].fillna(df['LSTSTATYPE'].mode()[0])

In [None]:
df[['LSTSTATYPE']].isnull().sum()

In [None]:
median_HIGH = df['HEIGHTROOF'].median()
#print(median_HIGH)
df['HEIGHTROOF'] = df['HEIGHTROOF'].fillna(median_HIGH)

In [None]:
df[['HEIGHTROOF']].isnull().sum()

In [None]:
Mean_GND = df['GROUNDELEV'].mean()
df['GROUNDELEV'] = df['GROUNDELEV'].fillna(Mean_GND)

In [None]:
df[['GROUNDELEV']].isnull().sum()

In [None]:
 df['GEOMSOURCE'] = df['GEOMSOURCE'].fillna('Other')

In [None]:
df[['GEOMSOURCE']].isnull().sum()

In [None]:
df['zipcode'] = df['zipcode'].fillna(df['zipcode'].mode()[0])

In [None]:
df[['zipcode']].isnull().sum()

In [None]:
df['bldgclass'] = df['bldgclass'].fillna(df['bldgclass'].mode()[0])

In [None]:
df[['bldgclass']].isnull().sum()

In [None]:
Mean_Xcord = df['xcoord'].mean()
df['xcoord'] = df['xcoord'].fillna(Mean_Xcord)

In [None]:
Mean_ycord = df['ycoord'].mean()
df['ycoord'] = df['ycoord'].fillna(Mean_Xcord)

In [None]:
Mean_latitude = df['latitude'].mean()
df['latitude'] = df['latitude'].fillna(Mean_latitude)

In [None]:
Mean_longitude = df['longitude'].mean()
df['longitude'] = df['longitude'].fillna(Mean_longitude)

In [None]:
df.isnull().sum()

### 🔄 3- Data Type Conversion

In [None]:
df.dtypes

In [None]:
df['LSTMODDATE'] = pd.to_datetime(df['LSTMODDATE'], format='%d/%m/%Y %I:%M:%S %p')

In [None]:
df['zipcode'] = df['zipcode'].astype('Int64')

In [None]:
from shapely import wkt

def safe_wkt_load(val):
    try:
        return wkt.loads(val) if isinstance(val, str) else val
    except:
        return None

df['the_geom'] = df['the_geom'].apply(safe_wkt_load)
df['geometry'] = df['geometry'].apply(safe_wkt_load)


In [None]:
import geopandas as gpd

gdf = gpd.GeoDataFrame(df, geometry='geometry', crs='EPSG:4326')

In [None]:
gdf.dtypes

# 📊 Visual Questions & Charts


### 🏗️ How are buildings distributed by construction year?
- Explore the historical development trends by analyzing the distribution of construction years (`CNSTRCT_YR`).


In [None]:
%matplotlib inline
plt.figure(figsize=(10,5))
sns.histplot(df['CNSTRCT_YR'], bins=50, kde=True)
plt.title('Distribution of Construction Years')
plt.xlabel('Construction Year')
plt.ylabel('Number of Buildings')
plt.show();

### 🏢 What is the distribution of building heights?
- Analyze how building heights vary across the city using the `HEIGHTROOF` attribute.


In [None]:
plt.figure(figsize=(10, 5))
sns.histplot(data=df, x='HEIGHTROOF', bins=50, kde=True, color='salmon')
plt.title('Distribution of Building Heights')
plt.xlabel('Height of Building (feets)')
plt.ylabel('Number of Buildings')
plt.grid(True)
plt.show()

### 🏢 What are the most common building types?
- Analyze the most frequent building classes using the `bldgclass` column.

In [None]:
top_classes = df['bldgclass'].value_counts().nlargest(10)
top_classes.plot(kind='bar', title='Top 10 Building Classes', color='skyblue')
plt.xlabel('Building Class')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

### 🗺️ Distribution of Land Use Categories
- Visualize the proportion or count of buildings by each `landuse` category

In [None]:
landuse_counts = df['landuse'].value_counts()
landuse_counts.plot(kind='pie', autopct='%1.1f%%',textprops={'fontsize': 5} ,figsize=(8,8), title='Land Use Distribution')
plt.ylabel('')
plt.show()

### 🏙️ Number of Buildings per Borough
- Display the building count across the five boroughs using the `borough` column

In [None]:
df['borough'].value_counts().plot(kind='barh', color='pink', title='Number of Buildings per Borough')
plt.xlabel('Count')
plt.ylabel('Borough')
plt.show()

# 🌍🗺️ Spatial Analysis

In [None]:
import folium
from folium.plugins import MarkerCluster, HeatMap

map_center = [df['latitude'].mean(), df['longitude'].mean()]
m = folium.Map(location=map_center, zoom_start=11)

marker_cluster = MarkerCluster(name="Building Info").add_to(m)

for _, row in df.iterrows():
    if pd.notnull(row['latitude']) and pd.notnull(row['longitude']):
        folium.Marker(
            location=[row['latitude'], row['longitude']],
            popup=f"""
            <b>Borough:</b> {row['borough']}<br>
            <b>Year Built:</b> {row['CNSTRCT_YR']}<br>
            <b>Roof Height:</b> {row['HEIGHTROOF']} ft<br>
            <b>Land Use:</b> {row['landuse']}<br>
            <b>Building Class:</b> {row['bldgclass']}
            """
        ).add_to(marker_cluster)


heat_data = df[['latitude', 'longitude']].dropna().values.tolist()
HeatMap(heat_data, name="Building Density Heatmap").add_to(m)

for _, row in df.iterrows():
    if pd.notnull(row['HEIGHTROOF']):
        folium.CircleMarker(
            location=[row['latitude'], row['longitude']],
            radius=3,
            fill=True,
            fill_opacity=0.5,
            color='blue' if row['HEIGHTROOF'] < 50 else 'green' if row['HEIGHTROOF'] < 150 else 'red',
            popup=f"Height: {row['HEIGHTROOF']} ft"
        ).add_to(m)


folium.LayerControl().add_to(m)
m.save("city_buildings_analysis.html")
m



### 📍 Insights from Interactive Building Map

1. **Building Information Clustering:**
   - Marker Clustering allows for clear visualization of buildings’ attributes (like borough, construction year, roof height, land use, and building class) without overcrowding the map.
   - Clicking on clusters or individual markers provides detailed popup info per building.

2. **Building Density Hotspots:**
   - The heatmap highlights areas with the highest concentration of buildings.
   - These hotspots may indicate zones of urban intensity, potential congestion, or priority regions for city planning.

3. **Building Height Classification by Color:**
   - Buildings are color-coded by height:
     - 🔵 Blue: Short buildings (< 50 ft)
     - 🟢 Green: Medium buildings (50–150 ft)
     - 🔴 Red: Tall buildings (> 150 ft)
   - This offers immediate visual insight into the vertical profile of different city zones.

4. **Data Layer Interactivity:**
   - The map includes layer controls that allow toggling between:
     - Marker Clusters
     - Heatmap
     - Circle height markers
   - Users can interactively explore spatial patterns without overwhelming the view.

5. **Exportable & Shareable Visualization:**
   - The map is saved as `city_buildings_analysis.html`, making it easy to share or embed in dashboards, reports, or presentations.


In [None]:
fig, ax = plt.subplots(figsize=(12, 12))
gdf.plot(ax=ax, color='lightblue', edgecolor='gray', linewidth=0.3)
ax.set_title('Spatial Layout of City Structures', fontsize=16)
ax.set_axis_off()
plt.show()

In [None]:
gdf.plot(column='borough', cmap='Set3', legend=True, figsize=(12, 12), edgecolor='black', linewidth=0.2)
plt.title('Buildings Colored by Borough', fontsize=15)
plt.axis('off')
plt.show()

In [None]:
def classify_age(year):
    if year < 1950:
        return 'Old'
    elif year <= 2000:
        return 'Mid'
    else:
        return 'New'

gdf['age_group'] = gdf['CNSTRCT_YR'].apply(classify_age)

gdf.plot(column='age_group', cmap='coolwarm', legend=True, figsize=(12, 12), edgecolor='black', linewidth=0.2)
plt.title('Building Age Clusters', fontsize=15)
plt.axis('off')
plt.show()

In [None]:
gdf.plot(column='borough', cmap='tab20', figsize=(12, 12), legend=True, edgecolor='black', linewidth=0.2)
plt.title('Static Map: Buildings by Borough', fontsize=15)
plt.axis('off')
plt.show()

In [None]:
gdf['Tall_Building'] = gdf['HEIGHTROOF'] > 43


fig, ax = plt.subplots(figsize=(14, 10))
gdf.plot(ax=ax, 
         color=gdf['Tall_Building'].map({True: 'red', False: 'cyan'}), 
         edgecolor='black', linewidth=0.1)

plt.title("Tall Buildings in NYC (Height > 43ft)", fontsize=16)
plt.axis('off')
plt.show()

# 💡 Insights

In this section, we summarize the most important findings from the spatial and statistical analysis of the dataset.


#### Key Insights

1. **The year 2018 recorded the highest number of buildings**, indicating a potential peak in construction or data registration activity during that year.

2. **Building class `B2` has the highest count among all classes**, suggesting a predominance of that building type in the dataset.

3. **Land Use category `1` dominates the dataset**, accounting for **56.1%** of all buildings, indicating a major portion of buildings share the same land use type.

4. **The borough `QN` (Queens) has the largest number of buildings**, followed by `SI` (Staten Island) and `BK` (Brooklyn), reflecting spatial distribution patterns across NYC boroughs.

5. The **height of buildings** is mostly concentrated in a specific range, with a right-skewed distribution as visualized in the histogram, suggesting the majority of buildings are mid to low-rise structures.
