# Project 1: NYC Air Quality Index Analysis
### Welcome to my first project!

Where do you live in the New York Metropolitan Area — the Upper West Side, Flushing, or maybe Rockaway Beach? Believe it or not, the air you breathe can be quite different depending on where you stand.

In this project, I explore New York City’s Air Quality data from NYC Open Data focusing on Nitrogen Dioxide (NO₂) — one of the key pollutants that shape the city’s air health. Using Python, I clean and visualize the dataset to uncover how NO₂ levels change across neighborhoods, through seasons, and over the years.

You’ll see how places like Flushing and the Upper West Side compare, whether winter air is really worse than summer’s, and how NYC’s air quality has evolved over the past decade.

In [133]:
import pandas as pd

Here’s where we get picky with our data. First, we import the dataset and tell Python which number column we actually care about. Then we do a little data “housecleaning” — we only keep the rows that talk about Nitrogen dioxide (NO₂) in 2023, and only for the UHF42 areas (a fancy way of saying the 42 neighborhoods that make up NYC). Basically, we’re trimming out all the extra stuff so we can zoom in on the data that really matters.

In [134]:
# To import the dataset, specify the filename and the target numerical column
df = pd.read_csv('Air_Quality.csv')
target_column = 'Data Value'

# To filter the dataset to get the rows 
# where 'Name' is 'Nitrogen dioxide (NO2)' 
# and 'Time Period' is 'Annual Average 2023'
# and 'Geo Type' is 'UHF42' which divides NYC into 42 distinct neighborhoods
filtered_df = df[(df['Name'] == 'Nitrogen dioxide (NO2)') & 
                 (df['Time Period'] == 'Annual Average 2023') &
                 (df['Geo Type Name'] == 'UHF42')]

## Computing Mean, Median, and Mode
### Pandas Approach
Now that we’ve got our cleaned-up dataset, it’s time to get a quick “data vibe check.” Using pandas, we calculate the mean, median, and mode of our target column to see what the typical air quality values look like. These three stats give us a snapshot of the overall trend — the average level, the middle point, and the most common value. It’s a simple but powerful way to get a feel for the data before diving into visualizations.

In [135]:
# Calculating mean, median, and mode using pandas
mean_value = filtered_df[target_column].mean()
median_value = filtered_df[target_column].median()
mode_pandas = filtered_df[target_column].mode()[0]

print(f"Mean: {mean_value:.2f}")
print(f"Median: {median_value:.2f}")
print(f"Mode (Pandas): {mode_pandas:.2f}")

Mean: 16.79
Median: 16.37
Mode (Pandas): 18.41


### Hard Way Approach
Here, we’re getting hands-on and calculating the mean, median, and mode from scratch — no shortcuts with pandas this time!

First, we pull the numbers we need from our filtered dataset and store them in a list (dropping any missing values along the way). Then:
Mean: We add everything up and divide by the number of data points — the good old average.
Median: We sort the list and find the middle value. If there’s an even number of values, we take the average of the two middle ones.
Mode: We count how often each value appears and pick the one(s) that show up the most. If every value is unique, we note that there’s no mode.

In [136]:
# Extract the data from filtered_df into a list
data = filtered_df[target_column].dropna().tolist()

# Calculate Mean (average)
def calculate_mean(data):
    # Sum all values and divide by count
    total = sum(data)
    count = len(data)
    return total / count

mean_value = calculate_mean(data)
print(f"Mean: {mean_value:.2f}")


# Calculate Median (middle value)
def calculate_median(data):
    # Sort the data first
    sorted_data = sorted(data)
    n = len(sorted_data)
    
    # If odd number of elements, return middle one
    if n % 2 == 1:
        return sorted_data[n // 2]
    # If even number of elements, return average of two middle ones
    else:
        mid1 = sorted_data[n // 2 - 1]
        mid2 = sorted_data[n // 2]
        return (mid1 + mid2) / 2

median_value = calculate_median(data)
print(f"Median: {median_value:.2f}")


# Calculate Mode (most frequent value)
def calculate_mode(data):
    # Count frequency of each value
    frequency = {}
    for value in data:
        if value in frequency:
            frequency[value] += 1
        else:
            frequency[value] = 1
    
    # Find maximum frequency
    if not frequency:
        return None
    
    max_freq = max(frequency.values())
    
    # Find all values with maximum frequency
    modes = [key for key, freq in frequency.items() if freq == max_freq]
    
    # If all values appear only once, there's no mode
    if max_freq == 1:
        return "No mode (all values appear only once)"
    
    return modes

mode_value = calculate_mode(data)
print(f"Mode (Hard Way): {mode_value}")

Mean: 16.79
Median: 16.37
Mode (Hard Way): [18.41]


## Data Visualization

Now we finally turn our data into something visual. Everything has to be done using only the Python standard library, which means we’ll “draw” our chart using text symbols! In this step, we first grab the neighborhood names and their corresponding NO₂ values from the filtered dataset, clean out any missing entries, and sort them from highest to lowest pollution levels.

Then, we calculate two thresholds — the 33rd and 67th percentiles — which divide all the neighborhoods into three groups:
High pollution (H) – above the 67th percentile
Medium pollution (M) – between the 33rd and 67th percentiles
Low pollution (L) – below the 33rd percentile

Next comes the fun part: for each neighborhood, we print a little text-based bar chart right in the terminal. The bar’s length reflects how high the NO₂ value is, and we use different symbols for each pollution level:

High pollution #

Medium pollution =

Low pollution -

This gives us a quick, visual way to compare air quality across NYC neighborhoods. Finally, we summarize how many neighborhoods fall into each category, giving a neat overview of the city’s 2023 NO₂ levels.

In [137]:
# Get the filtered data with place names
place_data = filtered_df[['Geo Place Name', target_column]].dropna()
place_data = place_data.sort_values(by=target_column, ascending=False)

# Calculate thresholds for grouping
q33 = place_data[target_column].quantile(0.33)
q67 = place_data[target_column].quantile(0.67)

print("\n" + "="*65)
print(" NYC NEIGHBORHOODS - NO2 POLLUTION LEVELS (2023 Annual Avg)")
print("="*65)
print(f" HIGH >= {q67:.2f}  |  MEDIUM {q33:.2f}-{q67:.2f}  |  LOW < {q33:.2f}")
print("-"*65)

max_val = place_data[target_column].max()
scale = 30 / max_val  # Scale bars to max 30 characters for narrow screens

for idx, row in place_data.iterrows():
    place_name = row['Geo Place Name']
    value = row[target_column]
    bar_length = int(value * scale)
    
    # Determine pollution level and use different symbols
    if value >= q67:
        bar = '#' * bar_length      # High: hash marks
        level = 'H'
    elif value >= q33:
        bar = '=' * bar_length      # Medium: equals
        level = 'M'
    else:
        bar = '-' * bar_length      # Low: dashes
        level = 'L'
    
    # Truncate long names to fit narrow screen
    display_name = place_name[:22].ljust(22)
    print(f"[{level}] {display_name} |{bar} {value:.2f}")

# Summary statistics
print("-"*65)
high_count = len(place_data[place_data[target_column] >= q67])
medium_count = len(place_data[(place_data[target_column] >= q33) & 
                               (place_data[target_column] < q67)])
low_count = len(place_data[place_data[target_column] < q33])

print(f" SUMMARY: [H] High: {high_count}  |  [M] Medium: {medium_count}  |  [L] Low: {low_count}")
print(f" TOTAL NEIGHBORHOODS: {len(place_data)}")
print("="*65)


 NYC NEIGHBORHOODS - NO2 POLLUTION LEVELS (2023 Annual Avg)
 HIGH >= 17.43  |  MEDIUM 15.50-17.43  |  LOW < 15.50
-----------------------------------------------------------------
[H] Chelsea - Clinton      |############################## 24.13
[H] Gramercy Park - Murray |############################# 23.60
[H] Greenwich Village - So |########################### 22.07
[H] Lower Manhattan        |########################## 21.61
[H] Greenpoint             |######################## 19.32
[H] Sunset Park            |####################### 19.11
[H] Upper East Side        |####################### 18.98
[H] Union Square - Lower E |####################### 18.97
[H] Long Island City - Ast |###################### 18.41
[H] West Queens            |###################### 18.41
[H] Upper West Side        |###################### 18.38
[H] Downtown - Heights - S |###################### 18.29
[H] Williamsburg - Bushwic |###################### 18.16
[H] Hunts Point - Mott Hav |#####################