## Mod 5 Lecture 2 Code-Along:  Feature Engineering & Scaling 

### Goals
* Create `hour_of_day` and `is_weekend` if not already done

* Create `night_weekend_interaction` = `is_weekend * is_night`

* Scale `hour_of_day` and `response_time_hrs` using both techniques (StandardScalar & MinMax)

### Data
Using the same NYC 311 dataset (remember the data is HUGE so we extracted just a week).  Data information exists [HERE](https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9/about_data)

In [None]:
# Read in data nyc311.csv 
import pandas as pd

In [4]:
df = pd.read_csv('/Users/ayemaq/Desktop/marcy_lab/DA2025_Lectures/Mod5/DataChallenges/data/nyc311.csv')

In [5]:
#Run this cell without changes!  You've done this in the previous Data Challenge 

import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler


LOCAL_TZ = "America/New_York"

def to_utc(series, local_tz=LOCAL_TZ):
    """
    Idempotent conversion:
      1) Parse to datetime.
      2) If naive -> localize to local_tz (handle DST).
      3) Convert to UTC.
    Safe to re-run without raising 'Already tz-aware' errors.
    """
    s = pd.to_datetime(series, errors="coerce")

    # if tz-naive, localize; if tz-aware, leave as-is
    if s.dt.tz is None:
        s = s.dt.tz_localize(local_tz, nonexistent="shift_forward", ambiguous="NaT")

    return s.dt.tz_convert("UTC")

# --- Apply to your DataFrame (df) ---
# Ensure the columns exist; adjust names if your file uses different headers
required_cols = ["Created Date", "Closed Date"]
missing = [c for c in required_cols if c not in df.columns]
if missing:
    raise KeyError(f"Missing expected columns: {missing}")

# Optionally drop rows that lack either timestamp before conversion
df = df.dropna(subset=["Created Date", "Closed Date"]).copy()

df["Created Date"] = to_utc(df["Created Date"])
df["Closed Date"]  = to_utc(df["Closed Date"])

# Compute response time in hours
delta = df["Closed Date"] - df["Created Date"]
df["response_time_hrs"] = delta.dt.total_seconds() / 3600

# Drop any rows that became NaT due to ambiguous DST cases
df = df.dropna(subset=["Created Date", "Closed Date"])

### Task 1:  Create Features 

Extract the hour and create a variable for weekends (we done this previously!). We will define “night” as any time from midnight to 6am.

In [6]:
df.head()

Unnamed: 0,Unique Key,Created Date,Closed Date,Agency,Agency Name,Complaint Type,Descriptor,Location Type,Incident Zip,Incident Address,...,Taxi Company Borough,Taxi Pick Up Location,Bridge Highway Name,Bridge Highway Direction,Road Ramp,Bridge Highway Segment,Latitude,Longitude,Location,response_time_hrs
7,66176906,2025-09-17 05:49:53+00:00,2025-09-17 07:00:34+00:00,DHS,Department of Homeless Services,Homeless Person Assistance,Non-Chronic,Store/Commercial,11385.0,55-25 MYRTLE AVENUE,...,,,,,,,40.699957,-73.907722,"(40.69995662802054, -73.90772175534917)",1.178056
37,66170659,2025-09-17 05:33:27+00:00,2025-09-17 05:42:59+00:00,NYPD,New York City Police Department,Blocked Driveway,No Access,Street/Sidewalk,11419.0,111-01 101 AVENUE,...,,,,,,,40.688089,-73.832615,"(40.68808859339035, -73.8326146957164)",0.158889
43,66175356,2025-09-17 05:29:57+00:00,2025-09-17 05:46:21+00:00,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,Residential Building/House,10009.0,103 AVENUE B,...,,,,,,,40.724829,-73.981377,"(40.7248294637277, -73.9813765585239)",0.273333
47,66173993,2025-09-17 05:27:57+00:00,2025-09-17 05:29:09+00:00,NYPD,New York City Police Department,Noise - Commercial,Loud Talking,Club/Bar/Restaurant,11237.0,44 WILSON AVENUE,...,,,,,,,40.702777,-73.929386,"(40.70277711007218, -73.92938632693719)",0.02
50,66178986,2025-09-17 05:26:39+00:00,2025-09-17 05:28:51+00:00,NYPD,New York City Police Department,Non-Emergency Police Matter,Other (complaint details),Street/Sidewalk,11374.0,98-30 67 AVENUE,...,,,,,,,40.725022,-73.855031,"(40.72502241781684, -73.8550310677977)",0.036667


In [None]:
# Create base features - dt.hour, dt.dayofweek
df['hour_of_day'] = df['Created Date'].dt.hour
df['is_weekend'] = df['Created Date'].dt.dayofweek >= 5
df['is_night'] = df['hour_of_day'].isin([0,1,2,3,4,5]) # 12am to 5am - YOU CAN ALSO USE in between 
df['is_night_demo'] = df['hour_of_day'].between(0,5) # its like using SQL 

### Task 2:  Create Interaction Term 

Create the `night_weekend_interaction` feature 

In [8]:
# you'd need night and weekend field 
df['night_weekend_interaction'] = df['is_weekend'].astype(int) * df['is_night'].astype(int)

#Look at the data -- so many columns in the data so only showing the ones we need 
df[['hour_of_day', 'is_weekend', 'is_night', 'night_weekend_interaction']].head()

Unnamed: 0,hour_of_day,is_weekend,is_night,night_weekend_interaction
7,5,False,True,0
37,5,False,True,0
43,5,False,True,0
47,5,False,True,0
50,5,False,True,0


### Task 3:  Scale Data 
* Use sklearn's StandardScaler object to scale hours and response time 
* Use sklearn's MinMaxScaler object to scale hours and response time 

**Note:  You will scale data before modeling in Mod 6; however, it will look slightly different because you will only scale a subset of the data (which we call "training data") vs. the whole dataset like we do here.  This is an important note!** 

In [11]:
df = df.dropna(subset=["response_time_hrs", "hour_of_day"])

scaler = StandardScaler()
df['hour_scaled'] = scaler.fit_transform(df[['hour_of_day']])
df['resp_scaled'] = scaler.fit_transform(df[['response_time_hrs']])

In [12]:
#Run this cell without changes -- do you see the difference in the scaled column? 

df[['resp_scaled', 'response_time_hrs']]

Unnamed: 0,resp_scaled,response_time_hrs
7,-0.381873,1.178056
37,-0.438908,0.158889
43,-0.432504,0.273333
47,-0.446681,0.020000
50,-0.445748,0.036667
...,...,...
55810,2.637687,55.134444
55811,2.637687,55.134444
55812,2.637687,55.134444
55813,2.637687,55.134444


In [13]:
minmax = MinMaxScaler()
df["hour_mm"] = minmax.fit_transform(df[["hour_of_day"]])
df["resp_mm"] = minmax.fit_transform(df[['response_time_hrs']])

In [14]:
#Run this cell without changes -- do you see the difference in all 3 of the scaled columns? 

df[['resp_mm','resp_scaled', 'response_time_hrs',]]

Unnamed: 0,resp_mm,resp_scaled,response_time_hrs
7,0.434466,-0.381873,1.178056
37,0.429910,-0.438908,0.158889
43,0.430421,-0.432504,0.273333
47,0.429289,-0.446681,0.020000
50,0.429363,-0.445748,0.036667
...,...,...,...
55810,0.675696,2.637687,55.134444
55811,0.675696,2.637687,55.134444
55812,0.675696,2.637687,55.134444
55813,0.675696,2.637687,55.134444


Q1: Created an interaction term for s_weeknd times is_night is better than using these features sperately because you are able to get a more deeper analyssi, while these fields are valuable by themselves so you can see complaints during the weekend vs the night

together you can better understand when complaints, or reposnonse times are during the weekend and night by using groupby. 

For example, a complaint at 1 am is different than 1 pm, you cna better understanf hte tpyrd of cplaints nad erpsons time when combiing if its the weekend and at night. 