# Q3: Data Wrangling

**Phase 4:** Data Wrangling & Transformation  
**Points: 9 points**

**Focus:** Parse datetime columns, set datetime index, extract time-based features.

**Lecture Reference:** Lecture 11, Notebook 2 ([`11/demo/02_wrangling_feature_engineering.ipynb`](https://github.com/christopherseaman/datasci_217/blob/main/11/demo/02_wrangling_feature_engineering.ipynb)), Phase 4. Also see Lecture 09 (time series).

---

## Setup

In [19]:
# Import libraries
import pandas as pd
import numpy as np
import os

# Load cleaned data from Q2
df = pd.read_csv('output/q2_cleaned_data.csv')
print(f"Loaded {len(df):,} cleaned records")
print(df.dtypes) # Check data types

Loaded 78,177 cleaned records
Station Name                    object
Measurement Timestamp           object
Air Temperature                float64
Wet Bulb Temperature           float64
Humidity                         int64
Rain Intensity                 float64
Interval Rain                  float64
Total Rain                     float64
Precipitation Type             float64
Wind Direction                   int64
Wind Speed                     float64
Maximum Wind Speed             float64
Barometric Pressure            float64
Solar Radiation                  int64
Heading                        float64
Battery Life                   float64
Measurement Timestamp Label     object
Measurement ID                  object
dtype: object


---

## Objective

Parse datetime columns, set datetime index, and extract temporal features for time series analysis.

**Time Series Note:** This dataset is time-series data (sensor readings over time), unlike the lecture's event-based taxi data. You'll work with a datetime index and extract temporal features (hour, day_of_week, month) that are essential for time series analysis. See **Lecture 09** for time series operations. Use pandas datetime index properties (`.hour`, `.dayofweek`, `.month`, etc.) to extract temporal features from your datetime index.

---

## Required Artifacts

You must create exactly these 3 files in the `output/` directory:

### 1. `output/q3_wrangled_data.csv`
**Format:** CSV file
**Content:** Dataset with datetime index set
**Requirements:**
- Datetime column parsed using `pd.to_datetime()`
- Datetime column set as index using `df.set_index()`
- Index sorted chronologically using `df.sort_index()`
- **When saving:** Reset index to save datetime as column: `df.reset_index().to_csv(..., index=False)`
- All original columns preserved
- **No extra index column** (save with `index=False`)

### 2. `output/q3_temporal_features.csv`
**Format:** CSV file
**Required Columns (exact names):** Must include at minimum:
- Original datetime column (e.g., `Measurement Timestamp` or `datetime`)
- `hour` (integer, 0-23)
- `day_of_week` (integer, 0=Monday, 6=Sunday)
- `month` (integer, 1-12)

**Optional but recommended:**
- `year` (integer)
- `day_name` (string, e.g., "Monday")
- `is_weekend` (integer, 0 or 1)

**Content:** DataFrame with datetime column and extracted temporal features
**Requirements:**
- At minimum: datetime column, `hour`, `day_of_week`, `month`
- All values must be valid (no NaN in required columns)
- **No index column** (save with `index=False`)

**Example columns:**
```csv
Measurement Timestamp,hour,day_of_week,month,year,day_name,is_weekend
2022-01-01 00:00:00,0,5,1,2022,Saturday,1
2022-01-01 01:00:00,1,5,1,2022,Saturday,1
...
```

### 3. `output/q3_datetime_info.txt`
**Format:** Plain text file
**Content:** Date range information after datetime parsing
**Required information:**
- Start date (earliest datetime)
- End date (latest datetime)
- Total duration (optional but recommended)

**Example format:**
```
Date Range After Datetime Parsing:
Start: 2022-01-01 00:00:00
End: 2027-09-15 07:00:00
Total Duration: 5 years, 8 months, 14 days, 7 hours
```

---

## Requirements Checklist

- [ ] Datetime columns parsed correctly using `pd.to_datetime()`
- [ ] Datetime index set using `df.set_index()`
- [ ] Index sorted chronologically using `df.sort_index()`
- [ ] Temporal features extracted: `hour`, `day_of_week`, `month` (minimum)
- [ ] All 3 required artifacts saved with exact filenames

---

## Your Approach

1. **Parse datetime** - Convert datetime column using `pd.to_datetime()`
2. **Set datetime index** - Set as index and sort chronologically
3. **Extract temporal features** - Use datetime index properties (`.hour`, `.dayofweek`, `.month`, etc.)
4. **Save artifacts** - Remember to `reset_index()` before saving CSVs so the datetime becomes a column

---

## Decision Points

- **Datetime parsing:** What format is your datetime column? Use `pd.to_datetime()` with appropriate format string if needed: `pd.to_datetime(df[col], format='%Y-%m-%d %H:%M:%S')`
- **Temporal features:** Extract at minimum: hour, day_of_week, month. Consider also: year, day_name, is_weekend, time_of_day categories. What makes sense for your analysis?

---

## Checkpoint

After Q3, you should have:
- [ ] Datetime columns parsed
- [ ] Datetime index set and sorted
- [ ] Temporal features extracted (at minimum: hour, day_of_week, month)
- [ ] All 3 artifacts saved: `q3_wrangled_data.csv`, `q3_temporal_features.csv`, `q3_datetime_info.txt`

---

**Next:** Continue to `q4_feature_engineering.md` for Feature Engineering.


In [20]:
#1. DATETIME INDEX
df['Measurement Timestamp'] = pd.to_datetime(df['Measurement Timestamp'], format='%Y-%m-%d %H:%M:%S')
df = df.set_index('Measurement Timestamp')
print("✓ Set 'Measurement Timestamp' as datetime index")
display(df.head())

df = df.sort_index()
display(df.head())
print("✓ Sorted data by datetime index")

print(f"\nIndex info:")
print(f"  Type: {type(df.index)}")
print(f"  Start: {df.index.min()}")
print(f"  End: {df.index.max()}")
print(f"  Total records: {len(df):,}")

# SAVE
df.reset_index().to_csv('output/q3_wrangled_data.csv', index=False)
print("✓ Saved: output/q3_wrangled_data.csv")

✓ Set 'Measurement Timestamp' as datetime index


Unnamed: 0_level_0,Station Name,Air Temperature,Wet Bulb Temperature,Humidity,Rain Intensity,Interval Rain,Total Rain,Precipitation Type,Wind Direction,Wind Speed,Maximum Wind Speed,Barometric Pressure,Solar Radiation,Heading,Battery Life,Measurement Timestamp Label,Measurement ID
Measurement Timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2018-09-27 13:00:00,Foster Weather Station,17.89,12.4,39,0.0,0.0,260.3,0.0,249,1.4,2.3,993.6,0,355.0,15.1,09/27/2018 1:00 PM,FosterWeatherStation201809271300
2018-09-27 15:00:00,Foster Weather Station,19.39,13.0,37,0.0,0.0,260.3,0.0,209,4.0,4.6,991.9,0,355.0,15.0,09/27/2018 3:00 PM,FosterWeatherStation201809271500
2018-09-27 16:00:00,Foster Weather Station,19.78,12.2,33,0.0,0.0,260.3,0.0,178,2.4,4.4,991.5,0,355.0,15.1,09/27/2018 4:00 PM,FosterWeatherStation201809271600
2018-09-27 17:00:00,63rd Street Weather Station,20.4,12.4,38,0.0,0.0,260.3,0.0,212,2.7,6.0,991.7,37,355.0,11.9,09/27/2018 5:00 PM,63rdStreetWeatherStation201809271700
2018-09-27 17:00:00,Foster Weather Station,20.0,12.4,33,0.0,0.0,260.3,0.0,200,2.5,3.1,990.9,0,355.0,15.1,09/27/2018 5:00 PM,FosterWeatherStation201809271700


Unnamed: 0_level_0,Station Name,Air Temperature,Wet Bulb Temperature,Humidity,Rain Intensity,Interval Rain,Total Rain,Precipitation Type,Wind Direction,Wind Speed,Maximum Wind Speed,Barometric Pressure,Solar Radiation,Heading,Battery Life,Measurement Timestamp Label,Measurement ID
Measurement Timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2015-05-22 19:00:00,Foster Weather Station,9.56,14.8,58,0.0,0.0,7.3,0.0,115,1.9,2.1,990.4,79,352.0,15.1,05/22/2015 7:00 PM,FosterWeatherStation201505221900
2015-05-22 20:00:00,Foster Weather Station,9.5,14.8,59,0.0,0.0,7.3,0.0,127,2.1,2.3,990.4,5,352.0,15.2,05/22/2015 8:00 PM,FosterWeatherStation201505222000
2015-05-22 21:00:00,Foster Weather Station,9.78,14.8,62,0.0,0.0,7.3,0.0,65,2.8,3.3,990.4,0,352.0,15.2,05/22/2015 9:00 PM,FosterWeatherStation201505222100
2015-05-22 22:00:00,Foster Weather Station,10.0,14.8,66,0.0,0.0,7.3,0.0,81,1.8,1.9,990.4,0,352.0,15.2,05/22/2015 10:00 PM,FosterWeatherStation201505222200
2015-05-22 23:00:00,Foster Weather Station,10.38,14.8,63,0.0,0.0,7.3,0.0,145,2.0,2.6,990.4,0,352.0,15.1,05/22/2015 11:00 PM,FosterWeatherStation201505222300


✓ Sorted data by datetime index

Index info:
  Type: <class 'pandas.core.indexes.datetimes.DatetimeIndex'>
  Start: 2015-05-22 19:00:00
  End: 2025-12-04 13:00:00
  Total records: 78,177
✓ Saved: output/q3_wrangled_data.csv


In [23]:
#2. TEMPORAL FEATURES EXTRACTION
temporal_df = pd.DataFrame()

# Keep the original datetime column
temporal_df['Measurement Timestamp'] = df.index

# Extract required temporal features
print("Extracting features...")

# Hour (0-23)
temporal_df['hour'] = df.index.hour
print(f"✓ hour: {temporal_df['hour'].min()}-{temporal_df['hour'].max()}")

# Day of week (0=Monday, 6=Sunday)
temporal_df['day_of_week'] = df.index.dayofweek
print(f"✓ day_of_week: {temporal_df['day_of_week'].min()}-{temporal_df['day_of_week'].max()} (0=Mon, 6=Sun)")

# Month (1-12)
temporal_df['month'] = df.index.month
print(f"✓ month: {temporal_df['month'].min()}-{temporal_df['month'].max()}")

# Year
temporal_df['year'] = df.index.year
print(f"✓ year: {temporal_df['year'].min()}-{temporal_df['year'].max()}")

# Day name
temporal_df['day_name'] = df.index.day_name()
print(f"✓ day_name: {temporal_df['day_name'].unique()[:3].tolist()}...")

# Is weekend (1 if Saturday or Sunday, 0 otherwise)
temporal_df['is_weekend'] = (df.index.dayofweek >= 5).astype(int)
print(f"✓ is_weekend: {temporal_df['is_weekend'].sum():,} weekend records")

display(temporal_df.head())

# SAVE
temporal_df.to_csv('output/q3_temporal_features.csv', index=False)
print("✓ Saved: output/q3_temporal_features.csv")

Extracting features...
✓ hour: 0-23
✓ day_of_week: 0-6 (0=Mon, 6=Sun)
✓ month: 1-12
✓ year: 2015-2025
✓ day_name: ['Friday', 'Saturday', 'Sunday']...
✓ is_weekend: 22,172 weekend records


Unnamed: 0,Measurement Timestamp,hour,day_of_week,month,year,day_name,is_weekend
0,2015-05-22 19:00:00,19,4,5,2015,Friday,0
1,2015-05-22 20:00:00,20,4,5,2015,Friday,0
2,2015-05-22 21:00:00,21,4,5,2015,Friday,0
3,2015-05-22 22:00:00,22,4,5,2015,Friday,0
4,2015-05-22 23:00:00,23,4,5,2015,Friday,0


✓ Saved: output/q3_temporal_features.csv


In [None]:
#3. DATETIME INFORMATION
print("\nDatetime index information:")

start_date = df.index.min()
end_date = df.index.max()
duration = end_date - start_date


# Calculate duration in years, months, days, hours
years = duration.days // 365
months = (duration.days % 365) // 30
days = (duration.days % 365) % 30
hours = duration.seconds // 3600

# Create report
report = []
report.append("Date range information after parsing:")
report.append(f"Start: {start_date}")
report.append(f"End: {end_date}")
report.append(f"Total Duration: {years} year, {months} month, {days} day, {hours} hour")

# Save report to file
with open('output/q3_datetime_info.txt', 'w') as f:
    for line in report:
        f.write(line + '\n')
print("✓ Saved: output/q3_datetime_info.txt")



Datetime index information:
✓ Saved: output/q3_datetime_info.txt


In [None]:
#4. DECISION ANSMWERS
#1. Measurment Timestamp is my datetime column which I set as the index after converting it to datetime format.
#2. I extracted hour, day of week, month, year, day name, and is_weekend features from the datetime index.
#3. For my analysis, I think that year and month will be most useful to identify seasonal trends and yearly patterns in beach sensor data.