# Weather Data Processing Pipeline

This notebook implements a data pipeline to ingest, clean, transform, and analyze weather data. The pipeline is modular, retains all data rows, and includes multiple visualizations for deeper insights.

In [1]:
import os
import sys

# Set working directory to project root
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
sys.path.append(project_root)

In [2]:
## Step 1: Import Dependencies and Modules
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from src.data_ingestion import load_weather_data
from src.data_cleaning import clean_weather_data
from src.data_transformation import transform_weather_data
from src.data_output import save_transformed_data, generate_temperature_report, plot_visualizations

In [3]:
# Set visualization style
sns.set(style="whitegrid")

## Step 2: Data Ingestion
# Load the weather data from the CSV file
file_path = "../data/weather_data.csv"
df = load_weather_data(file_path)
df.head()

Successfully loaded data from ../data/weather_data.csv


Unnamed: 0,date,city,temperature_celsius,humidity_percent,wind_speed_kph,weather_condition
0,2023-01-01,New York,5.0,60.0,10.0,Sunny
1,01/02/2023,New York,,65.0,12.0,Cloudy
2,03-01-2023,New York,7.0,,8.0,Rainy
3,,London,8.0,70.0,15.0,Unknown
4,2023-01-02,London,6.0,75.0,20.0,Snowy


In [4]:
## Step 3: Data Cleaning
# Clean the data: handle missing values, standardize dates
df_cleaned = clean_weather_data(df)
df_cleaned.head()

Data cleaning completed.


Unnamed: 0,date,city,temperature_celsius,humidity_percent,wind_speed_kph,weather_condition
0,2023-01-01,New York,5.0,60.0,10.0,Sunny
1,NaT,New York,8.923529,65.0,12.0,Cloudy
2,NaT,New York,7.0,57.142857,8.0,Rainy
3,NaT,London,8.0,70.0,15.0,Unknown
4,2023-01-02,London,6.0,75.0,20.0,Snowy


In [5]:
## Step 4: Data Transformation
# Transform the data: add temperature in Fahrenheit
df_transformed = transform_weather_data(df_cleaned)
df_transformed.head()

Data transformation completed.


Unnamed: 0,date,city,temperature_celsius,humidity_percent,wind_speed_kph,weather_condition,temperature_fahrenheit
0,2023-01-01,New York,5.0,60.0,10.0,Sunny,41.0
1,NaT,New York,8.923529,65.0,12.0,Cloudy,48.062353
2,NaT,New York,7.0,57.142857,8.0,Rainy,44.6
3,NaT,London,8.0,70.0,15.0,Unknown,46.4
4,2023-01-02,London,6.0,75.0,20.0,Snowy,42.8


In [6]:
## Step 5: Data Output
# Save the transformed data, generate a report, and create visualizations
output_path = "../outputs/transformed_weather_data.csv"
report_path = "../outputs/temperature_report.md"
viz_dir = "../outputs/"
save_transformed_data(df_transformed, output_path)
generate_temperature_report(df_transformed, report_path)
plot_visualizations(df_transformed, viz_dir)

Transformed data saved to ../outputs/transformed_weather_data.csv
Temperature report saved to ../outputs/temperature_report.md
Visualizations saved to outputs/


In [7]:
## Step 6: Display Visualizations
# The visualizations are saved in the outputs/ directory. Here's a preview of what's generated:
# - Average Temperature per City (Bar Chart)
# - Temperature Trend Over Time (Line Plot)
# - Humidity vs Wind Speed (Scatter Plot)
# - Weather Condition Distribution (Pie Chart)

#### Analysis of Visualizations

##### 1. Average Temperature per City (Bar Chart)
- The bar chart shows the average temperature in Celsius for three cities: Tokyo, New York, and London.
  - London has the highest average temperature, slightly above 9°C.
  - New York follows, with an average temperature around 8°C.
  - Tokyo has the lowest average temperature, around 7°C.
- **Insight**: London is the warmest city on average during the period (January 2023), while Tokyo is the coolest. The difference between the cities is relatively small (about 2°C), suggesting similar climatic conditions in January, though London’s slightly higher temperature might be due to its milder winter climate influenced by the Gulf Stream.

##### 2. Humidity vs Wind Speed by City (Scatter Plot)
- The scatter plot shows `humidity_percent` (x-axis) vs `wind_speed_kph` (y-axis), with points colored by city (New York: dark blue, London: teal, Tokyo: green) and sized by `temperature_celsius`.
  - Humidity ranges from around 30% to 90%, with most points clustering between 40% and 80%.
  - Wind speed ranges from 0 to 30 kph, with most points between 5 and 20 kph.
  - Larger points (higher temperatures) are more common at moderate humidity levels (40%-60%).
  - New York has points across the humidity range, with some high wind speeds (up to 30 kph) at higher humidity.
  - London shows a wide range of humidity but generally lower wind speeds (mostly below 20 kph).
  - Tokyo has points mostly at moderate humidity (40%-60%) and wind speeds (5-15 kph), with fewer extremes.
- **Insight**: There’s no strong linear relationship between humidity and wind speed, but higher temperatures (larger points) tend to occur at moderate humidity levels (40%-60%), suggesting that warmer days might have more balanced humidity. New York experiences more variable wind speeds, especially on humid days, which could indicate more dynamic weather patterns. Tokyo’s clustering at moderate levels suggests more stable weather conditions during this period.

##### 3. Temperature Trend Over Time (Line Plot)
- The line plot shows temperature trends for New York (blue), London (orange), and Tokyo (green) over January 2023.
  - New York starts at 5°C, dips to around 0°C around January 5th, peaks at 10°C around January 13th, and ends at 0°C.
  - London starts at 7°C, peaks at 15°C around January 5th, fluctuates, and ends at 13.5°C.
  - Tokyo starts at 6°C, drops to -1.7°C around January 5th, gradually rises to 10°C by the end of the month.
  - London consistently has the highest temperatures throughout the month, while Tokyo experiences the lowest, including sub-zero temperatures.
- **Insight**: London shows the most significant temperature spike early in the month (15°C on January 5th), indicating a possible warm spell, while Tokyo experiences the coldest days (below 0°C), suggesting a colder winter climate. New York’s temperature fluctuates more moderately but shows a cooling trend toward the end of the month. The trends suggest that London has a milder and more volatile winter, while Tokyo’s temperatures are colder but more steadily increasing over time.

##### 4. Weather Condition Distribution (Pie Chart)
- **Observation**: The pie chart shows the distribution of weather conditions across all cities in January 2023.
  - Unknown: 35.0%
  - Cloudy: 14.0%
  - Rainy: 12.0%
  - Sunny: 11.0%
  - Snowy: 9.0%
  - Unknown (small slice, possibly a typo in the chart): 10.0%
- **Insight**: The high percentage of "Unknown" conditions (35% + 10% = 45%) indicates significant missing or unclassified weather data, which could skew analysis and suggests a need for better data collection or classification. Excluding "Unknown," Cloudy (14%) and Rainy (12%) conditions dominate, reflecting typical winter weather in these cities. Sunny days (11%) are relatively rare, and Snowy conditions (9%) are the least frequent, aligning with the temperature trends where sub-zero temperatures (necessary for snow) are less common, especially in London.

#### Combined Insights
- **Temperature Patterns**: London is the warmest city on average (around 9°C) and shows the most significant temperature spike (15°C), indicating a milder but more volatile winter. Tokyo is the coldest, with sub-zero temperatures early in the month, while New York shows moderate fluctuations but a cooling trend by late January.
- **Weather Conditions**: The dataset has a high proportion of "Unknown" weather conditions (45%), which highlights a data quality issue. Excluding unknowns, cloudy and rainy conditions are most common, reflecting typical January weather in these cities, with fewer sunny or snowy days.
- **Humidity and Wind Speed**: There’s no clear correlation between humidity and wind speed, but warmer days (higher temperatures) tend to have moderate humidity (40%-60%). New York experiences more variable wind speeds on humid days, suggesting dynamic weather, while Tokyo’s weather appears more stable.
- **Data Quality**: The large number of "Unknown" weather conditions and the need to impute missing `humidity_percent` and `temperature_celsius` values indicate that the dataset has gaps that could affect analysis reliability. Future improvements could focus on better data collection or more sophisticated imputation methods.

## Insights from Visualizations


- **Temperature Patterns**:
  - **Average Temperature per City**: London is the warmest city on average (around 9°C), followed by New York (~8°C) and Tokyo (~7°C). London’s higher average temperature may be due to its milder winter climate influenced by the Gulf Stream.
  - **Temperature Trend Over Time**: London shows the most significant temperature spike (15°C on January 5th), indicating a possible warm spell, while Tokyo experiences the coldest days (below 0°C early in the month). New York’s temperature fluctuates moderately but cools toward the end of the month. This suggests London has a milder but more volatile winter, while Tokyo’s temperatures are colder but steadily increase over time.

- **Weather Conditions**:
  - **Weather Condition Distribution**: A large portion of the data (45%) is labeled as "Unknown," indicating significant missing or unclassified weather conditions, which could skew analysis. Excluding "Unknown," cloudy (14%) and rainy (12%) conditions are most common, reflecting typical winter weather in these cities. Sunny days (11%) are rare, and snowy conditions (9%) are the least frequent, aligning with the temperature trends where sub-zero temperatures (necessary for snow) are less common, especially in London.

- **Humidity and Wind Speed**:
  - **Humidity vs Wind Speed by City**: There’s no strong linear relationship between humidity and wind speed, but warmer days (larger points, higher temperatures) tend to occur at moderate humidity levels (40%-60%). New York experiences more variable wind speeds, especially on humid days, suggesting more dynamic weather patterns. Tokyo’s clustering at moderate humidity and wind speed indicates more stable weather conditions during this period.

- **Data Quality**:
  - The high percentage of "Unknown" weather conditions and the need to impute missing `humidity_percent` and `temperature_celsius` values highlight gaps in the dataset. This could affect the reliability of the analysis. Future improvements could focus on better data collection or more advanced imputation techniques to enhance data quality.