# Hanoi Weather Data Collection

This notebook demonstrates how to collect historical weather data for Hanoi from the Visual Crossing Weather API. We'll gather 10+ years of daily weather data with 33+ features that will be used for temperature forecasting.

## Objectives
1. Set up API connection to Visual Crossing Weather API
2. Collect 10 years of daily weather data for Hanoi
3. Understand the 33+ weather features available
4. Validate and explore the collected data
5. Save data for further processing

## 1. Setup and Imports

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import requests
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import time
import os
import json
from dotenv import load_dotenv
import warnings
warnings.filterwarnings('ignore')

# Load environment variables
load_dotenv()

# Set up plotting style
plt.style.use('seaborn-v0_8')
plt.rcParams['figure.figsize'] = (12, 8)
sns.set_palette("husl")

print("All libraries imported successfully!")

## 2. API Configuration

Visual Crossing Weather API provides comprehensive historical weather data. We'll configure our API connection and understand the available parameters.

In [None]:
# API Configuration
API_KEY = os.getenv('VISUAL_CROSSING_API_KEY')  # Get your free API key from visualcrossing.com
BASE_URL = "https://weather.visualcrossing.com/VisualCrossingWebServices/rest/services/timeline"
LOCATION = "Hanoi,Vietnam"

if not API_KEY:
    print("⚠️ Warning: API key not found!")
    print("Please set VISUAL_CROSSING_API_KEY in your .env file")
    print("You can get a free API key from: https://www.visualcrossing.com/weather/weather-data-services")
    API_KEY = input("Or enter your API key here: ")

print(f"API configured for location: {LOCATION}")
print(f"Base URL: {BASE_URL}")

## 3. Weather Data Collection Functions

Let's create functions to collect weather data with proper rate limiting and error handling.

In [None]:
class HanoiWeatherCollector:
    """A class to collect weather data for Hanoi from Visual Crossing API."""
    
    def __init__(self, api_key):
        self.api_key = api_key
        self.base_url = BASE_URL
        self.location = LOCATION
        self.last_request_time = 0
        self.rate_limit_delay = 1.1  # Seconds between requests (stay under rate limits)
    
    def _rate_limit_wait(self):
        """Implement rate limiting to avoid exceeding API limits."""
        current_time = time.time()
        time_since_last = current_time - self.last_request_time
        
        if time_since_last < self.rate_limit_delay:
            sleep_time = self.rate_limit_delay - time_since_last
            time.sleep(sleep_time)
        
        self.last_request_time = time.time()
    
    def fetch_weather_data(self, start_date, end_date, include_hours=False):
        """Fetch weather data for a specific date range."""
        self._rate_limit_wait()
        
        url = f"{self.base_url}/{self.location}/{start_date}/{end_date}"
        
        params = {
            'unitGroup': 'metric',
            'include': 'days,hours' if include_hours else 'days',
            'key': self.api_key,
            'contentType': 'json'
        }
        
        try:
            print(f"Fetching data from {start_date} to {end_date}...")
            response = requests.get(url, params=params, timeout=30)
            response.raise_for_status()
            return response.json()
        
        except requests.exceptions.RequestException as e:
            print(f"Error fetching data: {e}")
            return None
    
    def collect_historical_data(self, years=10, include_hours=False):
        """Collect historical weather data for specified number of years."""
        end_date = datetime.now()
        start_date = end_date - timedelta(days=years * 365)
        
        print(f"Collecting {years} years of weather data for {self.location}")
        print(f"Date range: {start_date.strftime('%Y-%m-%d')} to {end_date.strftime('%Y-%m-%d')}")
        
        # Collect data in chunks to avoid API limits (1 year per request)
        all_data = []
        current_date = start_date
        
        while current_date < end_date:
            chunk_end = min(current_date + timedelta(days=365), end_date)
            
            start_str = current_date.strftime("%Y-%m-%d")
            end_str = chunk_end.strftime("%Y-%m-%d")
            
            data = self.fetch_weather_data(start_str, end_str, include_hours)
            
            if data and 'days' in data:
                all_data.extend(data['days'])
                print(f"✓ Collected {len(data['days'])} days")
            else:
                print(f"✗ Failed to collect data for {start_str} to {end_str}")
            
            current_date = chunk_end
        
        print(f"\nTotal data collected: {len(all_data)} days")
        return all_data

# Initialize the collector
collector = HanoiWeatherCollector(API_KEY)
print("Weather collector initialized!")

## 4. Collect 10 Years of Daily Weather Data

Now let's collect 10 years of historical daily weather data for Hanoi. This will give us a comprehensive dataset for temperature forecasting.

In [None]:
# Collect 10 years of daily weather data
print("Starting data collection...")
print("This may take several minutes due to API rate limiting.")
print("="*50)

raw_data = collector.collect_historical_data(years=10, include_hours=False)

if raw_data:
    print(f"\n✅ Successfully collected {len(raw_data)} days of weather data!")
else:
    print("❌ Failed to collect weather data. Please check your API key and connection.")

## 5. Convert to DataFrame and Initial Exploration

Let's convert the raw data to a pandas DataFrame and explore its structure.

In [None]:
# Convert to DataFrame
if raw_data:
    df = pd.DataFrame(raw_data)
    
    # Convert datetime column
    df['datetime'] = pd.to_datetime(df['datetime'])
    
    # Sort by date
    df = df.sort_values('datetime').reset_index(drop=True)
    
    print(f"Dataset shape: {df.shape}")
    print(f"Date range: {df['datetime'].min()} to {df['datetime'].max()}")
    print(f"Number of features: {len(df.columns)}")
    
    # Display first few rows
    print("\nFirst 5 rows:")
    display(df.head())
else:
    print("No data to process. Please run the data collection cell above.")

## 6. Understanding Weather Features (33+ Features)

Let's explore all the weather features available in our dataset and understand what each one represents for temperature forecasting.

In [None]:
# Weather feature descriptions
feature_descriptions = {
    # Temperature Features (Primary targets)
    'temp': 'Average temperature (°C) - Our main prediction target',
    'tempmax': 'Maximum temperature (°C) - Daily peak temperature',
    'tempmin': 'Minimum temperature (°C) - Daily lowest temperature',
    'feelslike': 'Feels-like temperature (°C) - Apparent temperature considering humidity and wind',
    'dew': 'Dew point temperature (°C) - Temperature at which air becomes saturated',
    
    # Atmospheric Conditions
    'humidity': 'Relative humidity (%) - Amount of moisture in the air',
    'pressure': 'Atmospheric pressure (hPa) - Air pressure at sea level',
    'visibility': 'Visibility distance (km) - How far you can see clearly',
    'cloudcover': 'Cloud coverage (%) - Percentage of sky covered by clouds',
    
    # Wind Characteristics
    'windspeed': 'Wind speed (km/h) - Average wind velocity',
    'winddir': 'Wind direction (degrees) - Direction wind is coming from (0-360°)',
    'windgust': 'Wind gust speed (km/h) - Maximum wind speed in gusts',
    
    # Precipitation
    'precip': 'Precipitation amount (mm) - Total rainfall/snowfall',
    'precipprob': 'Precipitation probability (%) - Chance of precipitation',
    'preciptype': 'Precipitation type - Rain, snow, sleet, etc.',
    'precipcover': 'Precipitation coverage (%) - Area affected by precipitation',
    'snow': 'Snow amount (cm) - Fresh snowfall',
    'snowdepth': 'Snow depth on ground (cm) - Accumulated snow',
    
    # Solar and Radiation
    'solarradiation': 'Solar radiation (W/m²) - Solar energy received',
    'solarenergy': 'Solar energy (MJ/m²) - Total solar energy for the day',
    'uvindex': 'UV Index - Ultraviolet radiation intensity (0-11+ scale)',
    
    # Celestial and Time
    'moonphase': 'Moon phase (0-1) - 0=new moon, 0.25=first quarter, 0.5=full moon, 0.75=last quarter',
    'sunrise': 'Sunrise time - When sun rises',
    'sunset': 'Sunset time - When sun sets',
    
    # Weather Conditions (Text Features)
    'conditions': 'Weather conditions - Brief description (Clear, Cloudy, Rain, etc.)',
    'description': 'Detailed weather description - More comprehensive weather summary',
    'icon': 'Weather icon code - Visual representation identifier',
    
    # Additional Features
    'severerisk': 'Severe weather risk (%) - Risk of severe weather events',
    'datetime': 'Date and time - Primary time index for our time series'
}

if 'df' in locals():
    print("Available Features in Our Dataset:")
    print("=" * 50)
    
    available_features = df.columns.tolist()
    
    for i, feature in enumerate(available_features, 1):
        description = feature_descriptions.get(feature, 'Feature description not available')
        print(f"{i:2d}. {feature:15s} - {description}")
    
    print(f"\nTotal features: {len(available_features)}")
    
    # Check data types
    print("\nData Types:")
    print(df.dtypes)

## 7. Data Quality Assessment

Let's assess the quality of our collected data by checking for missing values, duplicates, and data distribution.

In [None]:
if 'df' in locals():
    print("DATA QUALITY ASSESSMENT")
    print("=" * 50)
    
    # Basic statistics
    print(f"Dataset shape: {df.shape[0]:,} rows × {df.shape[1]} columns")
    print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
    
    # Date range
    print(f"\nDate Range:")
    print(f"From: {df['datetime'].min()}")
    print(f"To:   {df['datetime'].max()}")
    print(f"Span: {(df['datetime'].max() - df['datetime'].min()).days} days")
    
    # Missing values analysis
    print("\nMissing Values Analysis:")
    missing_counts = df.isnull().sum()
    missing_percent = (missing_counts / len(df)) * 100
    
    missing_df = pd.DataFrame({
        'Feature': missing_counts.index,
        'Missing_Count': missing_counts.values,
        'Missing_Percent': missing_percent.values
    })
    
    # Show only features with missing values
    missing_features = missing_df[missing_df['Missing_Count'] > 0].sort_values('Missing_Percent', ascending=False)
    
    if len(missing_features) > 0:
        print("Features with missing values:")
        display(missing_features)
    else:
        print("✅ No missing values found!")
    
    # Check for duplicates
    duplicates = df.duplicated(subset=['datetime']).sum()
    print(f"\nDuplicate records: {duplicates}")
    
    # Basic statistics for numerical features
    print("\nNumerical Features Summary:")
    numerical_cols = df.select_dtypes(include=[np.number]).columns
    display(df[numerical_cols].describe())

## 8. Initial Temperature Analysis

Let's focus on our primary target variable - temperature - and understand its patterns over the 10-year period.

In [None]:
if 'df' in locals():
    # Temperature analysis
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
    fig.suptitle("Hanoi Temperature Analysis (10 Years)", fontsize=16, fontweight='bold')
    
    # 1. Temperature time series
    axes[0, 0].plot(df['datetime'], df['temp'], alpha=0.7, color='red', linewidth=0.5)
    axes[0, 0].set_title('Daily Average Temperature Over Time')
    axes[0, 0].set_xlabel('Date')
    axes[0, 0].set_ylabel('Temperature (°C)')
    axes[0, 0].grid(True, alpha=0.3)
    
    # Add yearly rolling average
    df['temp_rolling_365'] = df['temp'].rolling(window=365, center=True).mean()
    axes[0, 0].plot(df['datetime'], df['temp_rolling_365'], color='darkred', linewidth=2, label='365-day average')
    axes[0, 0].legend()
    
    # 2. Temperature distribution
    axes[0, 1].hist(df['temp'], bins=50, alpha=0.7, color='skyblue', edgecolor='black')
    axes[0, 1].axvline(df['temp'].mean(), color='red', linestyle='--', linewidth=2, label=f'Mean: {df["temp"].mean():.1f}°C')
    axes[0, 1].axvline(df['temp'].median(), color='orange', linestyle='--', linewidth=2, label=f'Median: {df["temp"].median():.1f}°C')
    axes[0, 1].set_title('Temperature Distribution')
    axes[0, 1].set_xlabel('Temperature (°C)')
    axes[0, 1].set_ylabel('Frequency')
    axes[0, 1].legend()
    axes[0, 1].grid(True, alpha=0.3)
    
    # 3. Seasonal patterns
    df['month'] = df['datetime'].dt.month
    monthly_temp = df.groupby('month')['temp'].agg(['mean', 'min', 'max'])
    
    months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
    axes[1, 0].plot(monthly_temp.index, monthly_temp['mean'], marker='o', linewidth=2, markersize=8, color='red', label='Average')
    axes[1, 0].fill_between(monthly_temp.index, monthly_temp['min'], monthly_temp['max'], alpha=0.3, color='red', label='Min-Max Range')
    axes[1, 0].set_title('Seasonal Temperature Patterns')
    axes[1, 0].set_xlabel('Month')
    axes[1, 0].set_ylabel('Temperature (°C)')
    axes[1, 0].set_xticks(range(1, 13))
    axes[1, 0].set_xticklabels(months)
    axes[1, 0].legend()
    axes[1, 0].grid(True, alpha=0.3)
    
    # 4. Yearly temperature trends
    df['year'] = df['datetime'].dt.year
    yearly_temp = df.groupby('year')['temp'].mean()
    
    axes[1, 1].bar(yearly_temp.index, yearly_temp.values, alpha=0.7, color='green')
    axes[1, 1].plot(yearly_temp.index, yearly_temp.values, color='darkgreen', marker='o', linewidth=2, markersize=6)
    axes[1, 1].set_title('Average Temperature by Year')
    axes[1, 1].set_xlabel('Year')
    axes[1, 1].set_ylabel('Average Temperature (°C)')
    axes[1, 1].grid(True, alpha=0.3)
    
    # Rotate x-axis labels for better readability
    axes[1, 1].tick_params(axis='x', rotation=45)
    
    plt.tight_layout()
    plt.show()
    
    # Print temperature statistics
    print("\nHANOI TEMPERATURE STATISTICS (10 Years)")
    print("=" * 45)
    print(f"Average Temperature: {df['temp'].mean():.2f}°C")
    print(f"Median Temperature:  {df['temp'].median():.2f}°C")
    print(f"Standard Deviation:  {df['temp'].std():.2f}°C")
    print(f"Minimum Temperature: {df['temp'].min():.2f}°C")
    print(f"Maximum Temperature: {df['temp'].max():.2f}°C")
    print(f"Temperature Range:   {df['temp'].max() - df['temp'].min():.2f}°C")
    
    # Seasonal analysis
    print("\nSeasonal Temperature Averages:")
    seasons = {
        'Winter (Dec-Feb)': [12, 1, 2],
        'Spring (Mar-May)': [3, 4, 5],
        'Summer (Jun-Aug)': [6, 7, 8],
        'Autumn (Sep-Nov)': [9, 10, 11]
    }
    
    for season, months in seasons.items():
        season_temp = df[df['month'].isin(months)]['temp'].mean()
        print(f"{season}: {season_temp:.2f}°C")

## 9. Key Weather Features Correlation Analysis

Let's examine how different weather features correlate with temperature to understand which features might be most important for forecasting.

In [None]:
if 'df' in locals():
    # Select key numerical features for correlation analysis
    key_features = ['temp', 'tempmax', 'tempmin', 'feelslike', 'humidity', 'pressure', 
                   'windspeed', 'cloudcover', 'precip', 'solarradiation', 'uvindex', 'moonphase']
    
    # Filter features that exist in our dataset
    available_features = [f for f in key_features if f in df.columns]
    
    if len(available_features) > 1:
        # Correlation matrix
        correlation_matrix = df[available_features].corr()
        
        # Plot correlation heatmap
        plt.figure(figsize=(12, 10))
        mask = np.triu(correlation_matrix.corr())
        sns.heatmap(correlation_matrix, annot=True, cmap='RdBu_r', center=0, 
                   square=True, mask=mask, cbar_kws={'shrink': 0.8})
        plt.title('Weather Features Correlation Matrix\n(Focus on Temperature Relationships)', 
                 fontsize=14, fontweight='bold')
        plt.tight_layout()
        plt.show()
        
        # Features most correlated with temperature
        temp_correlations = correlation_matrix['temp'].abs().sort_values(ascending=False)
        temp_correlations = temp_correlations[temp_correlations.index != 'temp']  # Remove self-correlation
        
        print("\nFEATURES MOST CORRELATED WITH TEMPERATURE")
        print("=" * 50)
        for i, (feature, corr) in enumerate(temp_correlations.head(10).items(), 1):
            correlation_strength = "Strong" if abs(corr) > 0.7 else "Moderate" if abs(corr) > 0.3 else "Weak"
            direction = "Positive" if corr > 0 else "Negative"
            print(f"{i:2d}. {feature:15s}: {corr:6.3f} ({direction} {correlation_strength})")
        
        # Interesting insights
        print("\n🔍 KEY INSIGHTS FOR TEMPERATURE FORECASTING:")
        print("=" * 50)
        
        if 'tempmax' in temp_correlations:
            print(f"• Maximum temperature correlation: {correlation_matrix.loc['temp', 'tempmax']:.3f}")
            print("  → Maximum temperature is highly predictive of average temperature")
        
        if 'humidity' in temp_correlations:
            humidity_corr = correlation_matrix.loc['temp', 'humidity']
            print(f"• Humidity correlation: {humidity_corr:.3f}")
            if humidity_corr < -0.3:
                print("  → Higher humidity tends to be associated with lower temperatures")
        
        if 'solarradiation' in temp_correlations:
            solar_corr = correlation_matrix.loc['temp', 'solarradiation']
            print(f"• Solar radiation correlation: {solar_corr:.3f}")
            if solar_corr > 0.3:
                print("  → More solar radiation typically means higher temperatures")
        
        if 'cloudcover' in temp_correlations:
            cloud_corr = correlation_matrix.loc['temp', 'cloudcover']
            print(f"• Cloud cover correlation: {cloud_corr:.3f}")
            if cloud_corr < -0.2:
                print("  → More clouds generally associated with cooler temperatures")
    
    else:
        print("Insufficient numerical features for correlation analysis.")

## 10. Text Features Analysis

Let's explore the text-based weather features (conditions, description) that we'll need to process for machine learning.

In [None]:
if 'df' in locals():
    print("TEXT FEATURES ANALYSIS")
    print("=" * 30)
    
    # Analyze weather conditions
    if 'conditions' in df.columns:
        print("\n1. WEATHER CONDITIONS:")
        conditions_counts = df['conditions'].value_counts()
        print(f"Total unique conditions: {len(conditions_counts)}")
        print("\nTop 10 most common weather conditions:")
        for i, (condition, count) in enumerate(conditions_counts.head(10).items(), 1):
            percentage = (count / len(df)) * 100
            print(f"{i:2d}. {condition:25s}: {count:4d} days ({percentage:5.1f}%)")
        
        # Average temperature by weather condition
        temp_by_condition = df.groupby('conditions')['temp'].agg(['mean', 'std', 'count']).round(2)
        temp_by_condition = temp_by_condition[temp_by_condition['count'] >= 10]  # Only conditions with 10+ occurrences
        temp_by_condition = temp_by_condition.sort_values('mean', ascending=False)
        
        print("\nAverage temperature by weather condition (conditions with 10+ occurrences):")
        print(f"{'Condition':<25} {'Avg Temp (°C)':<12} {'Std Dev':<8} {'Count':<6}")
        print("-" * 55)
        for condition, row in temp_by_condition.head(15).iterrows():
            print(f"{condition:<25} {row['mean']:8.1f}     {row['std']:6.1f}   {row['count']:4.0f}")
    
    # Analyze weather descriptions
    if 'description' in df.columns:
        print("\n\n2. WEATHER DESCRIPTIONS:")
        description_counts = df['description'].value_counts()
        print(f"Total unique descriptions: {len(description_counts)}")
        print("\nTop 10 most common weather descriptions:")
        for i, (desc, count) in enumerate(description_counts.head(10).items(), 1):
            percentage = (count / len(df)) * 100
            # Truncate long descriptions
            desc_short = desc[:50] + "..." if len(desc) > 50 else desc
            print(f"{i:2d}. {desc_short:<53}: {count:4d} days ({percentage:5.1f}%)")
    
    # Weather icons analysis
    if 'icon' in df.columns:
        print("\n\n3. WEATHER ICONS:")
        icon_counts = df['icon'].value_counts()
        print(f"Total unique icons: {len(icon_counts)}")
        print("\nAll weather icon types:")
        for i, (icon, count) in enumerate(icon_counts.items(), 1):
            percentage = (count / len(df)) * 100
            print(f"{i:2d}. {icon:<20}: {count:4d} days ({percentage:5.1f}%)")
    
    # Create a visualization for weather conditions
    if 'conditions' in df.columns:
        plt.figure(figsize=(12, 8))
        
        # Top weather conditions
        top_conditions = conditions_counts.head(8)
        plt.subplot(2, 1, 1)
        top_conditions.plot(kind='bar', color='skyblue', edgecolor='black')
        plt.title('Most Common Weather Conditions in Hanoi (10 Years)', fontweight='bold')
        plt.xlabel('Weather Conditions')
        plt.ylabel('Number of Days')
        plt.xticks(rotation=45, ha='right')
        plt.grid(True, alpha=0.3)
        
        # Temperature distribution by weather condition
        plt.subplot(2, 1, 2)
        top_5_conditions = conditions_counts.head(5).index
        temp_data = [df[df['conditions'] == condition]['temp'].values for condition in top_5_conditions]
        plt.boxplot(temp_data, labels=top_5_conditions)
        plt.title('Temperature Distribution by Weather Condition', fontweight='bold')
        plt.xlabel('Weather Conditions')
        plt.ylabel('Temperature (°C)')
        plt.xticks(rotation=45, ha='right')
        plt.grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()

## 11. Save Collected Data

Let's save our collected data to the appropriate directory for further processing and analysis.

In [None]:
if 'df' in locals():
    # Create data directory if it doesn't exist
    os.makedirs('../data/raw/daily', exist_ok=True)
    
    # Generate filename with timestamp
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"hanoi_weather_daily_10years_{timestamp}.csv"
    filepath = f"../data/raw/daily/{filename}"
    
    # Save the data
    df.to_csv(filepath, index=False)
    
    print(f"✅ Data successfully saved to: {filepath}")
    print(f"📊 Dataset summary:")
    print(f"   • Records: {len(df):,}")
    print(f"   • Features: {len(df.columns)}")
    print(f"   • Date range: {df['datetime'].min().date()} to {df['datetime'].max().date()}")
    print(f"   • File size: {os.path.getsize(filepath) / 1024**2:.2f} MB")
    
    # Also save feature descriptions for reference
    feature_info_file = f"../data/raw/daily/feature_descriptions_{timestamp}.json"
    with open(feature_info_file, 'w') as f:
        json.dump(feature_descriptions, f, indent=2)
    
    print(f"📝 Feature descriptions saved to: {feature_info_file}")
    
    # Create a summary report
    summary_report = {
        'collection_date': datetime.now().isoformat(),
        'location': LOCATION,
        'data_source': 'Visual Crossing Weather API',
        'records_collected': len(df),
        'features_count': len(df.columns),
        'date_range': {
            'start': df['datetime'].min().isoformat(),
            'end': df['datetime'].max().isoformat(),
            'days': (df['datetime'].max() - df['datetime'].min()).days
        },
        'temperature_stats': {
            'mean': float(df['temp'].mean()),
            'std': float(df['temp'].std()),
            'min': float(df['temp'].min()),
            'max': float(df['temp'].max())
        },
        'missing_values': df.isnull().sum().to_dict(),
        'data_quality': 'Good' if df.isnull().sum().sum() < len(df) * 0.05 else 'Needs attention'
    }
    
    summary_file = f"../data/raw/daily/collection_summary_{timestamp}.json"
    with open(summary_file, 'w') as f:
        json.dump(summary_report, f, indent=2, default=str)
    
    print(f"📋 Collection summary saved to: {summary_file}")
    
else:
    print("❌ No data to save. Please run the data collection cells above first.")

## 12. Next Steps and Key Insights

### What We've Accomplished:
1. ✅ Collected 10 years of daily weather data for Hanoi
2. ✅ Identified 33+ weather features for temperature forecasting
3. ✅ Analyzed temperature patterns and seasonality
4. ✅ Examined feature correlations and relationships
5. ✅ Processed text-based weather features
6. ✅ Saved clean, structured data for modeling

### Key Insights from Hanoi Weather Data:

**Temperature Patterns:**
- Hanoi exhibits strong seasonal variation typical of a subtropical climate
- Summer months (June-August) are hottest, winter months (December-February) are coolest
- Daily temperature variations provide rich information for forecasting

**Important Features for Forecasting:**
- Temperature-related features (tempmax, tempmin, feelslike) are highly correlated
- Atmospheric conditions (humidity, pressure, cloudcover) show significant relationships
- Solar radiation and weather conditions are key predictors
- Text features (conditions, descriptions) contain valuable categorical information

### Next Steps:
1. **Data Understanding & EDA**: Deep dive into feature relationships and patterns
2. **Data Processing**: Handle missing values, feature encoding, normalization
3. **Feature Engineering**: Create lag features, rolling statistics, seasonal components
4. **Model Development**: Build and train forecasting models
5. **Model Evaluation**: Test different algorithms and hyperparameters

### Recommendations for Temperature Forecasting:
- Use multiple temperature features (max, min, average) as predictors
- Incorporate lag features (previous days' temperatures)
- Consider seasonal and cyclical patterns
- Process text features using NLP techniques
- Create ensemble models combining different algorithms

This dataset provides an excellent foundation for building accurate temperature forecasting models for Hanoi!