## Introduction

- The project involves the analysis of weather data in Ho Chi Minh City over the course of one year, providing an in-depth study of climate variations and weather trends in the region. 
- The project aims to gain a comprehensive understanding of significant weather changes, including temperature fluctuations and humidity.
- Additionally, employing data analysis methods will help identify weather patterns and trends, enabling more accurate predictions of future weather conditions. 
- This information is valuable not only for individuals interested in weather but also for urban planning and development, as it can assist in making informed decisions related to agriculture, transportation, and environmental management. 

## Crawl Data

- The data is crawled from the website https://openweathermap.org/.
- Because our OpenWeatherMap account is designated as a student account, we can only collect data within the past year up to the crawl start date.
- In order to collect data for each weather index at 1 location, you can use a URL with the following format: 
https://history.openweathermap.org/data/2.5/history/city?id={id}&type=hour&start={dt}&appid={API_key}
    - Note:    
          - id: City ID.     
          - dt: is a timestamp counted from the Epoch (usually 1/1/1970).    
          - API_key: Our API key is 626e8ec21c8de03a592d15a0f2dca7f9.    
    - Example: https://history.openweathermap.org/data/2.5/history/city?id=1566083&type=hour&start=1671469200&appid=626e8ec21c8de03a592d15a0f2dca7f9
          - In this example, dt = 1670605200 = (2022-12-10). If you click on the url, it will not yield any results because it has exceeded the one-year time limit.
    - You need to select a date within a one-year range from the date you are crawling:
    - NewExample: https://history.openweathermap.org/data/2.5/history/city?id=1566083&type=hour&start=1685811600&appid=626e8ec21c8de03a592d15a0f2dca7f9
- When requesting the URL, you will receive a JSON-formatted response containing information about the weather at the specified timestamp (dt) and the next 23 hours. (total 24 hours).
  
       

## Import

In [1]:
import os
import sys
import pandas as pd
import json
from datetime import datetime, timedelta
import requests

## Find the ID of Ho Chi Minh City
- The IDs of the cities are located in the file "city.list.json". (data/City_ID/city.list.json)
- Read the file into a dataframe and find the ID of Ho Chi Minh City.

In [2]:
# Specify the file path
file_path = '../data/City_ID/city.list.json'

# Read JSON data from the file
with open(file_path, 'r', encoding='utf-8') as file:
    data = json.load(file)

# Create DataFrame
df = pd.json_normalize(data)

# Rename columns
df = df[['city.id.$numberLong', 'city.name']].rename(columns={'city.id.$numberLong': 'city_id', 'city.name': 'city_name'})

name_to_find = 'Thanh pho Ho Chi Minh'

def get_id_by_name(name):
    row = df.loc[df['city_name'] == name]
    if not row.empty:
        return row['city_id'].iloc[0]
    else:
        return None

ID_HCM_city = get_id_by_name(name_to_find)

ID_HCM_city


'1566083'

In [3]:
def loadDataWeatherOneDay(BASE_URL, ID_HCM_city, timestamp, AIP_ID):
    """Tạo DataFrame cho mỗi lần request

    Args:
        BASE_URL (str): 'https://history.openweathermap.org/data/2.5/history/city'
        ID_HCM_city (str): ID_HCM_city = '1566083'
        timestamp (int): is a timestamp counted from the Epoch (usually 1/1/1970)
        AIP_ID (str): '626e8ec21c8de03a592d15a0f2dca7f9'

    Returns:
        pd.DataFrame: Pandas dataframe kết quả chứa dữ liệu mong muốn.
    """
    
    url = f"{BASE_URL}?id={ID_HCM_city}&type=hour&start={timestamp}&appid={AIP_ID}"
    response = requests.get(url)

    if response.status_code == 200 : #and ("message" not in response.json().keys())
        
        # do đó chúng ta kiểm tra xem độ dài của phản hồi có > 1 hay không
        if len(response.json()) > 1:
            data = response.json()

            # Extract the 'list' key from the data
            list_data = data.get('list', [])
            

            # Create DataFrame
            df = pd.json_normalize(list_data)
            num_columns = pd.json_normalize(df['weather'][0]).shape[1]

            # Use json_normalize to flatten the 'weather' column for all rows
            weather_df = pd.concat([pd.json_normalize(weather) for weather in df['weather']], axis=1)
            
            # Reshape the DataFrame
            weather_df = pd.DataFrame(weather_df.values.reshape((-1, num_columns)), columns=weather_df.columns[:num_columns])

            # Concatenate the original DataFrame with the new weather DataFrame
            df = pd.concat([df, weather_df], axis=1)

            # Drop the original 'weather' column
            df = df.drop('weather', axis=1)
            return df


        else:
            # In thông báo lỗi nếu lệnh gọi API không thành công
            print("Error in Loading the data. Status Code: " + str(response.status_code))
            return None


def loadDataWeather(BASE_URL, ID_HCM_city, AIP_ID, start_date, end_date):
    """Hàm lấy dữ liệu với ngày bắt đầu và ngày kết thúc
      
    Args:
        BASE_URL (str): 'https://history.openweathermap.org/data/2.5/history/city'
        ID_HCM_city (str): ID_HCM_city = '1566083'
        AIP_ID (str): '626e8ec21c8de03a592d15a0f2dca7f9'
        start_date(datetime): Ngày bắt đầu thu thập dữu liệu.
        end_date(datetime):  Ngày kết thúc thu thập dữu liệu.

    Returns:
        pd.DataFrame: Pandas dataframe kết quả chứa dữ liệu mong muốn.
    """

    cur_month = start_date.month
    
    current_date = start_date

    # Khai báo DataFrame để lưu trữ dữ liệu
    df_full = pd.DataFrame()
    
    while current_date <= end_date:
    
        timestamp = int(current_date.timestamp())

        df_dataOneDay = loadDataWeatherOneDay(BASE_URL, ID_HCM_city, timestamp, AIP_ID)
    
        # Concatenate with the full DataFrame
        df_full = pd.concat([df_full, df_dataOneDay], axis=0)

        
        current_date += timedelta(days=1)
        if(current_date.month != cur_month):
            print('Complete month: ', cur_month)
            cur_month = current_date.month

    
    
        
    return df_full


## Note: 

- You should edit the start_date and end_date before running because, as mentioned earlier, the website only allows fetching data within one year from the current date when running this cell.
- It will take about 15 minutes for this cell to complete running..

In [4]:

# URL cơ sở được sử dụng trong tất cả các lệnh gọi API
BASE_URL = 'https://history.openweathermap.org/data/2.5/history/city'

AIP_ID = '626e8ec21c8de03a592d15a0f2dca7f9'
# ID_HCM_city = '1566083'

start_date_str = '2022-12-15'
end_date_str = '2023-12-12'

start_date = datetime.strptime(start_date_str, '%Y-%m-%d')
end_date = datetime.strptime(end_date_str, '%Y-%m-%d')


# In thông tin của một số dòng đầu tiên trong DataFrame
df_weather_HCM_city = loadDataWeather(BASE_URL, ID_HCM_city, AIP_ID, start_date, end_date)
df_weather_HCM_city.head()

Complete month:  12
Complete month:  1
Complete month:  2
Complete month:  3
Complete month:  4
Complete month:  5
Complete month:  6
Complete month:  7
Complete month:  8
Complete month:  9
Complete month:  10
Complete month:  11


Unnamed: 0,dt,main.temp,main.feels_like,main.pressure,main.humidity,main.temp_min,main.temp_max,wind.speed,wind.deg,clouds.all,id,main,description,icon,rain.1h,wind.gust
0,1671037000.0,298.16,298.62,1011.0,73.0,298.16,298.16,3.09,60.0,20.0,801,Clouds,few clouds,02n,,
1,1671041000.0,298.16,298.62,1010.0,73.0,298.16,298.16,2.06,30.0,40.0,802,Clouds,scattered clouds,03n,,
2,1671044000.0,297.16,297.78,1010.0,83.0,297.16,297.16,1.54,20.0,40.0,802,Clouds,scattered clouds,03n,,
3,1671048000.0,296.16,296.81,1009.0,88.0,296.16,296.16,1.54,320.0,40.0,802,Clouds,scattered clouds,03n,,
4,1671052000.0,296.16,296.81,1009.0,88.0,296.16,296.16,1.03,320.0,40.0,802,Clouds,scattered clouds,03n,,


In [5]:
df_weather_HCM_city.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9744 entries, 0 to 23
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   dt               8712 non-null   float64
 1   main.temp        8712 non-null   float64
 2   main.feels_like  8712 non-null   float64
 3   main.pressure    8712 non-null   float64
 4   main.humidity    8712 non-null   float64
 5   main.temp_min    8712 non-null   float64
 6   main.temp_max    8712 non-null   float64
 7   wind.speed       8712 non-null   float64
 8   wind.deg         8712 non-null   float64
 9   clouds.all       8712 non-null   float64
 10  id               8776 non-null   object 
 11  main             8776 non-null   object 
 12  description      8776 non-null   object 
 13  icon             8776 non-null   object 
 14  rain.1h          930 non-null    float64
 15  wind.gust        115 non-null    float64
dtypes: float64(12), object(4)
memory usage: 1.3+ MB


In [6]:
df_weather_HCM_city = df_weather_HCM_city.rename(columns={'dt': 'datetime', 
                                                            'main.temp': 'temp (K)',
                                                            'main.feels_like': 'feels_like',
                                                            'main.pressure': 'pressure',
                                                            'main.humidity': 'humidity',
                                                            'main.temp_min': 'temp_min',
                                                            'main.temp_max': 'temp_max',
                                                            'wind.speed': 'wind_speed',
                                                            'wind.deg': 'wind_deg',
                                                            'clouds.all': 'clouds_all',
                                                            'id': 'id_weatrher',
                                                            'main': 'main_weatrher',
                                                            'description': 'description_weatrher',
                                                            'icon': 'icon_weatrher',
                                                            'rain.1h': 'rain_1h',
                                                            'wind.gust': 'wind_gust',
                                                           })

- datetime: Timestamp of the weather data.
- temp (K): Current temperature, measured in Kelvin.
- feels_like: "Feels like" temperature, taking into account factors such as wind and humidity.
- pressure: Atmospheric pressure.
- humidity: Air humidity.
- temp_min: Minimum temperature during the measurement period.
- temp_max: Maximum temperature during the measurement period.
- wind_speed: Wind speed.
- wind_deg: Wind direction, measured in degrees.
- clouds_all: Percentage of cloud coverage.
- id_weather: Unique identifier code for weather conditions.
- main_weather: Main weather condition (e.g., rain, clear sky).
- description_weather: Detailed description of weather conditions.
- icon_weather: Icon representing the weather condition.
- rain_1h: Amount of rainfall in the last 1 hour.
- wind_gust: Wind gust intensity.


In [7]:
df_weather_HCM_city.head()

Unnamed: 0,datetime,temp (K),feels_like,pressure,humidity,temp_min,temp_max,wind_speed,wind_deg,clouds_all,id_weatrher,main_weatrher,description_weatrher,icon_weatrher,rain_1h,wind_gust
0,1671037000.0,298.16,298.62,1011.0,73.0,298.16,298.16,3.09,60.0,20.0,801,Clouds,few clouds,02n,,
1,1671041000.0,298.16,298.62,1010.0,73.0,298.16,298.16,2.06,30.0,40.0,802,Clouds,scattered clouds,03n,,
2,1671044000.0,297.16,297.78,1010.0,83.0,297.16,297.16,1.54,20.0,40.0,802,Clouds,scattered clouds,03n,,
3,1671048000.0,296.16,296.81,1009.0,88.0,296.16,296.16,1.54,320.0,40.0,802,Clouds,scattered clouds,03n,,
4,1671052000.0,296.16,296.81,1009.0,88.0,296.16,296.16,1.03,320.0,40.0,802,Clouds,scattered clouds,03n,,


In [8]:

df_weather_HCM_city.to_csv("../data/crawl/raw_data.csv",sep = ',', encoding= 'utf-8', index=False) 