## Mục tiêu:
Input:
 - 'country', 'location_name', 'last_updated'
Output:
 - 'air_quality_PM2.5', 'air_quality_PM10', 'air_quality_Nitrogen_dioxide', 'air_quality_Ozone'

Loại bài toán: Hồi quy đa biến (Multi-output Regression)
Dữ liệu: GlobalWeatherRepository.csv
Link dữ liệu live update: https://www.kaggle.com/datasets/nelgiriyewithana/global-weather-repository/data


### Step 1: Đọc dữ liệu

In [7]:
import pandas as pd

df = pd.read_csv('GlobalWeatherRepository.csv')
df.head()

Unnamed: 0,country,location_name,latitude,longitude,timezone,last_updated_epoch,last_updated,temperature_celsius,temperature_fahrenheit,condition_text,...,air_quality_PM2.5,air_quality_PM10,air_quality_us-epa-index,air_quality_gb-defra-index,sunrise,sunset,moonrise,moonset,moon_phase,moon_illumination
0,Afghanistan,Kabul,34.52,69.18,Asia/Kabul,1715849100,5/16/2024 13:15,26.6,79.8,Partly Cloudy,...,8.4,26.6,1,1,4:50 AM,6:50 PM,12:12 PM,1:11 AM,Waxing Gibbous,55
1,Albania,Tirana,41.33,19.82,Europe/Tirane,1715849100,5/16/2024 10:45,19.0,66.2,Partly cloudy,...,1.1,2.0,1,1,5:21 AM,7:54 PM,12:58 PM,2:14 AM,Waxing Gibbous,55
2,Algeria,Algiers,36.76,3.05,Africa/Algiers,1715849100,5/16/2024 9:45,23.0,73.4,Sunny,...,10.4,18.4,1,1,5:40 AM,7:50 PM,1:15 PM,2:14 AM,Waxing Gibbous,55
3,Andorra,Andorra La Vella,42.5,1.52,Europe/Andorra,1715849100,5/16/2024 10:45,6.3,43.3,Light drizzle,...,0.7,0.9,1,1,6:31 AM,9:11 PM,2:12 PM,3:31 AM,Waxing Gibbous,55
4,Angola,Luanda,-8.84,13.23,Africa/Luanda,1715849100,5/16/2024 9:45,26.0,78.8,Partly cloudy,...,183.4,262.3,5,10,6:12 AM,5:55 PM,1:17 PM,12:38 AM,Waxing Gibbous,55


In [8]:
df.describe()

Unnamed: 0,latitude,longitude,last_updated_epoch,temperature_celsius,temperature_fahrenheit,wind_mph,wind_kph,wind_degree,pressure_mb,pressure_in,...,gust_kph,air_quality_Carbon_Monoxide,air_quality_Ozone,air_quality_Nitrogen_dioxide,air_quality_Sulphur_dioxide,air_quality_PM2.5,air_quality_PM10,air_quality_us-epa-index,air_quality_gb-defra-index,moon_illumination
count,101334.0,101334.0,101334.0,101334.0,101334.0,101334.0,101334.0,101334.0,101334.0,101334.0,...,101334.0,101334.0,101334.0,101334.0,101334.0,101334.0,101334.0,101334.0,101334.0,101334.0
mean,19.155224,22.085921,1738407000.0,22.720556,72.898746,8.200674,13.201111,170.783498,1014.026309,29.943536,...,18.451695,509.579213,62.192365,15.724127,11.126171,25.792763,52.795425,1.74813,2.730219,50.243403
std,24.451675,65.813015,13019540.0,8.860898,15.949427,7.785981,12.527817,102.7688,11.234957,0.331722,...,14.537516,833.376821,32.050807,25.794196,40.108342,40.276838,161.923022,0.975419,2.547446,35.00676
min,-41.3,-175.2,1715849000.0,-24.9,-12.8,2.2,3.6,1.0,947.0,27.96,...,3.6,-9999.0,0.0,0.0,-9999.0,0.168,-1848.15,1.0,1.0,0.0
25%,3.75,-6.8361,1727171000.0,18.0,64.5,4.0,6.5,83.0,1010.0,29.83,...,10.4,233.1,42.0,1.4,0.9,7.4,10.7,1.0,1.0,15.0
50%,17.25,23.3167,1738407000.0,24.8,76.6,6.9,11.2,164.0,1013.0,29.93,...,15.7,323.75,59.0,4.995,2.405,14.985,21.645,1.0,2.0,51.0
75%,40.4,50.58,1749718000.0,28.3,82.9,11.2,18.0,256.0,1018.0,30.05,...,24.4,501.35,79.0,17.945,8.9,29.415,44.955,2.0,3.0,85.0
max,64.15,179.22,1760860000.0,49.2,120.6,1841.2,2963.2,360.0,3006.0,88.77,...,2970.4,38879.398,480.7,427.7,521.33,1614.1,6037.29,6.0,10.0,100.0


### Step 2: Chuyển đổi, lọc dữ liệu

In [9]:
df['last_updated'] = pd.to_datetime(df['last_updated'])

# Lấy dữ liệu cần thiết
features = [
    'country', 'location_name', 'last_updated', 'temperature_celsius', 'humidity',
    'wind_kph', 'pressure_mb', 'uv_index', 'cloud', 'precip_mm'
]

# Target
targets = [
    'air_quality_PM2.5', 'air_quality_PM10', 'air_quality_Nitrogen_dioxide', 'air_quality_Ozone'
]

data = df[features + targets].dropna()
data.head()

Unnamed: 0,country,location_name,last_updated,temperature_celsius,humidity,wind_kph,pressure_mb,uv_index,cloud,precip_mm,air_quality_PM2.5,air_quality_PM10,air_quality_Nitrogen_dioxide,air_quality_Ozone
0,Afghanistan,Kabul,2024-05-16 13:15:00,26.6,24,13.3,1012,7.0,30,0.0,8.4,26.6,1.1,103.0
1,Albania,Tirana,2024-05-16 10:45:00,19.0,94,11.2,1012,5.0,75,0.1,1.1,2.0,0.9,97.3
2,Algeria,Algiers,2024-05-16 09:45:00,23.0,29,15.1,1011,5.0,0,0.0,10.4,18.4,65.1,12.2
3,Andorra,Andorra La Vella,2024-05-16 10:45:00,6.3,61,11.9,1007,2.0,100,0.3,0.7,0.9,1.6,64.4
4,Angola,Luanda,2024-05-16 09:45:00,26.0,89,13.0,1011,8.0,50,0.0,183.4,262.3,72.7,19.0


### Step 3: Xử lý dữ liệu