# Dự đoán các căn bệnh dựa trên dữ liệu thời tiết và khí hậu

Notebook này hướng dẫn quy trình dự đoán các căn bệnh có thể xảy ra dựa trên dữ liệu thời tiết và sức khỏe từ file `global_climate_health_impact_tracker_2015_2025.csv`.

## 1. Import thư viện cần thiết và tải dữ liệu

- Sử dụng pandas, numpy, matplotlib, seaborn, scikit-learn.
- Đọc file `global_climate_health_impact_tracker_2015_2025.csv` vào DataFrame.

In [16]:
# Import các thư viện cần thiết
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

# Đọc dữ liệu health (bộ dữ liệu chính)
file_path = 'data/global_climate_health_impact_tracker_2015_2025.csv'
df = pd.read_csv(file_path)

# Đọc bộ dữ liệu thời tiết bổ sung (GlobalWeatherRepository)
weather_path = 'data/GlobalWeatherRepository.csv'
df_weather = pd.read_csv(weather_path)

print('Health rows,cols:', df.shape)
print('Weather rows,cols:', df_weather.shape)

# Hiển thị đầu mỗi bảng để kiểm tra nhanh
display(df.head())
display(df_weather.head())

Health rows,cols: (14100, 30)
Weather rows,cols: (118882, 41)


Unnamed: 0,record_id,country_code,country_name,region,income_level,date,year,month,week,latitude,...,air_quality_index,respiratory_disease_rate,cardio_mortality_rate,vector_disease_risk_score,waterborne_disease_incidents,heat_related_admissions,healthcare_access_index,gdp_per_capita_usd,mental_health_index,food_security_index
0,1,USA,United States,North America,High,2015-01-04,2015,1,1,37.09,...,82.0,69.4,31.5,6.6,16.2,1.4,77.3,63627.0,71.2,90.2
1,2,USA,United States,North America,High,2015-01-11,2015,1,2,37.09,...,6.0,70.0,26.3,5.2,11.4,0.0,83.6,63627.0,70.6,94.0
2,3,USA,United States,North America,High,2015-01-18,2015,1,3,37.09,...,137.0,66.9,33.4,1.3,19.5,0.0,84.7,63627.0,63.4,100.0
3,4,USA,United States,North America,High,2015-01-25,2015,1,4,37.09,...,-3.0,47.0,35.0,6.0,9.7,9.0,84.3,63627.0,68.1,96.4
4,5,USA,United States,North America,High,2015-02-01,2015,2,5,37.09,...,48.0,61.3,28.3,1.4,22.6,27.3,83.6,63733.0,69.1,100.0


Unnamed: 0,country,location_name,latitude,longitude,timezone,last_updated_epoch,last_updated,temperature_celsius,temperature_fahrenheit,condition_text,...,air_quality_PM2.5,air_quality_PM10,air_quality_us-epa-index,air_quality_gb-defra-index,sunrise,sunset,moonrise,moonset,moon_phase,moon_illumination
0,Afghanistan,Kabul,34.52,69.18,Asia/Kabul,1715849100,2024-05-16 13:15,26.6,79.8,Partly Cloudy,...,8.4,26.6,1,1,04:50 AM,06:50 PM,12:12 PM,01:11 AM,Waxing Gibbous,55
1,Albania,Tirana,41.33,19.82,Europe/Tirane,1715849100,2024-05-16 10:45,19.0,66.2,Partly cloudy,...,1.1,2.0,1,1,05:21 AM,07:54 PM,12:58 PM,02:14 AM,Waxing Gibbous,55
2,Algeria,Algiers,36.76,3.05,Africa/Algiers,1715849100,2024-05-16 09:45,23.0,73.4,Sunny,...,10.4,18.4,1,1,05:40 AM,07:50 PM,01:15 PM,02:14 AM,Waxing Gibbous,55
3,Andorra,Andorra La Vella,42.5,1.52,Europe/Andorra,1715849100,2024-05-16 10:45,6.3,43.3,Light drizzle,...,0.7,0.9,1,1,06:31 AM,09:11 PM,02:12 PM,03:31 AM,Waxing Gibbous,55
4,Angola,Luanda,-8.84,13.23,Africa/Luanda,1715849100,2024-05-16 09:45,26.0,78.8,Partly cloudy,...,183.4,262.3,5,10,06:12 AM,05:55 PM,01:17 PM,12:38 AM,Waxing Gibbous,55


## 2. Khám phá và tiền xử lý dữ liệu

- Kiểm tra giá trị thiếu, kiểu dữ liệu, thống kê cơ bản.
- Làm sạch và tiền xử lý dữ liệu (xử lý missing value, mã hóa biến phân loại...).

In [17]:
# Thông tin tổng quan về dữ liệu
df.info()

# Thống kê mô tả các trường số
df.describe()

# Kiểm tra giá trị thiếu
df.isnull().sum()

# Hiển thị các giá trị duy nhất của các trường dạng object (chuỗi)
for col in df.select_dtypes(include=['object']).columns:
    print(f"{col}: {df[col].unique()[:10]}")  # Hiển thị 10 giá trị đầu tiên

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14100 entries, 0 to 14099
Data columns (total 30 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   record_id                     14100 non-null  int64  
 1   country_code                  14100 non-null  object 
 2   country_name                  14100 non-null  object 
 3   region                        14100 non-null  object 
 4   income_level                  14100 non-null  object 
 5   date                          14100 non-null  object 
 6   year                          14100 non-null  int64  
 7   month                         14100 non-null  int64  
 8   week                          14100 non-null  int64  
 9   latitude                      14100 non-null  float64
 10  longitude                     14100 non-null  float64
 11  population_millions           14100 non-null  int64  
 12  temperature_celsius           14100 non-null  float64
 13  t

In [18]:
# --- Align and merge health + weather datasets on approximate lat/lon and date ---
# Prepare health dataframe
df_health = df.copy()
df_health['date'] = pd.to_datetime(df_health['date'], errors='coerce').dt.date

# Prepare weather dataframe date (from last_updated or epoch)
if 'last_updated' in df_weather.columns:
    df_weather['date'] = pd.to_datetime(df_weather['last_updated'], errors='coerce').dt.date
elif 'last_updated_epoch' in df_weather.columns:
    df_weather['date'] = pd.to_datetime(df_weather['last_updated_epoch'], unit='s', errors='coerce').dt.date
else:
    df_weather['date'] = pd.NaT

# Ensure lat/lon numeric and create rounded versions for approximate join
for col in ['latitude','longitude']:
    if col in df_health.columns:
        df_health[col] = pd.to_numeric(df_health[col], errors='coerce')
    if col in df_weather.columns:
        df_weather[col] = pd.to_numeric(df_weather[col], errors='coerce')

df_health['lat_r'] = df_health['latitude'].round(2)
df_health['lon_r'] = df_health['longitude'].round(2)
df_weather['lat_r'] = df_weather['latitude'].round(2)
df_weather['lon_r'] = df_weather['longitude'].round(2)

# Show overlapping column names (lower-cased) to help choose common variables
cols_h = set([c.lower() for c in df_health.columns])
cols_w = set([c.lower() for c in df_weather.columns])
common = sorted(list(cols_h & cols_w))
print('Raw common column names (lowercased):', common)

# Manual candidate mapping for canonical features (keys we want in final X)
canonical_candidates = {
    'temperature_celsius': ['temperature_celsius','temp_celsius','temperature_fahrenheit','temperature_celsius_health','temperature_celsius_weather'],
    'precipitation_mm': ['precipitation_mm','precip_mm','precipitation','precip_mm_weather'],
    'pm25_ugm3': ['pm25_ugm3','air_quality_pm2.5','air_quality_pm2_5','air_quality_PM2.5'],
    'humidity': ['humidity'],
    'pressure_mb': ['pressure_mb','pressure_mb_weather'],
    'wind_kph': ['wind_kph','wind_kph_weather'],
    'latitude': ['latitude'],
    'longitude': ['longitude'],
}

# Merge: left join weather onto health on rounded lat/lon and date
merged = pd.merge(df_health, df_weather, left_on=['lat_r','lon_r','date'], right_on=['lat_r','lon_r','date'], how='left', suffixes=('_health','_weather'))
print('Merged shape:', merged.shape)

# For each canonical feature, pick the first matching column from merged (case-insensitive)
selected_features = {}
for canon, candidates in canonical_candidates.items():
    found = None
    for cand in candidates:
        matches = [c for c in merged.columns if c.lower() == cand.lower() or cand.lower() in c.lower()]
        if matches:
            found = matches[0]
            break
    if found:
        selected_features[canon] = found

print('Selected feature columns to use:', selected_features)

# Build X_common using the selected columns (canonical names)
X_common = pd.DataFrame()
for canon, colname in selected_features.items():
    X_common[canon] = merged[colname]

# Add some contextual features from health if available
for extra in ['population_millions','year','month','region','income_level']:
    if extra in merged.columns:
        X_common[extra] = merged[extra]

# Labels (same as before)
label_cols = [
    'vector_disease_risk_score',
    'heat_related_admissions',
    'respiratory_disease_rate'
    ]
y_merged = merged[label_cols].copy()

# Keep rows where X has at least one non-null and y has at least one non-null
keep_mask = ~X_common.isnull().all(axis=1) & (~y_merged.isnull().all(axis=1))
X_common = X_common[keep_mask].copy()
y_merged = y_merged[keep_mask].copy()

print('Final X_common shape:', X_common.shape, 'y shape:', y_merged.shape)
display(X_common.head())
display(y_merged.head())

Raw common column names (lowercased): ['date', 'lat_r', 'latitude', 'lon_r', 'longitude', 'temperature_celsius']
Merged shape: (14100, 73)
Selected feature columns to use: {'temperature_celsius': 'temperature_celsius_health', 'precipitation_mm': 'precipitation_mm', 'pm25_ugm3': 'pm25_ugm3', 'humidity': 'humidity', 'pressure_mb': 'pressure_mb', 'wind_kph': 'wind_kph', 'latitude': 'latitude_health', 'longitude': 'longitude_health'}
Final X_common shape: (14100, 13) y shape: (14100, 3)


Unnamed: 0,temperature_celsius,precipitation_mm,pm25_ugm3,humidity,pressure_mb,wind_kph,latitude,longitude,population_millions,year,month,region,income_level
0,4.59,75.7,39.0,,,,37.09,-95.71,331,2015,1,North America,High
1,3.13,97.0,17.9,,,,37.09,-95.71,331,2015,1,North America,High
2,3.99,74.1,91.5,,,,37.09,-95.71,331,2015,1,North America,High
3,6.43,87.7,5.5,,,,37.09,-95.71,331,2015,1,North America,High
4,9.0,75.8,37.1,,,,37.09,-95.71,331,2015,2,North America,High


Unnamed: 0,vector_disease_risk_score,heat_related_admissions,respiratory_disease_rate
0,6.6,1.4,69.4
1,5.2,0.0,70.0
2,1.3,0.0,66.9
3,6.0,9.0,47.0
4,1.4,27.3,61.3


In [19]:
# Tiền xử lý dữ liệu
# Ví dụ: điền giá trị thiếu, mã hóa biến phân loại

df_clean = df.copy()

# Điền giá trị thiếu cho các trường số bằng trung bình
for col in df_clean.select_dtypes(include=[np.number]).columns:
    df_clean[col].fillna(df_clean[col].mean(), inplace=True)

# Điền giá trị thiếu cho các trường object bằng mode
for col in df_clean.select_dtypes(include=['object']).columns:
    df_clean[col].fillna(df_clean[col].mode()[0], inplace=True)

# Mã hóa biến phân loại (nếu cần)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for col in df_clean.select_dtypes(include=['object']).columns:
    df_clean[col] = le.fit_transform(df_clean[col])

df_clean.head()
df_clean.columns

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_clean[col].fillna(df_clean[col].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_clean[col].fillna(df_clean[col].mode()[0], inplace=True)


Index(['record_id', 'country_code', 'country_name', 'region', 'income_level',
       'date', 'year', 'month', 'week', 'latitude', 'longitude',
       'population_millions', 'temperature_celsius', 'temp_anomaly_celsius',
       'precipitation_mm', 'heat_wave_days', 'drought_indicator',
       'flood_indicator', 'extreme_weather_events', 'pm25_ugm3',
       'air_quality_index', 'respiratory_disease_rate',
       'cardio_mortality_rate', 'vector_disease_risk_score',
       'waterborne_disease_incidents', 'heat_related_admissions',
       'healthcare_access_index', 'gdp_per_capita_usd', 'mental_health_index',
       'food_security_index'],
      dtype='object')

## 3. Trích xuất và xây dựng đặc trưng (Feature Engineering)

- Lựa chọn hoặc tạo các đặc trưng liên quan đến thời tiết, khí hậu, vùng miền,... ảnh hưởng đến bệnh.

In [20]:
# Use the common features found by the merge step if available
label_cols = [
    'vector_disease_risk_score',
    'heat_related_admissions',
    'respiratory_disease_rate'
    ]
try:
    # If the alignment cell was executed, use X_common and y_merged
    X = X_common.copy()
    y = y_merged.copy()
    print('Using X_common from merged datasets')
except NameError:
    # Fallback: use df_clean as before
    features = [col for col in df_clean.columns if col not in label_cols]
    X = df_clean[features]
    y = df_clean[label_cols]
    print('X_common not found; using df_clean features')

print('Các đặc trưng sử dụng:', list(X.columns))
print('Các nhãn bệnh:', list(y.columns))

Using X_common from merged datasets
Các đặc trưng sử dụng: ['temperature_celsius', 'precipitation_mm', 'pm25_ugm3', 'humidity', 'pressure_mb', 'wind_kph', 'latitude', 'longitude', 'population_millions', 'year', 'month', 'region', 'income_level']
Các nhãn bệnh: ['vector_disease_risk_score', 'heat_related_admissions', 'respiratory_disease_rate']


## 4. Chia dữ liệu thành tập huấn luyện và kiểm tra

- Sử dụng train_test_split để chia dữ liệu.

In [21]:
# Preprocessing before train/test split: ensure all X features are numeric, handle categoricals and missing values
X_proc = X.copy()
y_proc = y.copy()

# Ensure y numeric and drop rows with missing target values
y_proc = y_proc.apply(pd.to_numeric, errors='coerce')
mask_y = ~y_proc.isnull().any(axis=1)
X_proc = X_proc.loc[mask_y].copy()
y_proc = y_proc.loc[mask_y].copy()

# Convert categorical object columns in X to dummies (one-hot) if any
obj_cols = X_proc.select_dtypes(include=['object','category']).columns.tolist()
if obj_cols:
    X_proc = pd.get_dummies(X_proc, columns=obj_cols, dummy_na=True)

# Convert remaining values to numeric (coerce errors) and fill numeric NaNs with median
X_proc = X_proc.apply(pd.to_numeric, errors='coerce')
for col in X_proc.columns:
    if X_proc[col].dtype.kind in 'biufc':
        X_proc[col].fillna(X_proc[col].median(), inplace=True)

# Final assignments used by downstream cells
X = X_proc
y = y_proc

print('Preprocessing complete. X shape:', X.shape, 'y shape:', y.shape)

Preprocessing complete. X shape: (14100, 24) y shape: (14100, 3)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_proc[col].fillna(X_proc[col].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_proc[col].fillna(X_proc[col].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we ar

In [22]:
# Giới hạn số lượng hàng để train
# Lấy toàn bộ
# n_samples = df_clean.shape[0]
n_samples = 5000
X_limited = X.iloc[:n_samples]
y_limited = y.iloc[:n_samples]

X_train, X_test, y_train, y_test = train_test_split(
    X_limited, y_limited, test_size=0.2, random_state=42)

print('Kích thước tập train:', X_train.shape)
print('Kích thước tập test:', X_test.shape)

Kích thước tập train: (4000, 24)
Kích thước tập test: (1000, 24)


## 5. Huấn luyện mô hình dự đoán bệnh

- Sử dụng mô hình phân loại (RandomForestClassifier, LogisticRegression, ...).

In [23]:
# Huấn luyện mô hình Random Forest cho multi-output regression
from sklearn.multioutput import MultiOutputRegressor
from sklearn.ensemble import RandomForestRegressor

clf = MultiOutputRegressor(RandomForestRegressor(n_estimators=100, random_state=42))
clf.fit(X_train, y_train)

# Dự đoán trên tập kiểm tra
y_pred = clf.predict(X_test)

## 6. Đánh giá hiệu quả mô hình

- Sử dụng các chỉ số: accuracy, precision, recall, F1-score, confusion matrix.

In [24]:
# Đánh giá mô hình cho từng nhãn bệnh
from sklearn.metrics import mean_squared_error, r2_score
for i, label in enumerate(label_cols):
    print(f'--- {label} ---')
    print('MSE:', mean_squared_error(y_test.iloc[:, i], y_pred[:, i]))
    print('R2:', r2_score(y_test.iloc[:, i], y_pred[:, i]))
    print()

--- vector_disease_risk_score ---
MSE: 23.08840168599999
R2: 0.8769108324320471

--- heat_related_admissions ---
MSE: 25.892701113
R2: 0.725952719536976

--- respiratory_disease_rate ---
MSE: 111.25292805699998
R2: 0.4882360177695375

