## Exploratory Data Analysis - Before Preprocessing

This notebook performs exploratory data analysis (EDA) on the raw dataset 
before applying any preprocessing steps.

### Following issues:
1. **Check overview structure**: function overview_structure
2. **Get data (lattitude, longtitude, date)**: function get_lat_lon, get_date
3. **Check missing values**: function check_missing_values
4. **Check duplicate rows**: function check_duplicate_rows
5. **Describe data (min, max, quantile, std)**: function describe_data

In [3]:
import pandas as pd

In [4]:
df_CaMau = pd.read_csv('/kaggle/input/raw-at-data/DATA_SENT SV/Ca Mau_Final.csv')
df_LangSon = pd.read_csv('/kaggle/input/raw-at-data/DATA_SENT SV/Lang Son_Final.csv')
df_LaoCai = pd.read_csv('/kaggle/input/raw-at-data/DATA_SENT SV/Lao Cai_Final.csv')
df_NoiBai = pd.read_csv('/kaggle/input/raw-at-data/DATA_SENT SV/NoiBai_Final.csv')
df_PhuBai = pd.read_csv('/kaggle/input/raw-at-data/DATA_SENT SV/Phu Bai_Final.csv')
df_QuyNhon = pd.read_csv('/kaggle/input/raw-at-data/DATA_SENT SV/Quy Nhon_Final.csv')
df_TPHCM = pd.read_csv('/kaggle/input/raw-at-data/DATA_SENT SV/TPHCM_Final.csv')
df_Vinh = pd.read_csv('/kaggle/input/raw-at-data/DATA_SENT SV/Vinh_Final.csv')

station_dfs = {
    'Noi_Bai': df_NoiBai,
    'Lang_Son': df_LangSon,
    'Lao_Cai': df_LaoCai,

    'Vinh': df_Vinh,
    'Phu_Bai': df_PhuBai,
    'Quy_Nhon': df_QuyNhon,

    'TPHCM': df_TPHCM,
    'Ca_Mau': df_CaMau
}

In [14]:
def overview_structure(station_dfs):
    for station_name, station_df in station_dfs.items():
        print(f"------STATION {station_name.upper()}------")
        print(station_df.info())
        print("\n")

overview_structure(station_dfs)

------STATION NOI_BAI------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12653 entries, 0 to 12652
Data columns (total 12 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   YMD        12653 non-null  object 
 1   NAME       12653 non-null  object 
 2   LATITUDE   12653 non-null  float64
 3   LONGITUDE  12653 non-null  float64
 4   YEAR       12653 non-null  int64  
 5   MONTH      12653 non-null  int64  
 6   DAY        12653 non-null  int64  
 7   TMP_2      12653 non-null  float64
 8   DEW_2      12653 non-null  float64
 9   RH         12653 non-null  float64
 10  AT mean    12653 non-null  float64
 11  AT max     12653 non-null  float64
dtypes: float64(7), int64(3), object(2)
memory usage: 1.2+ MB
None


------STATION LANG_SON------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11880 entries, 0 to 11879
Data columns (total 12 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   YMD        

In [6]:
def get_lat_lon(station_dfs):
    for station_name, station_df in station_dfs.items():
        print(f"------STATION {station_name.upper()}------")
        print(f"Longitude: {station_df['LONGITUDE'][0]}")
        print(f"Latitude: {station_df['LATITUDE'][0]}")

get_lat_lon(station_dfs)

------STATIONL NOI_BAI------
Longitude: 105.81
Latitude: 21.22
------STATIONL LANG_SON------
Longitude: 106.7666666
Latitude: 21.8333333
------STATIONL LAO_CAI------
Longitude: 103.9666666
Latitude: 22.5
------STATIONL VINH------
Longitude: 105.67
Latitude: 18.74
------STATIONL PHU_BAI------
Longitude: 107.7
Latitude: 16.4
------STATIONL QUY_NHON------
Longitude: 109.22
Latitude: 13.77
------STATIONL TPHCM------
Longitude: 106.65
Latitude: 10.82
------STATIONL CA_MAU------
Longitude: 105.15
Latitude: 9.1833333


In [7]:
def get_date(station_dfs):
    for station_name, station_df in station_dfs.items():
        ymd = pd.to_datetime(station_df[['YEAR', 'MONTH', 'DAY']])

        print(f"------STATION {station_name.upper()}------")
        print(f"First day: {ymd.min()}")
        print(f"Last day: {ymd.max()}")
        print(f"Years: {ymd.dt.year.max() - ymd.dt.year.min()}")
        print(f"Dates: {len(station_df)}")

get_date(station_dfs)

------STATION NOI_BAI------
First day: 1990-01-01 00:00:00
Last day: 2024-08-31 00:00:00
Years: 34
Dates: 12653
------STATION LANG_SON------
First day: 1990-01-04 00:00:00
Last day: 2024-08-31 00:00:00
Years: 34
Dates: 11880
------STATION LAO_CAI------
First day: 1992-04-01 00:00:00
Last day: 2024-08-31 00:00:00
Years: 32
Dates: 11775
------STATION VINH------
First day: 1990-01-01 00:00:00
Last day: 2024-08-31 00:00:00
Years: 34
Dates: 12638
------STATION PHU_BAI------
First day: 1992-04-01 00:00:00
Last day: 2024-08-31 00:00:00
Years: 32
Dates: 11824
------STATION QUY_NHON------
First day: 1990-01-01 00:00:00
Last day: 2024-08-31 00:00:00
Years: 34
Dates: 12634
------STATION TPHCM------
First day: 1990-01-01 00:00:00
Last day: 2024-08-31 00:00:00
Years: 34
Dates: 12658
------STATION CA_MAU------
First day: 1990-01-02 00:00:00
Last day: 2024-08-31 00:00:00
Years: 34
Dates: 12608


In [8]:
def check_missing_values(station_dfs):
    for station_name, station_df in station_dfs.items():
        print(f"------STATION {station_name.upper()}------")
        print(station_df.isnull().sum())

check_missing_values(station_dfs)

------STATION NOI_BAI------
YMD          0
NAME         0
LATITUDE     0
LONGITUDE    0
YEAR         0
MONTH        0
DAY          0
TMP_2        0
DEW_2        0
RH           0
AT mean      0
AT max       0
dtype: int64
------STATION LANG_SON------
YMD          0
NAME         0
LATITUDE     0
LONGITUDE    0
YEAR         0
MONTH        0
DAY          0
DEW_2        0
TMP_2        0
RH           0
AT mean      0
AT max       0
dtype: int64
------STATION LAO_CAI------
YMD          0
NAME         0
LATITUDE     0
LONGITUDE    0
YEAR         0
MONTH        0
DAY          0
DEW_2        0
TMP_2        0
RH           0
AT mean      0
AT max       0
dtype: int64
------STATION VINH------
YMD          0
NAME         0
LATITUDE     0
LONGITUDE    0
YEAR         0
MONTH        0
DAY          0
TMP_2        1
DEW_2        1
RH           1
AT mean      1
AT max       1
dtype: int64
------STATION PHU_BAI------
YMD          0
NAME         0
LATITUDE     0
LONGITUDE    0
YEAR         0
MONTH        0


In [9]:
def check_duplicate_rows(station_dfs):
    for station_name, station_df in station_dfs.items():
        print(f"------STATION {station_name.upper()}------")
        print(station_df.duplicated().sum())

check_duplicate_rows(station_dfs)

------STATION NOI_BAI------
0
------STATION LANG_SON------
0
------STATION LAO_CAI------
0
------STATION VINH------
0
------STATION PHU_BAI------
0
------STATION QUY_NHON------
0
------STATION TPHCM------
0
------STATION CA_MAU------
0


In [10]:
def describe_data(station_dfs):
    for station_name, station_df in station_dfs.items():
        print(f"------STATION {station_name.upper()}------")
        print(station_df.describe())

describe_data(station_dfs)

------STATION NOI_BAI------
           LATITUDE     LONGITUDE          YEAR         MONTH           DAY  \
count  1.265300e+04  1.265300e+04  12653.000000  12653.000000  12653.000000   
mean   2.122000e+01  1.058100e+02   2006.839959      6.482494     15.730973   
std    3.872611e-13  3.315523e-11     10.008054      3.441294      8.798758   
min    2.122000e+01  1.058100e+02   1990.000000      1.000000      1.000000   
25%    2.122000e+01  1.058100e+02   1998.000000      4.000000      8.000000   
50%    2.122000e+01  1.058100e+02   2007.000000      6.000000     16.000000   
75%    2.122000e+01  1.058100e+02   2016.000000      9.000000     23.000000   
max    2.122000e+01  1.058100e+02   2024.000000     12.000000     31.000000   

              TMP_2         DEW_2            RH       AT mean        AT max  
count  12653.000000  12653.000000  12653.000000  12653.000000  12653.000000  
mean      24.367013     20.088642     78.579162     21.875167     25.217853  
std        5.217426      5