# Load Data từ Bronze Layer sang Silver Layer

Notebook này sẽ đọc dữ liệu từ Bronze layer (MinIO) và xử lý để load vào các bảng Iceberg trong Silver layer với Nessie catalog.

## 1. Import Libraries và Khởi tạo Spark Session

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window
import csv, io, os, re
from datetime import datetime
from typing import Dict

# Cấu hình AWS/MinIO credentials
os.environ.update({
    'AWS_REGION': 'us-east-1',
    'AWS_ACCESS_KEY_ID': 'admin',
    'AWS_SECRET_ACCESS_KEY': 'admin123'
})

# Khởi tạo Spark Session với Nessie Catalog
spark = (
    SparkSession.builder
    .appName("Load_Bronze_To_Silver")
    .master("spark://spark-master:7077")
    .config("spark.executor.memory", "1536m")
    .config("spark.executor.cores", "2")
    # Nessie Catalog
    .config("spark.sql.catalog.nessie", "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.nessie.catalog-impl", "org.apache.iceberg.nessie.NessieCatalog")
    .config("spark.sql.catalog.nessie.uri", "http://nessie:19120/api/v2")
    .config("spark.sql.catalog.nessie.ref", "main")
    .config("spark.sql.catalog.nessie.warehouse", "s3a://silver/")
    .config("spark.sql.catalog.nessie.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
    # S3/MinIO Config
    .config("spark.sql.catalog.nessie.s3.endpoint", "http://minio:9000")
    .config("spark.sql.catalog.nessie.s3.access-key-id", "admin")
    .config("spark.sql.catalog.nessie.s3.secret-access-key", "admin123")
    .config("spark.sql.catalog.nessie.s3.path-style-access", "true")
    .config("spark.sql.catalog.nessie.s3.region", "us-east-1")
    # Hadoop S3A Config
    .config("spark.hadoop.fs.s3a.endpoint", "http://minio:9000")
    .config("spark.hadoop.fs.s3a.access.key", "admin")
    .config("spark.hadoop.fs.s3a.secret.key", "admin123")
    .config("spark.hadoop.fs.s3a.path.style.access", "true")
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    .config("spark.hadoop.fs.s3a.connection.ssl.enabled", "false")
    .config("spark.hadoop.fs.s3a.region", "us-east-1")
    # Executor Environment
    .config("spark.executorEnv.AWS_REGION", "us-east-1")
    .config("spark.executorEnv.AWS_ACCESS_KEY_ID", "admin")
    .config("spark.executorEnv.AWS_SECRET_ACCESS_KEY", "admin123")
    # Local JAR files
    .config("spark.jars", "/opt/spark/jars/hadoop-aws-3.3.4.jar,/opt/spark/jars/aws-java-sdk-bundle-1.12.262.jar")
    .getOrCreate()
)

spark.sparkContext.setLogLevel("ERROR")
spark.sql("CREATE DATABASE IF NOT EXISTS nessie.silver_tables")
spark.sql("USE nessie.silver_tables")
print(f" Spark Session initialized | Master: {spark.sparkContext.master} | App ID: {spark.sparkContext.applicationId}")


25/12/07 08:17:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/12/07 08:17:13 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


 Spark Session initialized | Master: spark://spark-master:7077 | App ID: app-20251207081713-0002


## 2. Load Bảng SCHOOL

In [None]:
print("=" * 80)
print("LOAD BẢNG SCHOOL")
print("=" * 80)

# Đọc và merge tất cả các năm
years = [2021, 2022, 2023, 2024, 2025]
base_path = "s3a://bronze/structured_data/danh sách các trường Đại Học (2021-2025)/Danh_sách_các_trường_Đại_Học_"
df_school = spark.read.option("header", "true").option("inferSchema", "true").csv([f"{base_path}{year}.csv" for year in years]).select("TenTruong", "MaTruong", "TinhThanh").dropDuplicates()

# Transform
df_school_silver = df_school.select(
    col("MaTruong").cast("string").alias("schoolId"),
    col("TenTruong").cast("string").alias("schoolName"),
    col("TinhThanh").cast("string").alias("province"),
    current_timestamp().alias("created_at"),
    current_timestamp().alias("updated_at")
).filter(col("schoolId").isNotNull() & col("schoolName").isNotNull())

# Ghi vào Silver
df_school_silver.writeTo("nessie.silver_tables.school").using("iceberg").createOrReplace()
print(f"Đã ghi {df_school_silver.count()} dòng vào school")

# Verify
spark.table("nessie.silver_tables.school").show(5, truncate=False)

## 3. Load Bảng MAJOR

In [None]:
from pyspark.sql.functions import col, lower, trim, regexp_replace, current_timestamp

df_major = spark.read.option("header", "true") \
    .option("inferSchema", "false") \
    .option("encoding", "UTF-8") \
    .csv("s3a://bronze/structured_data/danh sách các ngành đại học/Danh_sách_các_ngành.csv")

df_major_clean = df_major.select(
    regexp_replace(trim(col(df_major.columns[0])).cast("string"), r"\.0$", "").alias("majorId"),
    trim(col(df_major.columns[1])).cast("string").alias("majorName")
).filter(
    (col("majorId").isNotNull()) &
    (col("majorName").isNotNull()) &
    (col("majorId") != "") &
    (col("majorName") != "") &
    (lower(col("majorId")) != "nan")
)

# Chuẩn hoá để dedupe theo lowercase
df_major_silver = df_major_clean \
    .withColumn("majorId_lower", lower(col("majorId"))) \
    .dropDuplicates(["majorId_lower"]) \
    .select(
        col("majorId"),
        col("majorName"),
        current_timestamp().alias("created_at"),
        current_timestamp().alias("updated_at")
    )

df_major_silver.writeTo("nessie.silver_tables.major").using("iceberg").createOrReplace()

print(f"Đã ghi {df_major_silver.count()} dòng vào major")
spark.table("nessie.silver_tables.major").show(5, truncate=False)


## 4. Load Bảng SUBJECT_GROUP và SUBJECT

In [None]:
print("=" * 80)
print("LOAD BẢNG SUBJECT_GROUP và SUBJECT")
print("=" * 80)

# Đọc file tohop_mon_fixed.csv
df_tohop = spark.read.option("header", "true").option("inferSchema", "true").option("encoding", "UTF-8").csv("s3a://bronze/structured_data/tohop_mon_fixed.csv")

# --- SUBJECT_GROUP ---
df_subject_group_silver = df_tohop.select(
    col(df_tohop.columns[0]).cast("int").alias("subjectGroupId"),
    col(df_tohop.columns[1]).cast("string").alias("subjectGroupName"),
    col(df_tohop.columns[2]).cast("string").alias("subjectCombination"),
    current_timestamp().alias("created_at"),
    current_timestamp().alias("updated_at")
).filter(col("subjectGroupId").isNotNull() & col("subjectGroupName").isNotNull() & col("subjectCombination").isNotNull()).dropDuplicates(["subjectGroupName", "subjectCombination"])
df_subject_group_silver.writeTo("nessie.silver_tables.subject_group").using("iceberg").createOrReplace()
print(f"Đã ghi {df_subject_group_silver.count()} dòng vào subject_group")

# --- SUBJECT ---
df_subject = (
    df_tohop.select(explode(split(col(df_tohop.columns[2]), "-")).alias("subjectName"))
            .withColumn("subjectName", trim(col("subjectName")))
            .filter(col("subjectName").isNotNull() & (col("subjectName") != ""))
            .withColumn("subjectName_lower", lower(col("subjectName")))
            # loại bỏ trùng theo chữ thường
            .dropDuplicates(["subjectName_lower"])
)

window_spec = Window.orderBy("subjectName_lower")
df_subject_silver = df_subject.withColumn("subjectId", row_number().over(window_spec)).select(
    col("subjectId").cast("int"),
    col("subjectName").cast("string"),
    current_timestamp().alias("created_at"),
    current_timestamp().alias("updated_at")
)
df_subject_silver.writeTo("nessie.silver_tables.subject").using("iceberg").createOrReplace()
print(f"Đã ghi {df_subject_silver.count()} dòng vào subject")

# Verify
spark.table("nessie.silver_tables.subject_group").orderBy("subjectGroupId").show(5, truncate=False)
spark.table("nessie.silver_tables.subject").show(5, truncate=False)

## 5. Load Bảng SELECTION_METHOD

In [None]:
print("=" * 80)
print("LOAD BẢNG SELECTION_METHOD")
print("=" * 80)

# Đọc từ file benchmark để lấy các phương thức xét tuyển
df_benchmark = spark.read.option("header", "true").option("inferSchema", "true").option("encoding", "UTF-8").csv("s3a://bronze/structured_data/điểm chuẩn các trường (2021-2025)/Điểm_chuẩn_các_ngành_đại_học_năm(2021-2025)*.csv")

# Lấy PhuongThuc và loại bỏ "năm ..."
df_selection = df_benchmark.select(trim(regexp_replace(col("PhuongThuc"), r"\s*năm\s+\d{4}.*$", "")).alias("selectionMethodName")).filter(col("selectionMethodName").isNotNull() & (col("selectionMethodName") != "")).distinct()

window_spec = Window.orderBy("selectionMethodName")
df_selection_method_silver = df_selection.withColumn("selectionMethodId", row_number().over(window_spec)).select(
    col("selectionMethodId").cast("int"),
    col("selectionMethodName").cast("string"),
    current_timestamp().alias("created_at"),
    current_timestamp().alias("updated_at")
)
df_selection_method_silver.writeTo("nessie.silver_tables.selection_method").using("iceberg").createOrReplace()
print(f"Đã ghi {df_selection_method_silver.count()} dòng vào selection_method")

# Verify
spark.table("nessie.silver_tables.selection_method").show(10, truncate=False)

## 6. Load Bảng GradingScale

In [None]:
print("=" * 80)
print("LOAD BẢNG GRADING_SCALE TỪ PHANLOAITHANGDIEM")
print("=" * 80)

# 1. Đọc dữ liệu gốc từ file CSV (giống benchmark)
df_raw = (
    spark.read
        .option("header", "true")
        .option("inferSchema", "true")
        .option("encoding", "UTF-8")
        .csv("s3a://bronze/structured_data/điểm chuẩn các trường (2021-2025)/Điểm_chuẩn_các_ngành_đại_học_năm(2021-2025)*.csv")
)

# 2. Lấy unique PhanLoaiThangDiem
df_grading_raw = (
    df_raw
        .select(trim(col("PhanLoaiThangDiem")).alias("description"))
        .filter(col("description").isNotNull() & (col("description") != ""))
        .dropDuplicates(["description"])
)

# 3. Tách giá trị số trong description làm "value" (nếu có, vd: "thang 40" -> 40)
df_grading = (
    df_grading_raw
        .withColumn(
            "value",
            regexp_extract(col("description"), r"(\d+(?:\.\d+)?)", 1).cast("float")
        )
        .withColumn("gradingScaleId", monotonically_increasing_id().cast("int"))
        .withColumn("created_at", current_timestamp())
        .withColumn("updated_at", current_timestamp())
        .select(
            "gradingScaleId",
            "value",
            "description",
            "created_at",
            "updated_at"
        )
)

# 4. Ghi vào bảng Iceberg grading_scale đã tạo trước đó
df_grading.writeTo("nessie.silver_tables.grading_scale") \
          .using("iceberg") \
          .createOrReplace()

print(f"Đã ghi {df_grading.count()} dòng vào grading_scale")

# 5. Verify
spark.table("nessie.silver_tables.grading_scale").show(truncate=False)


## 6. Load Bảng BENCHMARK

In [None]:
from pyspark.sql.functions import (
    col, trim, regexp_replace, current_timestamp,
    avg, round, expr
)

print("=" * 80)
print("LOAD BẢNG BENCHMARK")
print("=" * 80)

# =========================
# 1. ĐỌC & CHUẨN HÓA DỮ LIỆU BRONZE
# =========================

df_benchmark = (
    spark.read
    .option("header", "true")
    .option("inferSchema", "true")
    .option("encoding", "UTF-8")
    .csv("s3a://bronze/structured_data/điểm chuẩn các trường (2021-2025)/Điểm_chuẩn_các_ngành_đại_học_năm(2021-2025)*.csv")
)

# Chuẩn hóa cột PhuongThuc: bỏ phần "năm XXXX ..."
df_benchmark = df_benchmark.withColumn(
    "PhuongThuc_cleaned",
    trim(regexp_replace(col("PhuongThuc"), r"\s*năm\s+\d{4}.*$", ""))
)

# Lookup tables từ Silver
df_selection_lookup     = spark.table("nessie.silver_tables.selection_method")
df_subject_group_lookup = spark.table("nessie.silver_tables.subject_group")
df_grading_scale_lookup = spark.table("nessie.silver_tables.grading_scale")

# Join lookup + chuẩn hóa
df_benchmark_base = (
    df_benchmark
    .join(
        df_selection_lookup,
        df_benchmark["PhuongThuc_cleaned"] == df_selection_lookup["selectionMethodName"],
        "left"
    )
    .join(
        df_subject_group_lookup,
        df_benchmark["KhoiThi"] == df_subject_group_lookup["subjectGroupName"],
        "left"
    )
    .join(
        df_grading_scale_lookup,
        trim(df_benchmark["PhanLoaiThangDiem"]) == df_grading_scale_lookup["description"],
        "left"
    )
    .select(
        col("MaTruong").cast("string").alias("schoolId"),
        col("MaNganh").cast("string").alias("majorId"),
        col("subjectGroupId").cast("int"),
        col("selectionMethodId").cast("int"),
        col("gradingScaleId").cast("int"),
        col("Nam").cast("int").alias("year"),
        col("DiemChuan").cast("double").alias("score"),
    )
    .filter(
        col("schoolId").isNotNull() &
        col("majorId").isNotNull() &
        col("gradingScaleId").isNotNull() &
        col("year").isNotNull() &
        col("score").isNotNull() &
        col("selectionMethodId").isNotNull()
        # col("subjectGroupId").isNotNull()  # nếu muốn bắt buộc khối thi thì mở dòng này
    )
    .dropDuplicates([
        "schoolId",
        "majorId",
        "subjectGroupId",
        "selectionMethodId",
        "year",
        "gradingScaleId",
        "score"
    ])
)

# =========================
# 2. GROUP BY & LẤY ĐIỂM TRUNG BÌNH
# =========================

df_benchmark_grouped = (
    df_benchmark_base
    .groupBy(
        "schoolId",
        "majorId",
        "subjectGroupId",
        "selectionMethodId",
        "gradingScaleId",
        "year"
    )
    .agg(
        round(avg("score"), 2).alias("score")
    )
)

table_name = "nessie.silver_tables.benchmark"

# =========================
# 3. CHECK BẢNG SILVER ĐÃ TỒN TẠI CHƯA
# =========================

try:
    spark.table(table_name)
    table_exists = True
    print(f"Bảng {table_name} đã tồn tại → dùng MERGE (upsert).")
except Exception:
    table_exists = False
    print(f"Bảng {table_name} chưa tồn tại → tạo mới full-load.")

# =========================
# 4. LẦN ĐẦU: TẠO BẢNG FULL (DÙNG xxhash64 LÀM benchmarkId)
# =========================

if not table_exists:
    df_benchmark_silver = (
        df_benchmark_grouped
        .withColumn(
            "benchmarkId",
            expr(
                """
                CAST(
                    xxhash64(
                        schoolId,
                        majorId,
                        COALESCE(subjectGroupId, -1),
                        selectionMethodId,
                        gradingScaleId,
                        year
                    ) AS BIGINT
                )
                """
            )
        )
        .withColumn("created_at", current_timestamp())
        .withColumn("updated_at", current_timestamp())
        .select(
            "benchmarkId",
            "schoolId",
            "majorId",
            "subjectGroupId",
            "selectionMethodId",
            "gradingScaleId",
            "year",
            "score",
            "created_at",
            "updated_at"
        )
    )

    df_benchmark_silver.writeTo(table_name).using("iceberg").createOrReplace()
    print(f"Đã tạo mới benchmark với {df_benchmark_silver.count()} dòng")

# =========================
# 5. CÁC LẦN SAU: MERGE / UPSERT
# =========================

else:
    # Staging từ bronze sau khi chuẩn hóa + group
    df_staging = (
        df_benchmark_grouped
        .withColumn("created_at", current_timestamp())
        .withColumn("updated_at", current_timestamp())
    )

    df_staging.createOrReplaceTempView("benchmark_staging")

    # MERGE:
    # - MATCHED: update score + updated_at
    # - NOT MATCHED: insert bản ghi mới với benchmarkId = hash(business key)
    spark.sql(f"""
        MERGE INTO {table_name} AS t
        USING benchmark_staging AS s
        ON  t.schoolId          = s.schoolId
        AND t.majorId           = s.majorId
        AND COALESCE(t.subjectGroupId,  -1) = COALESCE(s.subjectGroupId,  -1)
        AND t.selectionMethodId = s.selectionMethodId
        AND t.gradingScaleId    = s.gradingScaleId
        AND t.year              = s.year

        WHEN MATCHED THEN UPDATE SET
            t.score      = s.score,
            t.updated_at = current_timestamp()

        WHEN NOT MATCHED THEN INSERT (
            benchmarkId,
            schoolId,
            majorId,
            subjectGroupId,
            selectionMethodId,
            gradingScaleId,
            year,
            score,
            created_at,
            updated_at
        ) VALUES (
            CAST(
                xxhash64(
                    s.schoolId,
                    s.majorId,
                    COALESCE(s.subjectGroupId, -1),
                    s.selectionMethodId,
                    s.gradingScaleId,
                    s.year
                ) AS BIGINT
            ),
            s.schoolId,
            s.majorId,
            s.subjectGroupId,
            s.selectionMethodId,
            s.gradingScaleId,
            s.year,
            s.score,
            s.created_at,
            s.updated_at
        )
    """)

    print("Đã MERGE dữ liệu mới vào bảng benchmark")

# =========================
# 6. VERIFY
# =========================

spark.table(table_name).show(5, truncate=False)
spark.table(table_name).groupBy("year").count().orderBy("year").show()


## 7. Load Bảng REGION

In [None]:
print("=" * 80)
print("LOAD BẢNG REGION")
print("=" * 80)

df_region = spark.read.option("header", "true").option("inferSchema", "true").option("encoding", "UTF-8").csv("s3a://bronze/structured_data/region.csv")
df_region_silver = df_region.select(
    lpad(col(df_region.columns[0]).cast("string"), 2, "0").alias("regionId"),  # Format thành 2 chữ số: "1" -> "01"
    col(df_region.columns[1]).cast("string").alias("regionName"),
    current_timestamp().alias("created_at"),
    current_timestamp().alias("updated_at")
).filter(col("regionId").isNotNull() & col("regionName").isNotNull()).dropDuplicates(["regionId"])

df_region_silver.writeTo("nessie.silver_tables.region").using("iceberg").createOrReplace()
print(f"Đã ghi {df_region_silver.count()} dòng vào region")

# Verify
spark.table("nessie.silver_tables.region").show(10, truncate=False)

## 8. Load Bảng STUDENT_SCORES

In [None]:
from pyspark.sql.functions import (
    col, trim, regexp_replace, current_timestamp, lit,
    concat, substring, udf, input_file_name, regexp_extract
)
from pyspark.sql.types import MapType, IntegerType, DoubleType
from typing import Dict

print("=" * 80)
print("LOAD BẢNG STUDENT_SCORES - INCREMENTAL BY FILE (DELETE + APPEND)")
print("=" * 80)

# =====================================================
# 0. TẠO BẢNG LOG INGEST (LƯU FILE ĐÃ XỬ LÝ) NẾU CHƯA CÓ
# =====================================================
spark.sql("""
CREATE TABLE IF NOT EXISTS nessie.silver_tables.student_scores_ingest_log (
    path STRING,
    year INT,
    processed_at TIMESTAMP
) USING iceberg
""")

# =====================================================
# 1. LẤY DANH SÁCH TẤT CẢ FILE CSV HIỆN CÓ TRONG BRONZE
#    + TRỪ ĐI NHỮNG FILE ĐÃ INGEST (log)
# =====================================================

df_files = (
    spark.read.format("binaryFile")
    .option("pathGlobFilter", "*.csv")
    .load("s3a://bronze/structured_data/điểm từng thí sinh/*/*.csv")
    .select("path")
)

df_log = spark.table("nessie.silver_tables.student_scores_ingest_log")

df_new_files = df_files.join(df_log, on="path", how="left_anti")
new_files = [r.path for r in df_new_files.collect()]

if not new_files:
    print(" Không có file mới nào, dừng job.")
else:
    print(f" Phát hiện {len(new_files)} file mới cần xử lý.")

    # =====================================================
    # 2. ĐỌC CHỈ CÁC FILE MỚI + THÊM CỘT YEAR
    # =====================================================

    df_scores_raw = (
        spark.read
        .option("header", "true")
        .option("inferSchema", "false")
        .option("encoding", "UTF-8")
        .csv(new_files)
        .withColumn("path", input_file_name())
    )

    df_scores_raw = df_scores_raw.withColumn(
        "Year",
        regexp_extract(col("path"), r"/(\d{4})/", 1).cast("int")
    )

    # =====================================================
    # 3. LOAD LOOKUP MÔN HỌC
    # =====================================================

    df_subject_lookup = spark.table("nessie.silver_tables.subject").select("subjectId", "subjectName")
    subject_map = {row.subjectName: row.subjectId for row in df_subject_lookup.collect()}
    print(f"\nĐã load {len(subject_map)} môn học để mapping")

    # =====================================================
    # 4. UDF PARSE ĐIỂM → Map<subjectId, score>
    # =====================================================

    def parse_scores_with_subject_id(score_string: str) -> Dict[int, float]:
        if not score_string or score_string.strip() == "":
            return {}
        scores_dict = {}
        try:
            pairs = score_string.split(",")
            for pair in pairs:
                if ":" in pair:
                    subject_name, score = pair.split(":")
                    subject_name = subject_name.strip()
                    # Map tên môn -> subjectId
                    if subject_name in subject_map:
                        subject_id = subject_map[subject_name]
                        try:
                            scores_dict[subject_id] = float(score.strip())
                        except:
                            pass
        except:
            pass
        return scores_dict

    parse_scores_udf = udf(parse_scores_with_subject_id, MapType(IntegerType(), DoubleType()))

    # =====================================================
    # 5. TRANSFORM → DATAFRAME STAGING (KHÔNG MERGE)
    # =====================================================

    # 1️ Biến đầy đủ để append vào silver
    df_student_scores_stage = (
        df_scores_raw
        .withColumn("studentId", concat(col("SBD"), col("Year").cast("string")))
        .withColumn("scores", parse_scores_udf(col("DiemThi")))   # UDF ở đây
        .withColumn("regionId", substring(col("SBD"), 1, 2).cast("string"))
        .select(
            col("studentId").cast("string"),
            col("regionId").cast("string"),
            col("Year").cast("int").alias("year"),
            col("scores")
        )
        .filter(
            col("studentId").isNotNull() &
            col("year").isNotNull() &
            col("scores").isNotNull()
        )
    )
    
    # 2️ Biến thứ hai chỉ có studentId — KHÔNG UDF → dùng để DELETE
    df_student_ids = (
        df_scores_raw
        .withColumn("studentId", concat(col("SBD"), col("Year").cast("string")))
        .select("studentId")
        .filter(col("studentId").isNotNull())
        # .dropDuplicates(["studentId"])
    )
    
    df_student_ids.createOrReplaceTempView("student_scores_new_ids")


    staging_count = df_student_scores_stage.count()
    print(f"Staging có {staging_count:,} dòng.")

    table_name = "nessie.silver_tables.student_scores"

        # =====================================================
    # 6. XOÁ studentId CŨ BẰNG CÁCH COLLECT RA PYTHON + DELETE IN (...)
    # =====================================================

    # Lấy list studentId distinct trong batch mới
    new_ids = [
        row.studentId
        for row in df_student_scores_stage.select("studentId").distinct().collect()
    ]

    print(f"Số studentId distinct trong batch mới: {len(new_ids):,}")

    # Kiểm tra bảng silver đã tồn tại chưa
    try:
        spark.table(table_name)
        table_exists = True
        print(f"Bảng {table_name} đã tồn tại → DELETE theo list studentId + APPEND.")
    except Exception:
        table_exists = False
        print(f"Bảng {table_name} chưa tồn tại → tạo mới từ batch, không cần xoá.")

    silver_count = spark.table(table_name).count() if table_exists else 0
    print(f"Số dòng trong bảng silver hiện tại: {silver_count:,}")

    if not table_exists:
        # 1️ BẢNG CHƯA TỒN TẠI → TẠO MỚI
        (
            df_student_scores_stage
            .withColumn("created_at", current_timestamp())
            .withColumn("updated_at", current_timestamp())
            .writeTo(table_name)
            .using("iceberg")
            .createOrReplace()
        )
        print(f" Đã tạo mới bảng {table_name} với {staging_count:,} dòng.")
    
    elif silver_count == 0:
        # 2️ BẢNG TỒN TẠI NHƯNG RỖNG → KHÔNG XOÁ, CHỈ APPEND
        print(" Bảng silver đã tồn tại nhưng rỗng → chỉ append, không xoá.")
    
        (
            df_student_scores_stage
            .withColumn("created_at", current_timestamp())
            .withColumn("updated_at", current_timestamp())
            .writeTo(table_name)
            .using("iceberg")
            .append()
        )
        print(f" Đã append {staging_count:,} dòng mới vào {table_name}.")
    
    elif new_ids:
        # 3️ BẢNG TỒN TẠI VÀ new_ids KHÔNG RỖNG → DELETE + APPEND
        print("Bảng silver có dữ liệu → DELETE + APPEND.")
    
        chunk_size = 500
        from math import ceil
    
        num_chunks = ceil(len(new_ids) / chunk_size)
        print(f"Chia studentId thành {num_chunks} chunk để xoá...")
    
        for i in range(num_chunks):
            chunk = new_ids[i * chunk_size:(i + 1) * chunk_size]
            escaped_ids = [sid.replace("'", "''") for sid in chunk]
            in_list = ",".join([f"'{sid}'" for sid in escaped_ids])
    
            sql_delete = f"""
                DELETE FROM {table_name}
                WHERE studentId IN ({in_list})
            """
            spark.sql(sql_delete)
    
        print(" Đã xoá xong các studentId cũ trong silver.")
    
        (
            df_student_scores_stage
            .withColumn("created_at", current_timestamp())
            .withColumn("updated_at", current_timestamp())
            .writeTo(table_name)
            .using("iceberg")
            .append()
        )
        print(f" Đã append {staging_count:,} dòng mới.")
    
    else:
        # 4️ new_ids rỗng → không xoá, không append
        print(" Batch mới không có studentId nào hợp lệ → không làm gì cả.")


    # =====================================================
    # 7. GHI LOG FILE ĐÃ XỬ LÝ
    # =====================================================

    from pyspark.sql.functions import array, explode

    df_new_files_log = (
        df_new_files
        .withColumn("year", regexp_extract(col("path"), r"/(\d{4})/", 1).cast("int"))
        .withColumn("processed_at", current_timestamp())
    )

    (
        df_new_files_log
        .writeTo("nessie.silver_tables.student_scores_ingest_log")
        .using("iceberg")
        .append()
    )

    print(f"Đã ghi log {df_new_files_log.count():,} file đã xử lý.")

    # =====================================================
    # 8. VERIFY
    # =====================================================

    print("\nMẫu dữ liệu student_scores:")
    spark.table(table_name).show(5, truncate=False)

## 9. Load Bảng ARTICLE và COMMENT từ TikTok Data

In [None]:
import re
from pyspark.sql import SparkSession
from pyspark.sql.utils import AnalysisException
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql.functions import (
    col, lit, when, coalesce, trim,
    to_timestamp, current_timestamp,
    input_file_name, regexp_replace
)

# ====================================================
# CẤU HÌNH
# ====================================================
spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
spark.conf.set("spark.sql.files.maxPartitionBytes", "33554432")

POSTS_PATH = "s3a://bronze/MangXaHoi/tiktok-data/posts/*.csv"
TABLE_ARTICLE = "nessie.silver_tables.article"
TABLE_LOG = "nessie.silver_tables.tiktok_posts_files_log"

print("=" * 80)
print("JOB 1: LOAD TIKTOK POSTS")
print("=" * 80)

# 1. Check file moi
spark.sql(f"CREATE TABLE IF NOT EXISTS {TABLE_LOG} (file_path STRING, load_time TIMESTAMP) USING iceberg")

df_all = spark.read.format("binaryFile").option("pathGlobFilter", "*.csv").load(POSTS_PATH).select("path")
try:
    df_processed = spark.table(TABLE_LOG).select("file_path")
    df_new_files = df_all.join(df_processed, df_all.path == col("file_path"), "left_anti")
except:
    df_new_files = df_all

new_files = [r.path for r in df_new_files.collect()]

if not new_files:
    print("Khong co file Post moi.")
else:
    print(f"Xu ly {len(new_files)} file Post moi.")

    # 2. Doc & Transform
    df_raw = spark.read.option("header","true").option("inferSchema","false").csv(new_files)

    df_trans = (
        df_raw
        .withColumn("timePublish", 
            coalesce(
                to_timestamp(col("TimePublish"), "dd-MM-yyyy"),
                to_timestamp(col("TimePublish"), "d-M-yyyy"), 
                to_timestamp(regexp_replace(col("TimePublish"), r".*(\d{1,2})\s+Tháng\s+(\d{1,2}),\s+(\d{4}).*", "$1-$2-$3"), "d-M-yyyy"),
                current_timestamp()
            ))
        .withColumn("likeCount", 
            when(col("Like").contains("K"), (regexp_replace(col("Like"), "K", "").cast("float")*1000).cast("int"))
            .when(col("Like").contains("M"), (regexp_replace(col("Like"), "M", "").cast("float")*1000000).cast("int"))
            .otherwise(coalesce(col("Like").cast("int"), lit(0))))
        .withColumn("commentCount", 
            when(col("Comment").contains("K"), (regexp_replace(col("Comment"), "K", "").cast("float")*1000).cast("int"))
            .when(col("Comment").contains("M"), (regexp_replace(col("Comment"), "M", "").cast("float")*1000000).cast("int"))
            .otherwise(coalesce(col("Comment").cast("int"), lit(0))))
        .withColumn("shareCount", 
            when(col("Share").contains("K"), (regexp_replace(col("Share"), "K", "").cast("float")*1000).cast("int"))
            .when(col("Share").contains("M"), (regexp_replace(col("Share"), "M", "").cast("float")*1000000).cast("int"))
            .otherwise(coalesce(col("Share").cast("int"), lit(0))))
        .select(
            trim(col("ID")).alias("articleID"),  # ID TikTok -> articleID
            col("Description").alias("description"), 
            col("Author").alias("author"),
            col("Url").alias("url"),
            col("timePublish"),
            col("likeCount"), col("commentCount"), col("shareCount"),
            lit("TikTok").alias("type"),
            current_timestamp().alias("created_at"),
            current_timestamp().alias("updated_at")
        )
    )

    # 3. Ghi du lieu (Append)
    print("Dang ghi vao Iceberg...")
    df_trans.writeTo(TABLE_ARTICLE).using("iceberg").append()
    
    # 4. Ghi Log
    spark.createDataFrame([(f,) for f in new_files], ["file_path"]) \
          .withColumn("load_time", current_timestamp()) \
          .writeTo(TABLE_LOG).using("iceberg").append()
    
    print("Hoan tat.")

                                                                                

In [2]:
import re
import gc
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType, LongType
from pyspark.sql.functions import (
    col, lit, when, coalesce, trim,
    to_timestamp, current_timestamp,
    input_file_name, regexp_replace,
    row_number, monotonically_increasing_id
)

# ====================================================
# CẤU HÌNH LOW RESOURCE
# ====================================================
# spark = SparkSession.builder... (Giả sử session đã có)

spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
spark.conf.set("spark.sql.files.maxPartitionBytes", "16777216") # 16MB
spark.conf.set("spark.sql.shuffle.partitions", "20")

# Path & Table
POSTS_PATH = "s3a://bronze/MangXaHoi/tiktok-data/comments/*.csv"
TABLE_COMMENT = "nessie.silver_tables.comment"
TABLE_LOG = "nessie.silver_tables.tiktok_comments_files_log"

print("=" * 80)
print(f"JOB 2: LOAD TIKTOK COMMENTS (FULL RUN - NO BATCH LIMIT)")
print("=" * 80)

# ====================================================
# 1. CHUẨN BỊ & LỌC FILE
# ====================================================

# Tạo bảng Log nếu chưa có
try:
    spark.table(TABLE_LOG)
except:
    print(f"-> Bang log {TABLE_LOG} chua ton tai. Dang tao moi...")
    log_schema = StructType([StructField("file_path", StringType(), False), StructField("load_time", TimestampType(), False)])
    spark.createDataFrame([], log_schema).writeTo(TABLE_LOG).using("iceberg").create()

# Quét file nguồn
print("-> Dang quet file nguon...")
df_all = spark.read.format("binaryFile").option("pathGlobFilter", "*.csv").load(POSTS_PATH).select("path")
total_files_count = df_all.count()
print(f"   Tong so file trong folder: {total_files_count}")

# Lọc file mới (Anti Join với Log)
try:
    df_processed = spark.table(TABLE_LOG).select("file_path").distinct()
    processed_count = df_processed.count()
    print(f"   So file da xu ly truoc do: {processed_count}")
    
    df_new_files = df_all.alias("src").join(
        df_processed.alias("log"), 
        col("src.path") == col("log.file_path"), 
        "left_anti"
    )
except Exception as e:
    print(f"   [WARNING] Loi doc Log: {e}")
    df_new_files = df_all

# Lấy toàn bộ danh sách file cần xử lý
# [THAY ĐỔI]: Lấy hết, không cắt batch nữa
files_to_process = [r.path for r in df_new_files.collect()]

if not files_to_process:
    print("-> KHONG CO FILE COMMENT MOI.")
else:
    print(f"-> Tim thay {len(files_to_process)} file moi. Se xu ly TOAN BO.")

    # ====================================================
    # 2. XỬ LÝ TUẦN TỰ (LOOP)
    # ====================================================
    
    # Lấy max commentID hiện tại
    try:
        max_id_row = spark.sql(f"SELECT COALESCE(MAX(commentID), 0) as max_id FROM {TABLE_COMMENT}").collect()
        max_id = max_id_row[0]["max_id"] if max_id_row else 0
        print(f"-> Max commentID hien tai: {max_id:,}")
    except:
        max_id = 0
        print("-> Bang comment chua co du lieu, bat dau tu ID = 0")
    
    # Duyệt qua từng file
    for i, file_path in enumerate(files_to_process):
        filename = file_path.split('/')[-1]
        print(f"\n--- [{i+1}/{len(files_to_process)}] Dang xu ly: {filename} ---")
        
        try:
            # Đọc CSV
            df_raw = spark.read.option("header","true").option("inferSchema","false").csv(file_path)

            if "ID_Post" not in df_raw.columns:
                print(f"   [SKIP] File loi format (Thieu ID_Post).")
                # Ghi log bỏ qua để lần sau không đọc lại
                spark.createDataFrame([(file_path,)], ["file_path"]).withColumn("load_time", current_timestamp()).writeTo(TABLE_LOG).using("iceberg").append()
                continue

            # Transform dữ liệu
            df_trans = df_raw.select(
                trim(col("ID_Post")).alias("articleID"), 
                col("Name").alias("name"),
                col("TagName").alias("tagName"),
                col("URL").alias("urlUser"),
                col("Comment").alias("comment"),
                coalesce(
                    to_timestamp(regexp_replace(col("Time"), r"(\d{1,2})[-/](\d{1,2})[-/](\d{4}).*", "$1-$2-$3"), "d-M-yyyy"),
                    to_timestamp(regexp_replace(col("Time"), r".*trước.*", "1970-01-01"), "yyyy-MM-dd"),
                    current_timestamp()
                ).alias("commentTime"),
                coalesce(col("Likes").cast("int"), lit(0)).alias("commentLike"),
                when(col("LevelComment") == "Yes", 2).otherwise(1).alias("levelComment"),
                col("RepliedTo").alias("replyTo"),
                coalesce(col("NumberOfReplies").cast("int"), lit(0)).alias("numberOfReply"),
                current_timestamp().alias("created_at"),
                current_timestamp().alias("updated_at")
            ).filter(col("articleID").isNotNull() & (col("articleID") != ""))
            
            # Tạo commentID nối tiếp
            window_spec = Window.orderBy(lit(1))
            df_trans = df_trans.withColumn("commentID", (row_number().over(window_spec) + max_id).cast("bigint"))
            
            # Cập nhật max_id cho vòng lặp sau
            current_count = df_trans.count() # Action này trigger tính toán
            max_id += current_count
            print(f"   -> File nay co {current_count:,} comment. Max ID ke tiep: {max_id:,}")
            
            if current_count > 0:
                # Ghi vào Iceberg
                print("   -> Dang ghi vao Iceberg...")
                # Repartition để tránh tạo quá nhiều file nhỏ trong Iceberg
                df_trans.repartition(20).writeTo(TABLE_COMMENT).using("iceberg").append()
                print("   -> Ghi xong.")
            
            # Ghi Log đã xử lý xong file này
            spark.createDataFrame([(file_path,)], ["file_path"]).withColumn("load_time", current_timestamp()).writeTo(TABLE_LOG).using("iceberg").append()
            
            # Dọn dẹp RAM
            df_trans.unpersist()
            del df_trans
            del df_raw
            gc.collect()

        except Exception as e:
            print(f"   [ERROR] Loi xu ly file {filename}: {e}")
            # Nếu lỗi file này thì bỏ qua, chạy file kế tiếp
            continue

    print("\n" + "=" * 80)
    print("DA HOAN TAT TAT CA FILE.")

JOB 2: LOAD TIKTOK COMMENTS (FULL RUN - NO BATCH LIMIT)
-> Dang quet file nguon...


                                                                                

   Tong so file trong folder: 143
   So file da xu ly truoc do: 0
-> Tim thay 143 file moi. Se xu ly TOAN BO.
-> Max commentID hien tai: 0

--- [1/143] Dang xu ly: comments_TikTok_1_part_93.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 1,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.


                                                                                


--- [2/143] Dang xu ly: comments_TikTok_1_part_95.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 2,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [3/143] Dang xu ly: comments_TikTok_1_part_96.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 3,000
   -> Dang ghi vao Iceberg...
   -> Ghi xong.

--- [4/143] Dang xu ly: comments_TikTok_1_part_94.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 4,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [5/143] Dang xu ly: comments_TikTok_1_part_88.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 5,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [6/143] Dang xu ly: comments_TikTok_1_part_118.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 6,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [7/143] Dang xu ly: comments_TikTok_1_part_129.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 7,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [8/143] Dang xu ly: comments_TikTok_1_part_116.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 8,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [9/143] Dang xu ly: comments_TikTok_1_part_53.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 9,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [10/143] Dang xu ly: comments_TikTok_1_part_102.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 10,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [11/143] Dang xu ly: comments_TikTok_1_part_133.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 11,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [12/143] Dang xu ly: comments_TikTok_1_part_17.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 12,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [13/143] Dang xu ly: comments_TikTok_1_part_89.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 13,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [14/143] Dang xu ly: comments_TikTok_1_part_115.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 14,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [15/143] Dang xu ly: comments_TikTok_1_part_62.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 15,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [16/143] Dang xu ly: comments_TikTok_1_part_44.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 16,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [17/143] Dang xu ly: comments_TikTok_1_part_136.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 17,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [18/143] Dang xu ly: comments_TikTok_1_part_27.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 18,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [19/143] Dang xu ly: comments_TikTok_1_part_54.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 19,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [20/143] Dang xu ly: comments_TikTok_1_part_141.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 20,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [21/143] Dang xu ly: comments_TikTok_1_part_103.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 21,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [22/143] Dang xu ly: comments_TikTok_1_part_34.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 22,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [23/143] Dang xu ly: comments_TikTok_1_part_63.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 23,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [24/143] Dang xu ly: comments_TikTok_1_part_130.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 24,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [25/143] Dang xu ly: comments_TikTok_1_part_73.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 25,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [26/143] Dang xu ly: comments_TikTok_1_part_134.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 26,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [27/143] Dang xu ly: comments_TikTok_1_part_32.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 27,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [28/143] Dang xu ly: comments_TikTok_1_part_79.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 28,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [29/143] Dang xu ly: comments_TikTok_1_part_46.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 29,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [30/143] Dang xu ly: comments_TikTok_1_part_48.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 30,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [31/143] Dang xu ly: comments_TikTok_1_part_119.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 31,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [32/143] Dang xu ly: comments_TikTok_1_part_75.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 32,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [33/143] Dang xu ly: comments_TikTok_1_part_33.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 33,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [34/143] Dang xu ly: comments_TikTok_1_part_71.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 34,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [35/143] Dang xu ly: comments_TikTok_1_part_128.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 35,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [36/143] Dang xu ly: comments_TikTok_1_part_28.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 36,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [37/143] Dang xu ly: comments_TikTok_1_part_76.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 37,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [38/143] Dang xu ly: comments_TikTok_1_part_74.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 38,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [39/143] Dang xu ly: comments_TikTok_1_part_59.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 39,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [40/143] Dang xu ly: comments_TikTok_1_part_12.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 40,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [41/143] Dang xu ly: comments_TikTok_1_part_18.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 41,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [42/143] Dang xu ly: comments_TikTok_1_part_121.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 42,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [43/143] Dang xu ly: comments_TikTok_1_part_140.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 43,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [44/143] Dang xu ly: comments_TikTok_1_part_14.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 44,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [45/143] Dang xu ly: comments_TikTok_1_part_137.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 45,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [46/143] Dang xu ly: comments_TikTok_1_part_26.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 46,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [47/143] Dang xu ly: comments_TikTok_1_part_72.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 47,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [48/143] Dang xu ly: comments_TikTok_1_part_77.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 48,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [49/143] Dang xu ly: comments_TikTok_1_part_43.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 49,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [50/143] Dang xu ly: comments_TikTok_1_part_122.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 50,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [51/143] Dang xu ly: comments_TikTok_1_part_51.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 51,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [52/143] Dang xu ly: comments_TikTok_1_part_112.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 52,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [53/143] Dang xu ly: comments_TikTok_1_part_111.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 53,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [54/143] Dang xu ly: comments_TikTok_1_part_52.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 54,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [55/143] Dang xu ly: comments_TikTok_1_part_125.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 55,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [56/143] Dang xu ly: comments_TikTok_1_part_135.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 56,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [57/143] Dang xu ly: comments_TikTok_1_part_13.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 57,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [58/143] Dang xu ly: comments_TikTok_1_part_30.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 58,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [59/143] Dang xu ly: comments_TikTok_1_part_65.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 59,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [60/143] Dang xu ly: comments_TikTok_1_part_2.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 60,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [61/143] Dang xu ly: comments_TikTok_1_part_19.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 61,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [62/143] Dang xu ly: comments_TikTok_1_part_1.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 62,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [63/143] Dang xu ly: comments_TikTok_1_part_131.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 63,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [64/143] Dang xu ly: comments_TikTok_1_part_67.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 64,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [65/143] Dang xu ly: comments_TikTok_1_part_16.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 65,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [66/143] Dang xu ly: comments_TikTok_1_part_126.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 66,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [67/143] Dang xu ly: comments_TikTok_1_part_80.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 67,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [68/143] Dang xu ly: comments_TikTok_1_part_138.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 68,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [69/143] Dang xu ly: comments_TikTok_1_part_120.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 69,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [70/143] Dang xu ly: comments_TikTok_1_part_49.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 70,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [71/143] Dang xu ly: comments_TikTok_1_part_10.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 71,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [72/143] Dang xu ly: comments_TikTok_1_part_50.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 72,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [73/143] Dang xu ly: comments_TikTok_1_part_139.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 73,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [74/143] Dang xu ly: comments_TikTok_1_part_132.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 74,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [75/143] Dang xu ly: comments_TikTok_1_part_21.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 75,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [76/143] Dang xu ly: comments_TikTok_1_part_114.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 76,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [77/143] Dang xu ly: comments_TikTok_1_part_124.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 77,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [78/143] Dang xu ly: comments_TikTok_1_part_6.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 78,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [79/143] Dang xu ly: comments_TikTok_1_part_3.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 79,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [80/143] Dang xu ly: comments_TikTok_1_part_97.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 80,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [81/143] Dang xu ly: comments_TikTok_1_part_45.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 81,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [82/143] Dang xu ly: comments_TikTok_1_part_142.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 82,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [83/143] Dang xu ly: comments_TikTok_1_part_61.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 83,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [84/143] Dang xu ly: comments_TikTok_1_part_20.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 84,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [85/143] Dang xu ly: comments_TikTok_1_part_11.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 85,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [86/143] Dang xu ly: comments_TikTok_1_part_68.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 86,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [87/143] Dang xu ly: comments_TikTok_1_part_69.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 87,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [88/143] Dang xu ly: comments_TikTok_1_part_81.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 88,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [89/143] Dang xu ly: comments_TikTok_1_part_36.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 89,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [90/143] Dang xu ly: comments_TikTok_1_part_123.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 90,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.


                                                                                


--- [91/143] Dang xu ly: comments_TikTok_1_part_31.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 91,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [92/143] Dang xu ly: comments_TikTok_1_part_60.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 92,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [93/143] Dang xu ly: comments_TikTok_1_part_70.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 93,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [94/143] Dang xu ly: comments_TikTok_1_part_113.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 94,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [95/143] Dang xu ly: comments_TikTok_1_part_5.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 95,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [96/143] Dang xu ly: comments_TikTok_1_part_25.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 96,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [97/143] Dang xu ly: comments_TikTok_1_part_15.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 97,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [98/143] Dang xu ly: comments_TikTok_1_part_58.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 98,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [99/143] Dang xu ly: comments_TikTok_1_part_64.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 99,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [100/143] Dang xu ly: comments_TikTok_1_part_127.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 100,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [101/143] Dang xu ly: comments_TikTok_1_part_29.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 101,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [102/143] Dang xu ly: comments_TikTok_1_part_7.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 102,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [103/143] Dang xu ly: comments_TikTok_1_part_117.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 103,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [104/143] Dang xu ly: comments_TikTok_1_part_9.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 104,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [105/143] Dang xu ly: comments_TikTok_1_part_8.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 105,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [106/143] Dang xu ly: comments_TikTok_1_part_78.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 106,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [107/143] Dang xu ly: comments_TikTok_1_part_23.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 107,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [108/143] Dang xu ly: comments_TikTok_1_part_4.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 108,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [109/143] Dang xu ly: comments_TikTok_1_part_84.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 109,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [110/143] Dang xu ly: comments_TikTok_1_part_82.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 110,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [111/143] Dang xu ly: comments_TikTok_1_part_24.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 111,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [112/143] Dang xu ly: comments_TikTok_1_part_85.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 112,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [113/143] Dang xu ly: comments_TikTok_1_part_66.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 113,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [114/143] Dang xu ly: comments_TikTok_1_part_47.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 114,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [115/143] Dang xu ly: comments_TikTok_1_part_104.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 115,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [116/143] Dang xu ly: comments_TikTok_1_part_35.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 116,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [117/143] Dang xu ly: comments_TikTok_1_part_22.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 117,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [118/143] Dang xu ly: comments_TikTok_1_part_110.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 118,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [119/143] Dang xu ly: comments_TikTok_1_part_101.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 119,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [120/143] Dang xu ly: comments_TikTok_1_part_55.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 120,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [121/143] Dang xu ly: comments_TikTok_1_part_106.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 121,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [122/143] Dang xu ly: comments_TikTok_1_part_108.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 122,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [123/143] Dang xu ly: comments_TikTok_1_part_83.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 123,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [124/143] Dang xu ly: comments_TikTok_1_part_37.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 124,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [125/143] Dang xu ly: comments_TikTok_1_part_38.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 125,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [126/143] Dang xu ly: comments_TikTok_1_part_56.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 126,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [127/143] Dang xu ly: comments_TikTok_1_part_39.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 127,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [128/143] Dang xu ly: comments_TikTok_1_part_57.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 128,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [129/143] Dang xu ly: comments_TikTok_1_part_41.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 129,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [130/143] Dang xu ly: comments_TikTok_1_part_40.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 130,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [131/143] Dang xu ly: comments_TikTok_1_part_42.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 131,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [132/143] Dang xu ly: comments_TikTok_1_part_90.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 132,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [133/143] Dang xu ly: comments_TikTok_1_part_87.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 133,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [134/143] Dang xu ly: comments_TikTok_1_part_143.csv ---
   -> File nay co 791 comment. Max ID ke tiep: 133,791
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [135/143] Dang xu ly: comments_TikTok_1_part_100.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 134,791
   -> Dang ghi vao Iceberg...
   -> Ghi xong.

--- [136/143] Dang xu ly: comments_TikTok_1_part_99.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 135,791
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [137/143] Dang xu ly: comments_TikTok_1_part_98.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 136,791
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [138/143] Dang xu ly: comments_TikTok_1_part_107.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 137,791
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [139/143] Dang xu ly: comments_TikTok_1_part_109.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 138,791
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [140/143] Dang xu ly: comments_TikTok_1_part_86.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 139,791
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [141/143] Dang xu ly: comments_TikTok_1_part_105.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 140,791
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [142/143] Dang xu ly: comments_TikTok_1_part_91.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 141,791
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [143/143] Dang xu ly: comments_TikTok_1_part_92.csv ---
   -> File nay co 1,000 comment. Max ID ke tiep: 142,791
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

DA HOAN TAT TAT CA FILE.


In [3]:
from pyspark.sql.functions import col

# Cấu hình hiển thị rộng để không bị cắt chữ
spark.conf.set("spark.sql.repl.eagerEval.enabled", True)

TABLE_COMMENT = "nessie.silver_tables.comment"

print(f"--- KIEM TRA DU LIEU BANG: {TABLE_COMMENT} ---")

try:
    df = spark.table(TABLE_COMMENT)
    
    # 1. Kiểm tra tổng số dòng hiện có
    total_count = df.count()
    print(f"-> TONG SO COMMENT HIEN CO: {total_count:,}")

    # 2. Lấy 2000 dòng đầu tiên (Sắp xếp theo ID để kiểm tra thứ tự)
    # Lưu ý: Sắp xếp để đảm bảo bạn đang xem từ ID=1 trở đi
    print("-> Dang tai 2000 dong dau tien...")
    df_2000 = df.orderBy("commentID").limit(2000)

    # 3. Hiển thị
    # truncate=False: Để xem hết nội dung comment dài
    # vertical=False: Để xem dạng bảng (đổi thành True nếu muốn xem dạng dọc từng dòng)
    df_2000.show(2000, truncate=False)

except Exception as e:
    print(f"Loi: {e}")
    print("Có thể bảng chưa được tạo hoặc chưa có dữ liệu.")

--- KIEM TRA DU LIEU BANG: nessie.silver_tables.comment ---
-> TONG SO COMMENT HIEN CO: 142,791
-> Dang tai 2000 dong dau tien...


ERROR:root:KeyboardInterrupt while sending command.            (593 + 4) / 7050]
Traceback (most recent call last):
  File "/opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py", line 511, in send_command
    answer = smart_decode(self.stream.readline()[:-1])
  File "/usr/local/lib/python3.10/socket.py", line 717, in readinto
    return self._sock.recv_into(b)
KeyboardInterrupt
[Stage 2158:====>                                              (624 + 4) / 7050]

KeyboardInterrupt: 



## 10. Load Bảng ARTICLE và COMMENT từ Facebook Data

In [None]:
import re
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType
from pyspark.sql.functions import (
    col, trim, to_timestamp, lit, current_timestamp, 
    input_file_name, coalesce, udf
)

# ====================================================
# PARSE TIME UDF
# ====================================================
MONTH_MAP = {"Tháng 1": "01", "Tháng 2": "02", "Tháng 3": "03", "Tháng 4": "04", "Tháng 5": "05", "Tháng 6": "06", "Tháng 7": "07", "Tháng 8": "08", "Tháng 9": "09", "Tháng 10": "10", "Tháng 11": "11", "Tháng 12": "12"}

def parse_vietnam_datetime(dt_str):
    if not dt_str: return None
    try:
        if "," in dt_str: dt_str = dt_str.split(",", 1)[1].strip()
        match = re.search(r"(\d+)\s+(Tháng\s+\d+)", dt_str)
        if not match: return None
        day, month_text = match.group(1), match.group(2)
        month = MONTH_MAP.get(month_text, "01")
        year = re.search(r",\s*(\d{4})", dt_str).group(1) if re.search(r",\s*(\d{4})", dt_str) else "2025"
        time_str = re.search(r"lúc\s+(\d{1,2}:\d{2})", dt_str).group(1) if re.search(r"lúc\s+(\d{1,2}:\d{2})", dt_str) else "00:00"
        return f"{year}-{month}-{int(day):02d} {time_str}:00"
    except: return None

parse_vn_time_udf = udf(parse_vietnam_datetime, StringType())

# ====================================================
# CẤU HÌNH
# ====================================================
POSTS_GLOB = "s3a://bronze/MangXaHoi/Face-data/posts/*.csv"
TABLE_ARTICLE = "nessie.silver_tables.article"
TABLE_LOG = "nessie.silver_tables.fb_posts_files_log"

print("=" * 80)
print("JOB 3: LOAD FACEBOOK POSTS (MERGE/UPSERT)")
print("=" * 80)

spark.sql(f"CREATE TABLE IF NOT EXISTS {TABLE_LOG} (file_path STRING, load_time TIMESTAMP) USING iceberg")

# 1. Loc file moi
df_all = spark.read.option("header", "true").csv(POSTS_GLOB).withColumn("file_path", input_file_name())
try:
    df_proc = spark.table(TABLE_LOG).select("file_path").distinct()
    new_files = [r.file_path for r in df_all.join(df_proc, "file_path", "left_anti").select("file_path").distinct().collect()]
except:
    new_files = [r.file_path for r in df_all.select("file_path").distinct().collect()]

if new_files:
    print(f"Xu ly {len(new_files)} file moi.")
    
    # 2. Transform
    df_raw = spark.read.option("header", "true").option("inferSchema", "false").csv(new_files)
    df_trans = df_raw.select(
        trim(col("ID")).alias("articleID"),         # ID FB -> articleID
        trim(col("Description")).alias("description"),
        trim(col("Author")).alias("author"),
        trim(col("Url")).alias("url"),
        coalesce(to_timestamp(parse_vn_time_udf(col("TimePublish"))), to_timestamp(col("TimePublish")), current_timestamp()).alias("timePublish"),
        coalesce(col("Like").cast("int"), lit(0)).alias("likeCount"),
        coalesce(col("Share").cast("int"), lit(0)).alias("shareCount"),
        coalesce(col("Comment").cast("int"), lit(0)).alias("commentCount")
    )

    # 3. MERGE (Upsert)
    df_trans.createOrReplaceTempView("fb_source")
    
    # Update metrics
    spark.sql(f"""
    MERGE INTO {TABLE_ARTICLE} t USING fb_source s
    ON t.url = s.url AND t.type = 'facebook'
    WHEN MATCHED THEN UPDATE SET
        t.description = s.description, t.timePublish = s.timePublish,
        t.likeCount = s.likeCount, t.shareCount = s.shareCount, t.commentCount = s.commentCount,
        t.updated_at = current_timestamp()
    """)

    # Insert new
    spark.sql(f"""
    INSERT INTO {TABLE_ARTICLE}
    SELECT s.articleID, s.description, s.author, s.url, s.timePublish,
           s.likeCount, s.commentCount, s.shareCount, 'facebook', current_timestamp(), current_timestamp()
    FROM fb_source s
    WHERE NOT EXISTS (SELECT 1 FROM {TABLE_ARTICLE} t WHERE t.url = s.url AND t.type = 'facebook')
    """)
    
    # 4. Log
    spark.createDataFrame([(f,) for f in new_files], ["file_path"]).withColumn("load_time", current_timestamp()).writeTo(TABLE_LOG).using("iceberg").append()
    print("Hoan tat.")
else:
    print("Khong co file moi.")

In [6]:
import re
import gc
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType, LongType
from pyspark.sql.functions import (
    col, lit, when, coalesce, trim, broadcast,
    to_timestamp, current_timestamp,
    input_file_name, regexp_replace, row_number
)

# ====================================================
# CẤU HÌNH LOW RESOURCE
# ====================================================
# spark = SparkSession.builder... (Giả sử session đã có)

spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
spark.conf.set("spark.sql.files.maxPartitionBytes", "33554432") # 32MB
spark.conf.set("spark.sql.shuffle.partitions", "50")

# Path & Table
COMMENTS_GLOB = "s3a://bronze/MangXaHoi/Face-data/comments/*.csv"
TABLE_COMMENT = "nessie.silver_tables.comment"
TABLE_LOG = "nessie.silver_tables.fb_comments_files_log"

print("=" * 80)
print(f"JOB 4: LOAD FACEBOOK COMMENTS (FULL RUN - NO BATCH LIMIT)")
print("=" * 80)

# ====================================================
# 1. CHUẨN BỊ & LỌC FILE
# ====================================================

# Tạo bảng Log
try:
    spark.table(TABLE_LOG)
except:
    print(f"-> Bang log {TABLE_LOG} chua ton tai. Dang tao moi...")
    log_schema = StructType([StructField("file_path", StringType(), False), StructField("load_time", TimestampType(), False)])
    spark.createDataFrame([], log_schema).writeTo(TABLE_LOG).using("iceberg").create()

# Quét file nguồn
print("-> Dang quet file nguon...")
df_all = spark.read.format("binaryFile").option("pathGlobFilter", "*.csv").load(COMMENTS_GLOB).select("path")
total_files_count = df_all.count()
print(f"   Tong so file trong folder: {total_files_count}")

# Lọc file mới (Anti Join)
try:
    df_processed = spark.table(TABLE_LOG).select("file_path").distinct()
    processed_count = df_processed.count()
    print(f"   So file da xu ly truoc do: {processed_count}")
    
    df_new_files = df_all.alias("src").join(
        df_processed.alias("log"), 
        col("src.path") == col("log.file_path"), 
        "left_anti"
    )
except Exception as e:
    print(f"   [WARNING] Loi doc Log: {e}")
    df_new_files = df_all

# Lấy toàn bộ danh sách file
files_to_process = [r.path for r in df_new_files.collect()]

if not files_to_process:
    print("-> KHONG CO FILE MOI.")
else:
    print(f"-> Tim thay {len(files_to_process)} file moi. Se xu ly TOAN BO.")

    # ====================================================
    # 2. XỬ LÝ TUẦN TỰ (LOOP)
    # ====================================================

    # Lấy max commentID hiện tại
    try:
        max_id_row = spark.sql(f"SELECT COALESCE(MAX(commentID), 0) as max_id FROM {TABLE_COMMENT}").collect()
        max_id = max_id_row[0]["max_id"] if max_id_row else 0
        print(f"-> Max commentID hien tai: {max_id:,}")
    except:
        max_id = 0
        print("-> Bang comment chua co du lieu, bat dau tu ID = 0")

    # Duyệt từng file
    for i, file_path in enumerate(files_to_process):
        filename = file_path.split('/')[-1]
        print(f"\n--- [{i+1}/{len(files_to_process)}] Dang xu ly: {filename} ---")
        
        try:
            # Đọc CSV
            df_raw = spark.read.option("header", "true").option("inferSchema", "false").csv(file_path)
            
            # Kiểm tra cột bắt buộc
            if "Id_post" not in df_raw.columns:
                print(f"   [SKIP] File thieu cot 'Id_post'.")
                spark.createDataFrame([(file_path,)], ["file_path"]).withColumn("load_time", current_timestamp()).writeTo(TABLE_LOG).using("iceberg").append()
                continue

            # Transform (Mapping cho Facebook - Set NULL nhiều cột thiếu)
            df_trans = df_raw.select(
                trim(col("Id_post")).alias("articleID"), 
                lit(None).cast("string").alias("name"),
                lit(None).cast("string").alias("tagName"),
                lit(None).cast("string").alias("urlUser"),
                (col("Comment") if "Comment" in df_raw.columns else lit("")).alias("comment"),
                lit(None).cast("timestamp").alias("commentTime"),
                lit(0).alias("commentLike"), # Default 0 thay vì Null để tránh lỗi tính toán sau này
                lit(1).alias("levelComment"), # Default level 1
                lit(None).cast("string").alias("replyTo"),
                lit(0).alias("numberOfReply"),
                current_timestamp().alias("created_at"),
                current_timestamp().alias("updated_at")
            ).filter(col("articleID").isNotNull() & (col("articleID") != ""))
            
            # Tạo commentID nối tiếp
            window_spec = Window.orderBy(lit(1))
            df_trans = df_trans.withColumn("commentID", (row_number().over(window_spec) + max_id).cast("bigint"))
            
            # Cập nhật max_id
            current_count = df_trans.count()
            max_id += current_count
            print(f"   -> File co {current_count:,} dong. Max ID ke tiep: {max_id:,}")

            if current_count > 0:
                print(f"   -> Dang ghi vao Iceberg...")
                # Repartition(1) vì file FB thường nhỏ và rời rạc, gom lại để tránh file nhỏ trên HDFS/S3
                df_trans.select(
                    "commentID", "articleID", "name", "tagName", "urlUser", "comment",
                    "commentTime", "commentLike", "levelComment", "replyTo", "numberOfReply",
                    "created_at", "updated_at"
                ).repartition(1).writeTo(TABLE_COMMENT).using("iceberg").append()
                print("   -> Ghi xong.")
            else:
                print("   -> File rong hoac khong co ID hop le.")

            # Ghi Log
            spark.createDataFrame([(file_path,)], ["file_path"]).withColumn("load_time", current_timestamp()).writeTo(TABLE_LOG).using("iceberg").append()
            
            # Clean up
            df_trans.unpersist()
            del df_trans
            del df_raw
            gc.collect()

        except Exception as e:
            print(f"   [ERROR] Loi xu ly file {filename}: {e}")
            continue

    print("\n" + "=" * 80)
    print("DA HOAN TAT TAT CA FILE FACEBOOK.")

JOB 4: LOAD FACEBOOK COMMENTS (FULL RUN - NO BATCH LIMIT)
-> Dang quet file nguon...
   Tong so file trong folder: 100
   So file da xu ly truoc do: 0
-> Tim thay 100 file moi. Se xu ly TOAN BO.
-> Max commentID hien tai: 142,791

--- [1/100] Dang xu ly: comments_part_79.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 144,206
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [2/100] Dang xu ly: comments_part_68.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 145,621
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [3/100] Dang xu ly: comments_part_80.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 147,036
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [4/100] Dang xu ly: comments_part_78.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 148,451
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [5/100] Dang xu ly: comments_part_74.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 149,866
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [6/100] Dang xu ly: comments_part_86.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 151,281
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [7/100] Dang xu ly: comments_part_73.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 152,696
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [8/100] Dang xu ly: comments_part_11.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 154,111
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [9/100] Dang xu ly: comments_part_66.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 155,526
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [10/100] Dang xu ly: comments_part_69.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 156,941
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [11/100] Dang xu ly: comments_part_82.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 158,356
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [12/100] Dang xu ly: comments_part_87.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 159,771
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [13/100] Dang xu ly: comments_part_75.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 161,186
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [14/100] Dang xu ly: comments_part_77.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 162,601
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [15/100] Dang xu ly: comments_part_10.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 164,016
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [16/100] Dang xu ly: comments_part_9.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 165,431
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [17/100] Dang xu ly: comments_part_19.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 166,846
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [18/100] Dang xu ly: comments_part_91.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 168,261
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [19/100] Dang xu ly: comments_part_15.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 169,676
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [20/100] Dang xu ly: comments_part_76.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 171,091
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [21/100] Dang xu ly: comments_part_18.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 172,506
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [22/100] Dang xu ly: comments_part_65.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 173,921
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [23/100] Dang xu ly: comments_part_100.csv ---
   -> File co 1,404 dong. Max ID ke tiep: 175,325
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [24/100] Dang xu ly: comments_part_85.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 176,740
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [25/100] Dang xu ly: comments_part_13.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 178,155
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [26/100] Dang xu ly: comments_part_5.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 179,570
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [27/100] Dang xu ly: comments_part_22.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 180,985
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [28/100] Dang xu ly: comments_part_23.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 182,400
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [29/100] Dang xu ly: comments_part_83.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 183,815
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [30/100] Dang xu ly: comments_part_47.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 185,230
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [31/100] Dang xu ly: comments_part_21.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 186,645
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [32/100] Dang xu ly: comments_part_3.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 188,060
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [33/100] Dang xu ly: comments_part_56.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 189,475
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [34/100] Dang xu ly: comments_part_53.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 190,890
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [35/100] Dang xu ly: comments_part_90.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 192,305
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [36/100] Dang xu ly: comments_part_14.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 193,720
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [37/100] Dang xu ly: comments_part_61.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 195,135
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [38/100] Dang xu ly: comments_part_84.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 196,550
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [39/100] Dang xu ly: comments_part_29.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 197,965
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [40/100] Dang xu ly: comments_part_8.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 199,380
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [41/100] Dang xu ly: comments_part_55.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 200,795
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [42/100] Dang xu ly: comments_part_99.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 202,210
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [43/100] Dang xu ly: comments_part_44.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 203,625
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [44/100] Dang xu ly: comments_part_50.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 205,040
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [45/100] Dang xu ly: comments_part_48.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 206,455
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [46/100] Dang xu ly: comments_part_97.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 207,870
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [47/100] Dang xu ly: comments_part_52.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 209,285
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [48/100] Dang xu ly: comments_part_81.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 210,700
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [49/100] Dang xu ly: comments_part_88.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 212,115
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [50/100] Dang xu ly: comments_part_41.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 213,530
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [51/100] Dang xu ly: comments_part_67.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 214,945
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [52/100] Dang xu ly: comments_part_34.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 216,360
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [53/100] Dang xu ly: comments_part_72.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 217,775
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [54/100] Dang xu ly: comments_part_63.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 219,190
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [55/100] Dang xu ly: comments_part_96.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 220,605
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [56/100] Dang xu ly: comments_part_20.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 222,020
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [57/100] Dang xu ly: comments_part_12.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 223,435
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [58/100] Dang xu ly: comments_part_94.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 224,850
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [59/100] Dang xu ly: comments_part_45.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 226,265
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [60/100] Dang xu ly: comments_part_36.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 227,680
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [61/100] Dang xu ly: comments_part_64.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 229,095
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [62/100] Dang xu ly: comments_part_35.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 230,510
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [63/100] Dang xu ly: comments_part_7.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 231,925
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [64/100] Dang xu ly: comments_part_51.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 233,340
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [65/100] Dang xu ly: comments_part_89.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 234,755
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [66/100] Dang xu ly: comments_part_70.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 236,170
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [67/100] Dang xu ly: comments_part_92.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 237,585
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [68/100] Dang xu ly: comments_part_54.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 239,000
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [69/100] Dang xu ly: comments_part_26.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 240,415
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [70/100] Dang xu ly: comments_part_27.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 241,830
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [71/100] Dang xu ly: comments_part_6.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 243,245
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [72/100] Dang xu ly: comments_part_37.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 244,660
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [73/100] Dang xu ly: comments_part_46.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 246,075
   -> Dang ghi vao Iceberg...
   -> Ghi xong.

--- [74/100] Dang xu ly: comments_part_57.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 247,490
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [75/100] Dang xu ly: comments_part_60.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 248,905
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [76/100] Dang xu ly: comments_part_17.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 250,320
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [77/100] Dang xu ly: comments_part_58.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 251,735
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [78/100] Dang xu ly: comments_part_71.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 253,150
   -> Dang ghi vao Iceberg...
   -> Ghi xong.

--- [79/100] Dang xu ly: comments_part_30.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 254,565
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [80/100] Dang xu ly: comments_part_98.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 255,980
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [81/100] Dang xu ly: comments_part_25.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 257,395
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [82/100] Dang xu ly: comments_part_95.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 258,810
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [83/100] Dang xu ly: comments_part_49.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 260,225
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [84/100] Dang xu ly: comments_part_59.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 261,640
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [85/100] Dang xu ly: comments_part_1.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 263,055
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [86/100] Dang xu ly: comments_part_33.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 264,470
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [87/100] Dang xu ly: comments_part_62.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 265,885
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [88/100] Dang xu ly: comments_part_28.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 267,300
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [89/100] Dang xu ly: comments_part_40.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 268,715
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [90/100] Dang xu ly: comments_part_2.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 270,130
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [91/100] Dang xu ly: comments_part_42.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 271,545
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [92/100] Dang xu ly: comments_part_38.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 272,960
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [93/100] Dang xu ly: comments_part_43.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 274,375
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [94/100] Dang xu ly: comments_part_24.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 275,790
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [95/100] Dang xu ly: comments_part_4.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 277,205
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [96/100] Dang xu ly: comments_part_32.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 278,620
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [97/100] Dang xu ly: comments_part_39.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 280,035
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [98/100] Dang xu ly: comments_part_31.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 281,450
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [99/100] Dang xu ly: comments_part_16.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 282,865
   -> Dang ghi vao Iceberg...


                                                                                

   -> Ghi xong.

--- [100/100] Dang xu ly: comments_part_93.csv ---
   -> File co 1,415 dong. Max ID ke tiep: 284,280
   -> Dang ghi vao Iceberg...
   -> Ghi xong.

DA HOAN TAT TAT CA FILE FACEBOOK.


In [7]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, lit

# ====================================================
# CẤU HÌNH
# ====================================================
TABLE_ARTICLE = "nessie.silver_tables.article"

print("=" * 80)
print("KIỂM TRA SỐ LƯỢNG BÀI VIẾT (FACEBOOK & TIKTOK)")
print("=" * 80)

# 1. Đọc dữ liệu từ bảng Article
try:
    df_article = spark.table(TABLE_ARTICLE)
    
    # 2. Thống kê theo loại (Type)
    print("\n--- Thống kê chi tiết theo nguồn ---")
    df_stats = df_article.groupBy("type").count().orderBy("type")
    df_stats.show()
    
    # 3. Tính tổng số lượng
    total_count = df_article.count()
    print(f"-> TỔNG CỘNG: {total_count} bài viết.")
    
    # 4. Kiểm tra mẫu dữ liệu (Optional)
    print("\n--- 5 bài viết mới nhất ---")
    df_article.select("articleID", "type", "description", "created_at") \
              .orderBy(col("created_at").desc()) \
              .show(5, truncate=50)

except Exception as e:
    print(f"Lỗi khi đọc bảng {TABLE_ARTICLE}: {e}")
    print("Có thể bảng chưa được tạo hoặc chưa có dữ liệu.")

print("\nHoàn tất kiểm tra.")

KIỂM TRA SỐ LƯỢNG BÀI VIẾT (FACEBOOK & TIKTOK)

--- Thống kê chi tiết theo nguồn ---


                                                                                

+--------+-----+
|    type|count|
+--------+-----+
|  TikTok| 1346|
|facebook| 1949|
+--------+-----+

-> TỔNG CỘNG: 3295 bài viết.

--- 5 bài viết mới nhất ---


25/12/07 09:12:35 ERROR TaskSchedulerImpl: Lost executor 1 on 172.18.0.5: worker lost: 172.18.0.5:8882 got disassociated

+----------------+--------+--------------------------------------------------+--------------------------+
|       articleID|    type|                                       description|                created_at|
+----------------+--------+--------------------------------------------------+--------------------------+
|1922491415274632|facebook|                          Trường Đại học Giáo dục.|2025-12-07 05:08:32.077436|
|1915006809356426|facebook|Tốt nghiệp xong, tân cử nhân Trường Đại học Thă...|2025-12-07 05:08:32.077436|
|1918505135673260|facebook|                    Topic: Ngành Kỹ thuật Hóa học.|2025-12-07 05:08:32.077436|
|1914657312724709|facebook|Topic: Ngành Kinh tế số (Học mà không biết là r...|2025-12-07 05:08:32.077436|
|1911454119711695|facebook|                           Topic: Đại học Cần Thơ.|2025-12-07 05:08:32.077436|
+----------------+--------+--------------------------------------------------+--------------------------+
only showing top 5 rows


Hoàn tất kiểm tra.


                                                                                

In [8]:
# Dừng Spark Session để giải phóng resources
spark.stop()
print(" Spark Session đã được dừng!")

 Spark Session đã được dừng!
