# ArXiv AI Papers – Reproducible EDA Demo

Mục tiêu của notebook này:

- Cho thấy cách **tổ chức pipeline** theo kiểu reproducible:
  - Dùng `config.yaml` để cấu hình data,
  - Dùng script để log environment (`capture_env.py`),
  - Dùng script để tạo sample từ file lớn (`build_sample.py`).
- Sau đó thực hiện một vài bước EDA cơ bản trên sample và có thể `Restart & Run All` mà không lỗi.


In [1]:
import sys
import subprocess
from pathlib import Path

import yaml
import pandas as pd
import numpy as np

# Xác định ROOT = thư mục repo
ROOT = (Path.cwd() / "..").resolve()
print("ROOT:", ROOT)

CONFIG_PATH = ROOT / "configs" / "config.yaml"

with CONFIG_PATH.open("r", encoding="utf-8") as f:
    config = yaml.safe_load(f)

data_cfg = config["data"]
profiling_cfg = config["profiling"]

RAW_PATH = ROOT / data_cfg["raw_big_path"]
SAMPLE_PATH = ROOT / data_cfg["sample_path"]
SEED = int(data_cfg["random_state"])

print("RAW_PATH   :", RAW_PATH)
print("SAMPLE_PATH:", SAMPLE_PATH)
print("SEED       :", SEED)

np.random.seed(SEED)


ROOT: C:\Users\pc\OneDrive\Desktop\Work_N\Big Data\repo
RAW_PATH   : C:\Users\pc\OneDrive\Desktop\Work_N\Big Data\repo\data\raw\6_arXiv_scientific_dataset.csv
SAMPLE_PATH: C:\Users\pc\OneDrive\Desktop\Work_N\Big Data\repo\data\processed\arxiv_ai_sample.csv
SEED       : 42


In [2]:
print("[STEP] Capture environment...")
subprocess.run(
    [sys.executable, str(ROOT / "src" / "capture_env.py")],
    check=True
)

print("\n[STEP] Build sample if needed...")
result = subprocess.run(
    [sys.executable, str(ROOT / "src" / "build_sample.py")],
    capture_output=True,
    text=True
)

print("=== STDOUT from build_sample.py ===")
print(result.stdout)
print("=== STDERR from build_sample.py ===")
print(result.stderr)

if result.returncode != 0:
    raise RuntimeError(f"build_sample.py failed with code {result.returncode}")


[STEP] Capture environment...

[STEP] Build sample if needed...
=== STDOUT from build_sample.py ===
[build_sample] ROOT: C:\Users\pc\OneDrive\Desktop\Work_N\Big Data\repo
[build_sample] CONFIG_PATH: C:\Users\pc\OneDrive\Desktop\Work_N\Big Data\repo\configs\config.yaml
[build_sample] RAW: C:\Users\pc\OneDrive\Desktop\Work_N\Big Data\repo\data\raw\6_arXiv_scientific_dataset.csv
[build_sample] SAMPLE: C:\Users\pc\OneDrive\Desktop\Work_N\Big Data\repo\data\processed\arxiv_ai_sample.csv
[build_sample] N_ROWS: 10000
[build_sample] SEED: 42
[build_sample] Reading raw dataset...
[build_sample] Raw shape: (136238, 10)
[build_sample] Sampling...
[build_sample] Saved sample (10000, 10) to C:\Users\pc\OneDrive\Desktop\Work_N\Big Data\repo\data\processed\arxiv_ai_sample.csv

=== STDERR from build_sample.py ===



In [3]:
print("[STEP] Load sample dataset")

if not SAMPLE_PATH.exists():
    raise FileNotFoundError(f"Sample CSV not found: {SAMPLE_PATH}")

df = pd.read_csv(SAMPLE_PATH)
df.shape


[STEP] Load sample dataset


(10000, 10)

In [4]:
df.head()


Unnamed: 0,id,title,category,category_code,published_date,updated_date,authors,first_author,summary,summary_word_count
0,abs-1704.04688v1,Machine Learning and the Future of Realism,Machine Learning (Statistics),stat.ML,4/15/17,4/15/17,"['Giles Hooker', 'Cliff Hooker']",'Giles Hooker',The preceding three decades have seen the emer...,106
1,abs-2405.16697v1,CNN Autoencoder Resizer: A Power-Efficient LoS...,Machine Learning,cs.LG,5/26/24,5/26/24,"['Azim Akhtarshenas', 'Navid Ayoobi', 'David L...",'Azim Akhtarshenas',"Optimizing the design, performance, and resour...",143
2,abs-2109.03754v2,Memory and Knowledge Augmented Language Models...,Computation and Language (Natural Language Pro...,cs.CL,9/8/21,9/14/21,"['David Wilmot', 'Frank Keller']",'David Wilmot',Measuring event salience is essential in the u...,115
3,abs-2102.07318v4,A Global to Local Double Embedding Method for ...,Computer Vision and Pattern Recognition,cs.CV,2/15/21,10/17/21,"['Yiming Xu', 'Jiaxin Li', 'Yiheng Peng', 'Yan...",'Yiming Xu',Multi-person pose estimation is a fundamental ...,212
4,abs-1812.05788v2,AU R-CNN: Encoding Expert Prior Knowledge into...,Computer Vision and Pattern Recognition,cs.CV,12/14/18,8/25/19,"['Chen Ma', 'Li Chen', 'Junhai Yong']",'Chen Ma',Detecting action units (AUs) on human faces is...,245


In [5]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   id                  10000 non-null  object
 1   title               10000 non-null  object
 2   category            10000 non-null  object
 3   category_code       10000 non-null  object
 4   published_date      10000 non-null  object
 5   updated_date        10000 non-null  object
 6   authors             10000 non-null  object
 7   first_author        10000 non-null  object
 8   summary             10000 non-null  object
 9   summary_word_count  10000 non-null  int64 
dtypes: int64(1), object(9)
memory usage: 781.4+ KB


In [6]:
# Ví dụ: phân phối năm công bố nếu có cột 'year'
if "year" in df.columns:
    df["year"].value_counts().sort_index().plot(kind="bar", figsize=(10, 4), title="Số bài theo năm")


In [7]:
# Top 10 tác giả (nếu có cột 'authors')
if "authors" in df.columns:
    top_authors = (
        df["authors"]
        .str.split(",")
        .explode()
        .str.strip()
        .value_counts()
        .head(10)
    )
    top_authors


In [8]:
from datetime import datetime

LOG_PATH = ROOT / "metadata" / "run_log.txt"
LOG_PATH.parent.mkdir(parents=True, exist_ok=True)

with LOG_PATH.open("a", encoding="utf-8") as f:
    f.write(f"Run at {datetime.now().isoformat()} | sample={SAMPLE_PATH.name} | rows={len(df)}\n")

LOG_PATH


WindowsPath('C:/Users/pc/OneDrive/Desktop/Work_N/Big Data/repo/metadata/run_log.txt')