I want to check if all the column names from the .json file are in the .csv

In [1]:
import pandas as pd
import json

# --- Load DataFrame ---
# local_head = 'C:/Users/david/Documents/David/Unibe/Master_Thesis/'
local_head = '/home/dschwarz/Documents/MT/'

# Load the GOFC data more efficiently
GFOC_dir = local_head+'Dataset/Dataset_MSc/GFOC_RDCDFI.csv'
GFOC_data = pd.read_csv(GFOC_dir, low_memory=True)

# Load the SWMA data more efficiently
SWMA_dir = local_head+'Dataset/Dataset_MSc/SWMA_RDAWFI.csv'
SWMA_data = pd.read_csv(SWMA_dir, low_memory=True)

In [2]:

# --- Load JSON column documentation ---
with open("data_doc_all.json", "r") as f:
    data_doc = json.load(f)

# --- Extract all documented column names ---
documented_columns = set()
for section in data_doc.values():
    for entry in section:
        documented_columns.add(entry["Name"])

# --- Compare with DataFrame columns ---
print("GFOC_RDCDFI.csv and SWMA_RDAWFI.csv Column Documentation Check")
print("GFOC_RDCDFI.csv:")
df = GFOC_data.copy()
df_columns = set(df.columns)

missing_from_df = documented_columns - df_columns
extra_in_df = df_columns - documented_columns

# --- Output results ---
print("🟥 Columns documented but missing from DataFrame:")
for col in sorted(missing_from_df):
    print(f" - {col}")

print("\n🟦 Columns in DataFrame but not documented:")
for col in sorted(extra_in_df):
    print(f" - {col}")

print("\nSWMA_RDAWFI.csv:")
df = SWMA_data.copy()
df_columns = set(df.columns)

missing_from_df = documented_columns - df_columns
extra_in_df = df_columns - documented_columns

# --- Output results ---
print("🟥 Columns documented but missing from DataFrame:")
for col in sorted(missing_from_df):
    print(f" - {col}")

print("\n🟦 Columns in DataFrame but not documented:")
for col in sorted(extra_in_df):
    print(f" - {col}")


GFOC_RDCDFI.csv and SWMA_RDAWFI.csv Column Documentation Check
GFOC_RDCDFI.csv:
🟥 Columns documented but missing from DataFrame:
 - Bx GSM
 - F30 (LASP)
 - VX_IN
 - VY_IN
 - VZ_IN
 - X_EF
 - X_IN
 - Y_EF
 - Y_IN
 - Z_EF
 - Z_IN
 - beta_sun
 - columns with _c
 - h_ell
 - is_outlier_maneuver_day
 - lat_ell
 - lat_sph
 - lon_ell
 - lon_sph
 - r
 - u_sat
 - u_sun

🟦 Columns in DataFrame but not documented:
 - Approximate Distance to SEL (Re)
 - VX_IN [m/s]
 - VY_IN [m/s]
 - VZ_IN [m/s]
 - X_EF [m]
 - X_IN [m]
 - Y_EF [m]
 - Y_IN [m]
 - Z_EF [m]
 - Z_IN [m]
 - beta_sun [deg]
 - h_ell [m]
 - is_man_or_missing
 - is_maneuver_unresolved
 - is_maneuver_unresolved_10m_decay
 - lat_ell [deg]
 - lat_sph [deg]
 - lon_ell [deg]
 - lon_sph [deg]
 - mean_altitude_c
 - orbital_decay_c
 - r [m]
 - res_c
 - res_std_c
 - residual_c
 - se_orbital_decay_c
 - se_slope
 - seasonal_1_c
 - seasonal_2_c
 - seasonal_3_c
 - seasonal_4_c
 - trend_c
 - u_sat [deg]
 - u_sun [deg]
 - unresolved_c

SWMA_RDAWFI.csv:
🟥 C

## Check missing Columns

Missing/differnent columns are the same for GFOC and SWMA

Emoji convention
- ✅: is in the database but has a different name than in the json
- ⛔ is not in the database 
- ⚪️ is in database but not in json

🟥 Columns documented but missing from DataFrame:
 - ✅ Bx GSM **Same as Bx GSE**
 - ⛔ F30 (LASP) **Not in database**
 - ✅ VX_IN
 - ✅ VY_IN
 - ✅ VZ_IN
 - ✅ X_EF
 - ✅ X_IN
 - ✅ Y_EF
 - ✅ Y_IN
 - ✅ Z_EF
 - ✅ Z_IN
 - ✅ beta_sun
 - ⛔ columns with _c
 - ✅ h_ell
 - ✅ is_outlier_maneuver_day
 - ✅ lat_ell
 - ✅ lat_sph
 - ✅ lon_ell
 - ✅ lon_sph
 - ✅ r
 - ✅ u_sat
 - ✅ u_sun

🟦 Columns in DataFrame but not documented:
 - Approximate Distance to SEL (Re)
 - ✅ VX_IN [m/s]
 - ✅ VY_IN [m/s]
 - ✅ VZ_IN [m/s]
 - ✅ X_EF [m]
 - ✅ X_IN [m]
 - ✅ Y_EF [m]
 - ✅ Y_IN [m]
 - ✅ Z_EF [m]
 - ✅ Z_IN [m]
 - ✅ beta_sun [deg]
 - ✅ h_ell [m]
 - ✅ is_man_or_missing
 - ⚪️ is_maneuver_unresolved
 - ⚪️ is_maneuver_unresolved_10m_decay
 - ✅ lat_ell [deg]
 - ✅ lat_sph [deg]
 - ✅ lon_ell [deg]
 - ✅ lon_sph [deg]
 - ⚪️ mean_altitude_c
 - ⚪️ orbital_decay_c
 - ✅ r [m]
 - ⚪️ res_c
 - ⚪️ res_std_c
 - ⚪️ residual_c
 - ⚪️ se_orbital_decay_c
 - ⚪️ se_slope
 - ⚪️ seasonal_1_c
 - ⚪️ seasonal_2_c
 - ⚪️ seasonal_3_c
 - ⚪️ seasonal_4_c
 - ⚪️ trend_c
 - ✅ u_sat [deg]
 - ✅ u_sun [deg]
 - ⚪️ unresolved_c