I want to check if all the column names from the .json file are in the .csv

In [1]:
import pandas as pd
import json

# --- Load DataFrame ---
# local_head = 'C:/Users/david/Documents/David/Unibe/Master_Thesis/'
local_head = '/home/dschwarz/Documents/MT/'

# Load the GOFC data more efficiently
GFOC_dir = local_head+'Dataset/Dataset_MSc/GFOC_RDCDFI.csv'
GFOC_data = pd.read_csv(GFOC_dir, low_memory=True)

# Load the SWMA data more efficiently
SWMA_dir = local_head+'Dataset/Dataset_MSc/SWMA_RDAWFI.csv'
SWMA_data = pd.read_csv(SWMA_dir, low_memory=True)

In [2]:

# --- Load JSON column documentation ---
with open("data_doc_all.json", "r") as f:
    data_doc = json.load(f)

# --- Extract all documented column names ---
documented_columns = set()
for section in data_doc.values():
    for entry in section:
        documented_columns.add(entry["Name"])

# --- Compare with DataFrame columns ---
print("GFOC_RDCDFI.csv and SWMA_RDAWFI.csv Column Documentation Check")
print("GFOC_RDCDFI.csv:")
df = GFOC_data.copy()
df_columns = set(df.columns)

missing_from_df = documented_columns - df_columns
extra_in_df = df_columns - documented_columns

# --- Output results ---
print("üü• Columns documented but missing from DataFrame:")
for col in sorted(missing_from_df):
    print(f" - {col}")

print("\nüü¶ Columns in DataFrame but not documented:")
for col in sorted(extra_in_df):
    print(f" - {col}")

print("\nSWMA_RDAWFI.csv:")
df = SWMA_data.copy()
df_columns = set(df.columns)

missing_from_df = documented_columns - df_columns
extra_in_df = df_columns - documented_columns

# --- Output results ---
print("üü• Columns documented but missing from DataFrame:")
for col in sorted(missing_from_df):
    print(f" - {col}")

print("\nüü¶ Columns in DataFrame but not documented:")
for col in sorted(extra_in_df):
    print(f" - {col}")


GFOC_RDCDFI.csv and SWMA_RDAWFI.csv Column Documentation Check
GFOC_RDCDFI.csv:
üü• Columns documented but missing from DataFrame:
 - Bx GSM
 - F30 (LASP)
 - VX_IN
 - VY_IN
 - VZ_IN
 - X_EF
 - X_IN
 - Y_EF
 - Y_IN
 - Z_EF
 - Z_IN
 - beta_sun
 - columns with _c
 - h_ell
 - is_outlier_maneuver_day
 - lat_ell
 - lat_sph
 - lon_ell
 - lon_sph
 - r
 - u_sat
 - u_sun

üü¶ Columns in DataFrame but not documented:
 - Approximate Distance to SEL (Re)
 - VX_IN [m/s]
 - VY_IN [m/s]
 - VZ_IN [m/s]
 - X_EF [m]
 - X_IN [m]
 - Y_EF [m]
 - Y_IN [m]
 - Z_EF [m]
 - Z_IN [m]
 - beta_sun [deg]
 - h_ell [m]
 - is_man_or_missing
 - is_maneuver_unresolved
 - is_maneuver_unresolved_10m_decay
 - lat_ell [deg]
 - lat_sph [deg]
 - lon_ell [deg]
 - lon_sph [deg]
 - mean_altitude_c
 - orbital_decay_c
 - r [m]
 - res_c
 - res_std_c
 - residual_c
 - se_orbital_decay_c
 - se_slope
 - seasonal_1_c
 - seasonal_2_c
 - seasonal_3_c
 - seasonal_4_c
 - trend_c
 - u_sat [deg]
 - u_sun [deg]
 - unresolved_c

SWMA_RDAWFI.cs

## Check missing Columns

Missing/differnent columns are the same for GFOC and SWMA

Emoji convention
- ‚úÖ: is in the database but has a different name than in the json
- ‚õî is not in the database 
- ‚ö™Ô∏è is in database but not in json

üü• Columns documented but missing from DataFrame:
 - ‚úÖ Bx GSM **Same as Bx GSE**
 - ‚õî F30 (LASP) **Not in database**
 - ‚úÖ VX_IN
 - ‚úÖ VY_IN
 - ‚úÖ VZ_IN
 - ‚úÖ X_EF
 - ‚úÖ X_IN
 - ‚úÖ Y_EF
 - ‚úÖ Y_IN
 - ‚úÖ Z_EF
 - ‚úÖ Z_IN
 - ‚úÖ beta_sun
 - ‚õî columns with _c
 - ‚úÖ h_ell
 - ‚úÖ is_outlier_maneuver_day
 - ‚úÖ lat_ell
 - ‚úÖ lat_sph
 - ‚úÖ lon_ell
 - ‚úÖ lon_sph
 - ‚úÖ r
 - ‚úÖ u_sat
 - ‚úÖ u_sun

üü¶ Columns in DataFrame but not documented:
 - Approximate Distance to SEL (Re)
 - ‚úÖ VX_IN [m/s]
 - ‚úÖ VY_IN [m/s]
 - ‚úÖ VZ_IN [m/s]
 - ‚úÖ X_EF [m]
 - ‚úÖ X_IN [m]
 - ‚úÖ Y_EF [m]
 - ‚úÖ Y_IN [m]
 - ‚úÖ Z_EF [m]
 - ‚úÖ Z_IN [m]
 - ‚úÖ beta_sun [deg]
 - ‚úÖ h_ell [m]
 - ‚úÖ is_man_or_missing
 - ‚ö™Ô∏è is_maneuver_unresolved
 - ‚ö™Ô∏è is_maneuver_unresolved_10m_decay
 - ‚úÖ lat_ell [deg]
 - ‚úÖ lat_sph [deg]
 - ‚úÖ lon_ell [deg]
 - ‚úÖ lon_sph [deg]
 - ‚ö™Ô∏è mean_altitude_c
 - ‚ö™Ô∏è orbital_decay_c
 - ‚úÖ r [m]
 - ‚ö™Ô∏è res_c
 - ‚ö™Ô∏è res_std_c
 - ‚ö™Ô∏è residual_c
 - ‚ö™Ô∏è se_orbital_decay_c
 - ‚ö™Ô∏è se_slope
 - ‚ö™Ô∏è seasonal_1_c
 - ‚ö™Ô∏è seasonal_2_c
 - ‚ö™Ô∏è seasonal_3_c
 - ‚ö™Ô∏è seasonal_4_c
 - ‚ö™Ô∏è trend_c
 - ‚úÖ u_sat [deg]
 - ‚úÖ u_sun [deg]
 - ‚ö™Ô∏è unresolved_c