# Phase 1 — Data Preparation


This notebook reproduces the original SPSS data preparation pipeline in Python to ensure transparency and reproducibility before further analysis.

In [1]:
import pyreadstat

df, meta = pyreadstat.read_sav("data/raw/Data_Responden_TA_Christian_Revisi.sav")

In [2]:
df.shape
df.head()
df.columns.tolist()

['NoResponden',
 'X11',
 'X12',
 'X13',
 'X14',
 'X15',
 'TotalX1',
 'X21',
 'X22',
 'X23',
 'X24',
 'X25',
 'TotalX2',
 'X31',
 'X32',
 'X33',
 'X34',
 'X35',
 'TotalX3',
 'Y1',
 'Y2',
 'Y3',
 'Y4',
 'Y5',
 'TotalY',
 'RES_1',
 'RES_2']

In [8]:
raw_cols = [
    "NoResponden",
    "X11","X12","X13","X14","X15",
    "X21","X22","X23","X24","X25",
    "X31","X32","X33","X34","X35",
    "Y1","Y2","Y3","Y4","Y5"
]

df_raw = df[raw_cols].copy()

In [9]:
df_raw.describe().T[['min','max']]

Unnamed: 0,min,max
NoResponden,1.0,79.0
X11,2.0,4.0
X12,2.0,4.0
X13,2.0,4.0
X14,1.0,4.0
X15,2.0,4.0
X21,1.0,4.0
X22,1.0,4.0
X23,1.0,4.0
X24,1.0,4.0


In [10]:
df_raw.to_csv(
    "data/raw/ta_christian_raw_clean.csv",
    index=False
)

## Construct Definition and Aggregation Plan

This section documents the mapping between questionnaire items and latent constructs as defined in the original undergraduate thesis. All composite variables used in the analysis are reconstructed programmatically in Python to ensure transparency and reproducibility.

| Construct | Description | Items |
|---------|-------------|-------|
| X1 | Investment Knowledge | X11, X12, X13, X14, X15 |
| X2 | Knowledge of US–China Trade War | X21, X22, X23, X24, X25 |
| X3 | Risk Perception | X31, X32, X33, X34, X35 |
| Y  | Investment Decision | Y1, Y2, Y3, Y4, Y5 |


In [11]:
df_constructed = df_raw.copy()

df_constructed["X1"] = df_raw[["X11","X12","X13","X14","X15"]].sum(axis=1)
df_constructed["X2"] = df_raw[["X21","X22","X23","X24","X25"]].sum(axis=1)
df_constructed["X3"] = df_raw[["X31","X32","X33","X34","X35"]].sum(axis=1)
df_constructed["Y"]  = df_raw[["Y1","Y2","Y3","Y4","Y5"]].sum(axis=1)

In [12]:
df_constructed[["X1","X2","X3","Y"]].describe()

Unnamed: 0,X1,X2,X3,Y
count,79.0,79.0,79.0,79.0
mean,16.088608,13.189873,15.911392,11.35443
std,2.392241,2.722536,1.889172,2.496346
min,10.0,5.0,13.0,5.0
25%,15.0,13.0,15.0,10.0
50%,15.0,14.0,15.0,11.0
75%,17.0,14.5,17.0,13.0
max,20.0,18.0,20.0,17.0


In [None]:
df_constructed.to_csv(
    "data/processed/ta_christian_constructed.csv",
    index=False
)

### Conclusion

The raw SPSS dataset was successfully imported, cleaned, and reconstructed in Python. All questionnaire items were validated against their expected scales, and composite variables were programmatically aggregated according to the original construct definitions. The resulting dataset is fully reproducible and ready for regression analysis.


**Status:** Phase 1 complete. The dataset has been cleaned and all composite variables have been reconstructed in Python in alignment with the original undergraduate thesis.