In [1]:
import numpy as np
import pandas as pd
import csv
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import statsmodels.api as sm

In [6]:
# Load the dataset
df = pd.read_csv("controlled_variable_combined_new(final).csv")

The Pair_ID variable is created to indicate pairs of consecutive decisions. By introducing the Pair_ID variable, we are explicitly grouping each pair of consecutive decisions made by participants. This variable assigns the same identifier to each pair of consecutive rows in the dataset. For example, if we have rows 1 and 2, rows 3 and 4, and so on, they will all have the same Pair_ID because they represent paired decisions. Treating Pairs as Single Observations: With the Pair_ID variable, each pair of consecutive decisions is treated as a single observation in the analysis. Instead of treating each decision independently, the model considers the paired decisions together as a single unit. This ensures that the paired nature of the decisions is explicitly accounted for in the analysis.

In logistic regression models, observations are typically assumed to be independent and identically distributed (i.i.d.). However, in the context of paired decisions, each pair of decisions is not independent because the same participants decide which to spare/not within the compared group, where one has to be killed and the other will be spared. To address this, we use the freq_weights argument when fitting the model to specify that each pair of decisions should be weighted equally in the analysis. This ensures that each pair of decisions contributes equally to the estimation of model parameters and that the paired nature of the decisions is appropriately accounted for. For example, if some decision scenarios are more common in the dataset, frequency weights help to adjust their influence in the model, ensuring that the analysis doesn't become biased towards these more frequent scenarios.

Without the Pair_ID variable, the model won't have any information about which decisions belong to the same pair, and thus it won't be able to distinguish between decisions made within the same pair versus decisions made in different pairs.

Limitation: By assigning a unique identifier to each pair of decision, we are acknowledging that each decision to spare or not spare is made within the context of a pair, not in isolation.  which is crucial for logistic regression that assumes iid. Pair_ID is primarily used to calculate frequency weights, which adjust the analysis to account for the representation of each pair in the dataset. Frequency weights can help control for overrepresentation or underrepresentation of certain choice scenarios, ensuring that the model's estimation is not biased by the frequency of certain types of pairs.

Nonetheless, we did not directly model this dependency in the sense of altering the logistic regression's estimation process to account for the paired nature. To further modeling the choice explicitly as a comparative decision (e.g., by using differences between characteristics of the two options as predictors) would more directly align with capturing the comparative nature of the decision-making process.

In [7]:
# Note: The 'Age' variable is treated as continuous in this context

# Create a new variable for consecutive decision pairs directly in 'df'
df['Pair_ID'] = df.index // 2


# Define the formula with interaction terms
formula = 'Spare ~ Gender + Perspective + Age + Gender:Perspective + Perspective:Age'

# Split the data into training and testing sets, maintaining 'Pair_ID' for calculating frequency weights
train, test = train_test_split(df, test_size=0.3, random_state=42)

# Calculate frequency weights for each pair of decisions in the training data based on 'Pair_ID'
pair_counts_train = train.groupby('Pair_ID').size()
train['freq_weights'] = train['Pair_ID'].map(lambda x: 1 / pair_counts_train[x])

# Fit the logistic regression model using GLM.from_formula
model_cont_2 = sm.GLM.from_formula(formula=formula, data=train, family=sm.families.Binomial(), freq_weights=train['freq_weights'])
result_cont_2 = model_cont_2.fit()

# Print summary of the model
print("Model: Treating Age as Continuous with Interaction Terms")
print(result_cont_2.summary())


Model: Treating Age as Continuous with Interaction Terms
                 Generalized Linear Model Regression Results                  
Dep. Variable:                  Spare   No. Observations:                 1909
Model:                            GLM   Df Residuals:                     1228
Model Family:                Binomial   Df Model:                            5
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -717.03
Date:                Wed, 10 Apr 2024   Deviance:                       1434.1
Time:                        00:11:15   Pearson chi2:                 1.23e+03
No. Iterations:                     4   Pseudo R-squ. (CS):             0.1344
Covariance Type:            nonrobust                                         
                         coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------

In [8]:
# Generate predicted probabilities for the test set
test['predicted_probability'] = result_cont_2.predict(test)

# Convert probabilities to binary predictions based on the threshold of 0.5
test['predicted'] = (test['predicted_probability'] >= 0.5).astype(int)

# Compute accuracy
accuracy = (test['predicted'] == test['Spare']).mean()
print(f"Accuracy: {accuracy:.4f}")



Accuracy: 0.6996
