`CODEBOOK`

In [1]:
# Import libraries
import pandas as pd
import researchpy as rp

In [2]:
# Read cleaned csv
df = pd.read_csv('EU_data2.csv')

In [3]:
# Info
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23915710 entries, 0 to 23915709
Data columns (total 16 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   Unnamed: 0              int64  
 1   ResponseID              object 
 2   ExtendedSessionID       object 
 3   UserID                  float64
 4   ScenarioOrder           int64  
 5   Intervention            int64  
 6   PedPed                  int64  
 7   Barrier                 int64  
 8   CrossingSignal          int64  
 9   AttributeLevel          object 
 10  ScenarioTypeStrict      object 
 11  ScenarioType            object 
 12  NumberOfCharacters      float64
 13  DiffNumberOFCharacters  float64
 14  Saved                   int64  
 15  UserCountry3            object 
dtypes: float64(3), int64(7), object(6)
memory usage: 2.9+ GB
None


The descriptions below are retrieved from [here](https://osf.io/wt6mc)

Please note the following keywords:

* Session: Each session consists of 13 scenarios (dilemmas) that are faced by an automated vehicle (AV). When a user visits the Judge mode, they respond to each dilemma by choosing one of the two outcomes (aka profiles). At the end of the 13 scenarios the user is presented with a summary of their decisions. At the end of the session users are also offered an optional survey (more on it below). Users are allowed to go through as many sessions as possible, and they may decide to leave the website or close the window before completing the session. This data is in the SharedResponse.csv file. When users opt-out from sharing their data by clicking the link "Opt-out", their data is removed from the database and a counter is incremented.

* Scenario: Each scenario resembles a dilemma faced by an AV. Each dilemma consists of two outcomes (aka profiles): one outcome is the result of the AV STAYing on course, and other is the result of the AV SWERVing off course. In the data set when a user chooses an outcome, this choice is logged in the database.

* Outcome: An outcome is represented by a set of features describing the environment and the characters whose lives are at stake in that outcome. While the data is collected and stored in the database at the level of the scenario (each record/row is a scenario), for the purpose of the analysis, the data is stored in the files File 2-4 at the level of the outcome. In other words, each scenario is represented by 2 rows in the data set files File 2-4. Each completed session is represented by 26 rows in the data sets, while non-completed sessions are represented by an even number of rows that's less than 26.

* Survey: At the end of the session users are shown a summary of their results and are also offered an optional survey to take. The order of showing the results vs. offering the survey is randomly allocated: half of the users are offered the survey before they see the results, while the other half is shown the results first, and then are asked to complete a survey. The survey contains demographic questions (age, gender, etc.) as well as other questions which are not included in the data sets.

* Factors/attributes: we manipulated the following 9 attributes: kind of intervention (stay/swerve), relationship to AV (pedestrians/passengers), legality (lawful/unlawful), gender (males/females), age (younger/older), social status (higher/lower), fitness (fit/large), number of characters (more/fewer), species (humans/pets). The first three attributes are structural attributes (relate to the environment), they are manipulated in every scenario. The remaining 6 attributes describe something about the characters (the potential casualties). Each of the 6 character attributes is manipulated independently in two scenarios per session (i.e. 12 scenarios in total). The 13th scenario is generate completely randomly. The order of these scenarios is randomised. Before you use this data set files, please refer to the Supplemental Information (SI) document in order to understand how these scenarios were generated.

* Characters: we considered 20 different characters who are characterised by some of the 6 character attributes. These characters are: man, woman, pregnant woman, baby stroller, elderly man, elderly woman, boy, girl, homeless person, large woman, large man, criminal, male executive, female executive, female athlete, male athlete, female doctor, male doctor, Dog, and Cat.

`ResponseID`

A unique, random set of characters that represents an identifier of the scenario. Since each scenario is represented by 2 rows, every row should share a 'ResponseID' with another row. Some rows did not have a matching 'ResponseID' in my sample, however, these have been deleted, and only rows which have a duplicate 'ResponseID' are present in the dataset. All scenarios are then complete. 

In [6]:
print(df['ResponseID'].nunique())
# 11.957.855

11957855


`ExtendedSessionID`

A unique, random set of characters that represents an identifier of the session. This ID combines a randomly generated ID for the session, concatenated with the UserID. 

`ScenarioOrder`

Takes a value between 1 and 13, representing the order in which the scenario was presented in the session.

In [5]:
# Summary ScenarioOrder
rp.summary_cat(df['ScenarioOrder'])

Unnamed: 0,Variable,Outcome,Count,Percent
0,ScenarioOrder,1,2144564,8.97
1,,2,2036098,8.51
2,,3,1957508,8.19
3,,4,1900082,7.94
4,,5,1853060,7.75
5,,6,1819832,7.61
6,,7,1791698,7.49
7,,8,1769458,7.4
8,,9,1752260,7.33
9,,10,1738578,7.27


`Intervention`

Represents the decision of the AV (STAY or SWERVE) that would lead to this outcome. This is not the actual decision taken by the user, but rather a part of the structural characterisation of the scenario.
* 0: the character would die if the AV stays, 
* 1: the character would die if AV swerves.

In [7]:
# Summary Intervention
rp.summary_cat(df['Intervention'])

Unnamed: 0,Variable,Outcome,Count,Percent
0,Intervention,0,11957855,50.0
1,,1,11957855,50.0


`PedPed`

Every scenario has either pedestrians vs. pedestrians or pedestrians vs. passengers (or passengers vs. pedestrians). This column provides information about not just this outcome, but about the combination of both outcomes in the scenario; whether the scenario pits pedestrians against each other or not. 
* 1: pedestrians vs. pedestrians, 
* 0: pedestrians vs. passengers (or vice versa)

In [8]:
# Summary PedPed
rp.summary_cat(df['PedPed'])

Unnamed: 0,Variable,Outcome,Count,Percent
0,PedPed,0,13185172,55.13
1,,1,10730538,44.87


`Barrier`

Another structural column which describes whether the potential casualties in this outcome are passengers or pedestrians. This column was used to calculate PedPed (after matching rows on RespondID. 
* 1: passengers
* 0: pedestrians

In [9]:
# Summary Barrier
rp.summary_cat(df['Barrier'])

Unnamed: 0,Variable,Outcome,Count,Percent
0,Barrier,0,17323124,72.43
1,,1,6592586,27.57


`CrossingSignal`

Another structural column which represents whether there is a traffic light in this outcome, and light colour if yes.
* 0: no legality involved
* 1: green or legally crossing, 
* 2: red or illegally crossing 

Every scenario that has pedestrians vs. pedestrians (i.e. PedPed=1) features one of three legality-relevant characterisations: 
- a) the pedestrians on both sides are crossing with no legal complications, 
- b) one group is crossing legally (on a green light), while the other is crossing illegally (on a red light), and 
- c) vice versa. 

Every scenario that has pedestrians vs. passengers (i.e. PedPed=0) features also one of three legality-relevant characterisations: 
- a) the pedestrians are crossing with no legal complications, 
- b) the pedestrians are crossing legally (on a green light), and 
- c) pedestrians are crossing illegally (on a red light). There are no legality concerns for passengers.

In [10]:
# Summary CrossingSignal
rp.summary_cat(df['CrossingSignal'])

Unnamed: 0,Variable,Outcome,Count,Percent
0,CrossingSignal,0,14492014,60.6
1,,2,5089633,21.28
2,,1,4334063,18.12


`AttributeLevel`

Is dependent on the scenario type. Each scenario type (except random) has two levels: 
- Gender: [Males: characters are males, Females: characters are females]
- Age: [Young: characters in this outcome are younger (Boy/Girl + Man/Woman) than in the other outcome, Old: characters in this outcome are older (Elderly Man/Woman and Man/Woman)].
- Fitness: [Fit: characters in this outcome are more fit (Male/Female Athlete and Man/Woman), Fat: characters in this outcome are less fit (Large Man/Woman and Man/Woman)].
- Social Value: this was changed in the analysis to "social status" instead, and the characters Male/Female Doctor and Criminal were filtered out [High: characters in this outcome have higher social status (Male/Female Executives and Man/Woman), Low: characters have a lower social status (Homeless and Man/Woman)]
- Species: [Hoomans: characters in this outcome are humans (all but Dog/Cat), Pets: characters in this side are pets (Dog/Cat)]
- Utilitarian: [More: there are more characters in this outcome, Less: there are fewer people in this outcome]. In fact, the characters on the "More" side are the same characters on the "Less" side, in addition to at least one more characters. (excuse the error in using "Less" for a countable)
- Random: it has one value ["Rand": characters in both outcomes are randomly generated].

In [11]:
# Summary AttributeLevel
rp.summary_cat(df['AttributeLevel'])

Unnamed: 0,Variable,Outcome,Count,Percent
0,AttributeLevel,Rand,2643938,11.06
1,,More,2207759,9.23
2,,Less,2207759,9.23
3,,Pets,2119889,8.86
4,,Hoomans,2119889,8.86
5,,Female,2107852,8.81
6,,Male,2107852,8.81
7,,Old,2038921,8.53
8,,Young,2038921,8.53
9,,Fat,1904667,7.96


`ScenarioType` and `ScenarioTypeStrict`

These two columns have 7 values, corresponding to 7 types of scenarios (6 attributes + random). These are: 
- "Utilitarian",
- "Gender", 
- "Fitness", 
- "Age", 
- "Social Value", 
- "Species", and 
- "Random".

In the early stage of the website, we forgot to include a code that gives the scenario type (one of the 6 categories mentioned above + random). We had to write a code to figure that out from the character types. This is the "ScenarioType" column. Some scenarios who were generated as part of the "random", could fit in one of the 6 other categories. Later, we used a clear parameter to capture this type, which is in "ScenarioTypeStrict". Thus, this column provides an accurate description, but it does not have a value for the early scenarios. In the analysis for the figures, whenever we filtered based on the scenario type, we used both columns. For example, to filter the age related scenarios, we use:
	
ScenarioTypeStrict=“Age” && ScenarioType=“Age” where "&&” is the logic AND.

In [12]:
# Summary ScenarioTypeStrict
rp.summary_cat(df['ScenarioTypeStrict'])

Unnamed: 0,Variable,Outcome,Count,Percent
0,ScenarioTypeStrict,Utilitarian,4289370,17.94
1,,Species,4222962,17.66
2,,Fitness,4216902,17.63
3,,Age,4216776,17.63
4,,Gender,4212730,17.61
5,,Random,2104000,8.8
6,,Social Status,652970,2.73


In [13]:
# Summary ScenarioType
rp.summary_cat(df['ScenarioType'])

Unnamed: 0,Variable,Outcome,Count,Percent
0,ScenarioType,Utilitarian,4415518,18.46
1,,Species,4239778,17.73
2,,Gender,4215704,17.63
3,,Age,4077842,17.05
4,,Fitness,3809334,15.93
5,,Random,2643938,11.06
6,,Social Status,513596,2.15


`NumberOfCharacters` 

Takes a value between 1 and 5, the total number of characters in this outcome. It also represents the number of characters who will be saved or killed based on “Saved" value.

In [14]:
# Summary NumberOfCharacters
rp.summary_cat(df['NumberOfCharacters'])

Unnamed: 0,Variable,Outcome,Count,Percent
0,NumberOfCharacters,5.0,5940790,24.84
1,,1.0,4748461,19.85
2,,2.0,4498287,18.81
3,,3.0,4387746,18.35
4,,4.0,4340426,18.15


`DiffNumberOFCharacters`

Takes a value between 0 and 4; difference in number of characters between this outcome and the other outcome.

In [15]:
# Summary DiffNumberOFCharacters
rp.summary_cat(df['DiffNumberOFCharacters'])

Unnamed: 0,Variable,Outcome,Count,Percent
0,DiffNumberOFCharacters,0.0,17936738,75.0
1,,1.0,1742104,7.28
2,,2.0,1575108,6.59
3,,3.0,1413894,5.91
4,,4.0,1247866,5.22


`Saved` 

This resembles the actual decision made by the user 
* 1: user decided to save the characters in this outcome, 
* 0: user decided to kill the characters in this outcome. 

In [16]:
# Summary Saved
rp.summary_cat(df['Saved'])

Unnamed: 0,Variable,Outcome,Count,Percent
0,Saved,1,11957855,50.0
1,,0,11957855,50.0


`UserCountry3` 

The alpha-3 ISO code of the country from which the user accessed the website. This is generated from the user IP which is collected but not shared here.

In [17]:
# Summary UserCountry3
rp.summary_cat(df['UserCountry3'])

Unnamed: 0,Variable,Outcome,Count,Percent
0,UserCountry3,DEU,4440290,18.57
1,,FRA,3822946,15.99
2,,ESP,1680388,7.03
3,,POL,1654370,6.92
4,,ITA,1646878,6.89
5,,CZE,1422984,5.95
6,,BEL,1347572,5.63
7,,HUN,1236058,5.17
8,,NLD,1018216,4.26
9,,SWE,975528,4.08
