# Analysis of Interview Data

This notebook contains the analysis of the data generated during the interviews with architects in the case company. During the interview, the participants were prompted to elicit perceived issues with requirements quality.

In [8]:
import pandas as pd

## Data Preparation

Firstly, we load and prepare the data by cleaning it of properties that would disrupt the analysis.

In [9]:
df = pd.read_excel('../../data/raw/prq-data-0-10.xlsx', sheet_name='Data').fillna('na')

  warn(msg)


In [10]:
# list the variables containing codes, i.e., remove all variables which contain supplementary information or verbatim mentions
allvars = [ 
    'ID',
    'Quality Factor 1', 'Entity-Fact 1', 'Quality Factor 2', 'Entity-Fact 2',
    'Context Factor 1', 'Context Factor 2', 'Context Factor 3',
    'Activity 1', 'Attribute 1', 'Impact 1', 'Activity 2', 'Attribute 2', 'Impact 2'
]

### Removal of irrelevant Data

We filter out all rows where the interview participant explicitly stated that they do not perceive any quality defect in this regard.

In [11]:
# filter for rows where the participant explicitly mentioned a requirements quality impact
len_original = len(df)
df = df[df['M'] == True]
len_perceived = len(df)

print(f'Removed {len_original-len_perceived} rows where the interview participant stated that they did not perceive any defect (reduced the data set from {len_original} to {len_perceived} statements).')

Removed 24 rows where the interview participant stated that they did not perceive any defect (reduced the data set from 108 to 84 statements).


### Removal of unspecific Data

We additionally remove those data points where either the activity was unspecific (activity code `Processing`) or the attribute was unspecific (attribute code `Unspecific`). These data points are not refined enough to allow further inference.

In [12]:
n_unspecific_activity = 0
n_unspecific_attribute = 0
for index, row in df.iterrows():
    for actid in ['1', '2']:
        unspecific = False
        if row[f'Activity {actid}'] == 'Processing':
            n_unspecific_activity += 1
            unspecific = True
        if row[f'Attribute {actid}'] == 'Unspecific':
            n_unspecific_attribute += 1
            unspecific = True
        
        if unspecific:
            df.at[index, f'Activity {actid}'] = 'na'
            df.at[index, f'Attribute {actid}'] = 'na'
            df.at[index, f'Impact {actid}'] = 'na'

    if row['Activity 1'] == 'na' and row['Activity 2'] != 'na':
        df.at[index, 'Activity 1'] = row['Activity 2']
        df.at[index, 'Attribute 1'] = row['Attribute 2']
        df.at[index, 'Impact 1'] = row['Impact 2']

        df.at[index, 'Activity 2'] = 'na'
        df.at[index, 'Attribute 2'] = 'na'
        df.at[index, 'Impact 2'] = 'na'

print(f'Detected {n_unspecific_activity} unspecific activities and {n_unspecific_attribute} unspecific attributes.')

Detected 14 unspecific activities and 29 unspecific attributes.


### Removal of incomplete Data

We filter out rows that miss either a quality factor or an activity. Statements that lack either of the two form incomplete data and are irrelevant to the analysis.

In [13]:
df_spec = df.query('`Quality Factor 1` != "na" & `Activity 1` != "na"')
len_complete = len(df_spec)

print(f'Removed {len_perceived-len_complete} rows where the interview participant did not provide either a artifact mention or an activity mention (reduced the data set to {len_complete} statements).')

Removed 44 rows where the interview participant did not provide either a artifact mention or an activity mention (reduced the data set to 40 statements).


## Analysis

We now proceed with the analysis using the rows of the data set which exhibit the following properties:

1. The interview participant did not explicitly deny that they perceive an issue with the type of quality.
2. The statement contains at least one quality factor and at least one attribute.
3. The activity and attribute is not unspecific.

First, define a method that aggregates all instances mentioning the same combination of independent variables (combination of quality and context factors).

In [17]:
def aggregate_instances(df_filtered: pd.DataFrame) -> dict:
    
    instances = {}

    for _, row in df_filtered.iterrows():
        # assemble the entity-fact, which consists of all input variables (quality factor 1 & 2 plus context factor 1-3 where available)
        independent = f'{row["Quality Factor 1"]}-{row["Entity-Fact 1"]}'
        if row["Quality Factor 2"] != 'na':
            independent += f' & {row["Quality Factor 2"]}-{row["Entity-Fact 2"]}'
        for cid in ['1', '2', '3']:
            if row[f'Context Factor {cid}'] != "na":
                independent += f' & {row["Context Factor "+cid]}'

        # add the entity-fact to the list of instances if it does not already exist
        if independent not in instances:
            instances[independent] = {
                'support': 0,
                'activities': {}
            }

        # increment the support of the entity-fact, i.e., the number of statements mentioning this entity-fact
        instances[independent]['support'] += 1

        # record all activities and their impacted attributes that are affected by the entity-fact according to this statement
        for actid in ['1', '2']:
            if row[f'Activity {actid}'] != "na":
                activity = row[f'Activity {actid}']
                if activity not in instances[independent]['activities']:
                    instances[independent]['activities'][activity] = {}

                attribute = row[f'Attribute {actid}']
                if attribute not in instances[independent]['activities'][activity]:
                    instances[independent]['activities'][activity][attribute] = []
                instances[independent]['activities'][activity][attribute].append(row[f'Impact {actid}'])

    return instances

In [24]:
def print_instances(instances: dict):
    for ind in instances:
        data = instances[ind]
        print(f'{ind} ({data["support"]}) has the following impact:')
        
        for ac in data['activities']:
            print(f'\t- {ac}:')
            for att in data['activities'][ac]:
                print(f'\t\t{att}: {data["activities"][ac][att]}')

### Analysis 1: Single Quality Factors

First, we isolate all statements that contain a single quality factor and no context factor.

In [14]:
single_impact = df_spec.query('`Quality Factor 2` == "na" & `Context Factor 1` == "na" & `Context Factor 2` == "na" & `Context Factor 3` == "na"')

print(f'The {len(df_spec)} statements contain {len(single_impact)} statements with one quality factor and no context factors.')

The 40 statements contain 28 statements with one quality factor and no context factors.


In [18]:
single_quality_factors = aggregate_instances(single_impact)

print(f'The data set describes the isolated impact relationship of {len(single_quality_factors)} quality factors.')

The data set describes the isolated impact relationship of 17 quality factors.


In [25]:
single_quality_factors_supported = {instance:single_quality_factors[instance] for instance in single_quality_factors if single_quality_factors[instance]['support'] >= 2}

print('The following quality factors received a support of at least 2:')
print_instances(single_quality_factors_supported)

The following quality factors received a support of at least 2:
orientation-solution (8) has the following impact:
	- Understanding:
		Uniqueness: [-3.0, -2.0]
	- Verifying:
		Completeness: [-2.0, -2.0]
	- Estimating Effort:
		Traceability: [2.0]
	- Translating:
		Stability: [2.0]
	- Assessing Feasibility:
		Precision: [2.0]
	- Planning:
		Stability: [2.0]
atomic-false (2) has the following impact:
	- Translating:
		Duration: [-2.0, -2.0]
	- Planning:
		Stability: [-2.0]
concise-false (2) has the following impact:
	- Understanding:
		Uniqueness: [-1.0]
		Duration: [-2.0]
density-too high (3) has the following impact:
	- Understanding:
		Duration: [-2.0]
	- Verifying:
		Duration: [-2.0]
	- Interpreting:
		Uniqueness: [-2.0]


### Quality Factor Interaction

Next, we isolate statements where two quality factors interact but no context factors are at play.

In [26]:
interaction = df_spec.query('`Quality Factor 2` != "na" & `Context Factor 1` == "na" & `Context Factor 2` == "na" & `Context Factor 3` == "na"')

print(f'The {len(df_spec)} statements contain {len(interaction)} statements where two quality factors interact but no context factors interact.')

The 40 statements contain 4 statements where two quality factors interact but no context factors interact.


In [28]:
multiple_quality_factors = aggregate_instances(interaction)

print(f'The data set describes the isolated impact relationship of {len(multiple_quality_factors)} quality factors.')
print_instances(multiple_quality_factors)

The data set describes the isolated impact relationship of 4 quality factors.
semantically redundant-true & horizontal traces-missing (1) has the following impact:
	- Implementing:
		Coherence: [-2.0]
level of detail-too little & type-non-functional (1) has the following impact:
	- Understanding:
		Uniqueness: [-2.0]
maturity-immature & committed-true (1) has the following impact:
	- Assessing Feasibility:
		Duration: [2.0]
jargonic-true & density-too high (1) has the following impact:
	- Assessing Feasibility:
		Precision: [-2.0]


### Context Factor Interaction

Finally, we investigate statements where context factors mediate the impact of quality factors.

In [29]:
context_interaction = df_spec.query('`Context Factor 1` != "na"')

print(f'The {len(df_spec)} statements contain {len(context_interaction)} statements where context factors mediate the effect of the quality factors.')

The 40 statements contain 8 statements where context factors mediate the effect of the quality factors.


In [30]:
context_factors = aggregate_instances(context_interaction)

print(f'The data set describes the isolated impact relationship of {len(context_factors)} quality factors.')
print_instances(context_factors)

The data set describes the isolated impact relationship of 7 quality factors.
orientation-solution & Involvement (1) has the following impact:
	- Understanding:
		Uniqueness: [0.0]
orientation-solution & Novelty (2) has the following impact:
	- Assessing Feasibility:
		Precision: [2.0]
	- Estimating Effort:
		Precision: [2.0]
	- Translating:
		Stability: [1.0]
atomic-true & Experience (1) has the following impact:
	- Understanding:
		Uniqueness: [3.0]
orientation-solution & Peer Review (1) has the following impact:
	- Understanding:
		Uniqueness: [0.0]
overloaded term-true & Involvement (1) has the following impact:
	- Understanding:
		Uniqueness: [2.0]
density-too high & Experience (1) has the following impact:
	- Understanding:
		Duration: [0.0]
density-too high & Supplementary Communication (1) has the following impact:
	- Verifying:
		Duration: [2.0]
