In [1]:
import pandas as pd

# Data Evaluation

This notebook generates descriptive statistics about the data and its cleaning process.

In [2]:
df = pd.read_excel('./r3a-data-extraction.xlsx', sheet_name='Data')

In [3]:
def get_unique_codes(df: pd.DataFrame, column: str) -> [str]:
    """Returns a list of unique codes contained in a given column.

    parameters:
        df -- pandas.DataFrame containing the data
        column -- name of the column containing cells with codes or groups of codes separated by semicolons (e.g., "code1;code2)

    returns:
        list of unique codes
    """

    # obtain all codes of the column (which still contains code groups like "code1;code2")
    all_codes = list(df[column].value_counts().index)

    # split up code groups (the sum([...], []) flattens the list of lists)
    singular_codes = sum([code.split(';') for code in all_codes], [])

    # remove duplicates
    return list(set(singular_codes))

## Data Cleaning

### Removing wrongfully included Experimentation Literature

The inclusion and exclusion criteria were applied to the title, abstract, and keywords of the 1446 primary studies retrieved by the database search. In some cases, however, the data extraction revealed that the assumptions, under which a study was included, did not hold. These studies need to be excluded.

In [4]:
# identify the IDs of the wrongfully included papers, which have a True flag in the 'F' column
ids_of_wrongfully_included = set(df[(df['Type']=='E') & (df['F']==True)]['ID'].values)
print(f'{len(ids_of_wrongfully_included)} primary studies were wrongfully included and have to be removed.')

# remove all these wrongfully included primary studies (i.e., false positives) from the data set
df = df[df['F']==False]

22 primary studies were wrongfully included and have to be removed.


### Removing non-valuating Attributes

Several dependent variables in the data set describe non-valuating attributes, i.e., properties of the activity that do not have a connection to quality. These need to be removed.

In [5]:
# identify the IDs of the non-valuating dependent variables, which have a True flag in the 'F' column
ids_of_nonvaluating = set(df[(df['Type']=='E') & (df['Val']==True)])
print(f'{len(ids_of_nonvaluating)} studies use non-valuating dependent variables that are irrelevant to this study.')

# remove all these wrongfully included data points (i.e., false positives) from the data set
df = df[df['Val']==False]

12 studies use non-valuating dependent variables that are irrelevant to this study.


## Data Evaluation

In the following code blocks, we evaluate the cleaned data and generate some general, high-level statistics.

### Number of textual Descriptions

Firstly, we count the number of textual descriptions extracted for each data source type and in general.

In [6]:
data_source_types = {
    'E': 'Experimental Literature',
    'I': 'Interview Study', 
    'S': 'Software Process Literature'
}

for dst in data_source_types:
    df_specific = df[df['Type'] == dst]
    n_activity_mentions = len(df_specific[df_specific['Activity Description'].notnull()])
    n_attribute_mentions = len(df_specific[df_specific['Attribute Description'].notnull()])
    print(f'The {data_source_types[dst]} contained {n_activity_mentions} descriptions of activities and {n_attribute_mentions} descriptions of attributes.')

The Experimental Literature contained 142 descriptions of activities and 355 descriptions of attributes.
The Interview Study contained 55 descriptions of activities and 0 descriptions of attributes.
The Software Process Literature contained 21 descriptions of activities and 1 descriptions of attributes.


In [7]:
n_activity_mentions = len(df[df['Activity Description'].notnull()])
n_attribute_mentions = len(df[df['Attribute Description'].notnull()])
print(f'The complete data set of extractions contained {n_activity_mentions} descriptions of activities and {n_attribute_mentions} descriptions of attributes.')

The complete data set of extractions contained 218 descriptions of activities and 356 descriptions of attributes.


### Number of unique Codes

Next, we determine the number of unique codes per the four categories: activity, activity attribute, artifact, and artifact attribute.

In [8]:
activities = get_unique_codes(df, 'Activity')
activity_attributes = get_unique_codes(df, 'Activity Attributes')
artifacts = get_unique_codes(df, 'Artifact')
artifact_attributes = get_unique_codes(df, 'Artifact Attributes')

print(f'The data set contains {len(activities)} unique activities, {len(activity_attributes)} unique activity-attributes, {len(artifacts)} unique artifacts, and {len(artifact_attributes)} unique artifact-attributes.')

The data set contains 24 unique activities, 16 unique activity-attributes, 21 unique artifacts, and 26 unique artifact-attributes.
