# Validation of Variables prior to integration into truth dataset

**Note:** Validation of single variables based on isolated source files prior to merge

In [1]:
import pandas as pd
import re, os

In [2]:
def group_and_count(df, varname, sort=True):
    if not sort:
        return df[['lfdn',varname]].groupby(varname).count()
    else:
        return df[['lfdn',varname]].groupby(varname).count().sort_values('lfdn', ascending=False)

## Overall Validation Status

### Quick Checks

In [None]:
#df = pd.read_csv('../../data/freetext_coded/v_350_usability_188_coded.csv', sep=';')
#df = pd.read_csv('../../data/freetext_coded/v_285_with_252_total_274_coded.csv', sep=';', encoding='latin-1')
#pd.set_option('display.max_rows', None)
#df.head(2)

### Status

| Variable | Content | Coding scheme | Status | Necessary Actions |
| :--- | :---: | ---: | --- | --- |
| v_2 | Sectors | `New Tags` | OK | None|
| v_3 | Project Participants | `New Tags` | OK | None|
| v_5 | System Class | `Code Back` | OK | None|
| v_18 | Respondent Role | `Code Back` | OK | None|
| v_19 | Experience | `New Tags` | OK | None|
| v_20 | Certification | `New Tags` | OK | None|
|**Relationship customer**|||||
| v_26 | Reasons bad Relationship | `New Tags` | OK | None|
| v_27 | Reasons good Relationship | `New Tags` | Unicode error (latin-1) - Otherwise OK | None|
|**Documentation**|||||
| v_60 | Documentation granularity | `Code Back` | OK | None|
|**Top Problems: Causes and Effects**|||||
| [v_277](#v_277) | Problems - Top 1: Cause | `New Tags` | Unicode error (latin-1), incomplete, category syntax| Clean up &rarr; Re-categorise &rarr; validate|
| v_278 | Problems - Top 2: Cause | `New Tags` | Unicode error (latin-1), category syntax, missing values | Clean up &rarr; Re-categorise &rarr; validate|
| v_279 | Problems - Top 3: Cause | `New Tags` | Category syntax, missing values ("?") | Clean up &rarr; Re-categorise &rarr; validate|
| v_280 | Problems - Top 4: Cause | `New Tags` | Unicode error, category syntax, missing values ("?") | Clean up &rarr; Re-categorise &rarr; validate|
| v_281 | Problems - Top 5: Cause | `New Tags` | Category syntax | Clean up &rarr; Re-categorise &rarr; validate|
| v_282 | Problems - Top 1: Effect | `New Tags` | Category syntax | Clean up &rarr; Re-categorise &rarr; validate|
| v_283 | Problems - Top 2: Effect | `New Tags` | Category syntax, missing values | Clean up &rarr; Re-categorise &rarr; validate|
| v_284 | Problems - Top 3: Effect | `New Tags` | Category syntax, missing values | Clean up &rarr; Re-categorise &rarr; validate|
| v_285 | Problems - Top 4: Effect | `New Tags` | Unicode error, category syntax, missing values | Clean up &rarr; Re-categorise &rarr; validate|
| v_286 | Problems - Top 5: Effect | `New Tags` | OK | Clean up &rarr; Re-categorise &rarr; validate|
|**Documentation NFRs**|||||
| v_343 | Compatibility | `New Tags` | OK | None|
| v_344 | Maintainability | `New Tags` | OK | None|
| v_345 | Performance efficiency | `New Tags` | Missing values | Clean up|
| v_346 | Portability | `New Tags` | OK | None|
| v_347 | Reliability | `New Tags` | OK | None|
| v_348 | Safety | `New Tags` | OK | Check semantics ("Not"?)|
| v_349 | Security | `New Tags` | OK | None|
| v_350 | Usability | `New Tags` | OK | None|

### Data Cleaning and Validation

#### Variable 277<a id="v_2"></a>

In [3]:
#df = pd.read_csv('../../data/freetext_coded/v_350_usability_188_coded.csv', sep=';')
df = pd.read_csv('../../data/freetext_coded/v_277_with_246_total_365_coded.csv', sep=';', encoding='latin-1')
pd.set_option('display.max_rows', None)
df.head(2)

Unnamed: 0,lfdn,v_246,v_277,tag
0,58,Communication flaws between the project and th...,mistrust,People: Lack of trust
1,120,Communication flaws between the project and th...,Geographic ; contract,Organization: Too high team distribution


In [4]:
df.shape

(366, 4)

**To-Do**
* Show whole list and note correction items
* Remove all Categories per detault
* List all entries and save to intermediate file for validation

_Manual work - syntactic revision_
* Revise missing values
* Revise duplicates, inconsistencies, and obvious typos manually based on list
* Sum up Rework in notebook as a note

_Correction and Interpretation_
* Load file for intermediate validation
* Add categories
* Summarise all tags inluding problem category and answer
* Visualise as intermediate _unvalidated_ result
* Validate based on sample


In [5]:
#df.drop_duplicates(subset='tag', keep=False)
df.sort_values('tag')

Unnamed: 0,lfdn,v_246,v_277,tag
14,446,Communication flaws between the project and th...,wrong or unknown reqs,Incomplete requirements
45,1402,Communication flaws between the project and th...,Communication skill sets weak in sales/marketing,Input: Communication flaws between team and cu...
55,1831,Communication flaws between the project and th...,Sometimes clients say they did not receive the...,Input: Communication flaws between team and cu...
49,1482,Communication flaws between the project and th...,BAD COMMUNICATION,Input: Communication flaws between team and cu...
47,1416,Communication flaws between the project and th...,BAD COMMUNICATION,Input: Communication flaws between team and cu...
43,1344,Communication flaws between the project and th...,LACK OF COMMUNICATION,Input: Communication flaws between team and cu...
38,1159,Communication flaws between the project and th...,CLIENT'S COMMUNICATION FAILURES,Input: Communication flaws between team and cu...
31,929,Communication flaws between the project and th...,Problems in the communication,Input: Communication flaws between team and cu...
27,857,Communication flaws between the project and th...,Difficulties in the explanations of details fo...,Input: Complexity of domain
123,710,Incomplete or hidden requirements,"shalow analysis from the product owner,lack of...",Input: Customer does not formally approve the ...


In [6]:
df['tag'] = df['tag'].str.split(':').str[1]

In [7]:
df.sort_values('tag')

Unnamed: 0,lfdn,v_246,v_277,tag
49,1482,Communication flaws between the project and th...,BAD COMMUNICATION,Communication flaws between team and customer
38,1159,Communication flaws between the project and th...,CLIENT'S COMMUNICATION FAILURES,Communication flaws between team and customer
166,1727,Inconsistent requirements,Relationships with customers,Communication flaws between team and customer
161,1168,Inconsistent requirements,The communication between stakeholders is appl...,Communication flaws between team and customer
165,1650,Inconsistent requirements,Communication between customer stakeholders,Communication flaws between team and customer
43,1344,Communication flaws between the project and th...,LACK OF COMMUNICATION,Communication flaws between team and customer
45,1402,Communication flaws between the project and th...,Communication skill sets weak in sales/marketing,Communication flaws between team and customer
55,1831,Communication flaws between the project and th...,Sometimes clients say they did not receive the...,Communication flaws between team and customer
53,1742,Communication flaws between the project and th...,none communicated preassumptions regarding exi...,Communication flaws between team and customer
47,1416,Communication flaws between the project and th...,BAD COMMUNICATION,Communication flaws between team and customer


In [8]:
df.drop_duplicates(subset='tag', keep=False)

Unnamed: 0,lfdn,v_246,v_277,tag
0,58,Communication flaws between the project and th...,mistrust,Lack of trust
7,295,Communication flaws between the project and th...,Absence of sponsor,Weak management at customer side
8,304,Communication flaws between the project and th...,Time constraints;staff turnover,Strict time schedule by customer
27,857,Communication flaws between the project and th...,Difficulties in the explanations of details fo...,Complexity of domain
30,924,Communication flaws between the project and th...,interest of the team; team organization,Missing involvement of developers
36,1066,Communication flaws between the project and th...,concurrent engineering of implementation in pa...,Concurrent development activities
40,1283,Communication flaws between the project and th...,"Innappropriate people used as stakeholders, la...",Weak qualification of stakeholders
50,1483,Communication flaws between the project and th...,change resistance,Missing willingless to change
60,163,Communication flaws within the project team,misogynistic development team,Conflicts of personalities
61,168,Communication flaws within the project team,"No one wants to look stupid, so people refrain...",SE Team refrains from asking questions


In [9]:
group_and_count(df, 'tag')

Unnamed: 0_level_0,lfdn
tag,Unnamed: 1_level_1
Poor project management,21
Lack of time,15
Miscommunication between RE team,12
Communication flaws between team and customer,11
Lack of experience of RE team members,10
incomplete requirements,8
Insufficient collaboration in process,7
Missing customer involvement,7
Lack of a well-defined RE process,7
Unclear terminology,6


In [None]:
#pd.to_csv('../../data/freetext_coded/v_277_with_246_total_365_validation.csv', sep=';', encoding='latin-1')