# Validation of Variables prior to integration into truth dataset

**Note:** Validation of single variables based on isolated source files prior to merge
* How to avoid deleting `NotCodable` tags when cleaning entries?
* Ways to automate iterating steps for all files? (Does this even make sense?)

In [1]:
import pandas as pd
import numpy as np
import re, os
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
def group_and_count(df, varname, sort=True):
    if not sort:
        return df[['lfdn',varname]].groupby(varname).count()
    else:
        return df[['lfdn',varname]].groupby(varname).count().sort_values('lfdn', ascending=False)

## Overall Validation Status

### Quick Checks

In [None]:
#df = pd.read_csv('../../data/freetext_coded/v_350_usability_188_coded.csv', sep=';')
#df = pd.read_csv('../../data/freetext_coded/v_285_with_252_total_274_coded.csv', sep=';', encoding='latin-1')
#pd.set_option('display.max_rows', None)
#df.head(2)

### Status

| Variable | Content | Coding scheme | Status | Necessary Actions |
| :--- | :---: | ---: | --- | --- |
| v_2 | Sectors | `New Tags` | OK | None|
| v_3 | Project Participants | `New Tags` | OK | None|
| v_5 | System Class | `Code Back` | OK | None|
| v_18 | Respondent Role | `Code Back` | OK | None|
| v_19 | Experience | `New Tags` | OK | None|
| v_20 | Certification | `New Tags` | OK | None|
|**Relationship customer**|||||
| v_26 | Reasons bad Relationship | `New Tags` | OK | None|
| v_27 | Reasons good Relationship | `New Tags` | Unicode error (latin-1) - Otherwise OK | None|
|**Documentation**|||||
| v_60 | Documentation granularity | `Code Back` | OK | None|
|**Top Problems: Causes and Effects**|||||
| [v_277](#v_277) | Problems - Top 1: Cause | `New Tags` | Unicode error (latin-1), incomplete, category syntax| Clean up &rarr; Re-categorise &rarr; validate|
| v_278 | Problems - Top 2: Cause | `New Tags` | Unicode error (latin-1), category syntax, missing values | Clean up &rarr; Re-categorise &rarr; validate|
| v_279 | Problems - Top 3: Cause | `New Tags` | Category syntax, missing values ("?") | Clean up &rarr; Re-categorise &rarr; validate|
| v_280 | Problems - Top 4: Cause | `New Tags` | Unicode error, category syntax, missing values ("?") | Clean up &rarr; Re-categorise &rarr; validate|
| v_281 | Problems - Top 5: Cause | `New Tags` | Category syntax | Clean up &rarr; Re-categorise &rarr; validate|
| v_282 | Problems - Top 1: Effect | `New Tags` | Category syntax | Clean up &rarr; Re-categorise &rarr; validate|
| v_283 | Problems - Top 2: Effect | `New Tags` | Category syntax, missing values | Clean up &rarr; Re-categorise &rarr; validate|
| v_284 | Problems - Top 3: Effect | `New Tags` | Category syntax, missing values | Clean up &rarr; Re-categorise &rarr; validate|
| v_285 | Problems - Top 4: Effect | `New Tags` | Unicode error, category syntax, missing values | Clean up &rarr; Re-categorise &rarr; validate|
| v_286 | Problems - Top 5: Effect | `New Tags` | OK | Clean up &rarr; Re-categorise &rarr; validate|
|**Documentation NFRs**|||||
| v_343 | Compatibility | `New Tags` | OK | None|
| v_344 | Maintainability | `New Tags` | OK | None|
| v_345 | Performance efficiency | `New Tags` | Missing values | Clean up|
| v_346 | Portability | `New Tags` | OK | None|
| v_347 | Reliability | `New Tags` | OK | None|
| v_348 | Safety | `New Tags` | OK | Check semantics ("Not"?)|
| v_349 | Security | `New Tags` | OK | None|
| v_350 | Usability | `New Tags` | OK | None|

### Data Cleaning and Validation

#### Variable 277<a id="v_2"></a>

In [3]:
#df = pd.read_csv('../../data/freetext_coded/v_350_usability_188_coded.csv', sep=';')
df = pd.read_csv('../../data/freetext_coded/v_277_with_246_total_365_coded.csv', sep=';', encoding='latin-1')
pd.set_option('display.max_rows', None)
df.head(2)

Unnamed: 0,lfdn,v_246,v_277,tag
0,58,Communication flaws between the project and th...,mistrust,People: Lack of trust
1,120,Communication flaws between the project and th...,Geographic ; contract,Organization: Too high team distribution


In [4]:
df.shape

(366, 4)

**To-Do**
* Show whole list and note correction items
* Remove all Categories per detault
* List all entries and save to intermediate file for validation

_Manual work - syntactic revision_
* Revise missing values
* Revise duplicates, inconsistencies, and obvious typos manually based on list
* Sum up Rework in notebook as a note

_Correction and Interpretation_
* Load file for intermediate validation
* Add categories
* Summarise all tags inluding problem category and answer
* Visualise as intermediate _unvalidated_ result
* Validate based on sample

___


Show whole list

In [None]:
#df.drop_duplicates(subset='tag', keep=False)
df.sort_values('tag')

Clean syntactic sugar

In [5]:
## Check iterrows, should be the wrong approach to manipulate entries..
#for i, row in df.iterrows():
#    if df.get_value['tag'] == np.nan():
#        df['tag'].fillna('NotApplicable')
#    if df['NotCodable' not in df['tag']]:
#        df['tag'] = df['tag'].str.split(':').str[1] #Remove categories
#        df['tag'] = df['tag'].str.lstrip() #Remove whitespaces at beginning
#    else:
#        df['tag'] = df['tag'].str.lstrip() #Remove whitespaces at beginning

KeyError: True

In [None]:
df['tag'] = df['tag'].str.lstrip() #Remove whitespaces at beginning

In [None]:
df.sort_values('tag')

Remove duplicates for coherent view

In [None]:
df.drop_duplicates(subset='tag', keep=False)

In [None]:
group_and_count(df, 'tag')

In [None]:
df.to_csv('../../data/freetext_coded/validation/v_277_with_246_total_365_validation.csv', sep=';', encoding='latin-1')