# data quality

**Note**: The following questions aim to help you deal with the topic of data quality and to deal with issues relating to data fitness, etc.
the main question is:
> What do you think of the dataset? Is it appropriate to answer the question about employee satisfaction? Is the information provided sufficient to answer the question? If not, what other information would you use?

In [24]:
import pandas as pd
import numpy as np

data = {'Name':['Tom', 'nick', 'krish', 'jack'],
        'Age':[20, 21, 19, 18]}

df_es = pd.read_csv("./data/employees_satisfaction.csv", index_col=0)
df = pd.DataFrame(data)

#df_dss = pd.read_csv("./data/Latest_Data_Science_Salaries.csv", index_col=0)
#ppp_data_recent = pd.read_csv('./data/public_150k_plus_230630.csv')
#ppp_data = pd.read_csv('./data/public_up_to_150k_1_230630.csv')

Data quality is the assessment of usefulness and reliability of data to serve its purpose.

There are two axis of Data Quality:
1. the integrity of the data itself
2. and the fit or appropriateness of the dataset with respect to a particular question or data.

### 1. Data Integrety:

The integrity of a dataset is evaluated using the data values and the desctiptives that make it up.

Data integrity focuses on the quality of the data, or its accuracy. It strives to eliminate errors and redundant information, and to fill in missing information.

Data integrity requires that data meets the following 10 characteristics: 
1. _**high volume**_: 
2. _**of known provenance**_:
3. _**well-annotated**_:
4. _**timely**_:
5. _**complete**_: Completeness of data means that the main features/attributes required for the analysis do not have missing values. Missing values can skew the analysis and lead to misleading trends.
6. _**multivariant**_:
7. _**atomic**_:
8. _**consistent**_:
9. _**dimensionally structured**_:
10. _**clear**_:

Let's see what strategies we can use to assess each of these characteristics. 


**1.1. High Volume**:
Assesses whether the volume of the dataset is enough to answer your question. At minimum, a dataset will need to have sufficient records to support the type of analysis needed to answer your particular question. If what you need is a count—for example, the number of 311 calls that involved noise complaints in a particular year—then having “enough” data means having records of all of the 311 calls for that particular year.

> _important questions would be:_
>* _Is the number of data in your dataset is enough?_

**_Lösung:_**
From the meta data and supporting documents, we know, that all employee records of the company are listed in this dataset, which means that the volume of the dataset is sufficient for the purpose of this survey.

In [25]:
print("Does this dataset include all 500 employees of this company? "+str(len(df_es.index)==500))

Does this dataset include all 500 employees of this company? True


**1.2. Complete**: 

The metrics to measure the ratio of missing records for the features that are necessary (not optional) for the analysis. For example, if 30 out of 100 records in your analysis is missing the industry feature that you need for your marketing campaign, the completeness of your data is 70%.

> _important questions would be:_
>* _Are there any missing values in the data?_
>* _Find the earliest date(s) in our “recent” data file and confirm that they are before a specific date._

> _in case of two data sets:_
>* _Compare the file sizes and/or row counts of the two datasets to confirm that the more recent file is larger than the older file._
>* _Compare the data they contain to confirm that all the records in the earlier file already exist in the later file._

One important thing here is that not every kind of missing data can be detected by the default method of the pandas library e.g. if the missing value has been encoded as an empty string or arbitrary value.

In [41]:
#for exapmple by checking for NANs

check = df_es.isnull()

if(check.values.any()):
    print("Sum of all Nans? "+str(check.values.sum())); 
    
#sum of nans per column
print("Sum of Nans per column? "+str(check.sum()))

#You can replace NAN values with a string of your choice if you want
#missing_value.replace(np.nan, 'N/A')

1. Sum of all Nans? 506
Sum of all Nans? emp_id                0
age                   0
Dept                  0
education             0
recruitment_type      0
job_level             0
rating               29
awards                0
certifications        0
salary                0
gender                3
entry_date            0
last_raise          474
satisfied             0
dtype: int64


You can check for the count of all the values in a column.
By default, there is a ```dropna``` argument that controls the behavior of this function.
You have to explicitly pass it to the function to make it return the missing value as well.

In [29]:
print(df_es['gender'].value_counts())
#or
print(df_es['gender'].value_counts(dropna=False))

gender
m         207
f         187
Male       57
Female     46
Name: count, dtype: int64
gender
m         207
f         187
Male       57
Female     46
NaN         3
Name: count, dtype: int64


You can also find the earliest date(s) in our “recent” data file and confirm that they are before June 30, 2023.

In [36]:
# convert the values in the `entry_date` column to *actual* dates
df_es['entry_date'] = pd.to_datetime(df_es['entry_date'], format='%Y-%m-%d')

# print out the `min()` and `max()` values in the `entry_date` column
print(df_es['entry_date'].min())
print(df_es['entry_date'].max())

2004-01-05 00:00:00
2020-12-17 00:00:00


In section X we talk about ways to replace missing values.

**1.3. Consistent**: 

Consistency refers to the lack of contradictions within a dataset or between different datasets. A 5-feet tall newborn or mismatching revenue between sales and usage tables are both examples of inconsistency in the data.

> _important questions would be:_
>* _are there duplicates?_
>* _Do the data match when thy are read from two different sources?_
>* are the decriptives used for the same value in the dataset consistent? (e.g. spelling of "male" vs "Male")

In [None]:
#are there duplicates?
print(df_es.duplicated())
print(df_dss[df_dss.duplicated()])

print("---------------------------")

#you can drop duplicates as follows:
#df_dss.drop_duplicates() 

#Also, you can detect partial duplication by specifying your target columns.
df_dss[df_dss.duplicated(['Salary','Company Location'])]

In [27]:
#are the decriptives used for the same value in the dataset consistent? e.g. spelling of "male" vs "Male"
#find inconsistencies:
print(df.groupby(['gender'])['emp_id'].count())
print("--------------------")



gender
Female     46
Male       57
f         187
m         207
Name: emp_id, dtype: int64


**1.4. timely**: 

Timeliness ensures that data is up to date and the most recent records reflect the most recent changes.

> _important questions would be:_
>* _is the dataset up to date?_
>* _does the dataset include the most recent records?_
>* _when was the last time the data was updated?_
>* _What are the minimum and maximum dates in the table?_

**1.5. Of known Provenance**: 

The dataset is from a reliable source.

> _important questions would be:_
>* _is the dataset from a reliable source?_

### 2. Data Fitness:

There are 3 Metrics of data fitness:
1. Validity.
2. Reliability
3. Representativeness.

**2.1. Validity**:

Validity means that data has the right type, format and range. Data has the right type and format if it conforms to a set of pre-defined standards and definitions: 08–12–2019 is invalid when the standard format is defined as YYYY/mm/dd.

In the meantime, validity assures that data is in an acceptable range. Outliers, an important concept in Statistics and Data Science, are datapoints that do not meet this requirement.

> _important questions would be:_
>* _Are there any outliers? Boxplot is a simple approach to identify outliers_
>* _Are type and format of ata correct and consistent? e.g. date formats_

>**PS**: Boxplots visualize a distribution by plotting the five important numbers: minimum, maximum, 25th percentile, median and 75th percentile of the distribution. To find outliers in the data, look at the regions that are 1.5 IQR (Interquartile Range) smaller than 25th or greater than 75th percentiles where IQR is the distance between 75th and 25th percentiles.

[^1]: MYr