# **1. DATA QUALITY ASSESSMENT**

Import libraries:

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime
import os

Import data:

In [None]:
BEERS = pd.read_csv("https://raw.githubusercontent.com/camillasancricca/DATADIQ/master/BEERS.csv")
#Put the name of the variable to just show it
BEERS


Unnamed: 0,abv,ibu,id,name,style,brewery_id,ounces
0,0.050,,1436,Pub Beer,American Pale Lager,408,12.0
1,66.000,,2265,Devil's Cup,American Pale Ale (APA),177,12.0
2,71.000,,2264,Rise of the Phoenix,American IPA,177,12.0
3,0.090,,2263,Sinister,American Double / Imperial IPA,177,12.0
4,75.000,,2262,Sex and Candy,American IPA,177,12.0
...,...,...,...,...,...,...,...
2414,67.000,45.0,928,Belgorado,Belgian IPA,424,12.0
2415,0.052,,807,Rail Yard Ale,American Amber / Red Ale,424,12.0
2416,55.000,,620,B3K Black Lager,Schwarzbier,424,12.0
2417,55.000,40.0,145,Silverback Pale Ale,American Pale Ale (APA),424,12.0


Basic operation to inspect data:

In [42]:
#number of tuples and columns of the data source


In [43]:
#show the schema of the data source


In [44]:
#show the first 5 tuples of the data source


In [45]:
#head(K) shows the first K lines of the data source


In [46]:
#for each attribute the system shows the type of data. The type of data is defined analyzing the values


In [47]:
#unique display the list of distinct values in a column


In [48]:
#nunique counts the number of distinct values


In [49]:
#value_counts() returns an object containing counts for each unique value


In [50]:
#here we want to inspect how many unique values have the same count


**DUPLICATION**

Duplication occurs when a real-world entity is stored twice or more in a data source.

*Definition*: A measure of unwanted duplication existing within a data set.

*Evaluation*: Number of duplicates

In [51]:
BEERS = pd.read_csv("https://raw.githubusercontent.com/camillasancricca/DATADIQ/master/BEERS.csv", header=None)

In [52]:
#duplicated returns a boolean Series denoting the duplicate rows (exact matching)


In [53]:
#any shows if duplicates exist


**COMPLETENESS**

The completeness of a table characterizes the extent to which a table represents the corresponding real world.

Completeness in the relational model can be characterized by the presence of null values. In a model with null values, the presence of a null value has the general meaning of a missing value, i.e., a value that exist in the real-world but it is not available.

*Definition*: The degree to which a given data collection includes the data describing the corresponding set of real-world objects.

*Evaluation*: Number of not null values / Total number of values

In [54]:
BEERS = pd.read_csv('https://raw.githubusercontent.com/camillasancricca/DATADIQ/master/BEERS.csv')

In [55]:
#isnull() shows which values are null


In [56]:
#display the number of not null values for each column


In [57]:
#total number of not null values


In [58]:
#display the number of null values for each column


In [59]:
#total number of null values


In [60]:
#total number of cells


COMPLETENESS EVALUATION:

Dealing with missing values with a different format:

In [61]:
#we added to the set of missing values also 'na' and '--'

PROPERTY = pd.read_csv('https://raw.githubusercontent.com/camillasancricca/DATADIQ/master/PROPERTY.csv') #, na_values = MISSING)

**ACCURACY**

*Definition*: The extent to which data are correct, reliable and certified.

Syntactic Accuracy is the closeness of a value v to the elements of the corresponding definition domain D.

Semantic Accuracy is defined as the closeness between a data value v and a data value v’.

It is possible to calculate the accuracy of an attribute, i.e., attribute (or column) accuracy, of a relation, i.e., relation accuracy, or of a whole database, i.e., database accuracy.

*Evaluation*: Number of accurate values / Total number of values

In [62]:
#the styles data source contains the list of correct values for the attribute style in beers
STYLES = pd.read_csv('https://raw.githubusercontent.com/camillasancricca/DATADIQ/master/STYLES.csv')


In [63]:
#now we check if the values attribute style in beers contain errors
BEERS = pd.read_csv('https://raw.githubusercontent.com/camillasancricca/DATADIQ/master/BEERS.csv')

In [64]:
#correct values in beers are the ones contained in styles


In [65]:
#we sum the true values in correct


In [66]:
#we count the not null values of the column style in beers


ACCURACY EVALUATION:

In [67]:
#we assume that the values of attribute ibu in beers are correct only if they belong to a 5 to 100 range


In [68]:
#check Accuracy of ibu considering the acceptance range


**TIMELINESS**

*Definition*: The extent to which age of the data is appropriate for the task at hand.

Timeliness has two components: currency and volatility. Currency is a measure of how old the information is, based on how long ago it was recorded. Volatility is a measure of information instability/the frequency of change of the value for an entity attribute.
Currency = Age + (Delivery Time - Input Time)

*Evaluation*: Max(0, 1 - Currency/Volatility)

In [69]:

PROPERTY = pd.read_csv('https://raw.githubusercontent.com/camillasancricca/DATADIQ/master/PROPERTY.csv')#, parse_dates=['TS_UPDATE'], date_parser=dateparse)

In [70]:
#we assume that Volatility (the time in which information is valid in the real-world) is 80 days
#we assume also that the data are stored immediately (age = 0)
#now we compute the Currency and we add a column with its value for each tuple
VOLATILITY = 80


In [71]:
#adding a column with the Timeliness computation
#if Volatility is greater than Currency the Timeliness is equal to 1 - Currency / Volatility, otherwise is 0


In [72]:
#print("Average Timeliness: ", PROPERTY['Timeliness'].mean())
#print("Maximum  Timeliness:", PROPERTY['Timeliness'].max())
#print("Minimum Timeliness:", PROPERTY['Timeliness'].min())

**CONSISTENCY**

The consistency dimension captures the violation of semantic rules defined over (a set of) data items, where items can be tuples of relational tables or records in a file.

Semantic rules can be integrity constaints, data edits or business rules.

*Definition*: The satisfaction of semantic rules defined over a set of data items.

*Evaluation*: Number of consistent tuples / Total number of tuples

In [73]:
PROPERTY = pd.read_csv('https://raw.githubusercontent.com/camillasancricca/DATADIQ/master/PROPERTY.csv')

In [74]:
#we define a rule that the number of bathrooms should be lower than the number of bedrooms
#we add the column consistency
#we assign the value 1 if the rule is satisfied, 0 otherwise


In [75]:
#count the number of consistent tuples considering the rule


In [76]:
#count the total number of tuples in the property dataset


CONSISTENCY EVALUATION: