# Data Quality Experiments

Let us peak inside a dataset and see what we find out.

We will be looking for the following:

1. Completeness:
Completeness refers to how much information is available for all entities within your dataset. It assesses whether there are any missing values in your data. A higher level of completeness indicates that there are fewer missing values, ensuring that your dataset contains a comprehensive representation of the entities it describes. To evaluate completeness, you can count the number of missing values for each attribute or entity.

1. Consistency:
Consistency measures how uniform and consistent your data is throughout the dataset. It involves identifying and quantifying the number of inconsistencies or discrepancies within your data. These inconsistencies could include variations in data formats, naming conventions, or conflicting information. Maintaining data consistency is crucial for reliable analysis and reporting.

1. Accuracy:
Accuracy assesses how correct and error-free your data is. It involves identifying and counting the number of errors within your dataset. Errors could be typographical, computational, or factual inaccuracies. Accurate data is essential for making informed decisions and avoiding misleading conclusions.

1. Relevancy/Auditability:
Relevancy or auditability focuses on the presence of relevant data within your dataset. It involves evaluating the number of irrelevant values or records that do not contribute to the goals of your analysis or business needs. Ensuring that your dataset contains only pertinent information enhances its usability and effectiveness.

1. Validity:
Validity checks whether the data in your dataset adheres to predefined rules or allowable values. It involves validating data against established constraints or criteria to ensure that it meets quality standards. Valid data is trustworthy and conforms to expected norms, reducing the risk of using incorrect or invalid information.

1. Uniqueness:
Uniqueness measures how many duplicate values or records exist within your dataset. It is essential to identify and eliminate duplicates, as they can skew analysis results and waste storage resources. Maintaining data uniqueness ensures that each entity or data point is represented only once.

1. Timeliness:
Timeliness assesses how up-to-date your data is. It involves determining whether the data has been regularly updated to reflect the current state of the entities it describes. Timely data is critical for making decisions based on current information, especially in dynamic environments where data can quickly become outdated.

Evaluating these data quality dimensions is essential for ensuring that your dataset is reliable, accurate, and suitable for your intended purposes. It helps in making informed decisions, conducting meaningful analyses, and maintaining data integrity over time.

In [None]:
# imports
# pandas library for i/o and dataframes 
import pandas as pd

import missingno as msno

## Loading data 

We will use a CCPP data. Here is what we know about the data falls with the following specifications
source: https://archive.ics.uci.edu/dataset/294/combined+cycle+power+plant

- Temperature (T) in the range 1.81°C to 37.11°C
- Ambient Pressure (AP) in the range 992.89-1033.30 milibar
- Relative Humidity (RH) in the range 25.56% to 100.16%
- Exhaust Vacuum (V) in the range 25.36-81.56 cm Hg
- Net hourly electrical energy output (PE) 420.26-495.76 MW (Target we are trying to predict)
- No missing data
- 9568 instances
- data psna 6 years (2006 - 2011)

All of this information will be used to help us validate the dataset!

See image below for more details.

![Alt text](static/combined+cycle+power+plant_Dataset.png)

In [None]:
# get data from the source 
raw_ccpp_data = pd.read_csv('https://storage.googleapis.com/aipi_datasets/CCPP_data.csv',
                            skipinitialspace=True)

## Completness Checks

Simply we will just *Count* the number of missing values in each column.

In [None]:
completeness = raw_ccpp_data.isnull().sum()

The "missingno" library provides a set of tools for visualizing and working with missing data in a dataset. It's particularly helpful for identifying and visualizing missing values in a dataset, which can be crucial for data quality assessment and data cleaning.

Once you've imported "missingno" with the alias "msno," you can use its functions and capabilities by prefixing them with "msno." For example, you can use "msno.matrix()" to create a matrix plot that visualizes missing values in your dataset.

In the example below, "msno.matrix(df)" creates a visual representation of missing values in the DataFrame "df." This can be useful for quickly identifying which columns or rows have missing data in your dataset.

In [None]:
# query each column to ensure that data is within expected parameters
ccpp_data = raw_ccpp_data.query('(AT >= 1.81 & AT <= 37.11) &\
                                 (V >= 25.36 & V <= 81.56) &\
                                 (AP >= 992.89 & AP <= 1033.30) &\
                                 (RH >= 25.56 & RH <= 100.16)')

# Consistency
# Check for inconsistencies in the 'Supplier' column by identifying unique values
consistency = df['Supplier'].value_counts()

# Accuracy
# Simulate an accuracy issue by introducing a typo in 'Product_Name'
df['Product_Name'][2] = 'Widget C Typo'

# Relevancy/Auditability
# Check if there are irrelevant products with stock level zero
auditability = df[df['Stock_Level'] == 0]

# Validity
# Check if all product prices are greater than zero
validity = df[df['Price'] <= 0]

# Uniqueness
# Introduce a duplicate product
df = df.append({'Product_ID': 101, 'Product_Name': 'Widget A', 'Price': 10.99, 'Stock_Level': 100, 'Last_Update': '2023-09-01', 'Supplier': 'Supplier X'}, ignore_index=True)

# Timeliness
# Check if the last update date is up-to-date (within the last week)
from datetime import datetime
today = datetime.strptime('2023-09-10', '%Y-%m-%d')
timeliness = df[(today - pd.to_datetime(df['Last_Update'])).dt.days > 7]