<a href="https://colab.research.google.com/github/MehrdadJalali-AI/Data_Management/blob/main/Data_Quality_Concepts.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Quality Concepts in Python

In this notebook, we will explore key concepts of Data Quality and provide Python examples related to them.

## 1. Accuracy

Ensuring data accurately reflects the real-world entities or events it is meant to represent.

In [1]:

import pandas as pd

# Example dataset
data = {'Product': ['A', 'B', 'C', 'D'],
        'Price': [100, 150, 'invalid', 200]}

df = pd.DataFrame(data)

# Check for accuracy by validating data types and values
df['Price'] = pd.to_numeric(df['Price'], errors='coerce')  # Convert non-numeric to NaN
df['Price'].isnull().sum(), df  # Count invalid entries and show corrected data


(1,
   Product  Price
 0       A  100.0
 1       B  150.0
 2       C    NaN
 3       D  200.0)

## 2. Completeness

Ensuring that all necessary data is present and accounted for.

In [2]:

# Adding some missing values
data_incomplete = {'Product': ['A', 'B', 'C', None],
                   'Price': [100, 150, 175, 200]}

df_incomplete = pd.DataFrame(data_incomplete)

# Check for missing values (nulls)
df_incomplete.isnull().sum()


Unnamed: 0,0
Product,1
Price,0


## 3. Consistency

Ensuring data consistency across different systems or sources.

In [3]:

# Example of inconsistent formats
data_inconsistent = {'Product': ['A', 'B', 'C', 'D'],
                     'Price': ['100', '150.0', 'two hundred', '200']}

df_inconsistent = pd.DataFrame(data_inconsistent)

# Normalize data format (convert strings to numeric)
df_inconsistent['Price'] = pd.to_numeric(df_inconsistent['Price'], errors='coerce')
df_inconsistent


Unnamed: 0,Product,Price
0,A,100.0
1,B,150.0
2,C,
3,D,200.0


## 4. Timeliness

Ensuring that data is current and available when needed.

In [4]:

from datetime import datetime

# Example of checking if data is up-to-date based on a timestamp
data_time = {'Product': ['A', 'B', 'C'],
             'Last_Updated': ['2023-12-01', '2024-01-01', '2024-01-02']}

df_time = pd.DataFrame(data_time)

# Convert Last_Updated column to datetime
df_time['Last_Updated'] = pd.to_datetime(df_time['Last_Updated'])

# Check for timeliness (latest data)
latest_update = df_time['Last_Updated'].max()
latest_update, df_time[df_time['Last_Updated'] == latest_update]


(Timestamp('2024-01-02 00:00:00'),
   Product Last_Updated
 2       C   2024-01-02)

## 5. Uniqueness

Ensuring there are no redundant or duplicate entries in the dataset.

In [5]:

# Example of checking for duplicate data
data_duplicate = {'Product': ['A', 'B', 'A', 'C'],
                  'Price': [100, 150, 100, 200]}

df_duplicate = pd.DataFrame(data_duplicate)

# Check for duplicate entries based on Product column
df_duplicate.duplicated().sum(), df_duplicate.drop_duplicates()


(1,
   Product  Price
 0       A    100
 1       B    150
 3       C    200)

## 6. Integrity

Ensuring the structure and format of data remain intact.

In [6]:

# Example of checking data integrity with a structured format
data_integrity = {'Product': ['A', 'B', 'C'],
                  'Price': [100, 150, 200],
                  'Category': ['Electronics', 'Clothing', 'Groceries']}

df_integrity = pd.DataFrame(data_integrity)

# Check structure and format integrity
df_integrity.dtypes


Unnamed: 0,0
Product,object
Price,int64
Category,object


## 7. Data Silos

Disconnected data systems leading to inconsistent data.

In [7]:

# Example of data silos (different data sources)
data_silo_1 = {'Product': ['A', 'B', 'C'],
               'Price': [100, 150, 200]}

data_silo_2 = {'Product': ['A', 'B', 'C'],
               'Stock': [10, 15, 20]}

df_silo_1 = pd.DataFrame(data_silo_1)
df_silo_2 = pd.DataFrame(data_silo_2)

# Merge data sources to combine them
df_combined = pd.merge(df_silo_1, df_silo_2, on='Product')
df_combined


Unnamed: 0,Product,Price,Stock
0,A,100,10
1,B,150,15
2,C,200,20


## 8. Resource Constraints

Limited resources to ensure data quality continuously.

In [8]:

# Resource limitations for continuous data validation (simulated example)
import random

# Simulate checking data with limited resources
resources = 2  # Available resources
tasks = 10  # Total tasks to process

# Check if all tasks can be completed with available resources
tasks_completed = min(tasks, resources * 3)  # Each resource can process 3 tasks
tasks_completed, tasks - tasks_completed


(6, 4)