1. Definition of Data Quality:
    - Data quality refers to the degree to which data is accurate, complete, reliable, and relevant for the intended purpose.
2. Dimensions of Data Quality:
    - Accuracy: How well the data represents the real-world construct it is intended to measure.
    - Completeness: The extent to which all required data is available.
    - Consistency: Ensuring that data is free from contradictions or discrepancies.
    - Timeliness: Data should be up-to-date and available when needed.
    - Validity: The extent to which data conforms to defined business rules or constraints.
    - Relevance: Data should be pertinent to the purpose for which it is being used.
    - Uniqueness: Ensuring that each record or data point is unique.
3. Data Profiling:
    - This involves examining the structure and content of the data to understand its characteristics. It includes summary statistics, distributions, and identifying missing or anomalous values.
4. Data Cleaning:
    - This process involves identifying and correcting errors, inconsistencies, and inaccuracies in the data. Techniques may include imputation, outlier handling, and data transformation.
5. Data Validation and Verification:
    - Validation checks ensure that the data adheres to defined business rules or constraints. Verification involves confirming that the data accurately reflects the real-world entities it represents.
6. Data Governance:
    - This involves establishing policies, processes, and standards for managing and ensuring the quality of data throughout its lifecycle.
7. Data Quality Metrics:
    - Establishing measurable criteria for assessing data quality, such as error rates, completeness percentages, and accuracy scores.
8. Data Quality Tools:
    - Various software tools are available to assist in data quality analysis, including data profiling tools, data cleaning and transformation tools, and data quality monitoring platforms.
9. Data Quality Frameworks:
    - Frameworks provide a structured approach to managing and improving data quality. Examples include the DAMA (Data Management Association) framework and the TDQM (Total Data Quality Management) framework.
10. Data Quality in Machine Learning:
    - High-quality data is essential for training accurate and reliable machine learning models. Data quality issues can lead to biased or erroneous predictions.
11. Data Quality Maintenance:
    - Continuous monitoring and maintenance of data quality is essential to ensure that it remains reliable over time.
12. Data Privacy and Compliance:
    - Ensuring that data quality efforts are aligned with legal and regulatory requirements, such as GDPR (General Data Protection Regulation) or HIPAA (Health Insurance Portability and Accountability Act).

# Import Data

In [1]:
#invite people for the Kaggle party
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

# Set the option to display maximum rows to a high number
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 300)


#bring in the six packs
df = pd.read_csv('./Dataset/Titanic.csv')

# Assuming df is your DataFrame
import DataExplorationToolkit as dtl
feature_selector = dtl.FeatureSelector()
visualizer=dtl.Visualization()
data_quality=dtl.DataQuality()


In [2]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# Dimensions of Data Quality:
    - Accuracy: How well the data represents the real-world construct it is intended to measure.
    - Completeness: The extent to which all required data is available.
    - Consistency: Ensuring that data is free from contradictions or discrepancies.
    - Timeliness: Data should be up-to-date and available when needed.
    - Validity: The extent to which data conforms to defined business rules or constraints.
    - Relevance: Data should be pertinent to the purpose for which it is being used.
    - Uniqueness: Ensuring that each record or data point is unique.

# Data Profiling:
    - This involves examining the structure and content of the data to understand its characteristics. It includes summary statistics, distributions, and identifying missing or anomalous values.

In [3]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [8]:
cat_cols,num_cols,text_cols=feature_selector.return_categorical_numerical_columns(df)
profiling=data_quality.data_profiling(df,num_cols,cat_cols,text_cols)
keys=profiling[0]
profiling=profiling[1]

In [10]:
keys

dict_keys(['num_rows', 'num_columns', 'column_names', 'numeric_summary', 'categorical_summary', 'text_summary', 'unique_values', 'missing_values', 'data_types'])

In [11]:
profiling['numeric_summary']

Unnamed: 0,PassengerId,Age,Fare
count,891.0,714.0,891.0
mean,446.0,29.699118,32.204208
std,257.353842,14.526497,49.693429
min,1.0,0.42,0.0
25%,223.5,20.125,7.9104
50%,446.0,28.0,14.4542
75%,668.5,38.0,31.0
max,891.0,80.0,512.3292


In [12]:
profiling['categorical_summary']

Unnamed: 0,Sex,Embarked
count,891,889
unique,2,3
top,male,S
freq,577,644


In [13]:
profiling['text_summary']

Unnamed: 0,Name,Ticket,Cabin
count,891,891,204
unique,891,681,147
top,"Braund, Mr. Owen Harris",347082,B96 B98
freq,1,7,4


In [17]:
print(profiling['unique_values'])

profiling['missing_values']

{'Sex': {'male': 577, 'female': 314}, 'Embarked': {'S': 644, 'C': 168, 'Q': 77}}


PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [18]:
profiling['data_types']

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object