**----------------------------------------------------------------------DATA QUALITY METRICS---------------------------------------------------------------**
1. Completeness
This measures whether all the necessary data is present in a specific dataset. You can think about completeness in one of two ways: at the record level or at the attribute level. Measuring completeness at the attribute level is a little more complex however, as not all fields will be mandatory.

2. Accuracy
How accurately does your data reflect the real-world object? In the financial sector, data accuracy is usually black or white – it either is or isn’t accurate. That’s because the number of pounds and pennies in an account is a precise number.
Data accuracy is critical in large organizations, where the penalties for failure are high.

3. Consistency
Maintaining synchronicity between different databases is essential. To ensure data remains consistent on a daily basis, software systems are often the answer.

4. Validity
Validity is a measure of how well data conforms to required value attributes. For example, ensuring dates conform to the same format, i.e., date/month/year or month/date/year.

5. Timeliness
Timeliness reflects the accuracy of data at a specific point in time. An example of this is when a customer moves to a new house, how timely are they in informing their bank of their new address? Few people do this immediately, so there will be a negative impact on the timeliness of their data.

6. Integrity
To ensure data integrity, it’s important to maintain all the data quality metrics we’ve mentioned above as your data moves between different systems. Typically, data stored in multiple systems breaks data integrity.

In order to measure the quaily of a dataset, It may be helpful to analyze missing values, duplicated values, multicolinearity and erroraneous values for a given dataset.  



# Importing Libraries

In [154]:
!pip install pyforest
!pip install ydata_quality
!pip install termcolor
!pip install --upgrade klib

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [155]:
import pandas_profiling
import pyforest

import ipywidgets
from ipywidgets import interact

import numpy as np
import pandas as pd 

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.ticker as mticker

# Importing plotly and cufflinks in offline mode
import plotly.express as px
import cufflinks as cf
import plotly.offline
import io
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)

from termcolor import colored
from termcolor import cprint
from wordcloud import WordCloud

import scipy.stats as stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
import missingno as msno 
import klib

import datetime as dt
from datetime import datetime
from sklearn.cluster import KMeans
from sklearn.compose import make_column_transformer
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier, GradientBoostingRegressor 
from sklearn.ensemble import ExtraTreesRegressor, AdaBoostClassifier
from sklearn.feature_selection import SelectKBest, SelectPercentile, f_classif, f_regression, mutual_info_regression
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.metrics import plot_confusion_matrix, r2_score, mean_absolute_error, mean_squared_error, classification_report 
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.metrics import make_scorer, precision_score, precision_recall_curve, plot_precision_recall_curve 
from sklearn.metrics import plot_roc_curve, roc_auc_score, roc_curve, f1_score, accuracy_score, recall_score
from sklearn.model_selection import RepeatedStratifiedKFold, KFold, cross_val_predict, train_test_split
from sklearn.model_selection import StratifiedKFold, GridSearchCV, cross_val_score, cross_validate
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import scale, StandardScaler, MinMaxScaler, RobustScaler
from sklearn.preprocessing import PolynomialFeatures, OneHotEncoder, PowerTransformer, LabelEncoder 
from sklearn.svm import SVR, SVC
from sklearn.tree import plot_tree, DecisionTreeClassifier
from ydata_quality import DataQuality
from ydata_quality.duplicates import DuplicateChecker
from ydata_quality.erroneous_data import ErroneousDataIdentifier

from xgboost import XGBRegressor, XGBClassifier, plot_importance

# Ignore Warnings
import warnings
warnings.filterwarnings("ignore")
warnings.warn("this will not show")

# Figure&Display options
plt.rcParams["figure.figsize"] = (10,6)
pd.set_option('max_colwidth',200)
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 200)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

# Functions for Data Summary

In [156]:
def Data_Summary(df):
    print(colored("Shape:", attrs=['bold']), df.shape,'\n', 
          colored('*'*100, 'red', attrs = ['bold']),
          colored("\nInfo:\n", attrs = ['bold']), sep = '')
    print(df.info(), '\n', 
          colored('*'*100, 'red', attrs = ['bold']), sep = '')
    print(colored("Number of Uniques:\n", attrs = ['bold']), df.nunique(),'\n',
          colored('*'*100, 'red', attrs = ['bold']), sep = '')
    #print(colored("Missing Values:\n", attrs=['bold']), missing_values(df),'\n', 
    #      colored('*'*100, 'red', attrs = ['bold']), sep = '')
    print(colored("All Columns:", attrs = ['bold']), list(df.columns),'\n', 
          colored('*'*100, 'red', attrs = ['bold']), sep = '')

    df.columns = df.columns.str.lower().str.replace('&', '_').str.replace(' ', '_')
    #print(colored("Columns after rename:", attrs = ['bold']), list(df.columns),'\n',
    #      colored('*'*100, 'red', attrs = ['bold']), sep = '')  
    print(colored("Descriptive Statistics \n", attrs = ['bold']), df.describe().round(2),'\n',
          colored('*'*100, 'red', attrs = ['bold']), sep = '') # Gives a statstical breakdown of the data.
    #print(colored("Descriptive Statistics (Categorical Columns) \n", attrs = ['bold']), df.describe(include = object).T,'\n',
    #     colored('*'*100, 'red', attrs = ['bold']), sep = '') # Gives a statstical breakdown of the data.

# Functions for Missing Values, Multicolinearity and Duplicated Values

In [157]:
def missing_values(df):
    missing_number = df.isnull().sum().sort_values(ascending = False)
    missing_percent = (df.isnull().sum() / df.isnull().count()).sort_values(ascending = False)
    missing_values = pd.concat([missing_number, missing_percent], axis = 1, keys = ['Missing_Number', 'Missing_Percent'])
    return missing_values[missing_values['Missing_Number'] > 0]

In [158]:
def multicolinearity_control(df):
  feature = []
  collinear = []
  for col in df.corr().columns:
    for i in df.corr().index:
      if (abs(df.corr()[col][i]) > .9 and abs(df.corr()[col][i]) < 1):
        feature.append(col)
        collinear.append(i)
        print(colored(f"Multicolinearity alert in between:{col} - {i}", 
                                  "red", attrs = ['bold']), df.shape,'\n',
                                  colored('*'*100, 'red', attrs = ['bold']), sep = '')
  if len(collinear)==0:
    print("No Multicoliearity, Correlation between collumns is NOT over %90")

def duplicate_values(df):
    print("There are", df.duplicated(subset = None, keep = 'first').sum(), "duplicated observations in the dataset.")
    duplicate_values = df.duplicated(subset = None, keep = 'first').sum()
    #if duplicate_values > 0:
        #df.drop_duplicates(keep = 'first', inplace = True)
        #print(duplicate_values, colored(" Duplicates were dropped!"),'\n',
              #colored('*'*100, 'red', attrs = ['bold']), sep = '')
#     else:
#         print(colored("There are no duplicates"),'\n',
#               colored('*'*100, 'red', attrs = ['bold']), sep = '')     
        
# def drop_columns(df, drop_columns):
#     if drop_columns != []:
#         df.drop(drop_columns, axis = 1, inplace = True)
#         print(drop_columns, 'were dropped')
#     else:
#         print(colored('We will now check the missing values and if necessary, the related columns will be dropped!', attrs = ['bold']),'\n',
#               colored('*'*100, 'red', attrs = ['bold']), sep = '')
        

# Loading Dataset

In [159]:
df=pd.read_csv("/content/QualityTestData.csv")

# Data Summary

In [160]:
Data_Summary(df)

[1mShape:[0m(31, 9)
[1m[31m****************************************************************************************************[0m[1m
Info:
[0m
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31 entries, 0 to 30
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Age        31 non-null     int64  
 1   Gender     30 non-null     object 
 2   Height     30 non-null     float64
 3   Weight     29 non-null     float64
 4   Salary     31 non-null     int64  
 5   Job        29 non-null     object 
 6   Sex        31 non-null     object 
 7   Weight LB  30 non-null     float64
 8   Name       29 non-null     object 
dtypes: float64(3), int64(2), object(4)
memory usage: 2.3+ KB
None
[1m[31m****************************************************************************************************[0m
[1mNumber of Uniques:
[0mAge          12
Gender        2
Height       22
Weight       16
Salary       14
Job           5
Sex

# Missing Values, Multicolienaity and Duplicated Values

In [161]:
feature = []
collinear = []
for col in df.corr().columns:
  for i in df.corr().index:
    if (abs(df.corr()[col][i]) > .9 and abs(df.corr()[col][i]) < 1):
      feature.append(col)
      collinear.append(i)
      print(colored(f"Multicolinearity alert in between:{col} - {i}", 
                                  "red", attrs = ['bold']), df.shape,'\n',
                                  colored('*'*100, 'red', attrs = ['bold']), sep = '')
if len(collinear)==0:
  print("No Multicoliearity, Correlation between collumns is NOT over %90")

[1m[31mMulticolinearity alert in between:weight - weight_lb[0m(31, 9)
[1m[31m****************************************************************************************************[0m
[1m[31mMulticolinearity alert in between:weight_lb - weight[0m(31, 9)
[1m[31m****************************************************************************************************[0m


In [162]:
multicolinearity_control(df)

[1m[31mMulticolinearity alert in between:weight - weight_lb[0m(31, 9)
[1m[31m****************************************************************************************************[0m
[1m[31mMulticolinearity alert in between:weight_lb - weight[0m(31, 9)
[1m[31m****************************************************************************************************[0m


In [163]:
duplicate_values(df)

There are 5 duplicated observations in the dataset.


In [164]:
missing_values(df)

Unnamed: 0,Missing_Number,Missing_Percent
weight,2,0.065
job,2,0.065
name,2,0.065
gender,1,0.032
height,1,0.032
weight_lb,1,0.032


In [165]:
dc = DuplicateChecker(df=df)

In [166]:
results = dc.evaluate()
results.keys()

INFO | No duplicate columns were found.


[38;5;11m[1mPriority 2[0m - [1musage allowed, limited human intelligibility[0m:
	[38;5;11m*[0m [1m[DUPLICATES[0m - [4mEXACT DUPLICATES][0m Found 5 instances with exact duplicate feature values.



dict_keys(['exact_duplicates', 'entity_duplicates', 'duplicate_columns'])

In [167]:
# Retrieve the warnings
warnings = dc.get_warnings()

In [168]:
exact_duplicates_out = dc.exact_duplicates()

In [169]:
dc.duplicate_columns()

INFO | No duplicate columns were found.


In [170]:
edi = ErroneousDataIdentifier(df=df)

In [171]:
edi.evaluate()



[38;5;11m[1mPriority 2[0m - [1musage allowed, limited human intelligibility[0m:
	[38;5;11m*[0m [1m[ERRONEOUS DATA[0m - [4mPREDEFINED ERRONEOUS DATA][0m Found 19 ED values in the dataset.



{'predefined_erroneous_data':         name
 unknown   15
 ?          4}

In [172]:
edi.predefined_erroneous_data()

Unnamed: 0,name
unknown,15
?,4


In [173]:
(edi.predefined_erroneous_data().sum()[0]/(df.shape[0]*df.shape[1]))*100

6.810035842293908

In [174]:
df.sample(2)

Unnamed: 0,age,gender,height,weight,salary,job,sex,weight_lb,name
6,46,Male,198.0,66.0,15000,IT,Male,660.0,Unknown
8,35,Male,175.821,77.0,14500,,Male,770.0,Unknown


In [176]:
#klib.missingval_plot(df)

# DATA QUALITY FUNCTION

In [177]:
def Quality_Check(df):
  print("*****************************************DATA SUMMARY***********************************************")
  print(Data_Summary(df))
  print("****************************************MISSING VALUES**********************************************")
  print(missing_values(df))
  print(colored("Shape:", attrs=['bold']), df.shape,'\n', 
          colored('*'*100, 'red', attrs = ['bold']),
          colored("\nInfo:\n", attrs = ['bold']), sep = '')
  print("***************************************DUPLICATED VALUES********************************************")
  print(duplicate_values(df))
  print(colored("Shape:", attrs=['bold']), df.shape,'\n', 
          colored('*'*100, 'red', attrs = ['bold']),
          colored("\nInfo:\n", attrs = ['bold']), sep = '')
  print(dc.duplicate_columns())
  print("*************************************MULTICOLINEARITY CHECK*****************************************")
  multicolinearity_control(df)
  print("*****************************************ERRONEOUS DATA*********************************************")
  print(ErroneousDataIdentifier(df=df).predefined_erroneous_data())


In [151]:
Quality_Check(df)

*****************************************DATA SUMMARY***********************************************
[1mShape:[0m(31, 9)
[1m[31m****************************************************************************************************[0m[1m
Info:
[0m
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31 entries, 0 to 30
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   age        31 non-null     int64  
 1   gender     30 non-null     object 
 2   height     30 non-null     float64
 3   weight     29 non-null     float64
 4   salary     31 non-null     int64  
 5   job        29 non-null     object 
 6   sex        31 non-null     object 
 7   weight_lb  30 non-null     float64
 8   name       29 non-null     object 
dtypes: float64(3), int64(2), object(4)
memory usage: 2.3+ KB
None
[1m[31m****************************************************************************************************[0m
[1mNumber of Uniques:
[

# KPI Function

In [178]:
def KPI(df):
  print("**********************************NUMBER OF COLUMNS AND ROWS****************************************")
  print("There are", df.shape[0], "rows", df.shape[1],"columns and",df.shape[0]*df.shape[1],"entries in this dataset")
  print()
  print("****************************************MISSING VALUES**********************************************")
  print("Overall percentage of missing values is %", missing_values(df).mean()[1]*100)
  print("")
  print("***************************************DUPLICATED VALUES********************************************")
  print("There are",df.duplicated(subset = None, keep = 'first').sum(),"duplicated values.","Overall percentage is %", (df.duplicated(subset = None, keep = 'first').sum()/len(df))*100)
  print("")
  print("*************************************MULTICOLINEARITY CHECK*****************************************")
  multicolinearity_control(df)
  print("")
  print("******************************************ERRONEOUS DATA********************************************")
  ErroneousDataIdentifier(df=df).predefined_erroneous_data()
  edi.predefined_erroneous_data()
  print("Overall percentage of Eroneous Data is %",(edi.predefined_erroneous_data().sum()[0]/(df.shape[0]*df.shape[1]))*100)
  print()
  print("***************************************OVERALL DATA QUALITY*****************************************")
  if (missing_values(df).mean()[1] <.05) and (df.duplicated(subset = None, keep = 'first').sum()/len(df)<.02) and len(collinear)==0 and (edi.predefined_erroneous_data().sum()[0]/(df.shape[0]*df.shape[1]))*100 <.02:
    print('\033[1m'+"HIGH QUALITY DATA")
  else:
    print('\033[1m'+"LOW QUALITY DATA") 

# KPI Assesement
High Quality Data Criteria

1.   Overall Missing Value percentage less than %5 and,
2.   Overall Duplicated Value percentage less than %2 and,
3.   No Multicolinearity (Correlation between columns NOT higher than %90) and,
4.   Overall Erroneous Data percentage is less than %2.


In [179]:
KPI(df)

**********************************NUMBER OF COLUMNS AND ROWS****************************************
There are 31 rows 9 columns and 279 entries in this dataset

****************************************MISSING VALUES**********************************************
Overall percentage of missing values is % 4.838709677419354

***************************************DUPLICATED VALUES********************************************
There are 5 duplicated values. Overall percentage is % 16.129032258064516

*************************************MULTICOLINEARITY CHECK*****************************************
[1m[31mMulticolinearity alert in between:weight - weight_lb[0m(31, 9)
[1m[31m****************************************************************************************************[0m
[1m[31mMulticolinearity alert in between:weight_lb - weight[0m(31, 9)
[1m[31m****************************************************************************************************[0m

************************