## Introduction

This document provides a step by step guide to perform a comprehensive Data Quality Check on public data source.
Examining the completeness, accuracy, consistency, validity, uniqueness, and integrity of data 

## Prerequisites

Programming Knowledge:

Proficiency in Python and familiarity with Jupyter Notebook or an IDE.
Libraries/Packages:

pandas: For data manipulation (e.g., handling missing data, filtering, data transformation).
numpy: For numerical operations, handling NaN values, and array manipulations.
matplotlib & seaborn: For creating visualizations to detect patterns, trends, and outliers.
plotly.express: For interactive visualizations, especially useful in exploring large datasets.
sklearn.impute (SimpleImputer): For handling missing values by imputing (filling) with various strategies.
Data Cleaning Techniques:

Handling missing data (SimpleImputer and custom imputation techniques).
Outlier detection and handling (through visualizations).
Data transformation techniques like normalization or encoding, if applicable.
Exploratory Data Analysis (EDA) for insight generation.
Basic Data Science Skills:

Familiarity with statistics (mean, median, standard deviation) for understanding data distributions.
Knowledge of common data issues, like multicollinearity or categorical encoding needs.
Project Setup:

Organize code, document steps and transformations, and save important visualizations for future reference.

## Steps


press ctrl+enter to run the script

### import the libraries

In [8]:
import pandas as pd
import numpy as np
from scipy import stats

## Download the dataset from Kaggle
1. Open Kaggle and navigate to https://www.kaggle.com/datasets/vikrishnan/iris-dataset
2. Click Download
3. Extract iris.data.csv file
4. import the iris.data.csv file to jupyter
  

## import dataset 
use pandas.read_csv(): passs the file path of your CSV file to this function

In [11]:
def load_data(file_path):
    """Load the dataset into a Pandas DataFrame."""
    try:
        data = pd.read_csv(file_path)
        print(f"Dataset loaded successfully. Shape: {data.shape}")
        return data
    except Exception as e:
        print(f"Error loading dataset: {e}")
        return None

## Missing Values
#### count of missing value per column
#### df.isnull() - create a dataframe of same shape as df which contains true if original value is null otherwise false
#### sum() - total count of missing values for each column in the DataFrame.

In [13]:
def check_missing_values(df):
    """Check for missing values in each column."""
    missing_values = df.isnull().sum()
    missing_values = missing_values[missing_values > 0]
    if not missing_values.empty:
        print("\nMissing Values:")
        print(missing_values)
    else:
        print("\nNo missing values found.")
    return missing_values


## Dublicate rows 
identify dublicate rows in the dataset

In [15]:
def check_duplicates(df):
    """Check for duplicate rows in the dataset."""
    duplicate_rows = df[df.duplicated()]
    if not duplicate_rows.empty:
        print(f"\nFound {duplicate_rows.shape[0]} duplicate rows.")
    else:
        print("\nNo duplicate rows found.")
    return duplicate_rows

## Outliers Detection

In [17]:
def detect_outliers(df, threshold=3):
    """Detect outliers in numerical columns using Z-score."""
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    outliers = {}
    
    for col in numeric_cols:
        z_scores = np.abs(stats.zscore(df[col]))
        outliers[col] = df[col][z_scores > threshold]
    
    for col, outlier_values in outliers.items():
        if not outlier_values.empty:
            print(f"\nOutliers detected in column {col}:")
            print(outlier_values)
        else:
            print(f"\nNo outliers found in column {col}.")
    return outliers

## Generate reporte rows.\n\n")


In [19]:
def generate_report(missing_values, duplicate_rows, outliers):
    """Generate a summary report of the data quality checks."""
    with open("data_quality_report.txt", "w") as report_file:
        report_file.write("Data Quality Check Report\n")
        report_file.write("=========================\n\n")
        
        report_file.write("Missing Values:\n")
        if not missing_values.empty:
            report_file.write(f"{missing_values}\n\n")
        else:
            report_file.write("No missing values found.\n\n")
        
        report_file.write("Duplicate Rows:\n")
        if not duplicate_rows.empty:
            report_file.write(f"Found {duplicate_rows.shape[0]} duplicate rows.\n\n")
        else:
            report_file.write("No duplicate rows found.\n\n")
        
        report_file.write("Outliers:\n")
        for col, outlier_values in outliers.items():
            if not outlier_values.empty:
                report_file.write(f"Outliers in column {col}:\n")
                report_file.write(f"{outlier_values}\n\n")
            else:
                report_file.write(f"No outliers found in column {col}.\n\n")

        print("\nSummary report generated: data_quality_report.txt")


In [20]:
if __name__ == "__main__":
    # Load dataset
    dataset_path = 'iris.data.csv'
    df = load_data(dataset_path)
    
    if df is not None:
        # Perform data quality checks
        missing_values = check_missing_values(df)
        duplicate_rows = check_duplicates(df)
        outliers = detect_outliers(df)

        # Generate the report
        generate_report(missing_values, duplicate_rows, outliers)

Dataset loaded successfully. Shape: (149, 5)

No missing values found.

Found 3 duplicate rows.

No outliers found in column 5.1.

Outliers detected in column 3.5:
14    4.4
Name: 3.5, dtype: float64

No outliers found in column 1.4.

No outliers found in column 0.2.

Summary report generated: data_quality_report.txt
