## Check Accuracy & Completeness

**Objective**: Learn to assess data quality by checking for accuracy and completeness using Python.

For this, you will use a sample dataset students.csv that contains the following
columns: ID , Name , Age , Grade , Email .

**Steps**:
1. Check Accuracy
    - Verify Numerical Data Accuracy
    - Validate Email Format
    - Integer Accuracy Check for Age
2. Check Completeness
    - Identify Missing Values
    - Rows with Missing Data
    - Column Specific Missing Value Check

In [2]:
import pandas as pd
import re

def is_grade_valid(grade):
    """
    Check if grade is a number between 0 and 100 inclusive.
    Return: bool
    """
    if pd.isna(grade):
        return False
    try:
        val = float(grade)
        return 0 <= val <= 100
    except (ValueError, TypeError):
        return False

def is_email_valid(email):
    """
    Validate email format using regex.
    Return: bool
    """
    if not isinstance(email, str) or not email.strip():
        return False
    email_pattern = r'^[\w\.-]+@[\w\.-]+\.\w+$'
    return bool(re.fullmatch(email_pattern, email.strip()))

def is_age_valid(age):
    """
    Check if age is an integer and non-negative.
    Return: bool
    """
    if pd.isna(age):
        return False
    if isinstance(age, float) and age.is_integer():
        age = int(age)
    if not isinstance(age, int):
        return False
    return age >= 0

def check_completeness(df, mandatory_fields):
    """
    Return rows with missing values in mandatory fields.
    """
    missing_mask = df[mandatory_fields].isnull() | (df[mandatory_fields].astype(str).apply(lambda x: x.str.strip()) == '')
    rows_missing = df[missing_mask.any(axis=1)]
    return rows_missing

# Usage Example

df = pd.read_csv('students.csv')

df['Grade_Valid'] = df['Grade'].apply(is_grade_valid)
df['Email_Valid'] = df['Email'].apply(is_email_valid)
df['Age_Valid'] = df['Age'].apply(is_age_valid)

mandatory = ['ID', 'Name', 'Age', 'Grade', 'Email']
rows_with_missing = check_completeness(df, mandatory)

print("Invalid Grade rows:\n", df.loc[~df['Grade_Valid']])
print("Invalid Email rows:\n", df.loc[~df['Email_Valid']])
print("Invalid Age rows:\n", df.loc[~df['Age_Valid']])
print("Rows with missing mandatory fields:\n", rows_with_missing)


Invalid Grade rows:
    ID   Name   Age Grade              Email  Grade_Valid  Email_Valid  \
3   4  Diana  19.0   105      diana@example        False        False   
6   7  Grace  23.0   abc  grace@example.com        False         True   

   Age_Valid  
3       True  
6       True  
Invalid Email rows:
    ID   Name   Age Grade            Email  Grade_Valid  Email_Valid  Age_Valid
1   2    Bob  22.0    90  bob.example.com         True        False       True
3   4  Diana  19.0   105    diana@example        False        False       True
7   8   Hank  25.0    80              NaN         True        False       True
Invalid Age rows:
    ID     Name  Age Grade                Email  Grade_Valid  Email_Valid  \
2   3  Charlie  NaN    75  charlie@example.com         True         True   
5   6    Faith -1.0    95    faith@example.com         True         True   

   Age_Valid  
2      False  
5      False  
Rows with missing mandatory fields:
    ID     Name   Age Grade                Email