## Check Accuracy & Completeness

**Objective**: Learn to assess data quality by checking for accuracy and completeness using Python.

For this, you will use a sample dataset students.csv that contains the following
columns: ID , Name , Age , Grade , Email .

**Steps**:
1. Check Accuracy
    - Verify Numerical Data Accuracy
    - Validate Email Format
    - Integer Accuracy Check for Age
2. Check Completeness
    - Identify Missing Values
    - Rows with Missing Data
    - Column Specific Missing Value Check

In [1]:
# Write your code from here
import pandas as pd
import re

# Load the dataset
df = pd.read_csv('students.csv')

# --- 1. Check Accuracy ---

# a) Verify Numerical Data Accuracy for 'Grade' (assuming grade should be between 0 and 100)
def check_grade_accuracy(grade):
    try:
        val = float(grade)
        return 0 <= val <= 100
    except:
        return False

df['Grade_Accurate'] = df['Grade'].apply(check_grade_accuracy)

# b) Validate Email Format using regex
email_pattern = r'^[\w\.-]+@[\w\.-]+\.\w+$'

def validate_email(email):
    if pd.isna(email):
        return False
    return bool(re.fullmatch(email_pattern, email.strip()))

df['Email_Valid'] = df['Email'].apply(validate_email)

# c) Integer Accuracy Check for Age (should be integer and non-negative)
def age_check(age):
    if pd.isna(age):
        return False
    if isinstance(age, float) and age.is_integer():
        age = int(age)
    return isinstance(age, int) and age >= 0

df['Age_Valid'] = df['Age'].apply(age_check)

# --- 2. Check Completeness ---

# a) Identify Missing Values (any null in the row)
df['Has_Missing'] = df.isnull().any(axis=1)

# b) Rows with Missing Data
rows_with_missing = df[df['Has_Missing']]

# c) Column Specific Missing Value Check
missing_per_column = df.isnull().sum()

# --- Summary Reports ---

print("Accuracy Checks Summary:")
print(df[['Grade_Accurate', 'Email_Valid', 'Age_Valid']].describe())

print("\nRows with Missing Data:")
print(rows_with_missing)

print("\nMissing Values per Column:")
print(missing_per_column)


Accuracy Checks Summary:
       Grade_Accurate Email_Valid Age_Valid
count               9           9         9
unique              2           2         2
top              True        True      True
freq                7           6         7

Rows with Missing Data:
   ID     Name   Age Grade                Email  Grade_Accurate  Email_Valid  \
2   3  Charlie   NaN    75  charlie@example.com            True         True   
7   8     Hank  25.0    80                  NaN            True        False   

   Age_Valid  Has_Missing  
2      False         True  
7       True         True  

Missing Values per Column:
ID                0
Name              0
Age               1
Grade             0
Email             1
Grade_Accurate    0
Email_Valid       0
Age_Valid         0
Has_Missing       0
dtype: int64
