## Check Accuracy & Completeness

**Objective**: Learn to assess data quality by checking for accuracy and completeness using Python.

For this, you will use a sample dataset students.csv that contains the following
columns: ID , Name , Age , Grade , Email .

**Steps**:
1. Check Accuracy
    - Verify Numerical Data Accuracy
    - Validate Email Format
    - Integer Accuracy Check for Age
2. Check Completeness
    - Identify Missing Values
    - Rows with Missing Data
    - Column Specific Missing Value Check

In [1]:
import pandas as pd
import re

# Sample data creation (replace this with loading your actual CSV)
data = {
    'ID': [101, 102, 103, 104, 105],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [20, 19, 21, None, 22],
    'Grade': [88.5, 92.0, 85.0, 90.0, None],
    'Email': [
        'alice@example.com',
        'bob_at_example.com',    # invalid email
        'charlie@example.com',
        'david@example',          # invalid email
        None                     # missing email
    ]
}

df = pd.DataFrame(data)

# 1. Accuracy Checks

## 1a. Verify Numerical Data Accuracy (Age should be between 0 and 120, Grade between 0 and 100)
age_valid = df['Age'].between(0, 120, inclusive='both') | df['Age'].isna()
grade_valid = df['Grade'].between(0, 100, inclusive='both') | df['Grade'].isna()

print("Age valid:\n", age_valid)
print("Grade valid:\n", grade_valid)

## 1b. Validate Email Format with regex
email_pattern = re.compile(r'^[\w\.-]+@[\w\.-]+\.\w+$')

def validate_email(email):
    if pd.isna(email):
        return False
    return bool(email_pattern.match(email))

df['Email_Valid'] = df['Email'].apply(validate_email)
print("\nEmail validity:\n", df[['Email', 'Email_Valid']])

## 1c. Integer Accuracy Check for Age (Age should be int if present)
age_is_integer = df['Age'].dropna().apply(float.is_integer)
print("\nAge integer check (excluding missing values):\n", age_is_integer)

# 2. Completeness Checks

## 2a. Identify Missing Values (overall)
missing_values = df.isnull().sum()
print("\nMissing values per column:\n", missing_values)

## 2b. Rows with missing data
rows_with_missing = df[df.isnull().any(axis=1)]
print("\nRows with missing data:\n", rows_with_missing)

## 2c. Column-specific missing value check (example: Email)
missing_email = df['Email'].isnull()
print("\nRows with missing Email:\n", df[missing_email])


Age valid:
 0    True
1    True
2    True
3    True
4    True
Name: Age, dtype: bool
Grade valid:
 0    True
1    True
2    True
3    True
4    True
Name: Grade, dtype: bool

Email validity:
                  Email  Email_Valid
0    alice@example.com         True
1   bob_at_example.com        False
2  charlie@example.com         True
3        david@example        False
4                 None        False

Age integer check (excluding missing values):
 0    True
1    True
2    True
4    True
Name: Age, dtype: bool

Missing values per column:
 ID             0
Name           0
Age            1
Grade          1
Email          1
Email_Valid    0
dtype: int64

Rows with missing data:
     ID   Name   Age  Grade          Email  Email_Valid
3  104  David   NaN   90.0  david@example        False
4  105    Eva  22.0    NaN           None        False

Rows with missing Email:
     ID Name   Age  Grade Email  Email_Valid
4  105  Eva  22.0    NaN  None        False


## Check Accuracy & Completeness

**Objective**: Learn to assess data quality by checking for accuracy and completeness using Python.

For this, you will use a sample dataset students.csv that contains the following
columns: ID , Name , Age , Grade , Email .

**Steps**:
1. Check Accuracy
    - Verify Numerical Data Accuracy
    - Validate Email Format
    - Integer Accuracy Check for Age
2. Check Completeness
    - Identify Missing Values
    - Rows with Missing Data
    - Column Specific Missing Value Check