## Check Accuracy & Completeness

**Objective**: Learn to assess data quality by checking for accuracy and completeness using Python.

For this, you will use a sample dataset students.csv that contains the following
columns: ID , Name , Age , Grade , Email .

**Steps**:
1. Check Accuracy
    - Verify Numerical Data Accuracy
    - Validate Email Format
    - Integer Accuracy Check for Age
2. Check Completeness
    - Identify Missing Values
    - Rows with Missing Data
    - Column Specific Missing Value Check

In [2]:
import pandas as pd
import re

# --------- Sample Data Creation (Simulating students.csv) ---------
data = {
    'ID': [1, 2, 3, 4, 5],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [20, 150, 19, None, 25],  # 150 is invalid, None is missing
    'Grade': ['A', 'B', 'C', 'A', 'B'],
    'Email': ['alice@example.com', 'bob@example', 'charlie@example.com', None, 'eve@example.com']
}

df = pd.DataFrame(data)

print("=== Data Overview ===")
print(df)

# ----------- 1. Accuracy Checks -----------

# 1a. Verify Numerical Data Accuracy (Age should be between 5 and 100)
def check_age_accuracy(age):
    return 5 <= age <= 100 if pd.notnull(age) else False

df['Age_Valid'] = df['Age'].apply(check_age_accuracy)

print("\nAge accuracy check (True = valid, False = invalid or missing):")
print(df[['Age', 'Age_Valid']])

# 1b. Validate Email Format using regex
email_pattern = r'^[\w\.-]+@[\w\.-]+\.\w+$'

def is_email_valid(email):
    return bool(re.match(email_pattern, email)) if pd.notnull(email) else False

df['Email_Valid'] = df['Email'].apply(is_email_valid)

print("\nEmail format validation (True = valid, False = invalid or missing):")
print(df[['Email', 'Email_Valid']])

# 1c. Integer Accuracy Check for Age
def is_integer(value):
    if pd.isnull(value):
        return False
    return isinstance(value, int) or (isinstance(value, float) and value.is_integer())

df['Age_Integer'] = df['Age'].apply(is_integer)

print("\nAge integer check (True = integer, False = non-integer or missing):")
print(df[['Age', 'Age_Integer']])


# ----------- 2. Completeness Checks -----------

# 2a. Identify Missing Values per column
missing_per_column = df.isnull().sum()
print("\nMissing values per column:")
print(missing_per_column)

# 2b. Rows with any missing data
rows_with_missing = df[df.isnull().any(axis=1)]
print(f"\nRows with missing data (Total: {len(rows_with_missing)}):")
print(rows_with_missing)

# 2c. Column-specific missing value indices
print("\nIndices of missing values per column:")
for col in df.columns:
    missing_indices = df.index[df[col].isnull()].tolist()
    if missing_indices:
        print(f"{col}: {missing_indices}")

=== Data Overview ===
   ID     Name    Age Grade                Email
0   1    Alice   20.0     A    alice@example.com
1   2      Bob  150.0     B          bob@example
2   3  Charlie   19.0     C  charlie@example.com
3   4    David    NaN     A                 None
4   5      Eve   25.0     B      eve@example.com

Age accuracy check (True = valid, False = invalid or missing):
     Age  Age_Valid
0   20.0       True
1  150.0      False
2   19.0       True
3    NaN      False
4   25.0       True

Email format validation (True = valid, False = invalid or missing):
                 Email  Email_Valid
0    alice@example.com         True
1          bob@example        False
2  charlie@example.com         True
3                 None        False
4      eve@example.com         True

Age integer check (True = integer, False = non-integer or missing):
     Age  Age_Integer
0   20.0         True
1  150.0         True
2   19.0         True
3    NaN        False
4   25.0         True

Missing value