## Check Accuracy & Completeness

**Objective**: Learn to assess data quality by checking for accuracy and completeness using Python.

For this, you will use a sample dataset students.csv that contains the following
columns: ID , Name , Age , Grade , Email .

**Steps**:
1. Check Accuracy
    - Verify Numerical Data Accuracy
    - Validate Email Format
    - Integer Accuracy Check for Age
2. Check Completeness
    - Identify Missing Values
    - Rows with Missing Data
    - Column Specific Missing Value Check

In [1]:
import pandas as pd
import re

# Sample students data
data = pd.DataFrame({
    'ID': [101, 102, 103, 104, 105],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', None],
    'Age': [20, 21, '22', 19, 200],   # Note: Age has a string and an invalid value 200
    'Grade': [85, 90, 88, None, 95],
    'Email': ['alice@example.com', 'bob@example', 'charlie@example.com', 'david@sample.com', 'eve@sample.com']
})

# 1. Check Accuracy
# a. Verify Numerical Data Accuracy - Age should be integer between 0 and 120
def is_valid_age(x):
    if isinstance(x, int) and 0 <= x <= 120:
        return True
    return False

data['Age_Valid'] = data['Age'].apply(lambda x: is_valid_age(x) if isinstance(x, int) else False)

# b. Validate Email Format
email_pattern = re.compile(r'^[\w\.-]+@[\w\.-]+\.\w+$')
data['Email_Valid'] = data['Email'].apply(lambda x: bool(email_pattern.match(x)))

# c. Integer Accuracy Check for Age (should be integer type)
data['Age_Is_Integer'] = data['Age'].apply(lambda x: isinstance(x, int))

# 2. Check Completeness
# a. Identify Missing Values
missing_per_column = data.isnull().mean() * 100  # percentage missing per column

# b. Rows with Missing Data (any missing)
rows_with_missing = data[data.isnull().any(axis=1)]

# c. Column Specific Missing Value Check - e.g., Email missing
missing_email_count = data['Email'].isnull().sum()

# Output
print("Data with validation columns:")
print(data)
print("\nMissing Data % per column:")
print(missing_per_column)
print("\nRows with any missing data:")
print(rows_with_missing)
print(f"\nNumber of missing emails: {missing_email_count}")


Data with validation columns:
    ID     Name  Age  Grade                Email  Age_Valid  Email_Valid  \
0  101    Alice   20   85.0    alice@example.com       True         True   
1  102      Bob   21   90.0          bob@example       True        False   
2  103  Charlie   22   88.0  charlie@example.com      False         True   
3  104    David   19    NaN     david@sample.com       True         True   
4  105     None  200   95.0       eve@sample.com      False         True   

   Age_Is_Integer  
0            True  
1            True  
2           False  
3            True  
4            True  

Missing Data % per column:
ID                 0.0
Name              20.0
Age                0.0
Grade             20.0
Email              0.0
Age_Valid          0.0
Email_Valid        0.0
Age_Is_Integer     0.0
dtype: float64

Rows with any missing data:
    ID   Name  Age  Grade             Email  Age_Valid  Email_Valid  \
3  104  David   19    NaN  david@sample.com       True         Tru