## Check Accuracy & Completeness

**Objective**: Learn to assess data quality by checking for accuracy and completeness using Python.

For this, you will use a sample dataset students.csv that contains the following
columns: ID , Name , Age , Grade , Email .

**Steps**:
1. Check Accuracy
    - Verify Numerical Data Accuracy
    - Validate Email Format
    - Integer Accuracy Check for Age
2. Check Completeness
    - Identify Missing Values
    - Rows with Missing Data
    - Column Specific Missing Value Check

In [1]:
# Write your code from here
import pandas as pd

# Sample data dictionary
data = {
    "ID": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    "Name": ["Alice", "Bob", "Charlie", "David", "Eva", "Frank", "Gina", "Harry", "Ivy", "Jack"],
    "Age": [20, 22, 19, "abc", 21, 17, 24, 26, 20, 21],
    "Grade": [85, 90, 105, 75, 88, 92, 79, 65, None, 95],
    "Email": [
        "alice@example.com",
        "bob_at_example.com",
        "charlie@example.com",
        "david@example.com",
        None,
        "frank@example",
        "gina@example.com",
        "harry@example.com",
        "ivy@example.com",
        "jack@example.com"
    ]
}

# Create DataFrame
df = pd.DataFrame(data)

# Save to CSV locally
df.to_csv("students.csv", index=False)

print("students.csv created successfully!")


students.csv created successfully!


In [2]:
import pandas as pd

# Load the dataset
df = pd.read_csv('students.csv')

# Show first few rows
print(df.head())


   ID     Name  Age  Grade                Email
0   1    Alice   20   85.0    alice@example.com
1   2      Bob   22   90.0   bob_at_example.com
2   3  Charlie   19  105.0  charlie@example.com
3   4    David  abc   75.0    david@example.com
4   5      Eva   21   88.0                  NaN


In [4]:
import pandas as pd

# Load CSV (or assume df is already loaded)
df = pd.read_csv('students.csv')

# Convert Age to numeric, invalid parsing becomes NaN
df['Age_numeric'] = pd.to_numeric(df['Age'], errors='coerce')

# Find rows where Age is not numeric
invalid_age = df[df['Age_numeric'].isna()]
print("Rows with non-numeric Age:\n", invalid_age)

# Check Age range (5 to 25)
age_out_of_range = df[(df['Age_numeric'] < 5) | (df['Age_numeric'] > 25)]
print("Rows with Age out of expected range (5-25):\n", age_out_of_range)


Rows with non-numeric Age:
    ID   Name  Age  Grade              Email  Age_numeric
3   4  David  abc   75.0  david@example.com          NaN
Rows with Age out of expected range (5-25):
    ID   Name Age  Grade              Email  Age_numeric
7   8  Harry  26   65.0  harry@example.com         26.0


In [5]:
import re

email_pattern = re.compile(r'^[\w\.-]+@[\w\.-]+\.\w+$')

invalid_emails = df[~df['Email'].astype(str).apply(lambda x: bool(email_pattern.match(x)))]
print("Rows with invalid emails:\n", invalid_emails)


Rows with invalid emails:
    ID   Name Age  Grade               Email  Age_numeric
1   2    Bob  22   90.0  bob_at_example.com         22.0
4   5    Eva  21   88.0                 NaN         21.0
5   6  Frank  17   92.0       frank@example         17.0


In [6]:
# Check if Age values are integers (and not floats)
non_integer_ages = df[~df['Age'].apply(lambda x: isinstance(x, int))]
print("Rows where Age is not an integer:\n", non_integer_ages)


Rows where Age is not an integer:
    ID     Name  Age  Grade                Email  Age_numeric
0   1    Alice   20   85.0    alice@example.com         20.0
1   2      Bob   22   90.0   bob_at_example.com         22.0
2   3  Charlie   19  105.0  charlie@example.com         19.0
3   4    David  abc   75.0    david@example.com          NaN
4   5      Eva   21   88.0                  NaN         21.0
5   6    Frank   17   92.0        frank@example         17.0
6   7     Gina   24   79.0     gina@example.com         24.0
7   8    Harry   26   65.0    harry@example.com         26.0
8   9      Ivy   20    NaN      ivy@example.com         20.0
9  10     Jack   21   95.0     jack@example.com         21.0


In [7]:
missing_values = df.isnull().sum()
print("Missing values per column:\n", missing_values)


Missing values per column:
 ID             0
Name           0
Age            0
Grade          1
Email          1
Age_numeric    1
dtype: int64


In [8]:
rows_with_missing = df[df.isnull().any(axis=1)]
print("Rows with missing data:\n", rows_with_missing)


Rows with missing data:
    ID   Name  Age  Grade              Email  Age_numeric
3   4  David  abc   75.0  david@example.com          NaN
4   5    Eva   21   88.0                NaN         21.0
8   9    Ivy   20    NaN    ivy@example.com         20.0


In [9]:
missing_emails = df[df['Email'].isnull()]
print("Rows with missing emails:\n", missing_emails)


Rows with missing emails:
    ID Name Age  Grade Email  Age_numeric
4   5  Eva  21   88.0   NaN         21.0
