## Check Uniqueness & Validity

**Objective**: Evaluate data quality by checking for uniqueness and validity of data entries.

For this activity, you will use a sample dataset students.csv that contains the following
columns: ID , Name , Age , Grade , Email .

**Steps**:
1. Check Uniqueness
    - Unique IDs
    - Unique Email Addresses
    - Unique Combination

2. Check Validity
    - Validate Age Range
    - Validate Grade Scale
    - Validate Name Format

In [1]:
# Write your code from here
import pandas as pd
import re

# Load dataset
df = pd.read_csv('students.csv')

# --- Uniqueness Checks ---

# 1. Unique IDs
ids_unique = df['ID'].is_unique
print(f"IDs unique? {ids_unique}")
if not ids_unique:
    print("Duplicate IDs:")
    print(df[df.duplicated(subset=['ID'], keep=False)][['ID', 'Name']])

# 2. Unique Emails
emails_unique = df['Email'].is_unique
print(f"Emails unique? {emails_unique}")
if not emails_unique:
    print("Duplicate Emails:")
    print(df[df.duplicated(subset=['Email'], keep=False)][['Email', 'Name']])

# 3. Unique combination of Name + Age + Grade
combo_duplicates = df[df.duplicated(subset=['Name', 'Age', 'Grade'], keep=False)]
combo_unique = combo_duplicates.empty
print(f"Name+Age+Grade combination unique? {combo_unique}")
if not combo_unique:
    print("Duplicate Name+Age+Grade combinations:")
    print(combo_duplicates[['Name', 'Age', 'Grade']])

# --- Validity Checks ---

def valid_age(age):
    if pd.isna(age):
        return False
    if isinstance(age, float) and age.is_integer():
        age = int(age)
    if not isinstance(age, int):
        return False
    return 0 <= age <= 120

def valid_grade(grade):
    try:
        val = float(grade)
        return 0 <= val <= 100
    except (ValueError, TypeError):
        return False

def valid_name(name):
    if not isinstance(name, str) or not name.strip():
        return False
    # Allow letters, spaces, apostrophes, hyphens
    pattern = r"^[A-Za-z\s'-]+$"
    return bool(re.fullmatch(pattern, name.strip()))

df['Age_Valid'] = df['Age'].apply(valid_age)
df['Grade_Valid'] = df['Grade'].apply(valid_grade)
df['Name_Valid'] = df['Name'].apply(valid_name)

print("\nRows with invalid Age:")
print(df[~df['Age_Valid']][['ID', 'Name', 'Age']])

print("\nRows with invalid Grade:")
print(df[~df['Grade_Valid']][['ID', 'Name', 'Grade']])

print("\nRows with invalid Name:")
print(df[~df['Name_Valid']][['ID', 'Name']])


IDs unique? True
Emails unique? True
Name+Age+Grade combination unique? True

Rows with invalid Age:
   ID     Name  Age
2   3  Charlie  NaN
5   6    Faith -1.0

Rows with invalid Grade:
   ID   Name Grade
3   4  Diana   105
6   7  Grace   abc

Rows with invalid Name:
Empty DataFrame
Columns: [ID, Name]
Index: []
