## Check Uniqueness & Validity

**Objective**: Evaluate data quality by checking for uniqueness and validity of data entries.

For this activity, you will use a sample dataset students.csv that contains the following
columns: ID , Name , Age , Grade , Email .

**Steps**:
1. Check Uniqueness
    - Unique IDs
    - Unique Email Addresses
    - Unique Combination

2. Check Validity
    - Validate Age Range
    - Validate Grade Scale
    - Validate Name Format

In [1]:
import pandas as pd
import re

# -------- Sample data creation --------
data = {
    'ID': [1, 2, 3, 4, 4],  # Duplicate ID (4)
    'Name': ['Alice', 'Bob123', 'Charlie', 'David', 'Eve'],
    'Age': [20, 150, 19, 25, 22],  # 150 invalid age
    'Grade': ['A', 'B', 'E', 'A', 'B'],  # 'E' invalid grade
    'Email': ['alice@example.com', 'bob@example.com', 'charlie@example.com', 'david@example.com', 'bob@example.com']  # Duplicate Email
}

df = pd.DataFrame(data)

print("=== Data Overview ===")
print(df)

# -------- 1. Check Uniqueness --------

print("\n--- Uniqueness Checks ---")

# Unique IDs
unique_ids = df['ID'].is_unique
print(f"Unique IDs: {unique_ids}")

# Unique Emails
unique_emails = df['Email'].is_unique
print(f"Unique Emails: {unique_emails}")

# Unique Combination of ID + Email
unique_id_email = df.duplicated(subset=['ID', 'Email'], keep=False)
if unique_id_email.any():
    print("Duplicate ID + Email combinations found at rows:")
    print(df[unique_id_email])
else:
    print("All ID + Email combinations are unique.")

# -------- 2. Check Validity --------

print("\n--- Validity Checks ---")

# 2a. Validate Age Range (5 to 100)
def valid_age(age):
    return 5 <= age <= 100

df['Age_Valid'] = df['Age'].apply(valid_age)
print("Age validity (True=valid):")
print(df[['Age', 'Age_Valid']])

# 2b. Validate Grade Scale (A, B, C, D, F)
valid_grades = {'A', 'B', 'C', 'D', 'F'}
df['Grade_Valid'] = df['Grade'].apply(lambda x: x in valid_grades)
print("\nGrade validity (True=valid):")
print(df[['Grade', 'Grade_Valid']])

# 2c. Validate Name Format (only letters and spaces)
name_pattern = r'^[A-Za-z\s]+$'
df['Name_Valid'] = df['Name'].apply(lambda x: bool(re.match(name_pattern, x)))
print("\nName format validity (True=valid):")
print(df[['Name', 'Name_Valid']])

=== Data Overview ===
   ID     Name  Age Grade                Email
0   1    Alice   20     A    alice@example.com
1   2   Bob123  150     B      bob@example.com
2   3  Charlie   19     E  charlie@example.com
3   4    David   25     A    david@example.com
4   4      Eve   22     B      bob@example.com

--- Uniqueness Checks ---
Unique IDs: False
Unique Emails: False
All ID + Email combinations are unique.

--- Validity Checks ---
Age validity (True=valid):
   Age  Age_Valid
0   20       True
1  150      False
2   19       True
3   25       True
4   22       True

Grade validity (True=valid):
  Grade  Grade_Valid
0     A         True
1     B         True
2     E        False
3     A         True
4     B         True

Name format validity (True=valid):
      Name  Name_Valid
0    Alice        True
1   Bob123       False
2  Charlie        True
3    David        True
4      Eve        True
