## Check Uniqueness & Validity

**Objective**: Evaluate data quality by checking for uniqueness and validity of data entries.

For this activity, you will use a sample dataset students.csv that contains the following
columns: ID , Name , Age , Grade , Email .

**Steps**:
1. Check Uniqueness
    - Unique IDs
    - Unique Email Addresses
    - Unique Combination

2. Check Validity
    - Validate Age Range
    - Validate Grade Scale
    - Validate Name Format

In [1]:
# Write your code from here
import pandas as pd

# Sample data including duplicates and invalid values
data = {
    "ID": [1, 2, 3, 4, 5, 6, 7, 8, 8, 10],
    "Name": [
        "Alice", "Bob", "Charlie", "David", "Eva", 
        "Frank", "Gina", "Harry", "Harry", "Jack"
    ],
    "Age": [20, 22, 19, "abc", 21, 17, 24, 26, 26, 21],
    "Grade": [85, 90, 105, 75, 88, 92, 79, 65, 65, 95],
    "Email": [
        "alice@example.com", "bob_at_example.com", "charlie@example.com", 
        "david@example.com", None, "frank@example", "gina@example.com", 
        "harry@example.com", "harry@example.com", "jack@example.com"
    ]
}

# Create DataFrame
df = pd.DataFrame(data)

# Save to CSV
df.to_csv("students.csv", index=False)

print("students.csv created successfully!")


students.csv created successfully!


In [6]:
import pandas as pd

df = pd.read_csv('students.csv')
print(df.head())


duplicate_ids = df[df.duplicated(subset=['ID'], keep=False)]
if duplicate_ids.empty:
    print("All IDs are unique.")
else:
    print("Duplicate IDs found:")
    print(duplicate_ids)


duplicate_emails = df[df['Email'].notnull() & df.duplicated(subset=['Email'], keep=False)]
if duplicate_emails.empty:
    print("All Emails are unique (excluding missing emails).")
else:
    print("Duplicate Emails found:")
    print(duplicate_emails)


duplicate_emails = df[df['Email'].notnull() & df.duplicated(subset=['Email'], keep=False)]
if duplicate_emails.empty:
    print("All Emails are unique (excluding missing emails).")
else:
    print("Duplicate Emails found:")
    print(duplicate_emails)


duplicate_combination = df[df.duplicated(subset=['Name', 'Age', 'Email'], keep=False)]
if duplicate_combination.empty:
    print("All combinations of Name, Age, and Email are unique.")
else:
    print("Duplicate combinations of Name, Age, and Email found:")
    print(duplicate_combination)


df['Age_numeric'] = pd.to_numeric(df['Age'], errors='coerce')

invalid_age = df[
    (df['Age_numeric'].isna()) | 
    (df['Age_numeric'] % 1 != 0) |  # non-integer
    (df['Age_numeric'] < 5) | 
    (df['Age_numeric'] > 25)
]

if invalid_age.empty:
    print("All Age values are valid integers between 5 and 25.")
else:
    print("Invalid Age values found:")
    print(invalid_age[['ID', 'Name', 'Age']])




df['Grade_numeric'] = pd.to_numeric(df['Grade'], errors='coerce')

invalid_grade = df[
    (df['Grade_numeric'].isna()) | 
    (df['Grade_numeric'] < 0) | 
    (df['Grade_numeric'] > 100)
]

if invalid_grade.empty:
    print("All Grade values are valid numbers between 0 and 100.")
else:
    print("Invalid Grade values found:")
    print(invalid_grade[['ID', 'Name', 'Grade']])


import re

name_pattern = re.compile(r"^[A-Za-z\s\-']+$")

invalid_names = df[~df['Name'].astype(str).apply(lambda x: bool(name_pattern.match(x)))]

if invalid_names.empty:
    print("All Names are valid.")
else:
    print("Invalid Names found:")
    print(invalid_names[['ID', 'Name']])


   ID     Name  Age  Grade                Email
0   1    Alice   20     85    alice@example.com
1   2      Bob   22     90   bob_at_example.com
2   3  Charlie   19    105  charlie@example.com
3   4    David  abc     75    david@example.com
4   5      Eva   21     88                  NaN
Duplicate IDs found:
   ID   Name Age  Grade              Email
7   8  Harry  26     65  harry@example.com
8   8  Harry  26     65  harry@example.com
Duplicate Emails found:
   ID   Name Age  Grade              Email
7   8  Harry  26     65  harry@example.com
8   8  Harry  26     65  harry@example.com
Duplicate Emails found:
   ID   Name Age  Grade              Email
7   8  Harry  26     65  harry@example.com
8   8  Harry  26     65  harry@example.com
Duplicate combinations of Name, Age, and Email found:
   ID   Name Age  Grade              Email
7   8  Harry  26     65  harry@example.com
8   8  Harry  26     65  harry@example.com
Invalid Age values found:
   ID   Name  Age
3   4  David  abc
7   8  Harr