### Task 1: Detecting Missing Values during Data Ingestion
**Description**: You have a CSV file with missing values in some columns. Write a Python script to detect and report missing values during the ingestion process.

**Steps**:
1. Load data
2. Check for missing values
3. Report missing values

In [2]:
import pandas as pd
import os

# Step 1: Load data from CSV or create sample if not found
file_path = 'your_file.csv'

if os.path.exists(file_path):
    try:
        df = pd.read_csv(file_path)
        print(f"\n✅ Successfully loaded data from {file_path}.\n")
    except Exception as e:
        print(f"\n❌ Error reading the file: {e}")
        df = None
else:
    print(f"\n⚠️ File '{file_path}' not found. Using sample data for testing...\n")
    from io import StringIO
    sample_data = StringIO("""
name,age,city
Alice,25,New York
Bob,,Los Angeles
Charlie,30,
David,22,Chicago
""")
    df = pd.read_csv(sample_data)

# Step 2: Check if df is successfully created
if df is not None:
    # Step 3: Check for missing values
    missing_values = df.isnull().sum()

    # Step 4: Report missing values
    print("📝 Missing Value Report:")
    missing_report = missing_values[missing_values > 0]

    if missing_report.empty:
        print("✅ No missing values found in the dataset.")
    else:
        print(missing_report)

        # Optional: Show rows with missing values
        print("\n🔍 Rows with missing data:")
        print(df[df.isnull().any(axis=1)])
else:
    print("❌ DataFrame not created. Skipping missing value check.")


⚠️ File 'your_file.csv' not found. Using sample data for testing...

📝 Missing Value Report:
age     1
city    1
dtype: int64

🔍 Rows with missing data:
      name   age         city
1      Bob   NaN  Los Angeles
2  Charlie  30.0          NaN


### Task 2: Validate Data Types during Extraction
**Description**: You have a JSON file that should have specific data types for each field. Write a script to validate if the data types match the expected schema.

**Steps**:
1. Define expected schema
2. Validate data types

In [3]:
import json
import os

# Step 1: Define expected schema
expected_schema = {
    "id": int,
    "name": str,
    "age": int,
    "email": str
}

# Step 2: Try loading data from JSON, else use fallback data
file_path = "sample_data.json"
if os.path.exists(file_path):
    with open(file_path, "r") as f:
        data = json.load(f)
else:
    print("⚠️ JSON file not found. Using sample data for testing...")
    data = [
        {"id": 1, "name": "Alice", "age": 28, "email": "alice@example.com"},
        {"id": 2, "name": "Bob", "age": "25", "email": "bob@example.com"},      # age is str
        {"id": 3, "name": "Charlie", "age": 30, "email": None}                 # email is None
    ]

# Step 3: Validate data types
print("\n🔍 Data Type Validation Report:")
for i, record in enumerate(data, start=1):
    errors = []
    for field, expected_type in expected_schema.items():
        actual_value = record.get(field)
        if not isinstance(actual_value, expected_type):
            errors.append(f"   - {field}: Expected {expected_type.__name__}, got {type(actual_value).__name__}")
    if errors:
        print(f"❌ Record {i} has type mismatches:\n" + "\n".join(errors))
    else:
        print(f"✅ Record {i} passed type validation.")

⚠️ JSON file not found. Using sample data for testing...

🔍 Data Type Validation Report:
✅ Record 1 passed type validation.
❌ Record 2 has type mismatches:
   - age: Expected int, got str
❌ Record 3 has type mismatches:
   - email: Expected str, got NoneType


### Task 3: Remove Duplicate Records in Data
**Description**: You have a dataset with duplicate entries. Write a Python script to find and remove duplicate records using Pandas.

**Steps**:
1. Find duplicate records
2. Remove duplicates
3. Report results

In [4]:
import pandas as pd

# Sample dataset with duplicate entries
data = {
    "id": [1, 2, 2, 3, 4, 5, 5],
    "name": ["Alice", "Bob", "Bob", "Charlie", "David", "Eva", "Eva"],
    "age": [25, 30, 30, 35, 40, 28, 28]
}

# Step 1: Load data into DataFrame
df = pd.DataFrame(data)

print("📄 Original DataFrame:\n")
print(df)

# Step 2: Find duplicate records
duplicates = df[df.duplicated()]
print("\n🔍 Duplicate Records Found:\n")
print(duplicates)

# Step 3: Remove duplicates (keep first occurrence)
df_cleaned = df.drop_duplicates()

# Step 4: Report results
print("\n✅ Cleaned DataFrame (duplicates removed):\n")
print(df_cleaned)

print(f"\n📊 Total duplicates removed: {len(df) - len(df_cleaned)}")

📄 Original DataFrame:

   id     name  age
0   1    Alice   25
1   2      Bob   30
2   2      Bob   30
3   3  Charlie   35
4   4    David   40
5   5      Eva   28
6   5      Eva   28

🔍 Duplicate Records Found:

   id name  age
2   2  Bob   30
6   5  Eva   28

✅ Cleaned DataFrame (duplicates removed):

   id     name  age
0   1    Alice   25
1   2      Bob   30
3   3  Charlie   35
4   4    David   40
5   5      Eva   28

📊 Total duplicates removed: 2
