### Task 1: Detecting Missing Values during Data Ingestion
**Description**: You have a CSV file with missing values in some columns. Write a Python script to detect and report missing values during the ingestion process.

**Steps**:
1. Load data
2. Check for missing values
3. Report missing values

In [2]:
# Write your code from here
import pandas as pd

def detect_missing_values(file_path):
    # Step 1: Load data
    df = pd.read_csv(file_path)
    
    # Step 2: Check for missing values
    missing_counts = df.isnull().sum()
    
    # Step 3: Report missing values
    total_missing = missing_counts.sum()
    if total_missing == 0:
        print("No missing values detected in the dataset.")
    else:
        print(f"Total missing values in dataset: {total_missing}")
        print("Missing values per column:")
        print(missing_counts[missing_counts > 0])

# Example usage
csv_file = 'your_data.csv'  # Replace with your actual CSV file path
detect_missing_values(csv_file)


Total missing values in dataset: 3
Missing values per column:
name     1
age      1
email    1
dtype: int64


### Task 2: Validate Data Types during Extraction
**Description**: You have a JSON file that should have specific data types for each field. Write a script to validate if the data types match the expected schema.

**Steps**:
1. Define expected schema
2. Validate data types

In [4]:
# Write your code from here
import json

def validate_data_types(json_file, schema):
    with open(json_file, 'r') as f:
        data = json.load(f)
    
    errors = []
    
    # Assuming data is a list of records (dictionaries)
    for i, record in enumerate(data):
        for field, expected_type in schema.items():
            if field not in record:
                errors.append(f"Record {i}: Missing field '{field}'")
            else:
                if not isinstance(record[field], expected_type):
                    # Special case: allow int where float expected
                    if expected_type == float and isinstance(record[field], int):
                        continue
                    errors.append(f"Record {i}: Field '{field}' expected {expected_type.__name__}, got {type(record[field]).__name__}")
    
    if errors:
        print("Data type validation errors found:")
        for error in errors:
            print(error)
    else:
        print("All records match the expected schema.")

# Define expected schema: field -> Python type
expected_schema = {
    "id": int,
    "name": str,
    "age": int,
    "email": str,
    "is_active": bool,
    "balance": float
}

# Example usage
json_file_path = 'data.json'  # Replace with your JSON file path
validate_data_types(json_file_path, expected_schema)


Data type validation errors found:
Record 1: Field 'age' expected int, got str


### Task 3: Remove Duplicate Records in Data
**Description**: You have a dataset with duplicate entries. Write a Python script to find and remove duplicate records using Pandas.

**Steps**:
1. Find duplicate records
2. Remove duplicates
3. Report results

In [5]:
# Write your code from here
import pandas as pd

def remove_duplicates(file_path):
    # Step 1: Load data
    df = pd.read_csv(file_path)
    
    # Step 2: Find duplicate records
    duplicates = df[df.duplicated()]
    print(f"Number of duplicate records found: {len(duplicates)}")
    
    if len(duplicates) > 0:
        print("Duplicate records:")
        print(duplicates)
    
    # Step 3: Remove duplicates
    df_cleaned = df.drop_duplicates()
    
    # Report results
    print(f"Number of records after removing duplicates: {len(df_cleaned)}")
    
    # Optionally, save the cleaned data back to a new CSV file
    cleaned_file_path = 'cleaned_' + file_path
    df_cleaned.to_csv(cleaned_file_path, index=False)
    print(f"Cleaned data saved to {cleaned_file_path}")

# Example usage
csv_file = 'your_data.csv'  # Replace with your actual CSV file path
remove_duplicates(csv_file)


Number of duplicate records found: 0
Number of records after removing duplicates: 5
Cleaned data saved to cleaned_your_data.csv
