## Check Accuracy & Completeness

**Objective**: Learn to assess data quality by checking for accuracy and completeness using Python.

For this, you will use a sample dataset students.csv that contains the following
columns: ID , Name , Age , Grade , Email .

**Steps**:
1. Check Accuracy
    - Verify Numerical Data Accuracy
    - Validate Email Format
    - Integer Accuracy Check for Age
2. Check Completeness
    - Identify Missing Values
    - Rows with Missing Data
    - Column Specific Missing Value Check

In [1]:
import pandas as pd

# ----------------------------
# Part 2: Remove Duplicates & Fix Data Types
# ----------------------------

# Sample extended dataset with duplicates and incorrect types
data = {
    "ID": [101, 102, 103, 104, 102],
    "Name": ["Alice", "Bob", "Charlie", "David", "Bob"],
    "Age": ["25", "30", "35", "40", "30"],         # Age as string (should be int)
    "Join_Date": ["2023-01-10", "2022-12-01", "2021-07-15", "2023-03-22", "2022-12-01"]  # Should be datetime
}

# Task 1: Remove Duplicates

# Step 1: Load Data into DataFrame
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Step 2: Remove duplicate rows
df_no_duplicates = df.drop_duplicates()
print("\nDataFrame after Removing Duplicates:")
print(df_no_duplicates)

# Task 2: Fix Incorrect Data Types

# Convert 'Age' from string to integer
df_no_duplicates["Age"] = df_no_duplicates["Age"].astype(int)

# Task 3: Convert Data Type for Analysis

# Convert 'Join_Date' from string to datetime
df_no_duplicates["Join_Date"] = pd.to_datetime(df_no_duplicates["Join_Date"])

print("\nDataFrame after Fixing Data Types:")
print(df_no_duplicates)

# Check final data types
print("\nData Types After Conversion:")
print(df_no_duplicates.dtypes)


Original DataFrame:
    ID     Name Age   Join_Date
0  101    Alice  25  2023-01-10
1  102      Bob  30  2022-12-01
2  103  Charlie  35  2021-07-15
3  104    David  40  2023-03-22
4  102      Bob  30  2022-12-01

DataFrame after Removing Duplicates:
    ID     Name Age   Join_Date
0  101    Alice  25  2023-01-10
1  102      Bob  30  2022-12-01
2  103  Charlie  35  2021-07-15
3  104    David  40  2023-03-22

DataFrame after Fixing Data Types:
    ID     Name  Age  Join_Date
0  101    Alice   25 2023-01-10
1  102      Bob   30 2022-12-01
2  103  Charlie   35 2021-07-15
3  104    David   40 2023-03-22

Data Types After Conversion:
ID                    int64
Name                 object
Age                   int64
Join_Date    datetime64[ns]
dtype: object


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_no_duplicates["Age"] = df_no_duplicates["Age"].astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_no_duplicates["Join_Date"] = pd.to_datetime(df_no_duplicates["Join_Date"])
