### Task 1: Measure Data Accuracy using a Trusted Source

**Description**: You have two datasets of product prices: `company_prices.csv` and
`trusted_prices.csv` . Check if the prices in `company_prices.csv` match the prices in
`trusted_prices.csv` . Assume both files have a "product_id" and "price" column.

### Task 2: Detect Incorrect Values

**Description**: In `company_prices.csv` , detect any negative price values which are incorrect values for prices.

In [4]:
# Write your code from here









import pandas as pd
import os

# === Step 1: Upload CSV (for Jupyter/Colab users) ===
try:
    from google.colab import files
    print("Please upload 'company_prices.csv'...")
    uploaded = files.upload()
except ImportError:
    print("Not in Google Colab. Ensure 'company_prices.csv' is in the same folder.")

# === Step 2: Check if file exists ===
file_name = "company_prices.csv"
if not os.path.exists(file_name):
    raise FileNotFoundError(f"'{file_name}' not found in the current directory. Please upload it.")

# === Step 3: Load the dataset ===
df = pd.read_csv(file_name)

# === Step 4: Detect negative price values ===
negative_prices = df[df["price"] < 0]

# === Step 5: Display results ===
if not negative_prices.empty:
    print("\nIncorrect Entries with Negative Prices:\n")
    print(negative_prices)
    print(f"\nTotal Negative Price Records: {len(negative_prices)}")
else:
    print("✅ No negative prices detected. All entries are valid.")






Not in Google Colab. Ensure 'company_prices.csv' is in the same folder.


FileNotFoundError: 'company_prices.csv' not found in the current directory. Please upload it.

### Task 3: Check Missing Data Rates

**Description**: Calculate the percentage of missing values in `customer_data.csv` .

In [2]:
import pandas as pd

# Step 1: Create simulated data for company and trusted datasets
company_data = {
    "product_id": [101, 102, 103, 104, 105],
    "price": [20.0, 35.0, 50.0, 45.0, 60.0]
}

trusted_data = {
    "product_id": [101, 102, 103, 104, 105],
    "price": [20.0, 33.0, 50.0, 45.0, 59.0]  # Note: price for 102 and 105 differ
}

# Step 2: Create DataFrames
company_df = pd.DataFrame(company_data)
trusted_df = pd.DataFrame(trusted_data)

# Step 3: Merge on product_id
merged_df = company_df.merge(trusted_df, on="product_id", suffixes=('_company', '_trusted'))

# Step 4: Compare prices
merged_df["price_match"] = merged_df["price_company"] == merged_df["price_trusted"]

# Step 5: Calculate price accuracy
total_products = merged_df.shape[0]
matched_prices = merged_df["price_match"].sum()
accuracy_percentage = (matched_prices / total_products) * 100

# Step 6: Output results
print(f"Total Products Compared: {total_products}")
print(f"Matching Prices: {matched_prices}")
print(f"Price Accuracy: {accuracy_percentage:.2f}%")

# Step 7: Show mismatches (if any)
mismatches = merged_df[~merged_df["price_match"]]
if not mismatches.empty:
    print("\nMismatched Entries:")
    print(mismatches[["product_id", "price_company", "price_trusted"]])
else:
    print("\nAll prices match with the trusted source.")

Total Products Compared: 5
Matching Prices: 3
Price Accuracy: 60.00%

Mismatched Entries:
   product_id  price_company  price_trusted
1         102           35.0           33.0
4         105           60.0           59.0


In [8]:








import pandas as pd
import os

# Step 1: Define the file name
file_path = "customer_data.csv"

# Step 2: Check if the file exists
if not os.path.exists(file_path):
    print(f"Error: File '{file_path}' not found. Please check the file name or path.")
else:
    # Step 3: Load the dataset
    df = pd.read_csv(file_path)

    # Step 4: Calculate missing data percentage
    missing_percentage = df.isnull().mean() * 100

    # Step 5: Display the results
    print("\nMissing Data Percentage by Column:\n")
    print(missing_percentage.sort_values(ascending=False).round(2))







Error: File 'customer_data.csv' not found. Please check the file name or path.


### Task 4: Handling Partially Available Records

**Description**: In `customer_data.csv` , identify records with missing "email" or "phone number" and decide whether to drop or fill them.

In [10]:



import pandas as pd
import os

# Step 1: Try loading the dataset
file_path = "customer_data.csv"

if not os.path.exists(file_path):
    print(f"❌ File not found: '{file_path}'\n📌 Please check the filename or place it in the same directory as this script.")
else:
    df = pd.read_csv(file_path)

    # Step 2: Identify records with missing 'email' or 'phone number'
    partial_records = df[df["email"].isnull() | df["phone number"].isnull()]
    
    print("📋 Records with missing 'email' or 'phone number':")
    print(partial_records)

    # Step 3A: Option to Drop records with missing contact info
    df_dropped = df.dropna(subset=["email", "phone number"])
    print(f"\n✅ Records after dropping missing contacts: {len(df_dropped)} out of {len(df)}")

    # Step 3B: Option to Fill missing values with placeholder
    df_filled = df.copy()
    df_filled["email"].fillna("unknown@example.com", inplace=True)
    df_filled["phone number"].fillna("0000000000", inplace=True)
    print("\n✅ Sample filled data (with placeholder values):")
    print(df_filled.head())





















❌ File not found: 'customer_data.csv'
📌 Please check the filename or place it in the same directory as this script.
