### Task 1: Measure Data Accuracy using a Trusted Source

**Description**: You have two datasets of product prices: `company_prices.csv` and
`trusted_prices.csv` . Check if the prices in `company_prices.csv` match the prices in
`trusted_prices.csv` . Assume both files have a "product_id" and "price" column.

In [7]:
# Write your code from here
import pandas as pd

# Load the datasets
company_df = pd.read_csv("company_prices.csv")
trusted_df = pd.read_csv("trusted_prices.csv")

# Merge the two datasets on product_id
merged_df = company_df.merge(trusted_df, on="product_id", suffixes=('_company', '_trusted'))

# Compare the prices
merged_df["price_match"] = merged_df["price_company"] == merged_df["price_trusted"]

# Calculate accuracy
total_products = len(merged_df)
matching_prices = merged_df["price_match"].sum()
accuracy_percentage = (matching_prices / total_products) * 100

# Output results
print(f"Total products compared: {total_products}")
print(f"Matching prices: {matching_prices}")
print(f"Data Accuracy: {accuracy_percentage:.2f}%")

# Optional: Save mismatches to a CSV for review
mismatches = merged_df[~merged_df["price_match"]]
mismatches.to_csv("price_mismatches.csv", index=False)

Total products compared: 5
Matching prices: 3
Data Accuracy: 60.00%


### Task 2: Detect Incorrect Values

**Description**: In `company_prices.csv` , detect any negative price values which are incorrect values for prices.

In [8]:
# Write your code from here
import pandas as pd

# Load the dataset
company_df = pd.read_csv("company_prices.csv")

# Detect negative price values
invalid_prices = company_df[company_df["price"] < 0]

# Report results
print(f"Total records with negative prices: {len(invalid_prices)}")
print(invalid_prices)

# Optional: Save the incorrect records to a separate CSV for review
invalid_prices.to_csv("negative_price_records.csv", index=False)

Total records with negative prices: 0
Empty DataFrame
Columns: [product_id, price]
Index: []


### Task 3: Check Missing Data Rates

**Description**: Calculate the percentage of missing values in `customer_data.csv` .

In [9]:
# Write your code from here
import pandas as pd

# Load the dataset
customer_df = pd.read_csv("customer_data.csv")

# Calculate missing value percentages
missing_percentages = customer_df.isnull().mean() * 100

# Display the result
print("Percentage of Missing Values in Each Column:")
print(missing_percentages.sort_values(ascending=False))

# Optional: Save to a CSV file
missing_percentages.to_csv("missing_data_report.csv", header=["missing_percentage"])

Percentage of Missing Values in Each Column:
phone_number    37.5
address         25.0
email           25.0
customer_id      0.0
name             0.0
dtype: float64


### Task 4: Handling Partially Available Records

**Description**: In `customer_data.csv` , identify records with missing "email" or "phone number" and decide whether to drop or fill them.

In [10]:
# Write your code from here
import pandas as pd

# Load the dataset
df = pd.read_csv("customer_data.csv")

# Identify records with missing email or phone number
partial_records = df[df["email"].isnull() | df["phone_number"].isnull()]

print("Partially available records (missing email or phone number):")
print(partial_records)

# OPTION 1: Drop incomplete records
df_dropped = df.dropna(subset=["email", "phone_number"])
print(f"\nRecords after dropping incomplete entries: {len(df_dropped)}")

# OPTION 2: Fill missing values with placeholder text
df_filled = df.copy()
df_filled["email"].fillna("no_email@example.com", inplace=True)
df_filled["phone_number"].fillna("0000000000", inplace=True)

print("\nRecords after filling missing values:")
print(df_filled)

# Save outputs for review
partial_records.to_csv("partial_records.csv", index=False)
df_dropped.to_csv("cleaned_data_dropped.csv", index=False)
df_filled.to_csv("cleaned_data_filled.csv", index=False)

Partially available records (missing email or phone number):
   customer_id     name                email  phone_number          address
1            2      Bob                  NaN  9.876543e+09   456 Orange Ave
2            3  Charlie  charlie@example.com           NaN  789 Banana Blvd
4            5      Eva                  NaN           NaN              NaN
6            7    Grace    grace@example.com           NaN     654 Grape Ln

Records after dropping incomplete entries: 4

Records after filling missing values:
   customer_id     name                 email  phone_number          address
0            1    Alice     alice@example.com  1234567890.0     123 Apple St
1            2      Bob  no_email@example.com  9876543210.0   456 Orange Ave
2            3  Charlie   charlie@example.com    0000000000  789 Banana Blvd
3            4    David     david@example.com  5555555555.0              NaN
4            5      Eva  no_email@example.com    0000000000              NaN
5           

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_filled["email"].fillna("no_email@example.com", inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_filled["phone_number"].fillna("0000000000", inplace=True)
  df_filled["phone_number"].fillna("0000000000", inplace=True)
