### Task 1: Measure Data Accuracy using a Trusted Source

**Description**: You have two datasets of product prices: `company_prices.csv` and
`trusted_prices.csv` . Check if the prices in `company_prices.csv` match the prices in
`trusted_prices.csv` . Assume both files have a "product_id" and "price" column.

In [9]:
import pandas as pd

# Sample company prices data
company_data = {
    'product_id': [101, 102, 103, 104, 105],
    'price': [9.99, 15.49, 7.99, 20.00, 12.50]
}

# Sample trusted prices data
trusted_data = {
    'product_id': [101, 102, 103, 104, 105],
    'price': [9.99, 15.00, 7.99, 20.00, 13.00]
}

# Create dataframes
company_df = pd.DataFrame(company_data)
trusted_df = pd.DataFrame(trusted_data)

# Merge on product_id to compare prices side-by-side
merged = pd.merge(company_df, trusted_df, on='product_id', suffixes=('_company', '_trusted'))

# Find mismatches where prices differ
mismatches = merged[merged['price_company'] != merged['price_trusted']]

# Summary stats
total_products = merged.shape[0]
mismatch_count = mismatches.shape[0]
accuracy_percent = (total_products - mismatch_count) / total_products * 100

print(f"Total products compared: {total_products}")
print(f"Number of price mismatches: {mismatch_count}")
print(f"Price accuracy: {accuracy_percent:.2f}%")

if mismatch_count > 0:
    print("\nMismatched prices:")
    print(mismatches[['product_id', 'price_company', 'price_trusted']])
else:
    print("All prices match perfectly!")


Total products compared: 5
Number of price mismatches: 2
Price accuracy: 60.00%

Mismatched prices:
   product_id  price_company  price_trusted
1         102          15.49           15.0
4         105          12.50           13.0


### Task 2: Detect Incorrect Values

**Description**: In `company_prices.csv` , detect any negative price values which are incorrect values for prices.

In [10]:
# Write your code from here
import pandas as pd

# Sample company prices data with some negative values to simulate errors
company_data = {
    'product_id': [101, 102, 103, 104, 105],
    'price': [9.99, -15.49, 7.99, -20.00, 12.50]
}

company_df = pd.DataFrame(company_data)

# Detect negative prices
negative_prices = company_df[company_df['price'] < 0]

if not negative_prices.empty:
    print("Negative price values detected:")
    print(negative_prices)
else:
    print("No negative price values found.")


Negative price values detected:
   product_id  price
1         102 -15.49
3         104 -20.00


### Task 3: Check Missing Data Rates

**Description**: Calculate the percentage of missing values in `customer_data.csv` .

In [11]:
# Write your code from here
import pandas as pd
import numpy as np

# Simulated customer data with some missing values
data = {
    'CustomerID': [1, 2, 3, 4, 5],
    'Name': ['Alice', 'Bob', np.nan, 'David', 'Eva'],
    'Age': [25, np.nan, 30, 22, np.nan],
    'Email': ['alice@example.com', 'bob@example.com', 'charlie@example.com', np.nan, 'eva@example.com']
}

df = pd.DataFrame(data)

# Calculate percentage of missing values per column
missing_percent = df.isnull().mean() * 100

print("Percentage of missing values per column:")
print(missing_percent)


Percentage of missing values per column:
CustomerID     0.0
Name          20.0
Age           40.0
Email         20.0
dtype: float64


### Task 4: Handling Partially Available Records

**Description**: In `customer_data.csv` , identify records with missing "email" or "phone number" and decide whether to drop or fill them.

In [12]:
# Write your code from here
import pandas as pd
import numpy as np

# Sample data simulating missing emails and phone numbers
data = {
    'CustomerID': [1, 2, 3, 4, 5],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Email': ['alice@example.com', np.nan, 'charlie@example.com', None, 'eva@example.com'],
    'Phone': ['123-456-7890', '234-567-8901', None, np.nan, '567-890-1234']
}

df = pd.DataFrame(data)

# Identify records with missing Email or Phone
missing_contact = df[df['Email'].isnull() | df['Phone'].isnull()]
print("Records with missing Email or Phone:\n", missing_contact)

# Option 1: Drop records with missing Email or Phone
df_dropped = df.dropna(subset=['Email', 'Phone'])
print("\nData after dropping records with missing Email or Phone:\n", df_dropped)

# Option 2: Fill missing Email and Phone with placeholders
df_filled = df.copy()
df_filled['Email'] = df_filled['Email'].fillna('no_email@unknown.com')
df_filled['Phone'] = df_filled['Phone'].fillna('000-000-0000')

print("\nData after filling missing Email and Phone with placeholders:\n", df_filled)


Records with missing Email or Phone:
    CustomerID     Name                Email         Phone
1           2      Bob                  NaN  234-567-8901
2           3  Charlie  charlie@example.com          None
3           4    David                 None           NaN

Data after dropping records with missing Email or Phone:
    CustomerID   Name              Email         Phone
0           1  Alice  alice@example.com  123-456-7890
4           5    Eva    eva@example.com  567-890-1234

Data after filling missing Email and Phone with placeholders:
    CustomerID     Name                 Email         Phone
0           1    Alice     alice@example.com  123-456-7890
1           2      Bob  no_email@unknown.com  234-567-8901
2           3  Charlie   charlie@example.com  000-000-0000
3           4    David  no_email@unknown.com  000-000-0000
4           5      Eva       eva@example.com  567-890-1234
