### Task 1: Measure Data Accuracy using a Trusted Source

**Description**: You have two datasets of product prices: `company_prices.csv` and
`trusted_prices.csv` . Check if the prices in `company_prices.csv` match the prices in
`trusted_prices.csv` . Assume both files have a "product_id" and "price" column.

In [1]:
import pandas as pd

# Data for company_prices.csv
company_data = {
    'product_id': [101, 102, 103, 104, 105],
    'price': [19.99, 29.49, 15.00, 45.50, 10.99]
}
company_df = pd.DataFrame(company_data)
company_df.to_csv('company_prices.csv', index=False)

# Data for trusted_prices.csv
trusted_data = {
    'product_id': [101, 102, 103, 104, 105],
    'price': [19.99, 29.99, 15.00, 45.50, 11.49]
}
trusted_df = pd.DataFrame(trusted_data)
trusted_df.to_csv('trusted_prices.csv', index=False)

print("CSV files created successfully!")


CSV files created successfully!


In [2]:
import pandas as pd

# Load the datasets
company_df = pd.read_csv('company_prices.csv')
trusted_df = pd.read_csv('trusted_prices.csv')

# Merge datasets on product_id
merged_df = pd.merge(company_df, trusted_df, on='product_id', suffixes=('_company', '_trusted'))

# Compare prices
merged_df['price_match'] = merged_df['price_company'] == merged_df['price_trusted']

# Calculate accuracy
accuracy = merged_df['price_match'].mean() * 100

print(f"Price accuracy: {accuracy:.2f}%")

# Optional: Show mismatches
mismatches = merged_df[~merged_df['price_match']]
print(f"Number of mismatches: {len(mismatches)}")
print(mismatches[['product_id', 'price_company', 'price_trusted']])


Price accuracy: 60.00%
Number of mismatches: 2
   product_id  price_company  price_trusted
1         102          29.49          29.99
4         105          10.99          11.49


### Task 2: Detect Incorrect Values

**Description**: In `company_prices.csv` , detect any negative price values which are incorrect values for prices.

In [3]:
import pandas as pd

# Load the company prices dataset
company_df = pd.read_csv('company_prices.csv')

# Filter rows with negative prices
negative_prices = company_df[company_df['price'] < 0]

if not negative_prices.empty:
    print("Negative prices found:")
    print(negative_prices)
else:
    print("No negative prices detected.")


No negative prices detected.


### Task 3: Check Missing Data Rates

**Description**: Calculate the percentage of missing values in `customer_data.csv` .

In [4]:

import pandas as pd

# Load the customer data
customer_df = pd.read_csv('/workspaces/AI_DATA_ANALYSIS_/src/Module 3/Hands-on - Data Quality Assessment & Profiling/customer_data.csv')

# Calculate percentage of missing values per column
missing_percentage = customer_df.isnull().mean() * 100

print("Percentage of missing values per column:")
print(missing_percentage)


Percentage of missing values per column:
CustomerID     0.0
Name           0.0
Email         10.0
Phone         10.0
Gender         0.0
dtype: float64


### Task 4: Handling Partially Available Records

**Description**: In `customer_data.csv` , identify records with missing "email" or "phone number" and decide whether to drop or fill them.

In [7]:
# Write your code from here

import pandas as pd

# Load customer data
customer_df = pd.read_csv('/workspaces/AI_DATA_ANALYSIS_/src/Module 3/Hands-on - Data Quality Assessment & Profiling/customer_data.csv')

# Identify records with missing email or phone number
missing_contact = customer_df[customer_df['Email'].isnull() | customer_df['Phone'].isnull()]

print(f"Records with missing email or phone number: {len(missing_contact)}")
print(missing_contact)

# Option 1: Drop records with missing email or phone number
dropped_df = customer_df.dropna(subset=['Email', 'Phone'])
print(f"\nAfter dropping incomplete records, remaining rows: {len(dropped_df)}")

# Option 2: Fill missing values (example fill)
filled_df = customer_df.copy()
filled_df['Email'].fillna('noemail@example.com', inplace=True)
filled_df['Phone'].fillna('000-000-0000', inplace=True)

print(f"\nAfter filling missing values:")
print(filled_df.loc[filled_df['Email'] == 'noemail@example.com'])
print(filled_df.loc[filled_df['Phone'] == '000-000-0000'])

Records with missing email or phone number: 2
   CustomerID     Name                Email         Phone Gender
2           3  Charlie  charlie@example.com           NaN   Male
7           8    Henry                  NaN  789-012-3456   Male

After dropping incomplete records, remaining rows: 8

After filling missing values:
   CustomerID   Name                Email         Phone Gender
7           8  Henry  noemail@example.com  789-012-3456   Male
   CustomerID     Name                Email         Phone Gender
2           3  Charlie  charlie@example.com  000-000-0000   Male
