# Ex No 2.a Analyzing Academic Performance
Problem Statement:
Assess the performance trends of students across different subjects, focusing on data imperfections such as missing values, duplicates, and the need to uniquely identify records.

Objective Scenario:
A high school intends to analyze semester results to enhance teaching strategies and provide targeted support. The dataset includes student roll numbers but contains issues like missing entries and duplicates that must be addressed for accurate analysis.

Dataset:
Data is provided as a Python list containing tuples for each student's roll number followed by their scores in Mathematics, Science, and English. The list includes missing entries (represented as None) and intentionally duplicated records. Example: [(101, 45, 78, None), (102, 65, 56, 77), (103, 95, 85, 92), (102, 65, 56, 77), (104, 45, None, 88), (101, 45, 78, None)].

Tasks to be performed:

Data Conversion and Inspection:

Convert the list of student records into a Numpy array.
Identify and count the number of missing values in each subject.
Detect duplicate entries based on student roll numbers.

Data Cleaning:
Handle missing values by replacing them with the median score of the respective subject.
Remove duplicate records, ensuring data integrity by retaining only the first occurrence of each student's record.

Data Normalization:
Apply min-max normalization to the scores (excluding roll numbers) to scale them between 0 and 1. This adjustment facilitates fair comparisons across different subjects.

Statistical Analysis:
Calculate the normalized mean, median, and standard deviation of scores for each subject.
Identify the subject with the highest variability in scores.

In [1]:
import numpy as np

# -------------------------------
# 1. Dataset Setup
# -------------------------------
data = [
    (101, 45, 78, None),
    (102, 65, 56, 77),
    (103, 95, 85, 92),
    (102, 65, 56, 77),
    (104, 45, None, 88),
    (101, 45, 78, None)
]

# Convert to NumPy array (object type to allow None)
arr = np.array(data, dtype=object)

# -------------------------------
# 2. Data Inspection
# -------------------------------
# Count missing values per subject (columns 1:3)
missing_counts = [(arr[:, i] == None).sum() for i in range(1, 4)]
print("Missing values per subject (Math, Science, English):", missing_counts)

# Detect duplicate roll numbers
_, unique_indices = np.unique(arr[:, 0], return_index=True)
duplicate_indices = np.setdiff1d(np.arange(arr.shape[0]), unique_indices)
print("Duplicate entries based on roll number:", arr[duplicate_indices])

# -------------------------------
# 3. Data Cleaning
# -------------------------------
# Replace None with median per subject
cleaned_arr = arr.copy()
for col in range(1, 4):
    # Get column without None values
    col_values = [x for x in cleaned_arr[:, col] if x is not None]
    median_val = np.median(col_values)
    # Replace None with median
    cleaned_arr[:, col] = [median_val if x is None else x for x in cleaned_arr[:, col]]

# Remove duplicates, keeping first occurrence
cleaned_arr = cleaned_arr[np.sort(unique_indices)]

print("\nCleaned Data:\n", cleaned_arr)

# -------------------------------
# 4. Data Normalization (Min-Max)
# -------------------------------
scores = cleaned_arr[:, 1:].astype(float)
min_vals = scores.min(axis=0)
max_vals = scores.max(axis=0)
normalized_scores = (scores - min_vals) / (max_vals - min_vals)

print("\nNormalized Scores:\n", normalized_scores)

# -------------------------------
# 5. Statistical Analysis
# -------------------------------
means = normalized_scores.mean(axis=0)
medians = np.median(normalized_scores, axis=0)
std_devs = normalized_scores.std(axis=0)

subjects = ["Mathematics", "Science", "English"]
variability_subject = subjects[np.argmax(std_devs)]

print("\nNormalized Means:", means)
print("Normalized Medians:", medians)
print("Normalized Standard Deviations:", std_devs)
print("Subject with highest variability:", variability_subject)


Missing values per subject (Math, Science, English): [np.int64(0), np.int64(1), np.int64(2)]
Duplicate entries based on roll number: [[102 65 56 77]
 [101 45 78 None]]

Cleaned Data:
 [[101 45 78 np.float64(82.5)]
 [102 65 56 77]
 [103 95 85 92]
 [104 45 np.float64(78.0) 88]]

Normalized Scores:
 [[0.         0.75862069 0.36666667]
 [0.4        0.         0.        ]
 [1.         1.         1.        ]
 [0.         0.75862069 0.73333333]]

Normalized Means: [0.35       0.62931034 0.525     ]
Normalized Medians: [0.2        0.75862069 0.55      ]
Normalized Standard Deviations: [0.40926764 0.37645872 0.37739973]
Subject with highest variability: Mathematics


# Ex No 2.b Analyzing Healthcare Service Efficiency
Problem Statement:
Evaluate the performance trends of different hospital departments, addressing issues such as missing data, duplicates, and the need for unique identifiers.

Objective Scenario:
A hospital's administration wants to analyze patient treatment outcomes to enhance service delivery and provide targeted improvements in care. The dataset includes department ID numbers but contains issues like missing entries and duplicates that need accurate resolution for effective analysis.

Dataset:
Data is provided as a Python list containing tuples for each department's ID followed by their performance scores in patient satisfaction, treatment success rate, and wait times. The list includes missing entries (represented as None) and intentionally duplicated records. Example: [(701, 90, 95, None), (702, 88, 90, 85), (703, 92, None, 80), (702, 88, 90, 85), (704, 85, None, 78), (701, 90, 95, None)].

Tasks to be performed:

Data Conversion and Inspection:

Convert the list of department records into a Numpy array.
Identify and count the number of missing values in each performance metric.
Detect duplicate entries based on department ID numbers.

Data Cleaning:

Handle missing values by replacing them with the median score of the respective metric.
Remove duplicate records, ensuring data integrity by retaining only the first occurrence of each department's record.

Data Normalization:

Apply min-max normalization to the scores (excluding ID numbers) to scale them between 0 and 1. This adjustment facilitates fair comparisons across different metrics.

Statistical Analysis:

Calculate the normalized mean, median, and standard deviation of scores for each performance metric.
Identify the metric with the highest variability in scores.

In [2]:
import numpy as np

# --------------------------------------
# 1. Dataset Setup
# --------------------------------------
data = [
    (701, 90, 95, None),
    (702, 88, 90, 85),
    (703, 92, None, 80),
    (702, 88, 90, 85),
    (704, 85, None, 78),
    (701, 90, 95, None)
]

# Convert to NumPy array (object type to allow None values)
arr = np.array(data, dtype=object)

# --------------------------------------
# 2. Data Inspection
# --------------------------------------
# Count missing values in each metric
missing_counts = [(arr[:, i] == None).sum() for i in range(1, 4)]
print("Missing values per metric (Satisfaction, Success, Wait time):", missing_counts)

# Detect duplicate department IDs
_, unique_indices = np.unique(arr[:, 0], return_index=True)
duplicate_indices = np.setdiff1d(np.arange(arr.shape[0]), unique_indices)
print("Duplicate entries based on Dept ID:\n", arr[duplicate_indices])

# --------------------------------------
# 3. Data Cleaning
# --------------------------------------
cleaned_arr = arr.copy()

# Replace None with median of the column
for col in range(1, 4):
    col_values = [x for x in cleaned_arr[:, col] if x is not None]
    median_val = np.median(col_values)
    cleaned_arr[:, col] = [median_val if x is None else x for x in cleaned_arr[:, col]]

# Remove duplicates, keeping first occurrence
cleaned_arr = cleaned_arr[np.sort(unique_indices)]

print("\nCleaned Data:\n", cleaned_arr)

# --------------------------------------
# 4. Data Normalization (Min-Max)
# --------------------------------------
scores = cleaned_arr[:, 1:].astype(float)
min_vals = scores.min(axis=0)
max_vals = scores.max(axis=0)
normalized_scores = (scores - min_vals) / (max_vals - min_vals)

print("\nNormalized Scores:\n", normalized_scores)

# --------------------------------------
# 5. Statistical Analysis
# --------------------------------------
means = normalized_scores.mean(axis=0)
medians = np.median(normalized_scores, axis=0)
std_devs = normalized_scores.std(axis=0)

metrics = ["Patient Satisfaction", "Treatment Success", "Wait Time"]
highest_var_metric = metrics[np.argmax(std_devs)]

print("\nNormalized Means:", means)
print("Normalized Medians:", medians)
print("Normalized Std Deviations:", std_devs)
print("Metric with highest variability:", highest_var_metric)


Missing values per metric (Satisfaction, Success, Wait time): [np.int64(0), np.int64(2), np.int64(2)]
Duplicate entries based on Dept ID:
 [[702 88 90 85]
 [701 90 95 None]]

Cleaned Data:
 [[701 90 95 np.float64(82.5)]
 [702 88 90 85]
 [703 92 np.float64(92.5) 80]
 [704 85 np.float64(92.5) 78]]

Normalized Scores:
 [[0.71428571 1.         0.64285714]
 [0.42857143 0.         1.        ]
 [1.         0.5        0.28571429]
 [0.         0.5        0.        ]]

Normalized Means: [0.53571429 0.5        0.48214286]
Normalized Medians: [0.57142857 0.5        0.46428571]
Normalized Std Deviations: [0.36943144 0.35355339 0.37584938]
Metric with highest variability: Wait Time
