<a href="https://colab.research.google.com/github/Harjot2797/patient/blob/main/patient.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Task 1 solution:


In [6]:
import pandas as pd

# Example dataset
data_dict = {
    "Patient ID": ["P001", "P002", "P003", "P004", "P005", "P006", "P007", "P008", "P009", "P010", "P003"],
    "Age": [45, 67, 34, 78, 54, 23, None, 36, 62, 41, 34],
    "Diagnosis": ["Flu", "Pneumonia", "Fracture", "COVID-19", "Heart Disease", "Flu", "Pneumonia", "Fracture", "COVID-19", "Heart Disease", "Fracture"],
    "Length of Stay": [3, 7, 2, 14, 10, 5, 8, None, 12, 6, 2],
    "Hospital Department": ["Emergency", "General Medicine", "Orthopedics", "Cardiology", "Orthopedics", "Emergency", "General Medicine", "Orthopedics", "Cardiology", "General Medicine", "Orthopedics"],
}

# Convert dictionary to DataFrame
df = pd.DataFrame(data_dict)

# Step 1: Cleaning the dataset
# Remove duplicates
df_cleaned = df.drop_duplicates()

# Handle missing values (example: filling with mean, assuming there are missing values)
df_cleaned["Age"].fillna(df_cleaned["Age"].mean(), inplace=True)
df_cleaned["Length of Stay"].fillna(df_cleaned["Length of Stay"].mean(), inplace=True)

# Save cleaned data to a CSV file
df_cleaned.to_csv("cleaned_patient_data.csv", index=False)

# Step 2: Perform statistical analysis
statistics = {
    "Mean Age": df_cleaned["Age"].mean(),
    "Median Age": df_cleaned["Age"].median(),
    "Standard Deviation (Age)": df_cleaned["Age"].std(),
    "Mean Length of Stay": df_cleaned["Length of Stay"].mean(),
    "Median Length of Stay": df_cleaned["Length of Stay"].median(),
    "Standard Deviation (Length of Stay)": df_cleaned["Length of Stay"].std(),
}

# Convert statistics to a summary table
stats_summary = pd.DataFrame(
    list(statistics.items()),
    columns=["Metric", "Value"]
)

# Display results
print("Cleaned Dataset:")
print(df_cleaned)
print("\nStatistical Analysis Summary:")
print(stats_summary)


Cleaned Dataset:
  Patient ID        Age      Diagnosis  Length of Stay Hospital Department
0       P001  45.000000            Flu        3.000000           Emergency
1       P002  67.000000      Pneumonia        7.000000    General Medicine
2       P003  34.000000       Fracture        2.000000         Orthopedics
3       P004  78.000000       COVID-19       14.000000          Cardiology
4       P005  54.000000  Heart Disease       10.000000         Orthopedics
5       P006  23.000000            Flu        5.000000           Emergency
6       P007  48.888889      Pneumonia        8.000000    General Medicine
7       P008  36.000000       Fracture        7.444444         Orthopedics
8       P009  62.000000       COVID-19       12.000000          Cardiology
9       P010  41.000000  Heart Disease        6.000000    General Medicine

Statistical Analysis Summary:
                                Metric      Value
0                             Mean Age  48.888889
1                          

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_cleaned["Age"].fillna(df_cleaned["Age"].mean(), inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned["Age"].fillna(df_cleaned["Age"].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)'

Task 2 solution:
1. Assess the Extent of the Problem

Perform an initial exploration of the dataset using tools such as Pandas’ df.info() or df.isnull().sum() to determine the percentage of missing data in each column.

Identify inaccurate data by validating against logical constraints and expected ranges. For example:

Ensure ages are within a plausible range (e.g., 0–120 years).

Verify lengths of stay correspond to realistic treatment durations.

2. Categorize Data Based on Importance

Critical Data: Columns essential for analysis, such as patient IDs, diagnoses, or treatment information, require thorough handling.

Non-Critical Data: Less important columns, such as optional demographic details, can be deprioritized or omitted if incomplete.

3. Handle Missing Data

Remove Records:

If missing data affects a small percentage of rows and is in critical fields, these rows can be removed to minimize bias.

Impute Missing Values:

For numerical data, use statistical methods such as mean, median, or mode. Advanced techniques, like regression or K-Nearest Neighbors (KNN) imputation, may also be considered for higher accuracy.

For categorical data, impute using the most frequent category or predictive modeling based on related variables.

Flag Imputed Data:

Create a flag column to indicate records where imputation has been applied. This allows for transparency and further review if necessary.

4. Validate Data Accuracy

Cross-check the dataset against reliable external sources (e.g., reference databases or clinical standards) to confirm the validity of key fields.

Apply domain-specific business logic to detect errors (e.g., ensuring that the length of stay aligns with the diagnosis and treatment).

5. Document Assumptions and Modifications

Maintain detailed records of all actions taken during the cleaning process, including:

The method of imputation used for missing values.

The percentage of data removed, modified, or flagged.

Assumptions made during corrections.

Share these records with stakeholders to ensure transparency and reproducibility.

6. Test the Robustness of the Cleaned Dataset

Conduct analyses both with and without imputed data to validate the consistency of results.

Use sensitivity analysis to measure the impact of missing or modified data on key metrics and insights.

Task 3 solution:

Which of the following is NOT a typical step in data cleaning?

Answer: b) Filling missing data with random values

Explanation: Data cleaning typically involves removing duplicates, standardizing formats, and identifying outliers, but filling missing data with random values is not a common practice. Instead, missing data is often handled using methods like imputation (mean, median, mode) or deletion based on the context. What is the purpose of normalization in data analysis?

Answer: b) To ensure all variables are on a similar scale

Explanation: Normalization refers to the process of adjusting the values of numeric columns to a common scale without distorting differences in the ranges of values. It is often used when features have different units or scales, ensuring that all variables contribute equally to the analysis.