# **DS331 Lab 4: CRISP-DM Phase3, Data Preparation**

üë©üèª‚ÄçüîßWe will work on cleaning and preparing a real-world dataset from a AI in Healthcare, Building upon the quality checks we previously conducted, we will now address missing values, correct inconsistencies, adjust data types, identify outliers, and get the dataset ready for analysis.

### üì• Download AI in HealthCare Dataset  
[<button style="background-color:#008CBA; color:white; padding:10px 15px; border:none; border-radius:5px;">Click Here to Download</button>](https://drive.google.com/drive/folders/18UHGmcat5yFGkQPmzgwkntlCnMz89UCw?usp=drive_link)

In [7]:
# Import libraries
import re
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display

df = pd.read_csv("AI_Dataset_Unclean.csv")
print(df.head())

   Patient_ID      Age  Gender      Blood_Pressure Heart_Rate  \
0           1       62  Female  102.34913367982614         79   
1           2       65    MALE  137.76093329695257         72   
2           3      NaN    MAle   97.61856352030196         57   
3           4  unknown   Ma le  121.74375249886894         67   
4           5       85   Mal e         129.5304503         64   

         Temperature      Diagnosis     Medication Treatment_Duration  \
0  98.91236105995344   Hypertension        staTiNs     tWetn ty f ouR   
1        98.91250761   Hypertension       InsU LIn       Twenty niNre   
2  99.18972805060343      Influenza  CrHemotherapY           FiftEcen   
3        96.03348678  Heart Disease   chemotheraPY              five    
4        99.07767466  Heart Disease        Insulin                ten   

  Insurance_Type  Doctor_Name        Hospital_Name  Lab_Test_Results  \
0      Uninsured    Dr. Brown  Children's Hospital        114.906151   
1     Uninsured      Dr. W

## **‚öô 1. Handling Data‚ÄëType Issues in the Dataset**

Incorrect data types can break summary statistics, slow pipelines, and silently bias models.  
The goal is to make every column‚Äôs dtype accurately reflect its semantics.

---

### ‚ùó Common Data Type Issues (from the dataset)


The following columns in the dataset contain data-type or formatting issues that need to be cleaned:

| Column Name             | Issue Type                           | Description |
|-------------------------|--------------------------------------|-------------|
| Age                  | Numeric stored as text / Mixed types | Contains values like "unknown" and numbers stored as strings |
| Blood_Pressure       | Numeric stored as text               | Stored as strings instead of numeric values |
| Heart_Rate           | Numeric stored as text               | Stored as strings instead of numeric values |
| Temperature          | Numeric stored as text               | Stored as strings instead of numeric values |
| Treatment_Duration   | Mixed types                          | Contains text descriptions of durations like "twenty", "fifteen" |
| Gender               | Mixed types / Inconsistent format    | Variants like "MALE", "Ma le", "Female" with inconsistent casing and spacing |
| Insurance_Type       | Inconsistent strings                 | Variants like "Private", "Privat e" with extra spaces |
| Recovery_Time        | Mixed types                          | Contains numbers, scientific notation ("0.0003e3"), and suffixes like "7d" |
| Patient_Satisfaction | Mixed types                          | Contains numeric values, "-1", and words like "good" |
| X-ray_Results        | Categorical inconsistency            | Inconsistent casing such as "Normal", "AbnORmal" |


### ‚úÖ Data Cleaning Steps


| Step | Column Name             | Cleaning Action |
|------|-------------------------|------------------|
| 1    | Age                  | Convert to numeric using pd.to_numeric(errors='coerce') to handle "unknown" |
| 2    | Gender               | Standardize by lowercasing, removing spaces, then capitalizing first letter |
| 3    | Blood_Pressure       | Convert to numeric using pd.to_numeric(errors='coerce') |
| 4    | Heart_Rate           | Convert to numeric using pd.to_numeric(errors='coerce') |
| 5    | Temperature          | Convert to numeric using pd.to_numeric(errors='coerce') |
| 6    | Treatment_Duration   | Extract numeric values using regex and convert to float |
| 7    | Insurance_Type       | Remove spaces and standardize capitalization |
| 8    | Recovery_Time        | Extract numeric values (e.g., "7d", "0.0003e3") using regex and convert to float |
| 9    | Patient_Satisfaction | Replace non-numeric values like "good" with numeric equivalents, then convert to float |
| 10   | X-ray_Results        | Standardize text casing to "Normal" and "Abnormal" |



In [10]:
print(df.columns)

Index(['Patient_ID', 'Age', 'Gender', 'Blood_Pressure', 'Heart_Rate',
       'Temperature', 'Diagnosis', 'Medication', 'Treatment_Duration',
       'Insurance_Type', 'Doctor_Name', 'Hospital_Name', 'Lab_Test_Results',
       'X-ray_Results', 'Surgery_Type', 'Recovery_Time', 'Allergies',
       'Family_History', 'Patient_Satisfaction', 'AI_Diagnosis_Confidence'],
      dtype='object')


In [12]:
object_columns = df.select_dtypes(include='object').columns

for col in object_columns:
    unique_vals = df[col].unique()
    print(f"\n--- {col} ---")
    print("Sample unique values:")
    print(unique_vals[:10])  

    suspicious_vals = []
    for val in unique_vals:
        if isinstance(val, str):
            if (
                any(char.isalpha() for char in val) and not val.islower()
            ) or ' ' in val or val.lower() in ['unknown', 'yes', 'no', 'true', 'false', 'good', '-1']:
                suspicious_vals.append(val)

    if suspicious_vals:
        print(">> Suspicious values:")
        print(suspicious_vals[:5]) 
    else:
        print(">> No obvious suspicious values found.")


--- Age ---
Sample unique values:
['62' '65' nan 'unknown' '85' '27' '39' '150' '76' '64']
>> Suspicious values:
['unknown']

--- Gender ---
Sample unique values:
['Female' 'MALE' 'MAle' 'Ma le' 'Mal e' 'Male' 'FEMALE' 'M ale' 'Fema le'
 ' Male']
>> Suspicious values:
['Female', 'MALE', 'MAle', 'Ma le', 'Mal e']

--- Blood_Pressure ---
Sample unique values:
['102.34913367982614' '137.76093329695257' '97.61856352030196'
 '121.74375249886894' '129.5304503' nan 'high' '130.0858193967417'
 '105.31530033505123' '300']
>> No obvious suspicious values found.

--- Heart_Rate ---
Sample unique values:
['79' '72' '57' '67' '64' '78' '84' '82c' '220' '75']
>> Suspicious values:
['8 7', '7 6', '8 3', '6 6', '6 7']

--- Temperature ---
Sample unique values:
['98.91236105995344' '98.91250761' '99.18972805060343' '96.03348678'
 '99.07767466' 'low' '99.85369152' nan '100.25547386446549'
 '97.61390779734232']
>> No obvious suspicious values found.

--- Diagnosis ---
Sample unique values:
['Hypertensio

# **üí°**

1. Prints a sample of up to 10 unique values to get a quick overview.
2. Checks for suspicious or inconsistent values that could indicate data-quality issues, such as:
   - Strings that contain alphabetic characters but are not in lowercase (e.g., "MALE", "FeMale")
   - Strings that contain spaces (e.g., "Privat e")
   - Known problematic text values like "unknown", "yes", "no", "true", "false", "good", or "-1"
3. Filters out these suspicious values and prints a sample (up to 5) of them for manual inspection.

In [15]:
# Age ‚Äî fix "unknown", convert to numeric
df["Age"] = pd.to_numeric(df["Age"], errors="coerce")

# Gender ‚Äî standardize case and spacing
df["Gender"] = df["Gender"].astype(str).str.strip().str.lower().str.capitalize()

# Blood_Pressure ‚Äî convert to numeric
df["Blood_Pressure"] = pd.to_numeric(df["Blood_Pressure"], errors="coerce")

# Heart_Rate ‚Äî convert to numeric
df["Heart_Rate"] = pd.to_numeric(df["Heart_Rate"], errors="coerce")

# Temperature ‚Äî convert to numeric
df["Temperature"] = pd.to_numeric(df["Temperature"], errors="coerce")

# Treatment_Duration ‚Äî replace text with numeric via mapping
duration_map = {"Low": 1, "Moderate": 3, "High": 5, "Short": 1, "Medium": 3, "Long": 5}
df["Treatment_Duration"] = pd.to_numeric(
    df["Treatment_Duration"].replace(duration_map), errors="coerce"
)

# Insurance_Type ‚Äî remove extra spaces and unify case
df["Insurance_Type"] = df["Insurance_Type"].astype(str).str.strip().str.lower().str.capitalize()

# Recovery_Time ‚Äî convert numeric with coercion (handles things like '7d', '0.0003e3')
df["Recovery_Time"] = pd.to_numeric(
    df["Recovery_Time"].astype(str).str.extract(r'(\d+\.?\d*)')[0], errors="coerce"
)

# Patient_Satisfaction ‚Äî replace text with numeric and convert
satisfaction_map = {"good": 4, "excellent": 5, "poor": 2, "average": 3, "-1": None}
df["Patient_Satisfaction"] = pd.to_numeric(
    df["Patient_Satisfaction"].replace(satisfaction_map), errors="coerce"
)

# X-ray_Results ‚Äî standardize casing
df["X-ray_Results"] = df["X-ray_Results"].astype(str).str.strip().str.lower().str.capitalize()


columns_to_check = [
    ("Age", "üßì"),
    ("Gender", "üë´üèª"),
    ("Blood_Pressure", "ü©∏"),
    ("Heart_Rate", "üíì"),
    ("Temperature", "üå°"),
    ("Treatment_Duration", "‚è±"),
    ("Insurance_Type", "üí≥"),
    ("Recovery_Time", "‚è≥"),
    ("Patient_Satisfaction", "‚≠ê"),
    ("X-ray_Results", "ü©ª"),
]

for col, emoji in columns_to_check:
    col_dtype = df[col].dtype
    non_null_count = df[col].notna().sum()
    sample_values = df[col].dropna().unique()[:5]

    print(f"‚Äî‚Äî {emoji} {col} ‚Äî‚Äî")
    print(f"dtype : {col_dtype}")
    print(f"#non‚Äënull : {non_null_count}")
    print(f"sample : {sample_values.tolist()}\n")

‚Äî‚Äî üßì Age ‚Äî‚Äî
dtype : float64
#non‚Äënull : 4292
sample : [62.0, 65.0, 85.0, 27.0, 39.0]

‚Äî‚Äî üë´üèª Gender ‚Äî‚Äî
dtype : object
#non‚Äënull : 5000
sample : ['Female', 'Male', 'Ma le', 'Mal e', 'M ale']

‚Äî‚Äî ü©∏ Blood_Pressure ‚Äî‚Äî
dtype : float64
#non‚Äënull : 4312
sample : [102.34913367982614, 137.76093329695257, 97.61856352030196, 121.74375249886894, 129.5304503]

‚Äî‚Äî üíì Heart_Rate ‚Äî‚Äî
dtype : float64
#non‚Äënull : 4031
sample : [79.0, 72.0, 57.0, 67.0, 64.0]

‚Äî‚Äî üå° Temperature ‚Äî‚Äî
dtype : float64
#non‚Äënull : 4298
sample : [98.91236105995344, 98.91250761, 99.18972805060343, 96.03348678, 99.07767466]

‚Äî‚Äî ‚è± Treatment_Duration ‚Äî‚Äî
dtype : float64
#non‚Äënull : 0
sample : []

‚Äî‚Äî üí≥ Insurance_Type ‚Äî‚Äî
dtype : object
#non‚Äënull : 5000
sample : ['Uninsured', 'Private', 'Privat e', 'Medicaid', 'Medic aid']

‚Äî‚Äî ‚è≥ Recovery_Time ‚Äî‚Äî
dtype : float64
#non‚Äënull : 4044
sample : [5.0, 0.0003, 7.0, 6.0, 4.0]

‚Äî‚Äî ‚≠ê Patient_Sa

## **2. Handling Inconsistent Dataüî†**

### **üïµüèª‚Äç‚ôÄÔ∏è Find inconsistencies in categorical features**

In [19]:
# Define the list of categorical columns
categorical_features = ["Gender", "Medication", "Insurance_Type", "X-ray_Results"]

# Show sorted unique values for each categorical feature
for col in categorical_features:
  print(f"Number of unique values in {[col]} is: {df[col].nunique()}")
  print(df[col].value_counts().sort_index())
  print()
  print("-" * 50)

Number of unique values in ['Gender'] is: 10
Gender
F emale      94
Fe male      94
Fem ale      95
Fema le      82
Femal e      93
Female     2077
M ale       119
Ma le       134
Mal e       137
Male       2075
Name: count, dtype: int64

--------------------------------------------------
Number of unique values in ['Medication'] is: 3658
Medication
A N TI b ioti cS    1
A N tibIoticS       1
A Nt ibIot iCS      1
A Nt ibIoticsz      1
A Nti biOTics       1
                   ..
swTantsins          1
swtatins            1
sxtatinS            1
sztatIns            1
sztatins            1
Name: count, Length: 3658, dtype: int64

--------------------------------------------------
Number of unique values in ['Insurance_Type'] is: 32
Insurance_Type
M edicaid       42
M edicare       25
Me dicaid       34
Me dicare       30
Med icaid       37
Med icare       29
Medi caid       34
Medi care       39
Medic aid       36
Medic are       33
Medica id       28
Medica re       35
Medicai d       40

### **‚õèÔ∏è Fix inconsistencies in categorical features**

In [22]:
# 1- Replace known misspellings and variants with consistent values
# 2- Clean up formatting (remaining whitespace and case)

# Standardize Gender 
df["Gender"] = df["Gender"].str.lower().str.replace(" ","").replace({
    "femaile": "female",
    "f emale": "female",
    "ma le": "male"
}).str.title()

# -------------------------------------------
# Method to remove any character repeated more than twice
def remove_repeated(text):
    return re.sub(r'(.)\1+' , r'\1',text)

# Standardize Medication     
df["Medication"] = df["Medication"].str.lower().str.replace(" ","").apply(remove_repeated)

def replace_med(df, column):
    replacement_dict = {
        # Aspirin: Match 'a' + any letter (a-z) or none + 's' or 'p' + any chars
        r'^a[a-z]?[sp].*': 'Aspirin',
        r'^a[a-z]?[nt].*': 'Antibiotics',
        # Statins: Match 's' + optional 't' or 'q' + any chars
        r'^s[tq]?.*': 'Statins',
        r'^i[lnosp]?.*': 'Insulin',
        r'^c[h]?.*': 'Chemotherapy'
    }
    for pattern, replacement in replacement_dict.items():
        df[column] = df[column].str.replace(pattern, replacement, regex=True)

replace_med(df, 'Medication')

# -------------------------------------------

# Standardize Insurance_Type values
df["Insurance_Type"] = df["Insurance_Type"].str.lower().str.replace(" ","").replace({
    "m edic aid": "medicaid",
    "medi care": "medicare",
    "pr ivate": "private",
    "unin sured": "uninsured"
}).str.title()


# Standardize X-ray_Results 
df["X-ray_Results"] = df["X-ray_Results"].str.lower().str.replace(" ","").replace({
    "no rmal": "normal",
    "ab normal": "abnormal"
}).str.title()


# Re-check after cleaning
print("\n-------------------- Post-Cleaning Consistency Checks --------------------")
for col in categorical_features:
    print(df[col].value_counts().sort_index())
    print(f"Number of unique values: {df[col].nunique()}")
    print("-" * 50)


-------------------- Post-Cleaning Consistency Checks --------------------
Gender
Female    2535
Male      2465
Name: count, dtype: int64
Number of unique values: 2
--------------------------------------------------
Medication
Antibiotics     1014
Aspirin         1027
Chemotherapy    1017
Insulin          962
Statins          980
Name: count, dtype: int64
Number of unique values: 5
--------------------------------------------------
Insurance_Type
Medicaid     1234
Medicare     1228
Private      1295
Uninsured    1243
Name: count, dtype: int64
Number of unique values: 4
--------------------------------------------------
X-ray_Results
Abnormal    2509
Normal      2491
Name: count, dtype: int64
Number of unique values: 2
--------------------------------------------------


## **3. Handling Missing Valuesüß©**

In [25]:
# Missing Values Report

df.replace("unknown", np.nan, inplace=True)
missing_report = pd.DataFrame({
  "Count": df.isna().sum(),
  "Percentage": df.isna().mean() * 100
})

missing_report[missing_report["Count"] > 0].sort_values("Percentage", ascending=False)

Unnamed: 0,Count,Percentage
Treatment_Duration,5000,100.0
Patient_Satisfaction,1011,20.22
Heart_Rate,969,19.38
Allergies,964,19.28
Recovery_Time,956,19.12
Age,708,14.16
Temperature,702,14.04
Blood_Pressure,688,13.76


In [27]:
# Check Skewness Range

# -0.5 to +0.5      ‚Üí Fairly symmetric  ‚Üí Mean or median both okay
# > +0.5 or < -0.5  ‚Üí Moderately skewed ‚Üí Prefer median
# > +1 or < -1      ‚Üí Highly skewed     ‚Üí Strongly prefer median

numeric_cols = ["Age", "Blood_Pressure", "Heart_Rate", "Temperature", "Lab_Test_Results", "Recovery_Time", "Patient_Satisfaction"]
df[numeric_cols].skew()

Age                      0.911577
Blood_Pressure           1.979088
Heart_Rate              60.831839
Temperature             -2.628412
Lab_Test_Results        -0.002264
Recovery_Time           -0.275791
Patient_Satisfaction     3.066130
dtype: float64

In [29]:
# drop Treatment_Duration column - >30% missing values

df.drop(columns=["Treatment_Duration"], inplace=True)

In [31]:
# Fill-in Missing Values in Numeric Columns

df['Age'] = df['Age'].fillna(df['Age'].median())
df['Blood_Pressure'] = df['Blood_Pressure'].fillna(df['Blood_Pressure'].median())
df['Heart_Rate'] = df['Heart_Rate'].fillna(df['Heart_Rate'].median())
df['Temperature'] = df['Temperature'].fillna(df['Temperature'].median())
df['Lab_Test_Results'] = df['Lab_Test_Results'].fillna(df['Lab_Test_Results'].mean())
df['Recovery_Time'] = df['Recovery_Time'].fillna(df['Recovery_Time'].median())
df['Patient_Satisfaction'] = df['Patient_Satisfaction'].fillna(df['Patient_Satisfaction'].median())

sample_cols = ["Age", "Blood_Pressure", "Heart_Rate", "Temperature", "Lab_Test_Results", "Recovery_Time", "Patient_Satisfaction"]

print("Sample of modified columns:")
display(df[sample_cols].sample(10))

missing_after_fill = df[sample_cols].isna().sum()

print("\nMissing values after imputation:")
print(missing_after_fill)

Sample of modified columns:


Unnamed: 0,Age,Blood_Pressure,Heart_Rate,Temperature,Lab_Test_Results,Recovery_Time,Patient_Satisfaction
2709,56.0,145.352833,65.0,98.915399,75.471405,0.0003,0.0
1691,36.0,137.194024,76.0,98.078334,93.893225,2.0,5.0
1498,42.0,128.222638,76.0,98.432802,100.26404,5.0,4.0
1615,77.0,104.220786,71.0,100.010355,102.694197,0.0003,4.0
651,89.0,119.537082,0.0,97.644889,86.209396,5.0,200.0
3807,26.0,140.031741,0.0,98.301811,139.83587,5.0,3.0
4638,65.0,123.448391,76.0,99.265693,139.570254,6.0,200.0
3909,54.0,150.290462,77.0,98.432802,91.236714,5.0,5.0
2075,67.0,113.699359,73.0,97.190932,110.091099,5.0,4.0
4894,84.0,30.0,77.0,98.432802,138.967125,5.0,4.0



Missing values after imputation:
Age                     0
Blood_Pressure          0
Heart_Rate              0
Temperature             0
Lab_Test_Results        0
Recovery_Time           0
Patient_Satisfaction    0
dtype: int64


In [33]:
# Fill-in Missing Values in Categorical Columns (Allergies)

df["Allergies"] = df["Allergies"].fillna(df["Allergies"].mode()[0])

print("Sample of modified 'Allergies' column:")
display(df["Allergies"].sample(10))

missing_after_fill_allergies = df["Allergies"].isna().sum()

print("\nMissing values after imputation for 'Allergies':")
print(missing_after_fill_allergies)

Sample of modified 'Allergies' column:


3744     Shellfish
721     Penicillin
2682       Peanuts
3621     Shellfish
2428       Peanuts
925      Shellfish
3717       Peanuts
1152         Latex
1438         Latex
3805     Shellfish
Name: Allergies, dtype: object


Missing values after imputation for 'Allergies':
0


## **4. Handling Outliersü™Å**

In [36]:

def handle_outliers(df, column, lower_bound, upper_bound, method="remove"):
    """
    Handle outliers in a specified column based on strategy:
    - method: "remove", "cap", or "log"
    """
    original_shape = df.shape

    # Step 1: Convert to numeric if needed
    df[column] = pd.to_numeric(df[column], errors='coerce')

    if method == "remove":
        df = df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]
    elif method == "cap":
        df[column] = df[column].clip(lower=lower_bound, upper=upper_bound)
    elif method == "log":
        df[column] = np.log1p(df[column])  # log(1 + x)

    print(f"üßπ {column}: cleaned using '{method}'. Rows before: {original_shape[0]}, after: {df.shape[0]}")
    return df


### **Clean Age**


In [39]:
df = handle_outliers(df, 'Age', lower_bound=0, upper_bound=120, method="remove")

üßπ Age: cleaned using 'remove'. Rows before: 5000, after: 4508


### **Clean Blood_Pressure (if numeric average)**


In [42]:
df = handle_outliers(df, 'Blood_Pressure', lower_bound=70, upper_bound=250, method="remove")

üßπ Blood_Pressure: cleaned using 'remove'. Rows before: 4508, after: 4054


### **Clean Heart_Rate**


In [45]:
df = handle_outliers(df, 'Heart_Rate', lower_bound=30, upper_bound=200, method="remove")

üßπ Heart_Rate: cleaned using 'remove'. Rows before: 4054, after: 3644


**Clean Age column:**

This removes unrealistic or invalid age values such as negatives or over 120, this ensures only human ages in range are remained for analysis.

**Clean Blood Pressure column:**

This filters out extreme or physiologically implausible blood pressure values, this helps preventing skewed analysis due to data entry errors or noise.

**Clean Heart Rate column:**

This excludes heart rate readings outside a normal human range (30‚Äì200 bpm), this removes abnormal entries that could lead to distortion in  modeling results.

### **Exporting Cleaned Dataü´ß‚ú®**

In [54]:
df.to_csv("AI_dataset_cleaned.csv", index=False)
print("Cleaned dataset saved!")

Cleaned dataset saved!


## **Documenting Pre‚Äëprocessing Stepsüìë**
We cleaned a real-world AI in Healthcare dataset ü©∫ containing patient information like age, gender, diagnosis, and medications. Building on our initial quality check, we addressed data issues in four key steps: handling data type problems, inconsistent text, missing data, and outliers. No duplicate records were found, so deduplication wasn‚Äôt needed. The cleaned dataset, AI_dataset_cleaned.csv, is now ready for frequent pattern mining in the next lab! üöÄ

### **How We Handled the Issues üëÄ‚õè**

#### **1. Data-Type Issues üî¢**
Problem: Columns like age, blood pressure, heart rate, and recovery time were stored as text or mixed formats (e.g., ‚Äú7d‚Äù or ‚Äúunknown‚Äù).  
Solution:  
- Converted text to proper numbers, replacing ‚Äúunknown‚Äù with blanks.  
- Changed patient satisfaction terms (e.g., ‚Äúexcellent‚Äù) to numerical values.  
- ‚ö† Note: Treatment duration (e.g., ‚Äútwenty‚Äù) was too inconsistent to fix and left as is.  

Highlight: Ensured numerical columns are correctly formatted for analysis! ‚úÖ

#### **2. Inconsistent Data üìù**
Problem: Gender (e.g., ‚ÄúMALE,‚Äù ‚ÄúMa le‚Äù), medications, insurance types, and X-ray results had varied formats.  
Solution:  
- Standardized to consistent options:  
  - Gender: ‚ÄúMale‚Äù or ‚ÄúFemale‚Äù  
  - X-ray results: ‚ÄúNormal‚Äù or ‚ÄúAbnormal‚Äù  
  - Similar fixes for medications and insurance types.  

Highlight: Uniform formats across categorical columns for reliable mining! üßπ

#### **3. Missing Data ‚ùì**
Problem: Columns like heart rate, allergies, age, and others had missing entries; treatment duration was entirely blank.  
Solution:  
- Filled missing numerical values (e.g., heart rate, age) with median values due to skewed data.  
- Left allergies and treatment duration unchanged (no imputation).  

Highlight: Preserved data integrity by using median imputation for skewed numerical data! üìä

#### **4. Outlier Values üö®**
Problem: Extreme values skewed the dataset (e.g., ages <0 or >120, blood pressures outside 70‚Äì250, heart rates outside 30‚Äì200).  
Solution:  
- Removed or corrected unrealistic values to ensure data made sense.  

Highlight: Eliminated outliers to create a realistic and trustworthy dataset! üõ°

## **Outcome üéâ**
The cleaned dataset, saved as AI_dataset_cleaned.csv, now features:  
- Corrected formats for numerical and categorical columns.  
- Consistent text across all entries.  
- No unrealistic values after outlier removal.  
- Reliable data ready for frequent pattern mining in the next lab.  

**Key Takeaway**: By **addressing data types**, **inconsistencies**, **missing values**, and **outliers**, we‚Äôve ensured the dataset is robust and analysis-ready! üöÄüìà