**Part 1: Load and Inspect the Dataset**

1. Import the necessary libraries:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets, model_selection, preprocessing

2. Load the dataset:

In [2]:
from google.colab import files
uploaded = files.upload()
df = pd.read_csv("hospital data analysis.csv")

Saving hospital data analysis.csv to hospital data analysis.csv


2. Display the first 5 rows:

In [3]:
df.head()

Unnamed: 0,Patient_ID,Age,Gender,Condition,Procedure,Cost,Length_of_Stay,Readmission,Outcome,Satisfaction
0,1,45,Female,Heart Disease,Angioplasty,15000,5,No,Recovered,4
1,2,60,Male,Diabetes,Insulin Therapy,2000,3,Yes,Stable,3
2,3,32,Female,Fractured Arm,X-Ray and Splint,500,1,No,Recovered,5
3,4,75,Male,Stroke,CT Scan and Medication,10000,7,Yes,Stable,2
4,5,50,Female,Cancer,Surgery and Chemotherapy,25000,10,No,Recovered,4


3. check the structure using df.info()
and df.shape.

In [4]:
print("\nStructure of the dataset:")
print (df.info())

print("\nShape of the dataset:")
print(df.shape)


Structure of the dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 984 entries, 0 to 983
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Patient_ID      984 non-null    int64 
 1   Age             984 non-null    int64 
 2   Gender          984 non-null    object
 3   Condition       984 non-null    object
 4   Procedure       984 non-null    object
 5   Cost            984 non-null    int64 
 6   Length_of_Stay  984 non-null    int64 
 7   Readmission     984 non-null    object
 8   Outcome         984 non-null    object
 9   Satisfaction    984 non-null    int64 
dtypes: int64(5), object(5)
memory usage: 77.0+ KB
None

Shape of the dataset:
(984, 10)


**Part 2: Handle Duplicates:**

1. Check for duplicate rows:

In [5]:
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

Number of duplicate rows: 0


2.remove them :

In [6]:
df = df.drop_duplicates()

**Part 3: Explore Attributes:**

1. For each numerical attribute, display the minimum and maximum values:

In [7]:
#delete id column
df = df.drop(columns=['Patient_ID'])

print("Minimum and Maximum values for each numerical column:\n")
for col in df.select_dtypes(include=['int64', 'float64']).columns:
    print(f"{col}: Min = {df[col].min()}, Max = {df[col].max()}")


Minimum and Maximum values for each numerical column:

Age: Min = 25, Max = 78
Cost: Min = 100, Max = 25000
Length_of_Stay: Min = 1, Max = 76
Satisfaction: Min = 2, Max = 5


2. For each categorical attribute, display their unique values:

In [8]:
print("Unique values for each categorical column:\n")

for col in df.select_dtypes(include=['object']).columns:
    print(f"{col}: {df[col].unique()}\n")

Unique values for each categorical column:

Gender: ['Female' 'Male']

Condition: ['Heart Disease' 'Diabetes' 'Fractured Arm' 'Stroke' 'Cancer'
 'Hypertension' 'Appendicitis' 'Fractured Leg' 'Heart Attack'
 'Allergic Reaction' 'Respiratory Infection' 'Prostate Cancer'
 'Childbirth' 'Kidney Stones' 'Osteoarthritis']

Procedure: ['Angioplasty' 'Insulin Therapy' 'X-Ray and Splint'
 'CT Scan and Medication' 'Surgery and Chemotherapy'
 'Medication and Counseling' 'Appendectomy' 'Cast and Physical Therapy'
 'Cardiac Catheterization' 'Epinephrine Injection' 'Antibiotics and Rest'
 'Radiation Therapy' 'Delivery and Postnatal Care' 'Lithotripsy'
 'Physical Therapy and Pain Management']

Readmission: ['No' 'Yes']

Outcome: ['Recovered' 'Stable']



**Part 4: Handle Missing Values for Categorical Attributes:**

1. Replace missing values in categorical columns with the most frequent:

In [9]:
for col in df.select_dtypes(include=['object']).columns:
    mode_value = df[col].mode()[0]
    df[col] = df[col].fillna(mode_value)

**Part 5: Handle Missing Values for Numerical Attributes:**

1. Identify numerical columns with missing values:

In [10]:
print("Numerical columns with missing values:\n")
missing_num_cols = df.select_dtypes(include=['int64', 'float64']).columns[df.select_dtypes(include=['int64', 'float64']).isnull().any()]
print(missing_num_cols)

Numerical columns with missing values:

Index([], dtype='object')


2. Replace missing values in Age or Length_of_Stay with median values:

In [11]:
for col in ['Age', 'Length_of_Stay']:
    if col in df.columns:
        median_value = df[col].median()
        df[col].fillna(median_value, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(median_value, inplace=True)


3. Replace missing values in Cost or Satisfaction with mean values:

In [12]:
for col in ['Cost', 'Satisfaction']:
    if col in df.columns:
        mean_value = df[col].mean()
        df[col].fillna(mean_value, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(mean_value, inplace=True)


**Part 6: Encode Categorical Attributes:**

1. Encode Gender, Condition, Procedure, Readmission, and Outcome using one-hot
encoding or label encoding:

In [13]:
from sklearn.preprocessing import LabelEncoder

cols_to_encode = ['Gender', 'Condition', 'Procedure', 'Readmission', 'Outcome']

# Apply Label Encoding
label_encoder = LabelEncoder()
for col in cols_to_encode:
    if col in df.columns:
        df[col] = label_encoder.fit_transform(df[col])

**Part 7: Linear Regression :**

1. Define the target variable as Outcome and select the fea-
tures :

In [14]:
# Define target variable
y = df['Outcome']

# Define feature set: Age, Cost, Length_of_Stay, Satisfaction, and all other columns except Outcome and Patient_ID
feature_cols = [col for col in df.columns if col not in ['Outcome']]
X = df[feature_cols]


print("Target variable (y): 'Outcome'")
print("Feature columns (X):", feature_cols)

Target variable (y): 'Outcome'
Feature columns (X): ['Age', 'Gender', 'Condition', 'Procedure', 'Cost', 'Length_of_Stay', 'Readmission', 'Satisfaction']


2. Split the dataset into training and testing sets:

In [15]:
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

3. Fit a Linear Regression model on the training set to predict Outcome:

In [16]:
from sklearn.linear_model import LinearRegression

#Initialize the linear regression model
model = LinearRegression()
#Train the data on the training data already splited
model.fit(X_train, y_train)

4. Use the trained model to predict the Outcome for the test set:

In [17]:
y_pred = model.predict(X_test)

print("First 10 predicted values:", y_pred[:10])

First 10 predicted values: [ 0.67411438  0.77476654  0.67426313  0.67389125 -0.25759142  0.59284445
  0.16501784  0.04055549  0.86739942  0.25596126]


5. Evaluate the model using metrics such as Mean Squared Error (MSE), R2
score, and accuracy:

In [18]:
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score


# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Accuracy by rounding predictions to 0 or 1
accuracy = accuracy_score(y_test, np.round(y_pred))

print("Model Evaluation Metrics:")
print(f"Mean Squared Error (MSE): {mse}")
print(f"R² Score: {r2}")
print(f"Accuracy (rounded predictions): {accuracy}")

Model Evaluation Metrics:
Mean Squared Error (MSE): 0.05206393857133216
R² Score: 0.7832493679452017
Accuracy (rounded predictions): 0.9644670050761421


6. Compare the predicted outcomes with the actual outcomes to assess model performance:

In [19]:
# Create a DataFrame to compare actual vs predicted outcomes
comparison_df = pd.DataFrame({
    'Actual': y_test.values,
    'Predicted': y_pred,
    'Rounded_Predicted': np.round(y_pred)
})

#Display the first 10 rows for comparison
print(comparison_df.head(10))

   Actual  Predicted  Rounded_Predicted
0       1   0.674114                1.0
1       1   0.774767                1.0
2       1   0.674263                1.0
3       1   0.673891                1.0
4       0  -0.257591               -0.0
5       0   0.592844                1.0
6       0   0.165018                0.0
7       0   0.040555                0.0
8       1   0.867399                1.0
9       0   0.255961                0.0


7. Test the model on new patients:

In [20]:
#Create new patients DataFrame
new_patients = pd.DataFrame({
    'Age': [40, 65, 30],
    'Gender': ['Female', 'Male', 'Female'],
    'Condition': ['Heart Disease', 'Diabetes', 'Fractured Arm'],
    'Procedure': ['Angioplasty', 'Insulin Therapy', 'X-Ray and Splint'],
    'Cost': [12000, 5000, 300],
    'Length_of_Stay': [4, 6, 1],
    'Readmission': ['No', 'Yes', 'No'],
    'Satisfaction': [4, 2, 5]
})

cols_to_encode = ['Gender', 'Condition', 'Procedure', 'Readmission', 'Outcome']

# Apply Label Encoding
label_encoder = LabelEncoder()
for col in cols_to_encode:
    if col in new_patients.columns:
        new_patients[col] = label_encoder.fit_transform(new_patients[col])

#Predict
new_predictions = model.predict(new_patients)
print("Predicted outcomes for new patients:", new_predictions)


Predicted outcomes for new patients: [0.10198959 0.50330696 0.20303385]


**Part 8: Detect and Replace Outliers Using IQR:**

1. Create a function to detect outliers in numerical columns

In [22]:
def find_outliers_IQR(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1

   # Calculate bounds
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Detect outliers
    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)][column]
    print(f"Outliers in {column}:\n", outliers)
    return Q1, Q3, lower_bound, upper_bound

2. Replace outliers below Q1 with Q1 and outliers above Q3 with Q3:

In [23]:
# Select numeric columns automatically
numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns

# Apply detection and replacement
for col in numeric_cols:
    Q1, Q3, lower_bound, upper_bound = find_outliers_IQR(df, col)

    # ✅ Replace outliers below Q1 with Q1 and above Q3 with Q3
    df[col] = df[col].apply(lambda x: Q1 if x < lower_bound else (Q3 if x > upper_bound else x))

# Check updated DataFrame
print(df.head())


Outliers in Age:
 Series([], Name: Age, dtype: int64)
Outliers in Gender:
 Series([], Name: Gender, dtype: int64)
Outliers in Condition:
 Series([], Name: Condition, dtype: int64)
Outliers in Procedure:
 Series([], Name: Procedure, dtype: int64)
Outliers in Cost:
 Series([], Name: Cost, dtype: int64)
Outliers in Length_of_Stay:
 Series([], Name: Length_of_Stay, dtype: int64)
Outliers in Readmission:
 Series([], Name: Readmission, dtype: int64)
Outliers in Outcome:
 Series([], Name: Outcome, dtype: int64)
Outliers in Satisfaction:
 Series([], Name: Satisfaction, dtype: int64)
   Age  Gender  Condition  Procedure   Cost  Length_of_Stay  Readmission  \
0   45       0          8          0  15000               5            0   
1   60       1          4          8   2000               3            1   
2   32       0          5         14    500               1            0   
3   75       1         14          3  10000               7            1   
4   50       0          2         13  

**Part 9: Feature Engineering :**

1. Create a new attribute risk_score:

In [24]:
# Create risk_score = Cost * Length_of_Stay / Age
df['risk_score'] = (df['Cost'] * df['Length_of_Stay']) / df['Age']

# Check the first few rows
print(df.head())

   Age  Gender  Condition  Procedure   Cost  Length_of_Stay  Readmission  \
0   45       0          8          0  15000               5            0   
1   60       1          4          8   2000               3            1   
2   32       0          5         14    500               1            0   
3   75       1         14          3  10000               7            1   
4   50       0          2         13  25000              10            0   

   Outcome  Satisfaction   risk_score  
0        0             4  1666.666667  
1        1             3   100.000000  
2        0             5    15.625000  
3        1             2   933.333333  
4        0             4  5000.000000  


**Part 10: Save the Cleaned Dataset:**

1. Save the cleaned DataFrame as csv file:

In [26]:
df.to_csv('hospital_data_analysis_cleaned.csv', index=False)

# Display the first 5 rows
df_cleaned = pd.read_csv('hospital_data_analysis_cleaned.csv')
print("\n✅ First 5 rows of the cleaned dataset:")
print(df_cleaned.head())


✅ First 5 rows of the cleaned dataset:
   Age  Gender  Condition  Procedure   Cost  Length_of_Stay  Readmission  \
0   45       0          8          0  15000               5            0   
1   60       1          4          8   2000               3            1   
2   32       0          5         14    500               1            0   
3   75       1         14          3  10000               7            1   
4   50       0          2         13  25000              10            0   

   Outcome  Satisfaction   risk_score  
0        0             4  1666.666667  
1        1             3   100.000000  
2        0             5    15.625000  
3        1             2   933.333333  
4        0             4  5000.000000  


**Part 11: Reflection Questions:**

1. Why is it important to replace missing values rather than dropping rows?

===>Dropping rows reduces data size and may cause bias but replacing preserves information and dataset integrity.

2. What is the difference between mean, median, and mode when imputing missing data?

===>Mean is the average (sensitive to outliers), median is the middle value (robust), and mode is the most frequent (for categorical data).

3. Why do we use IQR to detect outliers instead of standard deviation?

===>IQR is robust to extreme values and works for skewed data, while standard deviation assumes normal distribution and is sensitive to outliers.

4. How could encoding categorical variables affect predictive models?

===>Encoding converts categories to numbers; wrong encoding can introduce bias or false relationships, affecting model accuracy.