# Step 1: Data Preprocessing

#### We'll focus on predicting the test results column. Our goal is to develop a model that can recommend the likelihood of a patient's test results being "Normal", "Abnormal", or "Inconclusive" based on their medical condition, demographics and other factors.

In [86]:
import pandas as pd

df = pd.read_csv('/Users/rayya/JupyterLab/healthcare_dataset.csv')
df

Unnamed: 0,Name,Age,Gender,Blood Type,Medical Condition,Date of Admission,Doctor,Hospital,Insurance Provider,Billing Amount,Room Number,Admission Type,Discharge Date,Medication,Test Results
0,Bobby JacksOn,30,Male,B-,Cancer,2024-01-31,Matthew Smith,Sons and Miller,Blue Cross,18856.281306,328,Urgent,2024-02-02,Paracetamol,Normal
1,LesLie TErRy,62,Male,A+,Obesity,2019-08-20,Samantha Davies,Kim Inc,Medicare,33643.327287,265,Emergency,2019-08-26,Ibuprofen,Inconclusive
2,DaNnY sMitH,76,Female,A-,Obesity,2022-09-22,Tiffany Mitchell,Cook PLC,Aetna,27955.096079,205,Emergency,2022-10-07,Aspirin,Normal
3,andrEw waTtS,28,Female,O+,Diabetes,2020-11-18,Kevin Wells,"Hernandez Rogers and Vang,",Medicare,37909.782410,450,Elective,2020-12-18,Ibuprofen,Abnormal
4,adrIENNE bEll,43,Female,AB+,Cancer,2022-09-19,Kathleen Hanna,White-White,Aetna,14238.317814,458,Urgent,2022-10-09,Penicillin,Abnormal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
55495,eLIZABeTH jaCkSOn,42,Female,O+,Asthma,2020-08-16,Joshua Jarvis,Jones-Thompson,Blue Cross,2650.714952,417,Elective,2020-09-15,Penicillin,Abnormal
55496,KYle pEREz,61,Female,AB-,Obesity,2020-01-23,Taylor Sullivan,Tucker-Moyer,Cigna,31457.797307,316,Elective,2020-02-01,Aspirin,Normal
55497,HEATher WaNG,38,Female,B+,Hypertension,2020-07-13,Joe Jacobs DVM,"and Mahoney Johnson Vasquez,",UnitedHealthcare,27620.764717,347,Urgent,2020-08-10,Ibuprofen,Abnormal
55498,JENniFER JOneS,43,Male,O-,Arthritis,2019-05-25,Kimberly Curry,"Jackson Todd and Castro,",Medicare,32451.092358,321,Elective,2019-05-31,Ibuprofen,Abnormal


In [87]:
df = df[['Name', 'Age', 'Gender', 'Blood Type', 'Medical Condition', 'Admission Type', 'Medication', 'Date of Admission', 'Discharge Date','Test Results']]

In [88]:
from sklearn.preprocessing import LabelEncoder

# Encode categorical variables into numerical
le = LabelEncoder()
df['Name']= le.fit_transform(df['Name'])
df['Age']= le.fit_transform(df['Age'])
df['Gender']= le.fit_transform(df['Gender'])
df['Blood Type']= le.fit_transform(df['Blood Type'])
df['Medical Condition']= le.fit_transform(df['Medical Condition'])
df['Admission Type']= le.fit_transform(df['Admission Type'])
df['Medication']= le.fit_transform(df['Medication'])
df['Date of Admission']= le.fit_transform(df['Date of Admission'])
df['Discharge Date']= le.fit_transform(df['Discharge Date'])
df['Test Results']= le.fit_transform(df['Test Results'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Name']= le.fit_transform(df['Name'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Age']= le.fit_transform(df['Age'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Gender']= le.fit_transform(df['Gender'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .lo

In [89]:
X = df.drop(['Test Results'], axis=1)
y = df['Test Results']

In [90]:
# Reducing the dataset size 
n_samples = 20000

X_small = X.iloc[:n_samples]
y_small = y.iloc[:n_samples]

In [91]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split (X_small, y_small, test_size=0.2, random_state=100)

print(X_train.shape)
print(X_test.shape)

(16000, 9)
(4000, 9)


# Step 2: Model Building 

## RandomForest 

### Training Model

In [92]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(max_depth=2, random_state=100)
rf.fit(X_train, y_train)

### Applying model for predictions

In [93]:
y_rf_train_pred = rf.predict(X_train)
y_rf_test_pred = rf.predict(X_test)

print(y_rf_train_pred)
print(y_rf_test_pred)

[0 2 0 ... 2 0 2]
[0 0 2 ... 2 0 2]


### Evaluating Model Performance

In [94]:
from sklearn.metrics import accuracy_score, classification_report 

In [95]:
# Accuracy Score

rf_train_as = accuracy_score(y_train, y_rf_train_pred)
rf_test_as = accuracy_score(y_test, y_rf_test_pred)

print(rf_train_as, rf_test_as)

0.3524375 0.32875


In [96]:
# Classification Report

rf_train_cr = classification_report(y_train, y_rf_train_pred)
rf_test_cr = classification_report(y_test, y_rf_test_pred)

print(rf_train_cr, rf_test_cr)

              precision    recall  f1-score   support

           0       0.34      0.67      0.46      5382
           1       0.45      0.05      0.09      5282
           2       0.36      0.33      0.34      5336

    accuracy                           0.35     16000
   macro avg       0.39      0.35      0.30     16000
weighted avg       0.39      0.35      0.30     16000
               precision    recall  f1-score   support

           0       0.33      0.66      0.44      1323
           1       0.31      0.03      0.05      1371
           2       0.34      0.31      0.32      1306

    accuracy                           0.33      4000
   macro avg       0.32      0.33      0.27      4000
weighted avg       0.32      0.33      0.27      4000



## SVC

### Training Model

In [97]:
from sklearn import svm

svc = svm.SVC()
svc.fit(X_train, y_train)

### Applying Model for Predictions 

In [98]:
y_svc_train_pred = svc.predict(X_train)
y_svc_test_pred = svc.predict(X_test)

print(y_svc_train_pred)
print(y_svc_test_pred)

[0 0 0 ... 1 0 2]
[2 0 2 ... 2 0 0]


### Evaluating Model Performance 

In [99]:
from sklearn.metrics import accuracy_score, classification_report 

In [100]:
# Accuracy Score

svc_train_as = accuracy_score(y_train, y_svc_train_pred)
svc_test_as = accuracy_score(y_test, y_svc_test_pred)

print(svc_train_as, svc_test_as)

0.33775 0.339


In [101]:
# Classification Report 

svc_train_cr = classification_report(y_train, y_svc_train_pred)
svc_test_cr = classification_report(y_test, y_svc_test_pred)

print(svc_train_cr, svc_test_cr)

              precision    recall  f1-score   support

           0       0.34      0.54      0.42      5382
           1       0.34      0.13      0.19      5282
           2       0.34      0.34      0.34      5336

    accuracy                           0.34     16000
   macro avg       0.34      0.34      0.31     16000
weighted avg       0.34      0.34      0.31     16000
               precision    recall  f1-score   support

           0       0.34      0.55      0.42      1323
           1       0.35      0.12      0.18      1371
           2       0.34      0.35      0.34      1306

    accuracy                           0.34      4000
   macro avg       0.34      0.34      0.32      4000
weighted avg       0.34      0.34      0.31      4000



## Naive Bayes

### Training Model

In [102]:
from sklearn.naive_bayes import GaussianNB

gnb = gnb = GaussianNB()
gnb.fit(X_train, y_train)

### Applying Model for Predictions 

In [103]:
y_gnb_train_pred = gnb.predict(X_train)
y_gnb_test_pred = gnb.predict(X_test)

print(y_gnb_train_pred)
print(y_gnb_test_pred)

[0 0 1 ... 2 0 2]
[1 0 2 ... 2 0 0]


### Evaluating Model Performance 

In [104]:
# Accuracy Score

gnb_train_as = accuracy_score(y_train, y_gnb_train_pred)
gnb_test_as = accuracy_score(y_test, y_gnb_test_pred)

print(gnb_train_as, gnb_test_as)

0.3454375 0.3375


In [105]:
# Classification Report 

gnb_train_cr = classification_report(y_train, y_gnb_train_pred)
gnb_test_cr = classification_report(y_test, y_gnb_test_pred)

print(gnb_train_cr, gnb_test_cr)

              precision    recall  f1-score   support

           0       0.34      0.40      0.37      5382
           1       0.35      0.27      0.31      5282
           2       0.35      0.36      0.35      5336

    accuracy                           0.35     16000
   macro avg       0.35      0.35      0.34     16000
weighted avg       0.35      0.35      0.34     16000
               precision    recall  f1-score   support

           0       0.34      0.41      0.37      1323
           1       0.34      0.25      0.29      1371
           2       0.33      0.35      0.34      1306

    accuracy                           0.34      4000
   macro avg       0.34      0.34      0.33      4000
weighted avg       0.34      0.34      0.33      4000



#### Best Model

In [106]:
models = {
    "Random Forest": (rf, accuracy_score(y_test, y_gnb_test_pred)),
    "SVC": (svc, accuracy_score(y_test, y_svc_test_pred)),
    "Gaussian Naive Bayes": (gnb, accuracy_score(y_test, y_gnb_test_pred))
}

best_model_name = max(models, key=lambda k: models[k][1])
best_model, best_accuracy = models[best_model_name]

print(f"Best Model: {best_model_name}")
print(f"Best Accuracy: {best_accuracy:.3f}")

Best Model: SVC
Best Accuracy: 0.339


#### Testing

In [108]:
def decode_prediction(encoded_value):
    # Assuming 0 = normal, 1 = abnormal, 2 = inconclusive
    # Adjust this mapping if your encoding is different
    mapping = {0: "Normal", 1: "Abnormal", 2: "Inconclusive"}
    return mapping.get(encoded_value, "Unknown")

In [109]:
# Select a few samples from your dataset
num_samples = 5
test_samples = X_test.iloc[:num_samples]
actual_results = y_test.iloc[:num_samples]

# Make predictions using each model
rf_predictions = rf.predict(test_samples)
svc_predictions = svc.predict(test_samples)
gnb_predictions = gnb.predict(test_samples)

# Print the results
print("Sample | Actual | RandomForest | SVC | Naive Bayes")
print("-" * 60)
for i in range(num_samples):
    actual = decode_prediction(actual_results.iloc[i])
    rf_pred = decode_prediction(rf_predictions[i])
    svc_pred = decode_prediction(svc_predictions[i])
    gnb_pred = decode_prediction(gnb_predictions[i])
    
    print(f"{i+1:6d} | {actual:7s} | {rf_pred:11s} | {svc_pred:3s} | {gnb_pred:11s}")

# Calculate accuracy for these samples
rf_accuracy = accuracy_score(actual_results, rf_predictions)
svc_accuracy = accuracy_score(actual_results, svc_predictions)
gnb_accuracy = accuracy_score(actual_results, gnb_predictions)

print("\nAccuracy on these samples:")
print(f"Random Forest: {rf_accuracy:.2f}")
print(f"SVC: {svc_accuracy:.2f}")
print(f"Naive Bayes: {gnb_accuracy:.2f}")

Sample | Actual | RandomForest | SVC | Naive Bayes
------------------------------------------------------------
     1 | Abnormal | Normal      | Inconclusive | Abnormal   
     2 | Abnormal | Normal      | Normal | Normal     
     3 | Normal  | Inconclusive | Inconclusive | Inconclusive
     4 | Normal  | Normal      | Normal | Normal     
     5 | Abnormal | Inconclusive | Inconclusive | Inconclusive

Accuracy on these samples:
Random Forest: 0.20
SVC: 0.20
Naive Bayes: 0.40


In [110]:
# Create a sample input
sample_input = pd.DataFrame({
    'Name': [0],  # Use encoded values here
    'Age': [30],
    'Gender': [0],
    'Blood Type': [1],
    'Medical Condition': [2],
    'Admission Type': [0],
    'Medication': [3],
    'Date of Admission': [100],
    'Discharge Date': [105]
})

# Make predictions
rf_pred = rf.predict(sample_input)[0]
svc_pred = svc.predict(sample_input)[0]
gnb_pred = gnb.predict(sample_input)[0]

print("Predictions for the sample input:")
print(f"Random Forest: {decode_prediction(rf_pred)}")
print(f"SVC: {decode_prediction(svc_pred)}")
print(f"Naive Bayes: {decode_prediction(gnb_pred)}")

Predictions for the sample input:
Random Forest: Abnormal
SVC: Abnormal
Naive Bayes: Inconclusive
