Machine Learning Task  - Classification

Build different classification models from historical data of patients and their responses to different medications. Then you use the trained algorithms to predict the class of an unknown patient or to find a proper drug for a new patient.

About the dataset
Imagine that you are a medical researcher compiling data for a study. You have collected data about a set of patients, all of whom suffered from the same illness. During their course of treatment, each patient responded to one of 5 medications, Drug A, Drug B, Drug c, Drug x and y.


Part of your job is to build a model to find out which drug might be appropriate for a future patient with the same illness. The feature sets of this dataset are Age, Sex, Blood Pressure, and Cholesterol of patients, and the target is the drug that each patient responded to.


You can use the training part of the dataset to build a logistic regression, decision tree, Random Forest, KNN, SVM, and Naive Bayes Classifiers and then use it to predict the class of an unknown patient, or to prescribe it to a new patient.


Downloading the Data
To download the data, we will use !wget to download it from IBM Object Storage.

!wget -O drug200.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/drug200.csv


Dowlaod Dataset

In [4]:
!wget -O drug200.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/drug200.csv

--2024-04-24 07:33:03--  https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/drug200.csv
Resolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.196
Connecting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)|67.228.254.196|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6027 (5.9K) [text/csv]
Saving to: ‘drug200.csv’


2024-04-24 07:33:03 (1.21 GB/s) - ‘drug200.csv’ saved [6027/6027]



Importing Libraries and Dataset

In [30]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [5]:
#read the data
drug_data = pd.read_csv("/content/drug200.csv")

In [6]:
drug_data.head()

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,F,HIGH,HIGH,25.355,drugY
1,47,M,LOW,HIGH,13.093,drugC
2,47,M,LOW,HIGH,10.114,drugC
3,28,F,NORMAL,HIGH,7.798,drugX
4,61,F,LOW,HIGH,18.043,drugY


In [7]:
#number of raws and columns
drug_data.shape

(200, 6)

In [8]:
drug_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Age          200 non-null    int64  
 1   Sex          200 non-null    object 
 2   BP           200 non-null    object 
 3   Cholesterol  200 non-null    object 
 4   Na_to_K      200 non-null    float64
 5   Drug         200 non-null    object 
dtypes: float64(1), int64(1), object(4)
memory usage: 9.5+ KB


In [9]:
#statistical measure of data
drug_data.describe()

Unnamed: 0,Age,Na_to_K
count,200.0,200.0
mean,44.315,16.084485
std,16.544315,7.223956
min,15.0,6.269
25%,31.0,10.4455
50%,45.0,13.9365
75%,58.0,19.38
max,74.0,38.247


In [10]:
#data cleaning
drug_data.drop_duplicates(inplace=True)
drug_data.dropna(inplace=True)

In [11]:
#feature selection
x = drug_data.drop("Drug", axis=1)
y = drug_data["Drug"]

In [12]:
x

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K
0,23,F,HIGH,HIGH,25.355
1,47,M,LOW,HIGH,13.093
2,47,M,LOW,HIGH,10.114
3,28,F,NORMAL,HIGH,7.798
4,61,F,LOW,HIGH,18.043
...,...,...,...,...,...
195,56,F,LOW,HIGH,11.567
196,16,M,LOW,HIGH,12.006
197,52,M,NORMAL,HIGH,9.894
198,23,M,NORMAL,NORMAL,14.020


In [13]:
y

0      drugY
1      drugC
2      drugC
3      drugX
4      drugY
       ...  
195    drugC
196    drugC
197    drugX
198    drugX
199    drugX
Name: Drug, Length: 200, dtype: object

In [14]:
#List of categorical and numerica columns
Numerical_column = drug_data.select_dtypes(include = "number").columns.to_list
Categorical_column = drug_data.select_dtypes(exclude = "number").columns.to_list
print(f"Numerical column in the data : {Numerical_column}")
print(f"Categorical column in the data : {Categorical_column}")

Numerical column in the data : <bound method IndexOpsMixin.tolist of Index(['Age', 'Na_to_K'], dtype='object')>
Categorical column in the data : <bound method IndexOpsMixin.tolist of Index(['Sex', 'BP', 'Cholesterol', 'Drug'], dtype='object')>


In [15]:
# Encode categorical variables
label_encoder = LabelEncoder()
x["Sex"] = label_encoder.fit_transform(x["Sex"])
x["BP"] = label_encoder.fit_transform(x["BP"])
x["Cholesterol"] = label_encoder.fit_transform(x["Cholesterol"])

In [16]:
# Normalization/Standardization
scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)

In [17]:
# Data splitting
x_train, x_test, y_train, y_test = train_test_split(x_scaled, y, test_size=0.2, random_state=42)

##Model Building

In [19]:
#Logistic Regressin
logistic_model = LogisticRegression()
logistic_model.fit(x_train, y_train)

In [20]:
# Decision Tree
dt_model = DecisionTreeClassifier()
dt_model.fit(x_train, y_train)

In [21]:
# Random Forest
rf_model = RandomForestClassifier()
rf_model.fit(x_train, y_train)

In [24]:
# K-Nearest Neighbors (KNN)
knn_model = KNeighborsClassifier()
knn_model.fit(x_train, y_train)

In [26]:
# Support Vector Machine (SVM)
svm_model = SVC()
svm_model.fit(x_train, y_train)

In [27]:
# Naive Bayes
nb_model = GaussianNB()
nb_model.fit(x_train, y_train)

In [31]:
# 7. Model Evaluation
models = {
    "Logistic Regression": logistic_model,
    "Decision Tree": dt_model,
    "Random Forest": rf_model,
    "KNN": knn_model,
    "SVM": svm_model,
    "Naive Bayes": nb_model
}

for name, model in models.items():
    y_pred = model.predict(x_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"{name} Accuracy: {accuracy}")
    print(classification_report(y_test, y_pred))
    print(confusion_matrix(y_test, y_pred))

Logistic Regression Accuracy: 0.925
              precision    recall  f1-score   support

       drugA       0.86      1.00      0.92         6
       drugB       1.00      1.00      1.00         3
       drugC       1.00      0.80      0.89         5
       drugX       1.00      0.91      0.95        11
       drugY       0.88      0.93      0.90        15

    accuracy                           0.93        40
   macro avg       0.95      0.93      0.93        40
weighted avg       0.93      0.93      0.93        40

[[ 6  0  0  0  0]
 [ 0  3  0  0  0]
 [ 0  0  4  0  1]
 [ 0  0  0 10  1]
 [ 1  0  0  0 14]]
Decision Tree Accuracy: 1.0
              precision    recall  f1-score   support

       drugA       1.00      1.00      1.00         6
       drugB       1.00      1.00      1.00         3
       drugC       1.00      1.00      1.00         5
       drugX       1.00      1.00      1.00        11
       drugY       1.00      1.00      1.00        15

    accuracy                  

In [33]:
# Prediction (Using the trained models for prediction)
# For example, to predict the class of an unknown patient
new_patient_data = [[30, 0, 120, 220, 1]]  # Example data for a new patient
predicted_drug = rf_model.predict(new_patient_data)
print("Predicted Drug for New Patient:", predicted_drug)

Predicted Drug for New Patient: ['drugY']
