# DRUG TYPE PREDICTION USING MACHINE LEARNING CLASSIFICATION MODELS

This project aim to classify the appropriate drug type for patients based on their health parameters using a variety of machine learning classification models. The dataset includes features such as age, gender, blood pressure (BP), cholesterol levels, and the sodium-to-potassium ratio (Na_to_K), with the target variable being the prescribed drug category. By implementing models such as K-Nearest Neighbors (KNN),Random Forest, Decision Tree, and Logistic Regression, this project seeks to compare model performance and determine the most effective approach for accurate drug prediction.

Importing Dataset and libraries

In [None]:
import pandas as pd
import warnings 
warnings.filterwarnings('ignore')
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
df=pd.read_csv(r"E:\data_analytics\ml_works\machine_learning_projects\drug200.csv")
df.head()

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,F,HIGH,HIGH,25.355,DrugY
1,47,M,LOW,HIGH,13.093,drugC
2,47,M,LOW,HIGH,10.114,drugC
3,28,F,NORMAL,HIGH,7.798,drugX
4,61,F,LOW,HIGH,18.043,DrugY


Information About Data

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Age          200 non-null    int64  
 1   Sex          200 non-null    object 
 2   BP           200 non-null    object 
 3   Cholesterol  200 non-null    object 
 4   Na_to_K      200 non-null    float64
 5   Drug         200 non-null    object 
dtypes: float64(1), int64(1), object(4)
memory usage: 9.5+ KB


Encoding Categorical data

In [3]:
le=LabelEncoder()
column_to_convert=['BP','Cholesterol','Drug']
for col in column_to_convert:
    df[col]=le.fit_transform(df[col])
df.head(10)

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,F,0,0,25.355,0
1,47,M,1,0,13.093,3
2,47,M,1,0,10.114,3
3,28,F,2,0,7.798,4
4,61,F,1,0,18.043,0
5,22,F,2,0,8.607,4
6,49,F,2,0,16.275,0
7,41,M,1,0,11.037,3
8,60,M,2,0,15.171,0
9,43,M,1,1,19.368,0


Independent and Dependent Variable

In [4]:
x=df.iloc[:,[0,2,3,4]].values
x=pd.DataFrame(x)
y=df.iloc[:,5].values
y=pd.DataFrame(y)
print(x)
print(y)

        0    1    2       3
0    23.0  0.0  0.0  25.355
1    47.0  1.0  0.0  13.093
2    47.0  1.0  0.0  10.114
3    28.0  2.0  0.0   7.798
4    61.0  1.0  0.0  18.043
..    ...  ...  ...     ...
195  56.0  1.0  0.0  11.567
196  16.0  1.0  0.0  12.006
197  52.0  2.0  0.0   9.894
198  23.0  2.0  1.0  14.020
199  40.0  1.0  1.0  11.349

[200 rows x 4 columns]
     0
0    0
1    3
2    3
3    4
4    0
..  ..
195  3
196  3
197  4
198  4
199  4

[200 rows x 1 columns]


Splitting Variable to test and train

In [5]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.25,random_state=42)

Feature Scaling

In [6]:
st_x=StandardScaler()
x_train=st_x.fit_transform(x_train)
x_test=st_x.fit_transform(x_test)
print(x_train)

[[-1.56623476  1.29344713  1.09812675 -0.95102405]
 [-0.27267433  0.10315836  1.09812675  0.36126848]
 [-1.44303853  1.29344713 -0.91064169 -1.0445412 ]
 [ 1.26727857 -1.08713041 -0.91064169  0.02937974]
 [-0.33427244  1.29344713 -0.91064169 -0.83697198]
 [-0.27267433  1.29344713  1.09812675  0.93929879]
 [ 0.2201106  -1.08713041  1.09812675 -1.36893747]
 [-1.56623476 -1.08713041  1.09812675  2.7061346 ]
 [-0.39587056  1.29344713  1.09812675  0.15120178]
 [ 0.77449364  0.10315836 -0.91064169  3.06799323]
 [-1.56623476 -1.08713041 -0.91064169 -0.67616134]
 [ 0.65129741 -1.08713041 -0.91064169  1.28478499]
 [ 1.76006349 -1.08713041 -0.91064169 -0.9113417 ]
 [-0.14947809  0.10315836 -0.91064169 -0.105346  ]
 [ 0.03531625  1.29344713  1.09812675 -1.227968  ]
 [ 0.95928799 -1.08713041 -0.91064169  1.29588494]
 [ 0.28170872  1.29344713  1.09812675 -0.04790372]
 [ 1.14408234 -1.08713041  1.09812675  0.66554608]
 [ 1.02088611  1.29344713 -0.91064169  0.06365086]
 [ 0.5896993  -1.08713041 -0.91

### Decision Tree Model

In [31]:
from sklearn.tree import DecisionTreeClassifier

Model fitting

In [7]:
classifier=DecisionTreeClassifier(criterion='entropy',random_state=42)
classifier.fit(x_train,y_train)

Model Prediction

In [8]:
y_pred=classifier.predict(x_test)
print(y_pred)
print(y_test)

[4 0 4 3 0 0 0 4 1 4 1 4 0 1 2 0 2 4 3 0 2 4 4 0 0 0 3 4 0 4 0 3 3 0 2 0 4
 1 0 1 4 4 4 0 0 3 0 0 0 4]
     0
95   4
15   0
30   4
158  3
128  0
115  0
69   0
170  4
174  1
45   4
66   1
182  4
165  0
78   1
186  2
177  0
56   2
152  4
82   3
68   0
124  2
16   4
148  4
93   0
65   0
60   0
84   3
67   4
125  0
132  4
9    0
18   3
55   3
75   0
150  1
104  0
135  4
137  1
164  0
76   1
79   4
197  4
38   4
24   0
122  0
195  3
29   0
19   0
143  0
86   4


Model Evaluation

In [9]:
print("mean absolute error:",metrics.mean_absolute_error(y_test,y_pred))
print("mean squared error:",metrics.mean_squared_error(y_test,y_pred))
print("root mean squared error:",np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

mean absolute error: 0.02
mean squared error: 0.02
root mean squared error: 0.1414213562373095


Prediction Accuracy

In [10]:
accuracy=metrics.accuracy_score(y_test,y_pred)
print(accuracy*100,"%")

98.0 %


### SVM Model

In [11]:
from sklearn.svm import SVC

Model Fitting

In [12]:
classifier=SVC(kernel='linear',random_state=42)
classifier.fit(x_train,y_train)

Model Prediction

In [13]:
y_pred=classifier.predict(x_test)
print(y_pred)
print(y_test)

[4 0 4 3 0 0 0 4 1 4 1 4 0 1 2 0 2 4 3 0 2 4 4 0 0 0 3 4 0 4 0 3 3 0 2 0 4
 1 0 1 4 4 4 0 0 3 0 0 2 4]
     0
95   4
15   0
30   4
158  3
128  0
115  0
69   0
170  4
174  1
45   4
66   1
182  4
165  0
78   1
186  2
177  0
56   2
152  4
82   3
68   0
124  2
16   4
148  4
93   0
65   0
60   0
84   3
67   4
125  0
132  4
9    0
18   3
55   3
75   0
150  1
104  0
135  4
137  1
164  0
76   1
79   4
197  4
38   4
24   0
122  0
195  3
29   0
19   0
143  0
86   4


Model Evaluation

In [14]:
print("mean absolute error:",metrics.mean_absolute_error(y_test,y_pred))
print("mean squared error:",metrics.mean_squared_error(y_test,y_pred))
print("root mean squared error:",np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

mean absolute error: 0.06
mean squared error: 0.1
root mean squared error: 0.31622776601683794


Prediction Accuracy

In [15]:
accuracy=metrics.accuracy_score(y_test,y_pred)
print(accuracy*100,"%")

96.0 %


### KNN Model

In [16]:
from sklearn.neighbors import KNeighborsClassifier

Model Fitting

In [17]:
classifier=KNeighborsClassifier(n_neighbors=5,metric='minkowski',p=2)
classifier.fit(x_train,y_train)

Model Prediction

In [18]:
y_pred=classifier.predict(x_test)
print(y_pred)
print(y_test)

[4 0 4 3 0 0 0 4 0 4 1 4 0 1 2 0 2 4 3 0 2 4 4 0 0 0 3 4 0 4 0 3 3 0 2 0 4
 1 0 1 4 4 4 0 0 3 0 0 2 4]
     0
95   4
15   0
30   4
158  3
128  0
115  0
69   0
170  4
174  1
45   4
66   1
182  4
165  0
78   1
186  2
177  0
56   2
152  4
82   3
68   0
124  2
16   4
148  4
93   0
65   0
60   0
84   3
67   4
125  0
132  4
9    0
18   3
55   3
75   0
150  1
104  0
135  4
137  1
164  0
76   1
79   4
197  4
38   4
24   0
122  0
195  3
29   0
19   0
143  0
86   4


Model Evaluation

In [19]:
print("mean absolute error:",metrics.mean_absolute_error(y_test,y_pred))
print("mean squared error:",metrics.mean_squared_error(y_test,y_pred))
print("root mean squared error:",np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

mean absolute error: 0.08
mean squared error: 0.12
root mean squared error: 0.34641016151377546


Prediction Accuracy

In [20]:
accuracy=metrics.accuracy_score(y_test,y_pred)
print(accuracy*100,"%")

94.0 %


### Random Forest Model

In [21]:
from sklearn.ensemble import RandomForestClassifier

Model Fitting

In [22]:
classifier=RandomForestClassifier(criterion='entropy',n_estimators=10,random_state=42)
classifier.fit(x_train,y_train)

Model Prediction

In [23]:
y_pred=classifier.predict(x_test)
print(y_test)
print(y_pred)

     0
95   4
15   0
30   4
158  3
128  0
115  0
69   0
170  4
174  1
45   4
66   1
182  4
165  0
78   1
186  2
177  0
56   2
152  4
82   3
68   0
124  2
16   4
148  4
93   0
65   0
60   0
84   3
67   4
125  0
132  4
9    0
18   3
55   3
75   0
150  1
104  0
135  4
137  1
164  0
76   1
79   4
197  4
38   4
24   0
122  0
195  3
29   0
19   0
143  0
86   4
[4 0 4 3 0 0 0 4 1 4 1 4 0 1 2 0 2 4 3 0 2 4 4 0 0 0 3 4 0 4 0 3 4 0 2 0 4
 1 0 1 4 4 4 0 0 3 0 0 0 4]


Model Evaluation

In [24]:
print("mean absolute error:",metrics.mean_absolute_error(y_test,y_pred))
print("mean squared error:",metrics.mean_squared_error(y_test,y_pred))
print("root mean squared error:",np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

mean absolute error: 0.04
mean squared error: 0.04
root mean squared error: 0.2


Prediction Accuracy

In [25]:
accuracy=metrics.accuracy_score(y_test,y_pred)
print(accuracy*100,"%")

96.0 %


### Logistic Regression Model

In [26]:
from sklearn.linear_model import LogisticRegression

Model Fitting

In [27]:
model=LogisticRegression(max_iter=1000)
model.fit(x_test,y_test)

Model Prediction

In [28]:
y_pred=model.predict(x_test)
print(y_test)
print(y_pred)

     0
95   4
15   0
30   4
158  3
128  0
115  0
69   0
170  4
174  1
45   4
66   1
182  4
165  0
78   1
186  2
177  0
56   2
152  4
82   3
68   0
124  2
16   4
148  4
93   0
65   0
60   0
84   3
67   4
125  0
132  4
9    0
18   3
55   3
75   0
150  1
104  0
135  4
137  1
164  0
76   1
79   4
197  4
38   4
24   0
122  0
195  3
29   0
19   0
143  0
86   4
[4 0 4 3 0 0 0 4 1 4 1 4 0 1 2 0 2 4 3 0 1 4 4 0 0 0 3 4 0 4 0 3 3 0 1 0 4
 1 0 1 4 4 4 0 0 3 0 0 2 4]


Model Evaluation

In [29]:
print("mean absolute error:",metrics.mean_absolute_error(y_test,y_pred))
print("mean squared error:",metrics.mean_squared_error(y_test,y_pred))
print("root mean squared error:",np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

mean absolute error: 0.06
mean squared error: 0.1
root mean squared error: 0.31622776601683794


Prediction Accuracy

In [30]:
score=metrics.accuracy_score(y_test,y_pred)
print(score*100,"%")

96.0 %


Conclusion:
The classification project aimed to predict the appropriate drug type for patients using various machine learning models, including KNN,SVM,Random Forest, Decision Tree, and Logistic Regression. The Decision Tree model emerged as the best performer, achieving an impressive accuracy of 98%. This high accuracy indicates that the model effectively captured the relationships between patient health parameters and the prescribed drug type.