# About Dataset
## **Context**
Since as a beginner in machine learning it would be a great opportunity to try some techniques to predict the outcome of the drugs that might be accurate for the patient.

## **Content**
The target feature is
* Drug type
The feature sets are:
* Age
* Sex
* Blood Pressure Levels (BP)
* Cholesterol Levels
* Na to Potassium Ration

## **Inspiration**
The main problem here in not just the feature sets and target sets but also the approach that is taken in solving these types of problems as a beginner. So best of luck.

# import 

In [1]:
import numpy as np

import pandas as pd

from pandas_profiling import ProfileReport

from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score,r2_score,classification_report

# read data

In [2]:
data = pd.read_csv('../input/drug-classification/drug200.csv')

# pandas_profiling

In [3]:
profile = ProfileReport(data, title="Drug Pandas Profiling Report")
profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



In [4]:
data.isna().sum()

Age            0
Sex            0
BP             0
Cholesterol    0
Na_to_K        0
Drug           0
dtype: int64

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Age          200 non-null    int64  
 1   Sex          200 non-null    object 
 2   BP           200 non-null    object 
 3   Cholesterol  200 non-null    object 
 4   Na_to_K      200 non-null    float64
 5   Drug         200 non-null    object 
dtypes: float64(1), int64(1), object(4)
memory usage: 9.5+ KB


In [6]:
data.sample(2)

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
80,60,M,HIGH,HIGH,13.934,drugB
167,57,F,NORMAL,HIGH,14.216,drugX


In [7]:
# x = data.drop('Drug',axis=1)
# y = data[['Drug']]

# Preprocessing data

In [8]:
data_num = data.select_dtypes(['int64','float64'])
data_obj = data.select_dtypes(['object'])

In [9]:
col = list(data_obj.columns)
for i in col:
    print(i)
    print(data_obj[i].unique(),'\n\n')

Sex
['F' 'M'] 


BP
['HIGH' 'LOW' 'NORMAL'] 


Cholesterol
['HIGH' 'NORMAL'] 


Drug
['DrugY' 'drugC' 'drugX' 'drugA' 'drugB'] 




In [10]:
### obj data

# nominal data = ['Sex']
# ordinal data = ['BP' , 'Cholesterol']


nominal_data = ['Sex']
ordinal_data = ['BP' , 'Cholesterol']


ordinal_encoder = OrdinalEncoder(categories=[['HIGH','NORMAL','LOW'] ,
                                             ['HIGH' ,'NORMAL']])
cat_encoded_ = ordinal_encoder.fit_transform(data_obj[ordinal_data])
data_obj_ord = pd.DataFrame(cat_encoded_,columns=ordinal_data)

def nom_data():
    
    pass

def nominal_data(df,i):
    cat_encoder = OneHotEncoder()
    x = cat_encoder.fit_transform(df[[i]])
    qw = [f'{i}{r}' for r in range(len(df[i].unique()))]
    df = pd.DataFrame(x.toarray(),dtype=np.float64,columns=qw)
    return df
Sex = nominal_data(data_obj,'Sex')

In [11]:
ALL_Data = pd.concat([data_num,data_obj_ord,Sex],axis=1)

In [12]:
def scaler(data):
    num_scaler=StandardScaler()
    scaler = num_scaler.fit_transform(data)
    data = pd.DataFrame(scaler,columns=data.columns,index=data.index)
    return data

train_ordinal_data_scaler = scaler(ALL_Data)

In [13]:
train_ordinal_data_scaler

Unnamed: 0,Age,Na_to_K,BP,Cholesterol,Sex0,Sex1
0,-1.291591,1.286522,-1.116921,-0.970437,1.040833,-1.040833
1,0.162699,-0.415145,1.272214,-0.970437,-0.960769,0.960769
2,0.162699,-0.828558,1.272214,-0.970437,-0.960769,0.960769
3,-0.988614,-1.149963,0.077647,-0.970437,1.040833,-1.040833
4,1.011034,0.271794,1.272214,-0.970437,1.040833,-1.040833
...,...,...,...,...,...,...
195,0.708057,-0.626917,1.272214,-0.970437,1.040833,-1.040833
196,-1.715759,-0.565995,1.272214,-0.970437,-0.960769,0.960769
197,0.465676,-0.859089,0.077647,-0.970437,-0.960769,0.960769
198,-1.291591,-0.286500,0.077647,1.030464,-0.960769,0.960769


# splitting data

In [14]:
X = train_ordinal_data_scaler
y = data['Drug']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

# RandomForestClassifier

In [15]:
RFC = RandomForestClassifier()
RFC.fit(X_train,y_train)
ypred = RFC.predict(X_test)
print(RFC,":",accuracy_score(y_test,ypred)*100)

RandomForestClassifier() : 95.0


# RandomForestClassifier() : 96.6

# classification_report

In [16]:
print(classification_report(y_test,ypred))

              precision    recall  f1-score   support

       DrugY       1.00      1.00      1.00        26
       drugA       1.00      1.00      1.00         7
       drugB       1.00      1.00      1.00         3
       drugC       1.00      0.50      0.67         6
       drugX       0.86      1.00      0.92        18

    accuracy                           0.95        60
   macro avg       0.97      0.90      0.92        60
weighted avg       0.96      0.95      0.94        60



# Notes 😃😃😃😃
* Thank for reading my analysis and my classification. 😃😃😃😃

* If you any questions or advice me please write in the comment . ❤️❤️❤️❤️

* If anyone has a model with a higher percentage, please tell me 🤝🤝🤝, its will support me .

# Vote ❤️😃
## If you liked my work upvote me 

# The End...