# Drug Prediction for Patient

## Problem Statement:
The dataset contains information about drug classification based on the general information and diagnosis of patients. 

## Objective:
The task is to develop a machine learning model that can predict the appropriate type of drug suitable for a patient based on their provided data. This model will aid in personalized drug prescription, ensuring better patient care and treatment outcomes.

### Loading Dataset and EDA

In [1]:
# Importing necessary libraries
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv('drug200.csv')

# Display the first few rows of the dataset to get an overview
print("First 5 rows of the dataset:")
print(df.head())

# Check the information of the dataset (data types, non-null counts)
print("\nDataset Info:")
print(df.info())

# Get the statistical summary of the dataset for numerical columns
print("\nStatistical Summary:")
print(df.describe())

# Check for missing values in the dataset
print("\nMissing Values:")
print(df.isnull().sum())

First 5 rows of the dataset:
   Age Sex      BP Cholesterol  Na_to_K   Drug
0   23   F    HIGH        HIGH   25.355  drugY
1   47   M     LOW        HIGH   13.093  drugC
2   47   M     LOW        HIGH   10.114  drugC
3   28   F  NORMAL        HIGH    7.798  drugX
4   61   F     LOW        HIGH   18.043  drugY

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Age          200 non-null    int64  
 1   Sex          200 non-null    object 
 2   BP           200 non-null    object 
 3   Cholesterol  200 non-null    object 
 4   Na_to_K      200 non-null    float64
 5   Drug         200 non-null    object 
dtypes: float64(1), int64(1), object(4)
memory usage: 9.5+ KB
None

Statistical Summary:
              Age     Na_to_K
count  200.000000  200.000000
mean    44.315000   16.084485
std     16.544315    7.223956
min     15.000000    6.26

### Encoding

In [2]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

# Encoding the 'Sex' column
df['Sex'] = label_encoder.fit_transform(df['Sex'])

# Encoding the 'BP' and 'Cholesterol' columns
bp_mapping = {'LOW': 0, 'NORMAL': 1, 'HIGH': 2}
cholesterol_mapping = {'NORMAL': 0, 'HIGH': 1}
df['BP'] = df['BP'].map(bp_mapping)
df['Cholesterol'] = df['Cholesterol'].map(cholesterol_mapping)

# Encoding the target variable 'Drug'
df['Drug'] = label_encoder.fit_transform(df['Drug'])

### Train-Test split

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report

X = df.drop(columns=['Drug'])
y = df['Drug']

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

scaler = StandardScaler()
X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()
X_train_scaled[['Age', 'Na_to_K']] = scaler.fit_transform(X_train[['Age', 'Na_to_K']])
X_test_scaled[['Age', 'Na_to_K']] = scaler.transform(X_test[['Age', 'Na_to_K']])

# Initialize models
dt_model = DecisionTreeClassifier()  # No scaling needed for Decision Tree
knn_model = KNeighborsClassifier()   # Needs scaling
lr_model = LogisticRegression(multi_class='ovr')  # Needs scaling, OvR for multiclass
svm_model = SVC()  # Needs scaling

# Train models
dt_model.fit(X_train, y_train)
knn_model.fit(X_train_scaled, y_train)
lr_model.fit(X_train_scaled, y_train)
svm_model.fit(X_train_scaled, y_train)

# Predictions
y_pred_dt = dt_model.predict(X_test)
y_pred_knn = knn_model.predict(X_test_scaled)
y_pred_lr = lr_model.predict(X_test_scaled)
y_pred_svm = svm_model.predict(X_test_scaled)

# Evaluate models
print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred_dt))
print("k-NN Accuracy:", accuracy_score(y_test, y_pred_knn))
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_lr))
print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm))

# Print detailed classification report
print("\nDecision Tree Report:\n", classification_report(y_test, y_pred_dt))
print("\nk-NN Report:\n", classification_report(y_test, y_pred_knn))
print("\nLogistic Regression Report:\n", classification_report(y_test, y_pred_lr))
print("\nSVM Report:\n", classification_report(y_test, y_pred_svm))



Decision Tree Accuracy: 1.0
k-NN Accuracy: 0.8833333333333333
Logistic Regression Accuracy: 0.8666666666666667
SVM Accuracy: 0.95

Decision Tree Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00         7
           1       1.00      1.00      1.00         3
           2       1.00      1.00      1.00         6
           3       1.00      1.00      1.00        18
           4       1.00      1.00      1.00        26

    accuracy                           1.00        60
   macro avg       1.00      1.00      1.00        60
weighted avg       1.00      1.00      1.00        60


k-NN Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00         7
           1       0.50      1.00      0.67         3
           2       1.00      0.33      0.50         6
           3       0.82      1.00      0.90        18
           4       1.00      0.88      0.94        26

    accuracy    

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Evaluating Model Performance

In [4]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

def evaluate_model_performance(y_true, y_pred, model_name):
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred, average='weighted')
    recall = recall_score(y_true, y_pred, average='weighted')
    
    print(f"Performance of {model_name}:")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print("="*30)

# Evaluate models' performance
evaluate_model_performance(y_test, y_pred_dt, "Decision Tree")
evaluate_model_performance(y_test, y_pred_knn, "k-NN")
evaluate_model_performance(y_test, y_pred_lr, "Logistic Regression")
evaluate_model_performance(y_test, y_pred_svm, "SVM")

Performance of Decision Tree:
Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000
Performance of k-NN:
Accuracy: 0.8833
Precision: 0.9205
Recall: 0.8833
Performance of Logistic Regression:
Accuracy: 0.8667
Precision: 0.7926
Recall: 0.8667
Performance of SVM:
Accuracy: 0.9500
Precision: 0.9750
Recall: 0.9500


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### Conclusion of the Drug Prediction Model

The goal of this project was to predict the type of drug that a patient should be prescribed based on several features using various classification models. Below is a summary of the performance of each model:

1. **Decision Tree**:
   - Achieved a perfect score in all metrics: Accuracy, Precision, and Recall were all **1.0000**.
   - The Decision Tree model correctly classified every instance in the test set.
   - It demonstrates that Decision Tree may have a tendency to **overfit** to the training data, especially if the dataset is small.

2. **k-Nearest Neighbors (k-NN)**:
   - Accuracy: **0.8833**
   - Precision: **0.9205**
   - Recall: **0.8833**
   - The k-NN model performed well with high precision, indicating it was accurate in predicting the correct drug class.
   - This model may have some issues with misclassifications, but still provides a good balance between sensitivity (Recall) and specificity.

3. **Logistic Regression**:
   - Accuracy: **0.8667**
   - Precision: **0.7926**
   - Recall: **0.8667**
   - The Logistic Regression model performed decently, indicating that the relationship between features and the target variable might not be strictly linear.
   - Precision was lower compared to other models, suggesting that it might not perform as well in minimizing false positives.

4. **Support Vector Machine (SVM)**:
   - Accuracy: **0.9500**
   - Precision: **0.9750**
   - Recall: **0.9500**
   - The SVM model was one of the best-performing models after the Decision Tree, with a high level of accuracy and precision.
   - SVM shows robustness in handling complex relationships between features and provides a good generalization capability.

### Key Takeaways:
- **Decision Tree** provides excellent accuracy, but may suffer from overfitting on smaller datasets. It’s important to validate with larger or more varied data.
- **k-NN** shows a strong performance, although it can be sensitive to the choice of `k` and data scaling. It requires scaling to handle numerical differences among features.
- **Logistic Regression** might not be the best choice for this problem due to its lower precision, but it offers interpretability and simplicity.
- **SVM** performed well, handling complexity with high precision and accuracy, making it a strong candidate when dealing with non-linear data.
