Since as a beginner in machine learning it would be a great opportunity to try some techniques to predict the outcome of the drugs that might be accurate for the patient.

#### Content 
##### The target feature is
- Drug type

##### The feature sets are:

- Age
- Sex
- Blood Pressure Levels (BP)
- Cholesterol Levels
- Na to Potassium Ration
- Inspiration

The main problem here in not just the feature sets and target sets but also the approach that is taken in solving these types of problems as a beginner. So best of luck.

source : [Kaggle](https://www.kaggle.com/datasets/prathamtripathi/drug-classification)

In [6]:
import pandas as pd 
df=pd.read_csv("data.csv")
nRow, nCol = df.shape
print(f'There are {nRow} rows and {nCol} columns')
df.head()

There are 200 rows and 6 columns


Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,F,HIGH,HIGH,25.355,DrugY
1,47,M,LOW,HIGH,13.093,drugC
2,47,M,LOW,HIGH,10.114,drugC
3,28,F,NORMAL,HIGH,7.798,drugX
4,61,F,LOW,HIGH,18.043,DrugY


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Age          200 non-null    int64  
 1   Sex          200 non-null    object 
 2   BP           200 non-null    object 
 3   Cholesterol  200 non-null    object 
 4   Na_to_K      200 non-null    float64
 5   Drug         200 non-null    object 
dtypes: float64(1), int64(1), object(4)
memory usage: 9.5+ KB


In [4]:
# Calculate the total number of missing values in each column
missing_values = df.isnull().sum()

# Calculate the total number of cells (non-missing values) in each column
total_cells = df.shape[0]  # Total number of rows

# Calculate the percentage of missing data for each column
percentage_missing = (missing_values / total_cells) * 100

# Create a DataFrame to store the results
missing_data_info = pd.DataFrame({
    'Column Name': missing_values.index,
    'Missing Values': missing_values,
    'Percentage Missing': percentage_missing
})

# Sort the DataFrame by the percentage of missing data in descending order
missing_data_info = missing_data_info.sort_values(by='Percentage Missing', ascending=False)

# Format the 'Percentage Missing' column to display 1 digit after the decimal point
missing_data_info['Percentage Missing'] = missing_data_info['Percentage Missing'].round(1)

# Display the result
print(missing_data_info)


             Column Name  Missing Values  Percentage Missing
Age                  Age               0                 0.0
Sex                  Sex               0                 0.0
BP                    BP               0                 0.0
Cholesterol  Cholesterol               0                 0.0
Na_to_K          Na_to_K               0                 0.0
Drug                Drug               0                 0.0


In [11]:
# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import pandas as pd

# Assuming your data is in a DataFrame called 'df'
# You might need to preprocess the data, handle missing values, and encode categorical variables
# Here's a basic example using scikit-learn's pipeline and column transformer

# Assuming 'data' is your DataFrame
X = df.drop('Drug', axis=1)  # Features
y = df['Drug']  # Target variable

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define preprocessing steps
numeric_features = ['Age', 'Na_to_K']
categorical_features = ['Sex', 'Cholesterol', 'BP']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Define the model
model = Pipeline(steps=[('preprocessor', preprocessor),
                        ('classifier', RandomForestClassifier(random_state=42))])

# Train the model
model.fit(X_train, y_train)

# Predictions on the test set
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")


Accuracy: 1.00
