<p align="center"><img src="https://github.com/insaid2018/Term-1/blob/master/Images/INSAID_Full%20Logo.png?raw=true" width="260" height="110" /></p>

![image-3.png](attachment:image-3.png)

## DRUG PREDICTION (MACHINE LEARNING - INTERMEDIATE)

**Submitted by NARAYAN V. SHANBHAG**

**GCD Student, INSAID**

**Batch: May 9,2021**

---
# **Table of Contents**
---

1. [**Introduction**](#Section1)<br>
2. [**Problem Statement**](#Section2)<br>
3. [**Installing & Importing Libraries**](#Section3)<br>
  3.1 [**Installing Libraries**](#Section31)<br>
  3.2 [**Upgrading Libraries**](#Section32)<br>
  3.3 [**Importing Libraries**](#Section33)<br>
4. [**Data Acquisition & Description**](#Section4)<br>
5. [**Data Pre-Profiling**](#Section5)<br>
6. [**Data Pre-Processing**](#Section6)<br>
7. [**Data Post-Profiling**](#Section7)<br>
8. [**Exploratory Data Analysis**](#Section8)<br>
9. [**Summarization**](#Section9)</br>
  9.1 [**Conclusion**](#Section91)</br>
  9.2 [**Actionable Insights**](#Section91)</br>

---

---
<a name = Section1></a>
# **1. Introduction**
---
### Company Introduction - inLab

##### Your client for this project is a pharmaceutical company. - inLab
- 1) They have a long history of making effective drugs and are the leading producer of antibiotics for bacterial infection.
- 2) Their research and development team have recently developed five types of drugs to fight against chronic throat infection.
- 3) They want to quickly release the drug in the market so that they could cure people and increase revenue for the company.
- 4) Their R&D team made a brief analysis of the chemical composition present in the drug and made a brief report stating that each drug has a different effect according to their health.
- 5) The drug which has a higher concentration of chemicals should be given to those groups of people whose health report passes some criteria as suggested by the R&D team.


### Current Scenario
  - The R&D group has invited some groups of people to test the drug, but going through each person’s health report might take a lot of time and cause a delay in launching the drug in the market.

---
<a name = Section2></a>
# **2. Problem Statement**
---
- This section is emphasised on providing some generic introduction to the problem that most companies confront.

### The current process suffers from the following problems:
- Testing phase takes a lot of time and it's done manually because they need to carefully examine each person for the side effects.
- Most of the crucial time is being wasted in checking each person’s health report and dispensing specific drugs according to the health metric as suggested by the R&D team.
- This process is time-consuming and wastage of resources.

TThe company has hired you as data science consultants. They want to **automate** the process of assigning the drug according to their health report.

### Your Role
- You are given a dataset containing the health report of the people from the test group.
- Your task is to build a multi-class classification model using the dataset.
- Because there was no machine learning model for this problem in the company, you don’t have a quantifiable win condition. - You need to build the best possible model.

### Project Deliverable
- Deliverable: **Drug classification.**
- Machine Learning Task:**Multi-class classification**
- Target Variable: **Drug**
- Win Condition: **N/A (best possible model)**

### Evaluation Metric
   - The model evaluation will be based on the **Accuracy Score.**

# **3. Installing & Importing Libraries**
---
- This section is emphasised on installing and importing the necessary libraries that will be required.

### **Installing Libraries**

In [None]:
!pip install -q datascience                                         
# Package that is required by pandas profiling
!pip install -q pandas-profiling                                    
# Library to generate basic statistics about data

# To install more libraries insert your code here..


### **Upgrading Libraries**

- **After upgrading** the libraries, you need to **restart the runtime** to make the libraries in sync.

- Make sure not to execute the cell under Installing Libraries and Upgrading Libraries again after restarting the runtime.

In [None]:
!pip install -q --upgrade pandas-profiling                          
# Upgrading pandas profiling to the latest version

### **Importing Libraries**

- You can headstart with the basic libraries as imported inside the cell below.

- If you want to import some additional libraries, feel free to do so.

In [1]:
#-------------------------------------------------------------------------------------------------------------------------------
import pandas as pd                                                 # Importing package pandas (For Panel Data Analysis)
from pandas_profiling import ProfileReport                          # Import Pandas Profiling (To generate Univariate Analysis)
#-------------------------------------------------------------------------------------------------------------------------------
import numpy as np                                                  # Importing package numpys (For Numerical Python)
#-------------------------------------------------------------------------------------------------------------------------------
import matplotlib.pyplot as plt                                     # Importing pyplot interface to use matplotlib
import seaborn as sns                                               # Importing seaborn library for interactive visualization
%matplotlib inline
#-------------------------------------------------------------------------------------------------------------------------------
import scipy as sp                                                  # Importing library for scientific calculations
#-------------------------------------------------------------------------------------------------------------------------------
import warnings
warnings.filterwarnings('ignore')

# 4. Data  Description
This step provides an in-depth description of the dataset associated with this project.

- The dataset contains all the necessary information about the person’s health like their sex, BP, Age, Cholesterol etc.

- We have the health metrics of the person which is an essential factor for transcribing the drug to that person without any side effect.

- This is the data that we have to predict for future samples.
 
#### The dataset is divided into two parts: Train and Test sets.

#### Training Set: 
 - The train set contains **160 rows** and **7 columns**.
 - The last column **Drug** is the **target variable**.

#### Testing Set:
 - The test set contains **40 rows** and **6 columns**.
 - The test set **doesn’t contain** the **Drug** column.
 - It needs to be predicted for the test set.
 

**Train Set:**

| Records |Features|Target Variable|
|:--|:--|:--|
|**160**|**7**|**Drug**|

**Test Set:**

|Records|Features|Predicted Variable|
|:--|:--|:--|
|**40**|**6**|**Drug**|
 
- The Dataset contains the following columns:
 
|***ID***|****Feature****|****Description****|
|:--|:--|:--|
|01| Id        | Unique Identity of the sample|
|02| Age       | Age of the person|  
|03| Sex       | The sex of the person(M and F)| 
|04| BP        | Blood pressure of the person|   
|05| Cholesterol       | The level of cholesterol in a person's body|
|06| Na_to_K   | Sodium and potassium ratio|
|07| IQR       | Interquartile range (in kHz)|
|08| Drug      | Drug: Contains 5 classes of drugs encoded as(drug A : 3, drug B: 4, drug C: 2, drug X: 0, drug Y: 1)|

---
<a name = Section4></a>
# **5. Data Acquisition**
---

- This section is emphasised on the accquiring the data and obtain some descriptive information out of it.

- You could either scrap the data and then continue, or use a direct source of link (generally preferred in most cases).

- You will be working with a direct source of link to head start your work without worrying about anything.

- Before going further you must have a good idea about the features of the data set:



In [None]:
#from google.colab import files
#uploaded = files.upload()
#import io
#data = pd.read_csv(io.BytesIO(uploaded['Churn_train.csv']))


In [2]:
#data = pd.read_csv(filepath_or_buffer = 'https://raw.githubusercontent.com/insaid2018/Term-1/master/Data/Projects/car_sales.csv', encoding='cp1252')
data = pd.read_csv("D://drug_train.csv",skipinitialspace=True)
data
print('Data Shape:', data.shape)
data.head()
#Initial Analysis shows there are 5634 RECORDS/ROWS and 21 FEATURES/COLUMNS

Data Shape: (160, 7)


Unnamed: 0,Id,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,79,32,F,LOW,NORMAL,10.84,drugX
1,197,52,M,NORMAL,HIGH,9.894,drugX
2,38,39,F,NORMAL,NORMAL,9.709,drugX
3,24,33,F,LOW,HIGH,33.486,DrugY
4,122,34,M,NORMAL,HIGH,22.456,DrugY


 ### **Data Information**
- **Totally there are 7 Columns/Features out of which 3 are NUMERICAL, 4 are CATEGORICAL**
- **Check whether the NUMERICAL Columns have 0s and if they are relevant**


In [3]:
# Insert your code here
data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 160 entries, 0 to 159
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Id           160 non-null    int64  
 1   Age          160 non-null    int64  
 2   Sex          160 non-null    object 
 3   BP           160 non-null    object 
 4   Cholesterol  160 non-null    object 
 5   Na_to_K      160 non-null    float64
 6   Drug         160 non-null    object 
dtypes: float64(1), int64(2), object(4)
memory usage: 8.9+ KB


### **Data Description**

- **To get some quick description out of the data you can use describe method defined in pandas library.**
-**Gives the 5-Point or 5-Number summary and other details such as Count, Mean and Standard Deviation of the data-set**

In [4]:
data.describe()

Unnamed: 0,Id,Age,Na_to_K
count,160.0,160.0,160.0
mean,99.075,45.3875,16.194988
std,59.374894,16.101481,7.254689
min,0.0,15.0,6.269
25%,45.5,32.0,10.44525
50%,100.5,46.0,14.0765
75%,149.5,58.25,19.48075
max,199.0,74.0,38.247


In [5]:
data_final = pd.read_csv("D://drug_test.csv",skipinitialspace=True)
print('Data Shape:', data_final.shape)
data_final.head()

Data Shape: (40, 6)


Unnamed: 0,Id,Age,Sex,BP,Cholesterol,Na_to_K
0,95,36,M,LOW,NORMAL,11.424
1,15,16,F,HIGH,NORMAL,15.516
2,30,18,F,NORMAL,NORMAL,8.75
3,158,59,F,LOW,HIGH,10.444
4,128,47,M,LOW,NORMAL,33.542


In [6]:
data_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40 entries, 0 to 39
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Id           40 non-null     int64  
 1   Age          40 non-null     int64  
 2   Sex          40 non-null     object 
 3   BP           40 non-null     object 
 4   Cholesterol  40 non-null     object 
 5   Na_to_K      40 non-null     float64
dtypes: float64(1), int64(2), object(3)
memory usage: 2.0+ KB


In [7]:
data_final.describe()

Unnamed: 0,Id,Age,Na_to_K
count,40.0,40.0,40.0
mean,101.2,40.025,15.642475
std,52.12667,17.778534,7.173492
min,9.0,16.0,7.261
25%,65.75,24.5,10.735
50%,94.0,38.0,12.8675
75%,148.5,53.25,19.10025
max,186.0,74.0,33.542


---
<a name = Section5></a>
# **6. Data Pre-Profiling**
---

- This section is emphasised on getting a report about the data.

- You need to perform pandas profiling and get some observations out of it...

In [None]:
# Insert your code here...
profile_satisfaction_train = ProfileReport(df=data)
profile_satisfaction_train

**5.1 Data Pre-Profiling for TEST SET**

In [None]:
profile_satisfaction_test = ProfileReport(df=data_final)
profile_satisfaction_test

---
<a name = Section6></a>
# **7. Data Pre-Processing**
---

- This section is emphasised on performing data manipulation over unstructured data for further processing and analysis.

- To modify unstructured data to strucuted data you need to verify and manipulate the integrity of the data by:
  - **Handling/Checking Duplicate Data for both the TRAIN and TEST Data Sets**

  - Handling missing data,

  - Handling redundant data,

  - Handling inconsistent data,

  - Handling outliers,

  - Handling typos

-**There are actually NO DUPLICATE RECORDS/ROWS in the DATA-SETS**



In [8]:
# Insert your code here...
data[data.duplicated()]

Unnamed: 0,Id,Age,Sex,BP,Cholesterol,Na_to_K,Drug


In [9]:
data_final[data_final.duplicated()]

Unnamed: 0,Id,Age,Sex,BP,Cholesterol,Na_to_K


-**Now Check ALL the NUMERICAL COLUMNS for ZERO values and Replace/Substitute them with appropriate values**

In [10]:
data.isnull().sum()

Id             0
Age            0
Sex            0
BP             0
Cholesterol    0
Na_to_K        0
Drug           0
dtype: int64

In [11]:
(data == 0 ).sum(axis = 0)

Id             1
Age            0
Sex            0
BP             0
Cholesterol    0
Na_to_K        0
Drug           0
dtype: int64

In [12]:
data_final.isnull().sum()

Id             0
Age            0
Sex            0
BP             0
Cholesterol    0
Na_to_K        0
dtype: int64

In [13]:
(data_final == 0 ).sum(axis = 0)

Id             0
Age            0
Sex            0
BP             0
Cholesterol    0
Na_to_K        0
dtype: int64

### **Check whether the DataSet is Balanced for the label column**

In [None]:
data.Drug.value_counts().plot(kind='pie',autopct='%1.1f%%', startangle=90);
data.Drug.value_counts()

In [None]:
figure = plt.figure(figsize = (20,10))
HeatMap = sns.heatmap(data.corr(),annot = True,cmap = 'coolwarm',vmin = -1, vmax = 1,linecolor = 'black',linewidths = 1)
HeatMap.set_title('Correlation HeatMap', fontdict = {'fontsize':16})

### OBSERVATIONS
- BP and Na_to_K have high correlation with Drug

In [14]:
data.columns

Index(['Id', 'Age', 'Sex', 'BP', 'Cholesterol', 'Na_to_K', 'Drug'], dtype='object')

In [15]:
data_categorical = data[['Sex', 'BP', 'Cholesterol']]
data_categorical

Unnamed: 0,Sex,BP,Cholesterol
0,F,LOW,NORMAL
1,M,NORMAL,HIGH
2,F,NORMAL,NORMAL
3,F,LOW,HIGH
4,M,NORMAL,HIGH
...,...,...,...
155,M,NORMAL,HIGH
156,F,NORMAL,HIGH
157,F,HIGH,HIGH
158,F,NORMAL,HIGH


In [16]:
data_numerical = data [['Age', 'Na_to_K']]
data_numerical

Unnamed: 0,Age,Na_to_K
0,32,10.840
1,52,9.894
2,39,9.709
3,33,33.486
4,34,22.456
...,...,...
155,22,11.953
156,50,12.703
157,29,29.450
158,67,15.891


In [17]:
data_target = data [['Drug']]
data_target

Unnamed: 0,Drug
0,drugX
1,drugX
2,drugX
3,DrugY
4,DrugY
...,...
155,drugX
156,drugX
157,DrugY
158,DrugY


### Applying Scale Mapper
#scale_mapper = {"drugX":0, "DrugY":1, "drugC":2, "drugA":3,"drugB":4}
#data_target["Drug"] = data_target["Drug"].replace(scale_mapper)
#data_target

### Apply Label Encoding to Categorical data

from sklearn.preprocessing import LabelEncoder
data_categorical = data_categorical.apply(LabelEncoder().fit_transform)
data_categorical

In [18]:
data_categorical = pd.get_dummies(data_categorical,drop_first=True)
data_categorical

Unnamed: 0,Sex_M,BP_LOW,BP_NORMAL,Cholesterol_NORMAL
0,0,1,0,1
1,1,0,1,0
2,0,0,1,1
3,0,1,0,0
4,1,0,1,0
...,...,...,...,...
155,1,0,1,0
156,0,0,1,0
157,0,0,0,0
158,0,0,1,0


In [19]:
data_model = pd.concat([data_categorical,data_numerical], axis=1)
data_model

Unnamed: 0,Sex_M,BP_LOW,BP_NORMAL,Cholesterol_NORMAL,Age,Na_to_K
0,0,1,0,1,32,10.840
1,1,0,1,0,52,9.894
2,0,0,1,1,39,9.709
3,0,1,0,0,33,33.486
4,1,0,1,0,34,22.456
...,...,...,...,...,...,...
155,1,0,1,0,22,11.953
156,0,0,1,0,50,12.703
157,0,0,0,0,29,29.450
158,0,0,1,0,67,15.891


In [20]:
data_model.columns

Index(['Sex_M', 'BP_LOW', 'BP_NORMAL', 'Cholesterol_NORMAL', 'Age', 'Na_to_K'], dtype='object')

In [21]:
x = data_model[['Sex_M', 'BP_LOW', 'BP_NORMAL', 'Cholesterol_NORMAL', 'Age', 'Na_to_K']]
x

Unnamed: 0,Sex_M,BP_LOW,BP_NORMAL,Cholesterol_NORMAL,Age,Na_to_K
0,0,1,0,1,32,10.840
1,1,0,1,0,52,9.894
2,0,0,1,1,39,9.709
3,0,1,0,0,33,33.486
4,1,0,1,0,34,22.456
...,...,...,...,...,...,...
155,1,0,1,0,22,11.953
156,0,0,1,0,50,12.703
157,0,0,0,0,29,29.450
158,0,0,1,0,67,15.891


In [None]:
data_model

In [22]:
y = data_target[['Drug']]
y

Unnamed: 0,Drug
0,drugX
1,drugX
2,drugX
3,DrugY
4,DrugY
...,...
155,drugX
156,drugX
157,DrugY
158,DrugY


### Classification Report

### **5.2 Data Cleaning**

- In this section, we will **remove** columns which are **redundant** for model.

- We will first remove the **label** column then remove the columns with **standard deviation = 0** from the dataset.

- Standard Deviation = 0 means that **every data point in a column is equal to its mean**.

- Also means that all of a column's values are **identical** and such columns do **not help** us in **prediction** so we will drop them.

---
# **6. Exploratory Data Analysis**

#### **Question:** What is the distribution of the **target** feature?

In [None]:
# Plot a displot on target variable
data_sns = data
sns.displot(x='Drug', data=data_sns, kde=True, aspect=3)

# Add some cosmetics
plt.xticks(size=12)
plt.yticks(size=12)
plt.xlabel(xlabel='Values', size=14)
plt.ylabel(ylabel='Count', size=14)
plt.title(label='Displot on target', size=16)
plt.grid(b=True)

# Display the plot
plt.show()

# **7. Data Post Processing**

### **7.2 Data Splitting**

- Now we will split our data into train set and test set.

- We will keep **80%** data in the **train** set, and **20%** data in the **test** set.

### Import the train Import the train test split and split the data into train and test data for x and y.

- from sklearn.model_selection import train_test_split split and split the data into train and test data for x and y.


In [23]:
from sklearn.model_selection import train_test_split

In [24]:
# Splitting data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=42)

# Display the shape of training and testing data
print('x_train shape: ', x_train.shape)
print('y_train shape: ', y_train.shape)
print('x_test shape: ', x_test.shape)
print('y_test shape: ', y_test.shape)

x_train shape:  (128, 6)
y_train shape:  (128, 1)
x_test shape:  (32, 6)
y_test shape:  (32, 1)


In [26]:
from sklearn.pipeline import Pipeline
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import PowerTransformer,StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

#dtree=DecisionTreeClassifier()
#rforest = RandomForestClassifier()
#dtree.fit(x_train,y_train)
#rforest.fit(x_train,y_train)
gboost = GradientBoostingClassifier()
gboost.fit(x_train,y_train)
print("Testing Accuracy for dtree")
print(gboost.score(x_test,y_test))

print("Training Accuracy for dtree")
print(gboost.score(x_train,y_train))

Testing Accuracy for dtree
1.0
Training Accuracy for dtree
1.0


In [27]:
y_train

Unnamed: 0,Drug
60,DrugY
115,DrugY
2,drugX
123,DrugY
45,drugB
...,...
71,DrugY
106,drugC
14,drugX
92,DrugY


# **8. Model Development & Evaluation**

### **8.1 Model Development & Evaluation without PCA**

In [28]:
#-------------------------------------------------------------------------------------------------------------------------------
import pandas as pd                                                 # Importing for panel data analysis
pd.set_option('display.max_columns', None)                          # Unfolding hidden features if the cardinality is high
pd.set_option('display.max_colwidth', None)                         # Unfolding the max feature width for better clearity
pd.set_option('display.max_rows', None)                             # Unfolding hidden data points if the cardinality is high
pd.set_option('mode.chained_assignment', None)                      # Removing restriction over chained assignments operations
pd.set_option('display.float_format', lambda x: '%.5f' % x)         # To suppress scientific notation over exponential values
#-------------------------------------------------------------------------------------------------------------------------------
import numpy as np                                                  # Importing package numpys (For Numerical Python)
#-------------------------------------------------------------------------------------------------------------------------------
import matplotlib.pyplot as plt                                     # Importing pyplot interface using matplotlib
import seaborn as sns                                               # Importin seaborm library for interactive visualization
import plotly.graph_objs as go                                      # Importing plotly for interactive visualizations
%matplotlib inline
#-------------------------------------------------------------------------------------------------------------------------------
from sklearn.model_selection import train_test_split                # To properly split the dataset into train and test sets
from sklearn.ensemble import RandomForestClassifier                 # To create a random forest regressor model
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression                 # To create a linear regression model
from sklearn import metrics                                         # Importing to evaluate the model used for regression
from sklearn.model_selection import cross_val_score
#-------------------------------------------------------------------------------------------------------------------------------
from random import randint                                          # Importing to generate random integers
#-------------------------------------------------------------------------------------------------------------------------------                                                        # For time functionality
import warnings                                                     # Importing warning to disable runtime warnings
warnings.filterwarnings("ignore")                                   # Warnings will appear only once

### Importing the evaluation metrics for Classification model - Logistic Regression


In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

In [None]:
logreg_model =  LogisticRegression(solver='liblinear')

In [None]:
dt_model = DecisionTreeClassifier()

In [None]:
randforest_model = RandomForestClassifier(random_state=42)

In [None]:
logreg_model.fit(x_train, y_train)

In [None]:
dt_model.fit(x_train, y_train)

In [None]:
randforest_model.fit(x_train, y_train)

In [None]:
y_pred_train_lr = logreg_model.predict(x_train)
y_pred_train_lr

In [None]:
y_pred_test_lr = logreg_model.predict(x_test)
y_pred_test_lr

In [None]:
y_pred_train_dt = dt_model.predict(x_train)
y_pred_train_dt

In [None]:
y_pred_test_dt = dt_model.predict(x_test)
y_pred_test_dt

In [None]:
y_pred_train_rf = randforest_model.predict(x_train)
y_pred_train_rf

In [None]:
y_pred_test_rf = randforest_model.predict(x_test)
y_pred_test_rf

In [None]:
y_train

### ACCURACY SCORE and F1-SCORE of Logistic regression Train Set

In [None]:
print(accuracy_score(y_train,y_pred_train_rf))
#ACC_SCORE_TRAIN_LR = accuracy_score(y_train,y_pred_train_rf)
#F_ONE_SCO_TRAIN_LR = f1_score(y_train,y_pred_train_rf)
#print(ACC_SCORE_TRAIN_LR)
#print(F_ONE_SCO_TRAIN_LR)

### ACCURACY SCORE and F1-SCORE of Logistic regression Test Set

In [None]:
ACCURACY_SCORE_TEST_LR = accuracy_score(y_test,y_pred_test_lr)
#F1_SCORE_TEST_LR = f1_score(y_test,y_pred_test_lr)
print(ACCURACY_SCORE_TEST_LR)
#print(F1_SCORE_TEST_LR)

### ACCURACY SCORE and F1-SCORE of Decision Tree Classifier Train Set

In [None]:
ACCURACY_SCORE_TRAIN_DT = accuracy_score(y_train,y_pred_train_dt)
#F1_SCORE_TRAIN_DT = f1_score(y_test,y_pred_train_dt)
print(ACCURACY_SCORE_TRAIN_DT)
#print(F1_SCORE_TRAIN_DT)

### ACCURACY SCORE and F1-SCORE of Decision Tree Classifier Test Set

In [None]:
ACCURACY_SCORE_TEST_DT = accuracy_score(y_test,y_pred_test_dt)
#F1_SCORE_TEST_DT = f1_score(y_test,y_pred_test_dt)
print(ACCURACY_SCORE_TEST_DT)
#print(F1_SCORE_TEST_DT)

### ACCURACY SCORE and F1-SCORE of Random Forest Classifier Train Set

In [None]:
ACCURACY_SCORE_TRAIN_RF = accuracy_score(y_train,y_pred_train_rf)
#F1_SCORE_TRAIN_RF = f1_score(y_train,y_pred_train_rf)
print(ACCURACY_SCORE_TRAIN_RF)
#print(F1_SCORE_TRAIN_RF)

### ACCURACY SCORE and F1-SCORE of Random Forest Classifier Test Set

In [None]:
ACCURACY_SCORE_TEST_RF = accuracy_score(y_test,y_pred_test_rf)
#F1_SCORE_TEST_RF = f1_score(y_test,y_pred_test_rf)
print(ACCURACY_SCORE_TRAIN_RF)
#print(F1_SCORE_TRAIN_RF)

### Using Random Tree Classifier

In [None]:
# training a RandomForestClassifier....score changes for n-estimators = 2 ==> 0.90625, 7 ==> 0.96875, 8 ==> 1
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(n_estimators = 8, random_state = 42, n_jobs = -1).fit(x_train,y_train)
rf_model
#selector = SelectFromModel(RandomForestClassifier(n_estimators = 100, random_state = 42, n_jobs = -1))
#selector.fit(X, y)

In [None]:
rf_predictions = rf_model.predict(x_test)
rf_predictions

In [None]:
# accuracy on X_test
accuracy_rf = rf_model.score(x_test, y_test)
print(accuracy_rf)

### Using Support Vector Machine Classifier

In [None]:
# training a linear SVM classifier
from sklearn.svm import SVC
svm_model_linear = SVC(kernel = 'linear', C = 1).fit(x_train, y_train)
svm_model_linear

In [None]:
y_pred_train_svc = svm_model_linear.predict(x_train)
y_pred_train_svc

In [None]:
y_pred_test_svc = svm_model_linear.predict(x_test)
y_pred_test_svc

### ACCURACY SCORE and F1-SCORE of Support Vector Machine Classifier Train Set

In [None]:
ACCURACY_SCORE_TRAIN_SVC = accuracy_score(y_train,y_pred_train_svc)
#F1_SCORE_TRAIN_SVC = f1_score(y_train,y_pred_train_svc)
print(ACCURACY_SCORE_TRAIN_SVC)
#print(F1_SCORE_TRAIN_SVC)

### ACCURACY SCORE and F1-SCORE of Support Vector Machine Classifier Test Set

In [None]:
ACCURACY_SCORE_TEST_SVC = accuracy_score(y_test,y_pred_test_svc)
#F1_SCORE_TEST_SVC = f1_score(y_test,y_pred_test_rf)
print(ACCURACY_SCORE_TEST_SVC)
#print(F1_SCORE_TEST_SVC)

### Using K Nearest Neighbour Classifier

In [None]:
# training a KNN classifier ...Score changes for n_neighbors = 2 ==> 0.6875, 4 ==> 0.625, 6 ==> 0.75, 7 ==> 0.71875 8 ==> 0.65625
from sklearn.neighbors import KNeighborsClassifier
knn_model= KNeighborsClassifier(n_neighbors = 6).fit(x_train, y_train)
knn_model

In [None]:
y_pred_train_knn = knn_model.predict(x_train)
y_pred_train_knn

In [None]:
y_pred_test_knn = knn_model.predict(x_test)
y_pred_test_knn

### ACCURACY SCORE and F1-SCORE of KNN Classifier Train Set

In [None]:
ACCURACY_SCORE_TRAIN_KNN = accuracy_score(y_train,y_pred_train_knn)
#F1_SCORE_TRAIN_KNN = f1_score(y_train,y_pred_train_knn)
print(ACCURACY_SCORE_TRAIN_KNN)
#print(F1_SCORE_TRAIN_KNN)

### ACCURACY SCORE and F1-SCORE of KNN Classifier Train Set

In [None]:
ACCURACY_SCORE_TEST_KNN = accuracy_score(y_test,y_pred_test_knn)
#F1_SCORE_TEST_SVC = f1_score(y_test,y_pred_test_knn)
print(ACCURACY_SCORE_TEST_KNN)
#print(F1_SCORE_TEST_KNN)

In [None]:
scores_rf = cross_val_score(dt_model,x_test,y_test, cv = 5, scoring = 'accuracy')

In [None]:
print(scores_rf)
print()
print(np.mean(scores_rf))

In [None]:
scores_rf = cross_val_score(randforest_model,x_test,y_test, cv = 5, scoring = 'f1_macro')

In [None]:
print(scores_rf)
print()
print(np.mean(scores_rf))

In [34]:
scores_gb = cross_val_score(gboost,x_test,y_test, cv = 12, scoring = 'accuracy')

In [35]:
print(scores_gb)
print()
print(np.mean(scores_gb))

[1.         1.         0.66666667 0.66666667 1.         0.66666667
 1.         0.66666667 1.         1.         1.         1.        ]

0.8888888888888888


## Analyzing the Test File

### Seperate out the Categorical and Numerical Columns in the Test DataSet

In [124]:
data_submission = data_final['Id']
data_submission

0      95
1      15
2      30
3     158
4     128
5     115
6      69
7     170
8     174
9      45
10     66
11    182
12    165
13     78
14    186
15    177
16     56
17    152
18     82
19     68
20    124
21     16
22    148
23     93
24     65
25     60
26     84
27     67
28    125
29    132
30      9
31     18
32     55
33     75
34    150
35    104
36    135
37    137
38    164
39     76
Name: Id, dtype: int64

In [90]:
data_final_numerical = data_final[['Age', 'Na_to_K']]
data_final_numerical

Unnamed: 0,Age,Na_to_K
0,36,11.424
1,16,15.516
2,18,8.75
3,59,10.444
4,47,33.542
5,51,18.295
6,18,24.276
7,28,12.879
8,42,12.766
9,66,8.107


In [125]:
data_final_categorical = data_final[['Sex', 'BP', 'Cholesterol']]
data_final_categorical

Unnamed: 0,Sex,BP,Cholesterol
0,M,LOW,NORMAL
1,F,HIGH,NORMAL
2,F,NORMAL,NORMAL
3,F,LOW,HIGH
4,M,LOW,NORMAL
5,M,HIGH,HIGH
6,F,HIGH,NORMAL
7,F,NORMAL,HIGH
8,M,HIGH,NORMAL
9,F,NORMAL,NORMAL


In [28]:
#data_final_categorical = data_final_categorical.apply(LabelEncoder().fit_transform)
#data_final_categorical

Unnamed: 0,Sex,BP,Cholesterol
0,1,1,1
1,0,0,1
2,0,2,1
3,0,1,0
4,1,1,1
5,1,0,0
6,0,0,1
7,0,2,0
8,1,0,1
9,0,2,1


In [126]:
data_final_categorical = pd.get_dummies(data_final_categorical,drop_first=True)
data_final_categorical

Unnamed: 0,Sex_M,BP_LOW,BP_NORMAL,Cholesterol_NORMAL
0,1,1,0,1
1,0,0,0,1
2,0,0,1,1
3,0,1,0,0
4,1,1,0,1
5,1,0,0,0
6,0,0,0,1
7,0,0,1,0
8,1,0,0,1
9,0,0,1,1


In [127]:
data_final_model = pd.concat([data_final_categorical,data_final_numerical], axis=1)
data_final_model

Unnamed: 0,Sex_M,BP_LOW,BP_NORMAL,Cholesterol_NORMAL,Age,Na_to_K
0,1,1,0,1,36,11.424
1,0,0,0,1,16,15.516
2,0,0,1,1,18,8.75
3,0,1,0,0,59,10.444
4,1,1,0,1,47,33.542
5,1,0,0,0,51,18.295
6,0,0,0,1,18,24.276
7,0,0,1,0,28,12.879
8,1,0,0,1,42,12.766
9,0,0,1,1,66,8.107


### Predict the label values using the earlier trained Logistic Regression Model

In [None]:
y_pred_test_final_logistic = logreg_model.predict(data_final_model)
y_pred_test_final_logistic

### Convert the array into a DataFrame

In [None]:
y_pred_test_final_logistic = pd.DataFrame(y_pred_test_final_logistic)

In [None]:
y_pred_test_final_logistic

In [None]:
y_pred_test_final_random_forest = randforest_model.predict(data_final_model)
y_pred_test_final_random_forest

In [None]:
y_pred_test_final_random_forest = pd.DataFrame(y_pred_test_final_random_forest)
y_pred_test_final_random_forest

In [57]:
y_pred_test_final_dtree= dtree.predict(data_final_model)

In [58]:
y_pred_test_final_dtree

array(['drugX', 'DrugY', 'drugX', 'drugC', 'DrugY', 'DrugY', 'DrugY',
       'drugX', 'drugA', 'drugX', 'drugA', 'drugX', 'DrugY', 'drugA',
       'drugB', 'DrugY', 'drugB', 'drugX', 'drugC', 'DrugY', 'drugB',
       'drugX', 'drugX', 'DrugY', 'DrugY', 'DrugY', 'drugC', 'drugX',
       'DrugY', 'drugX', 'DrugY', 'drugC', 'drugC', 'DrugY', 'drugA',
       'DrugY', 'drugX', 'drugA', 'DrugY', 'drugA'], dtype=object)

In [61]:
y_pred_test_final_dtree = pd.DataFrame(y_pred_test_final_dtree)
y_pred_test_final_dtree

Unnamed: 0,0
0,drugX
1,DrugY
2,drugX
3,drugC
4,DrugY
5,DrugY
6,DrugY
7,drugX
8,drugA
9,drugX


In [94]:
y_pred_test_final_rforest= rforest.predict(data_final_model)

In [95]:
y_pred_test_final_rforest

array(['drugX', 'DrugY', 'drugX', 'drugC', 'DrugY', 'DrugY', 'DrugY',
       'drugX', 'drugA', 'drugX', 'drugA', 'drugX', 'DrugY', 'drugA',
       'drugB', 'DrugY', 'drugB', 'drugX', 'drugC', 'DrugY', 'drugB',
       'drugX', 'drugX', 'DrugY', 'DrugY', 'DrugY', 'drugC', 'drugX',
       'DrugY', 'drugX', 'DrugY', 'drugC', 'drugC', 'DrugY', 'drugA',
       'DrugY', 'drugX', 'drugA', 'DrugY', 'drugA'], dtype=object)

In [96]:
y_pred_test_final_rforest = pd.DataFrame(y_pred_test_final_rforest)
y_pred_test_final_rforest

Unnamed: 0,0
0,drugX
1,DrugY
2,drugX
3,drugC
4,DrugY
5,DrugY
6,DrugY
7,drugX
8,drugA
9,drugX


In [128]:
y_pred_test_final_gboost= gboost.predict(data_final_model)
y_pred_test_final_gboost

array(['drugX', 'DrugY', 'drugX', 'drugC', 'DrugY', 'DrugY', 'DrugY',
       'drugX', 'drugA', 'drugX', 'drugA', 'drugX', 'DrugY', 'drugA',
       'drugB', 'DrugY', 'drugB', 'drugX', 'drugC', 'DrugY', 'drugB',
       'drugX', 'drugX', 'DrugY', 'DrugY', 'DrugY', 'drugC', 'drugX',
       'DrugY', 'drugX', 'DrugY', 'drugC', 'drugX', 'DrugY', 'drugA',
       'DrugY', 'drugX', 'drugA', 'DrugY', 'drugA'], dtype=object)

In [129]:
y_pred_test_final_gboost = pd.DataFrame(y_pred_test_final_gboost)
y_pred_test_final_gboost

Unnamed: 0,0
0,drugX
1,DrugY
2,drugX
3,drugC
4,DrugY
5,DrugY
6,DrugY
7,drugX
8,drugA
9,drugX


## Prepare the submission file which should have only two columns viz. the KEY/INDEX column(Id) and TARGET column(label)

### Applying Scale Mapper to convert Drug names to Labels
#scale_mapper = {"drugX":0, "DrugY":1, "drugC":2, "drugA":3,"drugB":4}
#data_target["Drug"] = data_target["Drug"].replace(scale_mapper)
#data_target

In [59]:
y_pred_test_final_random_forest.replace( {"drugX":0, "DrugY":1, "drugC":2, "drugA":3,"drugB":4}, inplace = True)
y_pred_test_final_random_forest

NameError: name 'y_pred_test_final_random_forest' is not defined

In [62]:
y_pred_test_final_dtree.replace( {"drugX":0, "DrugY":1, "drugC":2, "drugA":3,"drugB":4}, inplace = True)
y_pred_test_final_dtree

Unnamed: 0,0
0,0
1,1
2,0
3,2
4,1
5,1
6,1
7,0
8,3
9,0


In [97]:
y_pred_test_final_rforest.replace( {"drugX":0, "DrugY":1, "drugC":2, "drugA":3,"drugB":4}, inplace = True)
y_pred_test_final_rforest

Unnamed: 0,0
0,0
1,1
2,0
3,2
4,1
5,1
6,1
7,0
8,3
9,0


In [130]:
y_pred_test_final_gboost.replace( {"drugX":0, "DrugY":1, "drugC":2, "drugA":3,"drugB":4}, inplace = True)
y_pred_test_final_gboost

Unnamed: 0,0
0,0
1,1
2,0
3,2
4,1
5,1
6,1
7,0
8,3
9,0


# To concatenate Id

In [131]:
submission_file = pd.concat([data_submission,y_pred_test_final_gboost], axis = 1)

In [132]:
submission_file

Unnamed: 0,Id,0
0,95,0
1,15,1
2,30,0
3,158,2
4,128,1
5,115,1
6,69,1
7,170,0
8,174,3
9,45,0


In [63]:
submission_file = pd.concat([data_submission,y_pred_test_final_dtree], axis = 1)

In [64]:
submission_file

Unnamed: 0,Id,0
0,95,0
1,15,1
2,30,0
3,158,2
4,128,1
5,115,1
6,69,1
7,170,0
8,174,3
9,45,0


### To write the final data to the submission file which is .csv without HEADER and INDEX

In [133]:
submission_file.to_csv('D://Drug_Prediction_Submission_GBoost.csv', header=False, index=False)

In [None]:
y_pred_test_final_logistic.to_csv('D://Drug_Prediction_Intermediate.csv')

### Thank You !!!