#<font color='#3dc1d3'>  CLINICAL TRIAL - Machine Learning Model

#Outline
---
# <font color='#3dc1d3'>  
1. Install required library, import and read Dataset
2. Combine attributes and rename the Feature and Target column
3. Pre-Processing:
    - Tokenize sentences
    - Load pretrained genism Doc2Vec model
    - Convert all text into lowercase and apply the doc2vec model to vectorize the column text
4. Machine learning modelling:
    > Spliting data into training and validation
    
    > k-fold: Cross Validation

    > List of models : 
    - DecisionTreeClassifier, 
    - RandomForestClassifier, 
    - SVC
    - LogisticRegression

5. Result<br>
    - Accuracy
    - Precision
    - Recall
    - F1

#Install required library, import and read Dataset

In [None]:
#Libraries
import pandas as pd
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
df= pd.read_csv("/content/drive/MyDrive/Practicum/Final_folders/Submission/modelling_ML/Full_dataset.csv")
df.head(2)

Unnamed: 0.1,Unnamed: 0,NCT Number,Status,Conditions,Sponsor,Age,Funded_Bys,Study_Designs,Locations,Summerised_pdf_based_on_keywords,links
0,0,NCT03763474,Completed,diabetes mellitus type 1,aristotle university of thessaloniki,years to years child adult,other,allocation randomized intervention model paral...,endocrine unit of rd department of pediatrics ...,-The former England captain is among the most...,https://ClinicalTrials.gov/ProvidedDocs/74/NCT...
1,1,NCT05013294,Completed,diabetes,ku leuven jomo kenyatta university of agricult...,child adult older adult,other,allocation randomized intervention model paral...,jomo kenyatta university of agriculture and te...,The former England captain is among the most ...,https://ClinicalTrials.gov/ProvidedDocs/94/NCT...


#Required column from dataframe:
- Transform into feature vector column as text and target vector column as status for each NCT_Number that is for each clinical trial.

In [None]:
df["text"] = df[["Conditions", "Sponsor","Age","Funded_Bys","Study_Designs","Locations","Summerised_pdf_based_on_keywords"]].apply("-".join, axis=1)
dataframe= df[['NCT Number','text','Status']]
dataframe= dataframe.rename(columns ={'NCT Number':"NCT_Number"})
dataframe.head(2)

Unnamed: 0,NCT_Number,text,Status
0,NCT03763474,diabetes mellitus type 1 -aristotle university...,Completed
1,NCT05013294,diabetes -ku leuven jomo kenyatta university o...,Completed


In [None]:
dataframe['Status'] = dataframe['Status'].map({'Completed': 1, 'Not Completed': 0})

#Modelling

##Tokenize sentences

In [None]:
sentences=dataframe['text'].tolist()
tok_sent = []
for s in sentences:
    tok_sent.append(word_tokenize(s.lower()))

##Load pretrained genism Doc2Vec model

In [None]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
tagged_data = [TaggedDocument(d, [i]) for i, d in enumerate(tok_sent)]

In [None]:
model = Doc2Vec(tagged_data, vector_size = 5, window = 2, min_count = 1, epochs = 1000)

##Convert all text into lowercase and apply the doc2vec model to vectorize the column text

In [None]:
sentence=dataframe['text'][0]

test_doc = word_tokenize(sentence.lower())
test_doc_vector = model.infer_vector(test_doc)

def text_to_array(df: pd.DataFrame):
    text_list=[]
    for _,row in df.iterrows():
        text_vector=model.infer_vector(word_tokenize(row['text'].lower()))
        text_list.append(text_vector)
    return(np.array(text_list))

# Obtained vector for text data using Doc2Vec 

In [None]:
text_array= text_to_array(dataframe)

In [None]:
X = text_array
Y=dataframe['Status'].to_numpy()

(363, 5)
(363,)


In [None]:
print(X)
print(Y)

[[ 1.497994    1.1934664   1.9280154   1.1846226   0.13251962]
 [ 1.0638301   2.3907447   2.0549786   0.7462042  -0.2883812 ]
 [ 0.9122217   2.0932562   4.236477    0.80196506 -0.17591155]
 ...
 [ 0.858576    1.4652501   0.7683924   1.5630264  -0.74809825]
 [ 1.8869973   0.44585297 -0.13334319  1.7142289  -0.04603564]
 [-0.33040267  2.256687    2.2843857   1.2810794  -0.43557316]]
['Completed' 'Completed' 'Completed' 'Not Completed' 'Completed'
 'Completed' 'Completed' 'Completed' 'Completed' 'Completed'
 'Not Completed' 'Completed' 'Completed' 'Completed' 'Completed'
 'Completed' 'Completed' 'Completed' 'Completed' 'Completed'
 'Not Completed' 'Completed' 'Not Completed' 'Not Completed'
 'Not Completed' 'Not Completed' 'Not Completed' 'Not Completed'
 'Completed' 'Not Completed' 'Completed' 'Not Completed' 'Not Completed'
 'Completed' 'Completed' 'Completed' 'Completed' 'Completed' 'Completed'
 'Completed' 'Completed' 'Completed' 'Completed' 'Completed' 'Completed'
 'Completed' 'Compl

#Modelling Steps

##Separating data into training and test

In [None]:
X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y, test_size=0.20, random_state=1)

#Machine Learning Algorithms

In [None]:
#Algorithms
models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('CART', DecisionTreeClassifier()))
models.append(('SVM', SVC(gamma='auto')))
models.append(('RandomForest', RandomForestClassifier()))
# evaluate each model in turn
results = []
names = []
for name, model in models:
	kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
	cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy')
	results.append(cv_results)
	names.append(name)

##1. SVC

In [None]:
# make predictions
model = SVC(gamma='auto')
model.fit(X_train, Y_train)
predictions = model.predict(X_validation)
# Evaluate predictions
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))

0.6027397260273972
[[19 21]
 [ 8 25]]
               precision    recall  f1-score   support

    Completed       0.70      0.47      0.57        40
Not Completed       0.54      0.76      0.63        33

     accuracy                           0.60        73
    macro avg       0.62      0.62      0.60        73
 weighted avg       0.63      0.60      0.60        73



## 2. logistic Regression

In [None]:
#make prediction
model = LogisticRegression()
model.fit(X_train, Y_train)
predictions = model.predict(X_validation)
# Evaluate predictions
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))

0.5616438356164384
[[24 16]
 [16 17]]
               precision    recall  f1-score   support

    Completed       0.60      0.60      0.60        40
Not Completed       0.52      0.52      0.52        33

     accuracy                           0.56        73
    macro avg       0.56      0.56      0.56        73
 weighted avg       0.56      0.56      0.56        73



## 3. Decision Tree Classifier

In [None]:
# make predictions
model = DecisionTreeClassifier()
model.fit(X_train, Y_train)
predictions = model.predict(X_validation)
# Evaluate predictions
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))

0.5342465753424658
[[20 20]
 [14 19]]
               precision    recall  f1-score   support

    Completed       0.59      0.50      0.54        40
Not Completed       0.49      0.58      0.53        33

     accuracy                           0.53        73
    macro avg       0.54      0.54      0.53        73
 weighted avg       0.54      0.53      0.53        73



## 4. Random Forest Classifier

In [None]:
# make predictions
model = RandomForestClassifier()
model.fit(X_train, Y_train)
predictions = model.predict(X_validation)
# Evaluate predictions
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))

0.6438356164383562
[[26 14]
 [12 21]]
               precision    recall  f1-score   support

    Completed       0.68      0.65      0.67        40
Not Completed       0.60      0.64      0.62        33

     accuracy                           0.64        73
    macro avg       0.64      0.64      0.64        73
 weighted avg       0.65      0.64      0.64        73



#Results

Models | Accuracy(%) 
--- | --- 
Random Forest Classifier | 64.38 
Logistic Regression |  56.16
Decision Tree Classifier | 53.42 
SVC | 60.27


Out of all the models, Random Forest classifier had the highest accuracy, at 64.38 percent. The lowest accuracy was provided by the Decision Tree classifier, which was 53.42 percent. While Support Vector Classifier and Logistic Regression both provided results of 60.27 percent and 56.16 percent, respectively.