In [2]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import pyreadr

result = pyreadr.read_r('Diag_full.rds')
Medline = result[None]


# Select relevant columns and remove duplicates
Medline = Medline[['PMID', 'T_A', 'diagnosis']]
Medline = Medline.drop_duplicates()

# Separate data into diagnosis and no diagnosis
Medline_diag = Medline[Medline['diagnosis'] == 'Y']
Medline_nodiag = Medline[Medline['diagnosis'] == 'N']

# Set random seed and create training and test sets
np.random.seed(42)
Medline_train = pd.concat([Medline_diag[:500], Medline_nodiag[:500]])
Medline_train = Medline_train.sample(frac=1)

Medline_test = pd.concat([Medline_diag[500:700], Medline_nodiag[500:700]])
Medline_test = Medline_test.sample(frac=1)

Medline = pd.concat([Medline_train, Medline_test])

# Normalize text for machine learning
stop_words = ["introduction", "conclusion", "objective", "aim", "methods", "results", "conclusions",
              "background", "percent", "may", "use", "used", "however", "p", "cancer", "study", "lung",
              "prostate", "prostatic", "patient", "colorectal"]

Medline['T_A'] = Medline['T_A'].str.replace('\d+', '')  # Remove numbers
Medline['T_A'] = Medline['T_A'].str.lower()  # Convert to lowercase
Medline['T_A'] = Medline['T_A'].str.replace('[^a-zA-Z0-9]', ' ')  # Remove non-alphanumeric characters
Medline['T_A'] = Medline['T_A'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

# Extract features using TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(Medline['T_A'])

# Encode the 'diagnosis' column
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(Medline['diagnosis'])

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Set up the classifier (Support Vector Machine)
classifier = SVC(probability=True)
classifier.fit(X_train, y_train)

# Test the model
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
confusion = confusion_matrix(y_test, y_pred)

print("Accuracy:", accuracy)
print("Confusion Matrix:")
print(confusion)



  Medline['T_A'] = Medline['T_A'].str.replace('\d+', '')  # Remove numbers
  Medline['T_A'] = Medline['T_A'].str.replace('[^a-zA-Z0-9]', ' ')  # Remove non-alphanumeric characters


Accuracy: 0.7785714285714286
Confusion Matrix:
[[165  29]
 [ 64 162]]


Exercise 1: sample data (8 minutes)
Let's see if we can increase the current accuracy
What about if we increased the training data does that change the accuracy of the model?
Let's change in #7. and #8. to
0.64
1. training 500, test 200 :0.64
2. training 750, test 300 :0.68
3. training 1000, test 400 :0.73
4. training 1500, test 700 :0.70

Which training size seems to give better accuracy?

Let's add probabilities ##
At the end of #6. add the following line of code to check the confidence of the algorithm towards their prediction
learner$predict_type ="prob"

Exercise 2: change the algorithm ## (8 minutes)
mlr3 comes with the learner 'classif.rpart'(#6.), but we can change the algorithm to other learners using the mlr3learners package (https://github.com/mlr-org/mlr3learners)
Try the following algorithms instead in #6.
1. classif.svm 0.735
2. classif.glmnet  0.7
3. classif.ranger 0.77

Others are available, you can try others if you have time...

(if you get an error you may install a package as prompted... e.g. install.packages("ranger"))

Exercise 3: Let's change the balance of the data to see if it affects the predicition ## (15 minutes)

In #2. let's see if an unbalanced vs. balanced dataset changes the accuracy of the model:

Create 2 datasets, Medline_diag and Medline_nodiag by separating the diagnostics vs non_diagnostics cases (using filter)

Then two datasets Medline_train and Medline_test with respectively: 
1. training --> nodiag: 700 diag: 300 | test --> nodiag: 200 diag: 200 --> 0.69
2. training --> nodiag: 300 diag: 700 | test --> nodiag: 200 diag: 200 --> 0.66
3. training --> nodiag: 500 diag: 500 | test --> nodiag: 200 diag: 200 --> 0.76

To combine the training and the test set you can use the following code to help you:
Medline_train <- bind_rows(Medline_diag[1:700,], Medline_nodiag[1:300,])

when the training and test datasets have been done, don't forget to randomise the training dataset
(e.g. Medline_train <- Medline_train[sample(nrow(Medline_train)),])

And use bind_rows to join back the training and test set into Medline. 

Check how this affects the accuracy of the model. 