Compile a list of "stock ticker" identifiers, from the website stockanalysis.com. The list is incomplete. Contains only the larger companies, which might have implications down the line. There are 1175 companies listed in the file, and the file name is healthcare_stocks.csv

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# URL of the website
url = 'https://stockanalysis.com/stocks/sector/healthcare/'

def scrape_table(url):
    # Fetch the content from the URL
    response = requests.get(url)
    response.raise_for_status()  # Raises an HTTPError for bad responses

    # Parse the HTML content
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find the table
    table = soup.find('table', {'class': 'symbol-table'})

    # Prepare lists to store the data
    no_list = []
    symbol_list = []
    company_name_list = []

    # Iterate through each row of the table
    for row in table.find_all('tr')[1:]:  # skip the header row
        cols = row.find_all('td')
        if len(cols) >= 3:
            no_list.append(cols[0].text.strip())
            symbol_list.append(cols[1].text.strip())
            company_name_list.append(cols[2].text.strip())

    # Create a DataFrame
    df = pd.DataFrame({
        'No.': no_list,
        'Symbol': symbol_list,
        'Company Name': company_name_list
    })

    return df

# Scrape the table
df = scrape_table(url)

# Specify the path within Google Drive where the CSV will be saved
csv_file = '/content/drive/MyDrive/healthcare_stocks.csv'
df.to_csv(csv_file, index=False)
print(f'Data saved to {csv_file}')

A previously compiled table of clinical trial preemptive reports (merged_clinical_trials.csv), with a unique NCT-number for each trial, is loaded and filtered for industry sponsors, as opposed to academic trials such as NHS etc. Then the sponsor name (company name) from the trial report is saved as "company_names.csv".

In [None]:
import pandas as pd

# Load the CSV file into a DataFrame
file_path = '/content/drive/MyDrive/ClinicalTrialsFull/merged_clinical_trials.csv'
df = pd.read_csv(file_path, encoding='utf-8', engine='python', on_bad_lines='skip')

# Standardize column names to lower case and strip whitespace
df.columns = df.columns.str.lower().str.strip()

# Check if 'funder type' column exists after standardization
if 'funder type' not in df.columns:
    raise KeyError("The column 'funder type' was not found in the dataset. Please check the column names.")

# Filter the rows where 'funder type' is 'industry'
df_industry = df[df['funder type'] == 'INDUSTRY']

# Extract unique company names (sponsors)
company_names = df_industry['sponsor'].unique()

# Write company names to a text file, tab-separated
output_file_path = '/content/drive/MyDrive/ClinicalTrialsFull/company_names.txt'
with open(output_file_path, 'w') as f:
    f.write('\t'.join(company_names))

print("Company names saved successfully to company_names.txt.")

Load the trial report table, and the stock ticker list. Match the sponsor (the company) with the appropriate stock ticker. Some variation in the company names can happen, so it creates a match if the first 5 letters of the company name corresponds with the name of a stock ticker.

From the trial report table it finds the date of results being posted, and checks the average stock price 30 days before, and 30 days after. If the ratio is above 1, the price has increased from the results being posted, and the specific trial is assigned a parameter "success", and vice versa. The results are saved as "drug_trial_results.csv".

It also saves a list of the companies that couldn't be assigned a stock ticker, to perhaps be matched up manually later (when we hire a student assistant).

In [1]:
# Install necessary libraries
!pip install pandas yfinance

import pandas as pd
import yfinance as yf
from datetime import datetime, timedelta
import os

# Function to load the dataset using different delimiters
def load_csv_with_fallbacks(file_path):
    delimiters = [',', ';', '\t']  # List of common delimiters to try
    for delimiter in delimiters:
        try:
            df = pd.read_csv(file_path, delimiter=delimiter, on_bad_lines='skip', engine='python')
            print(f"File loaded successfully using delimiter: '{delimiter}'")
            return df
        except pd.errors.ParserError as e:
            print(f"ParserError with delimiter '{delimiter}': {e}")
        except Exception as e:
            print(f"Error with delimiter '{delimiter}': {e}")
    raise ValueError("Failed to load the CSV file with common delimiters.")

# Load the dataset
csv_file_path = '/content/drive/MyDrive/ClinicalTrialsFull/merged_clinical_trials.csv'
df = load_csv_with_fallbacks(csv_file_path)

# Load the ticker mapping CSV file
ticker_file_path = '/content/drive/MyDrive/healthcare_stocks.csv'
df_tickers = pd.read_csv(ticker_file_path)

# Create a dictionary to map company names to stock tickers based on the first 5 letters
ticker_dict = {name[:5].lower(): ticker for name, ticker in zip(df_tickers['Company Name'], df_tickers['Symbol'])}

# Standardize column names to lower case and strip whitespace
df.columns = df.columns.str.lower().str.strip()

# Check if 'results first posted' column exists
if 'results first posted' not in df.columns:
    raise KeyError("The column 'results first posted' was not found in the dataset. Please check the column names.")

# Filter rows where 'funder type' is 'industry'
df_industry = df[df['funder type'] == 'INDUSTRY']

# Display the first few rows to check if the dataset is loaded correctly
print(df_industry.head())

# Initialize a list to collect results and a set to track unmatched companies
results = []
unmatched_companies = set()

# Iterate through each row in the filtered DataFrame
for index, row in df_industry.iterrows():
    try:
        nct_number = row['nct number']
        sponsor = row['sponsor']
        results_date = row['results first posted']

        # Check if results_date is missing or NaN
        if pd.isna(results_date):
            print(f"Results date is missing for NCT Number {nct_number}. Skipping...")
            continue

        # Convert results_date to datetime
        results_date = datetime.strptime(results_date, '%Y-%m-%d')

        # Calculate 30 days before and after the results date
        start_date = results_date - timedelta(days=30)
        end_date = results_date + timedelta(days=30)

        # Get the stock ticker by matching the first 5 letters of the sponsor name
        ticker_symbol = ticker_dict.get(sponsor[:5].lower(), None)

        if not ticker_symbol:
            print(f"No stock ticker found for sponsor: {sponsor}. Adding to unmatched list...")
            unmatched_companies.add(sponsor)
            continue

        # Fetch stock price data using yfinance
        stock_data = yf.download(ticker_symbol, start=start_date, end=end_date)

        # Handle cases where the download might fail or return empty
        if stock_data.empty:
            print(f"No data found for ticker: {ticker_symbol} for NCT Number {nct_number}. Skipping...")
            continue

        # Calculate the average stock price 30 days before and 30 days after the results date
        avg_before = stock_data.loc[:results_date].iloc[-30:]['Close'].mean()
        avg_after = stock_data.loc[results_date:].iloc[:30]['Close'].mean()

        # Calculate the ratio
        ratio = avg_before / avg_after

        # Determine success or failure
        if ratio < 1:
            status = 'Success'
        else:
            status = 'Failure'

        # Collect the result
        results.append({'NCT Number': nct_number, 'Sponsor': sponsor, 'Status': status})

    except Exception as e:
        print(f"An error occurred for NCT Number {nct_number}: {e}")

# Convert the results list to a DataFrame
results_df = pd.DataFrame(results)

# Save the results to a new CSV file
results_csv_path = '/content/drive/MyDrive/ClinicalTrialsCSV/drug_trial_results.csv'
results_df.to_csv(results_csv_path, index=False)
print(f"Results saved successfully to {results_csv_path}")

# Save unmatched companies to a separate CSV file
unmatched_csv_path = '/content/drive/MyDrive/ClinicalTrialsCSV/unmatched_companies.csv'

# If the file already exists, load existing unmatched companies into the set
if os.path.exists(unmatched_csv_path):
    existing_unmatched_df = pd.read_csv(unmatched_csv_path)
    existing_unmatched_set = set(existing_unmatched_df['Unmatched Sponsor'])
    unmatched_companies.update(existing_unmatched_set)

# Convert the set to a DataFrame and save it
unmatched_df = pd.DataFrame(list(unmatched_companies), columns=['Unmatched Sponsor'])
unmatched_df.to_csv(unmatched_csv_path, index=False)
print(f"Unmatched sponsors saved successfully to {unmatched_csv_path}")

[31mERROR: Operation cancelled by user[0m[31m
[0m

KeyboardInterrupt: 

Plot the successes and failures, to verify that the ratio is somewhat decent. At the moment it is fairly equal, which is good, as we won't introduce bias in the training set and accidentally cause more type 1 or type 2 errors. On the other hand, it doesn't reflect the real world success/failure ratio, as only about 10% of trials succeed (across all phases. Our dataset is only phase 3 trials)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load the results CSV file
results_csv_path = '/content/drive/MyDrive/ClinicalTrialsCSV/drug_trial_results.csv'
results_df = pd.read_csv(results_csv_path)

# Count the number of successes and failures
status_counts = results_df['Status'].value_counts()

# Plot the results
plt.figure(figsize=(8, 6))
status_counts.plot(kind='bar', color=['green', 'red'])
plt.title('Number of Successes and Failures in Clinical Trials')
plt.xlabel('Status')
plt.ylabel('Number of Trials')
plt.xticks(rotation=0)
plt.show()

Bar plot of the frequency of a company in our trial set. Eli Lilly, Merck, and Pfizer are sponsors of a huge proportion of the trials. This could have some implications down the line.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load the results CSV file
results_csv_path = '/content/drive/MyDrive/ClinicalTrialsCSV/drug_trial_results.csv'
results_df = pd.read_csv(results_csv_path)

# Count the frequency of each company (sponsor)
company_frequency = results_df['Sponsor'].value_counts()

# Select the top 20 companies by frequency
top_companies = company_frequency.head(20)

# Plot the company frequencies using a horizontal bar plot
plt.figure(figsize=(10, 8))
top_companies.plot(kind='barh', color='skyblue')
plt.title('Top 20 Companies by Frequency in Clinical Trials Dataset')
plt.xlabel('Number of Trials')
plt.ylabel('Company (Sponsor)')
plt.gca().invert_yaxis()  # Invert y-axis to have the company with the most trials at the top
plt.tight_layout()
plt.show()

Term Frequency Inverse Document Frequency (TF-IDF) is utilized to establish features based on word frequency, and a "Random Forest classifier" model is trained to predict the success/failure based on the features. Accuracy, as reported by the output, is 55%. GPT interprets that the "confusion matrix" from the output tells us that it predicts successes better than failures (perfect, let's keep it that way). Perhaps more importantly, the output shows us the terms that are most correlated with success/failure. The top words include "model," "allocation," "treatment", which are terms describing study design, and "plasma," "cmax," "dose," "pharmacokinetics", which describe specific outcome measures used in trials. Maybe we can decode which types of trials have higher success rates. "plasma," "cmax," "dose," "pharmacokinetics" are outcome measures used in dose-determination studies, and toxicity/safety studies. They generally come before the efficacy studies, but could have stronger implications for the viability of the drug (if it's toxic, it's no good).

In [None]:
# Install necessary libraries
!pip install pandas scikit-learn nltk

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import nltk
from nltk.corpus import stopwords
import re

# Download stopwords from NLTK
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Load the clinical trials and results datasets
clinical_trials_path = '/content/drive/MyDrive/ClinicalTrialsFull/merged_clinical_trials.csv'
trial_results_path = '/content/drive/MyDrive/ClinicalTrialsCSV/drug_trial_results.csv'

df_clinical_trials = pd.read_csv(clinical_trials_path)
df_trial_results = pd.read_csv(trial_results_path)

# Standardize column names to lower case and strip whitespace
df_clinical_trials.columns = df_clinical_trials.columns.str.lower().str.strip()
df_trial_results.columns = df_trial_results.columns.str.lower().str.strip()

# Merge the two datasets on 'nct number'
df_merged = pd.merge(df_clinical_trials, df_trial_results, on='nct number', how='inner')

# Combine all relevant text columns into one
text_columns = ['study title', 'brief summary', 'conditions', 'interventions',
                'primary outcome measures', 'secondary outcome measures',
                'other outcome measures', 'study design']
df_merged['combined_text'] = df_merged[text_columns].apply(lambda x: ' '.join(x.dropna().astype(str)), axis=1)

# Preprocess the combined text data
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove special characters and digits
    text = re.sub(r'[^a-z\s]', '', text)
    # Remove stopwords
    text = ' '.join(word for word in text.split() if word not in stop_words)
    return text

df_merged['processed_text'] = df_merged['combined_text'].apply(preprocess_text)

# Feature extraction using TF-IDF
tfidf_vectorizer = TfidfVectorizer(max_features=1000)
X = tfidf_vectorizer.fit_transform(df_merged['processed_text']).toarray()
y = df_merged['status'].apply(lambda x: 1 if x == 'Success' else 0)  # Encode success as 1, failure as 0

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate the model
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

# Identify important features
feature_importances = model.feature_importances_
important_features = sorted(zip(tfidf_vectorizer.get_feature_names_out(), feature_importances), key=lambda x: x[1], reverse=True)

# Print top 20 important features
print("Top 20 important features correlated to success/failure:")
for feature, importance in important_features[:20]:
    print(f"{feature}: {importance:.4f}")



Try out some more strategies:
Use BERT to create contextual embeddings of the trial descriptions. BERT captures the meaning and relationships between words
Use sentiment analysis to extract "the vibe".

Several models are run, being; Logistic Regression, Random Forest, Gradient Boosting, Support Vector Machine (SVM), and Neural Network (MLP) models.
An ensemble method (Random Forest) is also trained and evaluated.

Results show that none of the models significantly outperform the others. Seems like the optimization needs to happen at the feature extraction level, or at the dataset level (Maybe we need to find some more information-dense training data?). The models perform at around 55-57% accuracy, which is not great, but still, anything above random will average to a profit given enough time (if we deploy it to trading live).

In [None]:
# Install necessary libraries
!pip install pandas scikit-learn nltk transformers tensorflow keras

# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
import nltk
from nltk.corpus import stopwords
from transformers import BertTokenizer, TFBertModel
from nltk.sentiment import SentimentIntensityAnalyzer
import re
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

# Download necessary NLTK data
nltk.download('stopwords')
nltk.download('vader_lexicon')
stop_words = set(stopwords.words('english'))
sia = SentimentIntensityAnalyzer()

# Load datasets
clinical_trials_path = '/content/drive/MyDrive/ClinicalTrialsFull/merged_clinical_trials.csv'
trial_results_path = '/content/drive/MyDrive/ClinicalTrialsCSV/drug_trial_results.csv'

df_clinical_trials = pd.read_csv(clinical_trials_path)
df_trial_results = pd.read_csv(trial_results_path)

# Standardize column names
df_clinical_trials.columns = df_clinical_trials.columns.str.lower().str.strip()
df_trial_results.columns = df_trial_results.columns.str.lower().str.strip()

# Merge datasets on 'nct number'
df_merged = pd.merge(df_clinical_trials, df_trial_results, on='nct number', how='inner')

# Combine text columns for analysis
text_columns = ['study title', 'brief summary', 'conditions', 'interventions',
                'primary outcome measures', 'secondary outcome measures',
                'other outcome measures', 'study design']
df_merged['combined_text'] = df_merged[text_columns].apply(lambda x: ' '.join(x.dropna().astype(str)), axis=1)

# Preprocess text data
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    text = ' '.join(word for word in text.split() if word not in stop_words)
    return text

df_merged['processed_text'] = df_merged['combined_text'].apply(preprocess_text)

# Add sentiment analysis
df_merged['sentiment_score'] = df_merged['processed_text'].apply(lambda x: sia.polarity_scores(x)['compound'])

# Advanced feature extraction using BERT embeddings
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = TFBertModel.from_pretrained('bert-base-uncased')

def get_bert_embedding(text):
    inputs = tokenizer(text, return_tensors='tf', truncation=True, padding=True, max_length=512)
    outputs = bert_model(inputs)
    return np.mean(outputs.last_hidden_state[0].numpy(), axis=0)

df_merged['bert_embedding'] = df_merged['processed_text'].apply(get_bert_embedding)

# Prepare features and target variable
X_text = np.array(df_merged['bert_embedding'].tolist())
X_sentiment = df_merged[['sentiment_score']].values
X = np.hstack((X_text, X_sentiment))
y = df_merged['status'].apply(lambda x: 1 if x == 'Success' else 0)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Different machine learning models
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "Gradient Boosting": GradientBoostingClassifier(n_estimators=100, random_state=42),
    "SVM": SVC(kernel='linear'),
    "Neural Network": MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=300)
}

# Train and evaluate each model
for model_name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"Results for {model_name}:")
    print(classification_report(y_test, y_pred))
    print(confusion_matrix(y_test, y_pred))

# Ensemble method (stacking)
ensemble_model = RandomForestClassifier(n_estimators=100, random_state=42)
ensemble_model.fit(X_train, y_train)
y_pred_ensemble = ensemble_model.predict(X_test)
print("Results for Ensemble Model (Random Forest):")
print(classification_report(y_test, y_pred_ensemble))
print(confusion_matrix(y_test, y_pred_ensemble))


At the feature extraction level, i decided to try to use medicine specific BERT models (ClinicalBERT, BioBERT), and something called "aspect-based sentiment", to see if that can more accurately identify meaningful features in the trials, as it might understand the language better.

No significant performace increase (55-58% now).

In [None]:
# Install necessary libraries
!pip install pandas scikit-learn transformers tensorflow keras torch nltk

import pandas as pd
import numpy as np
import re
import nltk
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, confusion_matrix
from transformers import BertTokenizer, TFBertModel
import torch
from transformers import AutoTokenizer, AutoModel
from nltk.sentiment import SentimentIntensityAnalyzer
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from nltk.corpus import stopwords
from transformers import pipeline

# Download necessary NLTK data
nltk.download('stopwords')
nltk.download('vader_lexicon')
stop_words = set(stopwords.words('english'))
sia = SentimentIntensityAnalyzer()

# Load datasets
clinical_trials_path = '/content/drive/MyDrive/ClinicalTrialsFull/merged_clinical_trials.csv'
trial_results_path = '/content/drive/MyDrive/ClinicalTrialsCSV/drug_trial_results.csv'

df_clinical_trials = pd.read_csv(clinical_trials_path)
df_trial_results = pd.read_csv(trial_results_path)

# Standardize column names
df_clinical_trials.columns = df_clinical_trials.columns.str.lower().str.strip()
df_trial_results.columns = df_trial_results.columns.str.lower().str.strip()

# Merge datasets on 'nct number'
df_merged = pd.merge(df_clinical_trials, df_trial_results, on='nct number', how='inner')

# Combine text columns for analysis
text_columns = ['study title', 'brief summary', 'conditions', 'interventions',
                'primary outcome measures', 'secondary outcome measures',
                'other outcome measures', 'study design']
df_merged['combined_text'] = df_merged[text_columns].apply(lambda x: ' '.join(x.dropna().astype(str)), axis=1)

# Preprocess text data
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    text = ' '.join(word for word in text.split() if word not in stop_words)
    return text

df_merged['processed_text'] = df_merged['combined_text'].apply(preprocess_text)

# Load BioBERT and ClinicalBERT Tokenizers and Models
biobert_tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-base-cased-v1.1")
biobert_model = AutoModel.from_pretrained("dmis-lab/biobert-base-cased-v1.1")

clinicalbert_tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
clinicalbert_model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

# Function to get BioBERT and ClinicalBERT embeddings
def get_bert_embedding(text, tokenizer, model):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    return np.mean(outputs.last_hidden_state[0].numpy(), axis=0)

df_merged['biobert_embedding'] = df_merged['processed_text'].apply(lambda x: get_bert_embedding(x, biobert_tokenizer, biobert_model))
df_merged['clinicalbert_embedding'] = df_merged['processed_text'].apply(lambda x: get_bert_embedding(x, clinicalbert_tokenizer, clinicalbert_model))

# Aspect-Based Sentiment Analysis: Define aspects
aspects = ['efficacy', 'safety', 'side effects', 'quality of life']

def get_aspect_sentiment(text, aspect):
    sentiment_pipeline = pipeline("sentiment-analysis", model="distilbert-base-uncased")
    truncated_text = (text[:512] + '..') if len(text) > 512 else text
    result = sentiment_pipeline(f"{aspect}: {truncated_text}", max_length=512, truncation=True)[0]
    return result['score']

# Apply aspect-based sentiment analysis
for aspect in aspects:
    df_merged[f'sentiment_{aspect}'] = df_merged['processed_text'].apply(lambda x: get_aspect_sentiment(x, aspect))

# Prepare features and target variable
X_biobert = np.array(df_merged['biobert_embedding'].tolist())
X_clinicalbert = np.array(df_merged['clinicalbert_embedding'].tolist())
X_sentiment_aspects = df_merged[[f'sentiment_{aspect}' for aspect in aspects]].values
X = np.hstack((X_biobert, X_clinicalbert, X_sentiment_aspects))

y = df_merged['status'].apply(lambda x: 1 if x == 'Success' else 0)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Different machine learning models
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "Gradient Boosting": GradientBoostingClassifier(n_estimators=100, random_state=42),
    "SVM": SVC(kernel='linear'),
    "Neural Network": MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=300)
}

# Train and evaluate each model
for model_name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"Results for {model_name}:")
    print(classification_report(y_test, y_pred))
    print(confusion_matrix(y_test, y_pred))

# Ensemble method (stacking)
ensemble_model = RandomForestClassifier(n_estimators=100, random_state=42)
ensemble_model.fit(X_train, y_train)
y_pred_ensemble = ensemble_model.predict(X_test)
print("Results for Ensemble Model (Random Forest):")
print(classification_report(y_test, y_pred_ensemble))
print(confusion_matrix(y_test, y_pred_ensemble))


Next steps?
-LSTM (Long Short-Term Memory) models?
-BERT based transformer?
-Deep learning model?
-Advanced topic extraction with Latent Dirichlet Allocation (LDA)? This can extract themes and meanings "between the lines" so to say.
-Deploy pre-trained LLM's to extract parameters/features through API access. https://huggingface.co/ is open source and provides many LLM's for free within limits.
-Consider pulling data from the internet/pubmed/press-releases/twitter that relates to the trial drug, to gather super nuanced data sets, to incorporate into the analysis. Perhaps LLM's can automate that. (feature extraction from CSV -> Search web for related stuff -> feature extraction from web material -> Success/failure).  