<a href="https://colab.research.google.com/github/Stephan-Dupoux/DSCI-691-Project/blob/main/Presentation_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mining Adverse Drug Events from Tweets
### [Project Repository](https://github.com/Stephan-Dupoux/DSCI-691-Project)

*Group Members:* 
1. Layla Bouzoubaa - lb3338@drexel.edu 
2. Stephan Dupoux - sgd45@drexel.edu
3. Hannah Wurzel - hjw35@drexel.edu
4. Zifeng Wang - zw438@drexel.edu 


### **Introduction**

Adverse drug reactions (ADR) are described as “harmful reactions that are caused by the intake of medication”. In fact, 3%-7% of hospitalizations occur because of ADR. Additionally, 10%-20% of hospital patients will endure ADR while hospitalized. It is thought that nearly 10%-20% of these cases of ADR are severe and can lead to damaging side effects and even death in some cases. Since all drugs have the potential for adverse drug reactions, risk-benefit analysis (analyzing the likelihood of benefit vs risk of ADRs) is necessary whenever a drug is prescribed. On the other hand, incidence and severity of ADR may vary by patient characteristics and drug factors, which makes it a quite time-consuming and challenging to task to detect ADR from traditional medicine consumer reports or even modern electronic health records.  

With the prosperity of social networks, people are more inclined to share their treatment experiences on social media, posting their use of prescription drugs and related side effects. This behavior makes user posts on social media an important source for ADR detection. User posts discussing ADRs must be identified before further detection can occur because almost 90% of drug-related user posts are not associated with ADRs. Detecting the presence of an ADR in each user post is a key of success of further data mining. 

On the other hand, mining tweets in general has been an issue for NLP specialists for quite some time.  Due to the relaxed nature of tweets they tend to contain a significant amount of noise.  This noise can be from users writing in slang, the use of emojis, mispelled words, etc.  This noise is difficult to deal with because it oftentimes is vital in understanding the message the user is attempting to get across.

The overarching goal of this project is to develop a basic framework of mining the value of social media postings as a source of “signals” of potential ADR, paying particular attention at the value such information might have to detect adverse events earlier than currently possible, and to detect effects not easily captured by traditional means. 

#### Rationale

We based our rationale off of a few different papers.  Luckily, this task has been completed in previous #SSM4H's so we were able to get a baseline from these attempts at the task.  Additionally, we outsourced to other papers that strived to complete tasks similar to ours.  From these papers we gathered inspiration on how we would attempt to solve these tasks.  

In addition to papers that we read, we also took inspriation from our first homework assignment.  Similar to the problem we are looking to solve, homework one asked us to create a binary classifer to determine if tweets were in a chunk of text or not.  By tweaking the code from this first homework assignment to fit our data, we were able to get a model up and running.  This will be discussed more later in our ___ section (fill in once seciton is named).


#### Data

Most of the data was provided to us by the organizers of the SMM4H 2022.  This data includes nearly 17,500 labeled tweets which will be used for training.  From these tweets, 1,235 are classified as ADR.  SMM4H also provides participants with around 10,000 tweets which will be used for testing. SMM4H organizers will be providing a validation and test set in the near future.

#### Task

The Social Media Mining for Health Applications (#SMM4H) is a yearly NLP challenge held by the Health Language Processing Lab at University of Pennsylvania.  Each year they devise ten different tasks and accompanying subtasks related to NLP in the health sector. Our group chose to complete task #1 Classification, detection and normalization of Adverse Events (AE) mentions in tweets (in English). These task contained three subtasks, however, our group only picked the first two:
1. Classify tweets reporting ADEs (Adverse Drug Events)
2. Detect ADE spans in tweets

Our group was interested in this task because of the concerning statistics surronding ADEs. In fact, 3%-7% of hospitalizations occur because of an ADE.  Additionally, 10%-20% of hospital patients will endure an ADE while hospitalized.  It is thought that nearly 10%-20% of ADE cases are severe and can lead to damaging side effects and even death<sup>1</sup>. 

### Preprocessing 

We wanted to keep the tweets closest to their original form as possible.  However, there was some noise in these tweets that we decided to remove.  This includes the user mentions ('@USER'), any proceeding "_" from tweets, emojis, and stopwords. ADE labels were converted from character to binary; `ADE`:`1`, `NoADE`:`0`.  


```
def clean_tweet(tweet, remove_stopwords = True):
    # lowercase and remove whitespace
    new = tweet.lower().strip()
    # remove any instance of '@USER' followed by '_'
    new = re.sub(r'@\w+_', '', new)
    # remove extranous characters
    new = re.sub(r'[^\w\s]', ' ', new).strip()
    # remove emojis
    new = new.encode('ascii', errors='ignore').decode('utf8').strip()
    # remove stopwords
    if remove_stopwords:
        new = ' '.join([word for word in new.split() if word not in stop_words])
    
    return new
  
def tokenize(text, space = False, clean = True):
    if clean: text = clean_tweet(text)
    tokens = []
    for token in re.split("([0-9a-zA-Z'-]+)", text):
        if not space:
            token = re.sub("[ ]+", "", token)
        if not token:
            continue
        if re.search("[0-9a-zA-Z'-]", token):                    
            tokens.append(token)
        else: 
            tokens.extend(token)
    return tokens

```

### Models and Results

#### **Subtask 1**
A training set, which made up of 80% of the data, created. The remaining 20% was reserved as a test test. When examining the training set, there was an observable imbalance between tweets that contain an ADE and those that do not (Figure 1). 

![Figure 1](./img/train-dist.png)

#### Feature based models

Logistic regression and support vector machines (SVM) were chosen as the baseline models from which to compare additional hypertuned and engineered models as well as the procedding neural models.

Both logistic and SVM models were based on TF-IDF features. Because TF-IDF represents the degree of salience, as it balances the frequency of a word with the number of documents it appears in, it was selected over a bag-of-words (BOW) representation.

1. Logistic Regression

  1. vectorize training, test
  2. calculate tf-idf
  3. fit logistic regression with balanced weights
  4. apply model to predict classes

  This `sklearn` implementation of a logistic classifer resulted in the following performance metrics:  

  ```
AUC: 0.818335056178608
Precision: 0.40
Recall: 0.72
F1 Score: 0.51

  ```

2. SVM

  1. SVM model with linear kernel and balanced weights on tf-idf features

  The `sklearn` implementation of the SVM classifier resulted in the following performance metrics:

  ```
  AUC: 0.7773205167496922
  Precision: 0.47
  Recall: 0.61
  F1 Score: 0.53
  ```

#### Neural Architecture

1. Baseline: Multilayer Perceptron

  Initial experimentation for a baseline neural framework used a multilayered perceptron (MLP) model. MLPs are markedly different in terms of how it computes predictions and how it incorperates data into the models compared to the Bi-LSTM. This baseline Multilayer perceptron model is a simple feed foward network model. The Multi Layered Perceptron model was built with the scikit learn model. As seen with the picture in the code used an adams optimizer. 

```
Precision of .53
Recall of .4
F1 of .46
ROC_AUC of .6867
```

RE: word embeddings
> Hidden layers in neural networks act like kernels in disentangling linearly inseparable input data layer by layer, step by step (Chollet, 2017)


2. Bi-LSTM

The BI-LSTM was another neural architecture that we wanted to use for our experiments. The Bi-LSTM architecture is marketed with one recursive layer and one time distributed layer. This model also uses the same preprocessing techniques with the other Neural Network Model. 

What makes this process interesting is that the process with the neural network model id not the specific usage for this type of algorithm. This was done intentionally as to see the predictive power of the models and setting a first step before doing more feature engineering with this model.

```
Bi-LSTM with TF-IDF preprocessing
Precision of .410
Recall of .405
F1 of .407
```

### Discussion



#### Challenges

There were a few challenges our group dealt with while going through the process of trying to solve these tasks.  The first issue we ran into was that our data is extremely unbalanced.  Specifically, 7.2% data points are classified as `ADE` while the remaining 92.8% are classified as `NoADE`.  Extremely unbalanced data like ours leads to skewed data and it is still an ongoing problem we are working on fixing.

The next challenged we faced was determining if we should include all of the noise in the tweets or not.  We believe that these noise of usernames, emojis, slang, etc. is vital in understanding the message the user is trying to get across.  However, we believe our models would suffice if we removed the usernames and emojis from the tweets.  

Lastly, when using Word2Vec on our data we noticed our data performed very poorly.  This is because our data is so dense.  This is a challenge that we are still trying to fix and it will be discussed a little more in the next section.

#### Next Steps

Our work is still an ongoing process as we planned to be finished with our models by Friday.  There are still quite a few things our group hopes to implement based on the literature that we found.  The first thing we want to implement is a roBERTa based model.  We saw a lot of hope for roBERTa in the literature and think it would perform well for our tasks.

Next, we want to include a lexicon of health related words and phrases.  There are a few different places where we can obtain these.  Our top three choices at the moment are Unified Medical Language System (UMLs), MedDRA, or DrugBank.  A

- roBERTa
- domain-specific vocabulary (UMLS, MedDRA, DrugBank)
  - RedMed encoding

### References

1. Marsh, D. E. S. (2022, June 2). Adverse drug reactions - clinical pharmacology. Merck Manuals Professional Edition. Retrieved June 6, 2022, from https://www.merckmanuals.com/professional/clinical-pharmacology/adverse-drug-reactions/adverse-drug-reactions 
2. 
3. 
4. 