<div align="center">

#  Drug Reviews Analysis Sentiment Analysis with Word2Vec 

<div align="center">

## 🌟 Deep Learning for Natural Language Processing 🌟

**Advanced Text Analytics | Drug Review Analysis | Transfer Learning**

---

### 🎯 **Project Objectives**
1. 📊 **Load Data** - Process WebMD drug review dataset
2. 🧠 **Build Models** - Create Word2Vec embeddings & LSTM networks  
3. ☁️ **Deploy** - Upload to cloud platform

---

*Transforming text into insights through the power of neural networks* ✨

</div>


# 📊 Project Summary: Sentiment Analysis with Word2Vec

## 🎯 Main Goal
**Building a system to analyze text reviews and predict satisfaction levels using deep learning.**

---

## 🔄 Step-by-Step Process

### 1️⃣ **Data Loading** 📂
- 📥 Loads a WebMD dataset with drug reviews
- 🎯 Uses **"Reviews"** column as input text and **"Satisfaction"** column as target
- 🧹 Cleans the data and splits it into training/testing sets

### 2️⃣ **Word2Vec Training (Model 1)** 🤖
- 🏋️ Trains your own Word2Vec model on the review text
- 🔢 Converts words into numerical vectors that capture word meanings
- 📐 Creates embeddings with **60 dimensions** per word

### 3️⃣ **Transfer Learning (Model 2)** 🚀
- ⬇️ Loads pre-trained model: **"glove-wiki-gigaword-50"**
- 🌐 Model trained on Wikipedia with superior word representations
- 📚 Like using a dictionary that already knows word relationships

### 4️⃣ **Text Processing** ⚙️
- 🔗 Converts review sentences into sequences of word vectors
- 📏 Pads sequences to uniform length (**200 words max**)
- 🎛️ Prepares data for the neural network

### 5️⃣ **Neural Network Training** 🧠
- 🏗️ Builds an **LSTM** (Long Short-Term Memory) model
- 📝 LSTM excels at understanding text sequences
- 🎓 Trains model to predict satisfaction from review text
- ⏹️ Uses early stopping to prevent overfitting

### 6️⃣ **Evaluation** 📈
- 🧪 Tests model on unseen data for accuracy assessment
- ⚖️ Compares performance: Custom Word2Vec vs. Pre-trained embeddings

---

## 💡 **Key Insight**
> **Transfer learning** (using pre-trained embeddings) typically outperforms training from scratch because pre-trained models have been exposed to vastly more text data! 🌟


# 1. Data Loading and Libs


In [None]:
import gensim
import numpy as np

import os
import pandas as pd
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import text_to_word_sequence

# --- 1. Locate and load the file ---
possible_paths = [
    '../Toms_Raw_Data/Data_Set_webmd_6174_Rows_Randomised.csv'
 ]
file_path = None
for path in possible_paths:
    if os.path.exists(path):
        file_path = path
        print(f'Found file: {file_path}')
        break
if file_path is None:
    raise FileNotFoundError('Data_Set_webmd_6174_Rows_Randomised.csv file not found!')

# --- 2. Try loading as Excel, CSV, TSV, and with different encodings ---
df = None
if file_path.endswith('.xlsx'):
    try:
        df = pd.read_excel(file_path)
        print('Loaded as Excel')
    except Exception as e:
        print('Could not load as Excel:', e)
if df is None:
    encodings = ['utf-8', 'latin-1', 'iso-8859-1']
    for enc in encodings:
        try:
            df = pd.read_csv(file_path, encoding=enc)
            print(f'Loaded as CSV with encoding {enc}')
            break
        except Exception:
            try:
                df = pd.read_csv(file_path, sep='\t', encoding=enc)
                print(f'Loaded as TSV with encoding {enc}')
                break
            except Exception:
                continue
if df is None:
    raise Exception('Could not load the file with any tried encoding or format.')

print('Columns:', df.columns.tolist())
print(df.head())

# --- 3. Use 'Reviews' as input and 'Satisfaction' as output ---
text_col = 'Reviews' #"Sides" for my Tom model
label_col = 'Satisfaction'
if text_col not in df.columns or label_col not in df.columns:
    raise Exception(f'Columns not found: Reviews or Satisfaction. Available columns: {df.columns.tolist()}')
print(f'Using text column: {text_col}, label column: {label_col}')

# --- 4. Clean and convert labels to binary if needed ---
df = df.dropna(subset=[text_col, label_col])
if df[label_col].dtype in ['int64', 'float64']:
    median_rating = df[label_col].median()
    df['binary_sentiment'] = (df[label_col] > median_rating).astype(int)
    label_col = 'binary_sentiment'
else:
    from sklearn.preprocessing import LabelEncoder
    le = LabelEncoder()
    df['encoded_labels'] = le.fit_transform(df[label_col])
    label_col = 'encoded_labels'

# --- 5. Train/test split and tokenize ---
X = df[text_col].astype(str).values
y = df[label_col].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
X_train = [text_to_word_sequence(text) for text in X_train]
X_test = [text_to_word_sequence(text) for text in X_test]

print(f'Train samples: {len(X_train)}, Test samples: {len(X_test)}')

FileNotFoundError: Data_Set_webmd_6174_Rows_Randomised.csv file not found!

# 2. [Model 1] Word2Vec Training


### Instantiation and Functions for Embedding and Padding


Functions to convert your training and test data into something you can feed into a RNN.
The functions**  


In [7]:
from gensim.models import Word2Vec

word2vec = Word2Vec(sentences=X_train, vector_size=60, min_count=10, window=10)

from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

# Function to convert a sentence (list of words) into a matrix representing the words in the embedding space
def embed_sentence(word2vec, sentence):
    embedded_sentence = []
    for word in sentence:
        if word in word2vec.wv:
            embedded_sentence.append(word2vec.wv[word])

    return np.array(embedded_sentence)

# Function that converts a list of sentences into a list of matrices
def embedding(word2vec, sentences):
    embed = []

    for sentence in sentences:
        embedded_sentence = embed_sentence(word2vec, sentence)
        embed.append(embedded_sentence)

    return embed

# Embed the training and test sentences
X_train_embed = embedding(word2vec, X_train)
X_test_embed = embedding(word2vec, X_test)


# Pad the training and test embedded sentences
X_train_pad = pad_sequences(X_train_embed, dtype='float32', padding='post', maxlen=200)
X_test_pad = pad_sequences(X_test_embed, dtype='float32', padding='post', maxlen=200)

☝️ To be sure that it worked, let's check the following for `X_train_pad` and `X_test_pad`:
- they are numpy arrays
- they are 3-dimensional
- the last dimension is of the size of your word2vec embedding space (you can get it with `word2vec.wv.vector_size`
- the first dimension is of the size of your `X_train` and `X_test`

✅ **Good Practice** ✅ Such tests are quite important! Not only in this exercise, but in real-life applications. It prevents from finding errors too late and from letting them propagate through the entire notebook.

In [8]:
# TEST ME
for X in [X_train_pad, X_test_pad]:
    assert type(X) == np.ndarray
    assert X.shape[-1] == word2vec.wv.vector_size


assert X_train_pad.shape[0] == len(X_train)
assert X_test_pad.shape[0] == len(X_test)

# 3. [Model 2] Trained Word2Vec ###Transfer Learning

Your accuracy, while above the model 1, might be quite low. There are multiple options to improve it, as data cleaning and improving the quality of the embedding.

We won't dig into data cleaning strategies here. Let's try to improve the quality of our embedding. But instead of just loading a larger corpus, why not benefiting from the embedding that others have learned? 

Because, the quality of an embedding, i.e. the proximity of the words, can be derived from different tasks. This is exactly what transfer learning is.


In [9]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import text_to_word_sequence

# --- 1. Locate and load the file ---
possible_paths = [
    '/home/xf/code/TomX79/06-Deep-Learning/04-RNN-and-NLP/data-sentiment-analysis-with-word2vec/Data_Set_webmd_6174_Rows_Randomised.csv'
 ]
file_path = None
for path in possible_paths:
    if os.path.exists(path):
        file_path = path
        print(f'Found file: {file_path}')
        break
if file_path is None:
    raise FileNotFoundError('Data_Set_webmd_6174_Rows_Randomised.csv file not found!')

# --- 2. Try loading as CSV, then TSV ---
try:
    df = pd.read_csv(file_path)
    print('Loaded as CSV')
except Exception:
    try:
        df = pd.read_csv(file_path, sep='\t')
        print('Loaded as TSV')
    except Exception as e:
        print('Could not load file:', e)
        raise

print('Columns:', df.columns.tolist())
print(df.head())

# --- 3. Use explicit columns for MODEL 2 ---
text_col = 'Reviews'
label_col = 'Satisfaction'
if text_col not in df.columns or label_col not in df.columns:
    raise Exception(f'Columns not found: Reviews or Satisfaction. Available columns: {df.columns.tolist()}')
print(f'Using text column: {text_col}, label column: {label_col}')

# --- 4. Clean and convert labels to binary if needed ---
df = df.dropna(subset=[text_col, label_col])
if df[label_col].dtype in ['int64', 'float64']:
    median_rating = df[label_col].median()
    df['binary_sentiment'] = (df[label_col] > median_rating).astype(int)
    label_col = 'binary_sentiment'
else:
    from sklearn.preprocessing import LabelEncoder
    le = LabelEncoder()
    df['encoded_labels'] = le.fit_transform(df[label_col])
    label_col = 'encoded_labels'

# --- 5. Train/test split and tokenize ---
X = df[text_col].astype(str).values
y = df[label_col].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
X_train = [text_to_word_sequence(text) for text in X_train]
X_test = [text_to_word_sequence(text) for text in X_test]

print(f'Train samples: {len(X_train)}, Test samples: {len(X_test)}')

Found file: /home/xf/code/TomX79/06-Deep-Learning/04-RNN-and-NLP/data-sentiment-analysis-with-word2vec/Data_Set_webmd_6174_Rows_Randomised.csv
Loaded as CSV
Columns: ['Random Number', 'Age', 'Condition', 'Date', 'Drug', 'DrugId', 'EaseofUse', 'Effectiveness', 'Reviews', 'Satisfaction', 'Sex', 'Sides']
   Random Number    Age   Condition      Date  \
0       0.000004  55-64  Depression  05-01-10   
1       0.000008  25-34  Depression  12-02-08   
2       0.000049  35-44  Depression  10-09-09   
3       0.000104  45-54  Depression  03-08-12   
4       0.000140  35-44  Depression  01-11-08   

                                        Drug  DrugId  EaseofUse  \
0  bupropion hcl sr tablet, extended release   13507          5   
1                              wellbutrin xl   76851          5   
2                                     celexa    8603          5   
3                                     prozac    6997          1   
4                              fluoxetine dr    1774          2   


In [10]:
import gensim.downloader as api
print(list(api.info()['models'].keys()))

['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']


ℹ️ You can also find the list of the models and their size on the [`gensim-data` repository](https://github.com/RaRe-Technologies/gensim-data#models).

❓ **Question** ❓ Load one of the pre-trained word2vec embedding spaces. 

You can do that with `api.load(the-model-of-your-choice)`, and store it in `word2vec_transfer`

<details>
    <summary>💡 Hint</summary>
    
The `glove-wiki-gigaword-50` model is a good candidate to start with as it is smaller (65 MB).

</details>

In [25]:
word2vec_transfer = api.load("glove-wiki-gigaword-50")

❓ **Question** ❓ Check the size of the vocabulary, but also the size of the embedding space.

In [26]:
print(len(word2vec_transfer.key_to_index))
print(len(word2vec_transfer['art']))

400000
50


###  Text Processing for RNN


❓ Let's embed `X_train` and `X_test`, same as in the first question where we provided the functions to do so! (There is a slight difference in the `embed_sentence_with_TF` function that we will not dig into)

In [27]:
# Function to convert a sentence (list of words) into a matrix representing the words in the embedding space
def embed_sentence_with_TF(word2vec, sentence):
    embedded_sentence = []
    for word in sentence:
        if word in word2vec:
            embedded_sentence.append(word2vec[word])

    return np.array(embedded_sentence)

# Function that converts a list of sentences into a list of matrices
def embedding(word2vec, sentences):
    embed = []

    for sentence in sentences:
        embedded_sentence = embed_sentence_with_TF(word2vec, sentence)
        embed.append(embedded_sentence)

    return embed

# Embed the training and test sentences
X_train_embed_2 = embedding(word2vec_transfer, X_train)
X_test_embed_2 = embedding(word2vec_transfer, X_test)

**Focus**   Do not forget to pad your results and store it in `X_train_pad_2` and `X_test_pad_2`.

In [28]:
# Pad the training and test embedded sentences
X_train_pad_2 = pad_sequences(X_train_embed_2, dtype='float32', padding='post', maxlen=200)
X_test_pad_2 = pad_sequences(X_test_embed_2, dtype='float32', padding='post', maxlen=200)

# 5. Neural Network Training


❓ **Question** ❓ Reinitialize a model and fit it on your new embedded (and padded) data!  Evaluate it on your test set and compare it to your previous accuracy.

❗ **Remark** ❗ The training here could take some time. You can just compute 10 epochs (this is **not** a good practice, it is just not to wait too long) and go to the next exercise while it trains - or take a break, you probably deserve it ;)

In [29]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Masking
from tensorflow.keras.optimizers import Adam
import numpy as np

def init_model():
    # Infer input shape from X_train_pad_2
    input_shape = (X_train_pad_2.shape[1], X_train_pad_2.shape[2])
    # Infer output shape (binary or multiclass)
    n_classes = len(np.unique(y_train))
    model = Sequential()
    model.add(Masking(mask_value=0., input_shape=input_shape))
    model.add(LSTM(64, return_sequences=False))
    if n_classes == 2:
        model.add(Dense(1, activation='sigmoid'))
        model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'])
    else:
        model.add(Dense(n_classes, activation='softmax'))
        model.compile(optimizer=Adam(learning_rate=0.001), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model

In [32]:
from tensorflow.keras.callbacks import EarlyStopping

es = EarlyStopping(patience=5, restore_best_weights=True)

model = init_model()

model.fit(X_train_pad_2, y_train,
          batch_size = 32,
          epochs=20,
          validation_split=0.3,
          callbacks=[es]
         )

Epoch 1/20
[1m108/108[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 128ms/step - accuracy: 0.5319 - loss: 0.6916 - val_accuracy: 0.6019 - val_loss: 0.6695
Epoch 2/20
[1m108/108[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 128ms/step - accuracy: 0.5319 - loss: 0.6916 - val_accuracy: 0.6019 - val_loss: 0.6695
Epoch 2/20
[1m108/108[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 116ms/step - accuracy: 0.6185 - loss: 0.6605 - val_accuracy: 0.6262 - val_loss: 0.6552
Epoch 3/20
[1m108/108[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 116ms/step - accuracy: 0.6185 - loss: 0.6605 - val_accuracy: 0.6262 - val_loss: 0.6552
Epoch 3/20
[1m108/108[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 118ms/step - accuracy: 0.6363 - loss: 0.6352 - val_accuracy: 0.6370 - val_loss: 0.6487
Epoch 4/20
[1m108/108[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 118ms/step - accuracy: 0.6363 - loss: 0.6352 - val_accuracy: 0.6370 - val_loss: 0.6487
Epoch 4/20

<keras.src.callbacks.history.History at 0x7280ca0e83b0>

# 6. Evaluation


In [35]:
# 🎯 Model Performance Evaluation
res = model.evaluate(X_test_pad_2, y_test, verbose=0)

# 📊 Display Results with Beautiful Formatting
accuracy = res[1] * 100
print("="*60)
print("🏆 MODEL PERFORMANCE RESULTS 🏆")
print("="*60)
print(f"📈 Test Set Accuracy: {accuracy:.3f}%")
print("-"*60)

# 🎨 Performance Rating
if accuracy >= 80:
    rating = "🌟 EXCELLENT"
    emoji = "🚀"
elif accuracy >= 70:
    rating = "✅ GOOD"
    emoji = "👍"
elif accuracy >= 60:
    rating = "⚠️ FAIR"
    emoji = "📊"
else:
    rating = "❌ NEEDS IMPROVEMENT"
    emoji = "📉"

print(f"{emoji} Performance Rating: {rating}")
print(f"🎯 Achieved: {accuracy:.3f}% accuracy on unseen data")
print("="*60)

🏆 MODEL PERFORMANCE RESULTS 🏆
📈 Test Set Accuracy: 68.664%
------------------------------------------------------------
📊 Performance Rating: ⚠️ FAIR
🎯 Achieved: 68.664% accuracy on unseen data


Because your new word2vec has been trained on a large corpus, it has a representation for many many words! Way more than with your small dataset, especially as you discarded words that were not present more than a given number of times in the train set. For that reason, you have way more embedded words in your train and test set, which makes each iteration longer than previously


