# Skip-gram (Word2Vec) Preprocessing for Cancer Text Classification

This notebook trains a **Word2Vec Skip-gram** model on the cancer research text corpus and prepares
embedding matrices and tokenized sequences ready for downstream classification models (GRU, LSTM, RNN).

## Why Skip-gram for This Dataset?

Skip-gram is particularly well-suited for biomedical text classification for several reasons:

1. **Rare word performance**: Skip-gram learns better representations for infrequent words compared to CBOW.
   Biomedical text contains many specialized, low-frequency terms (gene names, drug names, anatomical terms)
   that are critical for distinguishing cancer types. Skip-gram treats each context-target pair independently,
   giving rare words more gradient updates relative to their frequency.

2. **Semantic relationships in medical vocabulary**: Skip-gram captures fine-grained semantic similarities.
   For example, it can learn that "thyroidectomy" and "lobectomy" are related surgical procedures,
   or that "BRAF" and "RAS" are both oncogenes, even when they appear in different contexts.

3. **Small-to-medium corpus size**: With ~7,500 documents, our corpus is relatively small by NLP standards.
   Skip-gram is known to outperform CBOW on smaller datasets because it creates more training examples
   per word occurrence (one for each context word in the window).

4. **Domain-specific embeddings**: Pre-trained general embeddings (GloVe, FastText) may not capture
   biomedical semantics well. Training Skip-gram directly on our corpus produces embeddings tailored
   to the cancer classification domain.

### Output
All preprocessed artifacts are saved to `../preprocesseddata/` for reuse across model notebooks.

In [13]:
!pip install gensim

import numpy as np
import pandas as pd
import re
import pickle
import warnings
import os
warnings.filterwarnings('ignore')

import nltk
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('punkt_tab', quiet=True)
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from collections import Counter

from gensim.models import Word2Vec

print('All imports successful!')

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m80.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0
All imports successful!


In [41]:
# Remove duplicates
duplicates = df.duplicated().sum()
print(f'Duplicates found: {duplicates}')
df = df.drop_duplicates()
print(f'After cleanup: {len(df)} samples')

Duplicates found: 0
After cleanup: 7570 samples


In [None]:
file_path = '../alldata_1_for_kaggle.csv'
df = pd.read_csv(file_path, encoding='latin1')
df.columns = ['id', 'cancer_type', 'text'] + list(df.columns[3:])

print(f'Dataset shape: {df.shape}')
print(f'Columns: {df.columns.tolist()}')
print(f'\nClass distribution:')
print(df['cancer_type'].value_counts())
df.head()

## 1. Load Dataset

## 2. Text Preprocessing

We apply the same preprocessing pipeline used in the GRU classification notebook to ensure consistency
across all team members' model inputs:
- Lowercasing
- URL and email removal
- Special character removal (keeping only alphabetic characters)
- Stopword removal
- Lemmatization
- Filtering tokens shorter than 3 characters

In [42]:
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    """Clean and tokenize text for embedding training."""
    text = str(text).lower()
    text = re.sub(r'http\S+|www\S+', '', text)
    text = re.sub(r'\S+@\S+', '', text)
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    tokens = word_tokenize(text)
    tokens = [lemmatizer.lemmatize(t) for t in tokens if t not in stop_words and len(t) > 2]
    return ' '.join(tokens)

print('Sample:')
print('Original:', df['text'].iloc[0][:100], '...')
print('Cleaned :', preprocess_text(df['text'].iloc[0])[:100], '...')

Sample:
Original: Thyroid surgery in  children in a single institution from Osama Ibrahim Almosallama Ali Aseerib Ahme ...
Cleaned : thyroid surgery child single institution osama ibrahim almosallama ali aseerib ahmed alhumaida ali a ...


In [43]:
from tqdm import tqdm
tqdm.pandas()

print('Preprocessing texts...')
df['cleaned_text'] = df['text'].progress_apply(preprocess_text)
df = df[df['cleaned_text'].str.strip() != '']
print(f'Final samples: {len(df)}')

Preprocessing texts...


100%|██████████| 7570/7570 [02:57<00:00, 42.77it/s]

Final samples: 7570





## 3. Label Encoding & Train/Val/Test Split

Using the same split ratios and random state as the GRU notebook for fair comparison:
- 70% train, 15% validation, 15% test
- Stratified split to preserve class distribution

In [44]:
label_encoder = LabelEncoder()
df['label'] = label_encoder.fit_transform(df['cancer_type'])

print('Label mapping:')
for i, name in enumerate(label_encoder.classes_):
    print(f'  {name} -> {i}')

X = df['cleaned_text'].values
y = df['label'].values

X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.15, random_state=42, stratify=y)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.176, random_state=42, stratify=y_temp)

print(f'\nTrain: {len(X_train)}, Val: {len(X_val)}, Test: {len(X_test)}')

Label mapping:
  Colon_Cancer -> 0
  Lung_Cancer -> 1
  Thyroid_Cancer -> 2

Train: 5301, Val: 1133, Test: 1136


## 4. Train Word2Vec Skip-gram Model

### Skip-gram Architecture

Skip-gram predicts surrounding context words given a center word. For each word in the corpus,
it creates training pairs (center_word, context_word) within a sliding window.

**Key hyperparameters chosen for this dataset:**

| Parameter | Value | Rationale |
|-----------|-------|-----------|
| `vector_size=100` | Matches GRU embedding dimension for direct comparison |
| `window=5` | Captures local context typical of scientific writing |
| `min_count=2` | Keeps rare but meaningful biomedical terms (appearing at least twice) |
| `sg=1` | Selects Skip-gram (vs CBOW which uses sg=0) |
| `negative=5` | Negative sampling for efficient training |
| `epochs=20` | Sufficient convergence for our corpus size |
| `workers=4` | Parallel training threads |

In [47]:
# Tokenize training texts into lists of words for Word2Vec
train_tokenized = [text.split() for text in X_train]

print(f'Training Word2Vec Skip-gram on {len(train_tokenized)} documents...')
print(f'Total tokens in training set: {sum(len(doc) for doc in train_tokenized):,}')

skipgram_model = Word2Vec(
    sentences=train_tokenized,
    vector_size=100,    # embedding dimension
    window=5,           # context window size
    min_count=5,        # minimum word frequency (changed from 2 to 5)
    sg=1,               # 1 = Skip-gram, 0 = CBOW
    negative=5,         # negative sampling
    epochs=10,          # training iterations (changed from 20 to 10)
    workers=4,          # parallel threads
    seed=42             # reproducibility
)

print(f'\nSkip-gram model trained!')
print(f'Vocabulary size (Word2Vec): {len(skipgram_model.wv)}')
print(f'Embedding dimension: {skipgram_model.wv.vector_size}')

Training Word2Vec Skip-gram on 5301 documents...
Total tokens in training set: 11,451,874

Skip-gram model trained!
Vocabulary size (Word2Vec): 162502
Embedding dimension: 100


In [48]:
# Explore learned embeddings - check if Skip-gram captured meaningful medical relationships
test_words = ['cancer', 'tumor', 'thyroid', 'colon', 'lung', 'patient', 'treatment', 'cell']

print('Most similar words learned by Skip-gram:')
print('=' * 60)
for word in test_words:
    if word in skipgram_model.wv:
        similar = skipgram_model.wv.most_similar(word, topn=5)
        similar_str = ', '.join([f'{w} ({s:.3f})' for w, s in similar])
        print(f'  {word:12s} -> {similar_str}')
    else:
        print(f'  {word:12s} -> [not in vocabulary]')


Most similar words learned by Skip-gram:
  cancer       -> breast (0.780), prostate (0.683), lung (0.639), nonsmall (0.634), colorectal (0.631)
  tumor        -> tumour (0.647), microenvironment (0.637), teaming (0.606), malignant (0.600), solid (0.598)
  thyroid      -> papillary (0.745), carcinomapapillary (0.715), carcinomaneeds (0.708), associationguidelines (0.702), dierentiatedthyroid (0.701)
  colon        -> secondtumor (0.660), andrectum (0.656), betteroutcomes (0.656), rectum (0.654), modelsbreast (0.651)
  lung         -> nonsmall (0.759), nonsmallcell (0.679), nsclc (0.641), cancer (0.639), thoraconcol (0.637)
  patient      -> patientswith (0.735), nftdtpi (0.674), npoor (0.667), inmedocc (0.645), composingof (0.644)
  treatment    -> therapy (0.684), chemotherapy (0.610), treated (0.592), effective (0.592), modality (0.571)
  cell         -> proliferation (0.692), migration (0.668), migation (0.663), apoptosis (0.663), bothpdac (0.660)


## 5. Build Vocabulary & Embedding Matrix

We construct a vocabulary of the top 5,000 most frequent words (matching the GRU notebook's `MAX_WORDS`)
and build a pre-trained embedding matrix from the Skip-gram vectors. Words not in the Skip-gram
vocabulary are initialized to zero (they will be mapped to the `<UNK>` token).

In [49]:
MAX_WORDS = 5000
MAX_LEN = 300
EMBEDDING_DIM = 100

# Build vocabulary from training data (same as GRU notebook)
word_counts = Counter()
for text in X_train:
    word_counts.update(text.split())

vocab = ['<PAD>', '<UNK>'] + [w for w, c in word_counts.most_common(MAX_WORDS - 2)]
word2idx = {w: i for i, w in enumerate(vocab)}
idx2word = {i: w for w, i in word2idx.items()}
vocab_size = len(vocab)

print(f'Vocabulary size: {vocab_size}')
print(f'Max sequence length: {MAX_LEN}')
print(f'Embedding dimension: {EMBEDDING_DIM}')

Vocabulary size: 5000
Max sequence length: 300
Embedding dimension: 100


In [50]:
# Build pre-trained embedding matrix from Skip-gram vectors
embedding_matrix = np.zeros((vocab_size, EMBEDDING_DIM))
found_count = 0
missing_count = 0

for word, idx in word2idx.items():
    if word in skipgram_model.wv:
        embedding_matrix[idx] = skipgram_model.wv[word]
        found_count += 1
    else:
        missing_count += 1

coverage = found_count / vocab_size * 100
print(f'Embedding matrix shape: {embedding_matrix.shape}')
print(f'Words with Skip-gram vectors: {found_count}/{vocab_size} ({coverage:.1f}%)')
print(f'Words without vectors (will use zeros): {missing_count}')

Embedding matrix shape: (5000, 100)
Words with Skip-gram vectors: 4998/5000 (100.0%)
Words without vectors (will use zeros): 2


## 6. Convert Texts to Sequences

Each document is converted to a fixed-length integer sequence:
- Words are mapped to their vocabulary index
- Unknown words map to index 1 (`<UNK>`)
- Sequences are truncated or zero-padded to `MAX_LEN=300`

In [51]:
def text_to_sequence(text, word2idx, max_len):
    """Convert text to padded integer sequence."""
    tokens = text.split()[:max_len]
    seq = [word2idx.get(w, 1) for w in tokens]  # 1 = <UNK>
    seq = seq + [0] * (max_len - len(seq))       # 0 = <PAD>
    return seq

X_train_seq = np.array([text_to_sequence(t, word2idx, MAX_LEN) for t in X_train])
X_val_seq = np.array([text_to_sequence(t, word2idx, MAX_LEN) for t in X_val])
X_test_seq = np.array([text_to_sequence(t, word2idx, MAX_LEN) for t in X_test])

print(f'Sequence shapes:')
print(f'  Train: {X_train_seq.shape}')
print(f'  Val:   {X_val_seq.shape}')
print(f'  Test:  {X_test_seq.shape}')

Sequence shapes:
  Train: (5301, 300)
  Val:   (1133, 300)
  Test:  (1136, 300)


## 7. Save All Preprocessed Data

Saving the following artifacts to `../preprocesseddata/`:

| File | Contents |
|------|----------|
| `skipgram_embedding_matrix.npy` | Pre-trained embedding matrix (5000 x 100) |
| `skipgram_sequences.npz` | Train/val/test integer sequences |
| `skipgram_labels.npz` | Train/val/test labels |
| `skipgram_vocab.pkl` | Vocabulary mappings (word2idx, idx2word) |
| `skipgram_model.model` | Trained Word2Vec model (for further analysis) |
| `skipgram_config.pkl` | Hyperparameters and metadata |

In [52]:
output_dir = '../preprocesseddata'
os.makedirs(output_dir, exist_ok=True)

# 1. Save embedding matrix
np.save(os.path.join(output_dir, 'skipgram_embedding_matrix.npy'), embedding_matrix)
print(f'Saved: skipgram_embedding_matrix.npy  {embedding_matrix.shape}')

# 2. Save sequences
np.savez(os.path.join(output_dir, 'skipgram_sequences.npz'),
         X_train=X_train_seq, X_val=X_val_seq, X_test=X_test_seq)
print(f'Saved: skipgram_sequences.npz')

# 3. Save labels
np.savez(os.path.join(output_dir, 'skipgram_labels.npz'),
         y_train=y_train, y_val=y_val, y_test=y_test)
print(f'Saved: skipgram_labels.npz')

# 4. Save vocabulary
with open(os.path.join(output_dir, 'skipgram_vocab.pkl'), 'wb') as f:
    pickle.dump({
        'word2idx': word2idx,
        'idx2word': idx2word,
        'vocab': vocab,
        'label_encoder': label_encoder
    }, f)
print(f'Saved: skipgram_vocab.pkl')

# 5. Save trained Word2Vec model
skipgram_model.save(os.path.join(output_dir, 'skipgram_word2vec.model'))
print(f'Saved: skipgram_word2vec.model')

# 6. Save config/metadata
config = {
    'MAX_WORDS': MAX_WORDS,
    'MAX_LEN': MAX_LEN,
    'EMBEDDING_DIM': EMBEDDING_DIM,
    'vocab_size': vocab_size,
    'num_classes': len(label_encoder.classes_),
    'class_names': list(label_encoder.classes_),
    'train_size': len(X_train),
    'val_size': len(X_val),
    'test_size': len(X_test),
    'skipgram_params': {
        'vector_size': 100,
        'window': 5,
        'min_count': 2,
        'sg': 1,
        'negative': 5,
        'epochs': 20
    }
}
with open(os.path.join(output_dir, 'skipgram_config.pkl'), 'wb') as f:
    pickle.dump(config, f)
print(f'Saved: skipgram_config.pkl')

print(f'\nAll files saved to {output_dir}/')

Saved: skipgram_embedding_matrix.npy  (5000, 100)
Saved: skipgram_sequences.npz
Saved: skipgram_labels.npz
Saved: skipgram_vocab.pkl
Saved: skipgram_word2vec.model
Saved: skipgram_config.pkl

All files saved to ../preprocesseddata/


## 8. Verification - Load and Check Saved Data

Quick sanity check to confirm everything was saved correctly.

In [53]:
# Verify saved files
print('Verifying saved data...')
print('=' * 50)

# Load and check embedding matrix
emb = np.load(os.path.join(output_dir, 'skipgram_embedding_matrix.npy'))
print(f'Embedding matrix: {emb.shape}')

# Load and check sequences
seqs = np.load(os.path.join(output_dir, 'skipgram_sequences.npz'))
print(f'Train sequences: {seqs["X_train"].shape}')
print(f'Val sequences:   {seqs["X_val"].shape}')
print(f'Test sequences:  {seqs["X_test"].shape}')

# Load and check labels
labels = np.load(os.path.join(output_dir, 'skipgram_labels.npz'))
print(f'Train labels: {labels["y_train"].shape}')
print(f'Val labels:   {labels["y_val"].shape}')
print(f'Test labels:  {labels["y_test"].shape}')

# Load and check vocab
with open(os.path.join(output_dir, 'skipgram_vocab.pkl'), 'rb') as f:
    vocab_data = pickle.load(f)
print(f'Vocab size: {len(vocab_data["word2idx"])}')
print(f'Classes: {list(vocab_data["label_encoder"].classes_)}')

# Load config
with open(os.path.join(output_dir, 'skipgram_config.pkl'), 'rb') as f:
    cfg = pickle.load(f)
print(f'\nConfig: {cfg}')

print('\nAll verifications passed!')

Verifying saved data...
Embedding matrix: (5000, 100)
Train sequences: (5301, 300)
Val sequences:   (1133, 300)
Test sequences:  (1136, 300)
Train labels: (5301,)
Val labels:   (1133,)
Test labels:  (1136,)
Vocab size: 5000
Classes: ['Colon_Cancer', 'Lung_Cancer', 'Thyroid_Cancer']

Config: {'MAX_WORDS': 5000, 'MAX_LEN': 300, 'EMBEDDING_DIM': 100, 'vocab_size': 5000, 'num_classes': 3, 'class_names': ['Colon_Cancer', 'Lung_Cancer', 'Thyroid_Cancer'], 'train_size': 5301, 'val_size': 1133, 'test_size': 1136, 'skipgram_params': {'vector_size': 100, 'window': 5, 'min_count': 2, 'sg': 1, 'negative': 5, 'epochs': 20}}

All verifications passed!
