<a href="https://colab.research.google.com/github/EzraMW/ML/blob/main/Transfer_Learning_for_Newsgroups.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using pre-trained word embeddings

**Author:** [fchollet](https://twitter.com/fchollet)<br>
**Date created:** 2020/05/05<br>
**Last modified:** 2020/05/05<br>
**Description:** Text classification on the Newsgroup20 dataset using pre-trained GloVe word embeddings.

Taken from: 
https://keras.io/examples/nlp/pretrained_word_embeddings/

Modified by Avi Rosenfeld on 2023/21/02

## Setup

In [4]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import TextVectorization, Embedding
from sklearn.datasets import fetch_20newsgroups
import gensim.downloader as api

In [30]:
from sklearn.feature_selection import chi2
from sklearn.feature_extraction.text import CountVectorizer

In [11]:
glove_model = api.load("glove-wiki-gigaword-200")

## Download the Newsgroup20 data

Note that as opposed to our previous notebook, here we download the entire 20_newsgroup data.

In [3]:
newsgroups_all = fetch_20newsgroups(subset='all', shuffle=True, random_state=42, remove=('headers', 'footers', 'quotes'))
newsgroups_all_headers = fetch_20newsgroups(subset='all', shuffle=True, random_state=42)

In [37]:
def transfer_learning(model, d, size, stop_words=None, words=None):
  embedding_dim = model.vector_size 
  print("Embedding Dimension: ", embedding_dim)

  samples = d.data
  labels =  d.target
  class_names = d.target_names

  seed = 1337
  rng = np.random.RandomState(seed)
  rng.shuffle(samples)
  rng = np.random.RandomState(seed)
  rng.shuffle(labels)

  # Extract a training & validation split
  validation_split = 0.2
  num_validation_samples = int(validation_split * len(samples))
  train_samples = samples[:-num_validation_samples]
  val_samples = samples[-num_validation_samples:]
  train_labels = labels[:-num_validation_samples]
  val_labels = labels[-num_validation_samples:]

  vectorizer = TextVectorization(max_tokens=20000, output_sequence_length=size)
  text_ds = tf.data.Dataset.from_tensor_slices(train_samples).batch(128)
  vectorizer.adapt(text_ds)

  voc = vectorizer.get_vocabulary()
  
  # if inputted just use these specific words for the vocabulary
  if words != None:
    voc = words

  word_index = dict(zip(voc, range(len(voc))))

  vocab_size = len(word_index)
  num_tokens = len(voc) + 2
  embedding_matrix = np.zeros((num_tokens, embedding_dim))

  hits = 0
  misses = 0
          
  # Prepare embedding matrix
  if stop_words == None:
    for word, i in word_index.items():
        if word in model:
          hits += 1
          embedding_matrix[i] = model[word]
        else:
            misses += 1
  else:
    for word, i in word_index.items():
        if word in model and word not in stop_words:
          hits += 1
          embedding_matrix[i] = model[word]
        else:
            misses += 1
  print("Converted %d words (%d misses)" % (hits, misses))

  embedding_layer = Embedding(
      num_tokens,
      embedding_dim,
      embeddings_initializer=keras.initializers.Constant(embedding_matrix),
      trainable=False,
  ) 

  int_sequences_input = keras.Input(shape=(None,), dtype="int64")
  embedded_sequences = embedding_layer(int_sequences_input)
  x = layers.Conv1D(128, 5, activation="relu")(embedded_sequences)
  x = layers.MaxPooling1D(5)(x)
  x = layers.Conv1D(128, 5, activation="relu")(x)
  x = layers.MaxPooling1D(5)(x)
  x = layers.Conv1D(128, 5, activation="relu")(x)
  x = layers.GlobalMaxPooling1D()(x)
  x = layers.Dense(128, activation="relu")(x)
  x = layers.Dropout(0.5)(x)
  preds = layers.Dense(len(class_names), activation="softmax")(x)
  model = keras.Model(int_sequences_input, preds)
  # print(model.summary())

  x_train = vectorizer(np.array([[s] for s in train_samples])).numpy()
  x_val = vectorizer(np.array([[s] for s in val_samples])).numpy()

  y_train = np.array(train_labels)
  y_val = np.array(val_labels)

  model.compile(
      loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["acc"],
  )
  print("Running Model:")
  model.fit(x_train, y_train, batch_size=128, epochs=30, validation_data=(x_val, y_val))

  string_input = keras.Input(shape=(1,), dtype="string")
  x = vectorizer(string_input)
  preds = model(x)
  end_to_end_model = keras.Model(string_input, preds)
  return end_to_end_model

Without Headers, Footers, and Quotes

In [7]:
transfer_learning(glove_model, newsgroups_all_headers, 200)

Embedding Dimension:  200
Converted 17655 words (2345 misses)
Running Model:
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.engine.functional.Functional at 0x7fa0942a6d60>

The model did very well with the headers etc. removed. It got a training accuracy of ~ 99% and a validation accuracy around 85%.

With Headers, Footers, and Quotes

In [6]:
transfer_learning(glove_model, newsgroups_all, 200)

Embedding Dimension:  200
Converted 18326 words (1674 misses)
Running Model:
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.engine.functional.Functional at 0x7fa5381cc850>

However, when we left in the headers, footers and quotes the training accuracy slightly descresed (suprisingly) but the validation accuracy dramatically decreased to less than 60%. This demonstrates the tremendous amount of overfitting happening due to leaving in the headers, footers, and quotes

So, going forward, we will remove the headers, footers, and quotes from the data

# Improve Performance for 4 Categories

In [6]:
cats = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
# newsgroups_cats_header = fetch_20newsgroups(subset='all',categories=cats)
newsgroups_cats = fetch_20newsgroups(subset='all',categories=cats, remove=('headers', 'footers', 'quotes'))

Let's establish the baseline metric using the glove model with these 4 categories

In [13]:
transfer_learning(glove_model, newsgroups_cats, 200)

Embedding Dimension:  200
Converted 15881 words (4119 misses)
Running Model:
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.engine.functional.Functional at 0x7fc58190c0a0>

We acheived an over 97% training accuracy and over 75% validation accuracy. This isn't as good as we got using all 20 categories but very solid

Let's try tansfering the training from different models

In [41]:
glove_model_300 = api.load("glove-wiki-gigaword-300")

In [10]:
transfer_learning(glove_model_300, newsgroups_cats, 300)

Embedding Dimension:  300
Converted 16166 words (3834 misses)
Running Model:
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.engine.functional.Functional at 0x7fa45406ff10>

Very solid results but not very different than the glove model

In [7]:
w2v_model = api.load("word2vec-google-news-300")

In [8]:
transfer_learning(w2v_model, newsgroups_cats, 300)

Embedding Dimension:  300
Converted 14143 words (5857 misses)
Running Model:
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.engine.functional.Functional at 0x7fc5da07b880>

This also didn't really improve on the previous model

## Remove Stop Words

Let's try removing stop words from the dataset and see if that helps

In [9]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

In [12]:
transfer_learning(glove_model, newsgroups_cats, 200, ENGLISH_STOP_WORDS)

Embedding Dimension:  200
Converted 15858 words (4142 misses)
Running Model:
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.engine.functional.Functional at 0x7fc578165580>

Other than the last epoch, which seems like an outlier, we got a training accuracy above 97% and a validation accuracy around 78%. This is slightly better than our baseline model.

## Use the top CHI2 Words

In [32]:
vectorizer = CountVectorizer()
vectors = vectorizer.fit_transform(newsgroups_cats.data)
chi2score = chi2(vectors, newsgroups_cats.target)[0]
wscores = zip(vectorizer.get_feature_names_out(),chi2score)
wchi2 = sorted(wscores,key=lambda x:x[1],reverse = True) 

In [36]:
chi2_words = [t[0] for t in wchi2]
top_chi2_words = chi2_words[:5000]

In [38]:
transfer_learning(glove_model, newsgroups_cats, 200, words=top_chi2_words)

Embedding Dimension:  200
Converted 4667 words (333 misses)
Running Model:
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.engine.functional.Functional at 0x7fc51c291250>

Ouch! This significantly reduced the validation accuracy

In [40]:
transfer_learning(w2v_model, newsgroups_cats, 300, words=top_chi2_words)

Embedding Dimension:  300
Converted 4102 words (898 misses)
Running Model:
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.engine.functional.Functional at 0x7fc47a673c70>

The word2vec model did slightly better than glove but significantly underperformed the baseline

## Different Categories 

These are the suggested categories which are very different from one another

Again, we will remove the headers, footers, and quotes to get the better results and better compare to previous iterations

In [21]:
diff_cats = ['talk.religion.misc', 'comp.sys.ibm.pc.hardware', 'sci.space']
newsgroups_diff_cats = fetch_20newsgroups(subset='all',categories=diff_cats, remove=('headers', 'footers', 'quotes'))

In [22]:
transfer_learning(glove_model, newsgroups_diff_cats, 200)

Embedding Dimension:  200
Converted 15764 words (4236 misses)
Running Model:
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.engine.functional.Functional at 0x7fc581b66700>

WOW! These results are much better. The validation broke 90% accuracy very quickly and even the training accuracy is slightly better

In [23]:
transfer_learning(glove_model, newsgroups_diff_cats, 200, ENGLISH_STOP_WORDS)

Embedding Dimension:  200
Converted 15444 words (4556 misses)
Running Model:
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.engine.functional.Functional at 0x7fc5707a6b50>

The stopwords don't seem to make much of a difference, but this also did really well. Maybe the irrelevance of the stopwords is due to the distinction between the different categories so the model has an easier time determining which words are most relevant

In [24]:
transfer_learning(w2v_model, newsgroups_diff_cats, 300, ENGLISH_STOP_WORDS)

Embedding Dimension:  300
Converted 13823 words (6177 misses)
Running Model:
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.engine.functional.Functional at 0x7fc57016fbe0>

Interestingly, the word2vec model did slightly worse on the validation accuracy than the glove model

The word2vec model didn't make much headways either

## Similar Data

In [25]:
same_cats = ['comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware','comp.windows.x']
newsgroups_same_cats = fetch_20newsgroups(subset='all',categories=same_cats, remove=('headers', 'footers', 'quotes'))

In [26]:
transfer_learning(glove_model, newsgroups_same_cats, 200)

Embedding Dimension:  200
Converted 12007 words (7993 misses)
Running Model:
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.engine.functional.Functional at 0x7fc534701fd0>

As the parallel opposite conclusion of the different categories, the model had a very hard time distinguishing between the similar categories - only acheiving a validation accuracy of 65%.

In [27]:
transfer_learning(glove_model, newsgroups_same_cats, 200, ENGLISH_STOP_WORDS)

Embedding Dimension:  200
Converted 11770 words (8230 misses)
Running Model:
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.engine.functional.Functional at 0x7fc53454bfa0>

Removing the stopwords helped slightly

In [28]:
transfer_learning(w2v_model, newsgroups_same_cats, 200, ENGLISH_STOP_WORDS)

Embedding Dimension:  300
Converted 10149 words (9851 misses)
Running Model:
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.engine.functional.Functional at 0x7fc53451d670>

The word2vec model acted quite similarly to the glove model

## Conclusion

Overall, there was not much substantial change between different iterations of transfer learning. The biggest differences came from the similarity of the categories in the dataset. Also, when trained on all 20 categories the model also did better. But adding in the top chi2 words, using different models, removing stopwords all didn't significantly improve the model.