<a href="https://colab.research.google.com/github/HurleyJames/GoogleColabExercise/blob/master/Using_Word2Vec_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# News Category Classification using Word2Vec embeddings

University of Leeds

COMP5623 Artificial Intelligence

---

We will use two Python libraries:

1. **sklearn** a machine learning library

2. **Gensim** is a library for unsupervised topic modeling and natural language processing, using modern statistical machine learning

---


# 1. Dataset preparation - 20 Newsgroups

We will use sklearn to download **20 Newsgroups** (http://qwone.com/~jason/20Newsgroups/), a public available dataset of approximately 20,000 newsgroup posts, partitioned across 20 different newsgroups.  We will only load 3 categories for this example.

In [0]:
categories = [
      'comp.graphics',
      'sci.med',
      'rec.sport.baseball'
]

Load the subset of the 20 Newsgroups dataset.

In [0]:
from sklearn.datasets import fetch_20newsgroups

train_set = fetch_20newsgroups(
    subset='train',
    categories=categories,
    shuffle=True,
)

Look at some sample data.

In [0]:
train_set.target_names

In [0]:
print("Total number of news articles:", len(train_set.data))

In [0]:
print("\n".join(train_set.data[0].split("\n")[:10]))

In [0]:
print(train_set.target_names[train_set.target[0]])

# 2. Training a Word2Vec model

Now we will train a Word2Vec model which we will use to map each word in each news article to a feature representation. While the best models are trained on very large amounts of data, due to resources, we will use a model trained on this small corpus.

First we pre-process the data into lists of words.

In [0]:
parsed_train_data = []
for article in train_set.data:
  parsed_train_data.append(article.replace('\n',' ').split())

Train the model.

In [0]:
from gensim.models import Word2Vec

feature_length = 50

word2vec_model = Word2Vec(
    sentences=parsed_train_data,
    window=5,
    sg=1,       # Use skip-gram
    size=feature_length
)

How good is it? Sanity check...

In [0]:
word2vec_model.most_similar("disease", topn=5)

In [0]:
word2vec_model.most_similar("baseball", topn=5)

What is the feature representation for one word?

In [0]:
word2vec_model["baseball"]

# 3. Performing classification on news articles using feature embeddings

Finally, we will represent each news article as a block of word features, and perform classification on the embedded representations.

As an over-simplification of the problem (for the purposes of illustration), we will choose the N first words to represent an article so that all our article sizes are fixed.

In [0]:
article_size = 13 * feature_length

In [0]:
import numpy as np

def embed_article(article, cutoff):
  # Save the feature representation for each word in the article
  embedded_article = []
  for word in article:
    try:
      embedded_article.append(word2vec_model[word])
    except(KeyError): # Ignore words not in the model vocabulary
      pass
  return np.array(embedded_article).flatten()[:cutoff]

In [0]:
embedded_train_set = []

for article in parsed_train_data:
  embedded_train_set.append(embed_article(article, article_size))

Now we can try training a simple linear classifier to classify the news articles into their categories. 

In [0]:
from sklearn.linear_model import SGDClassifier

In [0]:
linear_classifier = SGDClassifier()

linear_classifier.fit(
      embedded_train_set,
      train_set.target,
)

Can it predict on new articles?

In [0]:
new_articles = [
      # Article one - from latest NVidia news
      "Nvidia RTX 2080 Ti Cyberpunk 2077 GPU is revealed – but you can’t buy one." +
      "77 limited edition graphics cards are to be given away in a competition" +
      "The mysterious Cyberpunk 2077-themed GPU Nvidia recently teased has been revealed," +
      "and the reality of the graphics card may be a touch disappointing for some folks, in that" +
      "it isn’t a new model – and you won’t be able to buy one." +
      "The card is simply a GeForce RTX 2080 Ti (and appears to be exactly the same model," + 
      "and shroud design) decked out with the Cyberpunk 2077 colors and logo, which admittedly" +
      "looks pretty cool, but isn’t the GeForce RTX 2080 Ti Super" +
      "AMD confirms ‘Nvidia killer’ graphics card will be out in 2020",
      # Article two - from Health -> Oncology
      "Breast cancer test could predict chances of disease return 20 years later, study shows" +
      "Molecular nature of a woman’s breast cancer determines how their disease could progress," +
      "not just for the first five years, but also later,' says researcher" +
      "A new test could identify breast cancers that are likely to return more than 20 years later" +
      "development that might herald an era of personalised medicine." +
      "The way a patient’s cancer will progress can be determined by categorising molecular and" +
      "genetic markers of breast tumours into 11 subtypes, University of Cambridge researchers found." +  
      "Following around 2,000 women over 20 years, the team funded by the Cancer Research charity found" +
      "some women with initially aggressive cancers had a low chance of tumours returning after five years."
]

# Parse and embed
parsed_na = [a.replace('\n',' ').split() for a in new_articles]
embedded_new_articles = [embed_article(a, article_size) for a in parsed_na]

In [0]:
predicted = linear_classifier.predict(embedded_new_articles)

In [0]:
for i, category in enumerate(predicted):
  print("New article", i, " predicted cateogry =>", train_set.target_names[category])