<h1><center> Model Notebook </center></h1>

<h2> <center> 1. Data Preparation </center> </h2>
<h3> <center> 1.1. Data Cleaning </center></h3>
Dataset used can be found on Kaggle: <a href="https://www.kaggle.com/datasets/fabiochiusano/medium-articles"> Click here to view the dataset </a> <br/>
It contains 190k+ Medium articles, but for our training purposes, only first <i> <b> N_ROWS = 1000 </b> </i> have been used.

<b> Data Description </b> <br/>

Each row in the data is a different article published on Medium. For each article, you have the following features: <br/><ul>
    <li> <b> title </b> <i>[string]</i>: The title of the article. </li>
    <li> <b> text </b> <i>[string]</i>: The text content of the article. </li>
    <li> <b> url </b> <i>[string]</i>: The URL associated to the article. </li>
    <li> <b> authors </b> <i>[list of strings]</i>: The article authors. </li>
    <li> <b> timestamp </b> <i>[string]</i>: The publication datetime of the article. </li>
    <li> <b> tags </b> <i>[list of strings]</i>: List of tags associated to the article. </li>
</ul>

For our training purposes, only <b> tags </b> column is relevant - everything else contributes not to embedding creation. 

In [1]:
# Third-party Library Imports
import pandas as pd                  # Data processing
import numpy as np                   # Math
from typing import List, Tuple, Dict # Type hinting
import ast                           # Literal evaluation

# Model creation - PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# Prediction making
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
DATA_PATH = "..//..//..//data/medium_articles.csv"

N_ROWS = 1000

df = pd.read_csv(DATA_PATH, nrows=N_ROWS)
print(f"Data successfully loaded! Data shape: {df.shape}")

Data successfully loaded! Data shape: (1000, 6)


In [3]:
df.head() # Show first 5 entries, for examples sake

Unnamed: 0,title,text,url,authors,timestamp,tags
0,Mental Note Vol. 24,Photo by Josh Riemer on Unsplash\n\nMerry Chri...,https://medium.com/invisible-illness/mental-no...,['Ryan Fan'],2020-12-26 03:38:10.479000+00:00,"['Mental Health', 'Health', 'Psychology', 'Sci..."
1,Your Brain On Coronavirus,Your Brain On Coronavirus\n\nA guide to the cu...,https://medium.com/age-of-awareness/how-the-pa...,['Simon Spichak'],2020-09-23 22:10:17.126000+00:00,"['Mental Health', 'Coronavirus', 'Science', 'P..."
2,Mind Your Nose,Mind Your Nose\n\nHow smell training can chang...,https://medium.com/neodotlife/mind-your-nose-f...,[],2020-10-10 20:17:37.132000+00:00,"['Biotechnology', 'Neuroscience', 'Brain', 'We..."
3,The 4 Purposes of Dreams,Passionate about the synergy between science a...,https://medium.com/science-for-real/the-4-purp...,['Eshan Samaranayake'],2020-12-21 16:05:19.524000+00:00,"['Health', 'Neuroscience', 'Mental Health', 'P..."
4,Surviving a Rod Through the Head,"You’ve heard of him, haven’t you? Phineas Gage...",https://medium.com/live-your-life-on-purpose/s...,['Rishav Sinha'],2020-02-26 00:01:01.576000+00:00,"['Brain', 'Health', 'Development', 'Psychology..."


In [4]:
df.to_csv("..//..//..//data/articles_tag_1k.csv", columns=["tags"], index=False) # Save dataset

<h3> <center> 1.2. Data Preparation </center></h3>

In [5]:
df["tags"] = df["tags"].apply(ast.literal_eval) # String that looks like list becomes a literal list

In [6]:
def encode_tags() -> Tuple[List[str], Dict[str, int]]:
    """
    Get a list of unique tags and associate each tag with its index.

    Returns
    -------
    Tuple[List[str], Dict[str, int]]
        Tuple of two elements:
            (1) tags : List[str] 
                List of unique tags
            (2) tagToInd : Dict[str, int]
                Dictionary that associates tags with its 
    """
    df_exploded = df["tags"].explode() # Expand each list element into a separate row
    
    tags     = df_exploded.unique().tolist()                  # Extract unique values, and convert to list, for ease of use
    tagToInd = {tag: index for index, tag in enumerate(tags)} # Associate each tag with its index
    
    return (tags, tagToInd)


tags, tagToInd = encode_tags()
print(f"There are {len(tags)} unique tags!")

There are 660 unique tags!


In [7]:
def generate_dataset() -> List[Tuple[int, int, int]]:
    """
    Generate dataset to be used in training.
    Each entry in dataset is in format (article_id, tag_id, label):
        article_id : int
            ID of article in whose reference we perceive the tag.
        tag_id : int
            ID of tag who is or is not present on article.
        label : {1, 0}
            If `label` is 1, then the tag is present.
            If `label` is 0, then the tag is not present.

    Returns
    -------
    List[Tuple[int, int, int]]
        List of tuples in format (article_id, tag_id, label), as described. 
    """
    input = []

    for idx, row in df.iterrows(): 
        tags_in_row = row["tags"] # List of tags in given row
        
        # True data
        for tag in tags_in_row:
            input.append((idx, tagToInd[tag], 1)) # Every of those tags is present, therefore label them with 1

        # False data - generate 5 random tags that are not present in this row
        cnt = 0
        while cnt != 5:
            potential_not_present_tag = np.random.randint(0, len(tags) + 1) # Generate a singular random tag ID
            if potential_not_present_tag not in tags_in_row: 
                input.append((idx, potential_not_present_tag, 0)) # This randomly generated tag is not present, therefore label it with 0
                cnt += 1

    input = np.random.permutation(input) # Shuffle input, just in case so the model doesn't learn irrelevant patterns

    return input

input = generate_dataset()
print(f"Input has been successfully generated! Input shape: {input.shape}")
print(f"Example of single input: {input[0]}")

Input has been successfully generated! Input shape: (9931, 3)
Example of single input: [258 267   0]


<h2> <center> 2. Model Creation </center> </h2>

<h3> <center> 2.1. Model Description </center> </h3>

We aim to create a model that maps articles into an N-dimensional vector space. The closer the articles are within the space, the more similar we think they are according to some metric. Therefore, the aim is to create an <b> article embedding </b>. <br/>

Embeddings contain <b> weights </b> that are learned. We are going to be creating a <b> logistic regression model </b>, i.e. <b> binary classification model </b>, that aims to fit best to our data. We will not be having training or test set, because we are not interested in the model itself, but only the learned weights that best fit to given data. Then, <b> cosine similarity </b> will be computed, and the most similar articles will be found.
By computing the dot product among <b> article representation </b> (<b> article_embedding </b>) and <b> tag representation </b> (<b> tag_embedding </b>), we will force similarly-tagged articles to be closer to each other within N-dimensional vector space. 

In [8]:
class ArticleEmbeddingModel(nn.Module):
    def __init__(self, articles_num, tags_num, embedding_dim):
        """
        Initialize a new instance of ArticleEmbeddingModel.

        Parameters
        ----------
        - articles_num : int
            Number of total / unique articles - each one getting its own embedding. 
        - tags_num : int 
            Number of total / unique tags - each one getting its own embedding.
        - embedding_dim : int
            Dimension of a singular embedding, i.e. number of dimensions used to represent a single article / tag.
        """
        super(ArticleEmbeddingModel, self).__init__()

        self.articles_embedding = nn.Embedding(articles_num, embedding_dim) # Embeddings of shape (articles_num x embedding_dim)
        self.tags_embedding     = nn.Embedding(tags_num,     embedding_dim) # Embeddings of shape (tags_num     x embedding_dim)

        self.sigmoid = nn.Sigmoid() # Sigmoid squashing function, used for logistic regression

    def forward(self, input):
        """
        Perform a forward pass, on a singular (article_id, tag_id) input.

        Parameters
        ----------
        input : Tuple[torch.Tensor, torch.Tensor]
            Input in format of (article_id, tag_id).

        Returns
        -------
        ...
        """
        article_id, tag_id = input # Unpack tuple

        try:
            article_embedding = self.articles_embedding(article_id) # Representation of given article in embedding_dim-ensional space
            tag_embedding     = self.tags_embedding(tag_id)         # Representation of given tag     in embedding_dim-ensional space
        except:
            print("INDEX OUT OF BOUNDS: ", article_id, tag_id)
            
        dot_product = torch.dot(article_embedding, tag_embedding) # Compute dot product

        prediction = self.sigmoid(dot_product) # Squash the value for final prediction
        return prediction

<h3> <center> 2.2. Training Model </center> </h3>

Model ought to be pored to GPU (CUDA) if possible - otherwise, it's just trained on CPU. Both processes are relatively fast for given dataset size. <br/>
Additionally, <b> criterion function </b> used is <a href="https://en.wikipedia.org/wiki/Cross-entropy"> Binary Cross Entropy </a> - a standard when it comes to logistic regression tasks, while the <b> optimizer </b> is <a href="https://optimization.cbe.cornell.edu/index.php?title=Adam"> Adam </a> - extended version of <a href="https://en.wikipedia.org/wiki/Stochastic_gradient_descent"> Stochastic Gradient Descent </a>.

In [9]:
device = "cuda" if torch.cuda.is_available() else "cpu" # Port models and data to GPU, if possible, for faster processing

articles_num  = df.shape[0] + 10 # Number of articles is the number of rows loaded
tags_num      = len(tags) + 10   # Number of tags is the number of unique tags present 
embedding_dim = 3                # Map each article / tag in a `embedding_dim`-dimensional continuous vector space

model = ArticleEmbeddingModel(articles_num, tags_num, embedding_dim).to(device) # Create model with relevant data

criterion = nn.BCELoss()                            # Binary Cross-Entropy Loss is used with binary classification tasks, such as one in our case
optimizer = optim.Adam(model.parameters(), lr=0.01) # Adam ... is the best optimizer!

Model will train for <b> epoch_num </b> epochs. Batch size is <b> 1 </b>, for conventionality reasons, and training process will inform us on average loss for each epoch.

In [10]:
epoch_num = 10 # Self-explanatory: Number of epochs to run
for epoch in range(epoch_num):
    total_loss    = 0 # Total loss per epoch
    total_batches = len(input)
    
    for article_id, tag_id, label in input:
        # Turn every piece of data into a torch.Tensor
        article_id = torch.tensor(article_id).to(device)
        tag_id     = torch.tensor(tag_id).to(device)
        label      = torch.tensor(label, dtype=torch.float32).to(device)
        
        # Forward pass
        output = model((article_id, tag_id))

        # Calculate loss
        loss = criterion(output, label)

        # Accumulate total loss
        total_loss += loss.item()

        # Back-propagate
        optimizer.zero_grad() # Zero-out the gradient
        loss.backward()       # Back-propagate
        optimizer.step()      # Make a step

    # Calculate average loss for the epoch
    average_loss = total_loss / total_batches
    print(f"Epoch: #{epoch + 1: 5} Average loss: {average_loss}") 

Epoch: #    1 Average loss: 0.9168626622551921
Epoch: #    2 Average loss: 0.641786002538289
Epoch: #    3 Average loss: 0.4784869268747256
Epoch: #    4 Average loss: 0.399404675861504
Epoch: #    5 Average loss: 0.3524544360282816
Epoch: #    6 Average loss: 0.31811857758596607
Epoch: #    7 Average loss: 0.29173004053455004
Epoch: #    8 Average loss: 0.2708357645426466
Epoch: #    9 Average loss: 0.25388813678528604
Epoch: #   10 Average loss: 0.23916833732098916


<h3> <center> 2.3. Model Testing </center> </h3>

As previously described, we will take <b> cosine similarity </b> to be the measure of similarity among two articles. For the sake of concept of this Notebook, we will use a <b> TEST_ARTICLE </b> and find top 5 most and least similar articles in our dataset. 

In [11]:
# Extract embeddings for articles and topics
article_embeddings = model.articles_embedding.weight.data.cpu().detach().numpy()

# Normalize articles, so cosine similarity makes sense
article_embeddings = article_embeddings / np.linalg.norm(article_embeddings, axis=1).reshape((-1, 1))

In [12]:
def calculate_cosine_similarity(vec1, vec2) -> np.float32:
    """
    Calculate cosine similarity between two embedding vectors.

    Parameters
    ----------
    vec1 : numpy.ndarray
        The first vector.
    vec2 : numpy.ndarray
        The second vector.
    
    Returns
    -------
    numpy.float32
        Cosine similarity between two given vectors.
    """
    return cosine_similarity([vec1], [vec2])[0, 0]

def find_top_similar_different_articles(target_embedding, embeddings, wanted_articles: int):
    """
    Find top-X most and least similar articles, compared to target article.

    Parameters
    ----------
    target_embedding : numpy.ndarray
        Target article every other article is compared to.
    embeddings : numpy.ndarray
        List of all article embeddings (retrieved from model).
    wanted_articles : int
        Number of wanted top-X articles. For example, if we wish to retrieve top-5 articles, `wanted_particles` would equal 5.

    Returns
    -------
    Tuple[Tuple[int, numpy.float32], Tuple[int, numpy.float32]]
        Returns the tuple that contains top-X most and least similar articles.
        Each article is represented as a tuple of (article_id, similarity).
    """
    # Calculate cosine similarity for all articles
    similarities = [calculate_cosine_similarity(target_embedding, other_embedding)
                    for other_embedding in embeddings]

    # Find top 5 most similar articles
    top_similar_articles   = sorted(enumerate(similarities), key=lambda x: x[1], reverse=True)[:wanted_articles]

    # Find top 5 least similar articles
    top_different_articles = sorted(enumerate(similarities), key=lambda x: x[1])[:wanted_articles]

    return (top_similar_articles, top_different_articles)

In [13]:
TEST_ARTICLE    = 300 # Random article used for testing
WANTED_ARTICLES = 5   # Number of top-X articles needed

target_embedding = article_embeddings[TEST_ARTICLE]  # Use the embedding at index TEST_ARTICLE as the target embedding
similar_articles, different_articles = find_top_similar_different_articles(target_embedding, article_embeddings, WANTED_ARTICLES)

print(type(similar_articles[0][1]))

<class 'numpy.float32'>


In [14]:
print(f"Article for reference: {'':25}", df.iloc[TEST_ARTICLE]["tags"])

print("\nTop 5 most similar articles:")
for article_idx, similarity in similar_articles:
    if article_idx < 1000:
        tags = df.iloc[article_idx]['tags']
        print(f"Article {article_idx}: {similarity:25} --> Tags: {tags}")

print("\nTop 5 most different articles:")
for article_idx, similarity in different_articles:
    if article_idx < 1000:
        tags = df.iloc[article_idx]['tags'] 
        print(f"Article {article_idx}: {similarity:25} --> Tags: {tags}")

Article for reference:                           ['Mobile App Development', 'Mobile Apps', 'Development', 'Technology', 'Startup']

Top 5 most similar articles:
Article 300:                       1.0 --> Tags: ['Mobile App Development', 'Mobile Apps', 'Development', 'Technology', 'Startup']
Article 960:        0.9995712041854858 --> Tags: ['Life Lessons', 'Writing', 'Creativity', 'Short Story', 'Inspiration']
Article 657:        0.9992546439170837 --> Tags: ['Health', 'Mental Health', 'Covid 19', 'Society', 'Fitness']
Article 650:        0.9986250996589661 --> Tags: ['Programming', 'Kubernetes', 'Microservices', 'Raspberry Pi', 'Engineering']
Article 345:        0.9980693459510803 --> Tags: ['Happiness', 'Productivity', 'Psychology', 'Self', 'Motivation']

Top 5 most different articles:
Article 769:       -0.8886616230010986 --> Tags: ['Entrepreneurship', 'Marketing', 'Data Science', 'Data Visualization', 'Storytelling']
Article 688:       -0.7373985648155212 --> Tags: ['Matplotlib', '

In [15]:
torch.save(model.state_dict(), "article_embedding_model.pth") # Save model weights, in the end, so it can be used within the project