# Solution: Article Similarity Calculation
This notebook provides the complete solution for the assignment.

### 1. Read the CSV File
The input will be a CSV file named `articles.csv` with three columns: `id`, `title`, and `content`.

In [8]:
import csv
import numpy as np

def read_articles(file_path):
    articles = []
    with open(file_path, 'r', encoding='utf-8') as f:
        reader = csv.DictReader(f)
        for row in reader:
            articles.append({
                'id': int(row['id']),
                'title': row['title'],
                'content': row['content']
            })
    return articles

articles = read_articles('articles.csv')
print(f"Read {len(articles)} articles.")
print(articles)

Read 50 articles.
[{'id': 1, 'title': 'The Rise of AI', 'content': 'Artificial intelligence is transforming industries globally. Machine learning and deep learning are key components...'}, {'id': 2, 'title': 'Future of Robotics', 'content': 'Robotics is advancing rapidly, integrating with AI for autonomous systems. Human-robot interaction is a growing field...'}, {'id': 3, 'title': 'Data Engineering', 'content': 'Data Engineering leverages tools and computation to move massive amounts of data. Big data analytics is crucial...'}, {'id': 4, 'title': 'The Innovations of Cybersecurity', 'content': 'This article discusses the innovations of cybersecurity. This article discusses the innovations of cybersecurity. This article discusses the innovations of cybersecurity. It is an important field in modern technology with rapid advancements. As of 2022, there are 375 companies investing in this! Is it the future? Yes, absolutely.'}, {'id': 5, 'title': 'The Advanced of Robotics', 'content': 'This

### 2. Clean Article Content
In this step, I cleaned the article content by removing punctuation, numerical digits, and converting the content to lowercase.

In [19]:
import re

def clean_and_tokenize(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation using regular expressions
    text = re.sub(r'[^\w\s]|_', '', text)
    # Remove numerical digits
    text = re.sub(r'\d+', '', text)
    # Tokenize the cleaned content into individual words
    tokens = text.split()
    return tokens

for article in articles:
    article['tokens'] = clean_and_tokenize(article['content'])
    
print(f"Tokens for the first article: {articles[0]['tokens']}")

Tokens for the first article: ['artificial', 'intelligence', 'is', 'transforming', 'industries', 'globally', 'machine', 'learning', 'and', 'deep', 'learning', 'are', 'key', 'components']


### 3. Build Global Bag-of-Words (BoW) Vocabulary

In [23]:
vocabulary = set()
for article in articles:
    # This methods takes the set and applys union operation with the tokens from each article
    vocabulary.update(article['tokens'])

# Sort the vocabulary to ensure consistent ordering
bow_list = sorted(list(vocabulary))
bow_size = len(bow_list)

# # Create a mapping from word to index
# bow_index = {word: i for i, word in enumerate(bow_list)}
print(f"Global vocabulary size: {bow_size} unique words")

Global vocabulary size: 75 unique words


### 4. Build Vector Representation

In [28]:
for article in articles:
    vector = np.zeros(bow_size, dtype=int)
    for token in article['tokens']:
        if token in bow_index:
            vector[bow_index[token]] = 1
    article['vector'] = vector

print(f"Shape of First vectors: {articles[0]['vector'].shape}")
print(f"First vectors: {articles[0]['vector']}")

Shape of First vectors: (75,)
First vectors: [0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0
 0 0 0 0 1 0 0 1 0 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
 0]


### 5. Calculate Cosine Similarity
Here we calculate the cosine similarity between all pairs of articles, creating a similarity matrix. to get the best similarity

In [38]:
num_articles = len(articles)
similarity_matrix = np.zeros((num_articles, num_articles))

for i in range(num_articles):
    for j in range(num_articles):
        vec_a = articles[i]['vector']
        vec_b = articles[j]['vector']
        
        dot_product = np.dot(vec_a, vec_b)
        norm_a = np.linalg.norm(vec_a)
        norm_b = np.linalg.norm(vec_b)
        
        if norm_a == 0 or norm_b == 0:
            similarity_matrix[i][j] = 0.0
        else:
            similarity_matrix[i][j] = dot_product / (norm_a * norm_b)

print(f"Similarity matrix shape: {similarity_matrix.shape}")
print(f"Similarity matrix shape: \n{similarity_matrix}")

Similarity matrix shape: (50, 50)
Similarity matrix shape: 
[[1.         0.07161149 0.14322297 ... 0.10878566 0.21350421 0.1067521 ]
 [0.07161149 1.         0.06666667 ... 0.15191091 0.1490712  0.1490712 ]
 [0.14322297 0.06666667 1.         ... 0.10127394 0.0993808  0.1987616 ]
 ...
 [0.10878566 0.15191091 0.10127394 ... 1.         0.90582163 0.90582163]
 [0.21350421 0.1490712  0.0993808  ... 0.90582163 1.         0.88888889]
 [0.1067521  0.1490712  0.1987616  ... 0.90582163 0.88888889 1.        ]]


### 6. Output Similarity Matrix to PKL
In Python, `pickle` is a built-in module used to serialize and deserialize Python objects.
- __Notes__: the use of the `pickle` library allows us to save the similarity matrix to a file, which can be loaded later without having to recompute it.

In [40]:
import pickle

with open('similarities.pkl', 'wb') as f:
    pickle.dump(similarity_matrix, f)

print("Saved similarity matrix to similarities.pkl")

Saved similarity matrix to similarities.pkl


### 7. Find Most Similar Articles

In [54]:
def get_top_3_similar_articles(article_id):
    # Find the index of the target article
    target_index = -1
    for i, article in enumerate(articles):
        if article['id'] == article_id:
            target_index = i
            break
            
    if target_index == -1:
        print("Article ID not found.")
        return []
    
    # Get similarities for this article
    similarities = similarity_matrix[target_index]
    
    # Pair each article with its similarity score, excluding the target article itself
    article_similarities = []
    for i, article in enumerate(articles):
        if i != target_index:
            article_similarities.append({
                'id': article['id'],
                'title': article['title'], 
                'score': similarities[i],
            })
            
    # Sort by similarity in descending order
    article_similarities.sort(key=lambda x: x['score'], reverse=True)
    
    # Return top 3 titles
    top_3 = [f"{item['title']} - (id:{item['id']} - score:{item['score']:.2f})" for item in article_similarities[:3]]
    return top_3

# Example usage
article_id = 21
test_article_id = articles[article_id]['id']
print(f"Target Article: '{articles[article_id]['title']}'")
print("- Top 3 Most Similar:")
top_3_titles = get_top_3_similar_articles(test_article_id)
for rank, title in enumerate(top_3_titles, 1):
    print(f"  {rank}. {title}")

Target Article: 'The Innovations of IoT'
- Top 3 Most Similar:
  1. The Future of IoT - (id:14 - score:0.98)
  2. The Innovations of Cybersecurity - (id:4 - score:0.96)
  3. The Innovations of 5G - (id:13 - score:0.96)
