#**Mini** **Project** **3**: **Attention** **Mechanism** **and** **Transformers**

##**Problem** **Statement**

##Business Context

In today's fast-paced media industry, swiftly categorizing and curating content is crucial. With an overwhelming flow of news across diverse topics, efficient systems are needed to deliver the right content to the right audience and maintain engagement.

Key Challenges:

Information Overload: The vast number of articles makes manual categorization impractical.

Timeliness: Delays in classification can lead to outdated or misplaced content.

##Problem Definition

E-news Express, a news aggregation startup, struggles with efficiently categorizing diverse articles across sports, entertainment, politics, and more. Manual classification is time-consuming, error-prone, and risks delays or reputational damage. To address this, the startup aims to adopt machine learning to automate categorization.

As a data scientist, the task is to build an unsupervised model that groups articles by content and validate results against human labels, ensuring faster, accurate, and personalized news delivery.

##Installing and importing necessery libraries

In [None]:
!pip install -U sentence-transformers -q

In [None]:
# to read and manipulate the data
import pandas as pd
import numpy as np
pd.set_option('max_colwidth', None)    # setting column to the maximum column width as per the data

# to visualise data
import matplotlib.pyplot as plt
import seaborn as sns

# to compute distances
from scipy.spatial.distance import cdist, pdist
from sklearn.metrics import silhouette_score

# importing the PyTorch Deep Learning library
import torch

# to import the model
from sentence_transformers import SentenceTransformer

# to cluster the data
from sklearn.cluster import KMeans

# to compute metrics
from sklearn.metrics import classification_report

# to avoid displaying unnecessary warnings
import warnings
warnings.filterwarnings("ignore")


##Load Dataset

In [None]:
reviews = pd.read_csv("/content/news_articles.csv")


##Data Overview

In [None]:
reviews.head()

In [None]:
reviews.tail()

In [None]:
reviews.shape

In [None]:
df = reviews.copy()

In [None]:
df = df.drop_duplicates()

# resetting the dataframe index
df.reset_index(drop=True, inplace=True)

df.duplicated().sum()


##Model Building

In [None]:
# loading the pre-trained model
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# encoding the dataset
embedding_matrix = model.encode(df['Text'], show_progress_bar=True, device=device)

embedding_matrix.shape

In [None]:
embedding_matrix

##Semantic Search

In [None]:
# defining a function to compute the cosine similarity between two embedding vectors
def cosine_score(text1,text2):
    # encoding the text
    embeddings1 = model.encode(text1)
    embeddings2 = model.encode(text2)

    # calculating the L2 norm of the embedding vector
    norm1 = np.linalg.norm(embeddings1)
    norm2 = np.linalg.norm(embeddings2)

    # computing the cosine similarity
    cosine_similarity_score = ((np.dot(embeddings1,embeddings2))/(norm1*norm2))

    return cosine_similarity_score


In [None]:
a= "i love apple"
b= "apple is a fruit"
c= "i like this table"
print(cosine_score(a,b))
print(cosine_score(b,c))
print(cosine_score(a,c))


In [None]:
# We can also use prebuilt method to calculate similarity score

a= "i love apple"
b= "apple is a fruit"
c= "i like this table"

from sentence_transformers import util

embeddings1 = model.encode(a)
embeddings2 = model.encode(b)
embeddings3 = model.encode(c)

print(util.cos_sim(embeddings1, embeddings2))
print(util.cos_sim(embeddings2, embeddings3))
print(util.cos_sim(embeddings1, embeddings3))


In [None]:
# defining a function to find the top k similar sentences for a given query
def top_k_similar_sentences(embedding_matrix, query_text, k):
    # encoding the query text
    query_embedding = model.encode(query_text)

    # calculating the cosine similarity between the query vector and all other encoded vectors of our dataset
    score_vector = np.dot(embedding_matrix,query_embedding)

    # sorting the scores in descending order and choosing the first k
    top_k_indices = np.argsort(score_vector)[::-1][:k]

    # returning the corresponding reviews
    return df.loc[list(top_k_indices), 'Text']


In [None]:
# defining the query text
query_text = "Budget for elections"

# displaying the top 5 similar sentences
top_k_reviews = top_k_similar_sentences(embedding_matrix, query_text, 5)

for i in top_k_reviews:
    print(i, end="\n")
    print("*******************************************************************")
    print("\n")


In [None]:
# defining the query text
query_text = "High imports and exports"

# displaying the top 5 similar sentences
top_k_reviews = top_k_similar_sentences(embedding_matrix, query_text, 5)

for i in top_k_reviews:
    print(i, end="\n")
    print("*******************************************************************")
    print("\n")


##**Categorization**

**- Using k-means Clustering to categorize the data**

In [None]:
meanDistortions = []
cluster_list = range(2, 11)

for k in cluster_list:
    clusterer = KMeans(n_clusters=k, random_state=1)
    clusterer.fit(embedding_matrix)

    prediction = clusterer.predict(embedding_matrix)

    distortion = sum(
        np.min(cdist(embedding_matrix, clusterer.cluster_centers_, "euclidean"), axis=1) ** 2
    )
    meanDistortions.append(distortion)

    print("Number of Clusters:", k, "\tAverage Distortion:", distortion)

plt.plot(cluster_list, meanDistortions, "bx-")
plt.xlabel("k")
plt.ylabel("Average Distortion")
plt.title("Selecting k with the Elbow Method", fontsize=20)
plt.show()

**Silhouette Scores**

In [None]:
sil_score = []
cluster_list = range(2, 11)

for n_clusters in cluster_list:
    clusterer = KMeans(n_clusters=n_clusters, random_state=1)

    preds = clusterer.fit_predict((embedding_matrix))

    score = silhouette_score(embedding_matrix, preds)
    sil_score.append(score)

    print("For n_clusters = {}, the silhouette score is {})".format(n_clusters, score))


In [None]:
plt.plot(cluster_list, sil_score, "bx-")
plt.show()


In [None]:
# defining the number of clusters/categories
n_categories = 5

# fitting the model
Kmeans=KMeans(n_clusters=n_categories,random_state=1).fit(embedding_matrix)


In [None]:
# creating a copy of the data
clustered_data = df.copy()

# assigning the cluster/category labels
clustered_data['Category'] = Kmeans.labels_

clustered_data.head()


In [None]:
# for each cluster, printing the 5 random news articles
for i in range(5):
    print("CLUSTER",i)
    print(clustered_data.loc[clustered_data.Category == i, 'Text'].sample(5, random_state=1).values)
    print("*****************************************************************")
    print("\n")


In [None]:
# dictionary of cluster label to category
category_dict = {
    0: 'Sports',
    1: 'Politics',
    2: 'Entertainment',
    3: 'Business',
    4: 'Technology'
}
# mapping cluster labels to categories
clustered_data['Category'] = clustered_data['Category'].map(category_dict)

clustered_data.head()


##Comparing with actual data

In [None]:
# loading the actual labels
labels = pd.read_csv("/content/news_article_labels.csv")
# checking the unique labels
labels['Label'].unique()


In [None]:
# adding the actual categories to our dataframe
clustered_data['Actual Category'] = labels['Label'].values

In [None]:
# adding the actual categories to our dataframe
clustered_data['Actual Category'] = labels['Label'].values

In [None]:
print(classification_report(clustered_data['Actual Category'], clustered_data['Category']))

In [None]:
correct_cat_data = clustered_data[clustered_data['Actual Category'] == clustered_data['Category']]
print("Examples of Correctly Categorized Articles:")
for i in range(5):
    print("\nActual Category:", correct_cat_data.iloc[i]['Actual Category'])
    print("Predicted Category:", correct_cat_data.iloc[i]['Category'])
    print("Text:", correct_cat_data.iloc[i]['Text'][:200] + "...") # Print first 200 characters
    print("*******************************************************************")

In [None]:
category_counts = clustered_data['Actual Category'].value_counts()
print(category_counts)

##**Incorrect Prediction**

In [None]:
incorrect_cat_data = clustered_data[clustered_data['Actual Category'] != clustered_data['Category']]
incorrect_cat_data.shape

In [None]:
incorrect_cat_data.head()

In [None]:
embedding_matrix[24].reshape(1,-1)

In [None]:
Kmeans.cluster_centers_[[2]]