# Website Classification Notebook
This notebook takes data from "The UK Web Archive & Partners" and tries to apply Machine Learning methods to classify websites given their title.

You will learn about the Pandas and Matplotlib python libraries for data exploration and vizualization and use Sentence Transformers found in Large Language Models to see how well they will help for Website Classification

In [None]:
#import libraries and install one we'll use later on
from tqdm import tqdm
import torch
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

!pip install sentence_transformers


In [None]:
#use this bash command to download the dataset
!curl -O https://data.webarchive.org.uk/opendata/ukwa.ds.1/classification/classification.tsv

In [None]:
#set the file path and open with Pandas
website_data_path = './classification.tsv'
website_df = pd.read_csv(website_data_path, sep='\t',header=0, on_bad_lines='skip')


In [None]:
#use ".head()" to view the first 10 rows on the table
website_df.head()

In [None]:
# Write code here that collects just the "Primary Category" column into an array
#Hint: website_df['Column_Name'] returns an array of all the values in that column

all_categories = #website_df['Primary Category']

In [None]:
# Answer the following questions:

#1: How many websites are in this dataset? Hint: use len() to find how many category labels you collected

num_websites = #len(all_categories)

print('There are this many websites:',num_websites)

#2: How many unique Primary Categories are there? Hint: array.unique() returns another array that removes duplicates
unique_categories = #all_categories.unique()
num_unique_primary = len(unique_categories)
print('There are this many Primary Categories:',num_unique_primary)

#3: How many unique Seconday Categories are there?
num_unique_secondary = #len(website_df['Secondary Category'].unique())
print('There are this many Secondary Categories:',num_unique_primary)



In [None]:
#Q4: What are the top three most represented categories found in the dataset?

# The following creates a histogram of the labels, use this plot to answer the question.
all_categories.value_counts(sort=False).plot(kind='bar')

# Sentence Embeddings
Imagine you have a really long book, and you want to understand the main ideas without reading every single word. Sentence embedding is like creating a shortcut for that.

In more technical terms, it’s a way to convert a sentence into a list of numbers (a vector) that captures its meaning. Each sentence gets turned into a fixed-length representation that reflects its context and the relationships between words. This makes it easier for computers to compare sentences, find similar ones, or understand their meaning.

Think of it like giving every sentence a unique fingerprint that highlights its essence. This helps in tasks like finding relevant information, translating languages, or even summarizing text!

In [None]:
from sentence_transformers import SentenceTransformer

#grab a pre-trained model that will embed our sentences
model = SentenceTransformer('paraphrase-MiniLM-L6-v2').cuda()

# Sentences are encoded by calling model.encode(). Here we are embedding the sentence "Arts in Humanities", the first element in the unique_categories array
embedding = model.encode(unique_categories[0])

In [None]:
# Question 5: How many features are in each embedding? Hint: find the shape of the embedding array with it's .shape attribute
# embedding.shape

We want to compare the Category Embeddings with our Title Embeddings to see if we can automatically classify a website given its title.

To test out this theory, we will do the following:
- embed each category into its vector representation
- for each Website Title, we will embed it and then select the Category Embedding it is closest to.

There are many ways to measure "closeness", we will try a standard euclidean distance metric.

In [None]:
# here is a fuction that finds which row an array is closest to

def closest_row_euclidean(A, B, device='cuda'):
   # Convert A and B to PyTorch tensors and move to the specified device
    A = torch.tensor(A, device=device).reshape(1, -1)  # Reshape A to be 2D (1x384)
    B = torch.tensor(B, device=device)  # Convert B to a tensor

    # Calculate the Euclidean distances
    distances = torch.norm(B - A, dim=1)

    # Find the index of the closest row
    closest_index = torch.argmin(distances)

    # the closest row is B[closest_index], but we only care about the index
    return closest_index


In [None]:
#Encoding all of our Categories (Labels)

#create empty numpy array to hold all embeddings
category_embeddings = np.zeros((num_unique_primary,384))

for i, category in enumerate(unique_categories):
  embedding = model.encode(category)
  category_embeddings[i,:] = embedding


In [None]:
# Write code here to create a dictionary mapping such that:
# Arts and Humanities = 0
# Business Economy and Industry = 1
# Company Websites = 2
# ... etc

#HINT: Use dict() and zip() together as dict(zip(array1,array2)) where array2= np.arange(len(unique_categories)))
category_mapping = dict(zip(unique_categories,np.arange(len(unique_categories))))
category_mapping

In [None]:
# Go through each Website Title, get it's sentence embedding, and compare it to the
# embeddings of the titles. Label the website with the category it's "closest" to

predicted_labels = []
truth_labels = []

for i in tqdm(range(len(website_df))):
  title = website_df.iloc[i]['Title']
  title_embedding = model.encode(title).astype(float)

  predicted_label = closest_row_euclidean(title_embedding,category_embeddings)

  predicted_labels.append(predicted_label)

  #collect the truth category so we can test how well this theory works
  truth_category = website_df.iloc[i]['Primary Category']
  truth_label = category_mapping[truth_category]
  truth_labels.append(truth_label)

#convert the lists
predicted_labels = torch.tensor(predicted_labels)
truth_labels = torch.tensor(truth_labels)

In [None]:
# how well did we do?
from sklearn.metrics import accuracy_score

accuracy_score(truth_labels, predicted_labels)